Risk Analysis

DOI: 10.1111/risa.12272

The Aggregation of Expert Judgment: Do Good Things Come to Those Who Weight? Fergus Bolger1∗ and Gene Rowe2

Good policy making should be based on available scientific knowledge. Sometimes this knowledge is well established through research, but often scientists must simply express their judgment, and this is particularly so in risk scenarios that are characterized by high levels of uncertainty. Usually in such cases, the opinions of several experts will be sought in order to pool knowledge and reduce error, raising the question of whether individual expert judgments should be given different weights. We argue—against the commonly advocated “classical method”—that no significant benefits are likely to accrue from unequal weighting in mathematical aggregation. Our argument hinges on the difficulty of constructing reliable and valid measures of substantive expertise upon which to base weights. Practical problems associated with attempts to evaluate experts are also addressed. While our discussion focuses on one specific weighting scheme that is currently gaining in popularity for expert knowledge elicitation, our general thesis applies to externally imposed unequal weighting schemes more generally. KEY WORDS: Aggregation; calibration; expert judgment; knowledge elicitation; risk assessment

1. INTRODUCTION

Risk analysis generally requires quantities as input, and these quantities are usually uncertain. Obtaining estimates of the quantities, and associated uncertainty, from experts is referred to as expert knowledge elicitation (EKE). EKE can be viewed as a piece of empirical research, involving, inter alia, checks of the reliability, validity, and (for applied research) utility of the methods used. As evidenced by risk assessments for climate change mitigation policy, there is a clear need for a systematic approach to EKE and the handling of uncertainty.(1–4) Below we will evaluate in terms of reliability, validity, and utility one increasingly popular approach, known as the “Classical Method” (CM).(5) To improve accuracy, several experts’ opinions are usually sought in EKE, but for policy making a single representation of the uncertain quantity, and related probability, is commonly needed.3 This can

One current mantra often chanted in national and international contexts is the need for more evidence-based policy making. Sometimes, pertinent evidence about a particular policy option is strong, being based on substantial empirical research, but often, published results are either equivocal, or do not exist at all—and yet a policy decision, such as on how to manage a particular risk, is still needed. In such cases, our hypothetical policymaker must rely on expert judgment, usually of scientists. This is frequently the case in situations dealing with rare or novel events, such as in risk assessment. 1 Department

of Management, Durham University Business School, Mill Hill Lane, Durham DH1 3LB, UK 2 Gene Rowe Evaluations, 12 Wellington Road, Norwich NR2 3HT, UK ∗ Address correspondence to Fergus Bolger, Department of Management, Durham University Business School, Mill Hill Lane, Durham DH1 3LB, UK; [email protected].

3 In

some situations, it may not be appropriate to combine expert judgments at all, for example, when averaging leads to an

1

C 2014 Society for Risk Analysis 0272-4332/14/0100-0001$22.00/1 

2

Bolger and Rowe

be obtained by mathematical aggregation (e.g., averaging the estimates of experts in some manner), or alternatively, experts can come to some agreement regarding the final values (e.g., in a group that has been tasked with reaching a consensus), known as behavioral aggregation. In both kinds of aggregation, a controversial issue is whether all experts should be treated equally, or whether the judgments of some should be rated more highly than others (i.e., “differential weighting”)—and if so, what criteria should be used to determine the weighting scheme. 2. THE CLASSICAL METHOD A 2010 opinion article in Nature(6) recommended that CM—a differential weighting scheme for mathematical aggregation based on the correspondence between subjective and objective probabilities (“calibration”)—be used for elicitation of uncertain quantities. Indeed, there exist a growing number of applications of CM to risk assessment in areas like food production, nuclear power, the environment, earthquake vulnerability, and computer-data security. Given the increasing popularity of CM, we consider it is a matter of urgency to examine its pros and cons. The CM involves recruiting a number of experts to answer a particular question, or questions, regarding the value of uncertain quantities with the view to establishing a probability distribution expressing the uncertainty surrounding each expert’s estimate of each quantity. Usually, experts will be interviewed individually by an “elicitor” experienced in the method. More specifically, experts make judgments about the range of uncertain quantities for given probabilities—for instance, “what are the highest and lowest values of x such that the true value of x falls within this range on 95% of occasions it is observed?” Before the elicitations related to the problem of concern, several ranges are elicited for each of a number of uncertain quantities—referred to as “seed variables”—in roughly the same problem domain. The realization of the quantities of the seed variables are known to the elicitor but not (it is hoped) to the experts. For a well-calibrated expert, 95% of realizations will fall within the ranges given for the 95% confidence level, etc. One important facet of the CM procedure, which marks it out from other approaches, is that experts’ average that does not properly represent the views of any expert, or where the judgments form input to a very nonlinear model.

judgments are weighted; primarily on the basis of how well calibrated they are on the seed variables. Differential weighting is justified on the assumption that there are individual differences between experts in their ability to express their uncertainty probabilistically. In the rest of this article, it is this weighting aspect of CM that provides the focus of our concerns. We do not argue with the principle of weighting experts by their ability, as this is clearly a sensible strategy. However, we contend that the way in which experts are weighted in CM is not meaningful and, more generally, we suggest that stable individual differences that could be used for weighting are difficult (if not impossible) to measure.

3. CRITIQUE OF THE CM 3.1. Reliability The reliability of a test is its ability to consistently measure the construct of interest. There are reasons to believe that CM’s test of calibration does not do this. When measuring something there is always some error: CM’s test of calibration requires the comparison of subjective and objective probabilities in which error enters measurement of each of these probabilities, and also the comparison test. CM primarily weights experts on their calibration: reliable weighting that consistently gives better calibrated experts higher weights depends on having both good data and powerful tests. With regard to the former, there is a large literature on the quality of subjective probability judgment that, although not uncontroversial, makes it clear that, at worst, judgments of this sort are significantly biased (usually overconfident) and, at best, subject to considerable random error. These findings apply to experts and nonexperts alike.(7,8) We argue that the reasons for the poor quality of probability judgment are twofold. First, in most domains where EKE is needed, there are poor conditions for learning how to formulate reliable cognitive representations of uncertainty. Good conditions include: having considerable experience at judging the same well-defined target; having sufficient good data on which to base judgments; and receiving regular and usable feedback about the accuracy of judgments.(9) Second, we contend that few experts are used to expressing their uncertainty as probability distributions, in particular, and many lack experience with probability more generally.

The Aggregation of Expert Judgment Expressing judgments in a conventional metric such as probability is a skill referred to as “normative expertise” (to be contrasted with “substantive expertise,” or domain knowledge). Usually, probabilistic skill is acquired through formal training that few experts receive. CM requires both substantive and normative expertise, and tries to overcome the lack of normative expertise through training. However, given the documented difficulty that most people have with probability concepts,(10) the amount of training that can realistically be given during EKE will almost certainly be inadequate. To put the problem into context, Ericsson and Lehmann(11) state that 10 years of daily deliberate practice are required to attain full mastery in a domain such as chess, although 50 hours can be enough to approach expert levels on certain subskills within the domain (e.g., solving some specific chess problems),(12) whereas a typical training session in probability distributions during an EKE exercise consists of an hour or two at most. Although experts’ difficulties with expressing uncertainty probabilistically is an issue for any EKE method that seeks to elicit probabilities—hence, the critical role of training—the problem is particularly acute for CM in that it does not elicit anything else from experts (e.g., some methods ask for qualitative information, such as rationales, in addition to probabilities). Further, CM’s weighting system is premised on the notion that there is a high positive correlation between normative and substantive expertise, and this relationship is cast into doubt by the poor probabilistic skills of many experts. If the relationship between normative and substantive expertise is weak, then giving more weight to those experts who are better calibrated could discriminate against the best substantive experts. Although there may be many instances where normative and substantive expertise are well related, we do not believe that it can be taken for granted, being highly dependent on the experts’ task and experience (see next paragraph). Meanwhile there are plenty of examples of poor probability judgment by experts.(7,8,13) Even those experts with a fair amount of statistical training are not immune: a recent study(14) found significant overconfidence in the predictions of experienced economists. The relationship between normative and substantive expertise could potentially be negative because knowing more can lead to raised confidence, but if the additional knowledge is not useful for the task at hand then performance does not keep pace, resulting in increasing overconfidence with experience.(15)

3 When learning conditions are favorable, and the judges have significant experience at representing uncertainty probabilistically, good calibration may be observed: weather forecasting being the most cited example.(16,17) Here, calibration is assessed over judgments for a large number of “essentially similar” items;(18) judgments are of the same event based on the same indicators. Further, weather forecasters receive rapid feedback on their forecasts, a day or so later, with such feedback being ubiquitous and characteristic of their job, that is, they receive it every day and are thus able to accurately and repetitively assess the quality of their judgments, and thereby learn. In contrast, the evidential base for judgments in CM typically varies from one judgment (of seed variables) to another, and from seeds to targets.4 Related to this, calibration skill in one domain of knowledge has not been found to generalize well to other knowledge domains,(19) which is to say, it is not a general cognitive skill like being able to perform mathematics; it is specifically context bound. Another source of unreliability is that experts may not express their true probability beliefs, either unintentionally, due to misunderstanding the task or poor motivation, or deliberately, so as to maximize their weights. CM uses a “proper scoring rule” to weight experts. Such rules are designed to reward expression of true beliefs and minimize attempts to “game the system” (e.g., giving a high level of confidence and very wide intervals to ensure that the true value will almost certainly fall within them). However, Clemen(20) argues that the scoring rule used in CM fails to eliminate gaming; we also believe that it places too much weight on calibration at the expense of informativeness (the narrower an interval the more “informative” it is about the likely true

4A

plausible model of how weather forecasters can be well calibrated is that they learn how well particular cues (e.g., temperature, cloud cover, air pressure) predict criterion events such as precipitation. This being the case, it is clear that the basis for seed selection is that they should share the same cues as the targets. Most problems where EKE is performed are not described by this cue-based model, and it is usually unclear what the underlying psychological model—and thus the basis of seed selection—should be. This issue points to a further criticism of CM: its ad hoc and atheoretical nature. Another issue here is that if the judgment task is one where an expert can learn to be well calibrated, then we normally would not need to perform EKE (e.g., we could build a linear prediction model for forecasting rain from cues). There may be cases where EKE is still required, though, such as when there is too little time to develop the statistical model.

4 value of the quantity).5 Further, for proper scoring rules to work they have to be incentivized and understood by the judges; however, experts in the CM are not paid on the basis of their weights, and we contend that–-due to their lack of normative expertise—they will rarely understand the rule. We think that deliberate distortion of judgments to maximize weights is unlikely as, in our experience, most experts wish to cooperate fully with the elicitor; however, we contend that CM’s proper scoring rule is not well designed to properly incentivize experts, and is also opaque. The assessment of calibration requires judgments of more than one event—the more events there are, the more reliable the assessment. However, finding seed variables for an EKE is difficult: in the 45 reported applications of CM(21) modally only 10 seed variables were used, but this is not sufficient to measure calibration reliably. CM weighting depends on testing the goodness fit of subjective to objective probability distributions against the null hypothesis of perfect calibration. A sample size of 10 seeds, with each expert making three or four judgments per seed variable, will lack sufficient power to detect any but the most miscalibrated experts.6 5 In

its most basic form, the proper scoring rule used in CM gives each expert a weight based on the product of that expert’s calibration and informativeness scores for each quantity, averaged over the total number of seed variables assessed. Clemen argues(20) that the averaging over seed variables in this way means that an expert could strategically state intervals that maximized his or her overall weight, without actually stating a true probability belief for any individual judgment. Clemen acknowledges that, due to the involvement of informativeness in the scoring rule, the final weight attained by an expert behaving strategically in this way might be somewhat less than that of an expert stating his or her true beliefs on every occasion, but that there is no way to determine if the counterweight of informativeness is a sufficient incentive against such strategic behavior. Further, we observe that, in practice, there is a two-stage process in CM such that experts are first of all subjected to a test of calibration, then only those who “pass” are given a weight according to the scoring rule. This procedure thus gives greater emphasis to calibration than informativeness in the weighting of experts; it also means that potentially valid expert opinions may be lost by virtue of failure to answer seeds. 6 Three quantiles are most commonly used in CM (5th, 50th, and 95th)—resulting in four interquartile intervals for the purpose of calculating calibration—in which case the degrees of freedom will equal 3. Since each judge makes four judgments they will not be independent, which inflates the Type-I error rate. To compensate for this, we take a stricter level of α than the conventional 0.05, say, 0.01. From Cohen,(22) we find that for a small effect size (w = 0.1) we need a sample of 1,546 to attain an acceptable level of power (probability of a Type-II error = 0.2). For medium (w = 0.3) and large (w = 0.5) effect sizes we need samples of 172

Bolger and Rowe 3.2. Validity If a measure is unreliable it cannot be valid, though a reliable measure is not necessarily valid (i.e., reliability is a necessary, though not sufficient, condition for validity).(23) Beyond this, to be a valid test of expertise the measures and test used by CM should be able to distinguish those who have more substantive expertise from those who have less. However, as we have already indicated, there is plenty of empirical evidence that experts are often badly calibrated,(7,8) and that someone can be well calibrated but know rather little, for example, by gaming the system. Further, experts who are normatively sophisticated will be able to manipulate the weights in CM in their favor better than those who are normatively naive, regardless of substantive expertise. Calibration seems, then, to be more a measure of normative than substantive expertise, while in practice, consistency in matching probabilities to outcomes (“resolution”) might be more valuable, as one can better act on a consistently overconfident judge’s estimates than an erratic one who is better calibrated on average.(24) Accuracy of the quantitative estimates may also be a more direct measure of substantive expertise than calibration. The use of seed variables to measure expertise on a target variable also creates a threat to the predictive validity of CM for weighting experts. Since the seed variables need to predict performance on the target variables they must be similarly within the experience of the experts. However, by virtue of the fact that we know the answers to seed variables, but not the target, it is highly likely that experts will also have less knowledge relevant to estimating targets than seeds. Related to this, experts may well have access to the same literature as the CM elicitation expert—in which case, “prediction” of seed variables may in fact be a memory test. Cooke and Goossens(25) argue that seed questions requiring predictions are to be preferred over those based on “retrodictions” (i.e., almanac questions), however, in practice the former are difficult to find and the large majority of applications of CM have used the latter. Further, there is some evidence that the cognitive processes involved in forecasting and “almanac” judgment may be different, with less bias in the former.(26)

and 62, respectively. A large effect size of 0.5 or more would reflect considerable miscalibration, e.g., 65% of realizations falling in a quantile when only 45% are expected.

The Aggregation of Expert Judgment 3.3. Utility EKE involves more than just reliable and valid measurement, it also requires selection of participants and treating them in an ethical manner once recruited.7 However, in CM the search for experts may be guided more by requirements of the test than of getting the best judgments. For example, the requirement to express uncertainty probabilistically means that academics and scientists might be favored over those with more practical knowledge. Similarly, the reason for polling opinions of more than one expert is to increase the knowledge base of the research, and this goal will be attained more often if there is heterogeneity of expertise. The requirement of CM that all experts should be able to answer the same seed questions will be counter to this goal of heterogeneity. Once experts have been found it may be more difficult to retain them when they have to take a test—they may either sense, or be informed, that they are not “up to standard”—although with good management of the process such perceptions can be largely, but perhaps not totally, dispelled. Further, since CM weighting may not accurately reflect substantive expertise, there is a risk of sending a negative signal to experts who might have something to offer in other ways (e.g., in defining the problem, or to future elicitations). In this manner, negative reputational effects may be created that could impact on recruitment and retention of experts in EKE. We have no precise evidence that this is the case from CM studies so far, and this need not happen if the process is properly managed, but there is the potential for negative reputational effects given the increasing popularity of the method and that future applications of CM may not be so well managed. Since EKE is applied research it is important to consider the manner in which output will be used; producing neat probability distributions from messy data, as is the case with CM, may convey a false air of scientific credibility to the public. We have found that many experts realize this, and have a fundamental concern about being forced to give numbers, that

7 The

careful selection, screening, and management of experts is crucial for successful EKE. This is a substantive topic in its own right that is beyond the scope of this article (but see, e.g., Ref. 27). However, it is important to note that expert recruitment and retention issues interact with the EKE method used, so some specific interactions with the CM method are discussed here. For example, the requirement to be able to answer particular seed variables may influence choice of experts.

5 is, they are unhappy about putting fine-grained values to something that cannot be so finely judged. Although, if done thoroughly, EKE is never easy, we contend that finding seed variables, recruiting experts, potentially training them, then assessing them will be even more costly both in terms of time and money than for equal weighting methods and/or behavioral aggregation. It is, therefore, a question as to whether the CM is worth the extra expense and effort: our arguments here are that it is not.8 4. ALTERNATIVES TO WEIGHTING The first thing to note is that, where there is sufficient data, purely statistical approaches should be used instead of processes involving human judgment. Statistical approaches provide more reliable assessments, and also more valid ones, according to much research that has been done. For example, work on clinical judgment shows that linear models regularly outperform the experts from whom they were derived, largely by virtue of being more consistent,(29) while in forecasting, statistical methods can rapidly extract signal from noise in large data sets whereas human forecasters have been found to be subject to numerous biases, including adding noise to their forecasts and seeing patterns in random data.(30) In conducting a task such as a risk assessment, the first step should be to review thoroughly all possible sources of data, and only to turn to EKE 8 One

potential advantage of unequal over equal weighting is that it helps overcome the problem of dependencies between experts. For example, if there are two experts who share the same knowledge, then, if equal weights are used, they will be given twice the weight of a third expert who has a different, but perhaps equally valid, knowledge base. Unequal weighting could help resolve this problem, but we again are faced with the question of how to determine the weights (in this case, identify the degree of overlap between experts’ knowledge, which would require the elicitor to be a “super expert”(28) ). We propose that a better solution is behavioral aggregation where you allow experts to weight themselves. In the above example, if experts are allowed to exchange rationales for their positions it should become clear to everyone that two of the experts share the same position, and the other a different one. Let us suppose that the single expert’s position is actually more valid in the current context than that of the twin experts. On the basis of the rationales given, the twin experts move towards the single expert by about the same amount while the single expert retains her position; thus the twin experts weight themselves equally, but less than the single expert. Finally, we wish to reiterate our point regarding the importance of heterogeneity in expert groups: we argued in Section 3.3 that CM’s requirement to answer a single set of seed questions tends towards homogeneity; thus the different perspectives in the example above may not have been given an airing in the first place.

6 when necessary. Once deciding to conduct EKE, a crucial initial phase—and one that is arguably underplayed by those responsible for producing uncertainty estimates of risks, etc.—is to consider in depth exactly which experts are needed to answer the question of concern. It is likely that experts from different professions and experiences may be useful in contributing to a solution, each providing separate quanta of relevant knowledge. Because such experts are heterogeneous, assessing them on one test makes little sense. Allowing the experts to discuss the topic can aid in the resolution of a problem (i.e., behavioral aggregation), although it is important to recognize that factors related to group interaction can lead to biases in group judgment (for example, group polarization, failure to share unique information, groupthink, and individual characteristics such as dominance and dogmatism).(31) It is for this reason that great efforts have been made to develop facilitated group processes, whereby group discussion is aided by a knowledgeable facilitator in order to preempt biases. One such approach, which we will refer to as “the Sheffield method,”(32) does just this, and is also designed specifically to provide probability distributions of use to risk assessors. Other facilitatedgroup approaches to knowledge elicitation recognize the difficulty of getting a significant number of relevant experts together in one place, and attempt to combine behavioral and mathematical aggregation components—allowing experts to interact in a limited manner via online or paper questionnaires, and then aggregating judgments using an equal-weighting rule. The Delphi technique(33) is perhaps the best known of such approaches and, although traditionally used to gain answers to qualitative questions (e.g., ordering priorities), rather than providing aggregate probability distributions, it could fairly easily be adapted to estimation of uncertain quantities. In short, there is little consensus about the best method for doing EKE: various approaches exist that are perhaps suited to different problem contexts, though our main contention is that the CM approach, because of its flawed differential weighting element, is categorically not the answer. 5. SUMMARY AND CONCLUSIONS The criteria of having enough judgments for related items to be able to reliably discriminate between experts using calibration-based measures are unlikely to be fulfilled in anything but highly data-

Bolger and Rowe rich domains where EKE will not be required (e.g., statistical models can be used instead). Also, we have argued that calibration is not the most valid measure of expert performance that could be used, measuring normative rather than substantive aspects of expertise; these criticisms are made on psychological as well as statistical grounds. In general, performance-based approaches to weighting experts require those doing the elicitation to set what is essentially an exam for the experts, in that setters and markers must be more expert than those who are taking them, but in the CM they are less expert! CM attempts to circumvent this problem by setting substantive experts a normative exam—an exam that the statisticians who set it can pass—but which very few domain experts can. Finally, finding and retaining experts for EKE is likely to be compromised by any attempt to assess them, while there will be significant extra costs to using a weighting system. The arguments we have presented, plus empirical findings,(20,34,35) suggest that, relative to behavioral aggregation with equal weights,9 the costs of any performance-based weighting of experts in EKE in general,10 and CM in particular, outweigh any potential benefits. So to answer the question posed in the title, in our view, no, good things do not come to those who weight, nor to those who are weighted.

ACKNOWLEDGMENTS We wish to thank Rob Clemen, Alastair McClelland, Frank Yates, and two anonymous reviewers for their help with the preparation of this article.

9 We

wish to reiterate that we are not against differential weighting per se but are skeptical as to whether a valid and reliable basis for an externally imposed weighting scheme can be found. We emphasize “externally imposed” because, as we argued in Note 8, we believe that experts should be free to weight themselves in behavioral aggregation. After behavioral aggregation we propose equal weighting of experts externally, for example, by the elicitor, to provide the final estimate. We concede, however, that there may be occasions when external unequal weighting might be considered, for instance, when consensus cannot be reached and there are homogeneous subgroups: Bunditz et al.(28) give guidance about best practice in such circumstances. A further possibility, particularly if no reasonable basis can be found for unequal weighting, is not to aggregate at all (see Note 3). 10 We have in mind here the use of peer and self-assessments of expertise, publications,(36) and citations(37) as performance weights, which similarly have not been shown to be well related to substantive expertise.

The Aggregation of Expert Judgment REFERENCES 1. Reilly J, Stone PH, Forest CE, Webster MD, Jacoby HD, Prinn RG. Uncertainty and climate change assessments. Science, 2001; 293:430–433. 2. U.S. EPA. Expert Elicitation Task Force White Paper. Washington, DC: Science and Technology Policy Council, 2011. Available at: http://www.epa.gov/stpc/pdfs/eewhite-paper-final.pdf, Accessed July 29, 2014. 3. Morgan MG, Henrion M. Uncertainty: A Guide to Dealing with Uncertainty in Quantitative Risk and Policy Analysis. Cambridge: Cambridge University Press, 1990. 4. Clemen RT, Reilly T. Making Hard Decisions with Decision Tools, 3rd ed. Mason, OH: South-Western, 2014. 5. Cooke RM. Experts in Uncertainty: Opinion and Subjective Probability in Science. Oxford: Oxford University Press, 1991. 6. Aspinall W. A route to more tractable expert advice. Nature, 2010; 463:294–295. 7. Griffin D, Brenner L. Perspectives on probability judgment calibration. Pp. 177–199 in Koehler DJ, Harvey N (eds). Blackwell Handbook of Judgment and Decision Making. Malden, MA: Blackwell, 2004. 8. Lichtenstein S, Fischhoff B, Phillips L. Calibration of probabilities: The state of the art to 1980. Pp. 306–334 in Kahneman D, Slovic P, Tversky A (eds). Judgment Under Uncertainty: Heuristics and Biases. Cambridge, UK: Cambridge University Press, 1982. 9. Bolger F, Wright G. Assessing the quality of expert judgment: Issues and analysis. Decision Support Systems, 1994; 11:1–24. 10. Gigerenzer G. Reckoning with Risk: Learning to Live with Uncertainty. London: Penguin, 2003. 11. Ericsson KA, Lehmann AC. Expert and exceptional performance: Evidence of maximal adaptation to task constraints. Annual Review of Psychology, 1996; 47:273–305. 12. Ericsson KA, Harris MS. Expert chess memory without chess knowledge: A training study. Bulletin of the Psychonomic Society, 1990; 28:518. 13. Gigerenzer G, Gaissmaier W, Kurz-Milcke E, Schwartz LM, Woloshin S. Helping doctors and patients make sense of health statistics. Psychological Science in the Public Interest, 2007; 8:53–96. 14. Soyer E, Hogarth RM. The illusion of predictability: How regression statistics mislead experts. International Journal of Forecasting, 2012; 28:695–711. 15. Oskamp S. Overconfidence in case study judgments. Journal of Consulting Psychology, 1965; 29:261–265. 16. Murphy AH, Winkler RL. Reliability of subjective probability forecasts of precipitation and temperature. Journal of the Royal Statistical Society, Series C, 1977; 26:41–47. 17. Charba JP, Klein WH (1980). Skill in precipitation forecasting in the National Weather Service. Bulletin of the American Meteorological Society, 1980; 61:1546–1555. 18. Keren G. Calibration and probability judgments: conceptual and methodological issues. Acta Psychologica, 1991; 77:217– 273. 19. Solomon I, Ariyo A, Tomasini LA. Contextual effects on the calibration of probabilistic judgments. Journal of Applied Psychology, 1985; 70:528–532.

7 20. Clemen RT. Comment on Cooke’s classical method. Reliability Engineering & System Safety, 2008; 93:760–765. 21. Cooke RM, Goossens LLHJ. TU Delft expert judgment database. Reliability Engineering & System Safety, 2008; 93:657–674. 22. Cohen J. Statistical Power Analysis for the Behavioral Sciences, 3rd ed., pp. 215–271. New York: Academic Press, 1988. 23. Bolger F, Wright G. Reliability and validity in expert judgment. Pp. 47–76 in Wright G, Bolger F (eds). Expertise and Decision Support. New York: Plenum, 1992. 24. Yates JF. External correspondence: Decompositions of the mean probability score. Organizational Behavior and Human Decision Processes, 1982; 30:132–156. 25. Cooke RM, Goossens LHJ. Procedures guide for structured expert judgement in accident consequence modelling. Radiation Protection Dosimetry, 2000; 90:303–309. 26. Wright G, Ayton P. Subjective confidence in forecasts: A response to Fischhoff and MacGregor. Journal of Forecasting, 1986; 5:117–123. 27. Meyer MA, Booker JM. Eliciting and Analyzing Expert Judgment: A Practical Guide. London: Academic Press, 1991. 28. Budnitz RJ, Boore DM, Apostolakis G, Cluff LS, Coppersmith KJ, Cornell CA, Moms PA. Recommendations for Probabilistic Seismic Hazard Analysis: Guidance on Uncertainty and Use of Experts, Vol. 1. Washington, DC: U.S. Nuclear Regulatory Commission, 1995. 29. Grove WM, Zald DH, Lebow BS, Snitz BE, Nelson C. Clinical versus mechanical prediction: A meta-analysis. Psychological Assessment, 2000; 12:19–30. 30. Goodwin P, Onkal D, Lawrence M. Improving the role of judgment in economic forecasting. Pp. 163–189 in Clements MP, Hendry DF (eds). The Oxford Handbook of Economic Forecasting. Oxford: Oxford University Press, 2011. 31. Hardman D. Decision making in groups and teams. Pp. 146– 158 in Hardman D (ed). Judgment and Decision Making: Psychological Perspectives. Chichester: Wiley, 2009. 32. Oakley JE, O’Hagan A. SHELF: The Sheffield Elicitation Framework [Internet]. Version 2.0. Sheffield, UK: University of Sheffield, School of Mathematics and Statistics, 2010 [updated 2013 Mar 11]. Available at: http://tonyohagan. co.uk/shelf, Accessed October 7, 2013. 33. Rowe G, Wright G. The Delphi technique: Past, present, and future prospects—Introduction to the special issue. Technological Forecasting and Social Change, 2011; 78: 1487–1490. 34. Clemen RT, Winkler RL. Combining probability distributions from experts in risk analysis. Risk Analysis, 1999; 19: 187–203. 35. Lin SW, Cheng CH. The reliability of aggregated probability judgments obtained through Cooke’s classical method. Journal of Modeling in Management, 2009; 4:149–161. 36. Burgman MA, McBride M, Ashton R, Speirs-Bridge A, Flander L, et al. Expert status and performance. PLoS ONE, 2011; 6:e22998. doi:10.1371/journal.pone.0022998. 37. Cooke RM, ElSaadany S, Huanga X. On the performance of social network and likelihood-based expert weighting schemes. Reliability Engineering and System Safety, 2008; 93:745–756.

The aggregation of expert judgment: do good things come to those who weight?

Good policy making should be based on available scientific knowledge. Sometimes this knowledge is well established through research, but often scienti...
72KB Sizes 3 Downloads 8 Views