CP10CH05-Kraemer

ARI

ANNUAL REVIEWS

11 February 2014

8:24

Further

Annu. Rev. Clin. Psychol. 2014.10:111-130. Downloaded from www.annualreviews.org by University of Southern California (USC) on 04/09/14. For personal use only.

Click here for quick links to Annual Reviews content online, including: • Other articles in this volume • Top cited articles • Top downloaded articles • Our comprehensive search

The Reliability of Clinical Diagnoses: State of the Art Helena Chmura Kraemer Department of Psychiatry and Behavioral Sciences, Stanford University (Emerita), Palo Alto, California 94301; and Department of Psychiatry, University of Pittsburgh, Pittsburgh, Pennsylvania 15213; email: [email protected]

Annu. Rev. Clin. Psychol. 2014. 10:111–30

Keywords

First published online as a Review in Advance on January 2, 2014

validity, disorder, design, kappa

The Annual Review of Clinical Psychology is online at clinpsy.annualreviews.org

Abstract

This article’s doi: 10.1146/annurev-clinpsy-032813-153739 c 2014 by Annual Reviews. Copyright  All rights reserved

Reliability of clinical diagnosis is essential for good clinical decision making as well as productive clinical research. The current review emphasizes the distinction between a disorder and a diagnosis and between validity and reliability of diagnoses, and the relationships that exist between them. What is crucial is that reliable diagnoses are essential to establishing valid diagnoses. The present review discusses the theoretical background underlying the evaluation of diagnoses, possible designs of reliability studies, estimation of the reliability coefficient, the standards for assessment of reliability, and strategies for improving reliability without compromising validity.

111

CP10CH05-Kraemer

ARI

11 February 2014

8:24

Contents

Annu. Rev. Clin. Psychol. 2014.10:111-130. Downloaded from www.annualreviews.org by University of Southern California (USC) on 04/09/14. For personal use only.

INTRODUCTION . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . THE ELEMENTS OF DIAGNOSIS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . QUALITY OF A DIAGNOSIS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . RELIABILITY STUDIES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Sampling and Design: Naturalistic . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Sampling and Design: Two-Stage Design . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Sampling and Design: Stratified . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Sampling and Design: Multiple Raters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Types of Reliability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . OTHER MEASURES OF RELIABILITY . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . POWER AND PRECISION . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . HOW RELIABLE IS RELIABLE ENOUGH?. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . IMPROVING RELIABILITY . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Standardize Conditions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Train/Qualify Raters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Use Multiple Ratings for Diagnosis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . DISCUSSION . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

112 114 114 118 118 119 119 120 120 122 123 124 125 125 126 126 127

INTRODUCTION Precise terminology is essential to any scientific discussion (Finney 1994), for imprecise terminology often leads to controversies, misunderstandings, and misleading results. The particular terminology of concern here is first “clinical diagnosis” and “clinical disorder” and then “reliability” and “validity,” with the ultimate focus on the reliability of clinical diagnoses. A “disorder” is something wrong within a patient, e.g., an injury, a malfunction, an infection, or something that causes distress and impairs the patients’ functional status and that if not recognized and treated, can worsen, become prolonged, and affect both the quality and quantity of life of that patient. What is wrong is serious, is long-lasting in the absence of intervention, and is not in the control of the patient. If the patient can choose whether or not to exhibit certain characteristics, even if they impair his/her own functional status, that is not a disorder but rather a choice or a lifestyle. If the distress or impairment occurs only because others impose it on the patient, rather than arising within the patient, that too is not a clinical disorder [consider homosexuality in early versions of the Diagnostic and Statistical Manual of Mental Disorders (DSM)]. Patients with the same disorder may be heterogeneous. They may be at different stages of the disorder, have different symptomatic expressions of the disorder, or may experience different impairments and different levels of distress. What makes the condition a medical disorder is that there is an underlying, potentially identifiable, cause, course, and cure, even if current knowledge has not yet identified these. The cause may be biological or environmental, is often multifactorial, and often involves complex interactions. Other than infectious diseases or single-gene disorders, physical disorders and mental disorders usually have no single cause but rather a complex web of causes. The course for one patient with the disorder may be rapid and for another slow, and the cure, too, may be multifaceted. A physical disorder is one that manifests in physical problems; a mental disorder is one that manifests in behavioral, emotional, or cognitive problems. Some disorders are both physical and 112

Kraemer

Annu. Rev. Clin. Psychol. 2014.10:111-130. Downloaded from www.annualreviews.org by University of Southern California (USC) on 04/09/14. For personal use only.

CP10CH05-Kraemer

ARI

11 February 2014

8:24

mental. The primary focus of this discussion is on mental disorders, but to stress the fact that the issues relating to evaluation of diagnoses are fundamentally no different for physical and mental disorders, illustrations will draw from both sources. In contrast to a clinical disorder, a “clinical diagnosis” is an informed opinion from a clinician that a certain disorder exists in the patient. The disorder is what the patient has; the diagnosis is what a clinician gives him/her. That diagnosis may be simply a yes/no answer (categorical diagnosis) or a score on some scale (dimensional diagnosis). The focus here is on a categorical diagnosis (yes/no) because it is the most difficult challenge. However, the same principles apply for dimensional diagnoses (as is discussed in the final section). The correspondence between the diagnosis and a disorder defines the quality of the diagnosis for that disorder. Generally that quality is expressed as its reliability and its validity or as its sensitivity and specificity. Troubles arise when the words “disorder” and “diagnosis” are used as if synonymous. Such use was apparent in the development process for the DSM-5 (Am. Psychiatr. Assoc. 2013) and has become even more evident in the reactions to the recent publication of the DSM-5. The DSM does not define disorders; it defines diagnoses. It may well be that one DSM diagnosis corresponds to several different disorders or that two or more DSM diagnoses actually refer to only one disorder. If clinical and research evidence later warrants, new diagnoses will be added, and some diagnoses will be removed or combined with other diagnoses. The DSM is neither a gold standard nor the “bible” of mental disorders but rather an expression of the current state of knowledge related to identifying who has mental disorders. The DSM-III (Am. Psychiatr. Assoc. 1980), DSM-IV (Am. Psychiatr. Assoc. 1994), and DSM-5 (Am. Psychiatr. Assoc. 2013) are successive approximations; they are guides for knowledgeable clinicians to communicate with their patients, to make the very best evidence-based clinical decisions for their patients, and to serve as the basis of further clinical research into the disorders to which those diagnoses correspond. In contrast to the situation with physical disorders, it appears that many individuals do not believe in the existence of mental disorders, holding the view that those who are given such diagnoses are simply in the extremes of the distributions of normal behaviors, emotions, or cognitions (the abnormal) (Frances 2013) or are simply individuals whom society does not approve of (the rejects). Moreover, suspicion exists that diagnoses of mental disorders are influenced by a conspiracy of drug companies, the American Psychiatric Association, and others to increase profits (Greenberg 2013). If one shares such beliefs, it is hard to generate interest in, or enthusiasm for, evaluating diagnoses. Others may believe that mental disorders exist, but they frown on the attempt to define diagnoses for disorders not yet well understood. Why put out such effort to find a mythical Atlantis? But perhaps the search is not for Atlantis, but for the Northwest Passage. That search took place with no guarantee that such a passage existed, took many years, and experienced many failures before its ultimate success. However, the process of exploration was itself vitally important to science, navigation, and commerce. In the same way, the search for better quality diagnoses of disorders serves as the basis of advances in understanding and dealing with disorders. In short, this review is aimed at those at least agnostic in their beliefs about the existence of mental disorders and the wisdom of seeking good quality diagnoses for such disorders. It is based on the premise that disorders (physical or mental) exist, even if we currently don’t know their cause, course, or cure, and that many who try to improve diagnosis are doing so with a genuine concern to use the current evidence base as best can be to improve the lot of those with disorders. Moreover, this review is based on the hope that having diagnoses to use in clinical research leads to a greater understanding of the disorder(s) those diagnoses refer to, to an improved evidence base and thus to improvement in diagnosis over time, and to an increasingly close correspondence between diagnosis and disorder. www.annualreviews.org • The Reliability of Clinical Diagnoses

113

CP10CH05-Kraemer

ARI

11 February 2014

8:24

THE ELEMENTS OF DIAGNOSIS

Annu. Rev. Clin. Psychol. 2014.10:111-130. Downloaded from www.annualreviews.org by University of Southern California (USC) on 04/09/14. For personal use only.

A diagnosis is defined by three entities: its protocol, its response, and its referent (Kraemer 1992b). The protocol defines the population of patients for whom the diagnostic procedure is used and the circumstances under which the information about the patient being diagnosed is obtained. The quality of a diagnosis for a disorder in a community sample (e.g., in epidemiological studies) may differ from that in a clinical sample (e.g., those referred to a general psychiatric clinic), and it may be very different if those who are uncooperative or have comorbidities are excluded from the population. Indeed, the DSM-5 field trials well illustrate that exactly the same diagnosis may have different quality when it is used in different types of psychiatric clinics. For example, the proposed diagnosis of disruptive mood dysregulation disorder (excluded from DSM-5) was good at one site (k = 0.49) and unacceptable at two others (k = 0.06, 0.11) (Regier et al. 2013). Blood tests underlying diagnoses for physical disorders often require eight hours of fasting prior to sampling as part of the protocol. If that instruction is not followed, there is no guarantee that the diagnosis will be accurate. Similarly, for diagnoses of mental disorders, even if exactly the same criteria are applied in all situations to the same population, issues of significance include whether the protocol for diagnosis requires a clinical interview of the type clinicians normally do and whether that interview is conducted using a structured interview, whether the period of observation extends over days or weeks or is restricted to a single interview, and whether the interview is conducted over the telephone by a nonclinician following an interview script (as often done in epidemiology studies). These represent different protocols. The response is the totality of information gathered that is used to make the diagnosis. In exercise stress testing for the diagnosis of coronary artery disease, this might include the duration of testing, the heart rate at various stages of testing, and the timing and severity of symptoms. For a blood test, the response might be a biochemical assay result. In DSM diagnostic criteria, there is often a listing of the signs and symptoms that need to be considered. Such information may include sociodemographic variables (age, sex, and ethnicity), biological variables (height and weight), or imaging or genetic results, as well as whatever physical, behavioral, emotional, or cognitive information is directly observed in, or elicited from or about, the patient by the clinician. Again, if the list of information elicited by a clinician differs substantially from that specified in the diagnostic criteria, it is a different diagnosis. The referent is the rule that determines which configuration of the response leads to a positive diagnosis (DX+) [the remainder of configurations being negative (DX−)]. For the diagnosis of hypertension, it might mean a specification of the systolic and diastolic blood pressures above which the diagnosis is positive. In using the Mini Mental State Exam (MMSE) to diagnose Alzheimer’s disease, a score below 24 might be considered a positive result. For many DSM diagnoses, the referent is a specification of the duration and number of signs/symptoms on the associated lists. Changing the referent will change the quality of the diagnosis.

QUALITY OF A DIAGNOSIS The quality of a diagnosis for a disorder depends on how closely related that diagnosis is to the corresponding disorder. Two factors that define the quality of a diagnosis for the disorder are reliability and validity. Here again, as is the case with “disorder” and “diagnosis,” there is often a problem with terminology. Many use the term “reliability” to mean, “I trust this” and “validity” to mean, “I believe this” on the basis of completely subjective judgments. It is important, however, to move past subjective judgments to evidence-based evaluation of diagnoses. In Figure 1, the circle represents the totality of individual differences (variance) in diagnoses among patients in the population sampled. This variance has three nonoverlapping influences. 114

Kraemer

CP10CH05-Kraemer

ARI

11 February 2014

8:24

Random error

Disorder

Annu. Rev. Clin. Psychol. 2014.10:111-130. Downloaded from www.annualreviews.org by University of Southern California (USC) on 04/09/14. For personal use only.

Contaminants

Figure 1 The distribution of the individual differences (variance) in the diagnosis in the population of interest.

First and most important, a portion of the variance is due to variance among the patients sampled in characteristics related to the disorder the diagnosis is meant to identify (Disorder). Second, a portion is due to variance among the patients sampled in characteristics unrelated to the disorder but that affect the diagnosis (Contaminants). The final portion of variance is due to factors not characteristic of the patient (Random error) and caused by, for example, random fluctuations within the patient (while the disorder status remains unchanged), errors by the diagnostician, or lack of clarity in the diagnostic criteria. Validity of the diagnosis for the disorder is defined as the proportion of the total variance of the diagnosis due to disorder. Reliability of the diagnosis is defined as the proportion of the total variance due to disorder plus contaminants, i.e., due to characteristics of the patient and not due to random error. From these definitions, we note that: 

Validity and reliability are numbers between 0 and 1 (or 0% and 100%) are not “yes” or “no” answers.



Validity can never be greater than reliability. It is possible to have a diagnosis that is 100% reliable and 0% valid, but it is not possible to have a diagnosis with low reliability that has high validity. It is possible to increase the reliability of the diagnosis at the cost of validity.

Validity is measured by correlating the diagnoses with the presence/absence of the disorder in the population of interest. The fundamental problem with evaluating the validity of clinical diagnoses is that because we have no way of ascertaining the presence/absence of a disorder without recourse to a diagnosis, there is no direct way of measuring validity. In fact, if we had such a gold standard method of identifying those with the disorder, why would we bother to consider diagnoses for that disorder? We would simply use that gold standard as the diagnosis. Consequently, what is done is not to prove validity but rather to challenge validity in a variety of ways (Kraemer 2013). The more challenges it withstands, the more likely the diagnosis is to be valid (Robins & Barrett 1989). However, in the absence of reliability, the diagnosis is unlikely to withstand any challenge at all because unreliability attenuates all correlations. For that reason, the primary focus in evaluating diagnoses is on examining the reliability of a diagnosis and then trusting that clinical decision making and subsequent research using that reliable diagnosis will www.annualreviews.org • The Reliability of Clinical Diagnoses

115

CP10CH05-Kraemer

ARI

11 February 2014

8:24

1.0 Ideal point

0.9 0.8 0.7

Diagnosis

Sensitivity

0.6 0.5

π-reference line

Random ROC

Annu. Rev. Clin. Psychol. 2014.10:111-130. Downloaded from www.annualreviews.org by University of Southern California (USC) on 04/09/14. For personal use only.

0.4 0.3

(P, P)

0.2

(π, π)

0.1 0 0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1.0

1-specificity Figure 2 The geography of the receiver operating characteristic (ROC) plane.

reveal its shortcomings. Such revelations would serve as a basis for improving the validity of the diagnosis. The significant role of reliability in diagnosis leads to the question of how it is measured. Measuring reliability requires the correlation of two or more independent diagnoses per subject in the population of interest over a span of time during which each individual patient is unlikely to change disorder status. Another separate but related approach to assessing the quality of a diagnosis for the disorder is via sensitivity and specificity. The sensitivity of a diagnosis for a disorder is the probability of a positive diagnosis for those with the disorder. The specificity is the probability of a negative diagnosis for those without the disorder. The most informative way to view the quality of a diagnosis is via a receiver operating characteristic (ROC) plane, a graph on which the sensitivity of a diagnosis is plotted on the y-axis and 1-specificity on the x-axis (Figure 2). Each possible diagnosis of the disorder then can be located as a single point in this plane. Every diagnosis equivalent to random decision making will be located on the upward-slanting diagonal: the random ROC (Sensitivity = 1 − Specificity, where the probability of a positive diagnosis is exactly the same for those with and without the disorder). Ideally, if the diagnosis perfectly predicts the presence/absence of the disorder, that diagnosis point will be at the upperleft-hand corner of the ROC plane: the ideal point (Sensitivity = Specificity = 1). As a practical matter, no diagnosis for which there is reasonable rationale and justification will ever lie on or below the random ROC, and no diagnosis will ever lie at the ideal point because of inevitable human error. It is the location between these two extremes that indicates the quality of the diagnosis for that disorder. 116

Kraemer

Annu. Rev. Clin. Psychol. 2014.10:111-130. Downloaded from www.annualreviews.org by University of Southern California (USC) on 04/09/14. For personal use only.

CP10CH05-Kraemer

ARI

11 February 2014

8:24

If the prevalence of the disorder (π) equals the prevalence (P) of a positive diagnosis, that diagnosis point lies on a line connecting the ideal point (0,1) with the point on the random ROC determined by the prevalence of the disorder (π ,π ): the π -reference line. Every other possible diagnosis for the same disorder in the same population lies on a line parallel to the π -reference line that crosses the random ROC at the point (P,P) (a geometric expression of Bayes’ theorem). The distance between the diagnosis point determined by its Sensitivity and 1-Specificity and the π -reference line indicates the bias of the diagnosis (P−π ). The shortest distance between the diagnosis point and the random ROC indicates the validity of the diagnosis: (sensitivity + specificity − 1)2 . To improve the quality of a diagnosis, one must either reduce the bias, moving P closer to π , or increase the magnitude of sensitivity + specificity. Bias, in this case, is not necessarily bad. For a screening diagnosis, i.e., if the diagnosis were only a preliminary to more extensive and accurate diagnostic procedures, one might welcome overdiagnosing to avoid false negative screens. Then a diagnosis point above the π -reference line would be preferred (P > π ). At the other extreme, if a positive diagnosis were to immediately lead to highly invasive, costly, and risky intervention, one might prefer underdiagnosis to avoid iatrogenic effects. Then a diagnosis point below the π -reference line might be preferred (P < π ). However, we do not know who does and does not have the disorder; there is no gold standard against which to compare the diagnosis. We cannot estimate π or Sensitivity (Se) or Specificity (Sp) without such a gold standard; all we can estimate is P. Thus, the ROC evaluation of a diagnosis against the criterion of disorder, i.e., its validity, is of theoretical importance only. What we can do, however, is to estimate the reliability in addition to the prevalence of the diagnosis. With a categorical diagnosis, each patient in the population of interest has a certain probability of a positive diagnosis, p(i ), that we can crudely estimate by taking M > 1 independent diagnoses for patient i and computing the proportion positive. The reliability of the diagnosis is defined (Lord & Novick 1968) as the variance of the patients’ true scores ( p(i)) to the variance of the patients’ observed scores (PP  , where P  = 1 − P). With a categorical diagnosis, this ratio is the intraclass kappa coefficient (Kraemer 1979): k = Variance( p(i))/PP  .

(1)

k = B · V + C 2,

(2)

We can then parse this out, for: 





where B = (π π )/(PP ), (π = 1 − π ), a measure of the bias of the diagnosis relative to the disorder. When the diagnosis is unbiased, then B = 1, although B may also be 1 when there is bias (when π = P  ). V is an indicator of validity: V = (Se + Sp − 1)2 . Finally C 2 indicates the degree of influence that contaminants have on the diagnosis: C 2 = [π Variance( p(i)|D+) + π  Variance( p(i)|D−)]/PP  .

(3)

If the diagnosis is completely free of contaminants and reflects only the presence/absence of the disorder, then the variance of p(i ) among those patients with positive diagnosis (D+) is zero, as is the variance of p(i ) among those patients with a negative diagnosis (D−). In that case, C 2 = 0. The greater the influence of contaminants on the diagnosis, the larger is C 2 . Now k = 0 if and only if p(i ) is constant for all patients (i ) in the population. In that case, Se = 1 − Sp, and V = 0. Every completely unreliable diagnosis is also completely invalid. At the other extreme, if V = 1 for some diagnosis, then Se = Sp = 1, which means that p(i ) for all those with a positive diagnosis is 1, and p(i ) for all those with a negative diagnosis is 0. Then k = 1. Every perfectly valid diagnosis is perfectly reliable.

www.annualreviews.org • The Reliability of Clinical Diagnoses

117

ARI

11 February 2014

8:24

A diagnosis that is unbiased (B = 1) and free of contaminants (C = 0) has k = V. In the absence of bias and contaminants, reliability directly determines validity. Otherwise, when both bias and contaminants may be present, a high reliability is necessary for good validity, but it is not sufficient. By increasing the influence of reliably measured contaminants on the diagnosis, it is possible to increase reliability while decreasing validity. Clearly the ultimate goal is not a reliable but rather a valid diagnosis. Expending time and effort to assess reliability is a necessary step to a valid diagnosis. To show that a diagnosis is valid, one needs to show that the diagnosis correlates strongly with characteristics known to be associated with the disorder (criteria for convergent or predictive validity) and not to be associated with contaminants, characteristics known not to be related to the disorder (criteria for discriminative validity). However, because unreliability attenuates any correlation, showing that a criterion for convergent or predictive validity is not correlated with the diagnosis does not show absence of validity unless the diagnosis is already known to be reliable. Similarly, showing that a criterion for discriminative validity is correlated with the diagnosis does not show absence of validity unless the diagnosis is already known to be reliable. Thus, it is vital to first have a reliable diagnosis in order to evaluate the evidence either supporting or refuting its validity. Moreover, assessing the validity of the diagnosis against various criteria also generates the information necessary to improve the validity of the diagnosis. Hereafter the focus of this review is on the reliability of clinical diagnoses. The goals are to (a) discuss the sampling and design issues for reliability studies, (b) describe procedures for estimation of reliability, (c) suggest standards of assessment of reliability, and finally (d ) consider strategies to improve reliability without compromising validity.

Annu. Rev. Clin. Psychol. 2014.10:111-130. Downloaded from www.annualreviews.org by University of Southern California (USC) on 04/09/14. For personal use only.

CP10CH05-Kraemer

RELIABILITY STUDIES Sampling and Design: Naturalistic Conceptually simplest is the situation in which a representative sample of N patients from the population of interest is drawn. Each patient is independently (blindly) assessed using the diagnosis (i.e., the protocol and referent) by two raters drawn from the population of raters to whom the results are to apply. Note that if the population includes only two raters, the two raters are randomized to the first and second ratings for each individual patient. The interval between the ratings must be long enough to ensure blindness among the ratings but short enough to ensure that the disorder status remains constant. The results from such a study can be summarized in a 2 × 2 table cross-tabulating the first rating (DX1+/−) versus the second (DX2+/−) (see Table 1). The actual numbers of patients in each cell are a, b, c, d, totaling N, and the probabilities of each cell are listed in parentheses. The estimated percentage of agreement between ratings is Aˆ = (a + d )/N . It should be noted that with random assignment of raters to ratings 1 and 2, the probabilities of the two types of disagreement, corresponding to b and c, are the same, and the marginal probabilities Table 1 A 2 × 2 table showing the results of a reliability study: a naturalistic sample with N patients and two blinded diagnoses (DX) per patient

118

DX2+

DX2−

DX1+

a (P2 +PP  k)

b (PP  (1−k))

(P)

DX1−

c (PP  (1−k)

d (P 2 +PP  k)

(P  )

Column totals

(P)

(P  )

Kraemer

Row totals

N

CP10CH05-Kraemer

ARI

11 February 2014

8:24

Table 2 A 2 × 2 table showing results with a two-stage sample, N1 and N2 patients from each of the two strata. Asterisks signify that a, b, c, and d in Table 2 differ from a, b, c, and d in Table 1. DX2+

DX2−

DX1+

a∗ (P+P  k)

b∗ (P  (1−k))

Row totals N1

DX1−

c∗ (P(1−k))

d∗ (P  +Pk)

N2

Annu. Rev. Clin. Psychol. 2014.10:111-130. Downloaded from www.annualreviews.org by University of Southern California (USC) on 04/09/14. For personal use only.

are the same. The sample estimate of the prevalence of a positive diagnosis, P, is the average of the two sample marginal probabilities (a + b)/N and (a + c)/N, i.e., Pˆ = (2a + b + c )/(2N ). The probability of agreement when diagnosis is random (k = 0) is C = P2 + P 2 = 1 − 2PP  . Thus estimated proportion random agreement is Cˆ = 1 − 2 Pˆ Pˆ  . Then: ˆ ˆ kˆ = ( Aˆ − C)/(1 − C).

(4)

This measure of reliability (the sample intraclass kappa) is often referred to as “agreement corrected for chance,” which, of course, it is. But, more important, it estimates the classic definition of reliability as the ratio of the true variance to the observed variance (Kraemer 1979).

Sampling and Design: Two-Stage Design Another possible design, one that reduces the total number of ratings needed (2N above) is a twostage design. At the first stage, N subjects are sampled from the population of interest, and each is diagnosed (per protocol) by a randomly selected rater from that population of interest. Then N1 patients are randomly selected from those with a positive first diagnosis and N2 from those with a negative first diagnosis. In the usual low-prevalence situation, this would usually mean selecting all the rare first positive diagnoses but only a small portion of the common first negative ones. Only the N1 + N2 patients so selected are evaluated by a second blinded rater randomly selected from the remaining pool of raters. The results can also be summarized in a 2 × 2 table (see Table 2), but the table is different from Table 1 because the row marginals are now fixed at N1 and N2. Here, we deal with the conditional probabilities of the second diagnoses given the first, and the estimated kappa is simply: c∗ a∗ − . (5) kˆ = N1 N2

Sampling and Design: Stratified If there are strong risk factors for a positive diagnosis that can be easily assessed in selecting patients, there is yet a third possibility: a stratified design. In this case, the population is stratified (for simplicity, say) into two strata, those likely to have a positive diagnosis (a proportion Q of the population) and those less likely (a proportion Q  ). At the first stage, N patients are sampled from ˆ Then N1 patients are sampled from the the population, and Q is estimated from this sample, Q. likely stratum, and N2 from the less likely stratum. Each of these N1 + N2 patients undergoes two diagnoses (as in the naturalistic design), and the results in each sample are compiled into a 2 × 2 table (as in Table 1) with frequencies a1, b1, c1, d1 (totaling N1) in the likely stratum and a2, b2, c2, d2 (totaling N2) in the less likely stratum. Then one combines the results from the two strata into a single 2 × 2 table like that in Table 1, with a = Qˆ · a1 + Qˆ  · a2, b = Qˆ · b1 + Qˆ  · b2, etc. The estimation proceeds as for the naturalistic sample (Equation 3) [although the standard error (SE) and confidence intervals are now very different]. www.annualreviews.org • The Reliability of Clinical Diagnoses

119

ARI

11 February 2014

8:24

This was the type of design used in the DSM-5 field trials (Clarke et al. 2013), and it is often useful in minimizing the cost and difficulty in assessing the reliability of relatively rare diagnoses. In the DSM-5 field trials, the strata were defined by the presence/absence of the corresponding DSM-IV diagnoses already available for the clinic samples, with the goal of sampling at least 50 (N1 = N2 = 50) patients in each stratum. Since many of the DSM-5 diagnoses evaluated had a prevalence below 10% (P < 0.10), this design minimized the number of subjects on whom two tests had to be done to achieve the desired precision of the estimated kappa. It should be noted that this is not a case-control design. In case-control designs, a sample of N1 is drawn from a high-risk group and a sample of N2 from a low-risk group. Frequently, the highrisk group is not representative of the likely stratum or the “low-risk” group is not representative of the less likely stratum of the population, a situation producing Berkson’s fallacy (Berkson 1946, 1955; Brown 1976). Moreover, an estimate of Q is then not typically available to use as a sampling weight in the estimation.

Annu. Rev. Clin. Psychol. 2014.10:111-130. Downloaded from www.annualreviews.org by University of Southern California (USC) on 04/09/14. For personal use only.

CP10CH05-Kraemer

Sampling and Design: Multiple Raters Situations exist in which one might sample N patients from the relevant population and have each blindly diagnosed by M > 2 raters. A form of intraclass kappa is available for this situation, and it estimates the same parameter as does the one with M = 2 (Fleiss 1971, 1981). Indeed, for a fixed sample size of patients, one can get a much more precise estimate of kappa with the same sample size of patients by using a larger number of raters (Donner & Wells 1986). Finally, there is also a form of intraclass kappa available for the situation in which each patient is diagnosed by varying numbers of raters (at least two) (Landis & Koch 1977). However, this form of kappa is valid only if the number of raters per patient is completely uncorrelated with the factors influencing the diagnosis. In many cases, patients with the most ratings would be those who are least impaired, most compliant, etc., which are factors often associated with the diagnosis. I have never seen this form used and would recommend against using it.

Types of Reliability One crucial decision is always that of defining the population of patients to which the results are to be applied. In the DSM-5, the decision was made very early that the results were to be applied to clinic populations likely to be affected by DSM-5 diagnoses. Another crucial decision is that of defining the population of raters (i.e., intrarater, interrater, or test-retest) to whom the results are to apply and thus the type of reliability of concern. If the diagnosis of concern is made at a particular time point by a single rater (i.e., a population of one rater), the focus is on intrarater reliability. This type of reliability is seldom used for diagnoses, either physical or mental, because most diagnoses require interaction between the rater and patient, and it is virtually impossible to ensure independence of ratings when both ratings are done based on the same information by a single rater. More important, particularly with mental disorders, there is often considerable random variability of expression of disorder-related characteristics within patients with the disorder or without the disorder. This variability is ignored in intra- and interrater reliability, and the resulting inflation of reliability may mislead clinical decision making. The following example illustrates intrarater reliability (W. Byron Brown, personal communication): At one time it was proposed that different treatments were required depending on the patient’s specific “type” of cancer cell. A classification scheme was proposed by an expert. Cancer tissue slides from 200 cancer patients were obtained and labeled in a way uninformative to the rater. The expert who proposed the scheme was asked to classify each of the 200 slides, and then 120

Kraemer

Annu. Rev. Clin. Psychol. 2014.10:111-130. Downloaded from www.annualreviews.org by University of Southern California (USC) on 04/09/14. For personal use only.

CP10CH05-Kraemer

ARI

11 February 2014

8:24

six months later, to classify the same slides, randomly reordered and relabeled, once again. (The kappa here came out at about 0.2.) This is intrarater reliability. For diagnosis in general, however, the reliability of a single rater is not of interest. In the above example, it was important only because the rater was the expert who originated the proposed classification system. However, who cares how reliable Mary Smith or Tom Jones is, when neither will be diagnosing the patients at other sites or at other times? Even more important, ensuring the independence of the two ratings per patient by the same rater using the same information is almost impossible, except when (as above) the diagnosis is based on tissue slides or images, which do not require interaction with the patient. Interrater reliability is more common. In this case, two raters per patient diagnose each patient using the same information (one time point). For example, the 200 slides above were also each evaluated by two different clinicians sampled from a pool of raters, each trained by the expert, each unaware of who the other clinician was, ensuring blindness. (The kappa here also came out at about 0.2.) In some situations, in evaluating psychiatric diagnoses, one rater interviews and rates while the other only rates. However, the blindness of the two raters in such cases is questionable because information is conveyed from one rater to the other via the interview itself. In some cases, two raters observe and rate an interview done by a third clinician. If the two raters occupy the same physical space, “blindness” may not be complete because the reactions of one may convey information to the other. More importantly, however, error variance due to inconsistency within the patient or inconsistencies between clinicians in conducting the interview are excluded, and these are often major sources of unreliability of diagnoses. There are also situations in which interviews are videotaped, and the videotapes are rated by two blinded raters. In these situations the blindness may be complete because the raters may view the videotapes at different times and in different places. However, a diagnosis of a videotape is not the same as the diagnosis of a live patient, and, once again, error variance due to inconsistencies is excluded. In all such approaches, the reliabilities found may be exaggerated. It should be noted, however, that the latter two approaches, with two raters independently viewing either an interview done live by a third clinician or on a videotape, are methods very useful for training raters to reliably use diagnostic criteria, but such training represents a different situation from trying to document the reliability of the diagnostic criteria in the first place. Finally, there is test-retest reliability, in which the patient is diagnosed by two raters at two separate times within a time period long enough to ensure independence of the raters but short enough that the disorder status of the patient is unlikely to have changed (i.e., almost no new onsets or cures). In DSM-5 field trials, this period of time was set at no less than four hours and no longer than two weeks, but tests for most patients were conducted about one week apart (Clarke et al. 2013). To patients, and very likely to clinicians, the test-retest reliability of a diagnosis is most important. If patients get a certain diagnosis from a clinician, how sure can they be that a day or two later, when their clinical condition has not changed, they would get the same diagnosis from another equally expert clinician using the same protocol? If the diagnoses differ, then one or both must be wrong, and depending on any single diagnosis might lead to the wrong treatment. Test-retest reliability indicates that concordance between a first and second opinion. It should be noted that a study in which the two ratings are separated by six months or one year is unlikely to be a valid reliability study, since one might expect new onsets and cures within any extended period of time. This correlation might better be termed “consistency over time.” However, if this correlation is high, that would suggest that the test-retest reliability is even higher, since this correlation would be attenuated by the unreliabilities of the diagnoses at both the initial www.annualreviews.org • The Reliability of Clinical Diagnoses

121

CP10CH05-Kraemer

ARI

11 February 2014

8:24

and the follow-up period. On the other hand, if the correlation is low, it is impossible to tell whether this is due to unreliability at one time point or to low consistency over longer periods of time.

OTHER MEASURES OF RELIABILITY

Annu. Rev. Clin. Psychol. 2014.10:111-130. Downloaded from www.annualreviews.org by University of Southern California (USC) on 04/09/14. For personal use only.

In the discussion so far, it has been assumed that the measure of reliability of a categorical diagnosis is the intraclass kappa. This is because sample intraclass kappa estimates exactly what the classic definition of reliability requires, the ratio of the true variance to the observed variance. Nevertheless, many question the use of this measure for a variety of reasons. Some of the reservations about kappa result from studies like the one mentioned above on classification of tissue samples: A kappa of 0.2 is too low! Surely that couldn’t be right! There must be something wrong with the kappa! The point of evaluation of a diagnosis is to show how good the diagnosis is, if it is good, as well as to show how bad it is, if it is bad, in the population of interest. Arguing from the results is not a viable option. Some correctly note that it is far more difficult to achieve adequate kappas in homogeneous populations (here, in very low- or high-prevalence populations), an issue often referred to as the base rate (P) problem with kappa (Elwood 1993, Spitznagel & Helzer 1985, Veiel 1988). However, that is a general problem with reliability, not merely with kappa. When the contribution of disorder or contaminants is very small (as in a homogeneous population), a very little bit of error can be overwhelming (consider Figure 1). Other complaints center on the many other choices of indices available. For example, why not simply report the proportion of agreement between raters? The proportion of agreement between raters using a random number generator is 1 − 2PP  . Thus the minimal value is 0.5 (when P = 0.5). As P becomes very small or very large (PP  near zero), the percentage of agreement rapidly approaches 100%. Many of the DSM-5 diagnoses evaluated in the clinical trials had estimated P less than 0.1 (Regier et al. 2013), and thus the percentage agreement, even with random decision making, greater than 82% [1 − 2(0.1 × 0.9)]. With relatively rare diagnoses, even random decision making looks very good! This, of course, was Cohen’s original argument against using percentage agreement and his reason for proposing kappa. Cohen also introduced other kappa coefficients, including the weighted kappas k(w) (w between 0 and 1) (Berry et al. 2005, Cohen 1968, Fleiss & Cicchetti 1978, Fleiss et al. 1969), where the weight w is determined by the relative importance of the two types of errors (false positive and false negative). In particular, k(1/2), where the emphasis on the two types of errors is equal, is often called Cohen’s kappa. If the trial is a well-designed and executed reliability study, so that b/N and c/N in Table 1 estimate the same probability, Cohen’s kappa [and every k(w)] estimates the same population value as does the intraclass kappa but does so slightly less efficiently than does the intraclass kappa. However, many trials are not well designed or well executed. If, for example, the more experienced rater always gives the first rating, then b/N and c/N may estimate different probabilities, and the marginal probabilities of the 2 × 2 table may not be the same. In that case, Cohen’s kappa estimates a population parameter larger than the population intraclass kappa because Cohen’s kappa tends to excuse systematic errors between the first and second ratings (Bloch & Kraemer 1989). Then, too, each of the weighted kappas estimates a different population parameter. What of the phi coefficient? If a positive diagnosis is coded +1 and a negative diagnosis 0, the phi coefficient is the familiar product moment correlation coefficient between the two raters. If the trial is a well-designed and well-executed reliability study, the phi coefficient estimates the same population parameter as does the intraclass kappa, but again at the cost of precision. If the trial is not well designed or executed, phi will estimate a different parameter from that estimated by the intraclass kappa or Cohen’s kappa or the weighted kappas. 122

Kraemer

CP10CH05-Kraemer

ARI

11 February 2014

8:24

What of the odds ratio ad/bc in Table 1? Even in a well-designed and well-executed reliability study, the odds ratio does not clearly indicate reliability. Indeed, in such a study, Odds Ratio = 1 +

k . (P P  )(1 − k)2

(6)

This means that as P becomes very large or very small (PP  near zero), the odds ratio approaches infinity. In short, the odds ratio is far more misleading even than is the percentage agreement. Other such measures exist—some that estimate the same parameter in a well-designed and well-executed study, but less efficiently, and others that relate poorly to reliability as classically defined. The preferred choice for a reliability measure remains the intraclass kappa.

Annu. Rev. Clin. Psychol. 2014.10:111-130. Downloaded from www.annualreviews.org by University of Southern California (USC) on 04/09/14. For personal use only.

POWER AND PRECISION A reliability trial is not done to test the hypothesis that the reliability is above zero. The rationale and justification for a diagnosis should be strong enough to ensure nonzero reliability. The finding of a “statistically significant” result merely means that the sample size was large enough (i.e., power great enough) to detect some deviation from zero, not that the reliability is even minimally acceptable. Instead, the purpose of a reliability study is to estimate the reliability (as above) with some indication of how precisely that reliability is estimated (its SE or a confidence interval). First, a distinction must be made between a failed diagnosis and a failed study. If the study design is so poor that the reliability is not estimated precisely, that is a failed study. Nothing can then be said about the diagnosis the study evaluates. In the DSM-5 field trials, it was specified a priori that the kappa be estimated with an SE less than 0.1, and the goal sample size was determined accordingly (Clarke et al. 2013). However, a certain number of field trials (i.e., the evaluation of a single diagnosis at a single site) failed, usually because the sample size fell far short of the a priori goal (Regier et al. 2013). These failed trials were reported (as they must be). It should be noted that the estimated kappas in the failed trials ranged across the spectrum from k = 0.28 to 0.77. It is important that the results of a reliability study not be rejected because they fail to support acceptable reliability for the diagnosis nor accepted because they do. The estimate of kappa in a failed field trial cannot be trusted and should be ignored. On the other hand, a failed diagnosis is one where the reliability is measured precisely enough, but its value is too low to be of any potential importance. Several reliabilities reported in the DSM-5 field trials were listed as unacceptable (Regier et al. 2013). None of the diagnoses found to be consistently unacceptable appear in the DSM-5. To make this distinction between a failed trial and a failed diagnosis, we need a measure of precision of the estimated kappa. How to estimate the SE or confidence interval for kappa is an important and often difficult issue. The approximate SE of an estimated kappa, using naturalistic sampling with N patients and two raters per patient, is   k(2 − k) (1 − k) (1 − k)(1 − 2k) + . (7) SE2 = N 2PP  √ When k = 0, the SE = 1/ N. When k = 1, SE = 0. For extreme P (PP  near zero), the sample size N needed to have SE < 0.1, for example, will always be very large. An approximate 95% two-tailed confidence interval is k − 1.96SE to k + 1.96 SE, where k and P are estimated in the sample. This approximation is reasonably accurate for large sample sizes and is the only known way of determining a priori the sample size necessary to obtain an estimate of k with the required precision in planning a reliability study. Once the study is done, bootstrap methods (Efron 1988, Efron & Gong 1983, Efron & Tibshirani 1995) can also be used both to estimate the SE and to compute the 95% confidence interval. Here that would mean generating a new kappa by sampling www.annualreviews.org • The Reliability of Clinical Diagnoses

123

CP10CH05-Kraemer

ARI

11 February 2014

8:24

N subjects from the observed sample with replacement and computing the estimated kappa. This is repeated, say, 500 times, to yield 500 bootstrap estimates of kappa. The mean of these 500 estimates is an improved estimate of kappa; the standard deviation gives an estimate of its SE. The span between the 2.5 percentile and the 97.5 percentile of these bootstrap estimates gives a 95% two-tailed confidence interval. For the two-stage sampling, the exact SE of the estimated kappa is   P P  + P 2 k P P  + P 2k + . (8) SE 2 = (1 − k) N1 N2

Annu. Rev. Clin. Psychol. 2014.10:111-130. Downloaded from www.annualreviews.org by University of Southern California (USC) on 04/09/14. For personal use only.

The approximate 95% confidence interval is, once again, k−1.96SE to k + 1.96SE (using estimated k). Once again bootstrap methods can be used. Now a random sample of N1 is drawn from the likely stratum and N2 from the less likely stratum, both with replacement, and used to estimate each bootstrap kappa. This is repeated, say, 500 times, and the improved estimate of kappa, its SE, and a confidence interval can be obtained, as above. For the stratified sample, even an approximate SE of the estimated kappa is unknown. Bootstrap estimates are now the only available approach. One would draw a sample of N1 from one stratum and N2 from the other, both with replacement, obtain the estimate of kappa, and proceed as above. The estimate would be conditional on the estimate of Q in the original sample. Similarly, for the multirater kappa, bootstrap methods are recommended in the absence of better methods for estimating the SE and confidence intervals.

HOW RELIABLE IS RELIABLE ENOUGH? The most controversial issues in evaluating diagnoses, especially diagnoses of mental disorders, are whether such disorders exist, and even if they do exist, whether it is worth pursuing better quality diagnoses in the absence of specific knowledge about the disorder. Close behind those issues, however, is the question of how reliable is reliable enough. Kappa originated in the process of developing the DSM-III (Am. Psychiatr. Assoc. 1980), when very little knowledge about how kappa “worked” was available. The projected standards for acceptable reliability were then set very high, say k > 0.75, without any empirical justification. Moreover, there was then little knowledge of the SEs of the kappas and experience with using them, and the sample sizes on which the field trials were based were, by present standards, much too small. However, DSM-III field trials (Spitzer & Forman 1979) were designed as well as methodological knowledge permitted at that time, and they were the model for DSM-5 field trials. The DSM-IV field trials addressed a quite different question. Each field trial was conducted by the work group that originated the diagnoses being evaluated. Thus raters were given a broad hint about on which diagnoses to focus rather than asked to use the DSM (with all its 150 or so diagnoses), as they would be in practice. In addition, there was at least the appearance of conflict of interest. It is difficult for the originators of any proposal to design and execute a trial evaluating their own decisions without some bias. Some DSM-IV field trials excluded patients who were hard to diagnose. In some, there was a level of training and feedback for raters beyond what clinicians are likely to have access to in practice. In some field trials, raters were required to be trained in, and to use, structured interview methods that clinicians ordinarily would not use. As a result, the reliability coefficients reported were very high and for selected populations were appropriate, but specially trained and focused raters are not usually used for clinical DSM diagnoses. In deciding to do DSM-5 field trials in clinical settings, with representative samples of patients accessing those settings, with representative clinicians as raters, and focusing exclusively on testretest reliability, it was clear that even if diagnostic quality were improved, DSM-5 reliabilities 124

Kraemer

Annu. Rev. Clin. Psychol. 2014.10:111-130. Downloaded from www.annualreviews.org by University of Southern California (USC) on 04/09/14. For personal use only.

CP10CH05-Kraemer

ARI

11 February 2014

8:24

would be lower than reported DSM-IV reliabilities. Reasonable standards were required before the DSM-5 field trials began (cf. Landis & Koch 1977). These goals were based on a perusal of the reliabilities reported for various medical diagnoses done by clinicians in clinical populations, including coronary arteriography and X-ray examinations (Detre et al. 1975; Koran 1975a,b). Very seldom is the test-retest kappa of a categorical diagnosis above 0.8. Some highly regarded and trusted medical diagnoses appear to have kappas between 0.6 and 0.8 and are characterized as “very good” (the terms vary). The bulk of medical diagnoses evaluated appear to have kappas between 0.4 and 0.6, which are characterized as “good.” Some have kappas between 0.2 and 0.4, which are characterized as “questionable.” Kappas below 0.2 were characterized as “unacceptable.” Given a choice, everyone would aim to have kappas above 0.75, but is this reasonable in every case? It must be remembered that one source of unreliability is the random variation of expression of signs and symptoms within each individual while disorder status remains unchanged. Thus a diagnosis with no bias and no contaminants, measured as well as can be at one time point, may have an upper limit on reliability considerably less than 0.75. From a methodological point of view at least, it was not surprising that in DSM-5 field trials, major depressive disorder (MDD) and generalized anxiety disorder (GAD) had “questionable” reliabilities, for these are disorders with a great deal of within-patient variability in expression. At the same time, clinicians vary in terms of acumen in interviewing, sensitivity to certain signs and signals, and interpretation of the criteria, and this places an upper limit on how high the interrater or test-retest reliability can be in clinical settings. The more subjective the criteria and the more subject to interpretation, the lower this upper achievable limit might be. In research settings, reliabilities can be made much higher by various strategies. The reliability standards set by DSM-5 are not a gold standard to be used in later reliability studies. It is important that standards be determined by the perusal of reliabilities already achieved in the particular context before designing a reliability study. Thus, if diagnoses for a particular disorder in a particular population already exist, if their validity is not in question, and if they have achieved reliabilities of 0.75 in the population of interest, then any new diagnosis with reliability less than 0.75 would be at least questionable and perhaps unacceptable. On the other hand, if no such diagnoses exist, and reliabilities for diagnoses of comparable disorders in comparable populations tend to be around 0.2, then a new diagnosis with reliability of 0.4 would be very welcome. It is crucial that the standard be set a priori and the results be reported in accord with those standards.

IMPROVING RELIABILITY To improve the quality of a diagnosis (Figure 1), one might seek to introduce new information specific to the disorder (in psychiatric diagnosis, examples include genetic, biochemical, imaging, and phenomenological information) to reduce contaminants and to decrease error. The first task, that of introducing new relevant information, is what the Research Domain Criteria Project is pursuing, and both tasks—introducing new information and reducing contaminants and errors— were those assigned to the work groups in the DSM-5 process. Here we focus only on reliability. To improve reliability without changing the information base, one needs to reduce the error of measurement. There are essentially three strategies to do so: (a) standardize the conditions of diagnosis, (b) train and qualify the raters, and (c) deal with the random within-patient variability.

Standardize Conditions A major source of intrapatient variability in response (while disorder status remains constant) is environmental. Some responses exhibit diurnal cycles, some weekly cycles, some monthly (perhaps www.annualreviews.org • The Reliability of Clinical Diagnoses

125

ARI

11 February 2014

8:24

attuned to the menstrual cycle in women). Some responses are influenced by how recent or how large the patient’s last meal was, by whether the patient slept well the night before, or by whether the patient had an argument with a spouse, teacher, or boss. Consequently, many tests used as the basis of medical diagnoses standardize conditions of testing to remove extraneous random variance. The patient may be instructed to fast for eight hours before blood is sampled, or instructed to drink a large amount of water before a sonogram, or to drink a specified quantity of a sweet solution before a glucose tolerance test. To get a more reliable blood pressure reading, the patient may be instructed to rest quietly for five minutes before the reading is taken. The same principle applies to improving the reliability of the protocols for psychiatric diagnosis. If it were known that there are diurnal, weekly, or monthly cycles to behavior and emotional or cognitive responses, the protocol should specify under what conditions diagnosis should be done. In general, an effort might be made to control any strong environmental influences on diagnosis, if such were known. A hypothetical example: It was previously noted that the reliability of one proposed child diagnosis in the DSM-5 field trials was “good” at one site, and “unacceptable” at two other sites. The first site had many inpatients, whereas the last two sites had mostly outpatients. Diagnoses of child/pediatric disorders are handicapped by the fact that the information obtained is frequently from a parent (by testing, observation, or self-report) rather than directly from the patient. Parent report of child behavior may be influenced by parental issues and thus may introduce major contaminants and error. For instance, in a study of low-birthweight premature infants, the mother’s report of the child’s health status at 3 years of age was found to be more highly correlated with her vocabulary level (Gross et al. 1997, p. 177) than with the child’s actual health status (p. 179). Thus, it may be that a reliable (and valid) diagnosis of some childhood mental disorders requires direct observation of the child for a short while in a controlled clinical setting, perhaps as an inpatient.

Annu. Rev. Clin. Psychol. 2014.10:111-130. Downloaded from www.annualreviews.org by University of Southern California (USC) on 04/09/14. For personal use only.

CP10CH05-Kraemer

Train/Qualify Raters In the DSM-5 field trials, clinicians at the various clinics were given training in the use of the DSM-5 comparable to that available to all clinicians once DSM-5 was published. Nevertheless, there are individual differences among clinicians, both in the skill of observation and interview and in adherence to the criteria of DSM-5. As was recognized in DSM-IV field trials, reliability can be improved by more intensive training of clinicians, providing feedback on assessments done, and by developing and requiring the use of structured interview methods and strict adherence to the protocol and criteria. However, to require that a diagnosis not be labeled a DSM-5 diagnosis unless done by a clinician who has passed a qualification test and uses a structured interview method is not likely to be acceptable to the American Psychiatric Association, which publishes the DSM, or to psychiatrists, psychologists, and others who deliver psychiatric diagnoses. Nevertheless, some strategy to train or qualify raters, as is often recommended and done in research settings, may be necessary to improve the reliability of diagnosis of mental disorders.

Use Multiple Ratings for Diagnosis With biochemical assays, it is frequently required that a tissue sample be separated into several (often three) aliquots and that these be independently assayed and the results averaged to obtain a more reliable assay result. In research studies that require reliability of diagnosis much higher than that typically found in clinical use, each patient might be examined by several (often three) independent clinicians and their consensus used as the diagnosis for the patient. Such strategies are justified by a 1910 finding referred to as the Spearman-Brown projection (Brown 1910, Spearman 126

Kraemer

CP10CH05-Kraemer

ARI

11 February 2014

8:24

1910). If the reliability of a single measurement (here an interval measure) is r1 , then the reliability of the average of m such independent measurements (rm ) is mr1 . (9) rm = (m − 1) r1 + 1 Then, if the reliability of a single measure is above 0, one can always step up the reliability by averaging multiple independent observations. Thus, to step up a reliability of r1 > 0 to any higher level r∗ , one would need m raters where

Annu. Rev. Clin. Psychol. 2014.10:111-130. Downloaded from www.annualreviews.org by University of Southern California (USC) on 04/09/14. For personal use only.

m>

r ∗ (1 − r1 ) . (1 − r ∗ ) r1

(10)

To move a reliability up one step (from 0.2 to 0.4, 0.4 to 0.6, 0.6 to 0.8), one would need about 3raters; to move it up two steps (from 0.2 to 0.6, from 0.4 to 0.8), 6 raters; to move it up three steps (from 0.2 to 0.8), 13 raters. But the diagnoses here are categorical, not interval measures, and the repeated diagnoses per patient in clinical practice are not likely to be independent. Nevertheless, the principle remains that if the diagnosis is based on several visits, the reliability of the diagnosis, even without any change in criteria, will be improved (Kraemer 1979, 1992a).

DISCUSSION To summarize the key issues in designing a reliability study of a categorical diagnosis, one must: 

 





 

Set the standards for reliability in the specific context by perusal of past related experience in the same population. Define the protocol, response, and referent of the diagnosis to be evaluated. Determine the sample size necessary to achieve adequate precision of estimation with the intended sampling and design. Sample the population of patients from the population to which the results are to be applied, according to whatever design was selected. Have each sampled patient rated two or more times with raters randomly sampled from the population of raters to which the results are to be applied. Estimate the intraclass kappa and present its SE and 95% confidence interval. Check that the trial has not failed (SE or confidence interval inadequate by a priori standards). If the trial has not failed, report the kappa and its confidence interval and compare results with a priori standards of reliability.

Outside of the choice of intraclass kappa as the measure of reliability, the same principles apply to dimensional diagnoses. For a dimensional diagnosis, the intraclass correlation coefficient is substituted for the intraclass kappa (Algina 1978, Bartko 1976, Donner & Bull 1983, Ramasundarahettige et al. 2009, Rothery 1979, Shrout & Fleiss 1979). Indeed, it has long been known that if one applies the formulas for the intraclass correlation coefficient to binary (coded 1 and 0) data, one ends with the intraclass kappa. Under parametric assumptions, the SEs and confidence intervals for the intraclass correlation coefficient can easily be obtained (Algina 1978, Bartko 1976, Donner & Wells 1986, Rothery 1979, Shrout & Fleiss 1979). When the parametric assumptions fail, bootstrap methods may still be used. The standards for reliability for dimensional diagnoses would generally be higher than those for the reliability of a categorical diagnosis, as was true for the standards articulated for dimensional measures in the DSM-5 (Kraemer et al. 2012). This is simply because a dimensional diagnosis would pick up a great deal more variance than would a categorical diagnosis for the same disorder, but again the standards depend on context. www.annualreviews.org • The Reliability of Clinical Diagnoses

127

ARI

11 February 2014

8:24

There are a few problems with the intraclass correlation coefficient not found with intraclass kappa. Many dimensional diagnoses are based on scoring lists of questions (e.g., scores on multiitem tests, lists of symptoms). In such cases, Cronbach’s alpha is often reported as the reliability. However, alpha is not reliability but rather a measure of the internal consistency in the responses to the questions on that list. It is easy to show that one can have a Cronbach’s alpha of 1 when the reliability is near zero, or a Cronbach’s alpha of 0 when the reliability is very high. Similarly, split-half correlations are not reliability coefficients; they also relate to the internal consistency in responses to the questions on that list. Both Cronbach’s alpha and split-half correlations are very useful in test construction, but the reliability of the score on a multi-item list is still based on two or more blinded assessments of the total score, preferably test-retest. There are many unsolved problems in assessing reliability in general. One salient such problem is dealing with comorbid diagnoses, which are ubiquitous in the diagnosis of mental disorders as well as in physical disorders in older populations. It is not unusual, in DSM-5 field trials and in many other studies using DSM diagnoses, for patients to be given two, three, or four separate diagnoses. In the DSM-5 field trials the diagnoses were assessed for quality individually, and when one clinician reported both MDD and GAD and another only GAD, they were counted in agreement on GAD and in disagreement on MDD. However, if the correct treatment of the patient depends on recognizing the comorbid occurrence of MDD and GAD, these clinicians are obviously neither in complete agreement nor in complete disagreement; rather, they are in partial agreement, a fact ignored. Comorbidity may be one reason that diagnoses such as MDD and GAD have only “questionable” reliability. No solution yet proposed for this problem seems satisfactory (Kraemer 1980). In many cases, a rule of thumb is that if N subjects are needed to have adequate power in testing or adequate precision of estimation, with a completely reliable measure, then N/r subjects are needed if the test-retest reliability of the measure is r. Unreliability may account for the inability to document clinically significant effects in randomized clinical trials or risk factors for onset of disorders, or, in general, to detect any important signals in research. With the unreliability of diagnoses in particular, the situation is of even more concern, since basing clinical decisions in patient care on unreliable diagnoses may result in patients being denied treatment they need or being treated for disorders they do not have. Moreover, since reliability of diagnosis is essential to raising the necessary questions about and to improving the validity of diagnoses, progress either in research or clinical decision making is compromised by unreliable diagnoses. In short, reliability of measurement is always essential, but particularly so for clinical diagnoses that affect both patient care and clinical research.

Annu. Rev. Clin. Psychol. 2014.10:111-130. Downloaded from www.annualreviews.org by University of Southern California (USC) on 04/09/14. For personal use only.

CP10CH05-Kraemer

DISCLOSURE STATEMENT The author is not aware of any affiliations, memberships, funding, or financial holdings that might be perceived as affecting the objectivity of this review.

LITERATURE CITED Algina J. 1978. Comment on Bartko’s “On various intraclass correlation reliability coefficients.” Psychol. Bull. 85:135–38 Am. Psychiatr. Assoc. 1980. Diagnostic and Statistical Manual of Mental Disorders. Washington, DC: Am. Psychiatr. Publ. 3rd ed. Am. Psychiatr. Assoc. 1994. Diagnostic and Statistical Manual of Mental Disorders. Washington, DC: Am. Psychiatr. Publ. 4th ed. 128

Kraemer

Annu. Rev. Clin. Psychol. 2014.10:111-130. Downloaded from www.annualreviews.org by University of Southern California (USC) on 04/09/14. For personal use only.

CP10CH05-Kraemer

ARI

11 February 2014

8:24

Am. Psychiatr. Assoc. 2013. Diagnostic and Statistical Manual of Mental Disorders. Washington, DC: Am. Psychiatr. Publ. 5th ed. Bartko JJ. 1976. On various intraclass correlation reliability coefficients. Psychol. Bull. 83:762–65 Berkson J. 1946. Limitations of the application of fourfold table analysis to hospital data. Biometrics Bull. 2:47–53 Berkson J. 1955. The statistical study of association between smoking and lung cancer. Proc. Staff Meet. Mayo Clin. 30:56–60 Berry KF, Johnston JE, Mielke PW Jr. 2005. Exact and resampling probability values for weighted kappa. Psychol. Rep. 96:243–52 Bloch DA, Kraemer HC. 1989. 2 × 2 kappa coefficients: measures of agreement or association. Biometrics 45:269–87 Brown GW. 1976. Berkson fallacy revisited: spurious conclusions from patient surveys. Am. J. Dis. Child. 130:56–60 Brown W. 1910. Some experimental results in the correlation of mental abilities. Br. J. Psychol. 3:296–322 Clarke DE, Narrow WE, Regier DA, Kuramoto SJ, Kupfer DJ, et al. 2013. DSM-5 field trials in the United States and Canada, part I: study design, sampling strategy, implementation, and analytic approaches. Am. J. Psychiatry 170:43–58 Cohen J. 1968. Weighted kappa: nominal scale agreement with provision for scaled disagreement or partial credit. Psychol. Bull. 70:213–29 Detre KM, Wright E, Murphy ML, Takaro T. 1975. Observer agreement in evaluating coronary angiograms. Circulation 52:979–86 Donner A, Bull S. 1983. Inferences concerning a common intraclass correlation. Biometrics 39:771–75 Donner A, Wells G. 1986. A comparison of confidence interval methods for the intraclass correlation coefficient. Biometrics 42:401–12 Efron B. 1988. Bootstrap confidence intervals: good or bad? Psychol. Bull. 104:293–96 Efron B, Gong G. 1983. A leisurely look at the bootstrap, the jackknife, and cross-validation. Am. Stat. 37:36–48 Efron B, Tibshirani R. 1995. Computer-Intensive Statistical Methods. Stanford, CA: Div. Biostat. Stanford Univ. Elwood RW. 1993. Psychological tests and clinical discriminations: beginning to address the base rate problem. Clin. Psychol. Rev. 13:409–19 Finney DJ. 1994. On biometric language and its abuses. Biom. Bull. 11:2–4 Fleiss JL. 1971. Measuring nominal scale agreement among many raters. Psychol. Bull. 76:378–82 Fleiss JL. 1981. Statistical Methods For Rates and Proportions. New York: Wiley Fleiss JL, Cicchetti DV. 1978. Inference about weighted kappa in the non-null case. Appl. Psychol. Meas. 2:113–17 Fleiss JL, Cohen J, Everitt BS. 1969. Large sample standard errors of kappa and weighted kappa. Psychol. Bull. 72:323–27 Frances A. 2013. Saving Normal: An Insider’s Revolt Against Out-of-Control Psychiatric Diagnosis, DSM-5, Big Pharma, and the Medicalization of Ordinary Life. New York: William Morrow Greenberg G. 2013. The Book of Woe: The DSM and the Unmaking of Psychiatry. New York: Blue Rider Press Gross RT, Spiker D, Haynes CW. 1997. Helping Low Birth Weight, Premature Babies. Stanford, CA: Stanford Univ. Press Koran LM. 1975a. The reliability of clinical methods, data and judgments, part 1. N. Engl. J. Med. 293:642–46 Koran LM. 1975b. The reliability of clinical methods, data and judgments, part 2. N. Engl. J. Med. 293:695–701 Kraemer HC. 1979. Ramifications of a population model for k as a coefficient of reliability. Psychometrika 44:461–72 Kraemer HC. 1980. Extensions of the kappa coefficient. Biometrics 36:207–16 Kraemer HC. 1992a. How many raters? Toward the most reliable diagnostic consensus. Stat. Med. 11:317–31 Kraemer HC. 1992b. Evaluating Medical Tests: Objective and Quantitative Guidelines. Newbury Park, CA: Sage Kraemer HC. 2013. Validity and psychiatric diagnosis. Arch. Gen. Psychiatry 70:138–39 Kraemer HC, Kupfer DJ, Clarke DE, Narrow WE, Regier DA. 2012. DSM-5: How reliable is reliable enough? Am. J. Psychiatry 169:13–15 Landis JR, Koch GG. 1977. The measurement of observer agreement for categorical data. Biometrics 33:159–74 www.annualreviews.org • The Reliability of Clinical Diagnoses

129

ARI

11 February 2014

8:24

Lord FM, Novick MR. 1968. Statistical Theories of Mental Test Scores. Reading, MA: Addison-Wesley Ramasundarahettige CF, Donner A, Zhou GY. 2009. Confidence interval construction for a difference between two dependent intraclass correlation coefficients. Stat. Med. 28:1041–53 Regier DA, Narrow WE, Clarke DE, Kraemer HC, Kuramoto SJ, et al. 2013. DSM-5 field trials in the United States and Canada, part II: test-retest reliability of selected categorical diagnoses. Am. J. Psychiatry 170:59– 70 Robins LN, Barrett JE, eds. 1989. The Validity of Psychiatric Diagnosis. New York: Raven Rothery P. 1979. A nonparametric measure of intraclass correlation. Biometrika 66:629–39 Shrout PE, Fleiss JL. 1979. Intraclass correlations: uses in assessing rater reliability. Psychol. Bull. 86:420–28 Spearman C. 1910. Correlation calculated from faulty data. Br. J. Psychol. 3:271–95 Spitzer RL, Forman JB, Nee J. 1979. DSM-III field trials: I. Initial interrater diagnostic reliability. Am. J. Psychiatry 136:815–20 Spitznagel EL, Helzer JE. 1985. A proposed solution to the base rate problem in the kappa statistic. Arch. Gen. Psychiatry 42:725–28 Veiel HOF. 1988. Base-rates, cut-points, and interaction effects: the problem with dichotomized continuous variables. Psychol. Med. 18:703–10

Annu. Rev. Clin. Psychol. 2014.10:111-130. Downloaded from www.annualreviews.org by University of Southern California (USC) on 04/09/14. For personal use only.

CP10CH05-Kraemer

130

Kraemer

CP10-FrontMatter

ARI

6 March 2014

22:5

Annual Review of Clinical Psychology

Contents

Volume 10, 2014

Annu. Rev. Clin. Psychol. 2014.10:111-130. Downloaded from www.annualreviews.org by University of Southern California (USC) on 04/09/14. For personal use only.

Advances in Cognitive Theory and Therapy: The Generic Cognitive Model Aaron T. Beck and Emily A.P. Haigh p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p 1 The Cycle of Classification: DSM-I Through DSM-5 Roger K. Blashfield, Jared W. Keeley, Elizabeth H. Flanagan, and Shannon R. Miles p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p25 The Internship Imbalance in Professional Psychology: Current Status and Future Prospects Robert L. Hatcher p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p53 Exploratory Structural Equation Modeling: An Integration of the Best Features of Exploratory and Confirmatory Factor Analysis Herbert W. Marsh, Alexandre J.S. Morin, Philip D. Parker, and Gurvinder Kaur p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p85 The Reliability of Clinical Diagnoses: State of the Art Helena Chmura Kraemer p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p 111 Thin-Slice Judgments in the Clinical Context Michael L. Slepian, Kathleen R. Bogart, and Nalini Ambady p p p p p p p p p p p p p p p p p p p p p p p p p p p p 131 Attenuated Psychosis Syndrome: Ready for DSM-5.1? P. Fusar-Poli, W.T. Carpenter, S.W. Woods, and T.H. McGlashan p p p p p p p p p p p p p p p p p p p 155 From Kanner to DSM-5: Autism as an Evolving Diagnostic Concept Fred R. Volkmar and James C. McPartland p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p 193 Development of Clinical Practice Guidelines Steven D. Hollon, Patricia A. Are´an, Michelle G. Craske, Kermit A. Crawford, Daniel R. Kivlahan, Jeffrey J. Magnavita, Thomas H. Ollendick, Thomas L. Sexton, Bonnie Spring, Lynn F. Bufka, Daniel I. Galper, and Howard Kurtzman p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p 213 Overview of Meta-Analyses of the Prevention of Mental Health, Substance Use, and Conduct Problems Irwin Sandler, Sharlene A. Wolchik, Gracelyn Cruden, Nicole E. Mahrer, Soyeon Ahn, Ahnalee Brincks, and C. Hendricks Brown p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p 243

vii

CP10-FrontMatter

ARI

6 March 2014

22:5

Improving Care for Depression and Suicide Risk in Adolescents: Innovative Strategies for Bringing Treatments to Community Settings Joan Rosenbaum Asarnow and Jeanne Miranda p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p 275 The Contribution of Cultural Competence to Evidence-Based Care for Ethnically Diverse Populations Stanley J. Huey Jr., Jacqueline Lee Tilley, Eduardo O. Jones, and Caitlin A. Smith p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p 305

Annu. Rev. Clin. Psychol. 2014.10:111-130. Downloaded from www.annualreviews.org by University of Southern California (USC) on 04/09/14. For personal use only.

How to Use the New DSM-5 Somatic Symptom Disorder Diagnosis in Research and Practice: A Critical Evaluation and a Proposal for Modifications Winfried Rief and Alexandra Martin p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p 339 Antidepressant Use in Pregnant and Postpartum Women Kimberly A. Yonkers, Katherine A. Blackwell, Janis Glover, and Ariadna Forray p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p 369 Depression, Stress, and Anhedonia: Toward a Synthesis and Integrated Model Diego A. Pizzagalli p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p 393 Excess Early Mortality in Schizophrenia Thomas Munk Laursen, Merete Nordentoft, and Preben Bo Mortensen p p p p p p p p p p p p p p p p 425 Antecedents of Personality Disorder in Childhood and Adolescence: Toward an Integrative Developmental Model Filip De Fruyt and Barbara De Clercq p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p 449 The Role of the DSM-5 Personality Trait Model in Moving Toward a Quantitative and Empirically Based Approach to Classifying Personality and Psychopathology Robert F. Krueger and Kristian E. Markon p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p 477 Early-Starting Conduct Problems: Intersection of Conduct Problems and Poverty Daniel S. Shaw and Elizabeth C. Shelleby p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p 503 How to Understand Divergent Views on Bipolar Disorder in Youth Gabrielle A. Carlson and Daniel N. Klein p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p 529 Impulsive and Compulsive Behaviors in Parkinson’s Disease B.B. Averbeck, S.S. O’Sullivan, and A. Djamshidian p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p 553 Emotional and Behavioral Symptoms in Neurodegenerative Disease: A Model for Studying the Neural Bases of Psychopathology Robert W. Levenson, Virginia E. Sturm, and Claudia M. Haase p p p p p p p p p p p p p p p p p p p p p p p 581

viii

Contents

CP10-FrontMatter

ARI

6 March 2014

22:5

Attention-Deficit/Hyperactivity Disorder and Risk of Substance Use Disorder: Developmental Considerations, Potential Pathways, and Opportunities for Research Brooke S.G. Molina and William E. Pelham Jr. p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p 607 The Behavioral Economics of Substance Abuse Disorders: Reinforcement Pathologies and Their Repair Warren K. Bickel, Matthew W. Johnson, Mikhail N. Koffarnus, James MacKillop, and James G. Murphy p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p 641

Annu. Rev. Clin. Psychol. 2014.10:111-130. Downloaded from www.annualreviews.org by University of Southern California (USC) on 04/09/14. For personal use only.

The Role of Sleep in Emotional Brain Function Andrea N. Goldstein and Matthew P. Walker p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p 679 Justice Policy Reform for High-Risk Juveniles: Using Science to Achieve Large-Scale Crime Reduction Jennifer L. Skeem, Elizabeth Scott, and Edward P. Mulvey p p p p p p p p p p p p p p p p p p p p p p p p p p p p p 709 Drug Approval and Drug Effectiveness Glen I. Spielmans and Irving Kirsch p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p 741 Epidemiological, Neurobiological, and Genetic Clues to the Mechanisms Linking Cannabis Use to Risk for Nonaffective Psychosis Ruud van Winkel and Rebecca Kuepper p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p 767 Indexes Cumulative Index of Contributing Authors, Volumes 1–10 p p p p p p p p p p p p p p p p p p p p p p p p p p p p 793 Cumulative Index of Articles Titles, Volumes 1–10 p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p 797 Errata An online log of corrections to Annual Review of Clinical Psychology articles may be found at http://www.annualreviews.org/errata/clinpsy

Contents

ix

Annual Reviews It’s about time. Your time. It’s time well spent.

New From Annual Reviews:

Annual Review of Organizational Psychology and Organizational Behavior Volume 1 • March 2014 • Online & In Print • http://orgpsych.annualreviews.org

Annu. Rev. Clin. Psychol. 2014.10:111-130. Downloaded from www.annualreviews.org by University of Southern California (USC) on 04/09/14. For personal use only.

Editor: Frederick P. Morgeson, The Eli Broad College of Business, Michigan State University The Annual Review of Organizational Psychology and Organizational Behavior is devoted to publishing reviews of the industrial and organizational psychology, human resource management, and organizational behavior literature. Topics for review include motivation, selection, teams, training and development, leadership, job performance, strategic HR, cross-cultural issues, work attitudes, entrepreneurship, affect and emotion, organizational change and development, gender and diversity, statistics and research methodologies, and other emerging topics.

Complimentary online access to the first volume will be available until March 2015. Table of Contents:

• An Ounce of Prevention Is Worth a Pound of Cure: Improving Research Quality Before Data Collection, Herman Aguinis, Robert J. Vandenberg • Burnout and Work Engagement: The JD-R Approach, Arnold B. Bakker, Evangelia Demerouti, Ana Isabel Sanz-Vergel • Compassion at Work, Jane E. Dutton, Kristina M. Workman, Ashley E. Hardin • Constructively Managing Conflict in Organizations, Dean Tjosvold, Alfred S.H. Wong, Nancy Yi Feng Chen • Coworkers Behaving Badly: The Impact of Coworker Deviant Behavior upon Individual Employees, Sandra L. Robinson, Wei Wang, Christian Kiewitz • Delineating and Reviewing the Role of Newcomer Capital in Organizational Socialization, Talya N. Bauer, Berrin Erdogan • Emotional Intelligence in Organizations, Stéphane Côté • Employee Voice and Silence, Elizabeth W. Morrison • Intercultural Competence, Kwok Leung, Soon Ang, Mei Ling Tan • Learning in the Twenty-First-Century Workplace, Raymond A. Noe, Alena D.M. Clarke, Howard J. Klein • Pay Dispersion, Jason D. Shaw • Personality and Cognitive Ability as Predictors of Effective Performance at Work, Neal Schmitt

• Perspectives on Power in Organizations, Cameron Anderson, Sebastien Brion • Psychological Safety: The History, Renaissance, and Future of an Interpersonal Construct, Amy C. Edmondson, Zhike Lei • Research on Workplace Creativity: A Review and Redirection, Jing Zhou, Inga J. Hoever • Talent Management: Conceptual Approaches and Practical Challenges, Peter Cappelli, JR Keller • The Contemporary Career: A Work–Home Perspective, Jeffrey H. Greenhaus, Ellen Ernst Kossek • The Fascinating Psychological Microfoundations of Strategy and Competitive Advantage, Robert E. Ployhart, Donald Hale, Jr. • The Psychology of Entrepreneurship, Michael Frese, Michael M. Gielnik • The Story of Why We Stay: A Review of Job Embeddedness, Thomas William Lee, Tyler C. Burch, Terence R. Mitchell • What Was, What Is, and What May Be in OP/OB, Lyman W. Porter, Benjamin Schneider • Where Global and Virtual Meet: The Value of Examining the Intersection of These Elements in Twenty-First-Century Teams, Cristina B. Gibson, Laura Huang, Bradley L. Kirkman, Debra L. Shapiro • Work–Family Boundary Dynamics, Tammy D. Allen, Eunae Cho, Laurenz L. Meier

Access this and all other Annual Reviews journals via your institution at www.annualreviews.org.

Annual Reviews | Connect With Our Experts Tel: 800.523.8635 (us/can) | Tel: 650.493.4400 | Fax: 650.424.0910 | Email: [email protected]

Annual Reviews It’s about time. Your time. It’s time well spent.

New From Annual Reviews:

Annual Review of Statistics and Its Application Volume 1 • Online January 2014 • http://statistics.annualreviews.org

Annu. Rev. Clin. Psychol. 2014.10:111-130. Downloaded from www.annualreviews.org by University of Southern California (USC) on 04/09/14. For personal use only.

Editor: Stephen E. Fienberg, Carnegie Mellon University

Associate Editors: Nancy Reid, University of Toronto Stephen M. Stigler, University of Chicago The Annual Review of Statistics and Its Application aims to inform statisticians and quantitative methodologists, as well as all scientists and users of statistics about major methodological advances and the computational tools that allow for their implementation. It will include developments in the field of statistics, including theoretical statistical underpinnings of new methodology, as well as developments in specific application domains such as biostatistics and bioinformatics, economics, machine learning, psychology, sociology, and aspects of the physical sciences.

Complimentary online access to the first volume will be available until January 2015. table of contents:

• What Is Statistics? Stephen E. Fienberg • A Systematic Statistical Approach to Evaluating Evidence from Observational Studies, David Madigan, Paul E. Stang, Jesse A. Berlin, Martijn Schuemie, J. Marc Overhage, Marc A. Suchard, Bill Dumouchel, Abraham G. Hartzema, Patrick B. Ryan

• High-Dimensional Statistics with a View Toward Applications in Biology, Peter Bühlmann, Markus Kalisch, Lukas Meier • Next-Generation Statistical Genetics: Modeling, Penalization, and Optimization in High-Dimensional Data, Kenneth Lange, Jeanette C. Papp, Janet S. Sinsheimer, Eric M. Sobel

• The Role of Statistics in the Discovery of a Higgs Boson, David A. van Dyk

• Breaking Bad: Two Decades of Life-Course Data Analysis in Criminology, Developmental Psychology, and Beyond, Elena A. Erosheva, Ross L. Matsueda, Donatello Telesca

• Brain Imaging Analysis, F. DuBois Bowman

• Event History Analysis, Niels Keiding

• Statistics and Climate, Peter Guttorp

• Statistical Evaluation of Forensic DNA Profile Evidence, Christopher D. Steele, David J. Balding

• Climate Simulators and Climate Projections, Jonathan Rougier, Michael Goldstein • Probabilistic Forecasting, Tilmann Gneiting, Matthias Katzfuss • Bayesian Computational Tools, Christian P. Robert • Bayesian Computation Via Markov Chain Monte Carlo, Radu V. Craiu, Jeffrey S. Rosenthal • Build, Compute, Critique, Repeat: Data Analysis with Latent Variable Models, David M. Blei • Structured Regularizers for High-Dimensional Problems: Statistical and Computational Issues, Martin J. Wainwright

• Using League Table Rankings in Public Policy Formation: Statistical Issues, Harvey Goldstein • Statistical Ecology, Ruth King • Estimating the Number of Species in Microbial Diversity Studies, John Bunge, Amy Willis, Fiona Walsh • Dynamic Treatment Regimes, Bibhas Chakraborty, Susan A. Murphy • Statistics and Related Topics in Single-Molecule Biophysics, Hong Qian, S.C. Kou • Statistics and Quantitative Risk Management for Banking and Insurance, Paul Embrechts, Marius Hofert

Access this and all other Annual Reviews journals via your institution at www.annualreviews.org.

Annual Reviews | Connect With Our Experts Tel: 800.523.8635 (us/can) | Tel: 650.493.4400 | Fax: 650.424.0910 | Email: [email protected]

The reliability of clinical diagnoses: state of the art.

Reliability of clinical diagnosis is essential for good clinical decision making as well as productive clinical research. The current review emphasize...
2MB Sizes 0 Downloads 0 Views