Statistical Methods in Medical Research http://smm.sagepub.com/

Measurement of reliability for categorical data in medical research Helena Chmura Kraemer Stat Methods Med Res 1992 1: 183 DOI: 10.1177/096228029200100204 The online version of this article can be found at: http://smm.sagepub.com/content/1/2/183

Published by: http://www.sagepublications.com

Additional services and information for Statistical Methods in Medical Research can be found at: Email Alerts: http://smm.sagepub.com/cgi/alerts Subscriptions: http://smm.sagepub.com/subscriptions Reprints: http://www.sagepub.com/journalsReprints.nav Permissions: http://www.sagepub.com/journalsPermissions.nav Citations: http://smm.sagepub.com/content/1/2/183.refs.html

>> Version of Record - Aug 1, 1992 What is This?

Downloaded from smm.sagepub.com at University of Texas Libraries on July 4, 2014

183-

Measurement of reliability for medical research

categorical data

in

Helena Chmura Kraemer Department of Psychiatry and Behavioral Sciences, Stanford University

The problem of measuring reliability of categorical measurements, particularly diagnostic categorizations, is addressed. The approach is based on classical measurement theory and requires interpretability of the reliability coefficients in terms of loss of precision in estimation or power in statistical tests. A general model is proposed, leading to definition of reliability indices. Design and estimation approaches are discussed. Issues and approaches found in the research literature that either lead to confusing or misleading results are presented. The signs and symptoms of unreliable diagnoses are identified, and strategies for improving the reliability of such diagnoses are discussed.

1

Introduction

,

It is said that when Gertrude Stein lay dying, she roused briefly and asked her assembled friends: ’Well, what’s the answer?’. They remained uncomfortably silent, at which she sighed: ’In that case, what’s the question?’. There is an uncomfortable analogy here with attempting to summarize how to assess reliability of categorical measurements or classifications, particularly those related to diagnostic categorizations for use in medical research and clinical practice. While there is a great deal in the statistical literature related to this issue, ’answers’ seem unclear, inconsistent and often controversial. Worse yet, when applied in the ’real world’ to make clinical or research decisions about patients, the ’answers’ often prove misleading. I propose that this is not for lack of answers, but from poor clarity in defining which questions are being addressed. Thus in this review I will carefully, even tediously, specify which questions are being addressed, and indicate which other questions (as important or

interesting they may be) are not dealt with here. The goal here is to seek a meaningful and interpretable index (or indices) of how well a categorical measure reflects some characteristic of a subject in a certain population. Such categorical measures include diagnostic categorization, as well as assignments to racial, ethnic, political and religious groups, and many other such characterizations important in research in medicine, education, psychology, sociology or political science. However, here we will tend to focus on medical diagnostic classifications. The focus is on ’categorical’ measures, here defined as measures having more than two possible responses that cannot logically be ordered. Thus there are three or more predefined labels (categories), one of which each rater/rating will elect to assign each patient. The act of so assigning labels to patients is ’classification’, and the label applied to a particular subject is a categorical measure for that subject. Measures with only two possible responses (Does the patient have schizophrenia? Yes/No) will here be called ’binary’ measures. Measures with more than two possible responses that can be logically ordered (3, 4, or 5 point scales, height, weight etc.) will be called ’ordinal’ measures. In particular, ordered categories (Patient status: worse, Address for correspondence: Professor Helena Chmura Kraemer, ioral Sciences, Stanford University, Stanford, CA 94305, USA.

Department of Psychiatry and Behav-

Downloaded from smm.sagepub.com at University of Texas Libraries on July 4, 2014

184

better) are here ’ordinal’ not ’categorical’. That is important, for many statistical dealing with ordered categories ignore order. For each type of measure (whether categorical, binary or ordinal), there may be further special assumptions made. In some cases the validity of the results based on those assumptions may be strictly limited to situations where the assumptions hold, and in others the results may be robust to some deviations from those assumptions. Whichever the case, what assumptions underlie the results and the robustness of results to deviations from those assumptions are important issues to consider when applying those results in same,

treatments

medical research meant to elucidate ’real world’ situations. In what follows, a premium is placed on making as few restrictive assumptions as possible and making them very explicitly. The process of making assumptions about the nature of the populations of subjects or raters/ratings, or about the nature of the measurement system, its process or its results, is ’modelling’, fundamental to the theory and practice of all statistical analysis. However, a prime source of difficulty is that theoreticians tend to make very restrictive assumptions. Practitioners tend to ignore them and thus tacitly to accept them, even where there is major incongruity between assumptions and reality. In turn, this leads to an incongruity between what is predicted from the model and what really happens. .

_

.

.;

&dquo;.

.

: :¡ r

Reliability as a special class of agreement: historical perspective We will be addressing the specific problem of assessing reliability, not the more general problem of assessing agreement. There is good reason for this focus that merits some 2

review.

history of development of the assessment of agreement between categorical has been largely based on ad hoc approaches. Such approaches address the following problem: In a set of data with N subjects, and two or more raters/ratings per subject, each rater/rating an assignment to one of C categories, how does one assess agreement? The response is frequently to propose a statistic that reflects some intuitive definition of agreement. Generally this proposal involves 1 defining some measure of pairwise agreement, 2 giving to each subject an agreement score equal to the pairwise agreement measure averaged over all pairs of raters/ratings, 3 averaging the agreement scores over subjects, and 4 assessing how the average agreement score relates to what one would The

measurements

define as ’random’ agreement and ’ideal’ agreement in that dataset. The kappa coefficient, as one such example, has several forms, a weighted form in which partial credit is given to some disagreement, and an unweighted one in which credit is given only to full agreement.2 Generally subjects are assumed to be randomly sampled from some population, but raters/ratings might be regarded as random or as fixed, and if fixed, as homogeneous or as heterogeneous. This leads to different definitions of what is ’random’ and ’ideal’. While the kappa coefficients have been proposed for unordered categories, one can apply them to ordered categories by incorporating consideration of order into the weighting of a weighted kappa coefficient. Clearly if an ad hoc measure of agreement is used simply as a descriptive statistic, what matters is whether the communication of the definition is clear and accurate. However, as a basis of inferential statistics, i.e. generalization of results from the sample to a

Downloaded from smm.sagepub.com at University of Texas Libraries on July 4, 2014

185

population, as in parameter estimation or tests of statistical hypotheses, there are major problems with this ad hoc approach. Whether the subjects are a random sample from some population, some type of stratified sample, or a fixed set, makes a difference in how any proposed statistic can be interpreted. Similarly whether the raters/ratings are a random sample from some population, some stratified sample, or a fixed set, also makes a difference. Whether ’agreement’ between a pair of raters/ratings is taken to mean identical responses, with any discrepancy considered ’disagreement’, or whether partial credit is given for certain types of discrepancies (as in a weighted kappa coefficient) also makes a difference. Added to these problems, moreover, how one defines ’randomness’ or ’ideal’ can also differ between one formulation of the problem assessing agreement and another.3 The upshot is that there are many measures proposed for categorical agreement, but no consensus on which is or is not appropriate, on the interpretations of their magnitudes, the statistical inferences that can be based on them. In contrast, the special problem of assessing reliability of categorical measures is based on the conceptualization and assessment of reliability of the classical approaches nor on

measurement theory.4 Such conceptualization requires specifying a population of subjects, and the assessment is based on drawing a representative sample from that population. Such conceptualization also requires specifying the population of raters/ratings to which a single measurement is to generalize,s and the assessment is based on drawing a representative sample from that population for each subject. Intraobserver, interobserver, and test-retest reliabilities differ in what population of raters/ratings is specified. Which specification is most relevant in a particular situation depends on whether one is concerned about measuring a state or a trait of a subject, and if a trait, over what span of time. Whatever the conceptualization, the multiple ratings per subject must be independent. The independence of the multiple ratings per subject is not an assumption, but a condition that must be guaranteed in the design of a reliability study, for example, by blinding all

in

classifications to each other. We will not be addressing the problem of assessing validity, even though the issue of validity is undoubtedly more important to medical research than is reliability. To do so would require an external criterion, a ’gold standard’, often lacking in ’real life’ situations. For reliability assessment the criterion is an internal one, the consensus of the population of raters/ratings for a subject. We will also not be addressing the question of assessing raters’ performance. Generally, if raters perform poorly, the resulting measures will poorly characterize the subjects, i.e. will be unreliable. However, good performance of raters does not guarantee that the classifications raters generate will be reliable. The fault may lie, not in the raters, but in the instruments they use or the materials to which, and the circumstances under which, the raters apply the instruments. There is a strong methodological literature on assessment of rater performance6-9 to which interested readers might refer for further information on this important subject, but that is not the issue here. Furthermore, any assessment approach that ’forgives’ errors of measurement, whether those of raters, of instruments, or of situation, is not considered here as a viable approach to assessing reliability, for such errors are the components of unreliability.3 To ’forgive’ them by removing them mathematically from consideration produces overly optimistic views of reliability, views that poorly predict what will then happen when such measures are used in medical research or in clinical practice.

Downloaded from smm.sagepub.com at University of Texas Libraries on July 4, 2014

186

In essence then, the question of reliability posed here arises after efforts have been made to clarify definitions, to train raters, to standardize conditions of testing, i.e. after every effort has been made to perfect the system of measurement. In short, we are poised to use this measurement in clinical or research situations and are concerned about the quality of the product (the measures) not that of the process (raters, instruments,

situations). For this reason, any coefficient of reliability (one or several) proposed must be meaningful and interpretable in terms of its effects on the proposed research or clinical uses. An indication of how one may quantitatively predict the effects on bias or precision of estimation, and on power of statistical tests, by the magnitude of the coefficient is necessary to an acceptable definition of a coefficient of reliability. In Section 3 we review briefly the classical approach as developed in classical measurement theory for an ordinal measure under certain assumptions, and in Section 4 show how that same approach can be applied to binary measurements without restrictive assumptions. This sets the stage for applying the same approach to the categorical measures. In Section 5, we extend this approach to categorical measures. In Section 6, we discuss some applications of this approach that depend on very restrictive assumptions designed to simplify the mathematics, thus to ’sugar coat’ the problem, and show both the value and the limitations of such approaches. In Section 7, finally, we summarize a proposal to assess categorical reliability in medical research, and suggest ways that the information on reliability so obtained might be used to reduce negative consequences of

unreliability. When all is said and done, the situation and prospect is nowhere near as grim as that Stein’s deathbed, but it is a situation that is not as easy or rosy as researchers, both medical and methodological, might hope. However, the answers are optimistic, in that they generate viable strategies to resolve any resulting problems, and may thus generate avenues toward improving the quality of diagnosis for use in medical research and clinical at

practice. ~

3

Classical approach to reliability

3.1 The model and definition The classical approach4 to assessment of reliability assumes that the ordinal measurement for subject i (sampled from a population of subjects), by a rater j (sampled from a population of raters or ratings) designated Xii can be expressed as:

parameter ~i is called subject i’s ’consensus score’, and E1~ is its error of measure(What we here call the ’consensus score’ has often been called the ’true score’, but that terminology tends to suggest validity rather than reliability.) Here it is assumed that the mean error is zero, and that the errors are independent of the consensus scores. This is a restrictive assumption. In the ’real world’, the error variance is often directly related to the consensus score (as we will see below). For a fixed subject i, over the population of raters/ratings: The

ment.

Downloaded from smm.sagepub.com at University of Texas Libraries on July 4, 2014

187

Over the population of subjects:

the population observed mean,

the consensus

score

variance. The reliability coefficient in this case is defined as:

i.e. the ratio of the consensus

score

variance to the observed score variance. ’ ’

3.2 Meaningful and interpretable? The particular coefficient p is meaningful and interpretable in a variety of waysl: .

1) Reproducibility: The correlation between a first and second independent rating of subjects is p. Thus p indicates how well a single rating per subject would predict a second, and is thus a measure of reproducibility of the measure. 2) Association between observed and consensus scores: The correlation between a single measure per subject and that subject’s consensus score is p1~2. Thus how well a single rating predicts that subjects’ consensus score is also indicated by the magnitude of p. 3) Variance inf lation: If the variance of the consensus scores is u2, the variance of the observed scores is U2/p. Thus the index p indicates the inflation of within group variance due to unreliability. 4) Effects on bias, precision, power: The mean of single ratings per subject and the consensus scores

is the

same.

Thus there is

no

bias introduced in the estimation of the

population observed mean. However, an estimate of the correlation between two characteristics, as measured by X above and another ordinal measure Y similarly defined, based on the correlation coefficient between X and Y, would be an attenuated estimator, attenuated by the geometric mean of the two reliability coefficients. Thus if the characteristics were perfectly correlated, but the reliabilities of X and Y were each 0.4, the correlation between their measured values X and Y would be 0.4 and far from perfect. Thus estimators of certain population parameters would be biased due to the unreliability of X, and the magnitude of that bias indicated by p. Because of the inflation of variance due to unreliability, p indicates the degree of loss of precision in estimation of population parameters such as the population mean or correlation coefficients. Furthermore, with inflation of within-group variance, it indicates the degree of attenuation of effect size for comparisons of means in hypothesis testing, and thus the attenuation of power. Indeed it can be shown that the power one might expect with a sample size of N in a two sample t-test if the measure were totally reliable (p 1), would require had coefficient if the measure reliability p. With a measure approximately N/p subjects with p 0.5 one would need twice as many subjects, and with p 0.2, five times as to the same many subjects, power. get =

=

=

Downloaded from smm.sagepub.com at University of Texas Libraries on July 4, 2014

188

3.3 Design, estimation and hypothesis testing in reliability studies With a sample of N subjects from the appropriate population of subjects, and a sample of m(> 1) ratings from the appropriate population of raters/ratings, one can estimate p using the intraclass correlation coefficient. 10-14 If we add a further assumption that both the consensus scores and the errors are normally distributed, the sampling distribution of the sample intraclass correlation coefficient is known and easily applied to estimate its standard error, to obtain confidence intervals or to test statistical hypotheses.1 15,16 There are several forms of intraclass correlation coefficient that might be used with data so generated, and this has occasioned some controversy. 10, 12,13 Furthermore, when m 2, the product moment correlation coefficient, not the intraclass, is the most commonly used sample estimate of the reliability coefficient. However, when the first, second, third etc. raters for each subject are randomly sampled from the population of rates, all these various forms estimate the same population parameter, namely p, differing only slightly in precision (standard error). Finally there is some, but not unlimited, robustness of results to deviations from the various limiting assumptions above. Results are particularly vulnerable when there are too many tied values (e.g. with 3, 4 or 5 point ordinal scales, or ordered categories), or when the within subject variance depends on the consensus score. =

4

Binary measures

4.1 The model and definition One situation in which there are many tied values and when the within subject variance depends strongly on the consensus score, is that of binary measures. Suppose now that Xi. - were a binary measure, i.e. could only take on the values 1 or 0 (1 represents Yes, 0 No; or 1 represents success, 0 failure etc.). Here for a fixed subject 1 over the population of raters/ratingsl: Prob (XZ~ 1) E(Xij) pi (the consensus score), &dquo;

=

=

=

Thus the within subject variance depends directly on the consensus score, and may differ from subject to subject in the population. Over the subjects in the population: E(pi) = P (often called the ’prevalence’), ,

I







The reliability coefficient proposed here is the intraclass kappa coefficient:

which again is the ratio of the consensus

score

variance to the observed score variance.

Downloaded from smm.sagepub.com at University of Texas Libraries on July 4, 2014



189

4.2

Meaningful and interpretable?

1) Reproducibility: Since

where the above are the conditional probabilities of a second independent rating given the first, K indicates the correspondence of a second opinion to the first, and is a measure ’

of reproducibility. 1, 17



.

2) Association of a single rating with the consensus rating:

-

Since

the magnitude of kappa also indicates the correspondence of a single rating to the consen-

rating. Otherwise, the effects

sus

of

unreliability

power of statistical measure considered above. 1, 17

meters, and

on

bias and precision in estimation of paravery similar to those for the special ordinal

on

tests are

Design, estimation and hypothesis testing in reliability studies a sample of N subjects from the appropriate population, and a sample of m for each subject from the appropriate population, one can estimate the popularatings tion kappa using the sample intraclass kappa.2,18,19 Its exact sampling distribution is unknown, but tests and confidence intervals can be derived using Jackknife or Bootstrap 4.3

With

methods . 3 Once again, there are several forms of kappa coefficients and other coefficients such as the phi coefficient that are perhaps more commonly used than is the intraclass kappa.2~23 When the ratings for each subject are randomly sampled, however, once again the parameters these statistics estimate may be the same, but differ slightly in their

precision.24 While this discussion leaves unaddressed many issues related to ordinal measure reliability, these constitute the basic essentials necessary to set the stage for consideration of reliability of categorical measures.

5

Extension to categorical measures

.

5.1 The model and definition Now let 3~.- represent an assignment of subject i by rater/rating j to one of C mutually exclusive, nonordered, categories. The C categories are labelled l, 2, ... , C. Since they are nonordered, which category is assigned which label can have no effect on the conclusions that will be drawn. Thus we assign the categories in any order at all to the labels 1, 2, ... , C. If there is any hesitation at so assigning labels to the categories, one should hesitate at using the following methods designed for nonordered categories or nominal measures.

exhaustive,

Downloaded from smm.sagepub.com at University of Texas Libraries on July 4, 2014

190

Technically, Xv is a C-dimensional vector with C entries corresponding to the C categories. If subject 1 is assigned to category k by rater/rating j, then the kth entry is 1, and all the other entries are Os. Thus Xi can take on one of only C possible vector values: Vl V2, ... , V~, where Vk is the vector with 1 as the kth entry and Os elsewhere. For a fixed subject i, over the population of raters/ratings: E(X¡j) pi, the consensus vector or profile, where pi now is also a C-dimensional vector, with each entry (pik) a value on the unit interval, with the sum of the entries equalling 1. The entries represent the probabilities that subject i will be assigned by a randomly selected rater/rating to each of the C possible categories. While there are only C possible values of Y*,;, there may be fewer or more possible values of pi, perhaps even an infinite number. There is a major source of semantic confusion in dealing with categorical reliability. In common usage, we ’assign’ a subject to a category on the basis of a single rating. For example, if there were four diagnostic categories 1 schizophrenia, 2 depression; 3 other psychiatric disorder; 4 no psychiatric disorder, a clinician or researcher might say that a subject assigned by a single rater to # 1 either ’has’ schizophrenia, or ’is’ schizophrenic, or ’belongs to’ the schizophrenic category, ignoring the unreliability of the categorization. This is a semantic shortcut that can prove misleading in scientific or clinical communications unless the subject under discussion has pi, VI, i.e. each and every rater/rating will assign him to Category # 1. With unreliable categorical measurements, a subject assigned by a single rater to one category may have a greater or equal probability of being assigned to another category by an independent second rater. In what follows, we will say a subject ’belongs to’ category k if and only if his consensus profile is Vk, i.e. every rater will assign him to category k. Otherwise we will say that a rater or rating ’diagnoses him’ as having k, thus acknowledging the probabilistic nature of diagnoses. Over the subjects in the population: E(pi) E(Xtl) P (pl, P2, ... , P~)’, (the prevalence profile), where P is the (column) vector of the observed prevalences of the C categories, and P’ its (row) transpose. We will assume that all C prevalences in the population of subjects (PI, P2, Pc) are greater than zero, i.e. that there are no superfluous categories. The observed proportion of a sample of subjects assigned by the first rating to each of the C categories is an unbiased estimator of the prevalence profile: E(XZ1) P. .... ’ .... ’.... Over the population of raters/ratings: =

=

=

...

=

=

,

=

&dquo;

_

Covariance Matrix (Xil)

Ix,

with diagonal elements

,

..



-

..

’7

~ .

and the off-diagonal elements

This Covariance Matrix can, of course, be estimated using the sample estimate of P. Also, over the subjects in the population: . ’. --I--..

Downloaded from smm.sagepub.com at University of Texas Libraries on July 4, 2014

191

To understand Xlp, as well as to suggest how it is estimated, consider the following: Suppose we were to sample N subjects from the population, obtaining m independent ratings on each. For each subject there are m(m - 1)/2 different pairs of ratings (each pair with entries in random order), a grand total of Nm(m - 1)/2 paired ratings overall. We can then create a C x C matrix with the rows defmed by the category of the first classification of each pair, the columns by the category of the second, and enter into each cell the corresponding proportion of the Nm(m - 1)/2 pairs. The proportion of the total pairs on which there is agreement on category k is the kth diagonal entry. The proportion of the total pairs on which the first rater assigns the subject to category k and the second

category k* is in the kth row, k*th column. We will call this the observed Pairwise Classification Matrix, PCMo. If all the classifications were random, the random Pairwise Classification Matrix, PCMR would equal: to

If we subtract PCMR from PCIVIo what results is a sample estimate of

~p

=

~p:

Expected Value(PC11~ - PCMR).

Thus ~p reflects how far from random are the probabilities of the various paired classifications. If there were perfect reliability in the categorical measures, ~p would equal Xlx. If there were perfect unreliability, i.e. the categories were randomly assigned to the subjects, of zeros. Assessment of reliability of a categorical measure, IP would be a matrix full consequently, depends on locating between these two extremes. In the classical approach applied either to ordinal or binary measures, it proved useful to assess the covariance ratios (correlations). Here we might consider the matrix A, the entries of which are:

Xlp

diagonal entry of A&dquo;Bkkl is the individual kappa coefficient obtained if each subject were classified by a binary measurement into k or ’not’, i.e. into any category other than k. Thus the diagonal elements of ~ p or A reflect how reliably each one of the C categories can be distinguished by the raters from the conglomeration of the other categories available. Off-diagonal entries too are informative. If one dichotomized paired ratings into categories k and ’not-k’ by the first rating, and into categories k* and not-k* by the second independent rating, the phi-coefficient (product moment correlation coefficient) between the two binary ratings so constructed would equal Akk*. Thus the off-diagonal entries reflect how ’confusable’ are the various categories in the system. If one category tends to preclude the another, the entry is negative. Otherwise, if two categories are likely to be confused, the value of kk- may approach zero or become positive.

The kth

5.2 Meaningful and interpretable? We now propose that to assess the reliability of a categorical measure one needs, at the very least, each diagonal entry of, and perhaps even the entire matrix, A or ~,p. A single

Downloaded from smm.sagepub.com at University of Texas Libraries on July 4, 2014

192

univariate coefficient of reliability will

not

do. To

justify this suggestion, let us consider

the following:

1) Reproducibility and confusability:







.

...

I

.

I

I1

.-

and for k ~k*:

Thus, once again, the diagonal elements of lp or A indicate how reproducible is a first rating by an independent second, and in addition, the off-diagonal elements, how confusable one category is with another. -j : 2) Association between consensus profile and a single rating -

&dquo;

-

and for k ~k*:

Thus the association between a single classification into Category k and the consensus probability of Category k for a subject is indicated by the diagonal entries of ~ or A, and the association between a single rating as Category k and the consensus of another Category k* for a subject is indicated by the off-diagonal entries of ~p or A. Little is known of the quantitative effects of the unreliability of categorical measures on estimation and testing. Qualitatively, however, from the above relationships, it is clear that the subjects assigned by a single rater to Category k using an unreliable classification, will be a heterogeneous mixture of subjects, many of whom a second or third independent rater/rating would have assigned to another category. Such a situation will confuse research for example, comparing risk factors, course of illness, or response to treatment, of subjects assigned to the different diagnostic categories.

probability

5.3 Design, estimation and hypothesis testing With a sample of 1V subjects from the population of subjects, and a sample of m(> 1) ratings from the population of raters/ratings, as indicated above, one can estimate the prevalence vector P by the proportion of total ratings that assign a subject to each of the C categories, and from that, estimate PCMR . Then we can estimate ~X using the estimate of P. Finally we can estimate ~p by comparing the observed pairwise classification matrix PCIVIo to PCMR . The matrix A then can also be estimated by dividing the entries of estimated ~p by the appropriate estimated variances. The exact sampling distributions of these various matrices are unknown, but Bootstrap or Jackknife estimators can be used to estimate standard errors or confidence intervals or as a basis of statistical tests, just as they are for binary measures.3 3

Downloaded from smm.sagepub.com at University of Texas Libraries on July 4, 2014

193

Sugar coating the problem?

6

6.1 Preliminary considerations Much of what is described above appears to be virgin territory for statistical research. The situation here is far more complex than that of assessing reliability of ordinal or binary measures, where a univariate coefficient of reliability is quickly identified. Dealing with C-dimensional vectors and C x C matrices, is obviously more difficult. As a result, at this juncture, there has generally been a temptation to ’sugar-coat’ the problem, either by making ad hoc statistical decisions or by imposing restrictive mathematical assumptions on the nature of the subject population, on the nature of the rater/rating population, or on the patterns of response, all to simplify the mathematics. Actually, such simplifications are not a bad idea, as long as they are only used to enhance understanding of the methodological aspects of the problem, to try out different approaches or ideas, that is, as a tool in methodological research. Difficulties arise when such simplifying assumptions that correspond poorly to the ’real world’ as they often do, are used to generate results that are then applied in ’real world’ situations, for example, to make treatment decisions for patients or to draw inferences about patients in clinical research. Then the results based on such assumptions are potentially quite misleading. To demonstrate both the value of what can be learned in methodological research, and what misleading results can be obtained in ’real life’ research, let us consider now two salient examples of such simplifications. 6.2 An ad hoc statistical decision: the multicategory kappa If the categorical measurement were completely reliable, all pairwise classifications on the same subject would agree. If the categorical measurement were perfectly unreliable (random), the pairwise classifications on the same subject would agree at a chance level that can be estimated from the observed prevalences. For this reason, it is tempting to make an ad hoc statistical decision to focus entirely on the univariate statistic of total percentage of agreements and to ignore the complex patterns of agreements and disagreements among the various separate categories in the matrices. Let A be the probability that two independent ratings of the same subject agree (ignoring the category on which they agree). The value of A is the trace (~ + Y.X). With value of A is perfect unreliability the value of A is trace (~X). With perfect reliability, the to use a 1. Thus one might propose2 multicategory kappa, defined kappa coefficient,

the

as:

perfectly unreliable measurement and 1 for estimation and testing of this parameter has The perfectly reliable this is the coefficient of reliability Indeed undergone considerable methodological study. 2 in the research literature. seen measurement most of categorical frequently This kappa can be related to the C individual kappas by the relationshipl: a

univariate index that has value 0 for measurement.

Thus the multicategory kappa is a weighted average of the C individual category kappas, with weights determined by the corresponding prevalences. The reliabilities of individ-

Downloaded from smm.sagepub.com at University of Texas Libraries on July 4, 2014

-

194

ual categories with extreme prevalence, near zero or one, are given little weight. However, we need have only one highly reliable category with moderate prevalence among the C categories to generate an optimistic value of KT. For example, suppose we had a three category system (C 3), such as 1: nondepressed; 2: depressedendogeneous ; 3 depressed-nonendogenous. Suppose Category 1 were perfectly reliable. Thus every subject either belongs to the depressed or to the nondepressed groups. However, suppose further that within the depressed subgroup, the decision to classify =

as

endogeneous

versus

nonendogeneous

were

completely random, equivalent

to

being

decided by the toss of a fair coin. Then, of course, the kappa for Category 1 would be 1, but here the kappas for Categories 2 and 3 can be shown to equal P1/(1 + P1). Thus if the prevalence of nondepressed in the population were 0.5, the reliabilities of the two depressed subcategories would each be .333, and the multicategory kappa equal to 0.600. If the prevalence of nondepressed were 0.9, the reliabilities of the two depressed subcategories would be 0.474, and the

multicategory kappa equal to 0.730. These values of kappa8 might seem quite acceptable separately for Categories 2 and 3, as well as for the multicategory kappa, despite the fact that here assignment to these two subcategories was done by tossing a fair coin. If we tried to use this categorical measurement to identify biological markers to differentiate two subtypes of depression from each other, for example, that effort would be doomed. For such reasons, use of the multicategory kappa in applications as a univariate index of reliability for categorical measurements may be misleading and should be discouraged. However, there are at least two valuable methodological insights that examination of this special case develops. First is the observation that when there are one or more very highly reliable individual categories, this will enhance the kappa coefficients for any less reliable available categories. High agreement for the highly reliable categories enhance the agreement that a nonreliable category is absent. For this reason, one should be wary when there is a wide spread among the C individual kappas (the diagonal value of ~p or A). In such a case, there should be close attention to the least reliable of the categories. These may be merely random subcategories of some reliable super-category. Secondly, this is a case in which U23 or À,23 is positive, indicating that Categories 2 and 3 tend to be easily confusable. Just as a relatively low individual kappa should alert researchers to possible problems in discriminating one particular category, positive offdiagonal values in I or A should alert researchers to possible problems in distinguishing between those pairs of categories. Such categories might better be pooled.2s

specific

6.3 Restrictive assumptions: the sensitivity-specificity model As noted above, there is a natural desire to assign subjects to one of the C categories. i.e. to say that each subject in some sense ’belongs to’ one of the categories. With unreliable categorization, to do so with scientific rigour is problematic. Above we proposed to say that a subject i ’belongs to’ category k if and only if pi = Vk, i.e. only if all raters assign that subject to category k. However, it is tempting to soften this stance somewhat, at least to the extent of proposing that the population of subjects might be divided into C ’subtypes’, occurring in the population with probabilities Q1, 623 Qc (the subtype prevalence vector Q). So far, we have made no restrictive assumptions, for one can always hypothesize such subtypes, and do so in a variety of

different ways.

Downloaded from smm.sagepub.com at University of Texas Libraries on July 4, 2014

195

Now however, let us assume that all subjects of subtype k have the same pi, with Pik the largest probability in that vector. Thus we assume that subjects of each subtype are homogeneous in terms of their probabilities of assignment to each category, and that there is a one to one correspondence between subtypes and categories. That there exist C homogeneous subpopulations of subjects, with a one-to-one correspondence of such subpopulations with the C categories, are both highly restrictive assumptions. The probability that each subject of subtype k is classified into category k is often called the ’sensitivity’ of the classification. The probability that a subject not of subtype k, ’not-k’, is classified into subtype ’not-k’ is often called the ’specificity’ of the classification. The hypothesized subtypes themselves are often called ’latent classes’. If we could assume such a situation, it would clearly allow us to say that each subject ’belongs to’ one particular category without the rigid requirement of perfect reproducibility for that subject’s classification. Furthermore, now there are only a limited number of parameters to be estimated (no more than C2), and this number can be made even smaller by assuming equal sensitivities and equal specificities. For example, suppose there were C homogeneous subtypes occurring with probabilities Ql, Q2 ... , Qc, and that the probability that a subject of subtype k is classified into category k is SE (sensitivity), and that the probability of any other classification, say into k*, is (1 - SE)l(C - 1). In this case:

Unless there were perfect sensitivity (SE = 1), the observed prevalence vector P would be a biased estimator of the subtype prevalence vector Q. Completely random decisions (SE 1/C) would yield Pk 1/C regardless of what the true subtype prevalences were. However, if we knew or could estimate the sensitivity, we could use that estimate in the equations above to estimate Qk from the estimated Pk . Under these assumptions, it can also be shown that =

=

and when K X K*:

In this case, the total proportion of agreements (A above) is an unbiased estimator of SE2. Thus estimation of SE is easy. With a point estimate of SE and the estimates of Q, with C known, the estimation of Xlx or Xlp or A is a simple matter. What’s wrong with that? To assume that the population comprises C homogeneous subtypes assumes that all the population variance is between-subtype variance. Thus if one proceeds as indicated above to estimate Q and SE, then to use these estimates to estimate the covariance matrices, any heterogeneity within subtype is simply ignored. Consequently one misestimates all the variances and covariances. How gross the misestimation is depends on how close to the truth the assumption of C homogeneous subtypes is. In many situations, the estimates one obtains from imposing such restrictive assumptions may have nothing to do with reality, i.e. are nonrobust.

Downloaded from smm.sagepub.com at University of Texas Libraries on July 4, 2014

196

Even more problematic, however, is that it is sometimes further assumed that as one from one population of subjects to another, what changes is only Q, the subtype prevalences of subtypes, not the sensitivities and specificities. This further restrictive assumption may exacerbate the problem of misleading results. Yet this sensitivity/specificity model, flawed as it may be for application in medical research, is perhaps the single most useful tool in methodological studies. For example, suppose the reliabilities of a single measure indicated better than random classification, but at levels not yet satisfactory. We might consider using some consensus rule. For example, we might consider using three raters per subject in a future research project or in decision making in the clinic. If at least two of the three agreed, the consensus category would be what that agreement indicated. If no two agreed, we would assign the subject to an additional category called C + 1: unclear. To do a study using a sample of N subjects and m independent panels of three raters each in order to assess the reliability of this newly proposed consensus categorical measurement would be an enormous task. Furthermore, this might not be the one and only proposal defining consensus to be considered. To evaluate each such proposal for a consensus with yet another reliability study would be prohibitive. Instead, what we might do is assume some simple sensitivity/specificity model, preferably one a little less restrictive than the one above, estimate the parameters of that model from a simple reliability study, say with as many raters per subjects as there are parameters to be estimated. Then we might use these estimated parameters to simulate what the results would be with the various proposed consensus rules. If a proposed consensus rule works poorly under these simplified circumstances, one should feel reluctant to assume that it will work any better under the more difficult circumstances of ’real life’ . Thus one might sort out those proposed decision rules that work best, for further evaluation. While there is no guarantee of what will work best in ’real life’, use of such models can help focus efforts in the most promising areas, thus reducing the time, effort and costs necessary to improve categorical measurements.26 moves

7

Discussion: coping with categorical unreliability There are three major points to this discussion:

1) The most important point, least welcome to medical researchers, is the suggestion that one needs a matrix such as ~p or A to describe the reliability of a categorical measure. A single summary index will not do. Moreover, when this matrix is obtained, it is not unusual to find poor reliability for important diagnostic categories. 2) The second least welcome to methodology researchers is the suggestion that imposing ad hoc or restrictive assumptions in order to simplify the mathematics may help to increase methodological insights into the problems of categorical reliability, but should be used only with great care in applications in the ’real world’. The results of such applications can be very misleading and may explain some of the difficulties encountered in research projects using such diagnostic categorizations. 3) Perhaps surprising to both medical and methodology researchers may be the suggestion of how little methodological research has been done on the general issue of categorical realibility, that is, research without restrictive assumptions, despite the importance of the issues. Much remains to be done.

____

Downloaded from smm.sagepub.com at University of Texas Libraries on July 4, 2014

197

We have here demonstrated how to estimate the matrices needed to assess categorical reliability, and have indicated how to approximate their sampling distributions as a basis of estimation and tests. We have begun to

identify what constitute signs and symptoms of trouble in categorical measurements: a wide spread in the kappas of the individual categories, or two categories with Qkk* that is near zero or positive. Awareness of such signs and symptoms from reliability studies might lead to suitable combination of categories, and may improve the categorical system for use in medical research and clinical practice without ’going back to the drawing boards’ to redesign a categorical measurement system. Such efforts might well avert many problems consequent to use of unreliable categorical measurements. Any suggestion that categories be combined in ’real world’ situations is likely to be met with a protest that, for research or clinical purposes, the distinctions between such categories (as in the above example, ’endogeneous’ and ’nonendogeneous’) is vital, perhaps even the central point of doing the categorization. We are not questioning the importance of the distinction in theory, only noting when the distinction is not being made in practice. If indeed this distinction is, as claimed, the most important one, there may be no recourse but to return to the ’drawing boards’ and to develop a new measurement system that can better make this important distinction. To proceed to use such an unreliable measurement in research or clinic as if it were meaningful is to risk drawing false and misleading conclusions. But suppose that there are no such indications that any categories are random, redundant or non-distinguishable, but their reliabilities are still not satisfactory. In this case, as we suggested above, we might begin to consider augmenting reliability by using of raters. For ordinal measures satisfying the classical model, the consensus of m independent raters is their mean, and the reliability of this mean (still an ordinal measure satisfying the classical model) is given by the Spearman-Brown formula: consensus

where p is the reliability coefficient of a single rating (m 1). For binary measures, one possible consensus of m independent raters is the proportion positive, an ordinal measure that does not, however, satisfy the classical model. However, with the use of a variance-stabilizing transformation, it can be shown that approximately, the Spearman-Brown formula holds here as well.1 1 However, in some cases one needs a binary consensus of binary measures, i.e. a rule that says that if more than j of the m independent raters are positive, the binary consensus is positive. Then it will be necessary to identify the optimal value of j for each possible value of m. It has recently been demonstrated that the Spearman-Brown rule does not hold here, but that a carefully selected binary consensus (setting j for each value of m) will improve reliability, usually quite dramatically, as the number of raters increases.26 A natural extension of these results is to propose that what might be considered for a categorical measure is to take m independent raters per subject, and to take one of two courses to define a consensus. For research use, one could use the consensus vector, i.e. the proportions of the m =

Downloaded from smm.sagepub.com at University of Texas Libraries on July 4, 2014

.

~

198

assigned the subjects to each of the C categories, as a multivariate response Alternatively one might begin to seek methods of developing optimal decision rules based on the consensus vector to classify each subject into one of C categories, methods similar to those being developed for binary measures. Before we summarize the situation, one comment should be made. We have dealt here with the special case of reliability of categorical measurement rather than the general problem of agreement. What makes this a special case is that we assume that we have N subjects randomly sampled from a well-specified population m raters/ratings for each subject randomly sampled from a well-specified population, and C nonordered categories, with agreement meaning identical responses and disagreement meaning nonidentical responses. No partial credit is given for any disagreements. There undoubtedly are more general problems of assessing categorical agreement beyond that of unreliability also defined by these characteristics. The present results apply to raters

that

vector.

them as well. We have elected here first to answer Stein’s plea by defining a well-specified but important question to which we could give a reasonable reply. Then, as with Stein’s friends, we could have taken three courses:

1) They chose to remain silent. In doing so, they conveyed the hopelessness of the situation, and a sense of their own and her helplessness to cope with or to avert the unpleasant outcome. In our situation, the outcome is not hopeless and we are not helpless. Silence seems a poor choice of response.

2) Stein’s friends might have chosen to ’sugar-coat’ the situation, to give her optimistic but empty assurances of a good outcome. This might have relieved Stein and her friends temporarily, but the outcome would not have changed, and they might have deprived her and them from averting that which could be averted, or best coping with that which was inevitable. In our situation, there has already been a great deal of ’sugar-coating’. To what extent are the difficulties and the costs of medical research related to the use of unreliable diagnostic categorizations? To what extent has ’sugar-coating’ reliability issues fostered these problems? That may never be known. Such ’sugar-coating’ ranges from using diagnostic categorizations without any assessment of reliability at all, to inadequate sampling of subjects or of raters, or the nonblinding of multiple raters in reliability studies. It includes using mathematical models that remove ’rater effects’, or ’situation effects’, thus mathematically concealing major sources of reliability. It includes making empirically unsupported assumptions about the nature of the subject or rater population, or their patterns of response and then trusting the subsequent results to predict the effect on future research and clinical

practice. ’sugar-coating’ will tend to preclude taking appropriate action to improve reliability before using the measures, ’sugar-coating’ also seems a poor choice Since any such

of response here.

3) Finally, Stein’s friends might have acknowledged to her and to each other the situation, giving her and them time and opportunity to avert that which could be averted, and to cope with that which was unavoidable. Here that seems the optimal solution, particularly in that guidelines as to how to identify problematic situations suggest strategies for averting or alleviating the potential consequences.

Downloaded from smm.sagepub.com at University of Texas Libraries on July 4, 2014

199

References 1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

Kraemer HC. Ramifications of a population model for k as a coefficient of reliability. Psychometrika 1979; 44: 461-72. Fleiss JL. Statistical methods for rates of proportions. New York: John Wiley & Sons, 1981. Bloch DA, Kraemer HC. 2 x 2 kappa coefficients: Measures of agreement or association. Biometrics 1989; 45: 269-87. Lord FM, Novick MR. Statistical theories of mental test scores. Reading, MA: AddisonWesley Publishing Company, Inc., 1968. Cronbach LJ, Gleser G, Nanda H, Rajaratnam J. The dependability of behavioral measurements. New York: John Wiley &

Sons, 1968. Agresti A. A model for agreement between ratings on an ordinal scale. Psycho Rep 1988; 44: 539-48. Becker M. Association models to analyse agreement data: two examples. Statistics in Medicine 1989; 8: 1199-208. Landis JR, Koch GG. The measurement of observer agreement for categorical data. Biometrics 1977; 33: 159-74. Tanner MA, Young MA. Modeling ordinal scale disagreeement, Psychol Bull 1985; 98: 408-15. Bartko JJ. The Intraclass correlation coefficient as a measure of reliability. Psychol Rep 1966; 19: 3-11. Bartko JJ. Corrective note to: the intraclass correlation coefficient as a measure of reliability. Psychol Rep 1974; 34: 418. Bartko JJ. On various intraclass correlation reliability coefficients. Psychol Bull 1976; 83: 762-65. Algina J. Comment on Bartko’s On various intraclass correlation reliability coefficients. Psychol Bull 85: 135-38. 1978; Lahey MA, Downey RG, Saal FE. Intraclass correlations: There’s more there than meets the eye. Psychol Bull 1983; 93: 586-95.

Cacoullos T. A relation between t-0 and

Journal F distributions. of the American Statistical Association 1965; 60: 528-31. 16 Kraemer HC. On estimation and hypothesis testing problems for correlation coefficients. Psychometrika 1975; 40: 473-85. 17 Kraemer HC, Bloch DA. Kappa coefficients in epidemiology: An appraisal of a reappraisal. Journal of Clinical Epidemiology 41: 1988; 959-68. 18 Fleiss JL. Measuring nominal scale agreement among many raters. Psychol Bull 1971; 19

76 : 378-82. Davies M, Fleiss JL. A measure of agreement for multinomial data arrayed in a two-way

20

layout. 1981. Goodman LA, Kruskal WH. Measures of association for cross classifications III:

Journal of Approximate sampling theory. the American Statistical Association 1963; 58: 310-64. 21 Goodman LA, Kruskal WH. Measures of association for cross classifications, IV: Simplification of asymptotic variances. Journal of the American Statistical Association

67: 1972; 415-24. 22

23

24

25

26

Goodman LA, Kruskal WH. Measures of association for cross classifications. Journal of the American Statistical Association 1954; 49: 732-64. Goodman LA, Kruskal WH. Measures of association for cross classifications. II. Further discussion and references. Journal of the American Statistical Association 1959; 54: 123-63. Kraemer HC. Assessment of 2 x 2 associations: Generalization of signal detection methodology. The American Statistician 1988; 42: 37-49. Kraemer HC. A study of reliability and its hierarchical structure in observed chimpanzee behavior. Primates 1979; 20: 553-61. Kraemer HC. How many raters? Toward the most reliable diagnostic categorization. Statistics in Medicine 1991 (in press).

Downloaded from smm.sagepub.com at University of Texas Libraries on July 4, 2014

Measurement of reliability for categorical data in medical research.

The problem of measuring reliability of categorical measurements, particularly diagnostic categorizations, is addressed. The approach is based on clas...
1MB Sizes 0 Downloads 0 Views