Problems and considerations in the valid assessment of personality disorders.

Problems

and

Considerations

in the Valid

of Personality J. Christopher

This

article

reviews

evidence

interview

and

self-report

f indings

ofnine

studies

to the same a summary

groups of the

for

the

questionnaire that

Disorders Perry,

reliability methods

compared

two

kappa=O.25) was lower

between preciably making

methods. the level disorder

interview improve personality

M.P.H.,

and

M.D.

diagnostic

for the diagnosis

or more

of subjects are summarized. overall diagnostic agreement

reliability (median nostic concordance

Assessment

axis Across between

concordance

of structured-

ofpersonality

disorders.

II diagnostic

instruments

the eight studies with any two instruments

for making individual personality between self-report questionnaire

and

The

administered sufficient yielded

data, a low

disorder diagnoses. interview methods

Comparing dimensional scores of different methods of agreement. The author concludes that current diagnoses have high reliability but yield diagnoses

Diagthan

did not methods that are

apfor not

significantly comparable across methods beyond chance, which is not scientifically acceptable. Sources for the disagreement include variance due to different raters, interview occasions, data sources

(self-report

versus

sitivity to state f rom the yes/no guish adequately made to improve increasing validity revealed (Am

observer

report),

information

by the

and demonstrate is to improve recurring

J Psychiatry

patterns

1992;

the

the validity of axis clinical interview,

one

finds

nor to the introduction of specific diagnostic criteria, the reliability of personality disorder diagnoses was low. A review of studies from the pre-DSM-III era demonstrated that the mean interrater reliability (kappa) for clinical diagnoses of personality disorders was 0.32 (1). With the introduction of criteria, in the DSM-III field trials the kappa for the presence of any

II diagnosis

was

0.61

in conjoint

interviews

and

0.54 when independent interviews were conducted (2). However, others (3) reported an overall kappa of 0.41 between independent interviews, with lower concordances for individual disorders (median kappa=0.23). Because low reliability constrains the validity of any assessment, these findings have highlighted the necessity

An earlier version of this paper was presented at the annual meeting of the Society for Psychotherapy Research, Wintergreen, Va.,June 28, 1990. Received April 24, 1991; revision received March 24, 1992; accepted April 2, 1992. From The Cambridge Hospital and Harvard Medical School, Boston. Address reprint requests to Dr. Perry, Institute of Community and Family Psychiatry, Sir Mortimer B. DavisJewish

General

Hospital,

Montreal, Que. 43T The author thanks

4333

Chemin

I E4, Canada. Philip W. Lavori,

de

Ph.D.,

Ia C#{244}teSte-Catherine,

for advice

issues.

Copyright

Am]

© I 992 American

Psychiatry

149:12,

obtained,

and

instrument

sen-

when

II diagnostic methods. One because personality patterns

taking

a systematic

route to are best

history.

149:1645-1653)

P

axis

bases

effects (e.g., mood). Serious problems in assessment validity may also arise format, which, despite probes for confirmatory examples, may fail to distinbetween sporadic occurrences and longstanding patterns. Efforts should be

Psychiatric

Association.

December

1992

on statistical

of adding some procedural guidelines to the atic clinical interview to improve reliability. of the Schedule for Affective Disorders and nialResearch Diagnostic Criteria system (4, ing

reliable

diagnoses

encouraged

the

nonsystemThe success Schizophre5) in yield-

development

of

both structured-interview and self-report assessments of personality disorders, focusing on the DSM-III and DSM-III-R criteria. As this scientific endeavor has matured, a number of studies have compared different instruments used with the same groups of subjects. This report reviews how well these instruments agree when the same individuals are being diagnosed and discusses issues of assessment validity that are important for both research and clinical practice.

DESCRIPTION

OF DIAGNOSTIC

INSTRUMENTS

Following is a brief description of the instruments that have been used in studies comparing the concordance of axis II diagnoses according to DSM-1II or DSM-III-R criteria. This list omits instruments for which comparison studies were not found. Reich (6, 7) has reviewed these and other instruments as well.

I 645

VALID

ASSESSMENT

OF PERSONALITY

DISORDERS

The National Institute of Mental Health Diagnostic Interview Schedule (DIS) (8), devised for use in the Epidemiologic Catchment Area (ECA) study of DSM-III axis I disorders, assesses one axis II disorder, antisocial personality disorder. Each criterion is assessed by one or more questions without probes for examples. Robins et at. (8) reported that the interrater reliability for the antisocial personality diagnosis in a comparison of lay and psychiatrist interviewers was 0.63. Subsequently, the ECA study reported a median lifetime prevalence across its sites of 2.5 % overall and 4.5 % among the 24to 44-year-old age group sampled (9). The Personality Disorders Examination, devised by Loranger ( 1 0, 1 1 ), is a structured interview yielding 11 DSM-III-R personality disorder diagnoses; the 1988 version includes the two (self-defeating and sadistic) listed in the DSM-III-R appendix. The questions are organized by topic area rather than by diagnosis, giving the interview a natural flow from the patient’s point of view. The current (1988) version includes a number of questions with probes asking for examples or anecdotes following a positive answer. A report on an earlier version ( I 1 ) yielded high levels of agreement on all diagnoses and high interrater reliability for five diagnoses with sufficient base rates (median kappa=0.80, range= 0.70-0.96). Interrater reliability of the dimensional scores for each disorder was very high (median in-

traclass

correlation=0.97).

Short-term

retest

reliability

for four diagnoses with calculable kappa coefficients was moderately good (median kappa=0.49, range= 0.37-0.56). Standage and Ladha (12) confirmed that overall interrater reliability was acceptable (median kappa=0.63, range=0.38-0.78). Pilkonis et at. (13) obtamed an overall interrater reliability kappa of 0.79 for whether any personality disorder was present, and the 6-month retest stability of kappa was 0.52. Kappas for individual disorders were not reported. The Structured Clinical Interview for DSM-III-R Personality Disorders (SCID-Il) was devised by Spitzer et at. (14) as a companion to the SCID for axis I disorders. It covers the 1 1 axis II disorders plus self-defeating personality from the appendix to DSM-III-R. The questions are organized by diagnosis, so that all of the criteria for a disorder are assessed together, making it easy for the interviewer to assess one disorder at a time. Interviewers are encouraged to ask additional questions to clarify ambiguous responses, although clarifying questions are not specified. A recent version of the SCID-II uses a series of self-report questions which, if the answers are positive, are then followed by interviewer questions. Reliability findings from this recent version are not yet available. The Structured Interview for DSM-III Personality Disorders was devised by Pfohl et al. (15). The questions are grouped topically rather than by individual diagnosis. Whenever a knowledgeable informant is available, the interviewer is encouraged to ask the informant some of the questions; if discrepancies occur between the subject’s and the informer’s answers, the criterion is scored on the basis of the more valid answer.

I 646

The interrater reliability of the DSM-III version was reported for the five disorders with calculable kappas (median kappa=0.75, range=0.45-0.90) (16). A subsequent study confirmed good interrater reliability: six disorders diagnosed two or more times had an interpolated median kappa of 0.83 (range=0.65-1.00) (17). Two instruments have been devised to capture an analogue measure of the certainty of a clinician’s judgment about whether a patient meets the criteria for each personality disorder. Hyler et al. (18) devised the Clinical Assessment Form, which was used in a survey study of clinicians contacted by mail. The Clinical Assessment Form includes a 4-point scale for each axis II disorder (0=no traits, 1=mild traits, 2=moderate traits, and 3= fulfills DSM-III criteria). Reliability data are unavaitable, and the authors cautioned that it is unclear whether these ratings are comparable to true diagnoses. A second analogue instrument, the Personality Assessment Form, was devised by Shea et at. (19) to assess axis II personality traits in a group of depressed mdividuals in the Collaborative Study of the Treatment of Depression. The Personality Assessment Form uses a 6-point scale to rate each of the I I DSM-III-R personality disorder diagnoses and the two provisional diagnoses in the appendix. Each scale is preceded by a definition of the important features of the disorder, and each scale point is anchored by a short statement. A diagnosis is considered present if the patient is given a score of 4 ( fits the description “to a considerable extent”). The overall reliability of this categorical cutoff score for the instrument yietded a kappa of 0.48. Subsequently, Pilkonis et al. (13) found that the overall interrater reliability kappa was 0.36 on the basis of intake data, 0.44 when follow-up data were available, and 0.49 when knowledgeable informants were interviewed. The 6-month retest stability of kappa was 0.56 for whether any personality disorder was present. One self-report instrument, the Personality Diagnostic Questionnaire (20), assesses whether the criteria of the DSM-III personality disorders are met. The self-report questions of the Personality Diagnostic Questionnaire are arranged by disorder, making it readily scorable. The revised version includes 152 items for the 11 DSM-II1-R personality disorder types plus self-defeating personality. A yes/no format is used, and the direction of some items is reversed to mitigate problems of response set. The internal consistency of the 1 1 original Personality Diagnostic Questionnaire scales was reported (median alpha=0.69, range=0.56-0.84) (KuderRichardson formula 20) (18). A second self-report instrument, the Milton Clinical Multiaxial Inventory (21, 22), purports to measure the same construct dimensions represented in the DSM-III personality disorder types, although it does not assess the DSM-III criteria directly. It consists of I 75 true/ false questions representing 20 scales. Reich (7) reported moderately high 8-week retest reliabilities (median correlation=0.75, range=0.60-0.89). A revised version is available, with closer approximation to the DSM-III-R personality types.

Am

] Psychiatry

1 49:1 2, December

1992

J. CHRISTOPHER

Concern about the validity of single assessment interviews has persuaded some clinicians and researchers to seek alternative ways to make more valid diagnoses. In the absence of a true validity criterion or “gold standard” for diagnoses, Spitzer (23) suggested what has become known as the LEAD standard, referring to making diagnoses based on “longitudinal expert evaluation using all available data.” This typically involves using intake diagnostic assessments, past records, data from informants, and, most importantly, observational data from a subsequent inpatient stay and response to treatment. LEAD diagnoses are then assigned on the basis of a review of all the data by a conference of experts, as has been reported in several studies reviewed below (13, 24-26). In one study the 6-month retest stability of kappa was 0.84 for the presence of any personality disorder, although agreement on individual types was not calculated (13).

KAPPA

THE

STATISTIC

AND

DIAGNOSTIC

AGREEMENT

Interpretation of the level of diagnostic agreement beany two methods is influenced by the base rate or frequency with which each diagnosis is made by each method in the group studied. For instance, if two instruments each diagnose the presence of borderline personality disorder in 40% of a group of subjects, then by chance alone they should agree on the presence of borderline personality disorder in 16% of the cases (i.e., the product of their base rates, 40%x40%=16%) and agree on the absence of borderline personality disorder in 36% of the cases (i.e., the product of their base rates for nonborderline personality disorder cases, 60%x 60%=36%), for a total chance level of agreement of tween

52%

(i.e.,

16%+36%).

The

kappa

statistic

devised

by

Cohen (27) corrects for chance agreement by taking the base rates into account to calculate what proportion of the maximum possible chance-corrected rate of agreement was obtained. It does this by taking the observed rate of agreement, subtracting the chance rate of agreement, and then dividing that by the maximum possible rate of chance-corrected agreement (i.e., 1 minus the chance rate of agreement). Studies that determine the level of diagnostic agreement between two interviewers or two instruments use the kappa statistic (chance-corrected concordance) rather than simply reporting the percentage of agreement. However, in modest-sized samples, when a diagnosis occurs at a very low base rate, kappa has high variability (28), and so many studies only calculate kappa for diagnoses occurring 5% or more of the time. Shrout et al. (28), after Fleiss (29), offered the following guidelines in interpreting kappa values: “Values greater than approximately .75 are generally taken to indicate excellent agreement beyond chance, values below approximately .40 are generally taken to represent poor agreement beyond chance, and values in between are generally taken to represent fair to good agreement

Am

J

Psychiatry

149:12,

December

1992

PERRY

beyond chance.” A further characteristic of kappa, or weighted kappa, is that in large samples, the weighted kappa is approximately equivalent to the intraclass correlation, interpretable as a proportion of variance (30, 31). Thus, a kappa value of 0.40 suggests that approximately 40% of the variance in the diagnoses made by two instruments is due to true differences in diagnoses among patients, while 60% is due to other things, such as instrument error.

STUDIES

COMPARING

DIAGNOSTIC

METHODS

The following studies compared two or more diagnostic methods. The emphasis of this review is on the concordance of any two methods for making individual personality disorder diagnoses. Because of the special problems in making clinical diagnoses with the use of self-report instruments only, I have included studies that used at least one observer-rated diagnostic method. The findings for individual diagnoses are displayed in table 1. Perry et at. (32) compared the DIS with a systematic clinical interview on the DSM-III diagnosis of antisocial personality in a group of 70 subjects with personality and affective disorders. Antisocial personality was diagnosed by the DIS in 34 cases and by the clinical interview in 20 cases, yielding a kappa of 0.54. The subjects were subsequently reinterviewed two to seven times over a median of 1 year of follow-up. Data on antisocial behavior were systematically obtained by two different interview methods for each follow-up interval. Subjects diagnosed as having antisocial personality by both intake diagnostic methods showed significantly more antisocial behavior than those for whom the diagnosis was either 1 ) not present according to both methods or 2) not present according to the clinical interview but present according to the DIS. The discrepant cases (group 2) showed no more antisocial behavior than the group that did not have the diagnosis according to both methods. Hyler et at. (18) compared Personality Diagnostic Q uestionnaire diagnoses with clinicians’ diagnoses. The data on 552 subjects were obtained by a mail survey of psychiatrists. Each participating psychiatrist administered the Personality Diagnostic Questionnaire to two patients and filled out a clinical assessment form, yielding ratings of clinical certainty that the patients met the DSM-III criteria for a given diagnosis, although the exact criteria each patient fulfilled were not noted. The diagnostic concordance between the Personality Diagnostic Questionnaire and clinical diagnosis yielded a median kappa of 0.08 (range=-0.16-0.46). When continuous scores were used for both the Personality Diagnostic Questionnaire and the clinical assessment form, the median Pearson’s correlation was 0.31 (range=0.16-0.S1). Zimmerman and Coryetl (17) compared diagnoses made by the Structured Interview for DSM-III Personality Disorders and the Personality Diagnostic Ques-

1647

VALID

ASSESSMENT

OF PERSONALITY

TABLE 1. Diagnost ic Agreeme nt (kappa) Perry et al. (32) Clin. Interview vs. DIS

Paranoid Schizoid

Zimmerman et

Hyler et al. (33) (N=87)

Clin.

al. (17)

PDQ-R

Interview vs. PDQ

(N=697) SIDP vs. PDQ

0.40

Schizotypal

-

Histrionic

-

Narcissistic

-

0.54

Borderline

0.27 0.43

0.01 0.15 0.10 0.07 0.46 0.10 0.08

0.27 0.38 0.00 0.14 0.30 0.20 0.08

0.48 0.24 0.34 0.42 0.53 0.63 0.57

0.54 0.18 0.42 0.36 0.46 0.53 0.52

0.08 -0.02

0.13 0.00

0.30 0.23 0.48

0.38 0.21 0.31

-

Avoidant Dependent

-

Hogg et al. (34)

Vs. PDE

0.00

-0.16

Antisocial

Vs. SCID

0.00

-

Me thods of Assess ing Person ality Disor ders in Nine Studiesa

Between

Hyler etal. (18) (N=552)

(N=70)

Axis II Personality Disorder

DISORDERS

0.12

(N=40)

(N=20)

SIDP vs. MCMI

SCID vs. PDE

0.18

0.34 0.00 0.13 0.15 0.00 0.29 0.10

Jackson etal.

Pilkoniset al. (13) (N=40)

Skodol

LEAD

et al. (26) LEAD

(37) (N=82)

(N=34)

SCID Vs. PDE

0.18

-

-0.02

O’Boyle et al. (35)

Vs. PAF

-

-

-

-

-

-

-

-

-

-

-

-

-

-

-

-

-

-

-

-

-

-

-

0.62 -

0.23

SIDP

vs. PDE (N=100)

Vs. SCID

Vs. PDE

0.29 0.14 0.44 0.58 0.44 0.59 0.53 0.56 0.66

0.54

0.25

0.22 0.23 0.60

0.50 0.21

0.30 0.03

-

-

0.53 0.25 0.03

0.34 0.31 0.04

vs. MCMI 0.19 -0.03

0.01 0.29 0.41

0.18 0.12 0.25 0.06 0.53 0.36 0.15

0.06

0.00

-

-

Obsessivecompulsive

-

Passive-aggressive

-

Self-defeating Any aDmSDiagnostic Personality

-

-

-

Interview

PDQ-R=Personality

evaluation

-

Disorders;

using

Schedule;

Diagnostic

0.32 PDQ=Personality

Questionnaire

PDE=Personality

all available

-

data;

Disorders

PAF=Personality

-

Diagnostic

revised

-

-

-

-

-

-

-

-

-

-

Questionnaire; criteria; MCMI=Millon

Examination;

0.38 0.28 SIDP=Structured SCID=SCID-II: Clinical

0.21 Interview

-0.01

0.38

-

-

-

-

-

-

-

-

for DSM-III

Structured Clinical Multiaxial Inventory;

Personality

Interview for LEAD=longitudinal

Disorders; DSM-III-R expert

Form.

tionnaire in a group of 697 relatives of psychiatric patients and healthy control subjects. Experienced interviewers who had completed graduate work in the social sciences were used. Significantly more subjects were given at least one personality disorder diagnosis by the Structured Interview for DSM-III Personality Disorders than by the Personality Diagnostic Questionnaire (17.2% versus 10.3%); however, the Personality Diagnostic Questionnaire gave more multiple diagnoses. Diagnostic agreement between the two instruments was generally poor (median kappa=0.13, range=0.00-0.38), although agreement about the presence of any personality disorder was higher (kappa=0.32). Comparing dimensional scores from both instruments led to somewhat higher levels of agreement (median Pearson’s r= 0.37, range=0.24-0.SS). The largest correlation (r=0.S8) was obtained between the total scores of all personality items endorsed on both instruments. Hyler et at. (33) compared the revised version of the Personality Diagnostic Questionnaire with two structured interviews, the SCID-II and the Personality Disorders Examination, for 87 applicants for inpatient treatment on a personality disorders specialty unit. Interviews were conducted by experienced clinicians on the same day in a balanced order. The revised version of the Personality Diagnostic Questionnaire had relatively low levels of concordance with either structured interview. The median kappa with the SCID-Il was 0.43 (range=0.23-0.63), while the median kappa with the Personality Disorders Examination was 0.37 (range= -0.02-0.54). The Personality Diagnostic Questionnaire was found to be highly sensitive to identifying positive diagnoses made by either structured interview (sensitivity range=75%-1 00% ), whereas its specificities

1648

-

-

-

for DSM-JIJ-R

Assessment

-

were low (specificity range=24%-89%). The authors suggested that these findings indicate that the Personality Diagnostic Questionnaire might be a useful screening instrument when personality disorders are highly likely to be present, although it is not a substitute for a structured interview. Agreement between the Personality Disorders Examination and the SCID-Il is described below from the authors’ report on the enlarged sample (26). Hogg et al. (34) studied 40 hospitalized patients with schizophrenia of recent onset after the patients had recovered from an acute episode. The Structured Interview for DSM-III Personality Disorders and the Millon Clinical Multiaxial Inventory were administered. The Structured Interview for DSM-III Personality Disorders diagnosed personality disorders in 57% of the patients, most commonly yielding antisocial, borderline, and schizotypal types, while the Millon inventory most commonly found dependent, narcissistic, and avoidant types. The level of agreement between the two instruments on eight diagnoses occurring with sufficient frequency was low (interpolated median kappa=0.14, range=0.00-0.34). When dimensional trait ratings were compared, levels of agreement were somewhat higher (median Pearson’s r=0.26, range=-0.03-0.60). O’Boyle and Self (35) interviewed 20 depressed inpatients with the SCID-II (May 1986 version) and the Personality Disorders Examination (May 1985 version) to rate the DSM-III-R criteria. Reliabilities for the diagnoses obtained were acceptable (Personality Disorders Examination, overall kappa=0.63; SCID-Il, overall kappa=0.74). Comparing the two instruments, these authors obtained a kappa of 0.38 for the overall presence/absence of any personality disorder. The reliabili-

Am ] Psychiatry

I 49:1

2, December

1992

J. CHRISTOPHER

ties of the three disorders diagnosed each yielded a median kappa of0.23, a high of 0.62. The reliabilities did

five or more times a low of0.18, and not change appreciably when four subjects who were psychotic at the time of their initial interviews were excluded. The investigators reinterviewed 1 7 of the patients after they had recovered from their depression to compare effects of the depressed state on the results of the Personality Disorders Examination . The Personality Disorders Examination dimensional scores were tower for all personality disorder types except paranoid when the subjects were in the nondepressed state, and they were significantly lower for borderline and compulsive disorders. The authors concluded that the modest kappas they obtamed were of concern. Pilkonis et al. (13) carried out an elegant study of 40 patients with major depression in which they compared diagnoses obtained from the revised Personality Disorders Examination and the Personality Assessment Form (scored by the same interviewers) and a LEAD consensus method. The diagnosis of mixed personality disorder was included. Because of the small sample size in relation to the number of disorders, they reported only overall levels of diagnostic agreement, rather than agreement for individual disorders. At intake, personality disorders were diagnosed in 80% of the subjects by the Personality Assessment Form, in 70% by LEAD consensus, and in 63% by the Personality Disorders Examination. The overall concordances between methods were 1 ) for the Personality Disorders Examination and LEAD, kappa=0.28, and 2) for the Personality Assessment Form and LEAD, kappa=0.2 I . Raising the threshold of the Personality Assessment Form for a positive diagnosis worsened the level of agreement. Personality Disorders Examination data obtained at intake from a significant other as an informant demonstrated higher reliability with the patient’s score on the Personality Disorders Examination repeated at 6-month follow-up (kappa=0.S0) and with the LEAD consensus data at 6 months (kappa=0.46). Pilkonis et al. (13) also examined predictive validity, comparing patients diagnosed with and without personality disorders on the amount of improvement at 6 months on the Global Assessment Scale (GAS), the Beck Depression Inventory, and the SCL-90 total score. They hypothesized that patients with personality disorders should show less improvement. The LEAD consensus diagnoses predicted significant differences in improvement on all three measures, while the Personality Disorders Examination demonstrated improvement on the GAS only, and the Personality Assessment Form demonstrated no significant differences. Skodol et al. (26) compared the SCID-Il and the Personality Disorders Examination data on 100 inpatients. Both interviews were independently administered by professionals in balanced order, generally on the same day. Agreement between the two methods was fair (median kappa=0.S0, range=0. 14-0.66). Comparing dimensional scores from both interviews yielded better agreement (median Pearson’s r=0.77, range=0.58-0.87).

Am J Psychiatry

149:12,

December

1 992

After

at least

diagnosis

6 weeks

was

made

of inpatient in

34

cases.

PERRY

observation,

a LEAD

The

demon-

SCID-Il

strated slightly greater agreement with a subsequent LEAD diagnosis (median kappa=0.25, range=0.030.60) than did the Personality Disorders Examination (median kappa=0.25, range=-0.01-0.41). The investigators suggested that diagnoses made according to one structured interview should not be considered comparable to those according to another. This conclusion was strengthened in a subsequent report (36) of different patterns of comorbidity between pairs of personality disorder types across the two interviews. In the same sample, out of 55 possible unique pairs, the Personality Disorders Examination found significant co-occurrence in

29

pairs

of

personality

disorders,

compared

to

12

pairs diagnosed by the SCID-II. Jackson et at. (37) gave the Millon Clinical Multiaxial Inventory to 82 inpatients prior to discharge and then administered the Structured Interview for DSM-III Personality Disorders the following day. DSM-III critena

were

used,

and

the

reliability

of the

Structured

Inter-

view for DSM-III Personality Disorders diagnoses was acceptable (median kappa=0.67). The Milton inventory diagnosed more individuals for six of the I 1 individual personality disorders. In the comparison of the two instruments, the base rates of the individual disorders varied by a factor between 1.2 and 6.0. The resulting agreement between the two instruments for categorical diagnoses was a median kappa of 0.18 (range=-0.030.53). When dimensional scores for each diagnosis were examined, the agreement rose slightly (median Pearson’s r=0.26, range=0.02-0.63). The authors concluded that for alt but the category of borderline personality disorder, the two instruments demonstrated poor concordance, and for some categories they identified completely different individuals.

SUMMARY DIAGNOSTIC

OF STUDIES METHODS

COMPARING

AXIS

II

The level of agreement between instruments across the studies in table 1 can be summarized as good news, bad news, and plain news. The good news is revealed by examining the highest kappa value for agreement on mdividual diagnoses from each study reporting on more than one disorder. This highest estimate yields a median kappa of0.54 (range=0.34-0.66) and represents fair to moderate levels of agreement. The bad news is obtained by examining the lowest kappa for agreement on individual disorders from the same studies, which yields a median kappa of 0.00 (range=-0.16--0.23), reflecting very poor chancecorrected levels of agreement. Finally, the plain news is summarized by the median value across studies of the median kappa within each study (i.e., the median of the median kappa values): median kappa=0.25 (range=0.080.54). This value has the greatest generalization across personality diagnoses across studies. It reflects that on average, the chance-corrected agreement between diagnostic methods is poor.

1649

VALID

ASSESSMENT

OF PERSONALITY

DISORDERS

These values change somewhat if comparisons with self-report measures are treated separately. For comparisons of interview methods only, the median highest estimated kappa=0.61 (interpolated), the median lowest estimate=0.09 (interpolated), and the median of median estimates=0.25. The studies comparing self-report and interviewer methods reported lower values: the median highest estimate=0.S0 (interpolated), the median lowest estimate=-0.01 (interpolated), and the median of median estimates=0.16. As I have indicated, in large samples the kappa is essentially equivalent to the intraclass correlation, interpretable as a proportion of variance (30, 31). An average chance-corrected concordance of 0.25 between instruments (the median values for all methods as well as for interview methods only) suggests that 75% of the variance in personality disorder diagnoses in the average study represents variance not attributable to the patients. This puts considerable constraint on comparing the findings from any one study with those from another. It is not a scientifically acceptable state of affairs. The overall situation is improved only slightly when dimensional scores for different methods are compared. Of the four studies that did this, the interpolated median Pearson’s r values were as follows: highest=0.59, lowest=0.28, and median=0.34. Pearson’s r is used to compare scales with different metrics and is not strictly interpretable in the same way as the intraclass correlation or kappa. Nonetheless, the median figure of 0.34 also reflects an overall poor level of agreement. The finding of poor diagnostic comparability mdicates the necessity of testing and improving the measurement validity of methods for diagnosing axis II disorders, a point made by many of the authors I have cited (13, 17, 26, 32). While some authors have suggested that the use of dimensional scores has advantages over categorical diagnoses (38), the evidence I have described suggests that the use of dimensional scores generally did not raise concordance between methods to acceptable levels. I suggest two different approaches to this problem. The first is to delineate the sources of measurement problems in current instruments in order to improve them. The second is to develop new methods that assess personality disorders in a conceptually different way, with the potential for better measurement validity.

POTENTIAL IN CURRENT

SOURCES OF INSTRUMENTS

MEASUREMENT

ERROR

Early in the development of a field of study, there is a difficulty in disentangling problems due to measurement validity and those involving construct validity (39). Construct validity is demonstrated when findings converge in a coherent way, such as demonstrating Validity in the description, etiology, course, and response to treatment of a given personality disorder and demonstrating that these findings diverge from those for other disorders (40, 41). Construct validity is still at

1650

issue for many of the personality disorders and may therefore constrain any test of the measurement validity of a given instrument. Nonetheless, reviews of the DSM-III-R personality disorders (42, 43) suggest that there is sufficient evidence to warrant further study of the present types. Problems with reliability constrain assessment validity. However, most of the observer-rated instruments for assessing axis II disorders have demonstrated fair to high interrater reliability, thereby minimizing this as a major source of discordance. Dimensional ratings, which have even higher reliabilities, demonstrated only slightly higher correlations ( I 7, 34), except in one study (26). Possible sources of the problem are noted below. The high reliabilities of the instruments suggest that rater variance due to different levels of experience, training, etc. is not a major source of the lack of agreement among instruments. This may be less true for methods that allow for more clinical judgment. However, studies comparing raters with different levels of training or from different sites are largely lacking. Self-report questionnaires solicit subjective data, whereas observer-rated interviews include observational data, allowing clinical judgment a role in interpreting the subject’s responses. The discrepancy between these data sources has been well described (44) and has properly led to caution in interpreting selfreport data on diagnoses (17, 18). This is validated by the findings, shown in table 1, that interview methods demonstrated higher levels of concordance with one another than with self-report instruments. For many criteria (e.g., criterion 7 for narcissistic personality disorder: lack of empathy), a subject might be expected to be a poor judge in the self-report, because the phenomenon requires an external judge. Occasion variance may introduce some disagreement. Theoretically, personality disorder diagnoses reflect longstanding characteristics and should have high short-term stability (45). Some studies minimized occasion variance by having the two different assessments on the same day (26, 33), but they still demonstrated problems in diagnostic concordance. Some instruments may be sensitive to state effects or changes. This may be suggested whenever test-retest stability coefficients obtained within instruments are substantially lower than their reliabilities. In addition, certain instruments, such as the Personality Diagnostic Q uestionnaire, may be sensitive to state effects due to depression, while other instruments are not ( I 7). Instruments do not yield the same databases, thereby introducing some information variance (25). They often assess the same criterion with different questions (17), thereby producing somewhat different data. At present it is not clear how representative of its respective criterion each question is. This problem may be amplified for self-report instruments, because each patient’s interpretation of a question may be somewhat idiosyncratic. The yes/no answer format in most instruments may produce a serious problem. First, it is unlikely that in-

Am

J

Psychiatry

149:12,

December

1992

J. CHRISTOPHER

dividuals encode their perceptions themselves and their personalities that required by the interview.

and attitudes about in the same format as For example, question 79 from the Diagnostic Interview for Personality Disorders (46) inquires of the subject, “[Have you] often noticed that you don’t feel things very deeply?” with a follow-up probe, “Have you ever been told that you seemed like a shallow or superficial kind of person?” This question represents a criterion for histrionic personality. It is readily conceivable that anyone who has tow self-esteem or, even worse from a conceptual point of view, a compulsive personality, might respond positivety. Asking for examples and then following scoring guidelines should mitigate some scoring errors, but to what extent this is true remains unclear. Even when the interviewer requests an example, the yes/no format simply may not develop sufficient information to ascertain whether a positive answer represents a pervasive, longstanding pattern or a sporadic occurrence. Furthermore, there is a danger that some assessments are overly dependent on the subject’s self-report when the critenon reflects an objective phenomenon (e.g., constricted affect, restricted emotionality). The reliance on the yes/no question-and-answer format is a very serious issue, given that all of the instruments use it to a large extent. The LEAD method of diagnosis may minimize some of these sources of variance. However, its reliability has not yet been established, and the use of data from tongitudinal observation during an inpatient hospitalization, while attractive, may not represent the patient’s real life patterns (25, 26). Pilkonis et al. (13) did find a good 6-month retest stability of their LEAD diagnosis that at least one personality disorder was present (kappa=0.84), although stability of individual disorders would be expectably lower. Unfortunately, the amount of time and personnel required also detracts from the general applicability of the LEAD method, and standardization of procedures remains to be done.

IMPROVING

THE

CLINICAL

INTERVIEW

Structured interviews were originally devised to cut down on the problem of unreliability in the clinical interview. However, it is possible that assessment validity has been neglected in favor of readily obtaining reliability. An examination of how the clinical interview might improve validity while preserving reliability is warranted. A good clinical assessment of personality begins with taking a history. The clinician asks the patient to tell important stories from across the life span, preserving the life context in which these occurred. Memorable events and important vignettes tell the story of the patient’s relationships with family, loved ones, friends, authorities, and co-workers at home, at school, at work, and at leisure. Symptoms and the onset of illness are seen as occurring in a context, and life stress, so important to the patient, is given its due, helping to differentiate the reaction types defined by Adolph Meyer

Am

J

Psychiatry

149:12,

December

1992

PERRY

from the longstanding maladaptive traits that DSM-IIIR axis II assesses. The database is an aggregate of dramatic or unusual stories balanced by more commonplace vignettes. This results in a more representative sample of behavior and experience than is obtained in response to a format of yes/no questions, even when these are followed by a request for confirmatory examples. Luborsky and colleagues have found that whenever people tell anecdotes about relationships in interviews or psychotherapy sessions, a limited number of patterns of motives, experiences, and interactions can be reliably identified (47). In the clinical interview, the clinician judges that a criterion is met only after reviewing the whole interview and ascertaining that a longstanding pattern is evident in historical context. There are some challenges that the clinical interview must meet to demonstrate its scientific respectability. Like the structured interviews, the clinical interview must be replicable and yield highly reliable diagnoses. Both of these aims could be accomplished by the development of guides for conducting the interview and the subsequent rating procedures. This would follow a direction that has proven successful in improving the reliability and comparability of psychodynamic formulations (48), a field which shares much treacherous terrain with axis II. Second, a compendium ofgood case examples demonstrating both common and unusual patterns that reflect axis II criteria would aid in training clinicians to make comparable ratings. This idea is simitar to the idea behind the DSM-III-R Casebook (49). Third, there is a need for renewed interest in developing training procedures for conducting the clinical interview, as well as standardizing the ascertainment of clinical interviewing competence. The results of this review suggest that demonstrating the high reliability of a diagnostic method is not enough to answer questions about assessment validity. We now need studies that address validity by comparing the diagnoses made by any two or more methods against some external criterion of validity, such as etiological factors, prediction of course, and treatment response (41). While several studies have begun this work (13, 32, 36), much more evidence is required before we can say that the axis II diagnoses made by any one method are valid. At present, studies using different methods for diagnosing personality disorders on average can be expected to obtain findings that concur at little better than chance levels and reach different conclusions.

CONCLUSIONS The introduction of structured interviews and selfreport questionnaires to assess axis II disorders has resulted in improved diagnostic reliability within each method. However, comparisons of any two instruments used with the same subjects reveal more diagnostic disagreement than agreement on average. Using continuous or dimensional scoring improves the situation only marginally. This suggests that studies using differ-

I 651

VALID

.

ASSESSMENT

OF PERSONALITY

DISORDERS

I 7.

ent diagnostic instruments can be compared only with great caution. Whether instruments for which no comparison studies are available (46, 50) will demonstrate better diagnostic concordance remains unknown, but their interview formats are by and large similar to those I have reported. More work on improving the validity of axis II assessments is needed. There may be a variety of reasons for problems with current approaches. Some, such as variance due to different raters, interview occasion, and state changes, are well-known. Other reasons may be more specific to the problem of assessing personality features, such as establishing that a specific pattern is pervasive and present over time. Using a guided clinical interview offers one potential solution to this problem. Further study is required to determine the assessment validity of current methods for diagnosing personality disorders.

in the 18.

19.

20.

of the reliability

3.

4. Spitzer

RL,

J: Schedule

Endicort

for

Affective

22. 23. 24.

of psychiatric

Disorders

Schizophrenia (SADS), 3rd ed. New York, New York chiatric Institute, Biometrics Research, I 977 Spitzer RL, Endicort J, Robins E: Research Diagnostic

S.

26.

27.

Psy-

ality disorders. Reich J: Update

J

Personality

Disorders

I 987;

Gen

1:220-240

Psychiatry

1981;

38:381-

MM, Orvaschel H, Gruenberg prevalence of specific psychiGen Psychiatry 1 984; 4 1:949-

958 10.

Loranger

Personality

Disorders

Examination

(PDE)

ual. Yonkers, NY, DV Communications, I 988 AW, Susman VL, Oldham JM, Russakoff

I I . Loranger sonality Personality 12. Standage Personality

methods

J 13.

Disorders Disorders K, Ladha Disorders

of identifying

Personality

Pilkonis

Examination (PDE): I 987; 1:1-13 N: An examination Examination and

Disorders

PA, Heape

nosis of personality Psychol Assessment

14.

15.

16.

personality 1988;

report.

disorders

in a clinical

1652

report.

Arch

37.

J, Serrao

disorders: 1991; 3:1-9

the

Zimmerman for DSM-III

Gen

sample.

2:267-271

CL, Ruddy

use

P: Validity of the

in the diagLEAD standard.

Psychiatry

1985;

InterYork, I 987

W, Corenthal C: Disorders (SIDP).

42:591-596

Williams JBW, Spitzer RL, Lyons M, of clinical and self-report diagnoses of disorders in 552 patients. Compr Psychiatry

Pilkonis

J, Docherty

PA, Watkins

JP: Fre-

quency and implications of personality disorders in a sample of depressed outpatients. J Personality Disorders I 987; 1:27-42 Hyler SE, Rieder RO, Williams JBW, et al: The Personality Diagnostic Questionnaire Revised (PDQ-R). New York, New York

Psychiatric

Millon

Institute,

T: Millon

Millon

T: Millon

ment.

Minneapolis, RL:

Compr

Psychiatry

Research,

Multiaxial

National

Spitzer

Biometrics

Clinical

Computer

Clinical

Systems,

Multiaxial

National

Psychiatric

Manual,

3rd

Manual

Supple-

Systems,

are

ed.

1983

Inventory

Computer

diagnosis:

1983;

1987

Inventory

clinicians

I 984 still

necesssary?

24:399-411

Skodol AE, Rosnick L, Kellman D, Oldham JM, Hyler SE: Validating structured DSM-III-R personality disorder assessments with longitudinal data. Am J Psychiatry 1988; 145: 1297-1299 Skodol AE, Rosnick L, Kellman D, Oldham JM, Hyler 5: Development of a procedure for validating structured assessments of axis II, in Personality Disorders: New Perspectives on Diagnostic Validity. Edited by Oldham JM. Washington, DC, American Psychiatric Press, 1991 Skodol A, Oldham J, Rosnick L, Kellman HD, Hyler S: Diagnosis of DSM-III-R personality disorders: a comparison of two structured interviews. Int J Methods in Psychiatr Res I 99 1 ; I:

J: A coefficient

Cohen

of nominal agreement 1960; 20:37-46

Measurement

for nominal

scales.

C: A a pre-

Hogg B, Jackson HJ, Rudd RP, Edwards J: Diagnosing personality disorders in recent-onset schizophrenia. J Nerv Ment Dis 1990; 178:194-199 O’Boyle M, Self D: A comparison oftwo interviews for DSM-III-

disorders.

Psychiatry

Res 1 990;

Widiger

TA:

Personality

for DSM-IV. 39.

Cronbach

tests. 40.

41. 42.

32:85-92

Oldham JM, Skodol AE, Kellman HD, Hyler SE, Rosnick L, Davies M: Diagnoses of DSM-III-R personality disorders by two structured interviews: patterns of comorbidity. Am J Psychiatry 1992; 149:213-220 Jackson HJ, Gazis J, Rudd RP, Edwards J: Concordance between

two personality disorder instruments Compr Psychiatry 1991; 32:252-260 38.

M, Bowers Personality

Rieder RO, A comparison

R personality

of the reliability of the a comparison with other

Iowa City, University of Iowa, I 982 Stangl D, Pfohl B, Zimmerman M, Bowers W, Corenthal structured interview for the DSM-III personality disorders: liminary

35.

J 36.

Spitzer RL, Williams JBW, Gibbon M: Structured Clinical view for DSM-III-R Personality Disorders (SCID-Il). New New York State Psychiatric Institute, Biometrics Research, Pfohl B, Stangl DA, Structured Interview

Man-

LM: The Per-

a preliminary

interview

47:527-531

Shrout PE, Spitzer RL, Fleiss JL: Quantification in psychiatric diagnosis revisited. Arch Gen Psychiatry 1987; 44:1 72-1 77 29. Fleiss JL: Statistical Methods for Rates and Proportions, 2nd ed. New York, John Wiley & Sons, 1981 30. Shrout PE, Fleiss JL: Intraclass correlation: uses in assessing rater reliability. Psychol Bull 1979; 86:420-428 31 . Fleiss JL, Cohen J: The equivalence of weighted kappa and the intraclass correlation coefficient as measures of reliability. Educational and Psychol Measurement 1973; 33:613-619 32. Perry JC, Lavori PW, Cooper SH, Hoke L, O’Connell ME: The Diagnostic Interview Schedule and DSM-III antisocial personality disorder. J Personality Disorders 1987; 1:121-131 33. Hyler SE, Skodol AE, Kellman HD, Oldham JM, Rosnick L: Validity of the Personality Diagnostic Questionnaire-Revised: comparison with two structured interviews. Am J Psychiatry 1990; 147:1043-1048 34.

AW:

1990;

disorders

and

28. Criteria

on instruments to measure DSM-III and DSMIII-R personality disorders. J Nerv Ment Dis 1989; 177:366-370 8. Robins LN, Helzer JE, Croughan J, Ratcliff KS: National Institute of Mental Health Diagnostic Interview Schedule: its history,

characteristics, and validity. Arch 389 9. Robins LN, Helzer JE, Weissman E, Burke JD, Regier DA: Lifetime atric disorders in three sites. Arch

personality

of self-report

Gen Psychiatry

DSM-III personality 1989; 30:170-1 78 Shea MT, Glass DR,

Psychol

(RDC) for a Selected Group of Functional Disorders, 3rd ed. New York, New York State Psychiatric Institute, Biometrics Research, 1977 6. Reich J: Instruments measuring DSM-III and DSM-III-R person7.

J:

Diagnosing

13-26

and

State

SE,

WH:

a comparison

Arch

Minneapolis,

diagnosis. BrJ Psychiatry 1974; 125:341-347 Spitzer RL, Forman JBW, Nee J: DSM-III field trials, I: initial interrater diagnostic reliabilty. Am J Psychiatry I 979; 136:815817 Mellsop G, Varghese F, Joshua S, Hicks A: The reliability of axis II ofDSM-III. AmJ Psychiatry 1982; 139:1360-1361

2.

Hyler Hendler

State 21.

25.

RL, Fleiss JL: A re-analysis

M, Coryell

community:

measures.

REFERENCES

1 . Spitzer

Zimmerman

U,

Psychol

Blashfield

disorder

J Personality Meehl

RK:

The

dimensional

models

1991;

5:386-398

Disorders PE:

Bull 1955;

with psychiatric

Construct

validity

inpatients. proposed

in psychological

52:281-302

Classification

of

Psychopathology:

Neo-

Kraepelinian and Quantitative Approaches. New York, Press, 1984 Perry JC: Challenges in validating personality disorders: description. J Personality Disorders 1990; 4:273-289 Widiger

TA,

Frances

A, Spitzer

Am ] Psychiatry

RL,

Williams

149:12,

JBW:

December

The

Plenum beyond DSM-

1992

J. CHRISTOPHER

III-R 43.

personality

145:786-795 Perry JC,

Vaillant

Textbook dock 44. 45.

46.

AmJ

disorders: GE:

of Psychiatry,

BJ. Baltimore,

an overview.

Personality

Am

disorders,

& Wilkins,

Psychiatry

by Kaplan

149:12,

December

I 989

1992

Diagnostic

Interview

47.

and test-retest Luborsky L,

reliabilty. Crits-Christoph

48.

Perry

The CCRT

HI, Sa-

Fiske DW: Measuring the Concepts of Personality. Chicago, Aldine, 1971 Perry JC: Use of longitudinal data to validate personality disorders, in Personality Disorders: New Perspectives on Diagnostic Validity. Edited by Oldham JM. Washington, DC, American Psychiatric Press, 1991 Zanarini MC, Frankenburg FR, Chauncey DL, Gunderson JG:

Psychiatry

The

1988;

in Comprehensive

5th ed, vol 1. Edited

Williams

J

JC:

Method. Scientific

for Personality Compr

New progress

Disorders:

Psychiatry 1987; P: Understanding

York,

Basic Books,

in psychodynamic

PERRY

interrater

28:467-480 Transference:

I 990 formulation.

Psy-

chiatry 1989; 52:245-249 49.

Spitzer

RL,

DSM-III-R 50.

Press, 1988 Widiger TA, dimensional Gen Psychiatry

Gibbon

Casebook.

M,

Skodol

Washington,

AE,

Williams JBW, DC, American

First MB: Psychiatric

Trull TJ, Hurt SW, Clarkin JF, Frances A: A multiscaling of the DSM-III personality disorders. Arch 1987; 44:557-563

1653

The Role of Metaperception in Personality Disorders: Do People with Personality Problems Know How Others Experience Their Personality?

Standardized assessment of personality disorders in obsessive-compulsive disorder.

Assessment of Axis II personality disorders among female substance abusers.

[Affective disorders and personality disorders].

Personality Assessment Inventory profiles of university students with eating disorders.

The Standardized Assessment of Personality-Abbreviated Scale as a screening instrument for personality disorders in substance-dependent criminal offenders.

Personality disorders.

Personality and behavior problems of schoolboys.

Milestones in the history of personality disorders.

History of personality disorders.

Assessment of DSM-III-R personality disorders by self-report questionnaire: the role of informants and a screening test for co-morbid personality disorders (STCPD)

Personality, Emotions, and the Emotional Disorders.

Personality disorders and somatization in functional and organic movement disorders.

Personality disorders and body weight.

Borderline personality pathology and the stability of interpersonal problems.

Problems in the boundaries of bipolar disorders.

Personality: structure and assessment.

Personality disorder cognitions in the eating disorders.

Fluoxetine in the treatment of borderline and schizotypal personality disorders.

Assessment of sleep problems and related risk factors observed in Turkish children with Autism spectrum disorders.

Becoming more oneself? Changes in personality following DBS treatment for psychiatric disorders: Experiences of OCD patients and general considerations.

Emotional inhibition in personality disorders.

Confidence in personality assessment.

Hereditary considerations in common disorders.