COMMENTS A Reply to Beutler et al.'s Study: Some Sources of Variance in Accurate Empathy Ratings In a recent article, Beutler, Johnson, Neville, and Workman presented data in defense of using the number of patients rather than the number of therapists as the unit of analysis for assessing the reliability of process psychotherapy variables. This reply presents data suggesting that the Beutler ct al. interpretation of their research findings is not valid. In recent years, the unit of analysis for assessing the reliability of process psychotherapy variables has come under increased scrutiny. This issue is of crucial importance. What is at stake is a serious questioning of the vast majority of process psychotherapy variables investigated over the past two decades. Basically, there are two opposing schools of thought. One approach holds that the unit of analysis should be the number of therapist-patient interactions units, irrespective of the number of patients or the number of therapists used in a given study. In practice, this has been the usual unit of assessment and comprises the bulk of published reliability studies of process psychotherapy variables. Some of the more recent proponents of this viewpoint include Beutler, Johnson, Neville, and Workman (1973) and Truax and Carkhuff (1967). Others hold the viewpoint that the number of therapists—not the number of therapist-patient interaction units —is the appropriate unit of analysis. In this regard, Chinsky and Rappaport (1970) presented data to indicate that, in fact, reliability estimates are increased by using few therapists and many patient responses. Correspondingly, these estimates are decreased by using many therapists, each of whom has been evaluated only once by a given set of raters. Chinsky and Rappaport presented some of Truax and Carkhuff's (1967) data on accurate empathy to support this point. Specifically, Chinsky and Rappaport presented the following: Examination of 28 AE [accurate empathy] reliability coefficients reported by Truax and Carkhuff (1967) indicates that IS of the 16 highest reliabilities (r > .70) were obtained when the number of therapists was IS or less. In only one of the five ratings using more than IS therapists did the reliability exceed .70. (italics added, Chinsky & Rappaport, 1970, pp. 380-381) This study was supported by the West Haven Veterans Administration Hospital (MRIS 1416).

This issue can be labeled as one of inflated reliability coefficients. What appears to be the primary source of the problem is the fact that variables such as accurate empathy have been assessed by listening to taped recordings of many therapist-patient interactions based on a very small number of therapists. This design problem makes it very easy for raters to recognize or differentiate among the voices of the therapists under study and to continue to rate a given therapist consistently on the basis of this fact alone. In one such study, four therapists treated 10 patients (Truax, 1970). Each patient was rated six times so that each therapist was heard by the raters 60 times. Rappaport and Chinsky (1972) concluded that It is reasonable to assume that any college student who is capable of discriminating therapist empathy ought to be astute enough to recognize the voices of four therapists repeated 60 times each, and to continue to rate those therapists consistently on the basis of recognition alone. (Rappaport & Chinsky, 1972, p. 404) With respect to their own experiences with the same phenomenon, Rappaport and Chinsky (1972) stated that In our own attempts to train raters in the use of the AE [Accurate Empathy] scale, the experience that stimulated our earlier critique, we found that when we had raters rate the same therapists several times they reported to us that they could not help but remember how they rated each therapist previously. (Rappaport & Chinsky, 1972, p. 404) It should be noted that these empirical arguments are also consistent with the views of prominent statisticians such as Maxwell (1968). Truax (1972), in a reply to these criticisms, argued that statistics such as the Ebel (1QS1) intraclass r control for the so-called inflated reliability problem. This is, in fact, not true. More.-.



over, no available statistical tests can control for the recognition problem that is assumed to cause the inflated reliability coefficients, Bartko (1966, pp. 5-6), in fact, has shown that Ebel's statistic is not valid because the assumptions underlying the analysis of variance model, on which it is based, were violated by the author in the development of his statistical approach. More recently, Beutler et al. (1973) tested further the contrary arguments presented by Truax (1972) on the one hand and Chinsky and Rappaport (1970) on the other. Beutler et al. presented data to indicate that the reliability coefficients for therapists' accurate empathy behavior, based on the number of therapists, were actually higher than those based on the number of patients, and thereby claimed support for the position of Truax as opposed to that of Chinsky and Rappaport. However, their argument appears suspect for the following reasons: 1. Beutler et al. used transcripts rather than tapes, thereby eliminating entirely the design factor suspected of artificially inflating rater reliability coefficients, namely, familiarity with the sound of the therapists' voices. 2. Beutler et al. used as their unit of analysis the number of patients rather than the usual method of using the number of therapeutic interactions, which must, per force, always be much larger. For example, in 24 studies reviewed by Truax and Carkhuff (1967) the number of patients varied between 3 and 160, whereas the number of therapeutic interactions varied between 28 and 698. This procedure would also mitigate against inflation of interrater reliability coefficients. 3. Finally, the difference between the two rater correlations (those based on the number of therapists compared to those based on the number of patients) were not significantly different from each other. In summary, the data presented by Beutler et al. simply do not support the arguments of Truax. In fact, the study was not, for the reasons cited above, appropriately designed to adequately test the opposing positions of Truax and Chinsky and Rappaport. One cannot help but realize the immense importance of this issue. Accordingly, we are undertaking a more definitive study than that of Beutler et al. in which we will directly compare the results of assessing rater reliability by systematically varying (a) the number of therapists; (b) the number of patients; and (c) the number of therapist-patient interaction units. The data, based on over 100 hours of psychotherapy interviews, will also assess the effects of


the following variables on the extent of reliability of the following psychotherapy process variables: (a) the sex of the therapist; (b) the sex of the patient; (c) the time of the interview (initial, second, third); and (d) the size of the sampled tape segment (2 minutes, 4 minutes, 8 minutes, 16 minutes, the entire hour. Here we will apply criteria specified by Kicsler, 1966, 1973; Kiesler, Klein, & Mathieu, 1965; Kiesler, Mathieu, & Klein, 1964). REFERENCES Bartko, J. J. The intraclass correlation coefficient as a measure of reliability. Psychological Reports, 1966, 19, 3-11. Beutler, L. E., Johnson, I). T., Neville, C. W., & Workman, S. N. Some sources of variance in "accurate empathy" ratings. Journal of Consulting and Clinical Psychology, 1973, 40, 167-169. Chinsky, J. M., & Rappaport, J. Brief 'Critique of the meaning and reliability of "accurate empathy" ratings. Psychological Bulletin, 1970, 73, 379-382. Ebel, R. L. Estimation of the reliability of ratings. Psychometrika, 1QS1, 16, 407-424. Kiesler, D. J. Basic methodologic issues implicit in psychotherapy process research. American Journal of Psychotherapy, 1966, 20, 135-155. Kiesler, D. J. The process of psychotherapy: Empirical foundations and systems of analysis. Chicago: Aldinc, 1973. Kiesler, D. J., Klein, M. H., & Mathieu, P. L. Sampling from the recorded therapy interview: The problem of segment location. Journal oj Consultin?, Psychology, 196S, 29, 337-344. Kiesler, D. J., Mathieu, P. L., & Klein, M. H. Sampling from the recorded therapy interview: A comparative study of different segment lengths. Journal oj Consulting Psychology, 1964, 28, 349357. Maxwell, A. E. The effect of correlated errors on estimates of reliability coefficients. Educational and Psychological Measurement, 1968, 28, 803-811. Rappaport, J., & Chinsky, J. M. Accurate empathy: Confusion of a construct. Psychological Bulletin, 1972, 77, 400-404. Truax, C. B. Length of therapist response, accurate empathy, and patient improvement. Journal oj Clinical Psychology, 1970, 26, 539-541. Truax, C. B. The meaning and reliability of accurate empathy ratings. Psychological Bidletin, 1972, 77, 397-399. Truax, C. B., & Carkhuff, R. R. Toward elective counseling and psychotherapy: Training and practice. Chicago: Aldinc, 1967. (Received June 27, 1975)

Domenic V. Cicchetti and Edward R. Ryan Veterans Administration Hospital West Haven, Connecticut 06516 and Yale University

A reply to Beutler et al.'s study: some sources of variance in accurate empathy ratings.

