Perceptual dnd Motor Skillr, 1975,40,415-423. @ Perceptual and Motor Skills 1975

JUDGMENT ERROR IN CATEGORY VS MAGNITUDE SCALES P. JOHN KIM RICHARD K. EYMAN University o f Oregon Medical School Pacific-Neuropsychiatric Institute Research Group, Pomona, California A N D TOM CALL Pacific-Neuropsychiatric Institute Research Group Pomona, California Summary.-The effects of a variety of experimental conditions on the judgments (length of lines) of 16 normal and 16 mentally retarded observers were examined using category and magnitode scaling techniques. Using error and variability of judgment as criteria for measuring response bias, for normal subjects knowledge about the stimulus range, whether learned or provided, had as much to do with resulting judgments as the type of scale used. Judgment error of the retarded group was significantly greater than the normal group and appeared to be related to their limited ability to assign categories or proportions to the simuli used.

..

Based on the recent literature (Ekman & Sjoberg, 1965; Eyman & Kim, 1970; Luce, 1972; Mashhour & Hosman, 1968; Poulton, 1968; Ross & DiLollo, 1971; Savage, 1970; Stevens, 1971; Teghtsoonian, 1971) considerable controversy still surrounds psychophysical measurement. Aside from the theoretical and philosophical issues pertaining to "a psychophysical law" and "fundamental measurement" (Luce, 1972; Savage, 1970) there are also practical considerations summarized by Poulton (1968) regarding response learning and response bias inherent in the estimation of sensory magnitude. On this latter problem, Stevens (1971, p. 448) stated, "In my own view, the problems of response bias rate no better than a nuisance, an interesting nuisance perhaps, . . but nevertheless a diversion from the basic business of sorting out the fundamental principles." At the root of these contentions, is the merit of alternative sensory scales, e.g., discrimination, category, and magnitude scales (Ekman & Sjoberg, 1965). Helson (1964) has argued that no one scale can be considered better than other scales for measuring sensation. Stevens ( 1971) has provided insight into this classical argument, when he stated:

.

"But opinions differ about problems and solutions [re selection of the best sensory scale]. How else can we understand the decision of an experimenter to limit the observer's responses to a finite set of numbers, such as 1 to 7 or 1 to 20? That curious maneuver of constraining the observer's responses is a tactic that seems somewhat mulish to those who have allowed the observer a full range of responses and have witnessed the greater usefulness of the resulting ratio scales. It may be true that in the long run. superior procedures tend to replace inferior procedures, but in almost rwo decades of practice, magnitude estimation has not displaced category estimation-nor does it seem Likely to d o so any time soon" (p. 448).

R. K. EYMAN, ET AL.

416

In contrast to Steven's view, Poulcon has concluded: "There does not appear to be a stable relationship between an observer's use of numbers or similar measuring techniques and his perception of physical magnitudes." Other investigators share Poulton's belief (Mashhour & Hosman, 1968; Eyman & Kim, 1970; Ross & DiLollo, 1971; Parducci, 1963; Ekman & Sjoberg, 1965; Cliff, 1973) and conclude any scale is no better than the experimental conditions used to obtain it. Most studies examining the relation between category and magnitude scales did not control for the relative effects of contextual variables such as placement of standard stimuli, the order of presentation of category vs magnitude scales, etc. (Eisler, 1963; Ekman & Kiinnapas, 1961, 1963; Eisler, 1962; Koh & Shears, 1970; Galanter & Messick, 1961; Schneider & Lane, 1963; Stevens & Guirao, 1962, 1963; Stevens & Stone, 1959). It was generally assumed that all subjects were equally competent and that practice sessions prior to the experiment would nullify adaptation or learning effects. In contrast to these studies, Poulton ( 1968), Poulton and Sirnmonds ( 1963), Pradhan and Hoffman (1963), among other studies reviewed by Cliff (1973), found significant context effects on category and magnitude scale values. The purpose of this paper is to examine judgment error made by normal and mentally retarded observers under a variety of experimental conditions using category and magnitude scaling techniques. Hopefully, a better understanding will be obtained on the degree to which contextual variables affect response bias (measured as judgmenc error) of relatively sophisticated normal observers in contrast to unsophisticated retarded observers. Although numerous studies have investigated context effects on psychophysical functions, little is known about intelligence effects on judgmenc error (Eyman, 1967; Eyman & Kim, 1970). METHOD Judgment Error Judgment error is defined as the difference between the subject's response and the "correct response" in terms of the stimulus presented which must be adjusted to the rating scale being used. For example, in magnitude estimation, if a line 20 cm. long is compared to a standard 10 cm. long which is called "50," the correct response would be "100." Similarly, in category scaling, the smallest stimulus belongs in the first category and so on. Eyman and Kim (1970) discussed a number of "standardized" measures of judgment error, two of which are appropriate to this study, ( 1) Judgment Error

=

(ir.r

- Sk)/sk h

( 2 ) Judgment Variability

where

Li,,; is the average response of

ievel Sk;and

A

c i . is ~

=~r.a/S*

individual i over J occasions for stimulus

the standard deviation of the responses of individual i over J

417

JUDGMENT ERROR IN SCALING

occasions for stimulus level Sk. The equation for judgment error counts any deviation from the ratio of systematic error to stimulus level as error. Variability of judgment was also divided by stimulus level providing a coefficient of variacion. n

A

Standardizing ( w . r- Sk) and a i , ~ by stimulus level Sk was intended to adjust error and variability for more meaningful interpretation by providing a single point of departure from which to make comparisons between rating scales and other experimental conditions. Preference for such standardized measures is dictated by various statistical advantages: statistical inferences, identification of source and size of variacion, ease of interpretation in terms of linear models, e.g., analysis of variance, etc. (Eyman & Kim, 1970). Judgments of length of lines were made by normal and mentally retarded observers. Line lengths were selecced as stimuli based on Scevens' (1966b, p. 399) contention that they represent one of the more stable perceptual tasks. Thirty-two subjects (16 normal and 16 retarded) made four repetitious category and magnitude estimates of the Length of 11 projected lines (7.5 cm., 15 cm., 22.5cm., 30cm., 37.5 cm.,45 cm.,52.5 cm.,60cm.,67.5 cm.,75 cm., 82.5 cm.). The order of the scaling method and placement of the standard stimuli were varied according ro Table 1. Four subjects were randomly assigned to each of the four experimental conditions specified in Table 1. Hence, a tocal of 16 subjects from each intelligence group were examined to accommodate the four experimental conditions. All subjects were instructed to make both category and magnitude estimates because of the possible effects of making category ratings after magnitude estimates or vice versa. TABLE 1

DESIGNUSED FOR ORDER O F SCALINGMETHOD AND PLACEMENT OF STANDARD STIMULI FOR CATEGORY A N D hlAGNlTUDE ESTIMATES Placement of standard stimuli

Category estimates presenred first

Magnitude estimates presented first

Identical standard used from middle of srimulus range for both methods Shortest and longest lines are standards for category estimates; shortest line is srandard for magnitude estimates

Condition A

Condition B

Condition C

Condition D

The mentally retarded subjects were resident patients committed to Pacific State Hospital and had 1Qs between 50 and 70 (Stanford-Binec, Form L, M, or L-M). The normal subjects were volunteers from the research staff, most of whom were graduate students in psychology or statistics at one of the nearby universities. All subjects had 20-20 correctable vision and were checked for prescribed medications. Ages of all subjects varied between 2 1 and 45 yr. In an attempt to provide and standardize motivation for the task, $20.00 in cash

.

418

R. K. EYMAN, ET AL.

was offered to the individual in each group whose judgments were the most accurate. As Table 1 indicates, half the subjects were presented with the 45-cm. projected line as a standard for both category and magnitude estimates, and the remaining subjects were presented with 7.5-cm. and 82.5-cm, lines as standards for the category judgments and the 7.5-cm. line as a standard For the magnitude estimations. After every sixth trial, these standards were repeated for all subjects. Order of presentation of category vs magnitude scales was similarly varied. Instructions followed the usual format (Stevens, 1966a). In the category situation, observers were directed to assign categories representing equal spacing from one or two standards (Table 1 ). In the magnitude situation, the observer judged the projected variable lines proportional to a standard of 60 (45 cm.) or 10 (7.5 crn.) as indicated in Table 1. Admittedly, producing a category scale based on a single standard is a difficult task but a necessary one if a comparison of magnitude and category judgments with identical standards was to be made. Other than this variation, the procedures used are quite common in psychophysics (Ekman & Sjoberg, 1965 ) . AU subjects were given practice periods preparatory to making their judgments. Pretest slides were different from those used in the test trials. All retarded subjects were screened for their ability to make rational estimates using either scale. A practice session involved a total of 104 category and magnitude judgments. Table 2 provides an enumeration of experimental conditions to be compared regarding degree of judgment error and variability. The design illustrated is a one way layout analysis of variance with 11 "observations" (stimulus levels) per cell1 The intent of this design is to characterize the relative amount of judgment error and judgment variability using the experimental conditions listed. Intuitively, the replacement of the customary "average response" by judgment error will provide a more sensitive measure of the differential effect of contextual variables on the judgment process. RESULTS A preliminary examination of the error distributions revealed varying degrees of non-homogeneity of variance and non-normality among the 16 experimental conditions, particularly in connection with the retarded group. Hence, a rank transformation was employed to resolve these complications. Essentially, each stimulus level within each of 16 experimental conditions was given a rank with respect to judgment error and variability. Of primary concern was the number of ties which occurred for the two statistics studied. Since there were

'The distribution of judgment error and variability over the 11 stimulus levels was investigated separately for normal subjects by Eyman and Kim (1970). Although placement of the standard stimulus affected these distributions (minimum judgment error and variability around the standard stimuli), greater differences were noted i n over-all level of error across the 11 stimuli in terms of intelligence and the contextual variables srudied.

JUDGMENT

419

ERROR I N SCALING

a limited number of ties involved in these data, the rank transformation was considered adequate for the analysis. For ties, average ranks were assigned. Table 2 presents the rank means for each of the 16 experimental conditions regarding judgment error and judgment variability. Since ranks were assigned in accordance with the magnitude of this error and variability, the higher the rank value, the more error or variability present. In this regard it is apparent that the retarded group were far less accurate in their judgments than the normal group. Also, magnitude estimates evinced more error than category judgments. The Kruskal-Wallis one-way analysis of variance (Siegel, 1956, p. 188) provided significant H values (corrected for ties) of 95.70 and 82.41 (df = 15, P < .01) for judgment error and judgment variability respectively. Separate Ktuskal-Wallis analysis of variance tests were done on recomputed rank means within the normal and retarded groups. For the normal group, H values of 37.15 and 26.30 (df= 7 ) , for judgment error and variability were also significant ( P < .01) as were H values of 35.13 and 23.84 for the retarded group. Fig. 1 shows a plot of the rank means of judgment error by judgment variability for the total sample. As would be expected, variability generally increased with error. Since the relative position of the rank means for the total sample was TABLE 2

DESIGN LAYOUT AND RANK MEANSFOR MULTIPLECOMPARISON OF 16 EXPERIMENTAL CONDITIONS

Type S

Scale

1. Normal. 2. Notmal, 3. Normal. 4. Normal. 5. Normal, 6. Normal, 7. Normal. 8. Normal. 9. Patient, 10. Patient. 1 1 . Patient. 12. Patient. 13. Patient. 14. Patient. 15. Patient, 16. Patient.

Exp. Condition (see Table 1 )

Category, Category , Category, Category. Magnitude,

(NCA ) (NCB) (NCC)

Magnitude,

(NMB) (NMC)

Magnirude, Magnitude. Category. Category. Category. Category. Magnirude, Magnitude. Magni rude, Magnirude,

"Ranks were assigned to judgment error (;..*

Rank M* for Judgment Error

Judgment Variability

( NCD

(NMA 1

(NMD) ( PCA ) (PCB)

(KC) ( PCD ) ( PMA ) ( PMB )

(PMC) (PMD)

-

Sr)/St, and judgment variability

A

u..Y/SI;. for each of 11 stimulus levels within each of 16 experimental conditions producing 176 ranks. The responses of the four Ss to four repetitions of a specified stimulus within each experimental condition were pooled to achieve more stable estimates of

h

p..r

and

:..r

(ScheffC, 1959)

R. K. EYMAN, ET AL.

.PMB *PMA PMC

oPCA

$..k I s k

oNMD .PCB 9PCD PCC

NCA ONMA oNCD

ONCB

FIG. 1. Rank means of judgment error ability ; . . ~ / S A for . 16

(i

.K

- S t ) / S t plotted by judgment vari-

experimental groups (see Table 2 and Results for code)

identical to that for the separate normal and retarded samples, data on the latter groups were omitted for the sake of brevity. For the normal subjects, three experimental conditions accounted for the least judgment error and variability, e.g., category ratings with standard stimuli at either end of the stimulus range (NCD, NCC) and magnitude estimates given after the category ratings with a standard stimulus at the low end of the stimulus range ( N M C ) . In the case of the category ratings (NCD, NCC) two standard stimuli at the high and low end of the stimulus range were likely responsible for the accuracy noted. In the case of the magnitude estimates, the experimental condition employed (NMC) used only one standard stimulus at the low end of the stimulus range but subjects had had the opportunity to make these judgments after making category estimates with the two standard stimuli. Hence, it appeared that familiarity with stimulus range in terms of the category ratings increased the accuracy of the magnitude estimates made later. In contrast, when magnitude estimates were presented before category ratings using the shortest line as a standard (NMD), error and variability were highest for the normal group. Category ratings made with a single standard from the middle of the stimulus range were also subject to higher error than

JUDGMENT ERROR IN SCALING

42 1

with two standards. Magnitude estimates tended to be less accurate over-all than category estimates among the normal subjects and in all judgments an order effect can be observed. The judgments of the retarded subjects were quite different from those of the normal group. For example, the magnitude estimates of the retarded were extremely poor and indicative of their limited ability to use numbers as ratios of a standard stimulus. Category ratings for the retarded group were much better although not as accurate as the normal group. Also, these findings are similar to those of Eyman (1967), Hawkins, et al. (1966), and Holowinsky (1964) in which visual and kinesthetic judgments of retarded subjects were less accurate than those of normal subjects. The order effect noted for the normal subjects was absent among the retarded group.

DrscussroN An attempt has been made to examine suggested measures of "judgment error" and "judgment variability" relative to the sophistication of the subjects, type of rating scale, placement of the standard stimulus, and an order or transfer effect. As Cliff (1973) has noted, all judgments are subject to biasing influences so that one really never knows when he has the "right" set of judgments. Current thinking on these issues favors shifting attention away from "psychophysical laws" based on the behavior of the sense organs, over to the mechanisms of response bias (Poulton, 1968; Cliff, 1973). Intuitively, characterizing judgment error under specified experimental conditions seems to be a practical way to examine response bias. There are a number of points to be made from the results of this study as well as other studies. Magnitude estimates are difficult to make and highly sensitive to familiarity with the stimulus range which can be affected through prior experience viewing the stimuli (order effect). Although this result is not new (Poulton, 1968), it is interesting to note that the size of error in magnitude estimates approached the lower level found in category ratings when two standard stimuli were used. This result could be attributed to the fact that the subjects learned to identify the limits of the stimulus range inasmuch as these particular magnitude estimates were made after the category ratings with standard stimuli at the upper and lower ends of the stimulus range. Conjointly, Gregson, et al. (1969) found that two anchor stimuli used with ratio judgments altered the resulting scale to take the form of a category scale. However, error associated with category ratings, e.g., NCA, where one standard stimulus was used, approached the higher level of error generally found for the magnitude estimates. Clearly, for competent observers, knowledge about the stimulus range whether learned or provided, has as much to do with the resulting judgments as the instructions given regarding the way in which the judgments are to be made. This conclusion is consistent with other studies reported by Cliff (1973, pp. 492, 495).

R. K. EYMAN, ET AL.

42 2

In the case of the retarded subjects, magnitude estimates were possible to make but extremely biased by the ability of these observers to use ratios or proportions. Although a lack of sophistication among the retarded subjects had less effect on the category ratings they contained more error and variability than was found for the normal subjects. It was also evident that an order effect could not be detected for the retarded observers. It is probably questionable whether either scale could be recommended for use with such individuals. Although it is generally conceded that visual acuity of the mentally retarded is less than that found for normal observers, it is still plausible that some of the problem can be attributed to the more limited ability of the retarded to assign categories to their perceptions (Eyman, 1967; Holowinsky, 1964). From the evidence presented here and elsewhere, it appears that there is good reason for the continued use of category and pair comparison discrimination scales. Furthermore, the latter scale seems more appropriate to use in connection with retarded subjects insofar as pair comparison ratings are easier to make than either category or magnitude estimates (Eyman, 1967). REFERENCES & M. R. Rosenzweig (Eds.), Annual review of Vol. 24. Palo Alto, Calif.: Annual Reviews Inc., 1973. Pp. 473-

CLIFF,N . Scaling. In P. H. Mussen psychology. 507.

EISLER,H. Empirical test of a model relating magnitude and category scales. Scandinavian Journal of Psychology, 1962, 3 , 88-96. EISLER,H. Magnitude scales, category scales and Fechnerian integration. Psychological

Review. 1963.. 70,. 243-253. & K ~ ~ N N A P AT. S , Measurement of aesthetic value by "direct" and "indirect" methods. Reports from the Psychological Laboratory, The Univmit-y of Stockholm, Sweden, 1961, No. 93. EWN, G., & K~~NNAPAS, T. A further study of direct and indirect scaling methods. Scandinavian Journal of Psychology, 1963, 4, 77-80. EKMAN. G., & SJOBERG. L. Scaling. In P. R. Farnsworrh, 0.McNemar, & Q. McNemar (Eds.). Annual review of psychology. Vol. 16. Palo Alto, Calif.: Annual Reviews Inc., 1965 Pp 451-474. EYMAN,R. K. The effect of sophisticarion on ratio- and discriminative scales. Amen'can JozzrnaJ of Pjychology, 1967, 80, 520-540. EYMAN.R. K., & KIM, P. J. A model for partitioning judgment error in psychophysics. Psychologrcal Bulletin, 1970, 74. 35-46. GALANTEX, E..& MESSICK. S. The relation between category and magnitude scales of loudness. Psychological Review, 1961, 68, 363-372. GREGSON, R. A. M.. MITCHELL,M. J.. SIMMONDS, M. B., & WELLS.J. E. Relative olfactory intensity perception as mediated by ratio-range category scale responses. Perception and Psychophysics, 1969, 6, 133-138. HAWKINS, W. F., BAUMEISTER, A. A.. & FRIEDRICH,D. Weight judgments of normals and retardates: a note. American journal of Mental Deficiency, 1966, 71. 393-

EKMAN,G.,

395.

HELSON,H. Adaptation-level theory: an experimental and systematic approach to behavior. New York: Harper, 1964.

HOLOWINSKY, I. Z. Length discrimination as a function of IQ and motivarion. Training School Bulletin, 1964, 61, 116-119. & SHEARS,G. Psychophysical scaling by schizophrenics and normals. Archives of General Psychiatry, 1970, 23, 249-259.

KOH, S. D.,

JUDGMENT ERROR IN SCALING

42 3

LUCE, R. D. What sort of measurement is psychophysical measurement? American Psychologist, 1972, 27, 96-106. MASHHOUR,M., & HOSMAN,J. On the new "psychophysical law": a validation study. Perception and Psychophyrics, 1968, 3, 367-375. PARDUCCI,A. Range-frequency compromise in judgment. Prychological Monographs, 1963.. 77.. 44-45. POULTON,E. C. The new sychophysics: six models for magnitude estimation. Psychological Bulletin, 196%. 69, 1-19. POULTON,E. C., & SIMMONDS,D . C. V. Value of standard and very first variable in judgments of reflectance of grays with various ranges of available numbers. lournal of Experimental Psychology, 1963, 65, 297-304, PRADHAN,P. L., & HOFFMAN,P. J. Effect of spacing and range of stimuli on magnitude estimation judgments. Jownal of Experimental Psychology, 1963, 66, 533-541. ROSS, J., & DILOLLO,V. Judgment and response in magnitude estimation. Psychological Review, 1971, 78, 515-527. SAVAGE,C. W. The measurement of sensrtion. Berkeley, Calif.: Univer. of California Press. 1970. SCHEFFB,H. The analysis of uariance. New York: Wiley, 1959. SCHNEIDER, B., & LANE, H. Ratio scales, category scales, and variabiliry in the production of loudness and softness. Journal of the Acoustical Society of America, 1963,

35, 1953-1961. SIEGEL,S. Nonparametric stati~ticrfor the behavioral sciences. New York: McGrawHill, 1956. STWENS, S. S. A metric for the social consensus. Science, 1966, 151, 530-541. ( a ) STEVENS,S. S. On the operarion known as judgment. American Scienrist, 1966, 54, 385-401. ( b ) STEVENS,S. S. Issues in psychophysical measuremenr. Psychological Review, 1971, 78, 426-450. STEVENS,S. S., & GUIRAO,M. Loudness, reciprocaliry, and partition scales. Jownal o f the Acoustical Society o f America, 1962, 34, 1466-1471. STEVENS,S. S., & GUIRAO,M. Subjective scaling of length and area and the matching of length to loudness and brightness. lournal of Experimentd Psychology, 1963. 66. 177-186. STEVENS,S. S., & STONE, G. Finget span: ratio scale, category scale, and J N D scale. Journal of Experimental ~sychology.1959, 57, 91-95. TEGHTSOONIAN, R. On the exponents in Stevens' law and the constant i n Ekman's law. Psychological Review, 197 1, 78, 7 1-80. Accepted December 10, 1974.

Judgment error in category vs magnitude scales.

The effects of a variety of experimental conditions on the judgments (length of lines) of 16 normal and 16 mentally retarded observers were examined u...
355KB Sizes 0 Downloads 0 Views