This article was downloaded by: [Memorial University of Newfoundland] On: 04 October 2014, At: 17:47 Publisher: Routledge Informa Ltd Registered in England and Wales Registered Number: 1072954 Registered office: Mortimer House, 37-41 Mortimer Street, London W1T 3JH, UK

The Journal of General Psychology Publication details, including instructions for authors and subscription information: http://www.tandfonline.com/loi/vgen20

Does Memory Contaminate Test-Retest Reliability? Stuart J. McKelvie

a

a

Department of Psychology , Bishop's University Published online: 06 Jul 2010.

To cite this article: Stuart J. McKelvie (1992) Does Memory Contaminate TestRetest Reliability?, The Journal of General Psychology, 119:1, 59-72, DOI: 10.1080/00221309.1992.9921158 To link to this article: http://dx.doi.org/10.1080/00221309.1992.9921158

PLEASE SCROLL DOWN FOR ARTICLE Taylor & Francis makes every effort to ensure the accuracy of all the information (the “Content”) contained in the publications on our platform. However, Taylor & Francis, our agents, and our licensors make no representations or warranties whatsoever as to the accuracy, completeness, or suitability for any purpose of the Content. Any opinions and views expressed in this publication are the opinions and views of the authors, and are not the views of or endorsed by Taylor & Francis. The accuracy of the Content should not be relied upon and should be independently verified with primary sources of information. Taylor and Francis shall not be liable for any losses, actions, claims, proceedings, demands, costs, expenses, damages, and other liabilities whatsoever or howsoever caused arising directly or indirectly in connection with, in relation to or arising out of the use of the Content.

Downloaded by [Memorial University of Newfoundland] at 17:47 04 October 2014

This article may be used for research, teaching, and private study purposes. Any substantial or systematic reproduction, redistribution, reselling, loan, sub-licensing, systematic supply, or distribution in any form to anyone is expressly forbidden. Terms & Conditions of access and use can be found at http://www.tandfonline.com/page/terms-and-conditions

The Journal of General Psychology, 119(1),59-72

Downloaded by [Memorial University of Newfoundland] at 17:47 04 October 2014

Does Memory Contaminate Test-Retest Reliability? STUART J. McKELVE Department of Psychology Bishop's University

ABSTRACT. The Wonderlic Personnel Test (1983) was administered twice over a 3-week period under conditions in which the activity of the second test was experimentally manipulated. Data from 302 undergraduates were analyzed. The standard test-retest reliability coefficient, .872, was not significantly different from the coefficients obtained from three other groups that, on the second test, were each given specific instructions: (a) to reason out the answers (pure reassess condition); (b) to use reasoning, memory of their initial responses, or both (reassess and memory); or (c) to take an alternate form of the test (parallel). However, the standard test-retest reliability coefficient was higher, p < .lo, than the coefficient obtained from a condition (pure memory) in which subjects were instructed to duplicate their previous responses, using only memory. Although the subjects in the test-retest and combined reassess and memory conditions reported recalling previous answers for 20-25% of the items on the second test, it was concluded that conscious repetition of specific responses did not seriously inflate the estimate of test-retest reliability.

THE MAJOR SOURCES OF ERROR VARIANCE affecting the reliability of standardized psychological tests are content sampling and time sampling (Anastasi, 1988). That is, errors occur either because of the particular set of items selected for the test or the particular time at which the test was administered. If the latter is of primary concern, the same test can be administered twice to obtain the test-retest reliability coefficient, but this measure may be

I would like to acknowledge the help of the students who participated either as subjects or as experimenters in this research project and to thank Patricia Monfette, who assisted me in analyzing the data; Howard Lucia, for his helpful comments; and Dale Stout, for many useful discussions. Address correspondence to Stuart J. McKelvie, Department of Psychology, Bishop's University, Lennoxville, QC,JIM 127, Canada. 59

60

The Journal of Generd Psychology

Downloaded by [Memorial University of Newfoundland] at 17:47 04 October 2014

contaminated by conscious carry-over and practice effects, particularly if the interval between tests is short (Anastasi; Kaplan & Saccuzzo, 1989; Walsh & Betz, 1985). Although Anastasi (1988) has argued that these spurious effects of readministration apply to many tests, the test-retest estimate is frequently reported (Schuerger & Witt, 1989; Schuerger, Zarrella, & Hotz, 1989), possibly because of difficulties in the construction of parallel forms. Moreover, most of the time intervals between the first and the second testing are weeks or months, rather than years, because the researcher is usually concerned with the reliability of the test rather than the long-term stability of the trait being measured (Schuerger et al., 1989; Walsh & Betz, 1985). Schuerger and his colleagues (Schuerger & Witt, 1989; Schuerger et al., 1989) have systematically documented and quantified the frequently reported claim that, as the time interval between the administrations of a test lengthens, the test-retest correlation coefficients decline. Schuerger and his colleagues also found that the rate was exponential, leveling off after a rapid decline over the first year, and that the coefficients were consistently higher for tests of ability (Schuerger & Witt, 1989) than for personality tests (Schuerger, Tait, & Tavernelli, 1982; Schuerger et al., 1989).

Most researchers attribute the early fall in these values to declining carryover effects (e.g., Anastasi, 1988; Ghiselli, Campbell, & Zedeck, 1981; Nunnally, 1978). That is, if a test is readministered within a short time interval, subjects may recall their initial test responses and simply repeat them. To the extent that the subjects do so successfully, there will be a high degree of similarity between the scores on the two administrations of the test, leading to an estimate of test-retest reliability that is spuriously high because it reflects memory rather than reassessment (Anastasi). Although this argument is made frequently, the rare attempts to test it have provided only limited support (Freeman, 1962; McKelvie, 1986). In the present experiment, I examined the role of memory in test-retest reliability by administering a general intelligence test on two occasions separated by a 3-week interval. The test activity (reassess, memory) on the second test was manipulated. Assuming that memory would inflate test-retest reliability somewhat, I hypothesized that the reliability coefficients would be higher in the two conditions in which both reasoning and recall could occur (test-retest, reassess and memory) than in the two others in which the effects of memory for specific responses would be minimal (pure reassess and parallel). It was not clear what would happen in two other conditions in which the subjects were instructed to use memory throughout, because the coefficients in these conditions would be dependent on overall recall accuracy, about which little is known.

McKelvie

61

Method

Downloaded by [Memorial University of Newfoundland] at 17:47 04 October 2014

Subjects

A total of 374 undergraduate students initially completed the test. The students were recruited over a 3-year period (1987-1989), and the sample in each year (n = 109, 130, 135, respectively) was selected to be representative of the university population, stratified by division (natural science, social science, humanities, business), year (1,2, 3), and sex. The sample was reduced to 3 18 because some students could not be recontacted for the second testing, and the final number of subjects used in the data analyses was 302. Materials

Because Schuerger and Witt ( 1989) obtained the highest test-retest reliability coefficients over short intervals for individual intelligence tests, I decided to employ a maximum rather than a typical performance test. However, it was too time-consuming to administer individual tests for a project of this kind. I used the Wonderlic Personnel Test (1983) because it is highly correlated (.93) with the individually administered Wechsler Adult Intelligence Scale (WAIS; Dodrill, 198 1) for normal adults and perhaps for other populations also (Dodrill & Warner, 1988; Edinger, Shipley, Watkins, & Hammett, 1985). Although the author of the Wonderlic refers to that test as a test of “problem-solving ability,” the Wonderlic can also be characterized as a test of general intelligence. The items on the Wonderlic are derived from the Otis Test of Mental Ability (Wonderlic, 1983), and the Wonderlic has correlations of .56 to .80 with aptitude G (general learning ability) of the General Aptitude Test Battery (Wonderlic). The Wonderlic’s test-retest reliability over short intervals is .82 to .94 (Wonderlic), consistent with Schuerger and Witt’s (1989) function. Moreover, the Wonderlic requires little specialized examiner training and can be given to groups of subjects in 12 min. The test contains 50 items in a spiral omnibus format, is available in 16 alternative forms, and is widely used as a screening device in business and industry (Murphy, 1984). Although the Wonderlic’s (1983) manual does not systematically describe the content of the test, my examination of the items on the Wonderlic indicated that they fall into six general categories, spread throughout the test. I judged the item type and the number of items of that type on the first and second halves of the test, respectively, to be as follows: numerical reasoning (6, 12); verbal reasoning (8,6); synonym-antonym (7, 2); nonverbal reasoning (1, 3); information (2, 1); and attention to detail ( 1 , 1). Procedure

The experimenters were the author and undergraduate students who had taken a course in psychological testing. Each experimenter recruited 12 to 15 sub-

Downloaded by [Memorial University of Newfoundland] at 17:47 04 October 2014

62

The Journal of General Psychofogy

jects spanning the three stratification criteria and tested them individually or in small groups of two to five. Before volunteering, the subjects were told that they would be asked to complete a test of problem-solving ability and that there would be another (unspecified) test in 3 weeks’ time. We administered the initial Wonderlic Personnel Test, Form A, with the usual 12-min time limit. Approximately 3 weeks (17 to 25 days) later, the subjects in the parallel condition were given Form D of the Wonderlic, and the subjects in all the other conditions were given Form A. Each experimenter randomly assigned his or her subjects to one of the six conditions, with the following exceptions: (a) all test-retest subjects were tested during 1987, (b) all subjects in the memory and reassess condition were tested in 1989, (c) no subjects in the parallel condition were tested during 1987, and (d) constraints were placed on the numbers of subjects in the other conditions during 1988 and 1989 in an attempt to obtain about 50 subjects in each condition. The subjects were given specific instructions about how they should take the second test. Those in the test-retest condition were simply told to complete the test again (standard instructions). Those in the memory condition were asked to answer each question by recalling what they had written before, that is, to duplicate their previous responses, using memory. Subjects in the pure reassess condition were asked to reason out the answer for each item on the test, even if they thought they could recall what they had previously written. Those in the pure memory condition were asked to duplicate their responses on the initial test, using recall, but they were explicitly instructed not to rework the answers. Subjects in the reassess and memory condition were told that the test could be answered using either memory or reasoning, or both, and that they were free to adopt whatever strategy they wished for each item. Subjects in the parallel condition were given Form D of the Wonderlic and standard instructions for completing the test. When the 1Zmin time limit for the second test had been reached, the subjects were given a questionnaire and were asked to estimate how many of the 50 items on the original test they had thought about or remembered during the 3-week interval. Those in the test-retest condition also estimated the number of items for which they had used memory the second time they took the test. The subjects in the reassess condition indicated whether each answer on the second test was identical to the answer they had given on the previous test. In both memory conditions, the subjects estimated how many answers they had successfully duplicated; in the reassess and memory condition, the subjects indicated which strategy (reassess, memory, both, or cannot say) they had used for each answer. About 2 weeks after the second test session, those subjects who so desired were given feedback about their score on the first test. Although a more powerful test of overall memory accuracy might have been obtained if the subjects had been told at the time of the first test that they

McKelvie

63

Downloaded by [Memorial University of Newfoundland] at 17:47 04 October 2014

would take the same test in 3 weeks' time, I used an incidental procedure in this study to simulate the usual test-retest method, in which subjects are not initially informed that the same test will be readministered. Under these circumstances, the subjects would not invoke a memory strategy until the second session.

Results Of the 318 subjects tested, the number in each condition was 51, test-retest; 47, memory; 60, pure memory; 51, pure reassess; 54, reassess and memory; and 5 5 , parallel. Subjects were randomly discarded in each condition to equate the sample size at 5 1, except in the memory condition, in which the sample size remained 47. Test-Retest Reliability Coeflcients The correlation (test-retest reliability) coefficients between raw scores (the number correct) in each session are shown in Table 1. The two highest scores (.872, .860) were obtained in the test-retest and in the reassess and memory conditions, respectively, and the other scores ranged from .80 to .74. Table 1 also shows the results of z tests comparing the reliability coefficient in the test-retest condition to the reliability coefficient in each of the other conditions. One-tailed tests were conducted because the purpose of this research was to investigate the possibility that memory artificially increases the reliaTABLE 1 Test-Retest Reliability Coefficients for Each Condition

Condition

n

Test-retest Reassess and memory Pure reassess Parallel Memory Pure memory

51 51 51 51

47 51

Raw scores Re1 .872 .860 .797 ,755 .780 .744

Equated variances

IQ scores

Z

Re1

Z

Re1

-

.872 .886 .837 .820 .856 .759

-

-0.30 0.64 0.90 0.30 1.70*

.875 ,854 .787 .740 ,773 .739

0.24 1.22 1.75* 1.42 1.87*

Nore. Re1 = reliability coefficient. For raw scores, the coefficients were calculated from the original raw scores; for equated variances, the raw score coefficients were corrected for unequal variances; for IQ scores, the coefficients were calculated from derived scores. The z values refer to comparisons of coefficients in each condition with test-retest. *p < .lo.

Downloaded by [Memorial University of Newfoundland] at 17:47 04 October 2014

64

The Journal of General Psychology

bility estimate. The results of two of these comparisons were significant; the values in the pure memory and the parallel conditions were lower than the value in the test-retest condition. For the comparison of correlation coefficients, however, the corresponding variances must be similar because the correlation can be reduced by a restriction in range (Anastasi, 1988). The standard deviations in each condition are shown in Table 2, along with the results of F tests comparing the testretest variance with the other variances. Although the results of only one of these comparisons were significant at the . 10 level (memory was lower), each reliability coefficient was corrected, using Ghiselli et al.3 (1981) formulawhich can be used when both variances are known-so that it would be comparable to the test-retest value. The resultant reliabilities are also shown in Table 1. The two highest remained test-retest and reassess and memory (A72 and .886, respectively), but, with the exception of pure memory at .759, the others were 3 2 0 or greater. The results of only one z test (test-retest vs. pure memory) were significant. A 6 x 2 (Condition x Time) analysis of variance (ANOVA) was also conducted on the raw scores themselves. Both the effect of time, F( 1, 296) = 10.44, p < .01, and the effect of the Condition x Time interaction, F( 1, 10.44),p < .01, were significant. Inspection of the means in Table 2 suggests that practice effects occurred in all the conditions except pure memory and parallel. This was confirmed by post-hoc t tests (see Table 2). Three-Week Recall

I planned to conduct a one-way ANOVA of the self-reported recall during the 3-week interval, to investigate whether there was differential rehearsal of the TABLE 2 Means and Standard Deviations of Raw Scores in Each Condition

First test

Second test

Condition

n

M

SD

M

SD

t

F

Test-retest Reassess and memory Pure reassess Parallel

51

Memory Pure memory

47 51

27.16 26.59 27.49 27.92 25.74 26.82

6.75 6.09 6.05 5.79 5.46 6.55

29.69 29.76 31.39 28.67 29.74 26.63

6.48 5.96 5.64 5.95 5.43 5.99

3.37** 4.23** 5.20** 1.00 5.12** 0.25

1.23 1.25 1.36 1.53* 1.06

51 51

51

-

Note. t values refer to comparisons between mean scores on the first and second test; F values refer to comparisons between the test-retest variance and each of the variances on the first test. *p < .lo. * * p < .01.

65

Downloaded by [Memorial University of Newfoundland] at 17:47 04 October 2014

McKelvie

items among the conditions. However, after testing a number of subjects, I noticed that some had given surprisingly high estimates. It appears that they had misinterpreted the question to include the rehearsal of their previous responses during the test itself, rather than simply during the 3-week interval. Consequently, for subsequent subjects the wording of the question was altered so that it stated explicitly that the 3-week interval was involved and not the time during the second test. Therefore, I analyzed the data from this question in three ways (see Table 3). First I conducted a 3 x 2 (Condition x Question) ANOVA. The three conditions were pure reassess, pure memory, and parallel (in which an early and a late version of the same question were asked). Significant effects were found for condition, F(2, 146) = 5.60, p < .01, and question, F(l, 146) = 13.78, p < .01. There were more reports of 3-week recall in the pure reassess and pure memory conditions than in the parallel condition, p s < .01, by the Newman-Keuls test, and from the early rather than the late version of the question. Although the Condition x Question interaction was significant only at the .10 level, F(2, 146) = 2.45, inspection of the means suggested that the effect of question was confined largely to the first two conditions. The second analysis was conducted on all the scores obtained from the early question. Five conditions were involved (see Table 3). The overall effect of condition was significant, F(4, 172) = 5.51, p < .01, and, again, recall in the parallel condition was lower than in all the other conditions, p s < .01, Newman-Keuls. A similar analysis was conducted on all the scores obtained from the revised question. The effect of condition was not significant (see Table 3), showing that 3-week recall in the pure reassess, pure memory, and reassess and memory conditions was similar to that in the TABLE 3 Means and Standard Deviations for the Number of Items Reported Recalled During the 3-Week Interval

Condition Test-retest Reassess and memory Pure reassess Parallel Memory Pure memory

Total n 37 51 51 51 40 50

n 37

Wording Original M SD 7.84

9.18

-

-

-

36 38 40 26

12.47 1.82 6.97 9.58

12.52 2.99 9.39 13.00

Revised n

M

SD

-

-

-

51 15 13

3.78 3.40

1.00

6.58 8.87 2.12

24

2.17

4.26

-

-

-

Note. Total n was sometimes lower than the n in Tables 1 and 2 because some subjects did not answer all the questions.

66

The Journal of General Psychology

parallel condition. Considered as a whole, these results indicate that the initial version of the question produced reports of higher 3-week recall than the revised version did, and that the level of recall on the latter was similarly low in the various conditions.

Downloaded by [Memorial University of Newfoundland] at 17:47 04 October 2014

Recall Accuracy

For the two memory conditions, the number of duplicated responses on the two tests (recall accuracy) was compared with the memory performance estimated from the questionnaire (see Table 4). The results of a 2 x 2 (Condition x Measure) ANOVA indicated a significant effect of measure, F( 1, 93) = 37.45, p < .01, and estimated recall accuracy was generally lower than actual recall accuracy. A correlation between estimated and actual duplicates was also calculated for each condition. For memory, r = .065, ns, and for pure memory, r = .349, p < .01. However, with the z test, these correlations did not differ significantly, z = 1.39. I calculated the percentage accuracy for each memory condition by comparing the actual number of duplicate answers to the number of attempts at duplication that were made on the second test. For memory, the value was 65.69%, and for pure memory it was 68.74%. Similarly, for the reassess condition, I compared the accuracy of the subjects’ judgments regarding whether their responses on the second test were duplicates of their responses on the first with the number of times each subject responded on the second test. Mean recognition accuracy was 77.2%. The number of duplicated responses also differed across the five conditions in which Form A was used twice. The results of the ANOVA were significant, F(4, 244) = 2.72, p < .05, and the results of a Newman-Keuls test indicated that there were more duplicate responses in the pure reassess TABLE 4 Means and Standard Deviations for the Number of Duplicated Responses in Each Condition

Condition

n

Actual M

SD

Estimated M SD

Memory Pure memory Test-retest Reassess and memory Pure reassess

47 51 51

26.75 27.00 28.33

7.09 6.76 6.32

17.16 21.37 -

-

51 51

26.55 30.51

8.06 6.19

-

-

12.59 10.03 -

Downloaded by [Memorial University of Newfoundland] at 17:47 04 October 2014

McKelvie

67

condition than in the memory, pure memory, and reassess and memory conditions, ps < .05,and than in the test-retest condition, but only withp < .10 (see Table 4). Finally, I considered self-reported recall from the test-retest condition and reassess and memory condition. For the test-retest condition, the mean number of items on which subjects reported using memory was 10.04 or 20.22%(SD = 10.11).For the reassess and memory conditions, the number of items reported as being remembered, reassessed, and both remembered and reassessed was 4.59 (SD = 4.46), 25.37 (SO = 10.03), and 7.88 (SD = 5.65), respectively. The results of the ANOVA were significant, F(2, 100) = 93.23, p < .01, and the results of Newman-Keuls tests indicated that more responses were reported as being reassessments than as recalls or reassessments and recalls, ps < .01.Notably, the number of items on which recall was at least partly used (4.59 + 7.88 = 12.47,24.94%)did not differ from the number of items (10.04) in the test-retest condition that were reported as having been recalled, t(94) = 1.34,p > .lo.

Discussion The purpose of this study was to investigate whether recall of initial responses would contaminate the standard estimate of test-retest reliability for the Wonderlic Personnel Test over a 3-week time interval. The first part of the discussion will focus on the reliability coefficients in the different conditions, and the second part of the discussion will focus on the extent and accuracy of memory for the original responses. Reliability Coeficients

As previously noted, reliability coefficients can be meaningfully compared only if the corresponding variances are similar. Only one of the F tests comparing the variance in the test-retest condition with those in the other conditions approached significance (see Table 1). This condition, memory, was probably the least important condition because the subjects may have reasoned out the answers to some items (see below). In addition, revised values for the initial coefficients were obtained by equating variances to the variance in the test-retest condition. Also, it has been observed (Kaplan & Saccuzzo, 1989; Walsh & Betz, 1985) that test-retest reliability coefficients are likely to be artificially inflated because of rehearsal or other activities that occurred in the interval between the two tests. To investigate whether this activity occurred differentially across the six conditions, I asked the subjects to estimate how many of the items they had rehearsed. With the original wording of the question, the amount of reported rehearsal was lower in the parallel condition than in the other conditions, whereas with the revised wording, the amount

Downloaded by [Memorial University of Newfoundland] at 17:47 04 October 2014

68

The Joirmul of General Psychology

of reported rehearsal was generally lower than with the original wording and similar to the amount of reported rehearsal in the parallel condition, Therefore, the amount of reported rehearsal activity during the 3-week interval between tests was similar to the amount of reported rehearsal activity in the experimental conditions. The original reliability coefficients in the test-retest condition and the reassess and memory condition were very similar to each other and higher than the reliability coefficients in the other conditions. Assuming that the pure reassess and parallel conditions eliminated the use of conscious recall of the original responses on each item and that the subjects in the reassess and memory condition used both strategies, this ordering provides some support for the contention that the standard test-retest reliability coefficient is inflated by a memory factor. On the other hand, the difference between the original reliability coefficient in the test-retest condition and the other original reliability coefficients was significant only for the parallel and pure memory conditions, and the revised values for the parallel (.820) and the pure reassess conditions (.837) were closer than the original values to the test-retest value (.872) and to the revised reassess and memory value (.886). Considered as a whole, these data suggest that the test-retest coefficient was not artificially inflated by the memory factor to any serious degree. It is difficult to state the precise value of the true test-retest reliability coefficient based only on a process of reassessment, but this value is probably higher than the value in the parallel condition, .820, because the set of items in the parallel test was not perfectly equivalent with that in the original test. If it is problematic to assume that the pure reassess condition did not completely eliminate the use of conscious memory for previous answers, the true value may be slightly lower than .837, yielding an overall estimate of .83. Although this number may represent the test-retest reliability coefficient uncontaminated by specific recall of the original responses on the test, other memory effects, such as implicit rather than explicit memory (Schachter, 1987), or memory for strategies as distinct from particular response alternatives, are not ruled out. In fact, an analysis of the raw scores indicated that there was a practice effect in all the conditions except two (pure memory and parallel). Presumably, memory for procedures on different items may have allowed subjects to progress further on the second test than on the first. Notably, the results of an ANOVA conducted on the attempts on the two tests in all conditions indicated a significant statistical interaction, and the results of post-hoc t tests indicated that the practice effect disappeared only in the pure memory condition. Even the subjects in the parallel condition seemed to benefit from experience. The lowest reliability coefficient (.744, revised to .759) was obtained in the pure memory condition, indicating that the exclusive use of memory was not advantageous. Notably, the coefficient for the memory condition (.780,

Downloaded by [Memorial University of Newfoundland] at 17:47 04 October 2014

McKelvie

69

revised to .856) was higher than that for the pure memory condition. Because the latter instructions clearly prohibited subjects from using reason to answer the test questions, it is possible that the subjects in the memory condition may have reassessed some of the items. The results of the previous analysis of raw scores and attempts also support the distinction between the two memory conditions because the subjects’ performance in the memory condition, but not in the pure memory condition, indicated a practice effect. In addition, the correlation between estimated and actual duplication performance was significant for the former, but not the latter, suggesting the influence of a factor other than recall in the memory condition. Dodrill (1983) reported that the test-retest reliability of the Wonderlic over a period of 5 years was .94. Assuming that the effect of memory would be minimal after this period of time (Dodrill also found no practice effect), Dodrill’s estimate is higher than the .83 derived in this study. However, Dodrill’s value is not strictly comparable because his sample consisted of 30 normal adults. My study was conducted with students, who represent a restricted range of ability. In fact, when I converted the raw scores from the first session in the test-retest condition to IQ equivalents, using Dodrill’s (198 1) table, the mean and standard deviation of the IQ scores were 113.84 and 12.79 respectively, compared with Dodrill’s values of 103.40 and 19.87. For the pure reassess condition, the converted means and standard deviations were 114.69 and 11.34; for the reassess and memory condition, they were 112.92 and 11.57. The reliability coefficients for these three conditions, based on the IQ equivalent scores, were .875, .787, and .854, respectively, only slightly different from the reliability coefficients that were calculated from raw scores (see Table 1 ) . To ascertain whether the difference between these values-particularly for the pure reassess condition-and Dodrill’s .94 could be accounted for by a restriction in the range of the samples, I applied to Dodrill’s value the formula to convert one reliability to another (Ghiselli et al., 1981). Dodrill’s revised value was .855 for the test-retest variance, .823 for the pure reassess variance, and .816 for the reassess and memory variance. The latter value was slightly lower than the converted value (.886) in this study, but the other two values were very similar to their counterparts. In fact, Dodrill’s revised value of .823 was almost identical to the previously suggested value (.83) for true test-retest reliability. Therefore initial differences between Dodrill’s result and the results I found can be attributed to the restricted range of subjects in my study. The pattern of the reliability coefficients across the various conditions does not support the hypothesis that conscious recall of initial responses inflates the standard estimate of test-retest reliability. However, this conclusion rests on the assumption that the subjects in the various experimental conditions followed the instructions that were given before the second test. No

Downloaded by [Memorial University of Newfoundland] at 17:47 04 October 2014

70

The Journal of General Psychology

postexperimental check was made of this assumption because interpreting the data from such a questionnaire would be difficult because of demand characteristics. However, as noted, there is independent evidence that the subjects in the memory and the pure memory conditions processed the second test differently. In addition, the subjects in the pure reassess condition duplicated more of their responses than the subjects in the four other conditions that involved a readministration of Form A, suggesting that the subjects in the pure reassess condition adopted a different strategy on the second test. Thus, it is not reasonable to argue that the subjects in the different experimental conditions answered the questions on the second test in the same way. Because the experimental conditions were run in different years (with potential variations in experimenters and populations), they may not have been strictly comparable, but two considerationsweaken this objection. First, the composition of the university population was similar from year to year, and stratified sampling occurred within each year. In fact, the sample means and standard deviations were comparable: 26.1 and 6.6 (1987), 27.0 and 6.4 (1988), and 26.8 and 5.8 (1989), F(2, 371) = 0.67. More important, the means and standard deviations of the initial scores were similar across the different experimental conditions (see Table 2). The Condition x Time interaction was significant, but all of the Newman-Keuls comparisons at the initial testing were nonsignificant,p s > .05. Accuracy and Extent of Memory Use Memory must be accurate if it is to contribute to an inflated estimate of testretest reliability. However, although subjects in the pure reassess condition correctly recognized that 77.2% of their second responses were duplicates of their first responses, the subjects’ duplication of responses via conscious recall in the pure memory condition was only 68.7% accurate. In fact, the accuracy of duplication (30.51, or 73.22% of attempts) was highest in the pure reassess condition and significantly higher than the accuracy of duplication in the pure memory condition. This demonstrates (again) that the exclusive use of memory did not lead to a high duplication rate; in fact, the duplication rate was lower than that obtained when the subjects reworked the answers. These data indicate that memory is not always accurate and, if used exclusively, may lower the duplication rate and test-retest reliability. Memory may have a different effect if used selectively, however. Subjects in the testretest condition and the reassess and memory condition reported that they used recall for 20-25% of the items. If memory is used only when it can be successful, the reliability coefficient will be increased. However, although the highest coefficients were obtained in the test-retest and the reassess and memory conditions, the highest coefficients were not significantly greater than those from the pure reassess or the parallel conditions (with variance

McKelvie

71

equated). Notably, reported recall was higher in this study (20-25%) than in a previous study (1 1%) using the Vividness of Visual Imagery Questionnaire (McKelvie, 1986), suggesting that the use of memory may vary with different kinds of tests.

Downloaded by [Memorial University of Newfoundland] at 17:47 04 October 2014

Conclusion Although the subjects reported answering 20 to 25% of the items on the second administration of the Wonderlic Personnel Test at least in part by recalling and repeating previous responses, the results of this study indicate that explicit memory does not seriously inflate test-retest reliability estimated at a 3-week interval. The true test-retest reliability in this sample of undergraduates was estimated to be .83, commensurate with Dodrill’s (1983) .94, which was obtained from a general adult sample after 5 years. Clearly, the test-retest reliability of the Wonderlic is high and may not decline over long-term intervals, as one would expect, given the function documented by Schuerger and Witt (1989). However, the results of my study-if they can be generalized to other tests and to other subject populations-imply that conscious specific response memory may not seriously contaminate test-retest reliability at short intervals. Such a view was espoused by Freeman (1962). Although this conclusion does not rule out other carry-over effects, such as that of practice, it may reassure those researchers who are concerned that explicit response recall artificially inflates test-retest reliability. REFERENCES

Anastasi, A: (1988). Psychological testing (6th ed.). New York: Macmillan. Dodrill, C. B. (1981). An economical method for the evaluation of general intelligence in adults. Journal of Consulting and Clinical Psychology, 49, 668-673. Dodrill, C. B. (1983). Long-term reliability of the Wonderlic Personnel Test. Journal of Consulting and Clinical Psychology, 51, 3 16-3 17. Dodrill, C. B., & Warner, M. H.(1988). Further studies of the Wonderlic Personnel Test as a brief measure of intelligence. Journal of Consulting and Clinical Psychology, 56, 145-147. Edinger, J. D., Shipley, R. H., Watkins, C. E., Jr., & Hammett, E. B. (1985). Validity of the Wonderlic Personnel Test as a brief IQ measure in psychiatric patients. Journal of Consulting and Clinical Psychology, 53, 937-939. Freeman, F. S. (1962). Theory and practice of psychological resting. New York: Holt, Rinehart and Winston. Ghiselli, E. E., Campbell, J. P., & Zedeck, S. (1981). Measurement theory for the behavioral sciences. San Francisco: Freeman. Kaplan, R. M., & Saccuzzo, D. P. (1989). Psychological testing (2nd ed.). Pacific Grove, CA: BrookslCole. McKelvie, S. J. (1986). Effects of format of the Vividness of Visual Imagery Ques-

Downloaded by [Memorial University of Newfoundland] at 17:47 04 October 2014

tionnaire on content validity, split-half reliability, and the role of memory in testretest reliability. British Journal of Psychology, 77, 229-236. Murphy, K. R. (1984). The Wonderlic Personnel Test. In D. J. Keyser & R. C. Sweetland (Eds.), Test critiques: Vol. 1 (pp. 769-775). Kansas City: Test Corporation of America. Nunnally, J. (1978). Psychometric theory (2nd ed.). New York: McGraw-Hill. Schachter, D. L. (1987). Implicit memory: History and current status. Journal of Experimental Psychology: Learning, Memory, and Cognition, 13, 501-5 18. Schuerger, J. M., Tait, E., & Tavernelli, M. (1982). Temporal stability of personality by questionnaire. Journal of Personality and Social Psychology, 43, 176182. Schuerger, J. M., & Witt, A. C. (1989). The temporal stability of individually tested intelligence. Journal of Clinical Psychology, 45, 294-302. Schuerger, J. M., Zarrella, K. L., & Hotz, A. S . (1989). Factors that influence the temporal stability of personality by questionnaire. Journal of Personality and Social Psychology, 56, 777-783. Walsh, W. B., & Betz, N. E. (1985). Tests and assessment. Englewood Cliffs, NJ: Prentice Hall. Wonderlic, E. F. (1983). Wonderlic Personnel Test manual. Northfield, IL: E. F. Wonderlic & Associates.

Received August 5 , I991

Does memory contaminate test-retest reliability?

The Wonderlic Personnel Test (1983) was administered twice over a 3-week period under conditions in which the activity of the second test was experime...
822KB Sizes 0 Downloads 0 Views