Statistical and Non-Statistical Interpretation of Test Results.

STATISTICAL AND NON-STATISTICAL INTERPRETATION OF TEST RESULTS. By Samuel W.

Assistant

Fernberger, Ph.D., Professor of Psychology, University of Pennsylvania.

Some time ago, a survey was made to determine standards for children of the fifteen-year-old level at the University of Pennsylvania.* This age is a critical one for vocational guidance and the study is thus relatively important. From a preliminary study, it was decided to test three groups of children as follows: Group I, children applying to a Junior Employment Bureau for assistance in obtaining jobs; Group II, children who were working and also attending Continuation School; Group III, children in High School who A battery of tests was given each child which were not working. included?shortened Binet, Witmer Cylinders, Dearborn Formboard, Monroe Reading, Courtis Arithmetic, and Woodworth and Wells' Hard Directions Test. One hundred boys and one hundred girls One hundred and thirty were each tested in Groups II and III. and seventy girls were tested in Group I. One of the important results of this study was the that the performances of the High School children (Group

boys

discovery III) were

considerably better than those of either of the other two groups. performances of Group II were slightly better than those of Group I, but these two groups may be considered as parts of a larger In all three groups, group which contrasts sharply with Group III. the performances of the boys were slightly better than those of the girls. The High School children give higher intellectual and more intelligent performances than those in the two working groups. It was thought advisable to determine the degree of statistical validity of these differences. For this purpose, the Binet I. Q. and The

the times of the first trial of the Dearborn Formboard were selected. The probable errors of the averages were calculated separately for boys and girls for all three groups. These results are given in Table I for the Binet I. Q. and in Table II for the Dearborn Formboard. It will be noted that the variations in the size of the averages are

relatively great in both cases when one compares the values for Group III with those for Groups I and II. The probable errors, however, are quite large indicating great variability within each The probable errors for the Dearborn Formboard results group. are relatively very much larger than those for the Binet I. Q. *

R. E. Learning, "Vocational Guidance at the Fifteen-Year-Old Level."

(68)

INTERPRETATION OF TEST RESULTS.

Table II.?Dearborn First Trial.

Table I.?I. Q. Group.

Group.

Average.

I B

87.6

I G

83.8

II B II G.:

92.6 117.0

III G

107.2

Average.

P. E.

194

72.75 110.68 93.75 75.96 65.14 63.42

I B... I G... II B.. II G.. III B. Ill G.

92.2

III B

69

227 178 197 150

170

probable errors of the differences were calculated by the formula, P. E.d VP. E.^ + P. E.b. Then a value z was calculated by the formula, z D/P. E.d, i. e., the difference divided by the probable error of the difference. One then looks up in a special table* a value Pz corresponding to the calculated z. The Pz is the probability that the difference will not vary by an amount greater than itself. It is, therefore, a reliable "index of significance." For mathematical certainty that the difference is significant, the value Pz must be 0.9999 or unity. Then the

=

=

Table III.?I.

Boys-Girls.

I-I II-I I III-III..

D

I-U... II-III.

Table IV.?I.

P. E.d

Pz

3.8

16.8

0.2

14.2

0.1286 0.0161

9.8

12.0

0.4198

Table V.?I. Girls.

Q.

D

D

5.0 29.4

24.4

P.

E.d

13.8 12.7 12.5

Pz

0.1919 0.8824 0.8116

Table VI.?Dearborn.

P. E.d

Boys-Girls.

D

E.d

P.

Pz

8.4

16.4

I- 1

33

133

0.1339

23.4

15.5

II-I I

19

121

15.0

13.6

III-III...

20

91

0.0859 0.1180

D

P. E.d

Pz

I-II I-III

16

119

44

98

0.0699 0.2385

n-III....

28

114

0.1301

*

I-II... I-III.. II-I1I.

Q.

Table VII.?Dearborn. Boys.

Boys.

Q.

Table VIII.?Dearborn. Group. I-II I-II I II-II I

D

P.

E.d

30

135

57

128

27

99

Pz

0.1180 0.2385 0.1445

"W. W. Johnson, The Theory of Errors and Method of Least Squares, N. Y., 1892, p. 154.

70

THE PSYCHOLOGICAL CLINIC.

These values are to be found in Tables III-VIII. Tables III-V contain the values for the Binet I. Q., while Tables VI-VIII contain those for the Dearborn Formboard. Tables III and VI compare the

boys and girls in each group. Tables IV and VII compare the different groups for boys and Tables V and VIII compare the different groups for girls. Each table is similarly constructed. In the first columns are indicated the groups compared. In the second columns are given the differences with the probable errors of the differin the third columns. In the last columns are given the values of P2, the index of significance. In the consideration of the I. Q. for boys and girls (Table III) none of the differences are at all significant. That for Group III is the largest but this indicates less than a fifty per cent probability that the difference is not due to chance factors. Comparing the I. Q. for boys (Table IV), the differences between ences

Groups

I and II

are

very

insignificant.

Group

III shows

a

much

greater significance; indeed, comparing Groups I and III there is better than

a

four to five chance that the difference is not due to

Comparing the groups of girls (Table V) none of the differences show a great degree of significance. That for Groups I and II is the lowest, however, and that for Groups I and III is the highest. In the case of the Dearborn Formboard (Tables VI, VII and VIII), none of the differences are at all statistically significant?none in fact show a difference as significant as a one to four chance. The obvious reason for the low degree of significance of the differchance.

ences is to be found in the size of the probable errors rather than in the size of the differences themselves. Take, for example, Groups I and III Boys for the Dearborn Formboard. The difference is 44 seconds which is nearly one-third as large as the lower average?a relatively large difference indeed. But there is so much variability

within each group that the probable errors are so large that the value of z (D/P. E.d) is small. It was therefore thought advisable to try and correct the values The obvious method so as to obtain greater statistical significance. the is to lower the time limits beyond which performance is considered a failure. In the present experiment, a period of 10 minutes was allowed for the Dearborn Formboard and, if the task was not completed, it was scored a failure. Five failures were recorded for =

It is obviously the extreme failure for Group III. cases which give the large variations and which give extremely large values when the variations are squared. Hence we arbitrarily cut off the "tails" of the distribution curves for the upper limits by arbitrarily calling all performances over 400 seconds failures. In

Group I;

one

INTERPRETATION OF TEST RESULTS.

71

this new procedure, we record 17 failures for Group I and six failures for Group III. The averages and probable errors for these new "corrected" values are to be found in Table IX. We have reprinted in this table the original "uncorrected" values for comparison. Table X.

Table IX. Average.

Group. I

194

1-12

cases

(cor162 150

rected) Ill III-5

cases

rected)

Girls.

I-III I?III (cor-

rected)

P.

E.d

Pz

44

98

0.2385

27

74

0.1971

(cor135

The size of the probable errors are considerably reduced by the of a shorter time limit for failure. But it will also be noticed that the elimination of the few longer values from the calculation of the averages also considerably reduces the size of the averThe subsequent calculations, described above, ages themselves. were carried out for the "corrected" values and the results are given in Table X. It will be noted that the probable error of the differences is considerably smaller for the corrected values. But the size of the

assumption

difference is relatively even more decreased. Hence the index of significance of the difference is not so large as for the uncorrected results?being reduced from 0.2385 to 0.1971. Such a procedure as we have applied to these results can only be applied to performance tests where a time record is kept. In the case of the Binet I. Q. the final result is a ratio and no extreme values can be properly eliminated. All of this means that the differences calculated, do not seem to be great enough to have mathematical significance. The two tests calculated were chosen because they seemed to show the greatest and most clean-cut differences.

And, indeed, when one considers only the differences and disregards the probable errors these differences are very marked. But the extreme variability within each group increases the size of the probable errors so that, mathematically, the significance of the difference is obscured. These results also indicate the extreme relative importance of the "tails" of the curves of distribution both for the size of the average and of the probable errors. In the non-mathematical interpretation of the results, one fact gave added conviction to the significance of the differences; namely, that when one compared two groups, the differences for all of the tests in the battery were invariably in the same direction. Hence,

72

THE PSYCHOLOGICAL CLINIC.

for both

boys and girls, the averages for Group II showed greater competency than for Group I and the averages for Group III showed much greater competency than for Group II in every test of the battery. And, in comparing the sex results, the boys were almost invariably, although only slightly, girls of the same group.

better in all the tests than the

Furthermore we do not see how one can expect mathematical in a mental test which is going to be of any value for It is upon the variability within the group purposes of diagnosis. that the examiner is able to make his diagnosis and prognosis. The greater the variability within the group the more chance the examiner has to form the basis of his diagnosis, and hence, the better the test.

significance

Hence the ideal for mental tests is great variability within the group which is utterly incompatable with mathematical significance when one compares the results of one group with another. It seems possible to draw several conclusions from these results:

they are now developed, show such a degree relatively homogeneous group that the differences between two groups do not have statistical significance, even though the differences may be great. In this connection, the "tails" of the curves of distribution assume a relatively great importance both with regard to the size of the averages and of the probable errors. Such a degree of variability is the thing desired in a mental test?with the reduction of the variability one has a correspondingly of

1. Mental tests, within

variability

as

a

great reduction in the value of the

test for the purposes of differenthe members of the group. 2. From this it would seem that, if tests are to have a diagnostic value, which means great variability within the group, we can never

tiating

hope to obtain significance.

differences between groups which will have statistical

3. The modern

therefore,

tendency

to be erroneous.

to over-statisticize test results seems, seem better to treat the raw

It would

material with as little statisticizing as possible. In so doing one is, of course, forced to a non-statistical interpretation of the differences. 4. This emphasizes the point of view that less weight is to be put on final test scores. From this viewpoint, mental tests become merely a standardized means of having the subject do something so that the trained examiner may observe his behavior and thus may arrive at a qualitative analytic diagnosis of the individual case.

A new robust statistical model for interpretation of differences in serial test results from an individual.

Ensuring correct interpretation of diagnostic test results.

Statistical and nonstatistical considerations for environmental monitoring studies.

Measurement and interpretation of skin prick test results.

Spirometric thresholds and biased interpretation of test results.

Interpretation of coagulation test results under direct oral anticoagulants.

Improving public interpretation of probabilistic test results: distributive evaluations.

Successive nonstatistical and statistical approaches for the improved antibiotic activity of rare actinomycete Nonomuraea sp. JAJ18.

An interpretation of the drop ball test in terms of a statistical model for fracture.

Studies on normal blood glucose level - statistical approach to interpretation of glucose tolerance test.

Interpretation of commonly used statistical regression models.

Statistical interpretation of DNA typing data.

Comorbidity in the interpretation of dexamethasone suppression test results in children: a review and report.

Postcoital test: physiologic basis, technique, and interpretation.

Interpretation of coagulation test results using a web-based reporting system.

Interpretation of complement fixation test results as displayed on Auto-Analyzer charts.

Comment on methodology and interpretation of results.

Letter: Interpretation of PPD skin test.

Patch test results of the Dermatology Clinic Zurich in 1989: personal computer-aided statistical evaluation.

The clinical interpretation of peritoneal equilibration test.

Patient commentary: Direct access would beat receiving test results with the receptionist's interpretation.

Beyond statistical significance: clinical interpretation of rehabilitation research literature.

Innovative statistical interpretation of Shewanella oneidensis microbial fuel cells data.

Statistical interpretation of fluorescence energy transfer measurements in macromolecular systems.