Scandinavian Journal of Clinical and Laboratory Investigation

ISSN: 0036-5513 (Print) 1502-7686 (Online) Journal homepage: http://www.tandfonline.com/loi/iclb20

Discriminant Analysis in Clinical Chemistry Helge Erik Solberg To cite this article: Helge Erik Solberg (1975) Discriminant Analysis in Clinical Chemistry, Scandinavian Journal of Clinical and Laboratory Investigation, 35:8, 705-712 To link to this article: http://dx.doi.org/10.3109/00365517509095801

Published online: 08 Jul 2009.

Submit your article to this journal

Article views: 5

View related articles

Citing articles: 5 View citing articles

Full Terms & Conditions of access and use can be found at http://www.tandfonline.com/action/journalInformation?journalCode=iclb20 Download by: [Australian National University]

Date: 07 November 2015, At: 13:32

Scand. J. clin. Lab. Invest. 35,705-712, 1975.

E D IT0 RIAL

Downloaded by [Australian National University] at 13:32 07 November 2015

Discriminant Analysis in Clinical Chemistry

The results of several different laboratory tests are usually necessary for the biochemical or physiological evaluation and the diagnostic classification of a particular patient. It is a conimon practice in medicine to evaluate such sets of results by multiple univariate comparisons, e.g. with reference values for each of the tests. A more appropriate approach would be the application of multivariate statistical methods, as rccently pointed out in a review by Winkel (27). Examples of such multivariate methods are estimation of multivariate reference regions (12, 13), investigation of functional relationships by multiple regression or correlation analysis (lo), exploratory classification of patients by cluster analysis (3, 26, 29), and assignment of diagnoses to patients by pattern recognition (21) or discriminant analysis and similar methods based on Bayes theorem (6, 25). In practical diagnostic judgements a clinician ‘intuitively’ uses methods that are formalized by pattern recognition or discriminant analysis. The present review will be confined to the application of discriminant analysis in clinical chemistry. After a short exposition of the theory of discriminant analysis, two problems will be discussed in more detail: (i) Do shortcomings in the empirical materials severely affect its use in clinical chemistry? (ii) How can the method be used for definition of rational combinations of laboratory tests for various diagnostic purposes?

Theoretical background

The theory of discriminant analysis has been reviewed by several authors (1, 5, 17, 23). Here only a sketchy review of the method will be given. Suppose we have the numerical values x l , . . . , X k of k quantities (e.g. results of k laboratory tests) from each of N,, . . . ,N, patients classified into one of d diagnostic groups (diseases). Discriminant analysis may then give the answer to two questions: (i) To what extent are the d groups of patients separated from each other by these k tests? This is the problem of distance. {ii) How can the results of these k tests be used to classify another patient into one of these d groups? This is the problem of clussijkation. An attempt to visualize the principle of discriminant analysis is shown in Fig. 1. The ellipses represent the bivariate distributions of the two variables x, and xz in two groups. Projections of these two distributions onto planes along each of the two axes show that the univariate distributions (of x1 and x2) in the two groups overlap considerably. If, therefore, individuals are classified 1- Scand. J. clin. Lab. Invest.

Downloaded by [Australian National University] at 13:32 07 November 2015

706

Editorial

Fig. 1. Discriminant analysis with two bivariate distributions.

according to only one of the two variables, a large number will be misclassified. The best separation of the two bivariate distributions is obtained when viewed along the line A-A. Their projections onto a plane along the perpendicular line B-B are nearly completely separated. Therefore, if the two variables x, and x2are combined in a function such that the resulting values L(x) are distributed along the axis B-B, very few misclassifications will occur. This may be obtained by discriminant analysis. This geometric interpretation may be extended to more than two vaiiables. We can think of an observation as a point in a k-dimensional space. By discriminant analysis we divide this space into two regions, R, and R,, by a hyperplane (analogous to the line A-A in Fig. 1). If an observation falls in R,, we classify it as belonging to the first group. Otherwise it belongs to the second group. By discriminant analysis one determines the coefficients a,, . . . , ak of the linear discriminant function L(x)=a,+a,x,+.

. . +akxl,

in such a way (1, 5, 17, 23) that the univariate distributions of L(x) in the two groups are maximally separated. The correlations existing between the variables are disentangled so that each variable in the discriminant function carries only that amount of information which is unique to it and is not shared by the others. The comparison of two k-variate distributions is in this way simplified to the comparison of the two univariate distributions of L(x). The constant a, is given such a value that an observation lies in R, if L(x)>O. Otherwise it

Downloaded by [Australian National University] at 13:32 07 November 2015

Editorial

707

belongs to R,. The discriminant function may thus solve our classification problem. The problem of distance is handled by calculating D2,the Mahalanobis squared distance ( 5 , 19, 23). This squared distance is equal to the difference between the mean values of L(x) of the two groups (Fig. 1). This discriminant analysis may be generalized to d groups ( d 32). Then the k-dimensional space is divided into d mutually exclusive regions. An observation is classified into the group corresponding to the region in which it is located (1, 17). The Mahalanobis D2may also be generalized (V2) to cover the case with d groups (23). The linear discriminant analysis presupposes (i) that the distributions are k-variate normal (Gaussian) distributions, and (ii) that the dispersion matrices of the groups are equal (1). (The dispersion matrix is the matrix of variances and covariances of the variables.) Since the a priori probability that an observation belongs to a particular group often is unknown and since risk functions are difficult to define, it is also commonly assumed that the groups are equally likely and that the misclassifications to either of the groups are equally deleterious (1).

Non-normul distributions

It is probably a common occurrence in clinical chemistry that laboratory results do not show normal (Gaussian) distributions (14), which is usually held to be a prerequisite for discriminant analysis (1). The problem may be tackled in four ways. (i) Cooper (8) has shown that it is possible to design other discriminant functions that are optimal for several large classes of non-normal distributions. This approach has to be dismissed as unrealistic in clinical chemistry, both because the types of distributions seldom are known a priori and because it is not likely that all the tests to be included in an analysis show the same type of distribution. (ii) More promising are the non-parametric methods for discriminant analysis that do not presuppose any special distribution at all. This was the method adopted by Mantel & Valand (20) for the study of six enzymes related to cancer. (iii) The data may be ‘normalized’ by an appropriate transformation function (4, 14). The problem with this approach is that the same variable may show different types of distributions in the groups under study. Then the same transformation function is not appropriate, and the use of different functions for the same variable will probably have unpredictable effects on the analysis. (iv) If the previous three alternatives are unapplicable, the traditional type of discriminant analysis can be run with untransformed data. The results will probably not be severely affected by non-normality of the distributions because discriminant analysis has proved to be rather robust against such deviations (5).

708

Editorial

Discrete Variables Clinical chemical materials often have discrete data such as dichotomous variables (present/absent) or scored qualitative or semi-quantitative data . . .), These data may be analyzed by non-parametric methods, (e.g. 0, as described by Cochran & Hopkins (7) and Kurczynski (18). Another approach is to take advantage of the apparent robustness of the discriminant analysis against deviations from normality (see above) and treat the discrete data as quasi-continuous variables, as was recommended by Kendall & Stuart (17). Downloaded by [Australian National University] at 13:32 07 November 2015

+, ++,

Non-equal dispersion matrices When using the standard Student’s t lest for univariate testing of the difference between two mean values, it is assumed that the dispersion in the two samples, i.e. their variances, is equal. Similarly, is it presupposed in multivariate discriminant analysis that the dispersions of the variances of each variable and their covariances (the dispersion matrices) should be equal in order to estimate linear functions (1). This assumption is commonly not fulfilled with clinical cheniical data (28). In such cases one may either use quadratic discriminant functions (11, 17, 24, 25) or sophisticated statistical methods for estimating linear functions (2, 5). There are, however, indications that linear discriminant analysis is rather robust even against quite considerable non-equality of dispersion matrices (5, 11). If this is generally true, little harm is done in performing the calculations in the usual way, provided the heterogeneity is not grotesquely manifested.

Missing data It is a common deficiency of empirical materials to have ‘holes’ in the data matrix, i.e. some observations do not contain the results of all variables. This problem cannot be solved by estimating the missing data by multivariate regression methods, since the discriminant analysis is based on the same statistical principles. Therefore no improvement in reliability would be obtained. It has been proposed that the problem may be solved b y iterative techniques founded on the maximum likelihood principle (15), but the reliability of this method does not seem to have been tested in relation to discriminant analysis. It is also claimed that mean values may be used for missing data (16). Our experience with the latter method is that it is acceptable provided the ‘holes’ for each variable are not too numerous. The presence of extreme results (‘outliers’) will, however, make this method unreliable. The only alternative left is therefore often to squeeze the maximum amount of information out of the available data without estimating the missing results. One then has to screen the data matrix and extract a subset matrix containing only complete observatioris with the actually used variables. It is then necessary

Editorial

709

to define a lower limit of remaining observations in each group in order to ensure reliable estimations of squared distances and discriminant functio:is. This limit will depend on the nature of the empirical material, and often it has to be rather arbitrarily set after preliminary tests guided by comparison with the results obtained when using mean values (see above) or by comparing the results of reclassification tests.

Downloaded by [Australian National University] at 13:32 07 November 2015

Searching jbr the ‘best’ test combination

The clinical chemical laboratory usually has a greater repertoire of tests than required for the biochemical or physiological evaluation and diagnostic classification of a particular patient. For some commonly encountered diseases or groups of diseases it is therefore desirable to have a rational ‘test battery’ containing only the most informative laboratory tests. The analyses should be included, either because they measure indispensable biochemical or physiological parameters, or because they permit discrimination among different diseases (or between health and disease). The latter aspect only will be discussed here. It has been a common practice to select the ‘best’ laboratory tests on the basis of multiple univariate comparisons. Student’s t test has thus been used for identifying the variables showing the greatest separation between the groups. This approach has to be supplemented by correlation analysis to avoid the inclusion of highly intercorrelated variables only (25). The better way of identifying the optimum test combinations is, however, to use multivariate techniques like discriminant analysis. Preliminary examinations of the empirical material by this method will usually show that out of a set of k available variables (laboratory tests) a subset containing m of the tests carries the greater part of information (and accordingly the ability to discriminate among the groups). The inclusion of more tests only results in redundancy (22). The problem is then to select the best subset with m variabIes out of a set of k ( m < k ) . The best combination of ni variables is that which gives the greatest squared distance (D2 or V ) . The ability to discriminate among the groups may be checked by reclassifying the original observations by the discriminant function with these m tests. Draper & Smith (10) discuss the relative merits of different procedures for the related problem of selecting the ‘best’ regression equation. The application of these methods to discriminant analysis will here be briefly reviewed and discussed. (i) Backbcsard elimination. One possibility is to start with all k variables and eliminate the variable that contributes least to the discrimination among the groups. This procedure is then continued until only FH variables remain. The disadvantages of this otherwise excellent method is that the necessary manipulations of large dispersion matrices may give disturbing rounding-off errors when many variables are included.

Downloaded by [Australian National University] at 13:32 07 November 2015

710

Editorial

(ii) Forward selection of the most discriminative variables, starting with the best single variable and adding one variable at each step, is another possibility. The variable that discriminates best in combination with the variables already included is chosen. This is an economical and fast method. It has, however, two drawbacks. Powerful combinations of the remaining variables, which only show a moderate increase in discrimination when added singly, may be missed. Furthermore, the effect that the introduction of a new variable may have on the roles played by variables that entered at earlier stages is not tested. (iii) The last disadvantage is overcome by the stepwise procedure, which is the forward selection and backward elimination combined in a zigzag manner. After each forward step the previously selected variables are reevaluated by backward elimination. Sometimes previously entered variables are found superfluous by this testing. This is the most commonly used method for automatic search routines (e.g. the BMD07M program (9)). It will usually identify an acceptable combination of m Variables among the k variables available, even though the ‘best’ combination may not necessarily be found (10). We have used the stepwise procedure extensively to analyze a liver disease material (25). According to our experience this method should not be used for automatic searching. The ‘best’ combination of laboratory tests identified by the automatic routine is not necessarily the ‘best’ for practical purposes. The method selects (or eliminates) at each step the variable showing the absolutely greatest (or least) discriminatory power. We usually found that several alternatives existed at each step because the differences in discriminatory power between the alternative sets of variables were statistically insignificant. In such situations the clinical chemist may decide to choose another variable for inclusion or elimination than the automatic routine, e.g. because the variable is a commonly used test or a test that also may give valuable pathophysiological information. We therefore prefer to use the method in an interactive manner. At each step intermediary results are printed out, permitting the clinical chemist to decide which variable to accept or reject. When in doubt, the consequences of including one particular analysis may be tested by running the stepwise procedure a couple of steps ahead and comparing the results with those obtained with alternative variables tested in the same way. The advantage of this interactive stepwise procedure is that it allows for clinical chemical judgement, ensuring that the results obtained are of practical value. It is, however, a clumsy and very slow method, requiring ample amounts of the clinical chemist’s time. (iv) A final possibility is to test all possible combinations of m out of k variables. The program can then list a suitable number of the ‘best’ combinations in decreasing order of discriminatory power, still leaving the final choice open to the clinical chemist. This method usually requires a large amount of computer time since the number of combinations is equal to

Editorial

711

k! m! (k-m)! This may restrict the use of the best-of-all-possible-combinations method.

Downloaded by [Australian National University] at 13:32 07 November 2015

Conclusions Discriminant analysis is a method that may find wide application in clinical chemistry. Deviations from normality of the distributions and non-equality of the dispersion matrices do not seem to significantly affect its use if the deficiencies are not grotesquely manifested. The method is suitable for identifying the optimal composition of clinical chemical ‘test batteries’ for various purposes, provided acceptable empiiical materials are available. The optimal search procedure is the best-of-all-possiblecombiiiations method. If the number of possible combinations is too large to be computed in a reasonable time, the interactive stepwise procedure is preferable. Helge Erik Solberg Dept. of Clinical Chemistry Rikshospitalet Oslo 1, Norway REFERENCES 1. Anderson, T. W. pp. 126153 in An Introduction to Multivariate Statistical Analysis. Wiley, New York, 1958. 2. Anderson, T. W. & Bahadur, R. R. Classification into two multivariate normal distributions with different covariance matrices. Ann. Math. Statist. 33, 420, 1962. 3. Baron, D. N. & Fraser, P. M. Medical applications of taxonomic methods. Brit. med. Bull. 24, 236, 1968. 4. Bartlett, M. S. The use of transformations. Biometrics 3, 39, 1947. 5. Blackith, R. E. & Reyment, R. A. pp. 46-87 in Multivariate Morphometrics. Academic Press, London, 1971. 6. Burbank, F. A computer diagnostic system for the diagnosis of prolonged undifferentiating liver disease. Amer. J. Med. 46, 401, 1969. 7. Cochran, W. G. & Hopkins, C. E. Some classification problems with multivariate qualitative data. Biometrics 17, 10, 1961. 8. Cooper, P. W. Statistical classification with quadratic forms. Biometrika 50, 439, 1963. 9. Dixon, W. J. (ed.) pp. 214a-214t in Biomedical Computer Programs. University of California Press, Berkley, 1971. 10. Draper, D. & Smith, H. pp. 163-177 in Applied Regression Analysis. Wiley, New York, 1966. 11. Gilbert, E. S. The effect of unequal variance-covariance matrices on Fisher’s linear discriminant function. Biometrics 25, 505, 1969. 12. Grams, R. R., Johnson, E. A. & Benson, E. S. Laboratory data analysis system. 111. Multivariate normality. Amer. J. din. Path. 58, 188, 1972. 13. Harris, E. K. Effects of intra- and interindividual variation on the appropriate use of normal ranges. Clin. Chem. 2U, 1535, 1974. 14. Harris, E. K. & DeMets, D. L. Estimation of normal ranges and cumulative proportions by transforming observed distributions to Gaussian form. Clin. Chem. 18, 605, 1972. 15. Hartley, H. 0. & Hocking, R. R. The analysis of incomplete data. Biometrics 27, 783, 1971. 16. Jackson, E. C. Missing values in linear multiple discriminant analysis. Biometrics 24, 835, 1968. 17. Kendall, M. G . & Stuart, A. pp. 314-331 in The Advanced Theory of Stutistics, Vol. 3. Griffin, London, 1968.

Downloaded by [Australian National University] at 13:32 07 November 2015

712

Editorial

18. Kurczynski, T. W. Generalized distance and discrete variables. Biometrics 26, 525, 1970. 19. Mahalanobis, P. C. On the generalized distance in statistics. Proc. not. Inst. Sci. India 12, 49, 1936. 20. Mantel, N. & Valand, R. S. A technique of nonparametric multivariate analysis. Biometrics 26, 547, 1970. 21. Nilsson, N. J. Survey of pattern recognition. Ann. N.Y. Acad. Sci. 161, 380, 1969. 22. Ramsoe, K., Tygstrup, N. & Winkel, P. The redundancy of liver tests in the diagnosis of cirrhosis estimated by multivariate statistics. Scand. J . clin. Lab. Invest. 26, 307, 1970. 23. Rao, C. R. pp. 246-258 in Advanced Statistical Methods in Biometric Research. Hafner, Darien, 1970. 24. Smith, J. E. & Klem, L. Vowel recognition using a multiple discriminant function. J. acoust. SOC.Anier. 33, 358, 1961. 25. Solberg, H. E., Skrede, S. & Blomhoff, J. P. Diagnosis of liver diseases by laboratory results and discriminant analysis. Identification of best combinations of laboratory tests. Scand. J. clin. Lab. Invest. 35, 713, 1975. 26. Solberg, H. E., Skrede, S., Elgjo, K., Blomhoff, J. P. & Gjone, E. Classification of liver diseases by cliniczl chemical laboratory results and cluster analysis. Scand. J . din. Lab. Invest. 36. In press, 1976. 27. Winkel, P. Patterns and clusters - multivariate approach for interpreting clinical chemistry results. Clin. Chem. 19, 1329, 1973. 28. Winkel, P. & Juhl, E. Assumptions in linear discriminant analysis. Lancet 2, 435, 1971. 29. Winkel, P. & The Copenhagen Study Group for Liver Diseases. Numerical taxonomic analysis of cirrhosis. I. The effect of varying the number and type of variables used. Comput. Biomed. Res. 7, 100, 1974.

Editorial: Discriminant analysis in clinical chemistry.

Scandinavian Journal of Clinical and Laboratory Investigation ISSN: 0036-5513 (Print) 1502-7686 (Online) Journal homepage: http://www.tandfonline.com...
675KB Sizes 0 Downloads 0 Views