Computer Programs&Biomedicine 10 (1979) 55-60 © Elsevier/North-Holland Biomedical Press

DENSITY ESTIMATION APPLICATIONS FOR OUTLIER

DETECTION

M.E. TARTER

Department of Biomedical and Environmental Health Sciences, University of California, Berkeley, CA 94720 and Department of Health Services, Resource of CancerEpidemiology, Berkeley Way, Berkeley, CA 94704, USA

Nonparametric estimates of joint, conditional and marginal probability densities can be used to estimate the relative probability of a data point's recurrence. Outlier, unusual or abnormal values of a random variate tend to be those which are unlikely to recur. As part of an interactive graphical system, a procedure has been implemented which enables a biomedical researcher to view both the estimated probability and the numerical value of a data point's coordinates. This display circumvents the problem of interpreting a normal range in two or more dimensions and can thus be more easily generalized than most alternative outlier detection procedures. Outliers

Probability-density-transformation

Graphical

1. Introduction

Normal-range

sonable to use nonparametric e s t i m a t e s f o f f a s alternatives to F* as data transformations. By plotting points (f(Y/IX/), Y/) (f(Xi, Y/), Y/) or other combinations of X~, Y~ and their estimated conditional, joint and marginal probability of recurrence, one can utilize the two dimensions available on conventional graphical display units for a variety o f data exploration purposes. For example, it will be shown in section 4 that densely clustered data points can be removed, i.e., masked in order to allow an investigator to more clearly discern outlier candidates.

In a recent paper [1] a procedure for isolating homogeneous data subgroups was described which relied primarily on the assumption that subgroups were either normally or lognormally distributed. A transformation Z --- C(YIX) was introduced which tended to compact a cluster o f bivariate-normally distributed data. Ideally, if any value (X, I 0 = (x, y ) is associated with a nonpathological data component then the function C will map (x, y ) onto or near the constant value c. The transformation C is a functional, one of whose components is a nonparametric estimator f o r conditional probability f(y Ix), where as usual: fO~ Ix) = f(x, y)/f(x)

Nonparametric

2. Data generation preliminaries To demonstrate the use of these new procedures a sample of 200 data point pairs was generated. The population underlying this data generation process contained a mixture of two bivariate normally distributed components, the first of which had a mean vector ( - 2 , - 2 ) and variance-covariance matrix:

(1)

f(x) ~ 0 and (X, Y) e f ( x , y ) It seems to have gone unnoticed by the author of [ 1] and other researchers that nonparametric estimators o f f ( y Ix) and f(x, y) can themselves be used as data transformations. A definite integral off, F is frequently estimated by the sample cumulative F*. Plots of F*(Xj) against Xj, where Xj represents the ]-th member of an i i d sample, are the basis of many graphical procedures (see [2] ch. 6 or [3] ch. 17). It therefore seems rea-

/:,2 :'2/ The second component had a mean vector (2, 2) and the same variance-covariance matrix. Although the mixing parameter was set equal to one-half, as could 55

56

M.E. Tarter, Density estimation applications for outlier detection 4.N

Density estimation applications for outlier detection.

Computer Programs&Biomedicine 10 (1979) 55-60 © Elsevier/North-Holland Biomedical Press DENSITY ESTIMATION APPLICATIONS FOR OUTLIER DETECTION M.E...
314KB Sizes 0 Downloads 0 Views