Normality transformations for environmental data from compound normal-lognormal distributions.

N O R M A L I T Y TRANSFORMATIONS F O R ENVIRONMENTAL DATA F R O M COMPOUND NORMAL-LOGNORMAL DISTRIBUTIONS LARRY G. BLACKWOOD* Idaho National Engineering Laboratory, P.O. Box 1625, Idaho Falls, ID 83415, U.S.A. (Received: September 1994; revised: November 1994)

Abstract. The combination of lognormally distributed quantities of interest with normally distributed random measurement error produces data that follow a compound normal-lognormal (NLN) distribution. When the measurement error is large enough, such data do not approximate normality, even after a logarithmic transformation. This paper reports the results of a search for a transformation method for NLN data that is not only technically appropriate, but easy to implement as well. Three transformation families were found to work relatively well. These families are compared in terms of success in achieving normality and robustness, using simulated NLN data and actual environmental data believed to follow a NLN distribution. The exponential family of transformations was found to give the best overall results.

1. Introduction

Many quantities of interest in the environmental sciences are known to follow two parameter lognormal distributions while the instruments with which they are measured produce normally distributed random measurement error. This produces observed data that follow a compound normal-lognormal (NLN) distribution. When measurement error is small relative to the variability of the lognormal quantities being studies, it can be effectively ignored. A simple logarithmic transformation of such data yields a close approximation to a normal distribution, allowing standard analysis methods to be applied. Traditionally this approach has worked quite well. However, increasing emphasis on measuring quantities occurring at trace levels (e.g. picograms or parts per trillion), along with recommendations by statisticians that so called 'less-than-detectable' values be included in analysis whenever possible (Gilbert, 1987; ASTM, 1983; Porter et al., 1988), has lead to the situation where measurement error can be a dominant component in the overall variability of observed values. In these instances simply taking logarithms does not produce an adequate transformation. In fact, the common occurrence of negative data values with such distributions precludes the use of a simple logarithmic transformation altogether. Because the NLN distributional form does not suggest a simple transformation capable of yielding an exact normal distribution, a search for an analytically tractable transformation that approximates normality is required. This paper reports * This work was supported by the U.S. Department of Energy, Office of EnvironmentalRestoration and Waste Management, under DOE Idaho Field Office Contract DE-AC07-76ID01570. Environmental Monitoring and Assessment 35: 55-75, 1995. (~) 1995 Kluwer Academic Publishers. Printed in the Netherlands.

56

LARRY G. BLACKWOOD

the results of an investigation to find a transformation method that is not only technically appropriate, but easy to implement as well. Findings are presented by way of example, focusing on specific data sets that are illustrative of typical NLN data and the corresponding issues involved in the transformation process. Following a brief summary of details of the NLN measurement model, three transformation families are compared using simulated NLN data and actual environmental data believed to follow such a distribution. The results given deal with single sample data only, but can be generalized to apply to other modeling situations as well (e.g. the analysis of residuals from ANOVA and regression models).

2. The Normal-Lognormal Measurement Model The NLN distribution results from the simple measurement model (1)

Y =X+e

where Yis the observed measurement, X is the true lognormally distributed quantity of interest, e is the normally distributed random measurement error term and X and c are assumed independent. If ln(X),-~N(#x, 0.2)

(2)

and ~ N(O,

(3)

then Y has the same expected values as X,

_]_Io-2

e "~ 5 ~

E(Y)=E(X)=

(4)

and has variance equal to the sum of the variances of X and e, 2

Var(Y) = 0.~ +

e2/z~ + a 2

(e ~ - 1).

(5)

The pdf and cdf of Y, obtained by the procedure of compounding, are f(y) =

=

f0 °

(6)

f(y[x)f(x)dx

fo °~

1

27r0-x0-~x exp

{ (lnx - #x)2 20.2

(Y -- X)2 ) dx 20.2

and F(y) =

~0°°

Pr(Y 0. The third family of transformation considered is the exponential family of transformations, e)~y

z-

A

A~0

(10)

While A % 0 is technically the only restriction, A < 0 covers the appropriate values for N L N data. This family has received less attention in the literature than the

58

LARRYG. BLACKWOOD

Box-Cox family but is quite applicable to use with negative numbers and has been mentioned in that context (e.g. Mosteller and Tukey, 1977).

4. Evaluation Criteria The effectiveness of the transformation families is evaluated qualitatively by visual examination of normal probability plots and quantitatively by use of the ShapiroWilk, skewness, and kurtosis statistics. The Shapiro-Wilk statistic was chosen because of its relationship to linearity in the normal probability plot. Skewness and kurtosis provide additional criteria for distinguishing between competing transformations. Many statistical tests are sensitive to skewness and kurtosis, and they serve as the basis for one of the methods of parameter estimation discussed below. To insure acceptance for use in day-to-day statistical analysis, particularly by data analysts in applied fields, ease of use is also of importance. Ease of use is evaluated by considering the level of understanding required to implement parameter estimation methods, the amount of effort required to obtain them for a particular family, and the availability of computer routines required to carry out the calculations.

5. Parameter Estimation To compare the relative effectiveness of the transformation families, it is necessary to pick a member of each family that is optimal in some sense (i.e. the parameter values must be chosen according to some consistent criteria). Various procedures are available for this purpose. Three are used in this study. Given an emphasis on the Shapiro-Wilk statistic for evaluation, a simple approach is to perform a grid search over a suitable range of parameter values, creating a response surface by calculating the Shapiro-Wilk statistic at each grid point. The optimal value is the one that maximizes the test statistic. Assuming the right grid of values is chosen, it is a very effective method. This approach is quite easy to understand and can be easy to implement with a minimal amount of programming (assuming the availability of a statistical such as SAS ® which calculates the statistic). Its main drawback is that, when used with the two parameter Box-Cox family, calculating many values may be required. This can be very time consuming, even on a computer. The method of Berry (1987) is also based on a grid search, but the evaluation is based on minimizing the sum of the absolute values of the skewness and kurtosis statistics rather than maximizing the Shapiro-Wilk statistic. Berry refers to this sum as go, and that notation is continued here. This method is easy to understand and is perhaps even more easily implemented than a Shapiro-Wilk grid search, since Shapiro-Wilk is not as readily available in statistics packages as are skewness and kurtosis measures.

NORMALITY TRANSFORMATIONS FOR ENVIRONMENTAL DATA

59

The third method used is the empirical nonlinear data-fitting approach of Lin and Vonesh (1989). Like the Shapiro-Wilk grid search, this method is based on selecting parameter values that provide for the most linear results in the normal probability plot. But instead of using the Shapiro-Wilk statistic, linearity is achieved by finding the best nonlinear least squares fit of the model

(11)

N(i) = a + bfo(y(i))

where N(i ) is the expected value of the ith order statistic from a normal distribution, Y(i) is the ith ordered y value, f is the transformation equation of interest with parameter vector 0 (in this case equations (8), (9), or (10), with parameters c and/or A), and a, and b are the intercept and slope of the best fitting line in the normal probability plot. This method is consistent with evaluating fit with the Shapiro-Wilk statistic (i.e. it generally yields approximately the same value that the searching a grid for the best Shapiro-Wilk statistic values does), and assuming the availability of a nonlinear data-fitting routine, is generally quicker and easier to implement than even simple grid searches. However, the method can be difficult to understand for those not experienced in the analysis of nonlinear models and is not infallible (as is shown below). Other potentially useful estimation methods reported in the literature include the maximum likelihood methods of Carroll and Ruppert (1984), and Box and Cox (1964). These methods are not considered here as they can be diffficult to understand and implement for some users, and because of technical problems noted by Berry (1977).

6. Simulated NLN Data

In preparing for this paper, NLN distributions with various parameter configurations were examined by means of simulation. The results presented here illustrate the main trends observed in that investigation. To begin, it is instructive to illustrate how the introduction of normal measurement error affects the distribution of lognormal data. This is achieved by examining normal probability plots of log transformed NLN data with varying normal variance parameters. The data are then used to illustrate the performance of the various transformation families. 6.1.

VARIANCE RATIOS AND DEVIATION FROM LOGNORMALITY

The ratio of lognormal to normal components of the NLN variance in equation (5) is a good indicator of the extent of the deviation from a pure lognormal distribution. This ratio affects the number of negative values likely to be observed, as well as the suitability of the log transformation on the positive valued results. Figure 1 shows the normal plot of simulated log transformed NLN data with lognormal to normal variance ratios of 100, 20, and 10 along with a pure lognormal for comparison

60

LARRY G. BLACKWOOD

e.

e

*%

¢,1

% \e ...e~

\

\

0

x.

\%

% 0~

X

÷

÷

%

\\

•

~

~,

i

E

x

Z Z ©

II

LO

m

cq ,,2, anleA leWJON pe~oedx~

anle A leU~oN p a ~ o ~ x ~

O

"\

\

O

\ eee

\

\

°~

.%

%

%

o

~ Z

A

=.-

II

\"i enle A leUJJON p a l o a d x 3

÷

x

II

OnleA leLUJON p a l o ~ X ~


61

purposes. The linear fit to the pure lognormal data in Figure l(a) is repeated in Figures l(b), (c), and (d) to illustrate the deviations caused by the introduction of normal error. The data were formed by drawing a pseudo-random sample of size n = 100 from a lognormal distribution with mean 2.4 and variance 10, and also independently from a standard normal distribution. The standard normal values were multiplied by the appropriate factor to obtain the desired variance ratio and added to the lognormal values. For comparison purposes, the same base lognormal and normal samples are used in all the graphs. The lognormal mean and variance values for the simulated data were chosen to produce a coefficient of variation of approximately 1.3. (The values were obtained by starting with a standard normal distribution and adjusting the mean until the lognormal distribution obtained after exponentiation produced the desired coefficient of variation value). Matched with the normally generated error terms, this parameter configuration produces negative observed values with a frequency matching that of many real data sets. The actual number of negative values produced in each set of simulated data is indicated in the graphs. Because the logarithm of a negative number is undefined, these values cannot be plotted so the data are in some sense censored. To maintain the proper relationship for assessing normality of the logs of the positive valued results, expected normal values in each graph are based on order statistics for the original 100 data points rather than just the positive valued points. The graphs in Figure 1 show that, for smaller values of y, the log transform is too severe, that is it over-corrects for skewness in the data. The NLN data were skewed right prior to transformation but skewed left after taking logs. The amount of overcorrection depends on the relative size of the normal variance component. As the lognormal to normal variance ratio increases, the NLN data behave more and more like simple lognormal data. However, a surprising finding indicated by the data is how little normal measurment error is required to produce significant departures from lognormality. Even with a variance ratio of 100, there is still marked deviation from lognormality, including the occurrence of 5 negative values in the data. For the variance ratios shown, the log transform works well for larger values of y, where the normal component is small relative to the lognormal component. However, with even smaller ratios than those shown, the larger values o f y become noticeably affected as well. If the normal variance is large enough, say on the same order of magnitude as the lognormal variance, a point is reached where the data begin to approximate a normal distribution quite well without any transformation at all. Consideration of such data are excluded here because a measurement system in which the measurement error is on the same order of magnitude (or larger) than variability in the quantities of interest would generally not be considered a useful measurement system. In each graph in Figure 1, there is a point at which the NLN data begin to closely follow the lognormal fit line. The larger the lognormal to normal variance ratio

62

LARRY G. BLACKWOOD

the sooner the linearity begins, and the closer the fit to the pure lognormal line. It has been suggested (Hawkins, 1991) that the slope and intercept of this upper line segment can provide quick graphical estimates of the lognormal mean and standard deviation. For the garaphs in Figures 1(b), (c), and (d), such a procedure would in fact yield results close to that for the pure lognormal data, although both the mean and standard deviation estimates would tend to be biased. 6.2.

APPLYING TRANSFORMATIONS TO NLN DATA

To illustrate the relative performance of the three transformation families on NLN data, consider the data from Figure 1(d). Examination of both the process of obtaining parameter values for these data and the degree to which the transformations using those parameters achieve normality is discussed below. While these results are for a specific data set only, the general findngs can be expected to apply to a broad class of similar data. 6.2.1. Parameter estimation A Shapiro-Wilk grid search quickly leads to parameter estimates of A = -0.24 for the exponential family and c = 2.5 for the ln(y + c) family. Repeating the grid using Berry's go criteria yielded the same results to two significant digits. Applying the method of Lin and Vonesh resulted in estimates of A = -0.24 and c = 2.7 after only a few iterations. Problems with both the grid search and the Lin and Vonesh methods occurred when applied to the Box-Co~ family. These problems stem primarily from the two parameter nature of the family, in particular to the interrelationships of the effects of the two parameters. At least in regard to treating skewness, there are many combinations of c and A that will yield nearly the same results (Hines and Hines, 1987; Tukey, 1957; Atkinson, 1985). Increasing the constant e just decreases the value of A required to achieve the best results, without substantially changing the adequacy of the fit to normality. Since all three evaluation criteria are affected at least to some extent by skewness, one might expect a very flat response surface. This is in fact the case for the simulated data, as is illustrated in Figure 2 which shows a contour plot fit to Shapiro-Wilk test statistic values. (The contour lines in the graph are spaced at unequal intervals to enhance the detail of the surface near its maximum.) There is a response surface ridge that is very nearly flat, indicating a set of pairs of parameter values which yield nearly the same degree of linearity in the normal probability plot. The existence of the flat response surface and the occurrence of a number of apparent local maxima required the used of a more extensive grid in the search methods. (The local maxima are artifacts of the grid spacing used, changing to a finer grid tends to eliminate them.) For the Shapiro-Wilk grid search, results were first calculated for a rectangular grid covering values of A from -5.4 to 0.6 and c from 1.7 to 22.0. Grid points were spaced using increments of 0.2 for A.

63

NORMALITY TRANSFORMATIONS FOR ENVIRONMENTALDATA

/

/

/

J

/ /

/

/'I- ~ ¢,

/

r

1

i~ //

/ /,'\, /

I

/ /'/' i

/

/,C,:

/ ,'"

I

/ ,/ /

~o

/ ,I

I

©

/// // f

~

o

c~ c~ C~ 0

.) ,

/ ~

/

,, ~ 8 8 8G8m

,//

/l

r~

°

~ ! e= =

¢

•

/

/,

. ~ ~ ~

....-

~

~

~

~

~

~=

64

LARRY G. BLACKWOOD

Increments of 0.3 and 1.0 were used for c in the ranges from 1.7 to 5.0 and 5.0 to 22.0 respectively. After the initial grid search narrowed down the region of interest, the grid -0.6 to -2.5 for ~ by 4.7 to 10 for c was searched by increments of 0.1 to get the final estimates. A similar approach was used to obtain the results for go. The Box-Cox grid search process took several hours on a personal computer. A smarter grid search would have reduced the required computational time somewhat, but not to any really practical level. The Shapiro-Wilk grid search yielded values of c = 6.1 and ,~ = -1.0. The go grid search gave quite different results of c = 20.4, A = -4.5. The Lin and Vonesh method produced no reliable parameter estimates for the Box-Cox family of transformations. The method converged to many local optimal values (depending on the starting values used) and often converged very slowly (e.g. no convergence even after 1000 iterations). 6.2.2. Transformation results Results of applying the exponential, ln(y + c), and Box-Cox transformations using the parameter estimates indicated above are given in Table I. When different optimal parameter estimates were obtained by the different estimation methods, results with each different estimate are given for comparison purposes. Corresponding normal probability plots are graphed in Figure 3, except that graphs for ln(y + 2.7) and the Box-Cox transformation with c = 20.4 and ~ = -4.5 are not shown, as they are visually nearly indistinguishable from the other graph of the same family. All three graphs shown indicate quite good normal fits; perhaps the only notable difference being slightly greater deviations from linearity in the tails of the ln(y + 2.5) graph, due to its higher kurtosis value. The results in Table I confirm the findings indicated in the graphs that all three families were quite successful in achieving normality. The Box-Cox family produced the best results in terms of the Shapiro-Wilk test statistic, but the other two families exhibit very good results on that criteria also, as indicated by the associated p-values. The results are a clear improvement over the Shapiro-Wilk value for the untransformed data. (The p-values for the Shapiro-Wilk statistics are included because they provide a clearer indication of the adequacy of the fit to the data than does the statistic value alone. However, because the parameter estimates depend on the data, the p-values should perhaps not be interpreted too literally. Berry (1987) and Hinkley and Runger (1984), for example, discuss this issue in more detail.) The results in terms of skewness were uniformly good as well, showing marked improvement over the untransformed data. Only minor differences distinguish among the method/family combinations. For a sample size of 100 from a normal population, the 95% points of the skewness statistic are 4-0.389, so this can serve as a comparison point for the results shown. Differences in kurtosis values are the most notable in the table, with the exponential family results being the best. The ln(y + c) results exceed, and the Box-Cox results with c = 6.1 and ,~ = -1.0


_e;f,r"

> E Z

~

-I

lJJ -2

.j -3

-0 4

7/

0

0.4

0.8

1.2

1.6

2

2.8

2,4

32

In(y ÷2 5)

(a) 3

2

..#*

E

•

fg

>

o

z

., .C " ~ ' j uJ -2

-7

i515

-4

-2,5

-1

05

e 241-.24

(b) 3 2

,~.f,,..t-

• D~ S~

> E z

0 -1

Lu -2

076

0.8

0.84

0.88

0.92

0.96

((y ÷ 6.1)~-1.0)/(-1.0)

(c)

Fig. 3. Normal probability plots of transformations from Table I.

65

66

LARRY G. BLACKWOOD

O

e~ O

oO

O~

O',

~,O

P"-

O'~

O

l

O

~

It3

,~.

It3

~

It3

~

1"~

~

oo

oo

.1

t'-I e-

¢-.

O

+ O

+~

~ O

O


67

came close to, the upper 95% point of the distribution of kurtosis values, which is approximately 0.7. The statistics in Table I reflect the results for one data set only. However, analysis of various other data dets indicaed similar patterns of results. In particular, the exponential and Box-Cox families tended to give better results than did ln(y + c). For the Box-Cox family, optimizing on go rather than the Shapiro-Wilk statistic often produced quite different parameter estimates, as it did for the data in Table I. In such cases, a better overall fit was obtained by using the estimates based in go (which generally resulted in larger c and smaller A values). That is, the ShapiroWilk value tended to be about the same regardless of the method but kurtosis was noticeably smaller when go was used as the optimization criteria. 6.3. ROBUSTNESS

The above findings show that, while there are some differences in results, all three families of transformations examined can be used to obtain reasonable approximations to normality. Consideration of robustness provides additional criteria on which to compare the three families. The data in Table I and the flat response surfaces dicussed above indicate that quite acceptable fits can be obtained with many suboptimal parameter 'values. So, in that sense, all three families are relatively robust in terms of sensitivity to parameter estimates used for a particular data set. In this section, several additional aspects of robustness are considered by applying three of the transformations in Table I to perturbations of the data or additional sampling from the same NLN distribution. The three transformations considered are the exponential transform, the logarithmic transform with c = 2.5 and the BoxCox transform with c = 20.4 and A = -4.5. The transformation ln(y + 2.7) is not considered further because results are generally very close to that for c = 2.5. The Box-Cox transformation with c = 6.1 and A = -1.0 is not discussed further either, because its Shapiro-Wilk values are always comparable to the Box-Cox transform with c = 20.4 and A = -4.5 while its go value is almost always worse. 6.3.1. Repeatedsampling For many uses of transformations, it is desirous to be able to apply the same transformation with the same parameter estimate to numerous samples. For example, when more data are to be collected at a later data, applying the same transformation to the future samples insures comparability. In experimental settings involving ANOVA or related techniques, the same transformation should be applied to all treatment groups. To test for this type of robustness with simulated data, the three transformations were applied to 1000 repeated samples from the same NLN distribution used to generate the data originally analyzed in Table I. That is, 1000 new data sets were generated from the same NLN distribution, but instead of optimizing the parameter estimate for each sample, the estimates in Table I were used on all 1000 samples.

68

LARRY G. BLACKWOOD

TABLE II Transformation results from application to 1000 repeated samples from a lognormal (2.4, 10) + normal (0,1) NLN distribution

Transformation

Shapiro-Wilk statistic 25% median 75%

go 25%

median

75%

ln(y + 2.5) e-°'24/-0.24 ((y + 20.4) -4.5 - 1)/-4.5

0.9600 0.9701 0.9723

0.99 0.35 0.31

1.84 0.52 0.51

3.24 0.74 0.82

0.9726 0.9751 0.9783

0.9806 0.9794 0.9826

Both the ln(y + c) and Box-Cox transformations have the potential to be less robust in regard to repeated sampling simply because of the requirement that y + c > 0. A value of c chosen as optimal for a current sample will be inappropriate when a value less than - c is observed in future sampling efforts. This in fact happened when ln(y + 2.5) was applied to the repeated simulations. Approximately 6% of the 1000 samples contained at least one data value less than -2.5, so that logarithms could not be calculated. (The smallest number generated was about-3.16, so there were no problems with applying the Box-Cox transformation.) The results of this simulation are summarized in Table II, which gives the median value and the 25% and 75% points of the empirical distributions for the Shapiro-Wilk and go statistics. In addition to the problem with negative values, the ln(y + 2.5) transformation showed the poorest performance at all levels of both statistics except for the 75% point of the Shapiro-Wilk results. The Shapiro-Wilk 25% point for ln(y + 2.5) is the only Shapiro-Wilk value in the table with an associated p-value less than 0.1. Also, the 25% point of the ln(y + 2.5) go value was greater than the 75% point values for the other transformations, indicating considerably poorer performance. The Box-Cox transformation showed the best median go value while the exponential transformation showed the best results at the 75% point. As with the data in Table I, kurtosis was the biggest contributor to go.

6.3.2. Outliers/influential data points Samples from lognormal distributions with relatively large variances are notorious for producing apparent outlier data points. These data points can have extreme effects on parameter estimates calculated with and without the influential points (for an example of the extent of such effects, see Blackwood, 1991), thus leading to uncertainty as to how such influential points should be treated. To see how influential observations might affect the NLN transformation process, the analysis of the data from Figure l(d) and Table I was repeated after doubling the largest of the 100 data values. Results are presented in Table III. Shapiro-Wilk statistic values in Table III are all quite acceptable, actually increasing

69


TABLE III Transformation results for data with a single outlier

Transformation

Shapiro-Wilk statistic

Shapiro-Wilk P-value

Skewness

Kurtosis

go

ln(y + 2.5)' e-°'24/-0.24 ((y + 20.4) -4.5 - 1)/-4.5

0.9759 0.9821 0.9852

0,33 0.64 0.79

0.39 0.04 0.06

2.32 0.21 0.42

2.70 0.25 0.48

for the exponential and Box-Cox transformations compared to the values reported in Table I. More noticeable effects were evident in the skewness and kurtosis statistics. The exponential transformation remained the most stable in the presence of the outlier, increasing its skewness and kurtosis values only slightly. Kurtosis for the ln(y + 2.5) transform increased considerably to a point where there is clearly a problem with the results. 6.3.3. Varyinglognormal/normalparameters Another way in which NLN distributions can change is for the lognormal or normal component parameters to shift. To test for robustness of transformation under this type of conditions, variations in the parameters making up the data in Figure l(d) were implemented (holding the core data the same), and the transformations reapplied. First, the variance of the normal error component [i.e. o-2 in equation (3)] was eliminated, so the transformations are being applied to the pure lognormal data of Figure 1(a). Then the normal error variance was doubled and halved. Finally, both the mean and variance of the normal distribution underlying the lognormal component [i.e. #x and o-x2 in equation (2)] were doubled and halved. These results are given in Table IV. When applied to the pure lognormal data, none of the transformations worked well at all, indicating there are limits to their robustness. When the normal variance component was halved, the Shapiro-Wilk statistic suffered for all transformations, but remained reasonable except for perhaps ln(y + 2.5). Ln(y + 2.5) also showed the poorest results in regard to skewness and kurtosis. The exponential transform was noticeably better than the others on go. When the normal variance component was doubled, the exponential transformation showed the best results on all measures, while ln(y + 2.5) showed consistently poor results. The exponential transform also produced the best results on all measures when 2 crx was halved. Doubling that variance produced mixed results, with the exponential and Box-Cox transformations holding up the best. Doubling or halving #x had similar effects. Shapiro-Wilk values remained quite acceptable, while kurtosis values tended higher for ln(y + 2.5).

70

LARRYG. BLACKWOOD

TABLE IV Transformation results after changing parameters of the underlying normal and lognormal distributions

Transformation ln(y + 2.5)

e-°24/-0.24

((y20.4) -45 - I)/-4.5

Parameter change

Shapiro- ShapiroWilk Wilk statistic P-value Skewness Kurtosis go

a2 = 0 c~2~halved cr2~ doubled a2~ halved ~r2~ doubled #~ halved #~ doubled a2~ = 0 ~r2~ halved cr2~ doubled a 2 halved crz~ doubled /z~ halved /z~ doubled ~2 = 0 cr2~ halved cr2~ doubled a2~ halved a2~ doubled /z~ halved /z~ doubled

0.8688 0.9667 0.9137 0.9790 0.9550 0.9805 0.9795 0.8824 0.9712 0.9722 0.9810 0.9658 0.9801 0.9776 0.8833 0.9727 0.9721 0.9807 0.9698 0.9814 0.9810

0.00 0.08 0.00 0.47 0.01 0.55 0.50 0.00 0.17 0.20 0.58 0.07 0.53 0.40 0.00 0.21 0.19 0.57 0.13 0.60 0.58

1.32 0.58 -1.60 -0.43 0.67 --0.09 0.20 1.02 0.39 -0.45 -0.24 0.17 0.04 -0.12 1.06 0.40 -0.50 --0.28 0.23 0.01 -0.08

1.56 0.94 7.04 1.06 1.50 1.11 0.92 0.29 0.10 0.50 0.27 -0.21 0.21 -0.05 0.46 0.22 0.72 0.39 -0.03 0.33 0.10

2.88 1.53 8.64 1.48 2.17 1.20 1.12 1.31 0.48 0.94 0.50 0.39 0.25 0.17 1.52 0.61 1.22 0.67 0.25 0.33 0.18

6.3.4. Method robustness In regard to the effect o f choice of parameter estimation method, the exponential and ln(y + c) families are more robust than is the B o x - C o x family in the sense that their parameter estimates seem to be less sensitive to the estimation procedure used. The differences between the grid search parameter estimate results for the B o x - C o x family in Table I are somewhat disconcerting.

7.

E x a m p l e : Trace L e v e l R a d i o n u c l i d e s

The previous section illustratedthe hypothetical effects of contaminating a lognormal distribution with normally distributed measurement error, and how transformation can be used to obtain near normality o f such data. To illustrate the type


71

TABLEV Gross beta radioactivitymeasurementsin air (~Ci/cc x 10LS) -0.60 0.86 4.02 4.93 5.12 5.14 5.72 6.08 6.76 7.08 7.52 8.36 9.48 9.74 9.98

10.11 10.96 11.77 12.40 12.58 12.76 13.57 13.63 14.03 14.12 14.28 14.35 14.45 15.76 16.70

17.53 17.58 18.28 19.07 20.46 21.66 22.44 23.38 24.51 28.97 30.32 35.36 44.37 44.37 50.49

of situation in which NLN data is encountered in analyzing real data, consider the following case of measuring trace level radionuclides. Table V presents forty-five measurements of gross beta radiation obtained from an air monitor near a radioactive waste storage compound. All the measurements obtained showed gross beta radiation at trace levels only, including one measurement less than zero. It is well known that pollutant concentrations, including radioactive constituents, in air and other media tend to follow lognormal distributions (e.g. Crow and Shimizu, 1988; Eberhardt and Gilbert, 1980). However, when only trace levels are available for measurement, the observed data often fails to show a close fit to lognormality. This lack of fit is indeed the case with the data in Table V, as indicated by the occurrence of the negative values and by Figures 4(a) and 4(b) which shows the normal probability plots for the raw data and for the logarithms of the data. Because these data indicate only trace level quantities of radioactivity, the observed measured values were subject to relatively large sampling and measurement error effects. These errors were found to be approximately normally distributed in a previous analysis of replicate measurements. Thus, the NLN measurement model seems appropriate for the data. Results of the transformation analysis on these data are presented in Table VI. For comparison purposes, the results from the simple two-parameter lognormal assumption (i.e. ln(y)) are also given, with the single negative value eliminated. As before, some differences in parameter estimates were obtained depending on

72

LARRY G. BLACKWOOD

~

0

0

0

0

0

0

0

0

o

u~

o

0

o

"= .~

~ o

O 0


73

the method used. Generally, the disagreement in parameter values was smaller than that observed for the simulated data and produced inconsequential differences in the performance results. Also as before, convergence problems occurred when applying the Lin and Vonesh method to the Box-Cox family of transformations to the extent that no useful results were obtained, and the grid search for the Box-Cox parameter estimates consumed considerable time. All the transformations showed marked improvement over the simple lognormal results, particularly in terms of the Shapiro-Wilk test statistic results. Skewness and kurtosis values were quite acceptable as well, with the possible exception of the ln(y + c) which had kurtosis values noticeably higher than those for the other transformations (although not near significant levels). The normal probability plot of the exponential transformed data in given in Figure 4(c) indicates graphically how well the informations worked.

8. Summary and Discussion All three families of transformations for NLN data produced reasonable approximations to normality under at least certain conditions. Since most normal distribution based statistical methods are to some extent robust to departures from normality, any of the transformation family and estimation method combinations discussed will likely produce satisfactory results. For the purpose of practical application, then, choosing an approach with ease of use as a major criteria seems appropriate. Based on balancing ease of use against the performance on the various statistical evaluation criteria, the exponential family has much to recommend it as the preferred choice for the type of NLN data illustrated here. The exponential transformation is single parameter family (making parameter estimation simple and straightforward) and deals with negative numbers in a natural way in that it is not necessary to put arbitrary limits on the parameter to insure the transformation's applicability to negative numbers. In terms of achieving a successful transformation to normality, the exponential family often showed noticeably better results in terms of kurtosis and Berry's go criteria than did the other two families, and very good results in terms of the Shapiro-Wilk test statistic. It also has good robustness to changes in the data. While the exponential family works quite well under a variety of conditions, use of the ln(y + c) and Box-Cox families should not be ruled out entirely. The emphasis for this report was on NLN distributions with high probabilities of producing negative numbers, because they are the most difficult to deal with. Many NLN distributions do not possess this characteristic. Such cases were not studied extensively in preparing this report but there was some indication that the performance of ln(y + c) improves considerably for primarily positive valued NLN data.

LARRY G. BLACKWOOD

74 2.5

1.5

O.5

-0,5

uJ -1.5

-2.5 -5

5

15

25

35

45

55

Gross Beta

(a)

!

2.5

1 negative gross beta value

t

1.5

• .J

--~ > -~ 0.5 E z

z - " ....

-o5

....t.,,

uJ /

-1.5

-2.5

-0.5

•

I

/

0.5

1.5

2.5

3,5

4.5

In(Gross Beta)

(b) 2.5

1.5

0.5 E 7 -0.5

u/ -1.5

-2.5

-22

-18

-14

-10

-6

-2

e- 05"GrossBeta/(_.05 )

(c)

Fig. 4. Normal probability plots of transformed gross beta data.

NORMALITYTRANSFORMATIONSFOR ENVIRONMENTALDATA

75

(The B o x - C o x family will of course always perform as good or better than ln(y + c) but at a higher computational cost.) Perhaps the simplest, most robust general approach to obtaining good transformation results under a wide range of possible conditions would be to consider both the exponential and ln(y + c) transformation families, with parameter estimation based on minimizing go and carried out by means of a grid search. Parameter estimation for both families can often be done with the same grid search, so little additional effort beyond considering only a single family is required. The more difficult B o x - C o x family, need only be considered of the exponential and ln(y + c) families fail to yield acceptable results. The use of a grid search and Berry's go criteria insure that the analysis can be done with even the simplest computer programs. While at least checking the Shapiro-Wilk results when optimizing on go would be helpful, the data have shown that good Shapiro-Wilk results tend to a c c o m p a n y good go results.

References American Society for Testing and Materials: 1982, Annual Book of ASTM Standards, Vol. 11.01, Designation D4210-83.

Atkinson, A. C.: 1985, Plots, Transformations, and Regression: An Introduction to GraphicaI Methods of Diagnostic Regression Analysis. Oxford, U. K.: Clarendon Press. Berry, D. A.: 1987, 'Logarithmic Transformations in ANOVA,' Biometrics 43, 439-456. Blackwood, L. G.: 1991, 'The Quality of Mean and Variance Estimates for Normal and Lognormal Data When the Underlying Distributions is Misspecified,' Journal of Chemometrics 5, 263-271. Box, G. E. P. and Cox, D. R.: 1964, 'An Analysis of Transformations (with Discussion),' Journal of the Royal Statistical Society, Series B 26, 211-252. Carroll, R. J. and Ruppert, D.: 1984, 'Power Transformations When Fitting Theoretical Models to Data,' Journal of the American Statistical Association 79, 321-328. Crow, E. L. and Shimizu, K. (eds.): 1988, Lognormal Distributions, Theory and Applications, New York: Marcel Dekker. Eberhardt, L. L. and Gilbert, R. O.: 1990, 'Statistics and Sampling in Transuranic Studies,' in Transuranic Elements in the Environment (W. C. Hansen, ed.). DOE/TIC.22800 NTIS, pp. 173186. Gilbert, R. O.: 1987, Statistical Methods for Environmental Pollution Monitoring. New York: Van Nostrand Reinhold. Hawkins, D. M.: 1991, 'The Convolution of the Normal and Lognormal Distributions,' South African Statistics Journal 25, 99-128. Hines, W. G. S. and Hines, R. J. O.: 1987, 'Quick Graphical Power-Law Transformation Selection,' The American Statistician 41, 21-24. Hinkley, D. V. and Runger, G.: 1984, 'The Analysis of Transformed Data,' Journal of the American Statistical Association 79, 302-309. Lin, L. I. and Vonesh, E. F.: 1989, 'Empirical Nonlinear Data-Fitting Approach for Transforming Data to Normality,' American Statistician 43, 237-243. Mosteller, F., and Tukey, J. W.: 1977, Data Analysis and Regression, A Second Course in Statistics. Reading MA: Addison Wesley. Porter, P. S., Ward, R. C. and Bell, H. E: 1988, 'The Detection Limit,' Environmental Science and Technology 2, 856. Tukey, J. W.: 1957, 'On the Comparative Anatomy of Transformations,' Annals of Mathematical Statistics 28, 602-632.

Estimating departure from normality.

Gene network inference by fusing data from diverse distributions.

Deducing intracellular distributions of metabolic pathways from genomic data.

Efficient estimation of smooth distributions from coarsely grouped data.

Robustness of the t test applied to data distorted from normality by floor effects.

Statistical inference from multiply censored environmental data.

Automatic Compound Annotation from Mass Spectrometry Data Using MAGMa.

Molecular distributions and compound-specific stable carbon isotopic compositions of lipids in wintertime aerosols from Beijing.

The epistemology of normality.

E-Flux2 and SPOT: Validated Methods for Inferring Intracellular Metabolic Flux Distributions from Transcriptomic Data.

Sparsity Inducing Prior Distributions for Correlation Matrices of Longitudinal Data.

A Spatial Data Infrastructure for Environmental Noise Data in Europe.

An expert system for environmental data management.

Epidemiological requirements for medical-environmental data management.

EPPS16: nuclear parton distributions with LHC data.

Inference Using Sample Means of Parametric Nonlinear Data Transformations.

An Analytical Model for the Distributions of Velocity and Discharge in Compound Channels with Submerged Vegetation.

"Property Phase Diagrams" for Compound Semiconductors through Data Mining.

Normality, therapy, and enhancement.

Identification and ranking of environmental threats with ecosystem vulnerability distributions.

Environmental correlates of species rank - abundance distributions in global drylands.

Nematode distributions as spatial null models for macroinvertebrate species richness across environmental gradients: A case from mountain lakes.

Differentiating spatial memory from spatial transformations.

Correction: Accounting for Linear Transformations of EEG and MEG Data in Source Analysis.