STATISTICAL POLLUTANT

MODELING

OF RESTRICTED

DATA SETS TO ASSESS COMPLIANCE

WITH J. A . T A Y L O R ,

AIR QUALITY R. W. SIMPSON,

CRITERIA andA.

J. J A K E M A N

Centre for Resource and Environmental Studies, Australian National University

(Received November 15, 1985) Abstract. Three statistical models are used to predict the upper percentiles of the distribution of air pollutant concentrations from restricted data sets recorded over yearly time intervals. The first is an empirical quantile-quantile model. It requires firstly that a more complete date set be available from a base site within the same airshed, and secondly that the base and restricted data sets are drawn from the same distributional form. A two-sided Kolmogorov-Smirnov two-sample test is applied to test the validity o f the latter assumption, a test not requiring the assumption of a particular distributional form: The second model represents the a priori selection of a distributional model for the air quality data. To demonstrate this approach the two-parameter lognormal, g a m m a and Weibull models and the one-parameter exponential model were separately applied to all the restricted data sets. A third model employs a model identification procedure on each data set. It selects the 'best fit' model.

1. Introduction

To obtain a clear picture of air pollution in an area of concern it is necessary to have as many air pollution monitors operating as possible. Due to economic constraints the number operating must of necessity be limited. The problem then becomes one of making the network as 'representative' as possible in space and time. Therefore all 'hot spots' and all land use types such as urban, residential and industrial should be covered and the sampling frequency at each site should be high enough to reliably assess compliance with chosen air quality criteria. Clearly, continuous monitoring achieves the latter criterion. Usually most monitors are fixed in position and record continuously or intermittently (e.g. TSP hi volume samples collected every six days). A more cost effective method may be to have mobile monitors which can be used to cover more sites at the same expense as that for a few continuous monitors. The problem then becomes one of obtaining a representative sample. It is reasonably straightforward to design a random sampling strategy to obtain reasonable estimates o f annual mean concentrations (e.g. see Ott and Mage, 1981; Simpson, 1984). However it is a different problem when it comes to obtaining reasonable estimates of a cumulative frequency distribution from a limited sample so that exceedances of 98-percentile or maximum air quality standards can be estimated. In this paper statistical models are constructed to fulfil that need, and therefore allow the use of a mixture of fixed continuous and intermittent monitors, mobile or fixed, which enable the most effective spatial Environmental Monitoring and Assessment 9 (1987) 29-46. 9 1987 by D. Reidel Publishing Company.

30

J.A.

T A Y L O R ET AL.

representation of pollutant levels to be obtained within the economic constraints prevailing. The first model, the simplest and most successful, is an empirical quantile-quantile model. The model predicts the upper percentiles of a restricted data set, given the presence of a more complete data set from another site in the same airshed and collected over the same period of time. Clearly, any modeling of a restricted data set from a base data set will only be as good as the information in the latter. Ideally, the base data set should be able to provide a reliable sample of air quality observations for comparison with chosen air quality criteria. Continuous recordings will obviously provide this. However, sufficiently regular or a sufficient number of random recordings can also achieve this goal. We do not investigate here the problem of what constitutes a representative base data set and use a base data set which records 24 hr acid gas levels 5 days/week, the most complete data available. An empirical quantile-quantile model is derived from quantiles of both the more complete and restricted data sets which then can be used to predict the upper percentiles of the restricted data set (not observed because of the limited sampling). The model assumes both the restricted and more complete data sets are of the same distributional form, an assumption that can be tested by the two-sided Kolmogorov-Smirnov two-sample test. The second model assumes a distributional form for the model, the parameters of which can be estimated from the restricted data set and the distribution then used to estimate the upper percentiles. Four standard distributions are used here for that purpose: the two-parameter gamma, lognormal, and WeibuU and the one-parameter exponential. The third model is an improvement on the second one in that, for each data set, the 'best' distribution is selected from amongst the four chosen here using a goodness-of-fit test. Here the test of Taylor and Jakeman, 1985, is used. Clearly this 'best fit' method should be an improvement on the second and the best test of the empirical quantile-quantile model is to compare its results with those of the third model. For hourly carbon monoxide observations, Ott and Mage (1981) were able to demonstrate that accurate estimates of the arithmetic mean, and associated 95~ confidence limits, could be obtained by random sampling. Their approach was based on the application of the central limit theorem and thus remains independent of the nature of the original distribution. However, maximum or near maximum concentrations, in terms of which many air quality standards are written, may not be estimated in this way. Ott and Mage (1981) originally proposed that 'random' sampling could take place between the hours of 9 a.m. and 5 p.m. However, Simpson (1984) identified the importance of both the diurnal and seasonal variations observed in ambient air quality and found that sampling of hourly pollutant data should be carried out as a completely random process to obtain best estimates of the mean concentration, especially for ozone. Simpson (1984) also found that continuous recordings one week ollt of four yielded good results. In this paper the various model performances are examined where the restricted

STATISTICAL MODELING OF RESTRICTED POLLUTANT DATA SETS

31

data sets have been generated using sampling strategies of one day in four, one in six, one in eight, and one in twelve. Such strategies should satisfactorily account for seasonal variations. The models are developed with 24-hr average acid gas concentrations and, since we are modeling 24-hr average data, problems associated with the diurnal variation of pollutant concentrations will not be present. 2. The Data Set The acid gas data set used consists of daily levels collected by the Health Division of the Newcastle City Council at their Watt Street, City and Mounter Street monitoring stations in Newcastle, Australia, over a ten year period from January 1972 to December 1981; at the Turton Road monitoring station over a nine year period from January 1973 to December 1981; and at their Seaview and Elder Street monitoring stations over an eight year period from January 1974 to December 1981. This provides a total of 55 yr of data. The acid gas levels were determined by scrubbing ambient air at a constant rate through a dilute solution of hydrogen peroxide and sulphuric acid. Thus any sulphur dioxide present in the sample is converted to sulphuric acid. The resulting increase in acidity due to this or other acid gases is determined by titration. The method is based on the British Standard method No. 1747 Part 3. The 24-hr average readings are taken daily commencing at 9 a.m. for five days per week.

~ _ . I~,~.~/,~"'"~~

~r,e,

st, O

.....

MAYF1ELD EAST Turto~ P,d. O WARATAH

O

~.

O

Eider St. LAMBTON

N

O S~lBview St, KOT-ARA

t 2km I

Fig. 1. Newcastle,Australia, acid gas monitoring network

32

z.A.

TAYLOR ET AL.

The location of the monitoring sites relative to the industrial area emitting acid gases is illustrated in Figure 1. It should be noted that the Mounter Street monitor lies in closest proximity to the industrial area. The Watt Street and City monitors are located within the central business district. The Seaview, Turton and Elder Street monitors are situated in the surrounding urban area. In this paper we draw no distinction between different land use categories. This does lead to problems and we indicate how future work may improve on the results here.

3. The Empirical Quantile-Quantile Model Let us denote two data sets as xi (i= 1 to n) and yj (j = 1 to m) and their empricial cumulative distribution functions" as Qx and Qy, respectively. Then an empirical quantile-quantile plot is a plot of Qy(p) against Qx(P) for a range of p-probability values where p may vary from 0 to 1 (Wilk and Gnanadesikan, 1968). (Note that quantile refers to a fraction, p, of the set of data and percentile to a percent of the set of data). If the two distributions were identical all the points would fall on the line y = x. This simple model relating the quantiles of the two data sets may be applied only where both sets have been drawn from the same distribution. Chambers et al. (1983) have extended this simple model to allow the quantiles to differ by both an additive and multiplicative constant. Thus the two data sets would have the approximate relationship

Qy(pi) = a Qx(pi) + ~

(1)

where the parameters a and fl must be estimated from the data. In this paper we refer to Equation (1) as the empirical quantile-quantile model, following the terminology of Chambers et al. (1983). Where we have two data sets of equal size an empirical quantile-quantile plot consists of a plot of the sorted data values paired from the lowest to the highest values for each data set. However, in our case we have one data set larger than the other. Here the usual practice is to employ all the sorted values in the smaller data set and to interpolate a corresponding set of quantiles from the larger set (Chambers et ai., 1983). In order to determine the corresponding quantile in the larger data set we must estimate the percentile at which each of the smaller data set values fall. Many empirical distribution functions are available for this purpose (Loony and Gulledge, 1985). We apply the form of the empirical distribution function suggested by Chambers et al. (1983) which for a sample of size n, is

p~= ( i - 0.5)/n.

(2)

Now suppose that yj is the smaller data set and x i is the larger, then yj, which is the ( ( j - 0 . 5 ) / m ) quantile of the y data, is matched with the interpolated ( ( j - 0 . 5 ) / m ) quantile of the x data set. Thus the required order statistic in the larger data set is determined from

33

STATISTICAL M O D E L I N G OF RESTRICTED P O L L U T A N T DATA SETS

n

v

=-(/-0.5)+0.5.

(3)

m

If V is not an integer we separate this value into an integer component i and a fractional component 0. The interpolated quantile is evaluated as

Ox ( ( j - 0.5)/m) =(1 - O)xi+ Oxi+~.

(4)

An example quantile-quantile plot using the acid gas data recorded at the Watt and Mounter monitors is presented as Figure 2. The least squares fit of Equation (1) to these data yielded parameter estimates and associated standard errors of a = 0.946 + 0.015 and/3 = 2.602 + 0.750 with a correlation coefficient of 0.994. Id 0 U N T E R

100 9 /t"''*

80 .9

M 0 N I T 0 R

Q

u A N T I L E S

9

9149149 A9 .,9

60

99 j,

40 aa

a.

Aa," A." o,

20 ." aa

0

I

2O Fig. 2.

9

I

I

I

40 60 80 WA]-F MONITOR QUANTILES

I

100

Acid gas concentrations ~ g m -3) selected from the Watt Street and Mounter Street monitors data sets with the fit according to Equation (1).

Clearly, the method of ordinary least squares provides a uniform weighting of all percentiles, and methods which attach decreasing importance to the more extreme percentiles would be preferable in practice. One of the referees has suggested an iterative weighted least squares technique, with the weighting determined analogously to that in probit analysis (see Finney, 1971). This involves inverse normal transformation of the response (our empirical cumulative distribution function) to obtain the probit which becomes a simple regression on the logarithm of dose (our concentrations). Thus the weighting method becomes maximum likelihood if the distributions of concentrations is lognormal, although other transformations can be

34

J.A.

T A Y L O R ET AL.

applied to obtain maximum likelihood if the distribution is known. The accent in our development of the quantile-quantile model has been to obtain a distribution-free method, the same distribution being appropriate at the two sites being correlated. However, if in addition a common distribution were identified for both sites then the appropriate transformation could be made. Alternatively, the approximation to lognormality (or some other skewed distributional form) could be estimated and this would lead to a weighting improved over ordinary least squares. However, for our data the improvement is minor.

4. Kolmogorov-Smirnov Two-Sample Test Given independent random samples of sizes n, m respectively from continuous distribution functions F,, (x), Fm (x), we wish to test the hypothesis Ho: Fn (x)=F,n (X), all x.

(5)

This class of non-parametric problem can be solved by distribution free methods which do not depend on the form of the underlying distributions at all, provided that they are continuous (Kendall and Stuart, 1973). The two-sided Kolmogorov-Smirnov two-sample test criterion, denoted by D,,, ,n is the maximum absolute difference between the two empirical distributions, S,, (x) and Sm (x), of Fn (x) and Fm (x) respectively

Dn, m =max I Sn (x)-S,,, (x) l.

(6)

Tables of the quantiles of this test statistic are readily available for small samples (n~

50

0

Fig. 6.

'

?,.

',.t

I

I

I

5

lO

15

I

I

20 25 SAMPLE NUMBER

'

I A

1

I

I

I

30

35

40

45

The maximum acid gas concentrations (#g m - 3) with the empirical quantile-quantile model and identified model predictions with Watt Street monitor data as the complete data set.

42

J.A.

300

l :: ,:;', i:

t , 250

I

I'

150

~

i:

,

,

:'

"i:

'

i /,:: : ~,,

il i

,::::

9 : ~' i

', ~

',:~1

',,',

',:,

0 0

'

:,,.

=,

/"~'.:,

:'',,'

~,

',~

\',,

C

'l

I

I

I

5

10

15

"

:,

l, ":."

~ ~ ,'.'.

~

J

,

~ ~, (. ','

~: ~ ~:

r,

A

:,,

; ,','~l' ~. ~',~':

k

I ',

'

'A'.:

," :vf. x.~ :1 '}

I: ,"

,/, r,,

~'

MODEL

" ;'

",~,,'

,

--

e

~'~:/;

'~

" OBSERVATIONS QUANTILE-QUANTILE IDENTIFIED MODEL

.....

:

a ' , , ', ~," ~,: : I'

100

9

:I

?

2OO

T A Y L O R ET AL.

~', :.,

'a :

,::

',

i~,~

",

,';

~',

/,

~

~:

~

"\,,

/,'

9

I,,

,,'

I

I

t

I

25 NUMBER

30

35

40

45

I

I

20 SAMPLE

,, v: ,\

"~ I /

",.'

,',

~

~" / 'v,

*

Fig. 7. The maximum acid gas concentrations ~ g m -3) with the empirical quantile-quantile model and identified model predictions with the Mounter Street monitor data as the complete data set.

300

9 9 OBSERVATIONS . . . . . QUANTILE-QUANTILE 9 -IDENTIFIED MODEL

250

MODEL

200 C 0 N C E N T

150

,~l ~,, 100

\,,"

R

A T I

:; i;,~,

i ~i~ I:'~, ' :P

Yj

'~'

:' 1

.

I,,,

',,: ;' 9 ~

, :1

:/,~,,

: i ~ ,,, I:~ '

I:

"

:/

,, ,~:~

\

,

50 g

0 N

0

Fig. 8.

:i i~ ,

I

I

I

5

10

15

I

20 SAMPLE

I

25 NUMBER

I

I

I

(

30

35

40

45

The maximum acid gas concentrations Lug m-3) with the empirical quantile-quantile model and identified model predictions with the City monitor data as the complete data set.

43

STATISTICAL MODELING OF RESTRICTED POLLUTANT DATA SETS

TABLE III Average relative root mean square errors for the empirical quantile-quantile and best fit models where the Watt Street, Mounter Street and City monitors are the complete data sets for three monitoring strategies: one day in six, one day in eight and one day in twelve. Model

;~ma~

Zmax- I

Xmax 2

Z98

Sampling strategy o f one day in six Watt Street monitor

quantile-quantile best fit

0.229 0.390

O.193 0.308

O. 161 0.218

O. 146 0.181

0.208 0.272

O.146 0.204

O. 151 O. 184

0.207 0.331

0.181 0.226

0.150 0.183

Mounter Street monitor

quantile-quantile best fit

0.411 0.336

quantile-quantile best fit

0.357 0.391

City monitor

Sampling strategy o f one day in eight Watt Street monitor

quantile-quantile best fit

0.244 0.468

quantile-quantile best fit

0.457 0.384

quantile-quantile best fit

0.370 0.458

0.233 0.447

0.205 0.324

O.177 0.250

0.264 0.396

0.216 0.311

0.185 0.252

0.232 0.441

O.199 0.317

O. 170 0.243

Mounter Street monitor

City monitor

Sampling strategy o f one day in twelve Watt Street monitor

quantile-quantile best fit

0.276 0.510

0.250 0.430

0.205 0.310

O.198 0.249

0.276 0.378

0.202 0.286

O. 196 0.243

0.249 0.431

0.205 0.301

O. 163 0.233

Mounter Street monitor

quantile-quantile best fit

0.485 0.428

quantile-quantile best fit

0.420 0.499

City monitor

t h e ' b e s t f i t ' m o d e l . O n c e a g a i n t h e s e results suggest t h a t p a i r i n g m o n i t o r e d d a t a w i t h t h o s e f r o m a s i m i l a r l a n d use c a t e g o r y m a y i m p r o v e t h e s e results. The cause of the decreased performance of the 'best fit' model with the reduction in t h e s a m p l e size o f t h e r e s t r i c t e d d a t a set c a n be a t t r i b u t e d to t h e i n c r e a s e d d i f f i c u l t y in i d e n t i f y i n g t h e best d i s t r i b u t i o n a l m o d e l f r o m t h e s m a l l e r d a t a sets. T h e r e is also

44

J . A . TAYLOR ET AL.

T A B L E IV A v e r a g e relative bias for the e m p i r i c a l q u a n t i l e - q u a n t i l e a n d best fit m o d e l s where the W a t t Street, M o u n t e r Street a n d C i t y m o n i t o r s are the c o m p l e t e d a t a sets for t h r e e m o n i t o r i n g strategies: o n e d a y in six, o n e d a y in eight a n d one d a y in twelve. Model

Xmax

~max - I

Zrnax - 2

~98

Sampling strategy o f one day in six Watt Street monitor quantile-quantile best fit

0.078 - 0.077

0.033 - 0.068

0.056 - 0.047

0.045 - 0.022

- 0.043 -0.036

0.012 -0.025

0.047 -0,012

- 0.019 - 0.097

- 0.029 - 0.057

- 0.023 - 0,029

Mounter Street monitor quantile-quantile best fit

- 0.211 -0.008

City monitor quantile-quantile best fit

- 0.064 - 0.096

Sampling strategy o f one day in eight Watt Street monitor quantile-quantile best fit

0.079 - 0.152

0.027 - 0.135

0.050 - 0.096

0.041 - 0.055

- 0.057 - 0.084

- 0.004 - 0.063

0.035 - 0.036

- 0.007 - 0.158

- 0.001 - 0.107

- 0.003 - 0.066

Mounter Street monitor quantile-quantile best fit

- 0.229 - 0.059

City monitor quantile-quantile best fit

- 0.039 - 0.162

Sampling strategy o f one day in twelve Watt Street monitor quantile-quantile best fit

0.049 - 0.071

0.003 - 0.060

0.030 - 0.035

0.020 - 0.011

- 0.091 -0.026

- 0.030 -0.015

0.010 -0.004

- 0,034 - 0.104

- 0.039 - 0.061

- 0.031 - 0.032

Mounter Street monitor quantile-cluantile best fit

- 0.270 0.001

City monitor quantile-quantile best fit

the increased data

- 0.090 - 0.106

uncertainty

sets, The results

in estimating

in Tables

sampling.

The rinse values

generally

the 'best

fit' case. However

Tables

bias.

It would

be

expected

the distributional

I and III also indicate

that

increase II and as the

as the data

IV indicate data

parameters

the problems

sample

set decreases,

no such trend decreases

from

reduced

with intermittent

the

especially

for

for the relative model

fit to

STATISTICAL M O D E L I N G OF RESTRICTED P O L L U T A N T D A T A SETS

45

observation would worsen. The rmse values show this behaviour but not the relative bias so it would appear that the rmse factor is the more sensitive goodness-of-fit indicator. It would also appear that the relative accuracy of the quantile-quantile model compared to the best fit model increases as sample size decreases indicating it is a more robust model. Clearly it is a matter o f subjective judgement to decide when the intermittent sample size has become too small. For instance one may decide on various values of rmse above which the fit is not acceptable (e.g. 0.3, 0.4 or 0.5). It is probably best however to make such a decision based on a consideration of both the graphs and the rmse factor, with the relative bias value being given less significance. In s u m m a r y then the quantile-quantile model would seem to be a very useful tool in air pollution monitoring. It does not require the assumption or identification of a statistical distribution for any of the data sets, it is simple to use as it requires simple mathematical techniques, and it would seem to be robust over a variety of sample sizes. In practical terms it requires a base data set f r o m a site in the same airshed as the intermittent monitoring site and its only theoretical assumption is that the distributional form, which need not be specified, is the same for both the restricted and complete data sets, an assumption which may be simply tested. It would also appear that the land use zones for the base and restricted monitoring sites should not differ too greatly.

7. Conclusions Three statistical models have been considered for use in estimating upper percentiles o f a restricted data set. The first is an empirical model linking the quantiles o f this data set to that of a more complete data set at another site in the same airshed. The only assumption required is that the distributions of both data sets are the same but the parametric form need not be specified. The second model assumes a distributional f o r m for the restricted data set, estimates the distribution parameters and thereby estimates the upper percentiles. The third is the same as the second except that the distributional form is first identified using a goodness-of-fit test. It has been found that the first model (the quantile-quantile model) and the third (the best fit model) yield the best results when the three highest values and the 98-percentile values of 45 yr of 24 hr acid gas data are examined. In general the quantile-quantile model yields better results although it is possible that the model results are affected when the restricted and base data set sites correspond to very different land use zones. Four types of intermittent monitoring were considered: one day in four, one in six, one in eight, and one in twelve. As expected the results worsened as the data sets became smaller, although the performance of the quantilequantile model improved in comparison to the best fit model in general. Because of these results and its simplicity, the quantile-quantile model would appear to be a very useful tool in air pollution monitoring. The results in this paper are indicative of the utility of purely statistical models

46

J. A. TAYLOR ET AL.

for air quality management applications. Jakeman et al. (1981) provide a comprehensive list of such applications, with functions ranging from use as data summaries to augmentation with deterministic models for predicting the probability distribution of annual concentrations under changing meteorological and emission inputs. Taylor et al. (1986) and Jakeman et al. (1986) demonstrate the performance of methodological tools needed to develop statistical models for air quality management purposes. Acknowledgements The authors would like to thank the referees for comments upon an earlier version of this work.

References Birnbaum, Z. W. and Hall, R. A.: 1960, 'Small Sample Distributions for Multisample Statistics of the Smirnov Type', The Annals of Mathematical Statistics 31, 710-720. Chambers, J. M., Cleveland, W. S., Kleiner, B., and Tukey, P. A.: 1983, Graphical Methods for Data Analysis, Duxbury, Boston. Conover, W. J.: 1980, Practical Nonparametric Statistics, 2nd ed., John Wiley and Sons, New York. Finney, D. J.: 1971, Profit Analysis, 3rd ed., Cambridge University Press. Fox, D. G.: 1981, ' Judging Air Quality Model Performance', Bulletin of the American Meteorological Society 62, 599-609. Gibbons, J. D.: 1971, Nonparametric Statistical Inference, McGraw-Hill, New York. Jakeman, A. J. and Taylor, J. A.: 1985, 'A Hybrid ATDL-Gamma Distribution Model for Predicting Area Source Acid Gas Concentrations', Atmospheric Environment 19, 1959-1967. Jakeman, A. J., Simpson, R. W., and Taylor, J. A.: 1987, 'Modeling Distributions of Air Pollutant Concentrations - III. The Hybrid Deterministic-Statistical Distribution Approach', AtmosphericEnvironment (to appear) Jakeman, A. J., Taylor, J. A., and Simpson, R. W.: 1986, 'Modeling Distributions of Air Pollutant Concentrations - II. Estimation of One and Two Parameter Statistical Distributions', Atmospheric Environment 20, 2435-2447. Kendall, M. G. and Stuart, A.: 1973, The Advanced Theory of Statistics, 3rd ed., London, Griffin. V. 2. pp. 436-450. Looney, S. W. and Gulledge, T. R.: 1985, 'Use of the Correlation Coefficient with Normal Probability Plots', The American Statistician 39, 75-79. Ott, W. R. and Mage, D. T.: 1981, 'Measuring Air Quality Levels Inexpensively at Multiple Locations by Ramdom Sampling'. Journal of the Air Pollution Control Association 31, 365-369. Simpson, R. W.: 1984, 'Predicting Frequency Distributions for Ozone, NO 2 and TSP from Restricted Data Sets', Atmospheric Environment 18, 353-360. Taylor, J. A. and Jakeman, A. J.: 1985, 'Identification of a Distributional Model', Communications in Statistics B14, 497-508. Taylor, J. A., Jakeman, A. J., and Simpson, R. W.: 1986, 'Modeling Distributions of Air Pollutant Concentrations - I. Identification of Statistical Models', Atmospheric Environment 20, 1781-1789. Wilk, M. B. and Gnanadesikan, R.: 1968, 'Probability Plotting Methods for the Analysis of Data', Biometrika 55, 1-17.

Statistical modeling of restricted pollutant data sets to assess compliance with air quality criteria.

Three statistical models are used to predict the upper percentiles of the distribution of air pollutant concentrations from restricted data sets recor...
830KB Sizes 0 Downloads 0 Views