Vol 8, No 2

International Journal of Epidemiology C Oxford University Press 1979

Printed in Great Britain

A Methodology for Determining High Risk Components in Urban Environments W E BERTRAND 1 , P L BROCKETT3 and A LEVINE 3

of an important practical problem. To appreciate the advantages of these methods, they must be viewed in the light of the most popular current methodology for obtaining indicators, 'factorial ecology' (an extension of the social area analysis of Shevky and Bell (5)). Utilizing a wide range of data such as those derived from census tabulations, observational units (usually census tracts) are assigned 'factor scores' using factor analysis. In the context of health disorders, attempts are made to link basic dimensions of health measures with major dimensions of the variation in population and environmental characteristics (6). This methodology depends to a great extent on the collection of quantitative, as opposed to qualitative information. The collection of such data is usually difficult and expensive. Unless secondary data (e.g. census) are already available, their collection for the express purpose of factorial ecology requires an extended time period. Clearly for the purposes of systematic monitoring of an area over time, such surveys are not appropriate. There are other problems associated with these data and with factor analysis. The data are often inaccurate, or simply do not exist, as in developing countries. With respect to the analysis, qualitative variables are collected which many times play as strong a role as quantitative variables (e.g. some important variables, such as those relating to the existence of private and public health facilities, can only be measured on a presence/absence basis). Moreover, the assumption of a linear model which

INTRODUCTION

It has been noted for some time (1, 2), that local environments within urban areas may be associated with higher risks of certain disorders in their populations. The problem is identifying which components of these local environments are prime contributors to the increased risk. Their determination is essential for chronic disease studies where environmental factors play an important role. Myers and Manton (3) in their introduction propose the use of geographically defined urban units for studying the differential distribution of disease and related factors. We have followed their lead. However, this research, although having the same origins as that of Myers and Manton (4) is orientated towards the resolution of different problems. They concentrated on the role population density plays with respect to pathological states contributing to mortality. Our aim was to try to determine environmental indicators (which may, in fact, be largely associated with density) which can be simply, quickly and cheaply monitored as precursors of community health. We present a method for collecting the basic data and a new methodology for analysing these data in the context 1

2

'

Associate Professor of Health Measurement Sciences and Sociology, Tulane University School of Public Health and Tropical Medicine, New Orleans, Louisiana, USA. Professor of Mathematics, University of Texas at Austin, Austin, Texas, USA. Professor of Mathematics and Statistics, Tulane University, New Orleans, Louisiana, USA.

161

Downloaded from http://ije.oxfordjournals.org/ at University of Manitoba on June 24, 2015

Bertrand W E (Assistant Professor of Health Measurement Sciences and Sociology, Tulane University School of Public Health and Tropical Medicine, New Orleans, Louisiana, USA), Brockett P L and Levine A. A methodology for determining high risk components in urban environments. International Journal of Epidemiology 1979 8: 161-166. A method of data collection and analysis is proposed for identifying environmental indicators significant to community health. The methods were applied to a study of child mortality in the city of Cali, Colombia. Pridit analysis (principal component analysis of ridits) is described. The methods have potential usefulness for monitoring and health care planning.

162

INTERNATIONAL JOURNAL OF EPIDEMIOLOGY

underlies the" factor analysis is often not applicable to the quantitative data collected. Hence, even after a difficult and expensive survey, serious problems of analysis and interpretation remain. The methodology prescribed here is free of the objections stated above. It permits the establishment of a set of indicators easy and inexpensive to collect which should be within the realm of even the most modest city planner or public health office to execute and analyse. We build here upon the base laid by such individuals as Stamp (7, 8) and Anderson (9). The paper is divided into 3 parts, the survey technique, the theory underlying the analysis, and a brief statement of results obtained.

At the outset, an attempt was made to simplify the difficult conceptual and practical problems encountered in this early stage of research into 'urban ecological health indicators' by focusing attention on a population sub-group and a particular city, both of which have special interest for health planners. The city was Cali, Colombia and the subpopulation was that of children below 5 years of age. The particular problem studied was child mortality in this group. Three classes of risk characteristics were considered: — (a) those associated directly with the individual — age, sex, race, income, family status; (b) those associated with the dwelling in which he lives — size, condition, availability of utilities, etc.; (c) those relating to the broader physical and social environment — neighbourhood quality, pressure on amenities, accessibility to a variety of services, contact with other social groups, etc. The characteristics (a) are usually not available on a small area basis and are difficult to collect for purposes of short term monitoring. (The Cali 1973 census has so far only been completely processed for a 4% sample and at the small area level may never be completely tabulated.) The standard unit of data collection for the purposes of this study is the 'barrio' or neighbourhood. It is important to note that the neighbourhood (barrio) in Cali is a relatively homogeneous unit with respect to socio-economic characteristics. It is used as a planning unit and inhabitants will identify a barrio or one of its subunits as their place of residence. Often the barrio will have its own

TABLE 1

Sample question from urban ecology instrument

Choose the description that best fits the neighbourhood as a whole. A. 1) 2) 3) 4)

House connected with town water supply. Fountain: public faucet connected to the distribution system of town. Cistern: deep well dug near a natural spring or over natural currents. No water supply at all.

B. Indicate the frequency of cases 1, 2, 3, 4, with estimated percentage for each.

There were a total of 58 indicators representing 25 variables.* The original list of variables included in the instrument is shown in Table 2. Data collection involved 2 college students who had been trained in survey techniques. Their instructions were to go to the barrio and by asking • For those individuals interested in a copy of the original questionnaire in English or Spanish, copies can be obtained from any of the authors.

Downloaded from http://ije.oxfordjournals.org/ at University of Manitoba on June 24, 2015

THE RESEARCH SITE: DESCRIPTION OF THE QUESTIONNAIRE AND SURVEY

political structure and fulfills many of the requisites of a community in the sociological sense. This pattern is repeated in urban sprawls throughout much of the third world. Faced with the desire to enter into a research programme correlating urban neighbourhood characteristics with infant and child mortality as global indicators of health status, yet not having neighbourhood data available, a rapid inexpensive survey technique was developed to collect characteristics (b), (c) noted above. Indicator development for this initial set of questions included presence or absence of park areas, churches, community leadership groups, firefighting services, telephone service, and SES classifications of the neighbourhood, availability of health services as well as environmental indicators. Each 'question' was intended to be a measure of a dimension of the latent variable 'quality of life' (10). The response categories for each question were ordered from 'low quality' to 'high quality'. The actual questionnaire often had several indicators for a given variable. For example, the question on water availability was presented as follows:

TABLE 2

URBAN ENVIRONMENTAL RISK COMPONENTS 163 Variables included in the original Cali respondents which are as distinct as possible with 'barrio'survey instrument

questions of a number of individuals, selected at random, fill in as many of the survey items as possible. Where contradictory opinions made selecting which data to record difficult, the field worker was expected to make a subjective choice based upon his own evaluation of the informant. The field supervisor at the same time gathered whatever secondary information was available from official and semi-official sources. These included the city planning office, national census bureau, taxi and bus companies, the utility companies, city and national health departments and similar entities. The field work portion of the study took an average of three hours per neighbourhood almost independent of the barrio size. In total the collection of data using this method for a city of nearly one million individuals took about 3 months. PRIDITS In this section we develop the theory underlying the data analysis. This theory is applicable to the analysis of ordered categorical questionnaire data in general and our presentation is designed to reflect this generality. We deal with 2 general problems: 1. The Classification problem (see 11, page 314): given a sample of individuals, or the whole population, the problem is to determine groups of

respect to 'response patterns'. 2. Question Discriminating Power-, the problem is to determine which questions are most important in distinguishing those groups formed in the process of resolving the classification problem. The literature has generally followed the lead of the classical authors, Thurstone and.Chave (12), Murphy and likert (13), in treating these problems separately. We show their interconnection. Ridit transformation (14, 15) is used to assign values to question responses and principal component analysis to obtain question weights and respondent's scores. This method which we refer to as PRIDIT analysis (principal component analysis of ridits) provides measures which have useful statistical properties and which permit an interpretation in terms of the probability distributions of the groups. To begin let kt represent the number of 'permitted responses' or categories for question t and let p = (p t i Ptk t ) represent the empirical distribution of the respondents among the categories: ?p t i = 1. The ridit value assigned to category i of question t is defined to be

Rti = 2 ptj - 2 p t j . ji Consider the value which is assigned to the i th response of a ranked categorical question. According to Kendall and Stuart (11, paragraph 44.21) since the responses are ranked the statistician should be able to utilize information in other than just the i t h response category in determining an assigned value for category i. One benefit of the assignment proposed here is that it does utilize this information. Let N represent the number of respondents. Assume that the population can be divided into two groups in the sense that all members of each group have the same joint distribution over the question responses. Let Pqj, q = 1,2 represent the common probability that a member of group q will give response j to question t, and let the random variable Rqt represent the values which a member of the q-th population can obtain for question (variable) t; ie, RJ t = Rtj if a person in population q chooses the i tn response to question t. E as usual represents the expected value operator. The following lemma allows us to calculate the expected ridit values for question t for each of the 2 populations. (See Appendix for proof.)

Downloaded from http://ije.oxfordjournals.org/ at University of Manitoba on June 24, 2015

l.Locational information (name, boundaries, etc.) 2. Population 3. Family income 4. Per capita income 5. Land value 6. No. of persons per housing unit 7. Principal physical characteristics 8. Natural obstacles present in the neighbourhood 9. Land use 10. Industrial presence 11. Water availability and type 12. Sewage system 13.Garbage collection 14. Electrical service 15.% of area covered by electrical service 16. Street lights . 17. Quality of roads 18. Type of housing construction 19. Bus transportation 20. Taxi service 21. Proximity to main transportation arteries 22. Number of kinds of schools 23. Proximity of health clinics 24. Proximity of family planning centres 25. Distance in minutes from centre of 'barrio' to health centre.

164

INTERNATIONAL JOURNAL OF EPIDEMIOLOGY

LEMMA 1. Let Tq>q = 1,2 represent the number of respondents in population q, Ti + T 2 = N. Assume that there are k categories (permitted responses) to question t. Then -T 2 A t TxAt E(Rft) = —=T-. E(R! t ) = where k-l

N

N

At = 2 2 (piiP2j — PljP2i>i-l j > l

'

'

j-l

If 2

The sequence W(') converges and in the limit each components wt is proportional to A*. As shown above, At is an indicator of question 'discriminating power'. Consequently, the question weights wt enable a determination of the discriminatory power of each question. (This need not be the case if raw scoring is used instead of 'ridit' scoring.) Note that the weights are derived from the population of barrios under the assumption that this population can be roughly divided into 2 groups, the 'good' barrios and 'bad' ones. We have used the weights to determine which factors (questions) discriminate between these 2 types. The results are reported in the next section.

k

2i = 1. £ pii = l then the ij question again discriminates perfectly but in an opposite way so the lAtl is again 1 but At is negative. If the question does not discriminate in the sense that pij = p 2 j for each category j then At is zero. However, it will also be zero or near zero if the populations are inconsistently mixed among the ranked responses. For example, if the question has 4 parts and p n = p 14 = .5, p 22 = p ^ = .5. Thus, a small value of A t may indicate poor discrimination or that the question responses are not properly ranked. In any case A t near zero indicates a poor question. Now define the matrix F = (f;t)i = 1,2, . . . ,N, t = 1.2, . . . , m where m is the number of questions and where fjt is the ridit value obtained by the i t h respondent to the t t h question, i.e., f;t = Rji if the i t h respondent gives response 1 to question t. L « W(0) = (1,1 l)T w here T represents the transpose. Then the i t h component of S = FW(O) represents the score of the i t n respondent obtained by adding together his ridit values for the questionnaire. A 'weight vector' for the questions is defined by Wd) = FTS(O)/||FTS(O)||=BW(O)/||BW(O)|| where II'II is the usual Euclidean norm and B = FTF. The t t h component w-D|| = Bi

URBAN ENVIRONMENTAL RISK COMPONENTS

TABLE 4 Group correlations of child mortality with 'good' and 'bad' question sets as defined by PRIDIT scores * age (years)

1970 1972 1973 1970 1972 1973 0-4 0-4 0-4 0-11 0-11 0-11

good questions (high weight)

-.21 -.22 -.32 -.21 -.21

bad questions (low weight) overall (entire questionnaire 58 items)

.08

.02

.013

.09

.03

-.32 .01

-.21 -.21 -.29 -.21 -.20 -.30

*1971 was left out of calculations due to inadequacies in the data.

The 20 'high weight' questions have generally as strong a correlation, in the expected direction, as all of the questions together and when compared with the low weight items are clearly the major components of the questionnaire. 1973, the base year for denominators used in rate calculations is the best example of this difference in that 'good' questions correlated with mortality in both age groups at —.32 or better and the bad questions at approximately .01. This highly significant pattern is repeated for all the years indicating that the utility of our new question set in representing 'quality of life' as measured by infant mortality is apparently high. We do not present a more detailed analysis of results since our purpose here is to present new investigative tools in the context of an important

practical problem. Although we have oversimplified the results here for purposes of concise presentation, it is clear that the indicators in Table 3 are easily monitored to obtain a continuing gross indication of each barrio's risk to child mortality. The same strategy could be appropriate where otheT dependent variables arc concerned. ACKNOWLEDGEMENTS

This research was begun while the senior author was with the International Center for Medical Research in Cali, Colombia and while another (Arnold Lcvine) held a position with the Division of Research in Epidemiology and Communication Science, WHO, Geneva. This research was encouraged in its early stages by Dr N T J Bailey who offered many useful suggestions. The authors acknowledge the work of Dr Yuri Medvedkov, Dr Allesandro Rossi-Espagnet and Dr George Meyers who instituted this work at the World Health Organization, Geneva. Rachel Wyon, Eduardo Rivera and Leopoldo Munoz were responsible for data collection in Cali. The authors also wish to thank Dr Alex Cobo Director of La Fundacion Para la Educacion Superior (FES) who provided logistic assistance for the duration of the project. APPENDIX

Let Rq't denote the RIDIT value obtained for question t by a randomly selected individual from population q. We are assuming E(RqtRqS) = E(R*t)E(R$s). Lemma 1. T q , q = 1,2 represent the number of respondents (barrios) in population q, Tj + T2 = N. Assume k categories to question (variable) t. Then

E(Rft) = At=

-T 2 A t N

-,

ft

N

i 2

J j>l ' Proof. We prove only the statement concerning Rj t since RJ t may be treated similarly. Let Ij equal 0 if the i tn response is not given and equal 1 if the i tn response is given, so that

k

Rft = 2 Rti I;. Thus

Downloaded from http://ije.oxfordjournals.org/ at University of Manitoba on June 24, 2015

The 18 questions so isolated represent an interesting mix of items some expected and some not expected. As a matter of course it became necessary to validate the method and establish the true utility of this reduced question set as an indicator of quality of life. Accordingly, mortality data based on death registration was acquired for the years 1970 to 1974 and using small area population estimates for 1973, rates were calculated for infant and child mortality by barrio. Although it is assumed that this rather gross method of measuring infant mortality could be improved, we expected it to be sufficient to validate the questionnaire reduction technique. Table 4 illustrates group correlations of child mortality with the good versus the bad questions as identified by the PRIDIT technique for 3 consecutive years.

165

INTERNATIONAL JOURNAL OF EPIDEMIOLOGY

166 E(RJ t )= S

Pli E(R ti lli

= l) =

i l

k ( T i-l)pij+T 2 p 2 j i pi; 2, il ji N

A methodology for determining high risk components in urban environments.

Vol 8, No 2 International Journal of Epidemiology C Oxford University Press 1979 Printed in Great Britain A Methodology for Determining High Risk C...
391KB Sizes 0 Downloads 0 Views