Models for Small Area Estimation for Census Tracts John R. Logan1, Cici Bauer2, Jun Ke1, Hongwei Xu3, Fan Li4 1

Department of Sociology, Brown University, Providence, RI USA, 2 Department of Biostatistics and Data Science, University of Texas Health Science Center in Houston, Houston, TX USA, 3 Department of Sociology, Queens College, CUNY, Flushing, NY USA, 4 Department of Biostatistics, Yale University, New Haven, CT USA

This study examines issues of Small Area Estimation that are raised by reliance on the American Community Survey (ACS), which reports tract-level data based on much smaller samples than the decennial census long-form that it replaced. We demonstrate the problem using a 100% transcription of microdata from the 1940 census. By drawing many samples from two major cities, we confirm a known pattern: random samples yield unbiased point estimates of means or proportions, but estimates based on smaller samples have larger average errors in measurement and greater risk of large error. Sampling variability also inflates estimates of measures of variation across areas (reflecting segregation or spatial inequality). This variation is at the heart of much contemporary spatial analysis. We then evaluate possible solutions. For point estimates, we examine three Bayesian models, all of which reduce sampling variation, and we encourage use of such models to correct ACS small area estimates. However, the corrected estimates cannot be used to calculate estimates of variation, because smoothing toward local or grand means artificially reduces variation. We note that there are potential Bayesian approaches to this problem, and we demonstrate an efficacious alternative that uses the original sample data.

Introduction Researchers often wish to characterize small areas based on data from a sample of the local population. While direct estimators (e.g., average values based on random samples) are known to provide unbiased estimates, the variance of estimates can be quite large when samples are small. The consequences for statistical analysis are well known, but it is only recently that many users of census data—urban geographers, sociologists, and demographers—are becoming aware of this problem. In the past, most researchers analyzed the Census Bureau’s official counts as though they had no sampling error. Now scholars are taking notice of the diminished sample sizes in the ACS, which replaced the long-form decennial census after 2000 as the primary source of most tract-level population data. Starsinic (2005) estimated that the standard errors of ACS at the Correspondence: John R. Logan, Department of Sociology, Box 1916, Brown University, Providence RI 02912 e-mail: [email protected]

Submitted: December 26, 2016; Revised version accepted: June 10, 2019 doi: 10.1111/gean.12215 © 2019 The Ohio State University

1

Geographical Analysis

tract level, even when samples are pooled across five years of data collection, will be about 50% higher than in Census 2000 long-form data (see also National Research Council 2015, pp. 24–40, Sommers and Hefter 2014). This study uses historical census data to illustrate concretely the level of risk that researchers take when relying on direct estimates of population characteristics at the tract level from the ACS. We take advantage of the microdata that are now readily available from the 1940 census, when most data were recorded for 100% of the population. We draw samples from it to reveal how estimates vary across samples and how much this is dependent on the sample size. The simplest ways to deal with the problem require scholars to make compromises on the scale at which phenomena are studied or on the specificity of indicators. One approach is to aggregate census tracts to larger areal units for which the larger sample sizes would produce estimates that are more reliable. For example, researchers in the New York City Planning Department, routinely combine tracts into Neighborhood Tabulation Areas of at least 15,000 residents (Alvarez and Salvo 2014), which increases sample size by threefold or more. Another suggested procedure is to rely on multiple indicators rather than a single measure. Spielman and Singleton (2015) argue that estimation errors in each indicator are likely to be independent of one another, so that random errors from multiple indicators will average to zero. The alternatives examined here avoid these compromises. We deal with two kinds of estimates. The more familiar is a point estimate for a mean or proportion to represent a population characteristic for a single small area (small area estimation or SAE). Of the many possible approaches to SAE, we apply and evaluate the performance of three versions of Bayesian models, all of which borrow information from other areas or related data sets (smoothing) to improve the estimate in each individual area. Bayesian model-based estimation is expected to offer a large reduction in the variance of estimates compared to alternative design-based approaches, at the cost of introduction of some bias (Pfefferman 2013). Less familiar is estimation of the variation across areas, which represents spatial inequality or residential segregation, a phenomenon that is at the heart of much contemporary spatial analysis (Sampson 2012). The problem here, well known to specialists, is that estimates of variation based on the original sample data—because of the unreliability of point estimates—are inflated. We illustrate this inflation and show that it cannot be resolved by basing measures on the Bayesian-corrected SAE values. When point estimates are improved by smoothing, that procedure artificially reduces the variation among them. We present a simple and effective alternative non-Bayesian method to generate more reliable and less-biased estimates of variation from the original sample data. The analyses presented here should encourage social scientists to use both Bayesian and non-Bayesian methods as standard correction procedures when relying on the ACS to measure local area population characteristics. The statistical literature offers a growing list of models from which to choose. It is reassuring that the three main Bayesian model types that we consider perform nearly equally. Their performance is not much affected by introduction of covariates, and very similar results are found using the standard MCMC or more computationally efficient INLA estimation techniques. Our non-Bayesian approach to estimating variance across small areas is easy to implement from the original sample data and greatly reduces both bias and unreliability.

2

John R. Logan et al.

Small Area Estimation for Census Tracts

Overview of SAE models Small area estimation (SAE) is regularly employed in conjunction with standard data sources for health research. Despite very large national samples, major health surveys have insufficient samples for reliable estimates of characteristics of smaller geographical areas such as states or counties. Statisticians define a “small area” as one where “the domain-specific sample is not large enough to support direct estimates of adequate precision” (Rao 2003, p. 1). Hence, depending on the data source, a county or even a state may be “small.” And yet for public policy reasons it is often desirable to develop estimates for all states and counties (even for areas with no sampled respondents). Various approaches have been applied to this purpose over the years. The National Center for Health Statistics (NCHS) has used synthetic estimation based on implicit linking models to produce state estimates of health characteristics for different groups from the National Health Interview Survey (NHIS). Malec et al. (1999) used NHIS data to estimate state-level physician visits using a combination of individual-level auxiliary variables and area-level covariates. NCHS also provides annual state-level estimates of the substitution of wireless phones for land lines, based on small-area statistical models (Blumberg et al. 2012). The National Cancer Institute (n.d.) uses Bayesian estimation procedures to produce and disseminate county-level estimates of cancer from both the Behavioral Risk Factor Surveillance System (BRFSS) and NHIS (see also Ha, Lahiri, and Parsons 2011). Malec et al. (1997) have used data from NHANES III to estimate overweight prevalence by states. The National Household Surveys on Drug Abuse have been used as a basis for Hierarchical Bayesian estimates of small area prevalence of drug use (Folsom, Shah, and Vaish 1999; Wright, Sathe, and Spagnola 2007). Beresovsky and Malek (2011) applied these methods to the National Ambulatory Medical Care Survey. Other specialized surveys are being used in similar ways. Data from the National Assessment of Adult Literacy (NAAL) have been used to estimate literacy rates for states and counties (Mohadjer et al. 2012). Savitz and Raudenbush (2009) show that a spatial Bayesian model improves the fit to estimates of collective efficacy in Chicago’s community areas. The Bureau of the Census routinely relies on the American Community Survey (ACS), supplemented by other administrative records, for its Small Area Income and Poverty Estimates (SAIPE) program, which provides annual estimates of income and poverty statistics for all school districts, counties, and states (http://www.census.gov/did/www/saipe/about/index.html; see Citro and Kalton 2000, Maples and Bell 2005). Many different SAE models have been proposed. Here, we evaluate the performance of several variations of such models and compare them to direct estimates for samples ranging from 1% (where sampling variation is extreme) to 16% (where all estimators tend to converge on the true value). We consider three kinds of area level models (where the inputs are aggregate statistics from area samples of the sort provided for census tracts from the ACS). These include hierarchical models (HB) and two versions of spatial Bayesian models (SB and Leroux). HB models trace back to the Fay-Herriot model (Fay 1979) which extended the James-Stein shrinkage estimator. Implementing the HB models requires specification of a prior distribution for all model parameters and a hyperprior for the variance parameter of the independent random effects. Spatial models allow for spatial dependence by correlating the random effects, borrowing information from neighboring small areas to improve estimation (see Banerjee et al. [2014] for a complete survey of spatial models applicable to areal data). These models provide a flexible and unified framework for small area estimation by introducing different ways to borrow information from other areas. They have in common an

3

Geographical Analysis

exogenously determined spatial connection structure (or neighborhood matrix). For example, the HB model dictates conditional independence across spatial units and provides point estimates that are shrunk to the grand mean, while the SB and Leroux models assume a fixed geographical structure and provide estimates shrunk to neighborhood means. There are alternatives that are worth future consideration. For example, rather than fixing the spatial connection structure, Lee and Mitchell (2013) and Dong et al. (2019) describe a localized spatial smoothing approach that estimates the connection structure without reference to geographical proximity. The main idea is to include an additional updating step to infer the unknown connection structure in a dataadaptive fashion, thus allowing for both spatial dependence (macro-level smoothing) and implicit step changes (micro-level discontinuities). However, the locally spatial smoothing necessarily introduces more unknown parameters and is computationally more intensive, beyond the scope of this study. Here, we offer an introductory overview of each type of model. Our applications are to a binary variable (proportion of owner occupied homes and proportion of European-born residents). To describe estimates of binary variables, however, we begin with the case of a continuous variable such as an average. Let i index areas, i = 1, … , I . For the i th area, let Ni be the total population size and mi be the sample size. At the individual level, let yik denote the outcome of interest for respondent k from area, either continuous or binary. Then the direct estimate of the average of interest in area i is 1 ∑ y . Ȳ i = mi k ik

(1)

( ) When yik is a continuous variable, an unbiased estimate of var Ȳ i is ( ) ( ) 1 mi ̄ � Yi = 1− 𝜎̂ i2 var mi Ni

(2)

m

where 𝜎̂ i2 is the empirical sample variance for area i , and 1 − Ni is the finite population correction i factor. When yik is a binary variable, that is, yik = 0 or 1, the estimated average becomes a proportion p̂ i whose unbiased variance estimator is ) ( ) ( ( ) mi p̂ i 1 − p̂ i � p̂ i = 1 − var Ni mi − 1

(3)

) ( where p̂ i 1 − p̂ i can be regarded as the estimated Bernoulli variance of area i (Rice 2007, p. 212). When sample size mi is small, then the sample variance may not be an accurate estimate for the true population variance, resulting in an unreliable variance estimate in (2) or (3). The problem becomes exacerbated for binary outcomes. Specifically, for areas with very small sample size, the estimated proportion p̂ i can be either 0 or 1, which leads to a variance estimate of 0. Now we consider three types of models that are intended to overcome some of the problems in the direct estimates.1 These models borrow information from other areas, and are all fitted using Bayesian inference. For each model, we first introduce the case when the outcome variable is continuous, then the counterpart when the outcome variable is binary.

4

John R. Logan et al.

Small Area Estimation for Census Tracts

Model 1: Bayesian hierarchical model with independent Gaussian random effects (HB model) When the outcome is continuous, we may assume the outcome data (or some transformation of the outcome data) follow a normal distribution: ( ) yik |𝜇i , 𝜎i2 ∼ N 𝜇i , 𝜎 2 , (4) where 𝜇i is the true mean of the quantity of interest in area i and 𝜎 2 the variance. This is the first stage in a Bayesian hierarchical model. In the second stage, we model the deviation of each 𝜇i from the overall mean 𝜇 by a random effect vi, which we assume to have an independent and identical (i.i.d.) Gaussian distribution with mean 0 and variance 𝜎v2: 𝜇i = 𝜇 + vi

(5)

Here, the random effect term vi captures the between-area variability. By assuming a common distribution of the area random effect vi, we borrow information from other areas to assist estimation for area i . At the third stage, we need to decide prior distributions for the remaining parameters, which in this case are the intercept and the variance parameters. Gelman (2006) provided a comprehensive review of the prior specifications for variance parameters in Gaussian hierarchical models, and pointed out the challenge in choosing appropriate priors. We assessed the posterior distribution of the variance parameter, which has little mass near zero. Accordingly, we used a non-informative normal distribution (i.e., normal distribution with large pre-specified variance such as 1,000), and assigned non-informative Gamma priors for the precision parameters (i.e., 𝜎v−2) as suggested by Gelman.2 ) ( When the outcome is binary, we assume yi |pi ∼iid Binomial mi ,pi , where yi is the number of individuals with outcome 1, and pi is the true percentage of interest. A generalized linear random-effects model is: ) ( pi = 𝛽0 + vi log 1 − pi ( ) vi |𝜎v2 ∼iid N 0, 𝜎v2 , i = 1, … , I

(6)

where 𝛽0 is the overall intercept and vi the random effect. This is essentially a logistic regression model, therefore vi can also be interpreted as the between-area variability on the log odds ( )� scale. When area-level covariates are available, denoted by xi = x1 , … , xp , model (6) can be extended to: ) ( pi = 𝛽0 + x�i 𝛽 + vi log (7) 1 − pi where 𝛽 represents a vector of area-level fixed effects associated with the covariates xi in area i . Covariates can also be introduced in models with continuous outcomes.

5

Geographical Analysis

Model 2: Bayesian hierarchical model with spatial ICAR random effects (SB model) To borrow information from the neighboring areas, we make the assumption that areal units that are close to each other more similar than units that are far away. This is the motivation of the spatial model introduced by Besag et al. (1991). In this model, we supplement the independent normal random effects with spatial random effects: 𝜇i = 𝜇 + vi + ui ( ui |uj , j ≠ i, 𝜎u2 ∼ N

2 1 ∑ 𝜎u uj , si j∈𝛿 si

)

i

(

) 2

vi |𝜎v2 ∼iid N 0, 𝜎v , i = 1, … , I ,

(8)

where 𝛿i denotes the set of neighbors of area i and si the number of its neighbors (often referred as cardinality). This model, therefore, includes a non-spatial normal random effect component vi, and a spatial random effect component ui. The spatial random effect ui is assumed to have a Gaussian distribution with conditional mean given by the average of its neighbors and conditional variance inversely proportional to the number of its neighbors. This model is usually called the intrinsic conditional autoregressive (ICAR) model and has been popular in the disease mapping literature (Banerjee et al. 2014). The parameter 𝜎u2 is a conditional variance that is conditional on the neighbors and it determines the contribution of spatial variation. The variance parameters 𝜎v2 and 𝜎u2 are on different scales (i.e., the former is marginal and the latter is conditional), and therefore cannot be directly compared. However, the amount of variation that can be ̂ ui ). explained by the spatial component can be estimated empirically by the sample variance var(̂ Similarly, area-level covariates can be added to the model in a straightforward manner. In this spatial model, the spatial dependency is defined by the neighborhood structure. For example, a common approach defines areas i and j to be neighbors if they share a common boundary. Other neighborhood schemes are possible. For example, Cressie and Chan (1989) defined the neighborhood structure as a function of the distance between area centroids. For a binary outcome variable, we have ) ( pi = 𝛽0 + vi + ui log (9) 1 − pi where ui and vi are the spatial and non-spatial random effects as defined in (8). The priors are specified similarly as the HB model, with a non-informative Gaussian prior for the intercept 𝛽0 and non-informative Gamma priors for the random effects precision parameters. Model 3: Bayesian hierarchical model with spatial CAR random effects (Leroux model) Instead of leveraging an additive spatial effect as in the ICAR model, we could borrow information from the neighboring areas by directly introducing associations among the random effects in the HB model. Leroux (1999) and MacNab (2003) studied such a conditionally autoregressive structure for log-linear models. We further explore this spatial structure (which we will refer to as the Leroux CAR model) for continuous and binary outcomes. We have 𝜇i = 𝜇 + w i

6

John R. Logan et al.

Small Area Estimation for Census Tracts

( wi |wj , j ≠ i, 𝜎w2 , 𝜌 ∼ N

∑ 𝜎w2 𝜌 wj , 1 − 𝜌 + si 𝜌 j∈𝛿 1 − 𝜌 + si 𝜌

) ,

(10)

i

where wi is the spatial random effect for area i , 𝜎w2 is the conditional variance that is conditional on the neighboring random effects, 𝜌 is a spatial dependency parameter restricted to the interval [0,1], and 𝛿i, si again represent the set of neighbors of area i and its cardinality, as previously introduced. The neighborhood structure can be defined similarly as before based on geographical proximity. In model (10), the spatial dependency parameter 𝜌 quantifies the degree of spatial association among the random effects and will be estimated from the data. In the absence of spatial correlation, that is, 𝜌 = 0, the wi’s form a set of independently and identically distributed Gaussian random effects with zero mean and variance 𝜎w2 . Note that in this case the Leroux CAR model reduces to the HB model and 𝜎w2 becomes the marginal variance due to independence. In the presence of strong spatial dependency, that is, 𝜌 = 1, the conditional prior distribution for each wi is ICAR given in (8), and model (10) reduces to a variant of spatial ICAR model without vi. In the presence of area-level covariates, we will use 𝜇i = 𝜇 + x�i 𝛽 + wi for the mean structure. For binary outcome data, we have ) ( pi = 𝛽0 + wi log (11) 1 − pi where the wi’s are modeled as in (10). To estimate model parameters, we assign non-informative Gaussian priors for 𝛽0 and non-informative Gamma prior for 𝜎w2 . There are two strategies for assigning a hyperprior for the spatial dependency parameter 𝜌. We could examine the total possibilities within the complete parameter space by assuming 𝜌 ∼ Unif (0,1). However, doing so may be computationally intensive when the number of total areas I is moderate to large (Lee, 2011). A more computationally efficient approach identifies a set of r equally spaced values within [0,1) representing from weak to strong spatial dependency and assigns a discrete uniform prior. In this case, some sensitivity analyses may be necessary with respect to the values of r. Variations in model implementation These three kinds of models are well known in the statistical literature. Our contribution is to compare their performance in a real-world setting in order to assist population researchers in deciding which to consider using. Our findings are also relevant to making comparisons between studies from 2000 and earlier, when sample sizes were larger, and later ACS data based on much smaller samples. In our application, we have access to 100% unit-level data from a population census and we estimate small area characteristics and variation across areas from samples of different sizes. We also explore the introduction of a covariate that is expected to be highly correlated with the characteristic to be estimated. This can be done under the assumption that the covariate is measured without sampling error. This is the approach in Ghosh and Meeden (1996) and we implement it by measuring the covariate from the 100% population data. Or it can use data from the same sample draw that is used to estimate the outcome variable, so that it, too, is affected by measurement error. This approach is considered, for example, in Ghosh, Sinha, and Kim (2006). We compare performance of the model when the covariate is added with or without this measurement error. Hence in the following analyses, all estimates are repeated a total of 10 times (each based on 200 sample draws). In addition to the direct (naïve) estimate, estimates based on the Hierarchical 7

Geographical Analysis

Bayesian (HB), Spatial Bayesian (SB), and Leroux models are presented with no covariate, with a covariate measured from the sample, and with a covariate measured from the total population.

Implementation and research design We use a novel methodology to illustrate the seriousness of the problem of sampling variability in local population data and to demonstrate the effectiveness and limitations of Bayesian modeling. Most similar studies have been based on simulated data, which offers great flexibility but requires the investigator to make many distributional assumptions on population characteristics (Gewerke 1999). These simulations may not capture all of the intricacies of real-life population distributions. The current study is based instead on sampling from a known and complete population rather than simulation derived from theoretical assumptions. We employ a newly available source of 100% census microdata, the fully digitized version of the 1940 U.S. Census of Population. These data are no longer confidential, and a 100% transcription of records is available for public use in digital form through the North Atlantic Population Project (NAPP, http://www.nappdata.org/napp, Ruggles et al. 2010). From this full population, we draw 200 samples at each of many different sampling rates in order to show how results depend on the sampling proportion, and from every sample we derive local population estimates in a variety of ways. We then compare these estimates to the actual full-count values in order to assess the utility of alternative approaches. Results do not apply fully to applications to ACS data, because the Census Bureau employs a complex scheme of sampling and weighting that is not transparent and that we cannot replicate. Instead we draw simple random samples. The study uses the cases of Chicago, IL, and New Haven, CT. For examination of Bayesian models, we aggregated data to the scale of enumeration districts (EDs), the standard administrative unit for the 1940 census. In order to carry out spatial analyses, we make use of ED-level maps created by the Urban Transition Historical GIS Project (Logan and Zhang 2017). The full 1940 data set includes many variables that are commonly used in contemporary demographic research. We focus on measures of home ownership and ethnicity (the percentage of residents born in Europe), and we introduce the average occupational status of employed persons (the Socioeconomic Index of occupations, or SEI) as a covariate in some models. Home ownership is a characteristic of households rather than persons, so the sample size available for estimating homeowner share is considerably smaller than samples used to estimate European ethnicity. Consequently, we expect estimates for ethnicity to be closer to the true values and have less sampling variation than estimates for home ownership. We selected these two cities in order to represent variations that could affect the performance of different estimators. One difference is related to size. The larger city, Chicago, had 2,986 populated EDs with a median population of 1,129 (ranging from 2 to 4,503). New Haven had only 197 populated EDs and their median size was considerably smaller (766, ranging from 140 to 1,874). Therefore, we expect to find better estimates in Chicago than New Haven. Another consideration is the average values and variation across EDs for each outcome variable, and the degree of spatial clustering: •

8

The mean European-born level in Chicago EDs was 18.8% (standard deviation of 9.1%). Europeans were highly clustered spatially, with a Moran’s I of 0.737 (P