Article

Pairwise Cluster Randomization: An Exposition

Evaluation Review 2014, Vol. 38(3) 217-250 ª The Author(s) 2014 Reprints and permission: sagepub.com/journalsPermissions.nav DOI: 10.1177/0193841X14540654 erx.sagepub.com

William Rhodes1

Abstract Background: Cluster randomization (CR) is often used for program evaluation when simple random assignment is inappropriate or infeasible. Pairwise cluster random (PCR) assignment is a more efficient alternative, but evaluators seemed to be deterred from PCR because of bias and identification problems. This article explains the problems, argues that they can be mitigated through design choices, and demonstrates that the suitability of PCR can be tested using Monte Carlo procedures. Research Design: The article presents simple formulas showing how the PCR estimator is biased and explains why its standard error is not identified. Formal derivations appear in a longer companion article. Using those formulas, this article discusses how good design can mitigate the problems with bias and identification. Using Monte Carol simulation, this article also shows how to choose between CR and PCR at the design stage. Conclusions: This article advocates for wider use of the PCR design. PCR loses its appeal when the investigator lacks baseline data for matching the clusters. Its use is less compelling when there are a large number of clusters. But when the evaluator is working with a fairly small number of clusters—26 in the running example used in this article—PCR is an attractive alternative to CR.

1

Abt Associates Inc., Cambridge, MA, USA

Corresponding Author: William Rhodes, Abt Associates Inc., 55 Wheeler Street, Cambridge, MA 02138, USA. Email: [email protected]

218

Evaluation Review 38(3)

Keywords outcome evaluation (other than economic evaluation), design and evaluation of programs and policies, methodological development, content area This article reviews the statistical analysis of data collected using pairwise cluster randomization (PCR), a design being applied with increasing frequency in evaluations of health care and social interventions.1 Although the topic is covered in several articles and books, explanations are often cryptic and obscured by technical language; furthermore, recommended approaches are varied and authorities frequently disagree. This article integrates extant arguments, explains complications when using the PCR design to estimate treatment effects, and recommends design and estimation procedures for program evaluation. The argumentative method used in this exposition employs Neyman’s randomized treatment assignment mechanism, sometimes called the Neyman-Rubin model of causal inference. This method has been used by others to study the properties of the cluster randomization (CR) design (Middleton & Aronow, 2012; Schochet, 2012) and the PCR design (Imai, 2008; Imai, King, & Nall, 2009) because it provides useful insight into estimation problems and solutions. The estimators discussed in this exposition are based on cluster-level summaries so readers who use individuallevel analysis techniques (random effects models, generalized estimating equations, mixed models, and so on) will not find them discussed here.2 This limitation is a concession to the restrictions of a short paper, the fact that important estimation issues discussed within the context of clusterlevel analyses translate into individual-level analyses, and recommendations by some authorities against using individual-level analysis techniques for PCR (Hayes & Moulton, 2009, p. 241). Estimation is conceptualized at the cluster level, but using individual-level covariates will enter into the discussion. Although this exposition recasts arguments made by others, it argues points from an intuitive perspective and also raises issues not considered elsewhere. This exposition begins by explaining that the PCR estimator produces biased estimates of treatment effectiveness and that the sampling variance for the test statistic is not identified. This assertion may surprise readers who perceive PCR as simply a way to avoid ‘‘bad draws.’’ Although bias and identification are serious problems, this article shows that the problems are ameliorated by careful matching, prestratification to adjust for cluster characteristics, and poststratification to adjust for unit

Rhodes

219

characteristics. This exposition explains that remedial steps are justified because the PCR design can produce estimates with better meansquared error properties than those of the CR design. However, this improvement is not guaranteed because of degrees of freedom adjustments, so this topic will receive consideration. After dealing with the theoretical aspects of PCR, the exposition turns to some parametric and nonparametric estimators that offer practical choices when applying the PCR design. The exposition ends with a Monte Carlo procedure recommended for making design decisions. There exist other, nonstatistical considerations for adopting or rejecting the PCR design, and while this exposition touches on a few of those considerations, it refers readers to other sources for brief (Donner & Klar, 2004; Imbens, 2011) and extended discussions (Donner & Klar, 2000; Hayes & Moulton, 2009; Murray, 1998).

PCR in Context Program evaluators recognize randomization as a device for inferring causal effects of program interventions (Orr, 1999). In a simple random design experiment, the evaluator randomly assigns study subjects to either a treatment group or a control group, and absent spillover from the treatment to the control group, the treatment effect—defined as the mean difference in outcomes for members of the treatment group and the control group—is identified without additional assumptions. Simple random design experiments are not without critics (Berk, 2005; Heckman & Vytlacil, 2007a, 2007b), and one practical criticism is that experiments can be prohibitively expensive, and another is that treatment effects often spillover into the outcomes for the control group. Cluster randomization (CR) is sometimes offered as a solution. In a CR design, study subjects (hereafter units) are associated with J clusters, the clusters are randomized into treatment and control conditions, and every unit within the treatment clusters is treated and no unit within the control clusters is treated. (There are variations in this design, but this simple version is common and suitable for present purposes.) As an illustration, the author and colleagues are currently evaluating an intervention to reduce resident falls within nursing homes. The nursing homes are clusters, nursing home residents are units, and nursing homes are randomly assigned to treatment or control. The treatment (program intervention) is being introduced at the nursing home level and affects all residents within the treatment nursing homes and none of the residents within the control nursing homes.

220

Evaluation Review 38(3)

The nursing home evaluation illustrates some advantages of the CR design: It is less expensive to train nursing home staff within a random sample of clusters than to train all nursing home staff and then restrict the intervention to a random sample of residents. It is more practical to prevent spillover by introducing the intervention in some nursing homes and to thereby isolate units in control nursing homes from spillover contamination.3 Cluster randomization offers other advantages, but these two will suffice for the present discussion. A disadvantage of using cluster randomization is that the design is less powerful than is unit-level random assignment for reasons that are recognized by survey sampling statisticians (Kish, 1995) and are described in almost all treatises on cluster randomization that describe how intracluster correlation inflates standard errors and reduces power (Cornfield, 1978; Schochet, 2008). Designing experiments requires estimating statistical power, surely an important topic, but not the principal focus of this article, which instead is concerned with the basic inferential properties of the PCR design. Of course, power depends on inferential properties so in that sense power is an implicit topic and receives attention. In many but not all cases, an evaluator can improve power by matching K pairs of clusters and hence, by converting a CR design into a PCR design. For example, in the nursing home illustration, there are 26 participating nursing homes, paired into 13 strata using preintervention measures of the rate of resident falls. One member of each pair is randomly assigned to treatment. Pair matching can improve power for the same reasons that stratification can improve accuracy in surveys, although the improvement depends importantly on the quality of the match, as poor matches reduce power. This assertion, which is the most important statistical criterion for choosing between PCR and CR, is justified later. Some additional technical reasons for preferring the PCR to the CR design are discussed elsewhere (Gail, Mark, Carroll, Green, & Pee, 1996). Despite the advantages of a PCR design, statistical inference is complicated. Even when an evaluator accounts for the clustering, conventional estimators of the average treatment may be biased and standard errors may not be identified. The nature of the problem depends on the purpose behind estimation. A two-by-two classification table (Table 1) provides a useful way to think about the objectives of estimation (Imai, King, & Nall, 2009) and leads to a basic understanding of the problem. One dimension of this table indicates whether the evaluator seeks to draw inferences about the J clusters or else about a somewhat nebulous (Schochet, 2012) population of M clusters from which the J clusters are

Rhodes

221

Table 1. A Two-Way Classification Table of Research Objectives Identified as A Through D. Unit and Cluster Choices Snj units SNj units

J Clusters

M > J Clusters

A C

B D

sampled. The other dimension of the table indicates whether the evaluator seeks to draw inferences about the Snj units comprising a sample from the observed clusters or about the SNj units from which the sample was drawn. There are four possible objectives represented by cells A through D. This article assumes that the evaluator seeks to draw statistical inferences about questions that are consistent with cell C. That is, the evaluator seeks to estimate the average treatment effect for the SNj units that could conceptually appear within the fixed J clusters. The author is aware that this is not always the objective. If the J clusters are a probability sample of the M clusters, then the evaluator might make an inference about the M so the focus would be on cell D. The conceptual shift from C to D makes a difference for the estimation problem, and this article will discuss that difference. It is less likely that the evaluator would seek to draw statistical inferences about the Snj units rather than the SNj units. Still, cell A is useful for conceptualizing the bias and identification problems posed by CR and PCR designs because important cluster parameters (mj or mj þ dj) are observed4 rather than estimated given cell A. Nevertheless, even when cluster parameters are observed, estimates of treatment effectiveness are biased and standard errors are not identified, a point made in this article by presenting the simple variance equation for case A. The immediate problem considered in this article is (1) why does the bias and identification problem occur and (2) what can an evaluator do to counter the problem? As noted, this problem is discussed elsewhere, but those discussions provide limited insight into the source of the problem, and they may provide ambiguous—and potentially incorrect—advice about estimation. This article provides an intuitive explanation for the problem and, based on that intuition (and some representative mathematics), discusses design and estimation. An expanded companion paper, available from the author, provides mathematical derivations for motivated readers.

222

Evaluation Review 38(3)

Notation Developing the argument requires Neyman-Rubin’s potential outcome notation. Define Yijkt(TRj) as the outcome for the ith unit in the jth cluster for the kth pair during period t. The term (TRj) indicates that the observations are taken when the cluster is assigned to treatment (TRj ¼ 1) or when the cluster is assigned to control (TRj ¼ 0), and only one of these two conditions is observable for a given cluster. Yijkt(TRj) may be sampled from a larger population within the jth cluster; the cluster population means are mjt þ TRjdj and variances are s2jt , where dj is the average treatment effect for the jth cluster. When covariates enter the analysis, they are denoted as Xijkt; a necessary assumption is that the covariates are independent of treatment assignment, that is, covariates are exogenous. In simple terms, the Xijkt cannot be determined by treatment assignment. The notation is redundant because cluster membership determines pair membership; however, at some times the discussion will use index j and at other times it will use index k. The periods are preintervention when t ¼ 1 and postintervention when t ¼ 2. The discussion does not necessarily assume repeated measures (i.e., the same study subjects are observed at preintervention and postintervention), although if repeated measures were available, that availability might be exploited in the estimation, presumably by using change scores or other approaches. There are J clusters, and assuming one-to-one matching, there are K ¼ J/2 pairs. To create the pairs, sort the file based on the average preintervention outcomes using clusteraverage measures Yijk1(0) and match so that k ¼ 1 when j ¼ 1 or 2, k ¼ 2 when j ¼ 3 or 4, and so on. (There are other ways to match; Guo & Fraser, 2010; Zhang & Small, 2009; so this approach is illustrative.) Having performed the matching, drop the t subscript because all subsequent calculations will use postintervention data. At some points in the discussion, the notation is simplified. When using the k subscript without the j subscript, we add a second subscript m, where m ¼ 1 for the first cluster of the kth pair and m ¼ 2 for the second cluster in the pair. This notation will only apply after we have made the matches so we can drop the t subscript and simply write Yikm. For simplicity, following randomization within the match pairs, we will always rearrange the order so that m ¼ 1 for the treatment cluster and m ¼ 2 for the control cluster. Finally, regarding notation, nj denotes the number of sampled units in the jth cluster, and Nj denotes the population cluster size. Throughout, units comprise a simple random sample from the population, although other probability-based sampling schemes are readily accommodated with

Rhodes

223

suitable notation. The J clusters might be a probability-based sample from a larger population, but assume initially that inferences pertain to treatment effects across the J clusters, not a larger population of clusters, so the sampling of the J cluster is irrelevant. Implications of treating the J clusters as a sample from a larger population are discussed later.

Treating the K Pairs as the Units of Observation

P It is possible to think of the data as comprising nj observations, J observations or K observations. All three characterizations are valid, but it will be useful to discuss estimation from the perspective of K observations, each of which provides an estimate of the average treatment effect for clusters within a pair ðdk Þ, namely the mean difference between the outcome for the treatment cluster and the control cluster. The estimated treatment effect across all clusters is then the weighted average of the K pairwise estimates. 0P 1 nk1 nk2 P Y Y ik1 ik2 X X Bi¼1 C i¼1 ^ C d¼ dk ¼ ð1Þ ok ^ ok B @ nk1  nk2 A: k k The weights o sum to one, but beyond that constraint, the evaluator has latitude. The choice of weights is obviously important, given potentially heterogeneous treatment effects. Different estimators discussed in this article and elsewhere apply different weighting schemes and, as a consequence, they may estimate different parameters. Some frequently used estimators estimate parameters that are not easily interpreted. An evaluator might seek to estimate the average treatment effect when every unit in a population is assigned a weight of one so the average treatment effect for a cluster is given a weight proportional to Nj, and the pair is given a weight proportional to Nk1 þ Nk2. Call this the population average weight. An evaluator might seek to estimate the average treatment effect when every unit in a sample is given a weight of one so the average treatment effect in a cluster is given a weight of nj, and the pair is given a weight proportional to nk1 þ nk2. Call this the sample average weight. The population average weight and the sample average weight are the same, given the sampling is proportional to size. Alternatively, an investigator might seek to estimate the simple average treatment effect across the J clusters. In that case, each cluster receives the same weight as every other cluster. Call this the cluster average weight. Other weights are frequently used including,

224

Evaluation Review 38(3)

especially, weighting used to produce efficient estimates of average treatment effects. Equation 1 implies that an evaluator first estimates the pairwise treatment effect and then averages across the pairs. We prefer this formulation to a regression-based formulation because it is explicit about the weighting and hence about what parameter is being estimated. In contrast, it is often difficult to tell what a regression-based estimator is identifying because the underlying weighting is implicit, a point to which this article turns next.

Regression-Based Estimators As a contrast to the estimates represented by Equation 1, consider estimators that are based on a regression, such as random effect models and hierarchical models, which use individual-level data. These models implicitly weight the cluster-level treatment effects proportional to the sample size (adjusted for any differences in the unit-level variance within each cluster) plus a constant that appears because of variation in effects across the clusters (i.e. intracluster correlation). Given heterogeneous treatment effects, random effects models and hierarchical models in general do not estimate the same thing as the models discussed in this article. From the perspective of this article, regressions employ a peculiar weighting schemes, driven largely by a quest for efficiency and often (but not necessarily) by an assumption that treatment effects are homogeneous. The estimating equation approach differs because efficiency is seen as a secondary consideration, parameters are typically estimated using the sample size to weight the cluster-level effects, and standard errors are estimated using robust estimation procedures. In summary, regressions based on individual-level data do not necessarily estimate the same parameters as do the cluster-based estimators discussed in this article. We do not mean this as any implied criticism of using regression models to estimate treatment effects; surely this is an established and venerable approach. Rather, the point is that the cluster-level approach clarifies what is being estimated, a consideration sometimes obscured by regression-based estimates. Another observation is that regression-based approaches appear to be based on a large number of observations: Snj. This sample size is misleading, a point emphasized by a sizable literature warning how intracluster correlations inflate standard errors (Cornfield, 1978), but the point is clearer when considering estimation from the cluster-level perspective. A third observation is that regression-based procedures may encourage a thoughtless attitude encouraging use of an apparently more sophisticated but

Rhodes

225

perhaps misleading regression in place of a simpler, readily interpretable cluster-based estimator. Finally, many of the regression-based procedures require some distributional assumptions and those assumptions may be false. Schochet (2012) suggests that evaluators test the sensitivity of their estimates using different estimators, although his approach raises the threat of accepting the estimate that appears best, given the evaluator’s prior opinions. An alternative is to test estimators using baseline data perhaps by using the approaches suggested in the Monte Carlo section on this article. Overall, this article promotes cluster-level estimates on a conceptual level because the cluster-level estimates are transparent about what is being estimated, but it also favors cluster-level estimation on an applied level because of simplicity.

Estimates of Treatment Effectiveness are Biased But Consistent Starting with the CR design first, a companion paper and the professional literature shows that the CR estimator is biased but consistent as J goes to infinity (Schochet, 2012), provided the size of the treatment effect does not depend on the size of the cluster.5 Regardless, the CR estimator will be unbiased if the treatment effects are homogeneous because then the weighting scheme does not matter from a bias perspective. Moving from Case A to Case C in Table 1, presuming that the sample within each cluster is a simple random sample proportional to size, the story about bias and asymptotic properties does not change. Moving from A to C adds variance to the estimator because the previously known mj or mj þ dj must be estimated, but otherwise nothing has changed. Moving from C to D alters the orientation, causing us to think of the data as being a realization of the sample of J clusters from M clusters, and the estimate is unbiased presuming that M is large. (Of course the sampling variance increases when J is treated as a sample from M so just pretending that J is a sample in order to identify the treatment effect is a pyrrhic victory.) We conclude the CR design is biased but consistent in Case C, the case of greatest interest to this article. Similar storytelling surrounds the PCR design. When J goes to infinity and hence K goes to infinity, the PCR estimator is consistent provided the treatment effect is independent of the sample size (Imai, King, & Nall, 2009) within pairs.6 Moving from C to D, the PCR estimate is unbiased provided M is large, but in Case C, the case of greatest interest in this article, the PCR estimator is biased. But PCR is most useful when there are few

226

Evaluation Review 38(3)

clusters so asymptotic properties are not compelling, and we must consider bias. Following Imai, King, and Nall—or the proof given in the companion paper—the bias for pairwise matching is given by Equation 2:   J K   X X Nj dj ðNk1 þ Nk2 Þ dk1 þ dk2 bias ^ d ¼  J K P P 2 j¼1 Nj k¼1 ðNk1 þ Nk2 Þ j¼1 K P

¼

k¼1

ðNk1  Nk2 Þ

k¼1 K P

d

k1 dk2

ð2Þ



2

: ðNk1 þ Nk2 Þ

k¼1

In general, the bias represented by Equation 2 will be nonzero, but there are three conditions under which bias disappears, and these conditions have design implications. The first condition is that the intended weighting scheme differs from the one proposed here so that the treatment effect in every cluster receives the same weight, that is, the intention is to estimate a simple cluster-specific average treatment effect. This is seldom the purpose of estimation so this observation provides little comfort. The second condition is that the treatment effect is the same in both clusters of a pair, in which case dk ¼ dk1 ¼ dk2 . (The treatment effect can vary across clusters as long as it is constant within pairs.) The third is that the clusters are of equal size so Nk1 ¼ Nk2 within each pair although cluster size can otherwise vary across clusters. Notice that the conditions for unbiased estimation are more lenient in the PCR design than in the CR design because good matching will reduce and potentially eliminate bias. Unexpectedly, perhaps, random assignment has not guaranteed an unbiased estimate for the treatment effect. Returning to Equation 2, if the problem is that d and the cluster sizes differ between the clusters within a pair, the bias might still disappear asymptotically, where asymptotic properties depend on the number of pairs increasing to infinity. The intuition is that over a large number of clusters, ðdk1 þ dk2 Þ=2 will average to ðNk1 dk1 þ Nkw dk2 Þ=ðNk1 þ Nk2 Þ. This asymptotic property will hold, provided the treatment effects are not a function of the cluster size, an assumption that may be reasonable in some settings but not in others. Furthermore, the number of clusters is often a limiting factor so asymptotic properties provide modest comfort. Equation 2 offers some design implications. Before turning to those implication, consider a formula for standard errors of the size of the treatment effect.

Rhodes

227

Standard Errors Are Not Identified Using Neyman’s mechanism and applying variance decomposition, a companion paper shows that for cell A in Table 1 the variance for the estimated treatment effects is: 92 8 > > >  > K > = < N þN >   X s2k1 s2k2 k1 k2 2 VAR ^ d ¼ þ ð  m Þ þ m : ð3Þ k1 k2 K > >P nk1 nk2 > k¼1 > > > Nk1 þ Nk2 ; : k¼1

The problem is that while we can estimate the term in brackets, absent making some assumptions, we lack an estimator for the term in parentheses. That is, we can estimate mk1 or mk2 but not both. (Moving from Case A to C requires the addition of another variance term but does not otherwise resolve the identification problems.) This article will make repeated backward references to Equations 2 and 3 because they are the crux of the problems with bias and identification when dealing with Case C, the case motivating most of the discussion in this article.

Design Considerations These results are disconcerting. The estimate for the treatment effect is in general biased, and even when it is not, no estimator identifies the sampling variance. These observations have caused some authorities to advise against using the PCR design, but others (Gail et al., 1996) have argued that the biases are typically small, and from a mean-squared error perspective, there is a strong argument for the PCR over the CR design (Imai, King, & Nall, 2009). Nevertheless, making this argument requires some estimator for the variance, considered later when discussing estimators. Prior to that consideration, pause to reflect on some practical aspects of design. According to Equations 2 and 3, the quality of the match is a key element of estimator performance. According to Equation 2, the bias will be zero when the treatment effects for clusters within a pair are equal:dk1 ¼ dk2 . We cannot observe whether this condition holds, but we expect that the condition would hold approximately when the clusters joined in a pair are alike. According to Equation 3, the variance will be identified when, absent treatment, the outcomes for clusters within a pair are equal:ðmk1 ¼ mk2 Þ. We cannot observe whether this condition holds, but again, we expect that it will hold approximately when the clusters joined in a pair are alike. Good matching is essential.

228

Evaluation Review 38(3)

A second observation follows from the first. A prudent investigator will think carefully about the nature of the treatment effect and, without making assumptions that violate the spirit of random assignment, perform data transformations that will more closely result in additive, homogeneous treatment effects. For example, in the nursing home illustration, it seems sensible to assume that the treatment effect is multiplicative so that the treatment effect varies across units and, specifically, is proportional to the underlying rate of falling for each resident absent the intervention. This suggests estimating the logarithm of the treatment effect for a pair by substituting the logarithm of the mean observed rate of falls for the mean rate of falls in earlier equations. This transformation leads to a model where the logarithm of the treatment effect is plausibly additive and homogeneous within pairs, and consequently, the estimator for the average treatment effect is unbiased. The logarithmic transformation does not guarantee unbiased estimates because the effect can be multiplicative and heterogeneous, but an improved specification is welcome, especially when the transformation only has to work for members of a pair. A third observation extends the second. In the nursing home illustration, the intervention is partly universal within a nursing home (e.g. removing obstacles that might cause falls for every resident) and partly directed at residents who are predictably at high risk of falling (e.g. case planning for residents based on their fall histories). This suggests that within a nursing home, there are two populations—low-risk and high-risk residents—and that the treatment effect will differ between them. The problem is that the mix of low-risk and high-risk residents may vary across nursing homes, leading to heterogeneity in the treatment effects between nursing homes within a pair. However, if high- and low-risk populations can be distinguished using statistical procedures, the bias might be reduced or removed by poststratification and estimation of the two treatment effects, one for low-risk residents and one for high-risk residents. Because randomization was not predicated on risk, poststratification is a valid estimation strategy.7 Continuing to extend this reasoning, the bias might be further reduced by introducing covariates into the analysis instead of or in addition to using poststratification, but discussion of that special topic is reserved because introducing covariates requires special (but simple) manipulation of the data. Putting these observations together, they imply strategies for reducing bias in the estimated treatment effect and identifying the standard error. With regard to the standard error, looking at Equation 3, a close match will drive the term in parentheses toward zero, and using prestratification to adjust for cluster characteristics and poststratification to adjust for unit

Rhodes

229

characteristics may additionally improve the nearness of the match. The variance is identified or nearly identified. With regard to bias, looking at Equation 2, matching is important to assure that cluster members of each pairs have the same average treatment effects, and beyond pairing, homogeneity of the treatment effect within pairs can be promoted by data transformations, prestratification based on cluster characteristics, and poststratification based on unit characteristics. None of this is doable without substantive knowledge of the programs being evaluated so evaluation is not the exclusive province of statisticians; still, good evaluation requires basic understanding of design considerations. The discussion assumed matching on preintervention outcome measures, and while this is an attractive strategy for matching, other strategies must be employed when preintervention outcome measures are unavailable (Hayes & Moulton, 2009, p. 77). The absence of good matching variables is one of the reasons why the Medical Research Council (2002) cautioned against using pairwise matching. Furthermore, the use of an alternative matching strategy is required when there are multiple outcome measures.8 The relative preference for PCR and CR, and of an intermediate position of stratification into blocks (Hayes & Moulton, 2009; Imbens, 2011),9 is likely to differ across disciplines and evaluation topics. In an age of information technology, preintervention outcome measures are often accessible through information systems, increasing the quality of the matches, and the justification for using PCR. However, if an investigator’s concern is with public health interventions in a developing nation, for example, preintervention outcomes measures may be unavailable, and stratification based on location or other observable criteria may be best. The pairwise match might be seen as good in the former case, leading to the PCR design, and questionable in the latter case, leading to a CR design. Regardless of how he or she matches, an investigator must be mindful of why he or she is matching and that is the principal subject of this section. Most discussions of this topic concern variance reduction, a topic to which this article turns in the following sections, but from Equations 2 and 3, we see that variance reduction is not the only important consideration when matching. From Equation 2, we see that estimation will be least biased if matched clusters have about the same population sizes, and if the size of the treatment effect is about the same within pairs or at least not widely different. From Equation 3, we see that the estimate of the variance will be identified if the mean outcomes in the counterfactual state are the same in the postintervention period. These observations require that the matching criteria be prospective; perhaps the past is prologue but this need not be true.

230

Evaluation Review 38(3)

We are using the past to anticipate the future, but a mechanical application of past commonalities among clusters should not substitute for a consideration of future commonalities among clusters including anticipation of capacity and willingness of clusters to implement treatment. The nursing home evaluation is using a mixed strategy of pair matching (based on retrospective data) and stratification (based on the anticipated willingness and ability of homes to implement the intervention). These are forwardlooking criteria, which presume that the past is indicative of the future but that anticipate, based partly on subjective assessment, the ability and willingness of nursing homes to successfully implement the treatment. In sum, knowledge of the substantive problem and thoughtful design can overcome or at least minimize the bias in the estimated treatment effect, a useful observation from a mean-squared error perspective, given that the PCR design can be much more efficient than the CR design. This article has not presented an estimator for the variance, but the following two sections on nonparametric and parametric estimation discuss that issue, after which the article turns to the question of whether and when the PCR design is more efficient than the CR design.

Traditional Nonparametric Solutions There are two generic forms of estimators for d, one nonparametric and the second parametric or semiparametric. This section considers the nonparametric estimator. The discussion is introductory and elementary; readers who seek to go beyond this rudimentary summary might consult Sprent and Smeeton (2007). Throughout this section, assume one-on-one matching, and assume that the clusters are about the same size or, if the clusters are not about the same size, that the error distributions are not grossly asymmetric. These are innocuous assumptions (Murray, 1998, p. 116), and interested readers can learn about consequences in rare cases where they fail (Gail et al., 1996). The Wilcoxin sign rank test makes no distributional assumptions other than an innocuous assumption regarding symmetry. The associated Hodges– Lehman estimator leads to point estimates and confidence intervals. The sign rank test (1) starts with the absolute values of the K estimated treatment effects, (2) ranks then from high to low, assigning 1 to the low rank and K to the high rank, and (3) then sums the ranks for when the k estimated treatment effects are positive. The sum of the ranks is the test statistic. Under the null, the sum of the ranks for the negative and positive outcomes should be about equal, so in expectation the sum S will equal

Rhodes

231

. . k 2 ¼ K ðK þ 1Þ 4. There is a small probability (1/2)K that the sum P could be as large as Kk¼1 k, that is, that the rank score is positive for all K pairs, a somewhat smaller probability that it is positive for K  1 of the pairs, and eventually a probability of .05 or less it is positive for K, K  1 . . . K  C pairs. For a one-tailed test, K  C identifies the critical value of the test statistic. The Hodges–Lehmann estimator lacks the intuitive appeal of the sign rank test itself, and we will not provide an explanation for deriving a point estimate and confidence intervals. Interested readers might consult Sprent and Smeeton (2007, p. 53) for one explanation and Rosenbaum (2009, p. 43) for an explanation from an alternative perspective. The important point is that we can derive a nonparametric test statistics for the null of no treatment effect, a point estimate for the median treatment effect, and confidence intervals for the median treatment effect without making distributional assumptions—a good thing, given that the variance discussed earlier is not identified. However, the Hodges–Lehmann estimator leads to an estimated median across the clusters using what was previously called a cluster average weight. Because of symmetry the median should be close to the mean based on the population or sample average weights, provided the clusters are about the same size. If the clusters have much different sizes, however, the Hodges– Lehmann estimator may be much different from the parametric estimators. The nonparametric estimator has desirable features and is more consistent with the spirit of random assignment than are parametric estimators. The estimated treatment effect is no longer biased, although this quality is purchased at the price of using cluster average weights to identify the median effect across the clusters. There is no need to estimate a sampling variance (i.e., Equation 3) because the permutation test depends exclusively on the design. There is a large sample variance approximation, but this too comes from the design, not the data and specifically does not depend on knowing mk1  mk2. In the spirit of random assignment, the nonparametric estimator has much to recommend it, although its utility is limited when K is small, and it does come at the cost of reduced efficiency compared with the parametric estimator, discussed next. PK

k¼1

Parametric, Partially Parametric, and Other Nonparametric Solutions Nonparametric tests are in the spirit of random assignment because no distributional assumptions are required to interpret test statistics or to derive point estimates and confidence intervals. But by making distributional

232

Evaluation Review 38(3)

assumptions, the same sort of assumptions seen as largely innocuous in other settings, we can increase power. So what should be said about substituting parametric estimators for nonparametric estimators? Whatever can be said in general about the substitution, the PCR design poses special problems that stem from random sampling within pairs. The nature of the problem has already been discussed: With the data at hand, we cannot estimate the variance represented in Equation 3 without making additional assumptions. To illustrate the difficulty, and to introduce a regression-based estimator, consider a fixed-effect regression estimator (Bloom, 2005). Bloom’s (2005) specification is a simple linear model where K dummy variable represents the matched clusters and an additional dummy variable represents the receipt of treatment. The companion paper shows that this estimator reduces to the estimator shown by Equation 2 except for the weighting scheme. In Equation 2, the weights were proportional to Nk1 þ Nk2, so the bias was with respect to the population average weights. Now the weights are proportional to nk1 þ nk2. Substituting this new weighting scheme into Equation 2 leads to the conclusion that the estimated treatment effect is biased. The earlier discussion of using design for reducing bias continues to apply, and, moreover, there is really no need for this distinct regression-based estimator. Estimation based on cluster-level summaries will work just as well and they will not obscure the weighting scheme. The estimated variance is again problematic. Based on results shown in the companion paper, Equation 3 still represents the problem, provided we substitute the sample-specific weights for the population-specific weights. As before, the problem is that we cannot estimate the term in parentheses D ¼ mk1  mk2. In practice, an evaluator would likely not estimate the regression with the above-mentioned specification. He or she might instead apply a more efficient generalized least squares estimator. The Monte Carlo simulation in fact adjusts the fixed-effect estimator to account for heteroscedasticity, although this adjustment is only partial because it does not adjust for the unknown D. Adjusting for heteroscedasticity is likely to improve efficiency but, as already shown, does not eliminate the problem with bias. Furthermore, when the treatment effect is heterogeneous across clusters, we cannot be sure that the regression is actually estimating. We have purposefully shown the regression estimator can be expressed as a cluster-level estimator that adopts a specific weighting scheme, thereby demonstrating that in general the fixed-effect least square estimator will lead to a biased standard error for the estimated treatment effect. Imai, King, and Nall (2009, p. 36) suggest a nonparametric variance estimator

Rhodes

233

that provides an upper limit on the variance regardless of the weighting scheme. Using the notation from this present article, their estimator is:

" (

VARð^ dÞPCR;IKN ¼

0P nk1

Yik1

K Bi¼1 K X wk TRk B @ nk1 ðK  1Þ k¼1

0P nk2

nk1 P

nk2 P

Yik2

1

C C  i¼1 nk2 A

1

Y Y Bi¼1 ik2 i¼1 ik1 C B C þ ð1  TRk Þ@  nk2 nk1 A

#2

) 

ð4Þ

1^ d ; K

where wk is one of several weighting schemes that sum to 1 and ^d is the estimated average treatment effect across the paired clusters, written as: ^d ¼

K X k¼1

¼

K X

dk wk ^ 8 > >
> k¼1 :



Yik2

i¼1

nk2

1

0P nk2

Yik2

C B C þ ð1  TRk ÞBi¼1 A @ nk2

19 > Yik1 > C= i¼1 C :  nk1 A> > ; nk1 P

ð5Þ Both the treatment effect and the upper limit for its sampling variance are easily estimated by combining cluster-level estimates. The sample statistic is treated as being distributed as Student’s t with K  1 degrees of freedom. Improvements are possible; interested readers should consult Imai, King and Nall. It is disconcerting that we cannot estimate the sampling variance, although estimating an upper limit, and designing the analysis to cause the upper limit to be near the true variance, is comforting. There is another way to proceed, which rests on some assumptions and the availability of preintervention data. The necessary assumption is that Dk ¼ mk1  mk2 does not change from the preintervention period to the postintervention period. This is equivalent to assuming that the mean outcomes under the counterfactual states do not change from the preintervention period to the postintervention period, or if they change, the change is the same for each cluster in every pair of clusters. Then we can estimate Dk using preintervention data, and from Equation 3 we can estimate the sampling variance. (An alternative is to assume that Dk is the same for all k.) This is admittedly an ad hoc justification.10

234

Evaluation Review 38(3)

Some generalizations follow from these results. Equation 3 shows that we can reduce variance, to a point, by leaving K fixed and increasing n, but we cannot force the variance to zero without increasing the number of pairs. Apparently, we can reduce the variance toward zero by increasing the number of pairs toward infinity (Schochet, 2012), although this sanguine implication is misleading. The availability of clusters is typically fixed, relatively small, and where large, may be inaccessible because of budget constraints. This subsection has illustrated a fundamental problem with the PCR design, namely, lack of a straightforward method for estimating Dk. Here, the Wilcoxin sign rank test and the Hodges–Lehmann estimator have advantages, partly offset or course by the fact that the nonparametric estimator is less efficient than its parametric counterpart and that the nonparametric approach estimates the median across the clusters, but in the spirit of random assignment—where avoiding assumptions is desirable—the nonparametric estimator has a clear advantage. Perhaps we should not worry too much about the problem and concentrate on making good matches so that the unknown variance component becomes negligible. Readers of other articles regarding PRC will note the absence of the term intercluster coefficient, often expressed as rA and written as  correlation  rA ¼ D2 s2 þ D2 . Most treatments of PCR begin with rA so where has it gone? Assuming that D is the same across clusters, applying the sample weights, solving for D2, the companion paper shows that the variance term can be expressed using the intercluster correlation coefficient: 2   2 rA s ^ VAR d ¼ þ : ð6Þ n 1  rA K The rhetoric is that we cannot estimate the variance because we cannot estimate rA but obviously this is the same as not being able to estimate D. Nevertheless, this observation ties the discussion in this article into discussions presented in standard sources. Although comparable derivations appear elsewhere, the current derivation—which uses the Neyman mechanism to treat the variance as arising from a sampling problem—clarifies the source and importance of the intercluster correlation coefficient.

PCR or Cluster CR? Is PCR always better than a simpler CR design? To answer that question, consider an intuitive argument before turning to a more formal argument.

Rhodes

235

Across the population of clusters, the investigator can observe J mean outcomes. Using a CR design, after random assignment an investigator will be able to estimate mj þ dj for 1 . . . K of these clusters and mj for K þ 1 . . . J of these clusters. By chance the investigator might have randomly sampled for treatment K clusters that had the lowest values of mj þ dj, in which case the estimated treatment effect would be much too low; alternatively, by chance the investigator might have randomly sampled for treatment K cluster that had the highest values of mj þ dj, in which case the estimated treatment effect would be much too high. Given random assignment, the extreme highs and lows will balance, but the sampling variance will be high. The relative advantage of the PCR design is that good matching will reduce the probability that these extreme outcomes will occur so that on this ground, the PCR estimator is more efficient than the CR estimator. Somewhat offsetting this advantage is that the CR design provides J data points and J  2 degrees of freedom for the test statistic based on an unmatched t-test, while the PCR design provides K data points and K  1 degrees of freedom based on a matched t-test. This reduction in degrees of freedom affects the test statistic and the confidence interval (Klar & Donner, 1997; Martin, Diehr, Perrin, & Koepsell, 1993) because the shape of the t-distribution depends on the degrees of freedom. For example, to reject the null hypothesis using a one-tailed test requires a test statistics of 2.02 when the degrees of freedom equal 5, a test statistics of 1.82 when the degrees of freedom equal 10, a test statistics of 1.75 when the degrees of freedom is 15, and the familiar a test statistics of 1.65 when the degrees of freedom is large. This degree of freedom disparity has led to an often repeated rule of thumb that the PCR design should be considered only when there are 10 or more pairs (Diehr, Martin, Koepsell, & Cheadle, 1995; Feng, Diehr, Peterson, & McLerran, 2001), although this rule of thumb is qualified by its proposers, and it is disputed by others (Imai, King, & Nall, 2009). This article recommends avoiding the rule of thumb in favor of an approach explained in the Monte Carlo section of this article. The companion paper shows that the efficiency of the PCR design depends on two factors. The first is the proportion of variance contributed by intracluster correlations. If it is negligible—that is, if there is little variation in mj across the clusters—then neither design is superior in terms of efficiency. (We would be concerned with lost degrees of freedom from the PCR design, so the CR design would be preferred.) Intuitively, the evaluator had no chance to make ‘‘bad draws’’ when assigning clusters to treatment and control. Otherwise, for the PCR design

236

Evaluation Review 38(3)

to be more efficient, the companion paper shows that the following inequality must hold: K P

J  P

ðmk1  mk2 Þ2

k¼1

J2

Pairwise cluster randomization: an exposition.

Cluster randomization (CR) is often used for program evaluation when simple random assignment is inappropriate or infeasible. Pairwise cluster random ...
286KB Sizes 0 Downloads 4 Views