Evaluation and Program Planning 48 (2015) 149–159

Contents lists available at ScienceDirect

Evaluation and Program Planning journal homepage: www.elsevier.com/locate/evalprogplan

Hierarchy of evidence and appraisal of limitations (HEAL) grading system P. Cristian Gugiu * Quantitative Research, Evaluation, and Measurement Department of Educational Studies, The Ohio State University, Columbus, OH, United States

A R T I C L E I N F O

A B S T R A C T

Article history: Available online 19 August 2014

Despite more than 30 years of effort that has been dedicated to the improvement of grading systems for evaluating the quality of research study designs considerable shortcomings continue. These shortcomings include the failure to define key terms, provide a comprehensive list of design flaws, demonstrate the reliability of such grading systems, properly value non-randomized controlled trials, and develop theoretically-derived systems for penalizing and promoting the evidence generated by a study. Consequently, in light of the importance of grading guidelines in evidence-based medicine, steps must be taken to remedy these deficiencies. This article presents two methods – a grading system and a measure of methodological bias – for evaluating the quality of evidence produced by an efficacy study. ß 2014 Elsevier Ltd. All rights reserved.

Historical antecedents of modern evidence-based medicine Although modern notions of evidence-based medicine surfaced in the early 1970s, its origins may be traced back much further (Claridge & Fabian, 2005; Doherty, 2005). One of the earliest recorded examples of a randomized controlled trial (RCT) occurred in 1662 when Jan Baptista van Helmont proposed that 200 or 500 people with fever or pleuritis be divided by cast lots (i.e., a random process) into two groups. He would then treat one group by sensible evaluation while another doctor would treat the other group by bloodletting. The better method would be judged by the number of funerals held. While it is unclear whether this study was ever conducted, other studies that challenged the efficacy of bloodletting were conducted. In 1816, Alexander L. Hamilton reported an experiment in which 366 sick soldiers were assigned to one of three surgeons, based on an alternate allocation, and attended with the same care and comforts, to the extent possible. Hamilton reported that the surgeon who used bloodletting had 10 times the number of deaths as the other two surgeons. In another study, a French physician, Pierre Charles-Alexandre Louis, published a monograph in 1835 detailing a retrospective numerical analysis of case series to prove bloodletting was an ineffective treatment for fevers. Unfortunately, the practice of bloodletting did not end until the late nineteenth century.

* Tel.: +91-11-26591332; fax: +91-11-26581114. E-mail address: [email protected] http://dx.doi.org/10.1016/j.evalprogplan.2014.08.003 0149-7189/ß 2014 Elsevier Ltd. All rights reserved.

Scurvy is another disease that illustrates the evolution of evidence-based medicine. While the disease can be traced at least as far back as Vasco da Gama’s expedition to the Cape of Hope (1497–1499), a remedy was not known until 1617 when John Woodall wrote that citrus products helped treat the condition. Unfortunately, this observation went largely unheeded for over a century. However, in a landmark study in 1747, James Lind conducted an experiment wherein 12 sailors with symptoms of scurvy were selected on the basis of the similarity of their cases. The sailors were then divided into six pairs and each pair received a different diet supplement. After 14 days, Lind concluded that only the pair that received the citrus supplement showed a dramatic recovery. Regrettably, over 40 years passed before the British Navy approved the use of lemon juice as a preventative dietary regimen for sailors. Notably, most historical antecedents of evidence-based medicine do not trace back to RCTs. Rather, the roots of evidence-based medicine appear to be firmly planted in non-randomized controlled trials (non-RCTs). Modern science would classify Woodall’s observations as a before-after (i.e., pretest–posttest) study, Hamilton’s study as a pseudo-RCT, Lind’s study as a paired comparison, and Louis’ monograph as the beginning of medical statistics. These historical examples also illustrate that once a belief or practice becomes widely accepted as evidence or scientific, it is difficult to dislodge it regardless of whether or not it is empirically valid. Thus, history teaches us to be careful in what we accept into our knowledge base. The purpose of this article is to present two methods – a grading system and a measure of methodological bias – for determining the

150

P.C. Gugiu / Evaluation and Program Planning 48 (2015) 149–159

strength of evidence produced by an individual study. These methods were developed by compiling study design limitations found in a review of the literature on the Chronic Care Model (CCM). However, the two methods can be used to evaluate the quality of evidence generated by any efficacy study. Although alternative definitions exist, the term ‘‘evidence’’ will be defined herein as the body of information produced by an empirical research study in support of or refuting the efficacy of a treatment. A grading guideline (system), then, is a qualitative rubric employed by researchers and clinicians for the purpose of developing an empirical base of evidence in support (or opposition) of a particular course of treatment. Hence, grading systems have come to define what researchers view to be credible evidence. Moreover, the study designs promoted by these guidelines are deemed by many researchers, including government agencies, as more worthy of receiving research funding than other study designs (Kessler & Glasgow, 2011). Nevertheless, the use of current grading guidelines for these purposes is not without criticism. Limitations of existing guidelines for grading evidence While the transformation of evidence into practice may have once been stymied by communication barriers, today medical practitioners and researchers are faced with the opposite problem. More than 10,000 new RCTs are added to Medline each year (Chassin, 1998). Additionally, the requisite knowledge to evaluate the quality of these studies has grown unabated. To fully gauge the strength of evidence described within a study, at minimum, users must be thoroughly educated in research theory and applied statistics. As a result, the immense volume of scientific literature and the requisite scientific knowledge for understanding it has created an insurmountable barrier for most physicians. Yet, physicians are expected to integrate best practices into medical decisions. Therefore, guidelines have been developed to aid medical practitioners, researchers, and policy-makers in determining the degree to which a study’s findings are trustworthy. Essentially, these guidelines evaluate the strength of evidence generated by a study based upon the quality of the research design although some effort has gone into extending the focus to other factors (e.g., directness of evidence, precision of results) (Gugiu & Gugiu, 2010). However, despite more than 30 years of effort dedicated to the improvement of grading systems, considerable shortcomings subsist (Berger & Alperson, 2009; Gugiu & Gugiu, 2010; Alperson & Berger, 2011). Existing medical research grading systems, like the Canadian Task Force on the Periodic Health Examination (CTFPHC) (1998), U.S. Preventive Services Task Force (USPSTF) (1996), Australian National Health and Medical Research Council (ANHMRC) (1998), Oxford Centre for Evidence-based Medicine (OCEBM) (2009), and Grades of Recommendation, Assessment, Development, and Evaluation (GRADE) (Atkins et al., 2005), for instance, neglect to define key terms (e.g., ‘well-designed’, ‘serious limitation’, ‘poor 1 In fact, the Kappa statistic is a measure of (inter-rater) agreement, not (interrater) reliability. Agreement estimators are absolute measures of consistency whereas reliabilities are relative measures. Reliability estimators focus on the relative rank-order of measures. Hence, they can be used to measure the consistency (replicability) of a variable (e.g., rating), including the average of two or more variables, and to construct confidence intervals on such variables. In contrast, agreement estimators focus on whether two or more raters coded a variable in identical (or nearly) fashion. Therefore, agreement estimators are subject to how ‘‘agreement" is defined (i.e., exact agreement versus agreement within a tolerance region). Additionally, such estimators are affected by the number of categories in the variable of interest and whether the categories are equiprobable. As a general rule, agreement estimators are appropriate for nominal data, whereas reliability should be used for all other levels of measurement. In the case of grading guidelines, reliability estimators are more appropriate than agreement estimators because grades (i.e., the product of a grading system) can be rank-ordered.

quality’), which, in turn, limits their usefulness. Moreover, they only provide prospective users (e.g., medical researchers, grant peer reviewers, journal editors) with a short list of design limitations upon which to evaluate the strength of evidence produced by a study. To date, the only published reliability analysis of a grading system (Atkins et al., 2005) found a Kappa statistic1 of 0.27. Given the critical role reliability serves in scientific research, it is surprising so many government and private organizations that champion evidence-based medicine have opted to adopt systems of unknown or poor reliability. Another shortcoming of existing guidelines is that they incorrectly assume RCTs are substantially better than other study designs. Thus, poorly-conducted RCTs are typically rated at the same level of evidence or higher than well-conducted non-RCTs (Gugiu & Gugiu, 2010, 2011). Hypothetically, if a RCT failed to administer the treatment to a substantial proportion of the intervention group, should it not receive a lower grade than a nonRCT that administered the treatment? Under most current grading systems, one must conclude there is moderate evidence to support the finding that the intervention does not work. This is the same level of evidence that would be assigned to a non-RCT that might find the opposite result. Until recently, a non-RCT could never be rated as high as a RCT. The GRADE system, however, has provided a mechanism for upgrading evidence produced by a non-RCT, providing that the study exhibits a large effect size. However, meta-epidemiological studies that have attempted to limit the discrepancies between study designs of RCTs and non-RCTs found similar effect sizes (MacLehose et al., 2000; Benson & Hartz, 2000; Concato, Shah, & Horwitz, 2000; Wilson & Lipsey, 2001). Until further evidence becomes available to justify requiring large effect sizes from non-RCTs in order for them to be accorded higher grades, this practice may be an arbitrarily high standard that sets a precedent in which the results of RCTs with serious limitations are given the same credence (or greater) as those of well-conducted non-RCTs. Therefore, a new grading system is needed that reflects the strengths and weaknesses of a study’s design and the quality with which it was implemented. Framework of the ‘Hierarchy of Evidence and Appraisal of Limitations’ (HEAL) grading system Despite the aforementioned limitations, prior guideline developers have greatly advanced the science of grading evidence, with new guidelines improving upon the accomplishments of prior grading systems. To begin, there is a near universal agreement regarding the hierarchy of evidence (Dupont, 1991; Barker, 2009). RCTs are considered the ‘gold standard’ due to their ability to eliminate rival hypotheses2. Non-RCTs, such as cohort and casecontrol studies, are regarded as the next best alternative due to their ability to make comparative decisions whereas uncontrolled 2 Some grading systems (e.g., ANHMRC, OCEBM, GRADE) place meta-analytic studies of RCTs atop the evidentiary pyramid. There can be no doubt that a metaanalysis of properly conducted RCTs yields evidence of higher caliber than a single RCT, which when properly conducted yields more trustworthy evidence than nonRCTs. However, inclusion of meta-analyses within a grading system shifts the focus from the individual study to the body of literature. Moreover, it is not clear how one would grade the quality of evidence of a meta-analysis of RCTs where some of the studies had severe limitations. Would it be reasonable to say that the level of evidence produced by such a meta-analysis was greater than that of a single properly conducted RCT? Probably not. Furthermore, it is not entirely clear how a meta-analysis of non-RCTs or one that included both RCTs and non-RCTs should be weighted in an evidentiary scheme. If one believed that the biases were random, then they, most likely, would cancel each other out. On the other hand, if the biases systematically increased (or decreased) their associated effect size, then the effect size produced by the meta-analysis would also be systematically biased. That is, the results of meta-analyses of non-RCTs can be stronger or weaker than those of a single RCT. Hence, the grading system proposed herein avoids this conceptual quagmire by excluding meta-analysis.

P.C. Gugiu / Evaluation and Program Planning 48 (2015) 149–159

trials (UTs), like before-after or case-series studies, are not highly regarded since they are unable to reach comparative conclusions or demonstrate what would have occurred in absence of the intervention (i.e., the counterfactual) (CTFPHC, 1998; USPSTF, 1996; ANHMRC, 1998; OCEBM, 2009; Atkins et al., 2005). Lastly, expert opinion and descriptive studies are given no weight because they do not test efficacy. While research theory firmly supports this hierarchy (Kazdin, 1992; Shadish, Cook, & Campbell, 2002), it also suggests that the classification of all non-RCT studies into the same category obscures the distinction between equivalent controlled trials (ECTs) and nonequivalent controlled trials (NECTs). The principal advantage of RCTs over all other designs is the ability to generate equivalent treatment groups via random allocation of subjects (Shadish et al., 2002). Therefore, differences between treatment groups following administration of an intervention are the product of the intervention. While no other study design can ensure equivalence (within chance variation) between treatment groups at baseline, several designs are capable of approximating equivalence. Pseudo-RCTs can closely approximate equivalence (Lewsey, Murray, Leyland, & Boddy, 1999), providing the allocation method is free of systematic bias. Matched comparisons can statistically remove the influence of measured confounding variables whereas stratified random sampling designs mitigate their effect through the sampling scheme (D’Agostino & D’Agostino, 2007). While these designs control only for measured variables, paired comparisons enable one to extend equivalence to some, albeit not all, unmeasured variables. Pairing relies on the use of dependent statistical tests (e.g., dependent t-test) on treatment and control patients matched on the similarity of their responses to key variables (Stevens, 2009). Regression discontinuity designs, on the other hand, seek to establish an equivalent before-after relationship (i.e., regression model) by allocating patients to treatment groups based on a cut-score on a measured covariate (Shadish et al., 2002). Thus, changes in the model (i.e., intercept and slope) are attributed to the intervention effect. A critical weakness of most existing guidelines is the lack of definitions of key terms (Gugiu & Gugiu, 2010). The best method for defining these terms, such as ‘well-conducted’ or ‘serious limitations’, is to incorporate a comprehensive list of plausible design limitations into the grading system. A ‘well-conducted’ study then may be defined as one that does not contain any design limitations or one for which researchers mitigate the seriousness of the limitations, perhaps through the use of statistical analysis. While existing guideline developers turned to expert panels to generate lists of design limitations, these lists tended to be short and focused on RCTs. Generation of a list of study design limitations To generate a comprehensive list of design limitations, Gugiu, Westine, Coryn, and Hobson (2013) conducted a systematic review of the CCM literature. These studies were uncovered by a search of MEDLINE and several social science databases. Although their search yielded a total of 159 articles, the list was reduced to 28 articles following exclusion of articles that only referenced the CCM model, provided qualitative results or descriptions of the research design, detailed how the intervention could be implemented, investigated statistical models, or reviewed the literature but did not statistically test the efficacy of the intervention. The final pool of studies consisted of 10 RCTs, 5 ECTs, 6 NECTs, and 7 UTs. The studies were clinically heterogeneous, measured numerous clinical outcomes, and reflected the efforts of over 100 medical researchers. The present paper incorporated examples of these CCM studies for the sole purpose of illustrating points made herein. However, hypothetical examples were also used to illustrate points

151

for which a CCM example could not be found. A systematic review of the 28 CCM studies, including a table listing each study’s methodological flaws according to the HEAL grading system, can be found in Gugiu et al. (2013). This analysis yielded a total of 21 flaws or design limitations, which can be utilized to create a grading hierarchy (see Table 1) based upon the amount of defendable evidence produced by a study. At the top of the hierarchy are well-conducted RCTs; generally, these consist of studies that correctly executed the randomization process so as to ensure statistical equivalence between the treatment and usual care groups). The second class in the hierarchy consists of studies that attempted to establish statistical equivalence between the treatment and control group at baseline followed by the third class in the hierarchy, which consists of studies that failed to establish a reasonable amount of statistical equivalence at baseline. Rounding out the HEAL classification system are studies that lack a true control group (i.e., usual care or placebo). It is important to note that the application of the HEAL system requires an extensive amount of adjudication on the part of the reviewer who utilizes the system. One of the chief tasks is the determination of whether the potential threat to validity in fact is likely to introduce bias. For example, consistent with the consensus in the field (see Gugiu & Gugiu, 2010, and citations therein), it shall be argued that a RCT should be downgraded if a significant crossover existed in the study. However, an implicit underlying premise is that the limitation had the potential to affect the estimates of the treatment efficacy. Naturally, if a reviewer utilizing the HEAL can reasonably conclude that the impact of this – or any other limitation discussed below – are negligible because the impact was small (e.g., the crossover effect accounted for a very small percentage of study participants) or the study’s author(s) mitigated the magnitude of the effect (perhaps via quantitative analyses) of the limitation, then it stands to reason that the study should not be penalized for the sake of abiding by the recommendations of the HEAL grading system. However, these ‘‘exceptions’’ in grading should always be noted. In a similar fashion, if the limitation had a potentially negative effect on efficacy yet the study found a positive effect, it is unreasonable to downgrade the study. That said, the limitation may have resulted in a negative bias, which should be considered if the purpose of the study was to estimate the magnitude of the efficacy of the treatment rather than simply establishing whether it worked or not. Therefore, the quality of evidence produced by a study depends upon the conclusions to be drawn. The remainder of this paper will present a detailed discussion of the 21 limitations considered by the HEAL system followed by a discussion of how these limitations can be used to assess methodological bias, the latter of which can be employed to gauge the degree of bias introduced in the effect sizes reported in a study. Common design limitations The weight of design limitations on the validity of evidence produced by a study varies according to the number and severity of the limitations and whether adequate steps were taken to mitigate their impact. These limitations constitute either design or implementation flaws that can be classified in one of three categories. At the extreme end of the spectrum are ‘serious limitations’, which denote a collapse of the controlled comparison and while they predominantly occur in RCT, ECT, and NECT designs, some of the limitations can also occur in UT designs. In contrast, ‘design-specific limitations’ entail problems specific to a particular study design. Unlike serious limitations, these problems only operate on the assumption of equivalent treatment groups. Thus, their presence in a RCT or ECT design signals that the

P.C. Gugiu / Evaluation and Program Planning 48 (2015) 149–159

152

Table 1 Hierarchy of evidence and appraisal of limitations (HEAL) grading system. Research design

Grade*

Description

Randomized control trial

A

A well-conducted RCT free of bias or confounding factors in which the randomization process is able to ensure the equivalence of the treatment and control groups beyond acceptable statistical chance A RCT with a flawed randomization process but which properly employed an ECT method for ensuring the equivalence of the treatment and control groups beyond a reasonable doubt A RCT with a flawed randomization process, which cannot ensure the equivalence of the treatment and control groups beyond a reasonable doubt. Alternatively, a cohort RCT that failed to show the population from which the samples were drawn remained stable over time A RCT with one or more of the following fatal flaws: treatment group received a very low dosage or poor treatment fidelity; significant cross-over between treatment and control subjects; follow-up period was insufficient to detect change; follow-up period (either timing or length of time) was significantly different between the treatment and control arms; sample size was significantly lower than the recommendations from an a priori power analysis; significant selection bias or differential attrition rates occurred between the treatment and control groups that produced baseline differences; or significant contamination of baseline outcome measures occurred

B C

D

Pseudo-RCT, matched comparison, stratified random sampling, paired comparison, regression discontinuity

B C

D Cohort study, case-control

C

Before-after, case series

D

D

A well-conducted ECT that demonstrated the statistical equivalence of the treatment and control groups on key baseline variables or the pre-post regression model An ECT with demonstrated baseline differences on key variables or that did not employ an adequate number of covariates, pairing, or strata variables to sufficiently remove reasonable doubt regarding the statistical equivalence of the treatment and control groups on key baseline variables or a regression discontinuity design that failed to demonstrate the equivalence of the pre-post regression model. Alternatively, a cohort ECT that did not demonstrate the population from which the samples were drawn remained stable over time A ECT with one or more of the fatal flaws listed under RCT Grade D A controlled study that did not adequately establish the equivalence of the treatment and control groups beyond a reasonable doubt A NECT with one or more of the fatal flaws listed under RCT Grade D Any study that did not employ a controlled comparison between two or more groups, including RCT, ECT, and NECT studies that did not employ a true comparison group (i.e., compared two treatments against each other but not against a treatment as usual group)

* Lower overall grade by half a grade, if greater than D, for each of the following limitations: the length of data collection period was extensive enough to significantly increase within group variance on primary outcomes; potential cross-over of patients or treatment occurred; the active components of the intervention were difficult to identify because the intervention changed throughout the follow-up period; post-baseline covariates that are not expected to remain constant over time were used to establish equivalence at baseline; or the generalizability of the results were compromised due to differences between the sample and population.

evidence produced may not be stronger than that produced by a NECT design. Finally, at the lowest end of the spectrum are ‘minor limitations’, which are comprised of issues that restrict the extent to or manner in which evidence may be interpreted. Serious limitations Numerous serious limitations can exist irrespective of the study design implemented. To begin, efficacy implies a relative comparison in producing a desired outcome and therefore, can only be established by comparing the outcomes of either (1) two or more treatment groups (e.g., treatment versus a control group) or (2) a single group measured at multiple periods (e.g., time-series designs). With respect to the former method, however, researchers may also compare the treatment of interest to an alternative treatment. Unfortunately, this practice can only produce meaningful results when the efficacy of the alternative treatment has been established. Moreover, the results from such studies are difficult to integrate with studies that compare a treatment to a true control group. The absence of significant results also creates an interpretive quandary. Take for example, a RCT study designed to improve processes and outcomes for diabetics randomized patients to either a standard or enhanced CCM group (Chin et al., 2007). Unfortunately, after controlling for multiplicity (Familywise Type I Error) via a Bonferroni adjustment (Feise, 2002; Holland & Copenhaver, 1988), the present author found only one process to be statistically significant. Similarly, in another RCT study (Otero-Sabogal, Owens, Canchola, & Tabnak, 2006)–designed to increase breast cancer screenings randomized patients to either a standard or enhanced CCM group–the standard CCM improved rescreening rates whereas the enhanced CCM did not. However, a number of other factors may explain these results. For example, the enhanced intervention took place in clinics located in

Southern California urban areas close to a metropolitan area whereas the standard treatment took place in clinics located in Northern California urban areas. Hence, geographic differences may have accounted for the lower than expected efficacy of the enhanced treatment. What is noteworthy is that neither of the prior studies demonstrated the efficacy of the CCM relative to a control group (i.e., standard care or a placebo3). Because it is conceivable that regular treatment may outperform the CCM, the absence of a true control group signifies that efficacy can only be established in relation to a baseline period. Thus, these studies, and others like them, are, in fact, UTs. No doubt for some researchers this statement will be objectionable. They will argue, and to some extent correctly so, that the comparison of two treatment groups is not synonymous with conducting an uncontrolled trial. However, the failure to include a treatment as usual group precludes one’s ability to establish the efficacy of the treatment with respect to usual care, which is the standard comparison group across the vast majority of efficacy studies. Since no analytical method exists for correcting for this post-hoc, such studies should be treated as UTs, unless the efficacy of the comparison group is well-known and widely accepted. This is not the case in the CCM literature at the present time. In pharmaceutical research, the minimum amount or intensity of a drug required to produce a therapeutic response in half the patients is known as the median effective dose. Expanding upon this concept, all interventions require a certain amount of ‘‘dosage’’ to yield a treatment effect—a clinically significant change on a desired outcome. Moreover, the absence of a treatment effect 3 Note, in many instances a placebo group is neither practical nor ethical. Hence, herein, the term control group is used to denote treatment as usual.

P.C. Gugiu / Evaluation and Program Planning 48 (2015) 149–159

whenever a study implements an intervention of exceedingly low dosage provides low evidence against the effectiveness of the intervention (European Medicines Agency, 1994). For example, a RCT study designed to prevent injury in children under five years old expected to observe a 40% reduction in the percentage of parents that were noncompliant with child safety practices 6 months following an intervention, which consisted of two brief counseling sessions, one with a physician and the other with a research health assistant (lasting between 10 and 15 minutes), and the provision of free safety equipment (Sangvai, Cipriani, Colborn, & Wald, 2007). Not surprisingly, the intervention did not result in a 40% reduction in noncompliant parents. However, despite the absence of a clinically significant result, it would be unwise to conclude that CCM cannot improve injury prevention in children based on a study that implemented such a weak intervention. Treatments that fail to implement the intervention as intended (i.e., lack fidelity) are also more likely to attain false negatives (i.e., fail to detect the treatment is effective, when in fact it is effective). For example, a RCT study (Krein et al., 2004) designed to improve clinical outcomes for diabetics reported having minimal or no contact with 40% of their patients and subsequently, failed to observe any statistically significant results. Hence, based on existing grading systems, the only inference one can draw is that there is reasonably strong evidence that CCM does not improve clinical outcomes for diabetics. Coincidentally, an ECT study (Dorr, Wilcox, Donnelly, Burns, & Clayton, 2005) found significant improvements in clinical outcomes for diabetics. So, which of these two studies produced the strongest evidence: the RCT study that did not fully implement the intervention and found no results, or the ECT study that implemented the intervention and found significant results? Another problem that may suppress the observation of significant effects is treatment crossover—the administration of key elements of the intervention to a significant portion of the control group. For example, a NECT study (Stroebel, Broers, Houle, Scott, & Naessens, 2000) designed to improve blood pressure control in patients with hypertension allowed 10 out of 30 clinical teams to switch to the treatment group nine months into the study. Of the eight primary hypotheses tested, only three were significant after controlling for multiplicity. However, it is conceivable the lack of significance for five of the primary hypotheses may have been due to the fact that a significant portion of the treatment patients received nine fewer months of the intervention. Another serious problem occurs when the intent-to-treat principle is violated. This principle states that all eligible patients should be included in an analysis – regardless of whether or not they received the intervention – to avoid producing biased results (Montori & Guyatt, 2001). The per-protocol principle, in contrast, states that patients should be analyzed based on whether or not they received the treatment (i.e., treatment-on-the-treated). Hence, control patients who received treatment are considered part of the treatment group and vice versa. A serious problem occurs when analyses based on these two principles produce different conclusions. For example, a UT study designed to improve the quality of chronic illness care for medically uninsured patients removed 11.4% of the sample because they acquired health insurance (Stroebel et al., 2005). In contrast to the result of the intent-to-treat analysis, the result of the per-protocol analysis revealed a clinically significant improvement in at least one chronic illness. However, a reasonable alternative explanation for this finding might be that healthier patients were able to find jobs that provided them with health insurance. As a result, the perprotocol analysis may have been conducted on an extreme subsample of patients, whose health improved on its own (via regression to the mean) rather than due to the treatment.

153

Two other problems that occurred in the CCM studies were insufficient follow-up time and contamination of baseline measures. In the former case, a RCT study (Ely et al., 2008) that administered self-management support to obese patients expected to observe a 5-kg weight loss at 90 days (primary outcome) despite the fact that a systematic review (McTigue et al., 2003) cited by the authors found results only after 12 months of behavioral counseling. Not surprisingly, the study failed to show a weight loss at 90 day but did find a significant weight loss at 180 days (secondary outcome). In the latter case, a UT study (Kimura, DaSilva, & Marshall, 2008) designed to improve the quality of care for diabetic patients collected baseline measures on 14 clinical practices, including three practices that pilot tested the program during the baseline period. Although the authors were able to find significant results, in part due to their large sample size (11,896), had the sample size been smaller or the number of patients with contaminated baseline measures been greater, it is conceivable that no significant differences would have been found. Due to the difficulty and cost of following patients over extended periods of time, an increasingly popular strategy has been to utilize a cohort design (Glenn, 2004). A cohort consists of a group of patients that experience a particular event (e.g., the intervention) or its absence during a specific point in time (e.g., baseline, 1-year follow-up). However, such studies always result in the confounding of effects (e.g., time and participants). Therefore, one can never be certain whether an observed effect is a function of the intervention, the period measured, or the participants. At least with respect to participants, one can demonstrate that they were not responsible for the observed effect if one can show that the underlying population sampled at each period is equivalent. For example, Chin et al. (2007) demonstrated – using a range of demographic and clinical outcome variables – that the underlying population served by 34 health centers did not change over a 4 year period. Had a significant difference been observed on these variables, one would not be able to determine whether the significant differences in process and clinical outcomes were produced by the CCM or differences in the populations sampled. One of the most frequent problems encountered in this review was underpowered studies—a study that lacks the sample size to detect a desired effect size based on an a priori statistical power analysis. For example, a RCT study (Piatt et al., 2006) designed to improve clinical and behavioral outcomes for diabetics conducted a study with 119 patients even though an a priori power analysis indicated 210 patients were needed. Not surprisingly, only one of the four primary hypotheses was statistically significant after adjusting for multiplicity. While a few of the existing grading guidelines penalize evidence to protect against false negatives whenever a study lacks statistical power, no guideline downgrades the level of evidence produced by an underpowered RCT below the evidence produced by a well-conducted non-RCT study. Thus, by not providing a mechanism by which the evidence generated by a weak RCT can be downgraded below that of a well-conducted nonRCT, the majority of existing guidelines inadvertently accept as true that a RCT with zero statistical power produces evidence at least as strong as that produced by a properly powered non-RCT. While the previous limitations increase the likelihood of attaining false negatives, some limitations can produce uncertain outcomes. Selection bias can generate false positives when the inclusion and exclusion criteria create a difference in the samples served by the treatment groups that can account for the intervention effect (Boehmke, 2004). In the Stroebel et al. (2000) study, for instance, there is no way to determine, based on the information reported, whether clinicians with healthy or unhealthy patients elected to switch from the control to the treatment group. Assuming the treatment was in fact ineffective, if healthy patients were removed from the control group, the

154

P.C. Gugiu / Evaluation and Program Planning 48 (2015) 149–159

disparity in health outcomes would be enhanced (false positive). In contrast, if the treatment was effective, moving unhealthy patients to the treatment group would suppress the disparity in health outcomes (false negative). Another serious problem that can generate either false positive or false negative results is differential attrition between treatment groups (Miller & Hollist, 2007). Two of the primary hypotheses tested in the Sangvai et al. (2007) study, for example, required in-home observations to determine the degree to which parents were compliant with safety practices. However, few parents agreed to a home visit (10.6% CCM and 6.3% control). While a 4.3% difference between treatment groups may not normally ruin a study, the potential for bias was high given the low participation rate and tendency among nonparticipants to have owned guns and not abide by proper safety practices. Therefore, the statistically (but not clinically) significant improvement reported by the authors in the use of smoke detectors and proper storage of hazardous substances among parents enrolled in the treatment group may have been an artifact of a higher likelihood of parents who abided by proper safety practices to attrite from the control group than the treatment group. In contrast, had parents who abided by proper safety practices attrited from the treatment group in greater numbers than the control group then the likelihood of attaining statistically significant results for these two hypotheses would have been diminished. Unfortunately, Sangvai and her colleagues did not test whether the treatment groups were equivalent following the differential attrition. Finally, differential follow-up periods between the treatment and control groups can produce false positives (or false negatives) depending upon whether differences in the length of the follow-up period improves the likelihood of observing significant outcomes for the intervention group, versus the control group, when such differences really do not exist (or do exist). For example, the follow-up period for a NECT study designed to improve intermediate outcomes for diabetics with high cardiovascular risk was three months longer for control than treatment patients (Kirsh et al., 2007). Consequently, it is possible the reason why only one of the six primary hypotheses was statistically significant, following multiplicity adjustments, was because the length of time for which control patients received medical care was slightly longer than that of treatment patients. In contrast, had the length of follow-up favored treatment patients, one would have needed to consider whether the attained significant result had been a function of this disparity rather than the intervention. Design-specific limitations RCT limitations The distinguishing feature of RCTs that enables the design to establish causal claims is randomization. Thus, RCTs are prone to problems that threaten this process. Specifically, randomization problems can occur whenever the allocation process cannot ensure equivalence between the treatment groups on baseline measures (Boruch, 1998). This has the potential to arise whenever the sample size of the randomized strata (level at which random allocation occurred) is small due to the increased risk that the subjects nested within the grouping variable will not be equivalent on baseline measures (Kazdin, 1992; Higgins & Altman, 2008). For example, the Otero-Sabogal et al. (2006) study conducted a clustered RCT in which three clinics were randomized to either the treatment or control group. Considering the small number of clinics that participated and that the two treatment group clinics were located in Northern California while the control group clinic was located in Southern California, the demographic differences between the treatment groups may be due to geographic differences. Consequently, there is no way to determine whether follow-up significant differences were a function of the intervention or confounding factors.

The limitations of randomization are not specifically discussed in any existing grading guidelines although several guidelines discuss the importance of sequence generation, allocation concealment, and blinding (Gugiu & Gugiu, 2011). Sequence generation refers to the method used for randomly allocating patients to treatment groups. An inadequately concealed allocation occurs whenever clinicians have access to the allocation schedule. While access to the allocation schedule is not a flaw per se, it can produce biases if the schedule is manipulated in a way that undermines the randomization process (Higgins & Altman, 2008). However, systematic biases that undermine equivalence can easily be detected using statistical tests of equivalence on key baseline variables. Blinding, on the other hand, is a procedure designed to keep a study participant’s assignment condition hidden from the participant, experimenters, and/or data analysts (Higgins & Altman, 2008). However, it is important to note that while blinding generally strengthens the evidence produced by a study, many studies cannot be engineered to hide treatment allocation from participants. Furthermore, blinding can reduce bias in subjective outcomes (e.g., amount of pain experienced) but is unlikely to impact the measure of objective outcomes (e.g., HbA1c, blood pressure) (Wood et al., 2008). ECT limitations Well-conducted ECTs are considerably more capable of reducing the impact of confounding factors than NECTs. However, factors that threaten the establishment of equivalence can diminish the strength of ECTs. Pseudo-RCTs allocate participants to treatment groups using a nonrandom procedure (e.g., alternate allocation, even and odd birth dates). Such studies, therefore, are prone to the introduction of a systematically biased allocation method that can create disparity between treatment groups. For example, if potential patients found out that only people born on an odd date would receive the treatment (i.e., the study lacked allocation concealment), patients with odd birth dates might be more likely to refuse to participate in the study (i.e., selection bias). Therefore, the control group may not accurately represent the population sampled (i.e., the sample of control patients that agreed to participate may represent a subpopulation). However, if a pseudo-RCT is properly conducted (e.g., the allocation mechanism was concealed from potential participants), it has the potential to produce results comparable to those of a RCT. Specifically, if the allocation mechanism is independent of both participants and the outcome measures then for all practical purposes it is acting in the same fashion as the random variable used to allocate participants in a RCT. Paired comparison designs are also a good alternative to RCTs. Several variations of paired comparisons exist, including matchedpairs (one-to-one matching), matched-sets (one-to-many or many-to-many matching), and exact matches (matching based on the same value) (Shadish & Clark, 2004). However, each of these methods requires the creation of sets of treatment and control patients based on their similarity on one or more key covariates (e.g., demographic variables, baseline outcomes of interest). Thus, these designs are subject to two questions. First, were matches found for a majority of patients? For example, an ECT study (Scanlon et al., 2008) designed to reduce medical payments and improve quality for diabetics matched 193 out of 199 patients. Hence, it is highly unlikely a selection bias occurred. However, the more covariates used to pair treatment and control patients the more likely that a large proportion of patients will not be matched. The second question that must be addressed is, how effective was the matching process in producing comparable treatment groups with respect to baseline measures? Addressing this question requires researchers dispel incorrect notions of how to establish statistical equivalence. Despite conventional practice,

P.C. Gugiu / Evaluation and Program Planning 48 (2015) 149–159

equivalence is not synonymous with the absence of statistically significant differences. The absence of significant differences occurs, even if actual differences exist, whenever a study is underpowered. Likewise, when a study is overpowered, the presence of statistical significance does not necessarily imply that a difference is clinically meaningful. Given a large enough sample size, any difference can attain statistical significance. Instead, establishing equivalence requires showing the difference between two variables is less than a tolerance criterion (e.g.,

Hierarchy of evidence and appraisal of limitations (HEAL) grading system.

Despite more than 30 years of effort that has been dedicated to the improvement of grading systems for evaluating the quality of research study design...
459KB Sizes 0 Downloads 4 Views