The good, the bad and the ugly: meta-analyses.

Human Reproduction, Vol.29, No.8 pp. 1622 –1626, 2014 Advanced Access publication on June 4, 2014 doi:10.1093/humrep/deu127

INVITED COMMENTARY

The good, the bad and the ugly: meta-analyses Madelon van Wely* Center for Reproductive Medicine, Academic Medical Center, University of Amsterdam, Amsterdam, The Netherlands *Correspondence address. E-mail: [email protected]

There seems to be a growing negativity toward meta-analyses. Two years ago systematic reviews and meta-analyses, and more specifically Cochrane reviews, were critiqued (Humaidan et al., 2012; Humaidan and Polyzos, 2012). In the present issue of Human Reproduction another example of negative publicity toward meta-analyses is published in the form of an Opinion paper (Simoń and Bellver, 2014). The authors use the meta-analyses that have been published on the value of endometrial scratching in IVF as an example. Meta-analyses-attacking authors in essence argue that meta-analyses should be faultless while meta-analyses are considered to be highest in the evidence-based pyramid. But, as the critics rightfully point out, studies that include meta-analyses are often not without biases. What is going on? Are meta-analyses not as useful as we thought they would be? Are the included studies not good enough? In this Editorial, we will discuss the use and pitfalls of aggregate meta-analyses. Furthermore, the main points raised by the authors of the Opinion paper (Simoń and Bellver, 2014) will be addressed. What should be done if only small trials of possibly low quality are available? When should we perform meta-analysis? Should we accept a combination of observational and randomized studies in meta-analyses? What can we do when multiple meta-analyses are published within a short time-period?

A Short History Meta-analysis has developed over time as a way to deal qualitatively with varying study results (O’Rourke, 2007). Questions on the possibility of summarized results from different studies were tackled in the 18th and 19th century by astronomers and mathematicians such as Gauss and Laplace (Laplace, 1820). In 1906 the British statistician Karl Pearson was the first to apply methods to combine observations from different clinical studies. He was asked to analyze data comparing infection and mortality among soldiers who had volunteered for inoculation against typhoid fever in various places across the British Empire with that of other soldiers who had not volunteered (Pearson, 1904). During the same period the British statistician Ronald Fisher and colleagues worked on the appropriate analysis of multiple studies in agriculture (Fisher, 1935). Their methods formed a

basis for the meta-analytical methods as we know them. In medicine, the first publication on the aggregation of findings from different studies was in 1955 by Beecher who combined studies that compared a placebo with a treatment (Beecher, 1955). It was in 1976 that Gene Glass used the term ‘meta-analysis’ to refer to ‘the statistical analysis of a large collection of analysis results from individual studies for the purpose of integrating the findings’ (Glass, 1976; O’Rourke, 2007). Subsequently, many statisticians worked on further improvements of the statistical methods behind the process of meta-analyzing results.

What Is Meta-Analysis? Conceptually, a meta-analysis uses a statistical approach to combine the results from multiple studies. In practice the analysis has to be preceded by a systematic review that starts off with a clearly formulated clinical question. A systematic review means that the available literature is evaluated in a systematic way such that it is reproducible for others to prevent author-induced selective bias in the inclusion of studies. After selecting and describing studies, meta-analysis can be used to summarize the predefined outcome. The major advantage of meta-analysis is that accumulation of evidence can improve the precision and accuracy of effect estimates and increase the statistical power to detect an effect. A further advantage of meta-analysis is that it facilitates the generalization of results to a larger population. With the help of cumulative meta-analysis a shift over time can be visualized. In cumulative meta-analysis studies are added one at a time, usually according to date of publication. The results are summarized as each new study is added. In a forest plot of a cumulative meta-analysis only the first horizontal line represents the results of that single study. The following horizontal lines are the summary of the results after inclusion of each subsequent study. An example is provided for all studies that compared urinary-derived gonadotrophins and recombinant FSH (Fig. 1). After the first six studies the difference in live birth was in favor of recombinant FSH but after including many more trials, there was no longer any evidence of a difference (van Wely and van der Veen, 2011).

& The Author 2014. Published by Oxford University Press on behalf of the European Society of Human Reproduction and Embryology. All rights reserved. For Permissions, please email: [email protected]

Downloaded from http://humrep.oxfordjournals.org/ at University of Waikato Library on July 11, 2014

Submitted on April 24, 2014; resubmitted on April 24, 2014; accepted on May 1, 2014

1623

Invited Commentary


Figure 1 An example of a cumulative meta-analysis. Adapted from van Wely and van der Veen (2011).

Within a meta-analysis inconsistency of results across studies can be quantified and analyzed. For instance, does inconsistency arise from sampling error, or are study results influenced by between-study heterogeneity? Meta-analysis can be extended by meta-regression. Meta-regression allows the evaluation of the effects of continuous and categorical variables and is in essence a more advanced way to do subgroup analysis. Though meta-regression is a valuable method to assess differential effects in subgroups, many studies will lack power to find a difference. As a rule of thumb about 10 studies are required to evaluate differential effects for one variable.

Popularity of Meta-Analyses Today, meta-analysis has become a key component of evidence-based medicine. Clearly there has been a tremendous rise in meta-analyses over the last decade. Looking at the amount of clinical meta-analyses in PubMed it seems however that the top has been reached (Fig. 2). The rise in meta-analyses over the last decade is the result of the growth in the number of clinical trials and of the desire to use accruing evidence as early as possible to improve health care decisions. Moher and Olkin (1995) described the ‘dramatic increase’ in the number of published meta-analyses. They suggested that several

1624

Figure 2 Number of meta-analyses in PubMed after imputing the term ‘meta-analysis’ and limiting to ‘clinical’ and ‘human’.

Problems That May Arise in Meta-Analyses No computer or statistical means can solve the problem that if the data are poor, the product of the analysis will be poor as well. This is also known as GIGA, or the ‘garbage in – garbage out’ principle. The best way to deal with this is to assess the quality of the studies. In epidemiological evaluations we are more or less looking for trends. Accumulating evidence from large cohorts is required to investigate relatively rare safety issues like neurological sequelae in preterm born children. In clinical meta-analyses on the effectiveness of interventions only methodologically sound studies should be included in a meta-analysis, a practice called ‘best evidence synthesis’. This means the inclusion of RCTs and exclusion of non-randomized studies as these are more likely to find large effects due to their non-random nature. Furthermore, underpowered studies with large effects in assisted reproductive technology (ART) studies should be considered with care and must ensure that the saying is not abused. It cannot be that every time somebody dislikes a result it is stated that the included studies had invalid data. Including only randomized trials will help. A further helpful tool is to look at the statistical heterogeneity. Figure 3 shows two forest plots. The first one is a forest plot of four studies that have quite similar or homogeneous results, also expressed as an inconsistency measure or I square of 0%. The second one is a forest plot of four studies with heterogeneous results; the corresponding I square here was 90%. With such large heterogeneity between studies the pooled estimate does not represent a true difference. Furthermore, it should be realized that a meta-analysis of several small studies does not predict the results of a single large study. A wellpowered RCT is what we really need (Lelorier et al., 1997). Only small trials could be included in the meta-analyses on endometrial scratching. The Cochrane Review on the subject could include four trials that evaluated endometrial injury in the previous cycle in terms of pregnancy outcomes (Nastri et al., 2012). A quick update of that review resulted in the

Figure 3 Two forest plots of four fictive studies. The left graph shows the effect measures for each individual study are all within each others’ boundary, i.e. data are homogenous. The right graph shows large differences in effect measures between each individual study, i.e. data are heterogeneous.


meta-analyses have improved medicine. As an example the authors use the meta-analysis that described the efficacy of corticosteroids given to mothers that were expected to deliver prematurely (Crowley et al., 1990). The results of their meta-analysis did not only indicate that corticosteroids significantly reduced morbidity and mortality of these infants but also showed that such evidence was available at least a decade earlier. The authors stated that, had a meta-analysis been conducted when the evidence became available, much unnecessary suffering might have been avoided. It is therefore understandable that policy makers use systematic reviews and meta-analyses, in addition to randomized controlled trials (RCTs), in their decision-making. Moreover, it has become standard practice to ask for a systematic review and meta-analysis on what is known on a certain subject in grant applications. Indeed an evidencebased overview is always helpful as long as the quality of the evidence is acceptable.

Invited Commentary

Invited Commentary

1625

inclusion of six trials, after including three other trials (Baum et al., 2012; Shohayeb et al., 2012; Nastri et al., 2013) and removing the interim analysis of the Nastri trial. Publication of interim analyses is not advisable. Such an interim analysis can affect the future conduct of a trial and make interpretation of final results difficult. Looking at the forest plot of the updated meta-analysis there was no statistical heterogeneity between the studies, as can be seen in Fig. 4. A differential effect was seen only in the smallest study with 36 women (Baum et al., 2012). Still, in view of the concept that several small studies may not predict the results of a single large study, we cannot be sure yet whether endometrial scratching leads to better results. The good news is that larger trials are ongoing in different parts of the world (see http://www. clinicaltrials.gov/). These studies should take into account safety issues as well as patient burdensomeness as the procedure has been reported to be painful (Nastri et al., 2013). Unit of randomization error is a commonly seen problem in meta-analyses. When women or couples are randomized to the interventions of interest, then all outcomes in the meta-analyses should be expressed per woman. Expressing the outcome per embryo would artificially increase the evidence. Mind you, it was not the embryo that was randomized. As a result implantation rate following IVF is inappropriate in meta-analyses, unless all women had a single embryo transferred.

Changing primary outcomes or basing the conclusions on secondary outcomes is another major problem that can partly be prevented by registering the meta-analyses in Prospero (http://www.crd.york.ac .uk/PROSPERO/). Even better would be the registration of a protocol which is mandatory for Cochrane reviews. Another problem in meta-analysis is publication bias. Usually this is due to the underreporting of trials that did not find a difference. Nowadays this form of bias is easier to detect as journals request randomized trials to be registered at one of the trial registries (http://www. controlled-trials.com/isrctn/search.html, http://www.clinicaltrials.gov/ ct2/search/index). Checking trial registries for relevant trials should therefore be part of the literature search. Other publications issues are the existence of more meta-analyses than actual trials in the literature and the publication of multiple meta-analyses within a short time period on the same comparison. It is the responsibility of the editors of journals to check what has been recently published in the field. Editors and reviewers should together aim to prevent not only double publication but also over-publications. For authors it can be wise to go to the Prospero website to check whether another group is already doing the same thing as more and more nonCochrane reviews will be registered here (http://www.crd.york.ac. uk/PROSPERO/).


Figure 4 The pooled and study specific risk rate for clinical pregnancy following endometrial injury versus no endometrial injury in the previous cycle of couples that underwent IVF.

1626

When to Meta-Analyse and Update

Summary The intention of clinical systematic reviews and meta-analyses is to summarize all available good-quality evidence. Meta-analysis should be seen as a helpful tool. It is not the tool that should be criticized but the people using the tool in case the analysis was not done appropriately. It is impossible to prevent all pitfalls in systematic reviews and meta-analyses. But what we can do is to register all potential problems

and not jump to conclusions too quickly. Be careful with meta-analyses that only include small underpowered trials and watch out for heterogeneous results. Good meta-analyses are objective and take into account both effectiveness and safety and do not base their conclusions on a bunch of secondary outcomes. When used wisely meta-analyses will remain a helpful friend.

References Ansari MT, Moher D. Systematic reviews deserve more credit than they get. Nat Med 2013;19:395– 396. Baum M, Yerushalmi GM, Maman E, Kedem A, Machtinger R, Hourvitz A, Dor J. Does local injury to the endometrium before IVF cycle really affect treatment outcome? Results of a randomized placebo controlled trial. Gynecol Endocrinol 2012;28:933 – 936. Beecher HK. The powerful placebo. JAMA 1955;159:1602 –1606. Crowley P, Chalmers I, Keirse MJ. The effects of corticosteroid administration before preterm delivery: an overview of the evidence from controlled trials. Br J Obstet Gynaecol 1990;97:11 – 25. Fisher RA. The Design of Experiments. Edinburgh: Oliver and Boyd, 1935. Glass GV. Primary, secondary and meta-analysis of research. Educ Res 1976; 10:3 – 8. Hughes EG, van Wely M, Farquhar CM. Cochrane reviews in perspective: the importance of appropriate conclusions and timing of publication. Hum Reprod 2012;27:3 – 5. Humaidan P, Polyzos NP. (Meta)analyze this: systematic reviews might lose credibility. Nat Med 2012;18:1321. Humaidan P, Kol S, Engmann L, Benadiva C, Papanikolaou EG, Andersen CY. Copenhagen GnRH Agonist Triggering Workshop Group. Should Cochrane reviews be performed during the development of new concepts? Hum Reprod 2012;27:6 – 8. Laplace P-S. Theórie Analytique des Probabilite´s. Oeuvres Comple`tes 7, 3rd edn. Paris: Courcier, 1820: lxxvii. Lelorier J, Gre´goire GV, Benhaddad A, Lapierre J, Derderian FO. Discrepancies between meta-analyses and subsequent large randomized, controlled trials. N Engl J Med 1997;337:536– 542. Moher D, Olkin I. Meta-analysis of randomized controlled trials. A concern for standards. JAMA 1995;274:1962– 1964. Nastri CO, Gibreel A, Raine-Fenning N, Maheshwari A, Ferriani RA, Bhattacharya S, Martins WP. Endometrial injury in women undergoing assisted reproductive techniques. Cochrane Database Syst Rev 2012; 7:CD009517. Nastri CO, Ferriani RA, Raine-Fenning N, Martins WP. Endometrial scratching performed in the non-transfer cycle and outcome of assisted reproduction: a randomized controlled trial. Ultrasound Obstet Gynecol 2013;42:375 – 382. O’Rourke K. J R Soc Med 2007;100:579– 582. Pearson K. Report on certain enteric fever inoculation statistics. BMJ 1904; 3:1243 – 1246. Shohayeb A, El-Khayat W. Does a single endometrial biopsy regimen (S-EBR) improve ICSI outcome in patients with repeated implantation failure? A randomised controlled trial. Eur J Obstet Gynecol Reprod Biol 2012; 164:176– 179. Simoń C, Bellver J. Scratching beneath ‘The Scratching Case’: systematic reviews and meta-analyses, the back door for evidence-based medicine. Hum Reprod 2014;29:1618 – 1621. van Wely M, van der Veen F. To assist or not to assist embryo hatching. Hum Reprod Update 2011;17:436 – 437.


In a critique on systematic reviews and meta-analyses it was recommended that systematic reviews should include at least three to four trials with a total sample size of a minimum of 1000 patients (Humaidan, Polyzos, 2012). There is no evidence at all for such a policy. We need to know all evidence to evaluate what is done in daily practice and whether this can be improved in a safe and effective way. In response to the critique well-known methodologists wrote that all clinical decisions should be based on good-quality systematic reviews that provide a synthesis of the current best evidence, no matter how shaky or sparse the evidence might be. When it is demonstrated that evidence is weak or inadequate this still adds value by revealing to knowledge users the true nature of the evidence informing their decisions (Ansari and Moher, 2013). Cochrane reviews have their own dynamics concerning when to do a review and when to update. As written previously ‘Cochrane reviews should not be postponed, waiting for more evidence. On the contrary, they should be undertaken and published when an important clinical question has been addressed by clinical trials’ (Hughes et al., 2012). Cochrane reviews are updated on a regular basis. The increase in information may result in narrowing down the boundaries of the effect estimate. Sometimes, the update actually changes the conclusions. This does not imply that previous reviews were wrong but does show that the evidence up to that point had been inadequate and that the update was a necessity. In the Opinion paper in the present issue of Human Reproduction the following statement was made: ‘The weakness of published meta-analyses is so evident that some societies such as the Royal College of Obstetricians and Gynaecologists have created guidelines subdividing the level of evidence 1...’ (Simoń and Bellver, 2014). Subdividing evidence does not relate to a weakness. Evidence reflects what we know now at this moment. Due to changing policies, protocols, patient populations, concomitant diseases, etc., we can never be 100% sure that our effect measure reflects the truth. However, solid evidence proves with a reasonable certainty that we do know the truth. Level 1 evidence stands for likely reliable evidence for interventional conclusion and reflects the presence of RCTs with or without meta-analyses that pooled the available evidence. The RCOG uses a subdivision of level 1 evidence that is actually really helpful in that respect. Evidence 1+ + can be interpreted as it being highly likely that the evidence reflects the truth. Evidence 1+ implies it is most likely that the observed effects are true effects. With evidence 12 it seems that the effect is as observed but we do need more evidence to be sure. All evidence 2 and above are not based upon truly randomized trials and are therefore more prone to bias.

Invited Commentary

The good the bad and the ugly.

IgG-effector functions: "the good, the bad and the ugly".

COPD: The Not So Good, the Bad, and the Ugly!

Surgical quality measurement: the good, the bad, and the ugly.

Proliferation versus regeneration: the good, the bad and the ugly.

"The Good, the Bad and the Ugly" of Chitosans.

Autoimmunity: The good, the bad, and the ugly.

Communication: the good, the bad, and the ugly.

Maternal deaths: the good, the bad and the ugly.

On contact precautions: the good, the bad, and the ugly.

Chemokines in tuberculosis: the good, the bad and the ugly.

Microparticles: the good, the bad, and the ugly.

New inhaler devices - the good, the bad and the ugly.

Cardiac fibroblasts: the good, the bad, the ugly, the beautiful.

Bridging Psychological and Biological Science: The Good, Bad, and Ugly.

Re: Alexis Carrel: the good, the bad, the ugly.

Competition in Healthcare: Good, Bad or Ugly?

Calcium, mitochondria, and the pathogenesis of ALS: the good, the bad, and the ugly.

The good, the bad and the ugly of monitoring programs: Defining questions and establishing objectives.

Exercise and the heart: the good, the bad, and the ugly.

The role of collagen crosslinks in ageing and diabetes - the good, the bad, and the ugly.

CD8+ T cells in cutaneous leishmaniasis: the good, the bad, and the ugly.

The good, the bad, and the ugly of free drug samples.

The good, the bad, and the ugly of interleukin-6 signaling.