0895-4356/91 $3.00 + 0.00 Copyright 0 1991 Pcrgamon Prw pk

J Clb QIdmbl Vol. 44, No. 1, pp. 91-98, 1991 Printed in Great Britain. Ail rights reserved

AGREEMENT AMONG REVIEWERS REVIEW ARTICLES

OF

ANDREW D. OXMAN,‘* GORDON H. GuYATT,‘** JOEL S~OER,~ GOLDSMITH,’ BRIANG. HUTCHEON,’ RUTHA. MILNER’and S~ DAVDL. STREINER*~~

CHARLIE H.

Departments of ‘Clinical Epidemiology & Biostatistics, 2Medicine, ‘Family Medicine, 4Pediatrics and %ychiatry, Faculty of Health Sciences, Mc&ster University, Hamilton, Ontario, Canada (Received in revisedform 26 February 1990)

Abstract--Objective. To assess the consistency of an index of the scientific quality of research overviews. Design. Agreement was measured among nine judges, each of whom assessed the scientific quality of 36 published review articles. Item selection. An iterative process was used to select ten criteria relative to five key tasks entailed in conducting a research overview. Sample. The review articles were drawn from three sampling frames: articles highly rated by criteria external to the study; meta-analyses; and a broad spectrum of medical journals. Judges. Three categories of judges were used: research assistants; clinicians with research training; and experts in research methodoldgy; with three judges in each category. Results. The level of agreement within the three groups of judges was similar for their overall assessment of scientific quality and for six of the nine other items. With four exceptions, agreement among judges within each group and across groups, as measured by the intraclass correlation coefficient (ICC), was >O.S, and 60% (24/40) of the ICCs were >0.7. Conclusions. It was possible to achieve reasongble to excellent agreement for all of the items in the index, including the overall assessment of tientific quality. The implications of these results for practising clinicians and the peer review system are discussed. Meta-analysis Peer review

Reprodircibility of results Information dissemination

Research design

INTRODUCTION

Publishing standards

that exist between evidence and practice [l] is the vastness of the liteiature with which Before new evidence can be integrated into clinicians are inundated. Processing this inforclinical practice it must be accessed, its applicamation is time consuming and many clinicians bility and validity inust be assessed and it must lack the skills, as well as the time [2]. be synthesized. Part of the reason forkthe gaps The publication of review articles is one means of addressing this problem. Review *Allcorrespmldencesllo~bc-to:Andrewoxmall, artioles can greatly improve efficiency by conMcMaster ttniversity Medical Centre, 1200 Main densing the amount of material that must be Street West, Room 2V10, Hamilton, Ontario, Canada LSN 325. read. 1; addition, they can assist clinicians with 91

ANDREWD. OXMANet al.

92

validating the information that is summarized [3]. However, just as primary research must be critically appraised before its results can be appropriately applied [4], the applicability and validity of a review article must also be assessed before applying its conclusions to clinical practice. Although criteria for assessing the quality of review articles [5,6], as well as criteria for assessing meta-analyses [7] and other research reports [4, 8-101, have been published, little effort has gone into evaluating the consistency and validity of these criteria [3, 11-131. When consistency has been measured and reported, the results have often been disappointing [l l-l 31. In this paper we report the results of a study to assess the consistency of a set of criteria that was developed to measure the scientific quality of published review articles (research overviews) and discuss the implications of our findings for the peer review system, as well as for practising clinicians. METHODS

The development of criteria for evaluating the scientific quality of research overviews is conceptually analogous to the development of a measurement instrument such as a quality of life measure [14]. This process includes the following tasks: conceptualization; item selection; consistency testing; and validity testing. The first three of these are described in this report. Conceptualization

The intended purpose of the criteria (Table 1) is to assess certain aspects of the scientific quality of research overviews. We are attempting to measure the extent to which a review is likely to be valid; i.e. the extent to which an Table 1. Criteria for assessing the scientific quality of research overviews* 1. Were the search methods reported? 2. Was the search comprehensive? 3. Were the inclusion criteria reported? 4. Was selection bias avoided? 5. Were the validity criteria reported? 6. Was validity assessed appropriately? 7. Were the methods used to combine studies reported? 8. Were the findings combined appropriately? 9. Were the conclusions supported by the reported data? 10. What was the overall scientific quality of the overview? *The ten items referred to in the text are briefly summarized in this table. The complete questionnaire that was used is available from the authors.

overview guards against bias (by which we mean systematic deviation from the truth). The assessment of these aspects of scientific quality is primarily a measure of process rather than of outcome. It does not include a number of features that might be considered part of scientific quality. These include the importance of the question addressed, the degree of innovation in the approach or the impact on future scientific and technical developments. The process of conducting a research overview is complex, and entails a number of tasks. The criteria were developed to assess threats to validity relative to each of these tasks (dimensions); specifically: problem formulation; study identification; study selection; validation of studies; data extraction; data synthesis; and inference. These tasks, and the threats to the validity of an overview posed by each one, have been discussed in detail elsewhere [5-7, 15-191. Item selection

A preliminary set of criteria for evaluating research overviews was developed based on a review of the literature. The inclusion criteria used to select items were that they should measure “scientific quality”, as defined above, and they should be applicable to overviews of practical questions in the health sciences; i.e. questions regarding causation, prognosis, diagnosis, therapy, prevention or policy. Items were excluded if they were redundant, irrelevant to scientific quality or were not generalizable to both quantitative and qualitative overviews (meta-analyses and traditional narrative review articles) of clinically relevant topics. Items were initially selected based on the subjective assessment of one of the authors (A.D.O.), and were subsequently refined through an iterative process of discussions, pretesting and revision. In addition, much helpful advice was received from numerous investigators who had published relevant material. A mailed survey of editors and additional methodological experts known to be engaged in meta-analytic research did not generate any additional items or general concepts. All items were presented as closed-ended questions with 7-point scale response options. The measurement properties of the 7-point scale and the rationale for its use have been described elsewhere [20]. Four anchors were given for the scoring of each item, allowing for “fence hangers” to score between the anchors.

ReviewerAgreement

In a pilot study nine overviews were each evaluated by nine judges. In addition, to identifying any remaining ambiguities ,in the evaluation instrument and providing a basis for further revisions of the fsnn, the pilot test was an important component of the training that the judges received. Twenty-five items were included in the instrument that was used in the consistency study. These were subsequently reduced to the ten items summarized in Table 1 by eliminating items that did not discriminate between overviews of high and low scientific quality. Consistency testing The type of consistency that was tested was interjudge agreement; i.e. the extent to which different judges assessing the same overview agreed about its scientific quality relative to each item and its overall quality. Three categories of judges were used: research assistants, clincians with research training (G.H.G., B.G.H., A.D.O.) and experts in research methodology (C.H.G., R.A.M., D.L.S.), with three judges in each category. With the exception of the research assistants, who were randomly selected from the support staff in the Department of Clinical Epidemiology & Biostatistics, the judges were selected on the basis of an interest in overview methodology. Seven of the nine judges, including all of the research assistants and clinicians, participated in the pilot test noted above. In addition, all nine judges participated in a 1 hr training session which utilized the review articles from the pilot test and the consensus which was reached over the assessments of these articles. For a review article to be included in the consistency study, the primary focus had to be practical; dealing with a question regarding causation, prognosis, diagnosis, therapy, prevention or policy. Reviews focusing primarily on basic science issues, including pathophysiology, on a methodological issue, or on a theoretical issue were excluded. Editorials and articles that cited fewer than ten studies were also excluded. Because a broad spectrum of quality rather than a representative sample of published review articles was desired, the overviews were selected from several different sampling frames. Experience from developing the criteria suggested that a representative sample would be heavily skewed towards articles of lower scientific quality. Indeed, despite a number of efforts

93

directed at identifying scientifically rigorous overviews, only a small number of exemplary overviews were located. They were identified primarily through personal contact with investigators with a known interest in .research synthesis. Six reviews published in 1985 or 1986 which were highly rated by criteria external to the study were included (a list of reviews used can be obtained by writing to the authors). Because.of the ,quantitative nature of metaanalyses, they tend to be more scientifically rigorous than traditional narrative review articles [7, 19,27-301. Therefore, published me&analyses were used as a second source of more rigorous overviews. These were identified through a textword search of the MEDLINE database (January 1985 to March 1987) using “meta-analysis” as the search term. Four metaanalyses [3 l-341 were selected sequentially from this listing, beginning with the most recent posting. The rest of the articles were selected from journals with a methodological focus [35,36], annual reviews [37-401, major U.S. internal medicine journals [41-481, major general medical journals [49-56] and other journals [57-601. Journals within each of these categories were identified on an ad hoc basis by one of the authors (A.D.O.). Articles were selected systematically from each of the sampling frames by a research assistant. The first two articles published in 1986 were selected sequentially from each journal. When two articles meeting the inclusion criteria could not be found, additional articles from another journal were selected. A total of 36 review articles was selected. The author’s names, dates and the names of the journals were deleted from photocopies of all of the articles that were evaluated. A41nine of the judges evaluated all of the overviews included in the consistency study. The sample size was based on obtaining a confidence interval of 0.1 on the intraclass correlation coefficient (ICC) which was used to measure agreement among judges [21]. The ICCs and their 95% confidence intervals were calculated according to Shrout and Fleiss’ guidelines [22]. The ANOVA model which was used assumed non-random selection of judges within groups. The mean of each,judge for the 36 articles and the mean across judges within each group were compared to identify systematic differences between judges and between categories of judges. All of the analyses were done using BMDP [23].

ANDREWD. OXMANet al.

94

It was decided a priori that the ICC for interjudge agreement should be >0.5 for the criterion to be considered sufficiently reliable. This is noticeably greater than values of 0.3 and less that have been reported for peer review of primary research using global assessments [ 11, 121. Furthermore, an ICC for a single judge that is >0.5 results in an ICC of >0.67 for the mean scores of two judges [24]. This is generally considered to represent a high level of agreement 121,251, although any cut off value is somewhat arbitrary. RESULTS

The mean score and standard deviation for each of the 36 articles, based on the ratings of

"Type of revieu" and (journal)

all nine judges, are shown for the overall assessment of scientific quality in Fig. 1. It can be seen that there was substantial variability in the sample of review articles that was selected, although without the articles and meta-analyses which were highly rated by criteria external to the study there would have been far less variability. The mean of each judge and the mean across judges within each group are displayed in Table 2 for the overall assessments of scientific quality. Although the two highest means are both among the research assistants, and although the research assistants consistently had higher means than the other two groups across reviews, overall there was no statistically significant difference between groups [quasi

Mean (and standard deviation) for 9 judges on item 10: overall scientific quality

“1External

#I 2 3 4 5 6

high rating" (Arch Int Med) (BHJ) (Chest) (Eur Heart J)

:I:--_: -----I-

:--

--

(JAM)

l . (Psych Bull) : -_I---Yeta.analysis/nEOLIYE”____ _______-_:_________:_________:_________:_________:_________:_________:________ -i I 7 (Am J Psychiat) -: 8 (Lancet) w

“1

:-------_:

9 (Nursing Res) 10 (Spine)

~,eth~o[ogical

Jour~[M__

1

-I----

:

_________:_________:_________:_________:.________:_________:_________:________

-i I I *ml Revie~sU____--____ _________:_________:_________:_________:_________:-_______-:____--___:__---___ -i 13 (Ann Rev Med) -I: I

11 (J Chron Ois) 12 (J Chron Dis)

-: 14 (Ann Rev Med) * --: 15 (Ann Rev Pub Iill 16 (Ann Rev Pub Ii11 I “1Hajor US Internal Mad J"- ----‘_~~_:_________:_________:---___--_:_________:_______._:__-------:__--__--i ___ : 17 (Am J Med) -I 18 (AM Int Fled) ~ : -I I 19 (Ann Int Med) 20 (Ann Int Med) -: + 21 (Arch Int Med) I 1 22 (Arch Int Fled) 23 (Arch Int bled) -I24 (Arch Int Med) I ._________:___-_‘___:_--____--:--___----:__-----Major General Medical J"- _---____-:_____--‘-:__---"-_. I 25 (BMJ) 26 (BMJ) : -IL-27 (CNAJ) ~ : 28 (CMAJ) -I :-------_: 29 (JAHA) 30 (JAMA) I , 31 (NEJM) I 32 (NEJM) I Other Journals"---------- '________:_________:_________:~_____~~~:___~~~~__~~~~~~~~~~~~~~~~~~~~~~~----~33 (Hasp Practice) 34 (Hasp Pratt) - , : 35 (J Roy Co1 Gen P) -I 36 (Medicine) I Score:

I 1 Extensive Flaws (Very Poor)

1 2

I 3 Major Flaws (Poor)

I 4

I 5 Minor Flaws (Good)

Fig. 1. The scientific quality of 36 review articles.

, 6

I 7 Minimal Flaws (Exemplary)

ReviewerAgreement

95

Table2. k&w of in@iduaLjudgesacrossreviewsfor Table 3.. Agntrrwzrtwithingrew of judges regarding ,overallEcicntific quality(item10) assessments of overall scientit% quality (Item 10) ICC 9sxl c Judge 3 Mean Expcl%inrpKarchmcthodol~ 0.77 0.69-0.87 1 2 0.74 0.51-0.79 withl&arch tinilig 2.80 3.23 3.4a 3.14 MD8 Expcr~in msearch Restarch assistants 0.62 0.38-0.78 IlWth~OlOOy

M=I~~~E:

training

3.20 3.46 3.00 3.22 3.20 3.80 4.29 3.76

Aunhmjudges

0.71

~0.59-0.81

F(2,8) = 2.11, p =0.18]. When each review ratings of ‘*publishability” [ 11,12,26-323 and was considered independently, the difference in assessments of the “execution” [26& the mean scores was statistically significant for only “research design” [271 and the “research 2 of the 36 articles. In these ‘cases the means quality” [ll] of primary research reports. for the research assistants and for the other The results of this study are not directly two groups were 3.3 vs 1.0 (p = 0.027) and 3.7 comparable to assessments of scientific quality vs 2.2 (p = 0.016) (uncorrected for multiple using different criteria and diRerent judges, and there are a number of factors that might comparisons). The level of agreement within groups was have contributed to the high ICCs that were similar for the clinicians, the methodologists observed. However, two strategies that are and the research assistants for their overall perhaps essential to improving interjudge assessment of scientific quality (Table 3) and for agreement warrant comment. most of the other items, as illustrated in Fig. 2. First, agreement regarding assessments of With four exceptions, the ICCs were >0.5, complex concepts can be improved by assessing and 60% of the ICCs (24/40) were >0.7. The the different dimensions of the construct, stating research assistants had ICCs that were < 0.5 for them explicitly on the scoring form and miniItems 7, 8 and 9 (see Table 1). Examining the mixing the need for inference by the judges. In rank order of the ICCs (Table 4) agreement the present case, agreement was generally better appeared somewhat poorer among the research for items which required less inference on the assistants than within the other two groups. part of the judges (Items 1, 3, 5 and 7). These However, for seven of the ten items (Items l-6 items simply assessed the clarity with which and 10) the differences in the ICCs were small the methods that were used were reported. (Fig. 2). The lower ICC for agreement among research assistants for Item 7 (0.31) might reflect a general uncertainty regarding how the results of DI!XU!SSION research are qualitatively combined in narrative Agreement about scientific quality review articles. The results demonstrate that excellent agreeSecond, in addition to the use of explicit ment for a complex concept like scientific qual- criteria, training is necessary to achieve high ity can be achieved among judges with relatively levels of agreement. The use of explicit criteria little knowledge of the content area. This stands has been found by others to have little impact in contrast to the generally poor agreement that on agreement among referees without training has been reported among referees for global [ 111.Validating research requires some degree of Table 4. Rank order for agreement (ICC) within groups of judges Rank Question

:: 3. 4. 5. 6. 7. 8. 9. 10.

Group I*

Search methods reported Comprehensive search Inclusion criteria reported Selection bias avoided Validity criteria reported Validity amessed appropriately Methods for combining reported Findings combined appropriately Conclusions supported by data Overall scientific quality

*1 = Experts

assistants.

in research methodology;

Group 2+

Group 3* 3

2 = MDs with training; 3 = research

ANDREWD. OXMANet al.

96

judgement, even when explicit criteria are used. Nonetheless, we found, for the most part, only minor differences in scoring by three groups of judges with substantial differences in background experience (research assistants, clinicians with research training and experts in research methodology). The fact that all raters were part of the same university department may have contributed to our success in achieving agreement. Achieving agreement using raters from very different environments may prove more difficult.

For seven of the ten items there were only minor differences in the level of agreement within each the three groups, and the difference in mean scores among the three groups was statistically significant for only 2 of the 36 review articles. However, the research assistants tended to have a higher mean score than the other judges. The difference between the mean for the research assistants and the mean for the other two groups was > 1 in 14 cases, and the mean for the research assistants was higher in all of those instances. Whether those differences are

SEARCH METHOOS REPORTED

-Z-

: -l-

3-:

CC,,,PR&NS,"E S~~RC"-----~-----~---~-~---~~_-:~~___--~_~_~~____~_~___________________~________ l--2

INCL”S,m

REPOR~~D--------~---------:---__---_:_-_______:_________~______

CR,TERI,,

-l------2----: 3----:

.

-+ sEL&*m B**S~~O~D~~---~-----_---~___---___:-________~_________~_________~_________~_______ .

I-----2-: 3

.

P5: "ALID;TY CRITERIA R~PORTE~---___---:__--_____:_________:_________:_________~_________~________ 1 2 3 -: P6: VALIDITY ASSESSED APPROPR,ATELY_--_:___-_____:_________:_________:_________:_________:__._____ 1

:

2 b:

p7:

METHODS

FOR

C~~*NIN~

REPORTED_____:___-_____:_________:_________:_________:______.__:________

-l: -2-

:

3

: Q8: FI,,D,&SC(-#,BI,,ED APPROPR~ATELY----:---------:_________:_________:_________~_________~___-____ 1

i,_

3 -: Q9: CONCL"S,ONS SUPPORTED SY ~ATA-___--:___--____:_________:_________:_________~_________~________

1 2 3-------------: g,D: OVERALL SC,ENT,F,C g"~L,,Y--------:---------:-______-_:__--____-:____~-___~-______-_~___--___ I-: 2-: 3 I

-0.1

I

0

t

0.1

1

,

I

I

0.6 0.7 0.8 0.2 0.3 0.4 0.5 INTRACLASS CORRELATION COEFFICIENT (and 95% confidence interval)

1 = Experts in research methodology 2 ? ? MD's with research training 3 = Research assistants t = All nine judges

Fig. 2. Agreement within groups of judges.

I

0.9

0

Reviewer Agreement

practically important is a matter of judgement, but to the extent that the observed trend is practically important, it can likely be attributed, at least in part, to inadequate training of the one research assistant with the highest mean (Table 2), who also used a contracted scale (3-6). The poor agreement among the research assistants regarding the reporting of the methods used to combine studies (Item 7), the appropriateness of those methods (Item 8), and the extent to which conclusions were supported by data (Item 9) can also be attributed, to some extent, to inadequate training of the one research assistant. Despite this, the level of agreement among research assistants was acceptable on overall scientific quality (Item 10). It is not possible to predict how much improvement might be obtained with additional training. However, the between judge variation, which is a sourceof bias ,that can be corrected or ignored if the same judges are always used, was relatively large for these items. Because of this, when the between judge variation is not included in the denominator, the ICCs for the research assistants are better, but still below those of the other two groups on Items 7 (0.51) and 8 (0.48). Since the between judge variation was relatively small within the other two groups of judges on nearly all of the items, the corresponding ICCs for those groups change only slightly, if at all, when between judge variation is excluded from the denominator. The scientiJic quality of review articles Although the sample of articles included in the study is not representative, the scientific quality of these articles is consistent with what has been reported by others [5,7]. With the exception of the articles highly rated by external criteria and the meta-analyses, all of the remaining articles had minor flaws, and most had major flaws. The practical importance of the scientific flaws observed among the overviews included in this study is difficult to ascertain. Although the conclusions of less rigorous review articles are not necessarily less valid, in general they warrant less confidence because they are more prone to both random and systematic errors [33]; e.g. through selecting a non-representative sample of the relevant research [16,34-361, through misrepresenting ‘the validity of the research that is reviewed [3], or through employing inappropriate methods to combine the results of the research that is

91

reviewed (37,381. ne fact that a’review article is published in a peer reviewed journal, even a prestigious one, is no guarantee of scientific quality. Thus, in the same way that readers must be prepared to assess the quality of articles reporting primary research [4], and for the same reasons, readers must be prepared to assess the quality of research overviews. Clinicians can expect to find few review articles that provide even moderately &strong support for their conclusions. The criteria used in this study can be used to assess the scientific quality of published reviews [q, and they can be applied reliably after relatively little training. Clinicians and other decision-makers should be encouraged to rely on scientifically sound research overviews as an effective and efficient means of processing large amounts of information. At the same time, they should be wary of review articles that simply summarize information without assuring its representativeness, its validity or the appropriateness of the methods used to synthesize it. In narrative review articles, studies tend to be considered one at a time. Most often the process is informal, and decisions about how to aggregate findings and draw overall conclusions are idiosyncratic. Thus, it is not surprising that different review articles often draw very different conclusions from the same studies. For the reader, it is difficult to know what was done, let alone evaluate it, when decision rules are not specified. This does not imply that narrative review articles cannot be scientifically rigorous. A reviewer might, for example,,choose not to use quantitative techniques to combine study findings for sound reasons, and still be explicit about what was done. Nor does it imply that meta-analyses are always scientifically sound [7’J. However, if an overview does not report how something was done, it is reasonable to assume that it was not done properly. Thus, by applying the criteria that were used in this study clinicians can quickly screen out the majority of review articles that do not report the methods that were used, and improve the efficiency, as well as the effectiveness, with which review articles are used to guide clinical practice. The criteria might also be used by journals as an aid to the. peer review process [39]. The consistency of assessments of scientific quality, the general quality of what is published and, perhaps, the efficiency of the process, might further be improved by relying on trained

ANDREW

98

D. OXMANet al.

research assistants to screen manuscripts for their scientific soundness. Scientific quality is, of course, only one attribute of a review article. Other factors such as literary quality and clinical relevance are also important determinants of what is published. Nonetheless, there is a striking need for the publication of more scientifically rigorous research overviews [3,5]. It has been suggested that the quality of peer review of manuscripts should not be judged solely by the extent of agreement amoung referees [40], and we would agree with this. However, with respect to assessments of scientific quality, we would argue that explictness regarding what is meant and the reproducibility of assessments are not only desirable, but essential if the general quality of what is published is to be improved. Acknowledgement-This work was supported by the Ontario Ministry of Health, Grant No. 01969. Dr Guyatt is a Career Scientist of the Ontario Ministry of Health.

15.

16.

17. 18.

19. 20.

21.

22. 23. 24.

REFERENCES 1. Fineberg HV. Effects of clinical evaluation on the diffusion of medical technology. In: Mosteller F, Ed. Asses-sing Medical Technology. Washington, D.C.: National Academy of Science; 1989: 176-205. 2. Williamson JW, German PS, Weiss R, Skinner EA, Bowes F. Health science information management and continuing education of physicians: a survey of U.S. primary care practitioners and their opinion leaders. Ann Intern Med 1989; 110: 151-160. 3. Williamson JW, Goldschmidt PG, Colton T. The quality of medical literature: an analysis of validation assessments. In: Bailar JC, Mosteller F, Eds Medical Uses of Statistics, Waltham. Mass.: NEJM Books: 1986: 370-391. 4. Sackett DL, Haynes RB, Tugwell P. Clinical Epidemiology: a Basic Science for Clinical Medicine. Boston, Mass.: Little, Brown; 1985. 5. Mulrow CD. The medical review article: state of the science. AM Intern Med 1987; 106: 485-488. 6. Oxman AD, Guyatt GH. Guidelines for reading literature reviews. Can Med Assoc J 1988; 138: 697-703. 7. Sacks HS, Berrier J, Reitman D, Ancona-Berk VA. Meta-analyses of randomized controlled trials. N Engl J Med 1987; 316: 450-455. 8. Chalmers TC. A method for assessing the quality of a randomized control trial. Control Clln Trials 1981; 2: 31-49. 9. Feinstein AR, Horwitz RI. Double standards, scientific methods, and epidemiologic research. N Engl J Med 1982; 307: 1611-1617. 10. Horwitz RI, Feinstein AR. Methodologic standards and contradictory results in case-control research. Am J Med 1979; 66: 556-564. 11. Marsh HW, Ball S. Interjudgmental reliability of reviews for the journal of educational psychology. J Educ Psycho1 1981; 73: 872-880. 12. Ingelfinger FJ. Peer review in biomedical publication. Am J Med 1974; 56: 686-692. 13. Stock WA. Rigor in data synthesis: a case study of reliability in meta-analysis. Educ Res 1982; 11: 10-14. 14. Guyatt GH, Bombardier C, Tugwell PX. Measuring

25.

26.

27.

28.

29. 30. 31. 32. 33.

34.

35. 36. 37. 38. 39. 40.

disease-specific quality of life in clinical trials. Can Med Assoc J 1986; 314: 889-892. Chalmers TC, Berrier J, Sacks HS, Levin H, Reitman D, Nagalingam R. Meta-analysis of clinical trials as a scientific discipline II: replicate variability and comparison of studies that agree and disagree. Stat Med 1987; 6: 733-744. Chalmers TC, Levin H, Sacks HS, Reitman D, Berrier J, Nagalingam R. Meta-analysis of clinical trials as a scientific discipline I: control of bias and comparison with large co-operative trials. Stat Med 1987; 6: 315-325. Cooper HM. The Integrative Research Review: a Systematic Approach. Beverly Hills, Calif.: Sage; 1984. Light RJ, Pillemer DB. Summing Up: the Science of Reviewing Research. Cambridge, Mass. Harvard Univ. Press; 1984. Glass GV. Meta-analysis in Social Research. Beverly Hills, Calif.: Sage; 1981. Guyatt GH, Townsend M, Berman L. A comparison of likert and visual analogue scales for measuring change in function. J Chron Dis 1987; 40: 1129-l 133. Streiner DL, Norman GR. Health Measurement Scales: a Practical Guide to Their Development and Use. Oxford: Oxford Univ. Press; 1989. Shrout PE, Fleiss JL. Intraclass correlations: uses in assessing rater reliability. Psycho1Bull 1979;86: 420-428. Dixon WJ, Ed BMDP StatisticaI Software. Berkeley, Calif.: Univ. of California Press; 1983. Rosenthal R. Meta-analytic Procedures for Social Research. Beverly Hills, Calif.: Sage; 1984: 55-58. Fleiss JL. The measurement of interrater agreement. In: Statistical Methods for Rates and Proportions, 2nd edn. New York: Wiley, 1981: 212-237. _ Cicchetti DV. Eron LD. The reliabilitv of manuscriot. reviewing for the Journal of Abnormal Psychology. J Abnorm Psycho1 979; 22: 596-600. Cicchetti DV, Conn H. A statistical analysis of reviewer agreement and bias in evaluating medical abstracts. Yale J Biol Med 1976; 49: 373-383. Gottfredson SD. Evaluating psychological research reports: dimensions, reliability, and correlates of aualitv iudmnents. Am Psvchol 1978: 33: 920-934. fiend&i< 6 Editorial cornkent. Pers’Soc PsychoI BuII 1976; 2: 207-208. Linder DE, Evaluation of the Personality and Social Psychology Bulletin by its readers and authors. Pers Sot Psycho1 Buli 1977; 3: 583-591. Starr S, Weber BLR. The reliability of reviews for the American Psychologist. Am Psycho1 1978; 33: 935. Scott WA. Interreferee agreement on some characteristics of manuscripts. Am Psychol 1974; 29: 698-702. Cooper HM, Rosenthal R. Statistical versus traditional procedures for summarizing research findings. Psycho1 Bull 1980; 87: 442-449. Begg CB, Berlin JA. Publication bias: a problem in internretina medical data. J R Stat Sot A 1988: 151: 419-263. Gotzsche PC. Reference bias in reports of drug trials. Br Med J 1987; 295: 654-656. Simes RJ. Confronting publication bias: a cohort design for meta-analysis. Stat Med 1987; 6: 11-29. Hedges LV. Issues in meta-analysis. Rev Res JZduc 1987; 13: 353-398. Hedges LW, Olkin I. Statistical Methods for Metaanalvsis. Orlando, Fla: Academic Press; 1985. Squires BP. Biomedical review articles; what editors want from authors and peer reviewers. Can Med Assoc J 1989; 141: 195-197. Bailar JC, Patterson K. Journal peer review: for a research agenda. In: Bailar JC, Mosteller F, Eds. Medical Uses of Statistics. Waltham, Mass. NEJM Books; 1986: 349-369.

Agreement among reviewers of review articles.

To assess the consistency of an index of the scientific quality of research overviews...
902KB Sizes 0 Downloads 0 Views