Injury, Int. J. Care Injured 45S (2014) S83–S88

Contents lists available at ScienceDirect

Injury journal homepage: www.elsevier.com/locate/injury

Strategies for comparative analyses of registry data Rolf Lefering * Institute for Research in Operative Medicine (IFOM), University of Witten/Herdecke, Ostmerheimer Str. 200 (Building 38), 51109 Cologne, Germany

A R T I C L E I N F O

A B S T R A C T

Keywords: Registry data Statistical methods Outcome adjustment Propensity score Scoring systems Confounder

The present paper is a description and summary of methods used in non-randomised cohort data where the comparability of the study groups usually is not granted. Such study groups are formed by a diagnostic or therapeutic intervention, or by other characteristics of the patient or the treatment environment. This is a typical situation in the analysis of registry data. The methods are presented together with an illustrative example of whole-body computed tomography in the early phase of treatment of severe trauma cases. The following approaches are considered: (i) unadjusted direct comparisons; (ii) parallelisation; (iii) subgroup analysis; (iv) matched-pairs analysis; (v) outcome adjustment; and (vi) propensity score analysis. All these approaches have in common that they try to separate, or limit, the influence of confounding variables, which are unevenly distributed among the study groups, but also influence the outcome of interest. They differ in the number of confounders being considered, as well as the number of patients regarded. The more sophisticated the approach, the more effectively such confounding factors could be reduced. However, any method used for the reduction of bias depends on the quality and completeness of recorded confounders. Factors which are difficult or even impossible to be measured could thus not be adjusted for. This is a general limitation of retrospective analyses of cohort data. ß 2014 Elsevier Ltd. All rights reserved.

Introduction In recent years data from registries have become increasingly important for health services research. This is especially important in health care areas where the conduct of classical randomised trials is very difficult, or even impossible. The emergency treatment of severely injured patients is such an area because informed consent is difficult to obtain from non-responsive patients. However, the evidence level of registry studies ranges somewhere between prospective and retrospective observational studies. The main problem with registry studies is not the sample size - there are usually much more patients documented in registries than in clinical trials. It is also a positive aspect that registries include a larger variety of patients with a certain condition while clinical trials usually consider a selected subgroup of cases only. Therefore, registry studies are most appropriate to analyse the effectiveness of routine care. But the main problem of registry studies, however, is data completeness and data correctness, which tend to be lower

* Tel.: +49 221 98957 16; fax: +49 221 98957-30. E-mail address: [email protected] http://dx.doi.org/10.1016/j.injury.2014.08.023 0020–1383/ß 2014 Elsevier Ltd. All rights reserved.

than in clinical trials. There are usually limited resources for monitoring and source data verification in registries, is not frequently performed. Registries also document considerably less data per case than clinical trials. But there is a further methodological aspect of registry data analyses, which should be considered more closely here. While descriptive data (like prevalence or incidence rates) profit a lot from a large and representative sample size, problems arise with the comparability of subgroups. If a certain intervention, therapeutic or diagnostic, is analysed in registry data, then the direct comparison of cases with and without that intervention would nearly always give biased results. True comparability would only result from randomizing a sufficiently large number of patients. Registries are comparable with observational studies where treatment decisions are not influenced by an experimental design. However, there are some analytic strategies, which would allow to reach a certain degree of comparability which sometimes comes close to that of controlled trials. The present paper intends to present and describe six of these strategies, together with their advantages and disadvantages. A summary of these strategies could be found in Table 1. In the first part some general comments on descriptive analyses, especially on the use of confidence intervals, are given.

R. Lefering / Injury, Int. J. Care Injured 45S (2014) S83–S88

S84

Table 1 Summary of six analytical strategies with the potential to reach a certain degree of comparability which sometimes approximates that of controlled trials. Description

Advantages

Disadvantages

Direct comparison

No adjustments performed; the groups are compared as they are

Easy to perform

 No comparability of groups Observed differences in outcome could have many reasons

Parallelisation

Comparison is performed in a subgroup of patients defined by some inclusion and exclusion criteria

Comparability somewhat improved Extremes are excluded Easy to perform

 Number of cases reduced Imbalances usually remain

Subgroup analysis

Patients were split into subsets based on a number of criteria; comparisons are performed then in each subgroup

Relatively good comparability within a Subgroup easy to perform

 multiple results only a few criteria could be considered; otherwise the number of subgroups would dramatically increase

Matched Pairs

Based on a number of predefined criteria, pairs of patients are selected and compared who differ only in the intervention (performed, or not)

Very good comparability intuitive understandable equal sample size

 Good comparability only with multiple criteria, but the more criteria, the less pairs Many cases remain unconsidered

Outcome adjustment

Factors that influence the outcome of interest are combined, and the predicted outcome is calculated for each case. Predicted and observed outcome is then compared between those with and without intervention

all cases are included Established tools for outcome adjustment could be used

 Separate adjustment required for each outcome of interest depends on the quality of the Adjustment tool sophisticated multivariate analysis

Propensity score

In a first step, the probability for the intervention is calculated (the propensity score). Then, in a second step, comparisons are made among patients with a similar propensity score

Similarity of cases defined by the probability of receiving the intervention Nearly all cases could be included

 Sophisticated multivariate analysis Sufficient discrimination of the propensity score required

Method

Pictogram

Descriptive statistics Descriptive data analysis and presentation follow the same rules as in clinical trials. Frequencies are presented as number of subjects and percentage, and continuous variables are presented with a measure of location and a measure of variation, like mean and standard deviation (SD). In case of considerably skewed data, for example length of stay (LOS) in hospital, it is recommended to provide the median with interquartile range, or add at least the median to the mean/SD. The use of mean/SD is by no means limited to normally distributed data, which is a common misunderstanding. It could be calculated from any type of data. But in some cases the median gives important additional information. If mean and

median have about the same quantity, then it could be assumed that the data are distributed approximately symmetrically. The range of the observed data (minimum and maximum) could also be helpful, however, these values tend to be biased by outliers and extreme values. Confidence intervals For some key results it is further recommended to provide 95% confidence intervals (CI). Such an interval describes very clearly the degree of uncertainty contained in the data. The range of a confidence interval decreases when the sample size increases, reflecting the increasing statistical certainty with which the results

R. Lefering / Injury, Int. J. Care Injured 45S (2014) S83–S88

are associated. A confidence interval for the mean value of a metric variable could easily be calculated by mean  1:96  SE where SE is the standard error of the mean. The formula for the 95% confidence interval of a percentage P, based on n observations, is as follows (sqrt = square root):

P  1:96  sqrt

  P  ð100  PÞ n

The latter formula provides a symmetric confidence interval, which might be not well appropriate in case of very small or high values of P (the lower or upper bound may exceed 0 or 100, respectively). In these instances it is recommended to calculate the 95% confidence interval based on the Poisson distribution. If P is based on x events in n observations, then appropriate tables provide a lower bound and an upper bound for x (see for example http://faculty.washington.edu/heagerty/Books/Biostatistics/ TABLES/Poisson). In the same way as P is calculated from x and n (P = 100  x/n) you could then derive the lower and upper bound of the 95% confidence interval for P. In the above formulas, the value 1.96 which is bases on a normal distribution is used to calculate a 95% confidence interval. Other confidence intervals, like 90% CI or 99% CI, could easily be calculated in the same way by using different factors (e.g., 1.645 for the narrower 90% CI, or 2.58 for the wider 99% CI). It should also be mentioned that these formulas should be applied only in a sufficiently large sample (about n > 50). In smaller samples the adequate factors are somewhat larger and could be derived from respective tables in statistical textbooks. A special situation is a confidence interval for a zero rate, i.e. if no event was observed (0%). In this case the 95% confidence interval is not symmetrical since the lower bound is the same as the value itself, namely zero. For the upper bound of the confidence interval there fortunately exists a very simple rule of thumb which gives adequate results for n > 30: Hanley’s rule of three [1]: n/3. This rule of thumb gives a p-value (range 0–1); for a percentage, multiply the result by 100. Comparisons For the rest of the paper we will consider a constellation where two groups of patients will be compared which could not assumed to be comparable. The patient groups could be based on whether a certain intervention was performed or not, whether a diagnostic procedure was applied or not, whether patients were treated in different types of facilities (e.g., large trauma centre or local hospital), or whether patients had a certain characteristic or not (e.g., gender). Such types of comparisons are frequently performed in the analysis of registry data. The problem is always that the two

S85

patient groups usually differ in one or more important aspects, which make the interpretation of the results difficult. If these aspects influence the outcome of interest (e.g., mortality), they are called confounders. In order to illustrate such a comparison we will use the example of whole body computed tomography (WBCT). Based on registry data, this diagnostic procedure has recently been shown to have a significant survival advantage [2,3]. There is also an additional paper investigating the location of the CT device in this issue. The outcome of interest here is hospital mortality. Patients who received a WBCT during the initial treatment phase until admission to the intensive care unit (ICU) will be denoted as WBCT(+) while those who did not will be the WBCT() group. All data presented here in this article should NOT be used for drawing conclusions about the use of WBCT. The results presented here were selected exclusively for illustrative purposes. They mostly refer to real data from the TraumaRegister DGU1 (TR-DGU) and the respective publication of Huber-Wagner et al. but were sometimes modified to support the key messages. Unadjusted comparison The direct comparison of patients with and without a certain intervention usually proves that there are some aspects in which the two study groups differ. Such differences not only apply to the outcome of interest but also to patient characteristics and other circumstances (like emergency versus elective situation). Whether an observed imbalance between the study groups is relevant for interpreting the results or not depends on the influence on outcome. If a variable like the mechanism of injury has only limited impact on mortality, which is the outcome of interest here, then an imbalance also has only limited importance. However, if the uneven distributed variable is known to have prognostic relevance, then the observed outcome is probably biased. Such a factor is called a confounder, or a confounding variable. An observed difference in outcome may then be due to this confounding factor, and not due to the intervention of interest. In most cases, however, it is influenced by both, and it is a challenge of analysis to separate both effects from each other. Table 2 shows some aspects of cases with and without WBCT from Huber-Wagner et al. [2]. It turned out that patients in both groups were quite good comparable regarding age and sex, however, patients who received a WBCT tend to have more severe injuries, especially thoracic injuries, than those in the WBCT() group. In the WBCT() group isolated head injury was about twice as frequent. The WBCT was also more often used in a supraregional level one trauma centres. Outcome, however, was very similar in both groups (Table 2). Nevertheless, such an unadjusted presentation of group differences is helpful in research papers in order to demonstrate that certain differences between the groups exist, and that more sophisticated approaches for reaching comparability are justified.

Table 2 Comparability table for patients with (+) and without () whole-body computed tomography (WBCT) during the initial treatment phase in hospital. Date were taken from the TR-DGU (2005; ISS 9; primary admissions).

Number of cases Age (mean, SD) Male patients (n, %) Injury Severity Score (mean, SD) Relevant head injury (n, %) Isolated head injury (n, %) Relevant thoracic injury (n, %) Hospital mortality (n, %) Treated in a level one hospital (n, %)

WBCT (+)

WBCT ()

1494 42.5 (20.3) 1098 (74%) 32.4 (13.6) 884 (59%) 143 (10%) 1035 (69%) 306 (21%) 1295 (87%)

3127 42.7 (20.8) 2267 (73%) 28.4 (12.4) 1882 (60%) 605 (19%) 1589 (51%) 691 (22%) 2425 (78%)

S86

R. Lefering / Injury, Int. J. Care Injured 45S (2014) S83–S88

Parallelisation This first simple approach to reach a better comparability defines inclusion and exclusion criteria in order to reduce observed imbalances. For example, if patients with isolated traumatic brain injuries (TBI) receive a WBCT much less frequently than those with multiple injuries, then the exclusion of patients with isolated TBI would increase comparability in the remaining groups. Another example is patients who died within a few minutes after hospital admission. These extremely injured patients usually do not receive a WBCT after admission due to their unstable situation. Leaving these patients in the dataset would wrongly increase the mortality in the WBCT() group. This kind of bias is called ‘immortal time bias’, a special kind of selection bias. If the availability of WBCT is different in large and small hospitals then this may also lead to an imbalance. The WBCT(+) group would contain more patients treated in university or level one hospitals (87% vs. 78%, Table 2), while patients treated in level three hospitals more frequently were found in the WBCT() group (2.0%, compared to 0.5% in the WBCT(+) group). Excluding patients treated in level three hospitals would therefore increase comparability. The advantage of this approach is its ease of performance and interpretation. Imbalances are reduced when compared to the unadjusted comparison. However, in most cases it would not be possible to reach a satisfactory comparability. If the number of criteria used to define the study groups is increased, then comparability improves but at the cost of a reduced sample size and limited representativeness. For example, if the imbalance in thoracic injuries is attempted to be reduced by excluding patients with severe thoracic trauma, then the study group is reduced by more than half, and very interesting patients would then be excluded. Subgroup analysis It is a common and easily understandable approach that in case of an observed imbalance, for example in injury severity, comparisons are repeated in subgroups. The Injury Severity Score (ISS) could be used to generate subgroups, say 16–24, 25–34, 35– 49, and 50–75. Within each subgroup the injury severity is then very similar, so that comparisons of outcome are no longer biased by different levels of injury severity. However, the imbalance regarding the hospital level of care still remains since more severely injured patients were more frequently admitted to level one hospitals. Thus simple subgroup analysis will be able to adjust for one confounding variable only. However, the concept of subgroup analysis could also repeatedly be applied to each severity subgroup, i.e. each ISS subgroup could further be subdivided according to the level of care (one, two, or three). This would result in twelve different subgroups where the effect of WBCT could be evaluated. Within each subgroup injury severity and level of care are well comparable. However, the sample size for some subgroups dramatically decreases. While there are 1,426 cases with ISS 16–24 treated in level one trauma centres, there are only two cases with ISS 50–75 treated in a level three hospital. The sample size problem could be reduced by defining less specific categories, e.g. two or three ISS categories only instead of four. But this, however, negatively affects the comparability regarding injury severity. The general advantage of subgroup analysis is its ease of performance and its intuitive comprehensibility. In case that only one major confounding factor exists between the study groups, then subgroup analysis may be considered as an appropriate approach. However, this implies that all relevant confounders are measured. The clear limitation of this approach is its reduced

ability to adjust for multiple confounders. The number of potential outcome comparisons dramatically increase with multiple factors, while the sample size (and thus the power) within each subgroup decreases. And if finally a subgroup is identified with a ‘significant’ effect, then it will be difficult to defend this finding against the criticism of a chance finding due to multiple testing. Matched-pairs The so-called matched-pairs analysis is also a very intuitive approach in case of non-comparable overall subgroups. The idea behind this is to select ‘similar’ pairs of patients where the one case has received the intervention of interest while the other one did not. In our example, one case should have received a WBCT on admission, while the other one did not. Similarity in this context is defined by a number of criteria which should be identical (or at least within a small range of values) within each pair of cases. For example, sex should be identical, age should not differ by more than 5 years, also injury severity score should be about the same (e.g.,4 points), and the level of the treating hospital should be the same. Thus comparability will be granted by definition for all criteria used to define the pairs. It could happen that for some patients from the WBCT(+) group no adequate partner from the WBCT() group is available. Then these cases could not be included in the paired comparison. It could also happen that for a specific case with WBCT several appropriate patients exist in the WBCT() group. Then only one case is included as a partner in the paired comparison. The selection of the respective partner must be made without knowledge of the outcome, since otherwise a bias could be introduced. In order to increase comparability, further criteria could be used, like ‘isolated head injury’, ‘thoracic trauma’, ‘shock on admission’, ‘unconsciousness’, etc. Each additional criterion is able to adjust for the respective characteristic, however, the chance to find an appropriate partner is decreased with each additional criterion. This may end up in a very small but highly comparable selection of cases. Outcome analysis within the selected pairs could be performed using test statistics for dependent values, like the paired t-test for continuous measures, Wilcoxon’s rank sum test for ordered categories, or McNemar’s test for dichotomous outcomes. These test statistics have a higher power than those used for comparing independent groups. The intuitive understandable approach as well as the high comparability within the (paired) subgroups could be mentioned as advantages of this approach. The major limitation is the tradeoff between good comparability (i.e. many criteria used for defining the pairs) and sample size. The more criteria are used, the lower the number of potential pairs. Thus finally only a small number of available cases are used for drawing conclusions. As for all other approaches as well, it is a further limitation that only those variables could be used for matching which are measured and available in the database. For example, in a matched-pair analysis of pre-hospital intubation, Ruchholtz et al. missed to match for blood loss (because this could not be measured) and thus found a difference in hospital mortality of 4.5% versus 13.6% in favour for non-intubated patients [4]. This is a difference, which is still biased by blood loss. Outcome adjustment This approach does not require comparable subgroups, nor does it attempt to generate comparability. This approach either uses (i) existing tools for outcome adjustment, or it (ii) directly applies statistical tools to adjust for confounders.

R. Lefering / Injury, Int. J. Care Injured 45S (2014) S83–S88

Existing tools for outcome adjustment in trauma cases are, for example, prognostic score systems. Such systems provide a risk of death estimation based on known risk factors (confounders). Such systems are developed in a specific patient group which serves as a standard. For example, Trauma and Injury Severity Score (TRISS) used the data of the Major Trauma Outcome study [5], while the Revised Injury Severity Classification (RISC) score has been developed and validated with data from the TR-DGU collected 1993–2000 in Germany [6]. Thus the RISC prognosis describes the expected outcome in German trauma centres about 15 years ago. It is thus obvious that a prognostic scoring system requires a regular update. A new version of the RISC score is going to be published this year, where the reference standard will be the trauma cases documented in the TR-DGU in 2010–2011. The general idea of outcome adjustment now is not to compare the two groups directly, but to compare the observed outcome within each group with the expected outcome, i.e. with an external standard. The observed (and expected) outcome may then differ between the two study groups, but only the extent to which the two values differ is finally used to compare the two groups. In the paper of Huber-Wagner et al., the observed mortality rate in the WBCT(+) group was 3.1% lower than the expected rate (19.9% versus 23.0%) [2]. In the WBCT() group, however, observed and expected mortality were almost equal (21.3% vs. 20.6%). Thus the relative advantage was higher in the WBCT(+) group. In case that no adequate tool for outcome adjustment exists, or if an existing tool does not consider all relevant confounders (here: the level of hospital, or the year of trauma), then an appropriate adjustment has to be calculated. In case of an event as primary outcome measure (like hospital mortality) this is usually done by a multivariate logistic regression analysis. This procedure is able to simultaneously analyse the influence of several independent predictor variables (the confounders) on a dichotomous dependent variable (here: mortality). This analysis is performed on all cases (i.e. not separately within each subgroup), and the intervention of interest (here: WBCT performed or not) is included as one of the independent predictor variables in the model. The result of such a multivariate analysis are odds ratios (OR) describing the effect of each individual predictor variable on the (dependent) outcome. This OR is said to be adjusted for all other confounders included in the model. In the paper of Huber-Wagner et al. the OR for WBCT(+) is 0.69 (95% confidence interval 0.55–0.85), which means that the mortality of patients who received a WBCT, as compared to those who did not receive this diagnostic, was significantly lower [2]. This effect was adjusted for the RISC prognosis, the level of care of the treating hospital, and for the year of trauma (see Table 5 in [2]). The most important advantage of this approach is that it is able to consider multiple confounder variables without reducing the number of cases considered. Although sample size is not a relevant problem in most registry analyses it should be mentioned that the number of predictor variables used in a model is limited. It is recommended to have at least 5–10 ‘events’ (here: patients who died) for each predictor analysed in the model [7]. Finally, it should again be mentioned that adjustment could only be performed for those variables contained in the dataset. Propensity score In observational trials, like in registries, the decision whether an intervention is performed or not, is depending on various aspects. Such aspects are the patient himself (age, weight, pre-existing diseases), the injuries (mechanism, pattern and severity), the actual condition of the patient (physiology, consciousness), but also the experience of the treating physician or the availability of resources (level of trauma centre). Usually, some of these aspects have an effect on the outcome measure, i.e. they are confounders.

S87

For example, the decision to intubate a patient depends on findings like unconsciousness, shock, or respiratory rate. Thus intubated patients have a much higher mortality rate in the database, not because intubation is a life threatening intervention, but because of the actual situation which required an intubation. The basic idea of the propensity score approach is to replace the collection of confounders with one function of these confounders, called propensity score [8–10]. It is determined by a logistic regression analysis where the intervention of interest is the dependent variable, and the confounders are the independent variables. The propensity score is thus the conditional probability for a certain intervention given the set of observed covariates. This is different from the approach above where the outcome (mortality) is the dependent variable. Calculation of the propensity score does not consider the outcome at all. Patients with a high propensity score nearly always received the intervention of interest, while nearly all patients with a low propensity score did not. These groups of patients with a very high or low propensity score where thus less appropriate for performing comparisons. The most interesting subgroup for comparative analyses are thus patients with a medium range propensity score, i.e. where some physicians applied the intervention while others did not. Just for illustration, in a randomised clinical trial the propensity score for receiving the study drug would be 0.5 for all patients. Now, if a propensity score successfully has been derived, the following evaluation of the effectiveness of an intervention could be done according to one of the above mentioned procedures:  Parallelisation: Patients with a very high or low propensity score (e.g. >0.8 and

Strategies for comparative analyses of registry data.

The present paper is a description and summary of methods used in non-randomised cohort data where the comparability of the study groups usually is no...
454KB Sizes 0 Downloads 5 Views