G Model

ARTICLE IN PRESS

PREVET-3790; No. of Pages 8

Preventive Veterinary Medicine xxx (2015) xxx–xxx

Contents lists available at ScienceDirect

Preventive Veterinary Medicine journal homepage: www.elsevier.com/locate/prevetmed

Dealing with deficient and missing data Ian R. Dohoo ∗ Department of Health Management, Atlantic Veterinary College, University of Prince Edward Island, Charlottetown, PEI C1A 4P3, Canada

a r t i c l e

i n f o

Article history: Received 27 January 2015 Received in revised form 25 March 2015 Accepted 8 April 2015 Keywords: Descriptive data Analytical data Bias New disease Missing data Multiple imputation

a b s t r a c t Disease control decisions require two types of data: data describing the disease frequency (incidence and prevalence) along with characteristics of the population and environment in which the disease occurs (hereafter called “descriptive data”); and, data for analytical studies (hereafter called “analytical data”) documenting the effects of risk factors for the disease. Both may be either deficient or missing. Descriptive data may be completely missing if the disease is a new and unknown entity with no diagnostic procedures or if there has been no surveillance activity in the population of interest. Methods for dealing with this complete absence of data are limited, but the possible use of surrogate measures of disease will be discussed. More often, data are deficient because of limitations in diagnostic capabilities (imperfect sensitivity and specificity). Developments in methods for dealing with this form of information bias make this a more tractable problem. Deficiencies in analytical data leading to biased estimates of effects of risk factors are a common problem, and one which is increasingly being recognized, but options for correction of known or suspected biases are still limited. Data about risk factors may be completely missing if studies of risk factors have not been carried out. Alternatively, data for evaluation of risk factors may be available but have “item missingness” where some (or many) observations have some pieces of information missing. There has been tremendous development in the methods to deal with this problem of “item missingness” over the past decade, with multiple imputation being the most prominent method. The use of multiple imputation to deal with the problem of item missing data will be compared to the use of complete-case analysis, and limitations to the applicability of imputation will be presented. © 2015 Elsevier B.V. All rights reserved.

1. Introduction Making valid animal health decisions requires reliable data be available to the decision making process. These data are of two types: descriptive and analytical. Descriptive data will include:

• the frequency of disease – preferably incidence data, but prevalence data may be all that are available, and • data about the host population and its environment – numbers at risk, characteristics of the population, and data about the environment in which the susceptible animals are kept (e.g. husbandry characteristics of domestic animals, environment of wildlife reservoirs).

Analytical data will include information about factors that influence the occurrence, severity or prognosis of the disease in question. Data may be deficient because they are incorrect (imprecise or inaccurate) or missing (either totally or partially). The objectives of this manuscript are to describe each combination of type of data and type of deficiency, to provide examples of each combination along with some suggested mechanisms of dealing with the problem. However, with multiple data types and four possible deficiencies, most will only be touched on very briefly. Emphasis will be placed on dealing with partially missing data in analytical studies because it is a common problem (virtually all analytical studies have this problem to some extent) and substantial progress has been made over the last few years in the development of methods to deal with this issue.

2. Incorrect descriptive data Incorrect disease data may suffer from imprecision (confidence around the estimate may be very wide) or inaccuracy (the estimate

∗ Tel.: +1 902 566 0640; fax: +1 902 620 5053. E-mail address: [email protected] http://dx.doi.org/10.1016/j.prevetmed.2015.04.006 0167-5877/© 2015 Elsevier B.V. All rights reserved.

Please cite this article in press as: Dohoo, http://dx.doi.org/10.1016/j.prevetmed.2015.04.006

I.R.,

Dealing

with

deficient

and

missing

data.

PREVET

(2015),

G Model PREVET-3790; No. of Pages 8 2

ARTICLE IN PRESS I.R. Dohoo / Preventive Veterinary Medicine xxx (2015) xxx–xxx

Fig. 1. Spike plots of daily mortalities from a sample of 4 net pens (cages) from 2 sites from a study into a new disease (subsequently shown to be infectious salmon anemia) in the Bay of Fundy, New Brunswick, Canada. Spike plots were used to determine which cages had an outbreak, when it started and when it ended. Reprinted with permission from J Fish Diseases, 2005. 28: p. 646.

is wrong even if it has a very narrow confidence interval). Imprecision (in disease, population or environmental data) is a function of random error and can be dealt with by increasing the sample size from which the estimate is derived. This is only possible if the investigator has some control of the data generation process. Inaccurate disease data can be compensated for by adjusting the disease frequency estimate using the sensitivity (Se) and specificity (Sp) of the test procedure used to generate the estimate – provided that reasonable estimates of the Se and Sp of the test are available. For many test procedures (e.g. on-farm recording of clinical disease), the operating characteristics of the “test” are not known and validation studies may be required. Procedures for validation of data were covered in depth at the 2012 Schwabe Symposium (Emanuelson and Egenvall, 2014) and will not be discussed further. Given that there are often many steps involved prior to arriving at a recorded diagnosis of a disease event, scenario trees have been developed to account for this complexity (Christensen et al., 2014, 2011). Bayesian methods have been applied to allow for integration of previous knowledge of a disease situation in the estimation of the current situation (especially for the demonstration of disease freedom) (Gustafson et al., 2010; Heisey et al., 2014) Incorrect information about the source population may be a function of incomplete or inaccurate registry data and validation procedures are required to deal with this issue (Emanuelson and Egenvall, 2014). Environmental data may suffer from limitations in the methods used to collect the data and will often require validation through the use of more intensively collected data (e.g. “ground truthing” of normalized difference vegetation

Please cite this article in press as: Dohoo, http://dx.doi.org/10.1016/j.prevetmed.2015.04.006

I.R.,

index (NDVI) (Baghzouz et al., 2010) or satellite derived temperature data (White-Newsome et al., 2013). 3. Missing descriptive data Disease frequency data may be missing because either the disease is unknown or it is a known disease but no surveillance programs are in place for the condition. If the disease is unknown, some surrogate measure of the disease (often based on clinical presentation) needs to be used. If the disease is known, it may be possible to conduct targeted surveys to determine the disease frequency. In 1996/1997 an unknown disease hit the salmon aquaculture industry on the east coast of Canada. The pathology exhibited (primarily hemorrhagic lesions in the kidney) did not resemble that of any known infectious diseases and the possibility that the disease was non-infectious (e.g. environmental toxin) could not be ruled out. The disease affected some production sites (farms) and within a site, some net-pens (cages of fish) were affected while others were not. Across net pens, tremendous differences in the mortality patterns were observed (Hammell and Dohoo, 2005a). In order to start investigating factors that might be related to the disease, it was necessary to classify net pens as having, or not having an outbreak. In the absence of any diagnostic procedures for this unknown disease, mortality data were compiled and spike plots of mortality data (Fig. 1) for all net pens in the affected region were generated and shown to fish health specialists working in the region. Each specialist was asked to indicate if they felt the net pen had been

Dealing

with

deficient

and

missing

data.

PREVET

(2015),

G Model PREVET-3790; No. of Pages 8

ARTICLE IN PRESS I.R. Dohoo / Preventive Veterinary Medicine xxx (2015) xxx–xxx

3

affected and when the start and end dates of the outbreak occurred. From these observations, a set of consensus rules for classifying pens as outbreaks or not was developed. Even once the disease was confirmed to be infectious salmon anemia (O’Halloran et al., 1999) the net pen classification of pens was still used to investigate risk factors because laboratory diagnostic data were very limited (Hammell and Dohoo, 2005b). More recently, an outbreak of an unknown disease affecting dairy farms in The Netherlands was investigated using milk production data to identify farms with a sudden and unexpected drop in milk production. The disease was subsequently identified as being associated with the Schmallenberg virus through the use of metagenomic analyses (Roberts et al., 2014; van den Brom et al., 2012). If a disease is known but exists in a jurisdiction with no surveillance in place, targeted surveillance activities will be required to determine the incidence/prevalence of the condition. For example, over the period of 2012–2013 a one-health project coordinated by Massey University looked into both the animal and human prevalences of a number of zoonotic diseases in south-Asian countries. One of these studies evaluated risk factors for Brucellosis in both animals and humans in Herat Province of Afghanistan. Years of civil strife had left the region with little organized surveillance. Despite the difficulties of collecting data and samples under such hostile conditions, the study team were able to assemble a very complete dataset that established the seroprevalence of Brucellosis in humans and animals in the region was approximately 5% and 1.5%, respectively. They also established a clear link between a household drinking raw milk and the risk of one of the family members being seropositive for Brucella (manuscript in preparation). A similar approach (targeted surveys) will be required in the case where key pieces of population or environmental data are missing.

a qualitative bias analysis in published studies may encounter editorial resistance. It is also very difficult to consider more than a single key bias in any study (Lash et al., 2009) (Effects of multiple biases may be counteracting and it can be difficult to determine which effect will predominate). Quantitative bias analysis seeks to adjust an observed association for known or suspected biases to produce an estimate that is free from systematic error (Lash et al., 2009). The application of quantitative bias analysis is limited by two key factors. First, the analysis requires quantitative estimates of the “bias parameters”. For example to adjust for the misclassification of the exposure in a study, the sensitivity (Se) and specificity (Sp) of the method used to measure the exposure must be known. Unless good estimates are available, investigators are often reluctant to proceed. However, inevitably, “best guesses” are preferable to ignoring the bias (equivalent to assuming both the Se and Sp are 100%) (Lash et al., 2009). The second limitation is that software for carrying out these analyses (e.g. -episens- in Stata and spreadsheets accompanying the text by Lash et al.) has only been developed for categorical exposures and dichotomous outcomes. Adjustment for systematic error during the analysis process is, perhaps, an ideal solution and has been applied in some situations. Many of these applications will be carried out in a Bayesian framework. For example, Dufour et al. (2012) used a Bayesian model to adjust for outcome misclassification of a study of risk factors for coagulase-negative intra-mammary infections in dairy cattle. Selection bias can be accounted for using survey methods with the appropriate sampling weights. Structural equation models may be used to adjust for measurement error and, although, these procedures are most developed for linear models, they have recently been extended to include non-linear models. Needless to say, these methods also require the specification of the relevant bias parameters.

4. Incorrect analytical data

5. Missing analytical data

Knowledge about risk factors for disease conditions is crucial for implementing effective disease control strategies. However, in many cases, the information available will be derived from observational studies and it is well known that such studies may be subject to a range of biases. The topic of bias in observational research was discussed at the 2012 Schwabe Symposium (Dohoo, 2014) and only a brief summary of some of that information will be repeated here. As epidemiologists know, all biases can be classified as arising due to confounding, information bias or selection bias (although it is acknowledged that in some cases the distinction between the categories becomes blurred – such as misclassification of disease status in a case–control study being a source of selection bias). Bias arises when there is systematic (as opposed to random) error in a study and increasing the size of the study does nothing to reduce the systematic bias. While tremendous progress has been made over the last 15 years in dealing more effectively with random error (primarily through the development of multilevel models to account for the clustering of data that is commonly present in animal health studies), much less attention has been paid to dealing with systematic error (bias). There are three general approaches to dealing with bias: qualitative bias analysis, quantitative bias analysis and adjustment for bias during the analysis. The first two procedures are “post-hoc” in that they are applied to adjust an observed result after it has been computed. Qualitative bias analysis seeks to identify the direction and approximate magnitude of any bias that may have affected the study. While it is very helpful it does suffer from some limitations. It is invariably somewhat subjective in nature so attempts to include

5.1. Completely missing analytical data

Please cite this article in press as: Dohoo, http://dx.doi.org/10.1016/j.prevetmed.2015.04.006

I.R.,

In situations where a new disease is being investigated, no information about risk factors will be available. In some situations, diseases with similar pathology but in different species or populations may be available to guide investigations. This was certainly the case with bovine spongiform encephalopathy (BSE) with its similarity to scrapie in sheep. However, with new or unknown diseases, broad spectrum investigations will be required to identify the most important risk factors. While these investigations are sometimes derogatorily referred to as “fishing trips” they may be required when risk factor information is completely absent. Three examples will highlight this need. When BSE first appeared in the UK, Wilesmith and co-workers cast the net widely in terms of evaluating chemical exposures, genetic factors and feedstuffs as potentially being associated with this new disease (Wilesmith et al., 1988). This was followed by more targeted studies which clearly identified feeding of ruminant by-products to dairy cattle as the most important risk factor for BSE. In the investigation of the disease originally known as HKS (later confirmed as ISA) referred to above, longitudinal studies of risk factors for outbreaks at the pen level considered a wide range of factors that might have been associated with dissemination of an infectious agent or modification of the host’s resistance (Hammell and Dohoo, 2005b; McClure et al., 2005). Proximity to other infected pens was confirmed to be a key risk factor (McClure et al., 2005).

Dealing

with

deficient

and

missing

data.

PREVET

(2015),

G Model PREVET-3790; No. of Pages 8 4

ARTICLE IN PRESS I.R. Dohoo / Preventive Veterinary Medicine xxx (2015) xxx–xxx

Finally, in 1988, veterinary diagnostic laboratories in Canada reported a much higher proportional prevalence of Nocardia mastitis (i.e. increase in the proportion of milk samples yielding a positive culture for Nocardia spp.). Two case–control studies of farms that had experienced one or more cases were carried out in two provinces (Ontario and Nova Scotia) and each looked into a wide range of management and hygienic factors that might have been associated with the disease (Ferns et al., 1991; Stark and Anderson, 1990). These investigations clearly identified the use of neomycin-based dry-cow intra-mammary infusions as the key risk factor – although Nocardia was never isolated from any of the suspect products. Clearly, wide ranging investigations can produce very useful results in the face of new disease challenges. 5.2. Partially missing analytical data (item missing) Anyone who has carried out observational studies of risk factors of disease has encountered the problem of missing data in some of the variables of interest. The usual approach to dealing with this problem is to carry out a complete-case analysis (CCA) (also known as listwise deletion) in which only observations for which all variables are complete (no missing values) are used in the analysis. The obvious problem with this is that it throws out useful data that have been recorded, and if the missing values are scattered among many variables in a multivariable analysis, the proportion of records discarded may be quite high. The general perception is that the complete case analysis will provide unbiased estimates of the parameters of interest (although with reduced precision), but it is not clear that this is, in fact, the case. An alternative approach to a complete-case analysis using all variables is to delete the variables with more than a specified proportion of missing values (followed by a complete-case analysis of the remaining variables). If the deleted variable(s) is unimportant as a risk factor this has the beneficial effect of avoiding the deletion of a lot of observations. However, if the variable is important as a risk factor, as an interacting variable or as a confounder, deletion of the variable is not appropriate. This approach should only be considered if analysis of the available data strongly supports the notion that the variable(s) to be deleted is unimportant. Recently, alternative methods for dealing with missing data have become more widely available. These newer approaches include maximum likelihood estimation of a model which accounts for missing values and multiple imputation (MI). The latter is generally more accessible and will be the focus of this section. The topics touched on briefly below include: • • • • •

missing data patterns missing data mechanisms overview of options for dealing with missing data a very brief overview of multiple imputation, and results from a small simulation study comparing CCA and MI

Discussion of these topics is, of necessity, brief and the reader is referred to one of several texts on the subject for more details (Carpenter and Kenward, 2013; Graham, 2012; Heeringa et al., 2010; Little and Rubin, 2002; StataCorp, 2013; van Buuren, 2012). 5.2.1. Patterns of missing data The first step in deciding on a strategy to deal with missing data is to determine the pattern of missingness (Fig. 2). The most commonly encountered pattern is one of generalized missingness in which there is no obvious pattern to the missing values. In some situations, data may be missing by design in which a response to one question precludes responses to following questions (e.g. answering “no” to “Do you use teat dip?” precludes answers to questions about product used and method of application). If the variables with Please cite this article in press as: Dohoo, http://dx.doi.org/10.1016/j.prevetmed.2015.04.006

I.R.,

Fig. 2. Patterns of missing data. The generalized pattern of missingness is most commonly encountered.

missing values can be ordered in such a way that once a missing value appears, all subsequent variables also have missing values, the data are called “monotone missing”. This has implications for the types of analyses possible but is not a common occurrence and will not be discussed further. One type of missing pattern not shown in Fig. 2 is specific to longitudinal data with repeated measures. If one of the sampling times is missed, all data from that particular time point will be missing. This is called “wave missing” and in general, wave missing data can be more easily imputed because observations of the same variables from the preceding and following observation times may be present. 5.2.2. Missing data mechanisms The second step in deciding on a strategy to deal with missing data is to consider “why are the data missing?” – the missing data mechanism. There are three mechanisms by which data may be missing: missing completely at random (MCAR), missing at random (MAR) and not missing at random (NMAR – also called MNAR). MCAR means, as its name implies, that the probability of a value being missing in a variable is a completely random event. An example of this might be a blood sample being dropped in the laboratory and not available for analysis. The probability of being missing is the same for all observations in the dataset and is not related to any other data recorded in the study. Asymptotically, CCA of data with some MCAR observations will produce unbiased estimates of all parameters, however in any particular dataset, this property may not hold. MAR means that the probability of being missing depends only on the observed data (i.e. the probability of being missing can be fully explained by variables recorded in the dataset). For example, if, in a study of dog behavior, you ask the question “Is your dog allowed to run free?” and rural residents are more likely to answer the question (whatever their answer is) than urban residents, the data becomes MAR if data on the residence (rural vs urban) is recorded with no missing values in the dataset. These data are called missing at random because once the residence of the respondent is accounted for, it is a random process that determines if the values are missing or not. Data are NMAR if the probability of being missing depends on unobserved data. Using the example above, data would be NMAR if owners who do let their dog run free are less likely to respond to the question. In this case, the probability of being missing is related to a value which is itself missing. The probability of being missing

Dealing

with

deficient

and

missing

data.

PREVET

(2015),

G Model PREVET-3790; No. of Pages 8

ARTICLE IN PRESS I.R. Dohoo / Preventive Veterinary Medicine xxx (2015) xxx–xxx

may also be related to some completely unknown (and unmeasured) factor which will produce NMAR data if this unknown factor is related to the exposure and/or outcome. There is no statistical way to determine if the data are MCAR, MAR or NMAR, so the question bears careful reflection. It is quite conceivable that different variables have different missingness mechanisms and the mechanism determines the expected impact of using MI instead of CCA. 5.2.3. Options for dealing with item missing data There are multiple options for dealing with missing data and these will just be reviewed briefly here. • Avoid – Obviously this is the best approach. While not always possible, the study of Brucellosis in Afghanistan referred to above generated a dataset with only 0.1% of missing values in key variables. This was a remarkable achievement given the extremely difficult circumstances under which the study was carried out. • Complete case analysis (listwise deletion). This will be discussed further in the presentation of the simulation results shown below. • Pairwise deletion (also called available-case deletion). The means of all variables and covariances of all pairs of variables are computed based on all available data. These values are subsequently used as the basis of all analyses. This method is difficult in practise and not widely used. • Missing value indicator. It has become a relatively common practise to replace missing values of a categorical variable with an indicator and then estimating the effect of this indicator on the outcome of interest. For example, following assignment of a missing value indicator, breeds considered in a study may be: Holstein, Jersey, Ayrshire or missing. While this method is convenient and it may prevent exclusion of a lot of observations, this approach has some serious drawbacks. First, if the indicator is statistically significant, the interpretation of the effect becomes difficult. More seriously, it has been shown that this approach may produce seriously biased estimates, even if the data are MCAR. This method should not be used. • LOCF/BOCF. Last observation carried forward (LOCF) or baseline observation carried forward (BOCF) are methods that may be used in longitudinal (repeated measures) data. However, they have generally be supplanted by superior imputation methods. • Maximum likelihood estimation. This approach has already been mentioned and it involves building a model which accounts for missing values. Asymptotically, it produces the same results as MI but has a couple of drawbacks. First, software for these methods is not as generally accessible (these models can be fit using structural equation modeling software) and they are not as conducive to including “auxiliary variables” (described below). The reader is referred to Chapters 3–5 of Enders (2010) for further information on this topic. • Mean imputation. This involves replacing the missing values with the mean of the variable. While fast and simple, this results in a variable with less variance than it should have and frequently produces biased estimates of parameters. It is no longer recommended. • Single imputation. In this method, all of the missing values are replaced with values predicted from either an explicit or implicit model. With an explicit model, a distribution for the missing values is specified and predicted values generated from some sort of regression model (e.g. linear regression for a normally distributed variable). With an implicit model, each missing observation is replaced with a value for that variable which exists in the data set. For example, if all variables are categorical, “hot deck imputation” can be used by randomly choosing a non-missing value from the observations that have the same combination of other variables. (If one or more variable is continuous, then it is possible Please cite this article in press as: Dohoo, http://dx.doi.org/10.1016/j.prevetmed.2015.04.006

I.R.,

5

to use close matches to define the set of observations to choose from.) Single imputation is a well established practise but has two substantial drawbacks. The first is that, being a stochastic process, repeated analyses of the data may produce divergent results. The second is that it does not take into account the uncertainty induced by imputation when computing the standard errors (SEs) of the parameters of interest. • Multiple imputation. This has become the most widely available and easily implemented approach to dealing with missing data. It is the subject of the rest of this paper. 5.3. Multiple imputation of item missing data Conceptually, multiple imputation is a very straight forward procedure with 3 distinct steps. • Generate multiple complete datasets with all missing values filled in by some form of “imputation model”. This imputation model may be something as simple as a linear regression or somewhat more complicated (e.g. predictive mean matching). A key feature of this step is that the variables used to predict the missing values should include all of the variables to be subsequently used in the “analysis” model, but may also include a number of additional variables (referred to as “auxiliary variables”). • For each complete data set, estimate the parameter(s) of interest based on an “analysis” model. This will be the same model as would be used in a CCA (e.g. odds ratios from a logistic regression model). • Combine these multiple sets of estimates into a single summary set of estimates using a procedure known as Rubin’s Rules. The overall parameter estimates will be simple averages of all of the individual estimates, but the variance of those estimates will reflect both the within-data set variance and the betweendata set variance – thus accounting for the additional uncertainty induced by the imputation process. Of course, in practice, there are many details to be considered, but a complete discussion of imputation procedures is a subject for a text (see texts cited above), not for this manuscript. 5.4. Multiple imputation – simulation example Rather than going through a detailed discussion of methods of imputing data, I will present the results of a small simulation exercise that demonstrates the relative abilities of CCA and MI. More extensive simulations have been published (for example, Groenwold et al., 2012) for interested readers. The data are derived from a survey of infectious diseases of dairy cattle in eastern Canada (VanLeeuwen et al., 2001). 2400 complete observations (no missing values) were extracted and a logistic model fit to evaluate how lactation number (-lct-), leukosis status (-leu-), and days in milk (-dim-) affected the probability of a cow testing positive for Neospora caninum antibodies. (To simplify things, clustering of observations within herds was ignored with some more discussion of this aspect of the data considered below.) Results from the model were assumed to be “the truth”. Of the 2400 cows, 1940 were negative for -neo-, and 460 were positive, -lct- varied from 1 to 13, -dim- was rescaled by dividing by 100 (average = 1.91) while 521 cows were positive for -leu- and 1879 were negative. The logistic regression model that shows the “true” results (complete data) is presented in Table 1. Missing data were then created in -lct- and -leu- by each of the three missing data mechanisms (MCAR, MAR, NMAR). The number of missing values in each of the variables ranged from 1% (n = 24 missing) up to 20% (n = 480 missing) (8 different values selected). Because missing values were created in two variables, this meant

Dealing

with

deficient

and

missing

data.

PREVET

(2015),

G Model

ARTICLE IN PRESS

PREVET-3790; No. of Pages 8

I.R. Dohoo / Preventive Veterinary Medicine xxx (2015) xxx–xxx

6

Table 1 Logistic regression of neosporosis (neo) on three predictors. Based on complete data set (n = 2400) used in simulation studies (i.e. the “truth”). Variable

Coefficient

Standard error

P

95% conf. interval

Lactation number (lct) Leukosis (leu) Days in milk (dim) Intercept

−0.08 0.48 0.08 −1.49

0.03 0.12 0.05 0.15

0.01 0.00 0.14 0.00

−0.14 0.25 −0.02 −1.79

−0.02 0.71 0.18 −1.20

Fig. 3. Average percent bias in the estimate of the coefficients for the effects of lactation number (lct), leukosis (leu) and days in milk (dim) on the logit of the probability of being seropositive for Neospora caninum. Derived from simulations of data that were missing completely at random (MCAR), at random (MAR), or not at random (NMAR). (Dashed lines from CCA and solid lines from MI analyses.).

Fig. 4. Standard deviation of estimates of the coefficients for the effects of lactation number (lct), leukosis (leu) and days in milk (dim) on the logit of the probability of being seropositive for Neospora caninum across the 1000 simulations. Derived from simulations of data that were missing completely at random (MCAR), at random (MAR), or not at random (NMAR). (Dashed lines from CCA and solid lines from MI analyses.).

Please cite this article in press as: Dohoo, http://dx.doi.org/10.1016/j.prevetmed.2015.04.006

I.R.,

Dealing

with

deficient

and

missing

data.

PREVET

(2015),

G Model PREVET-3790; No. of Pages 8

ARTICLE IN PRESS I.R. Dohoo / Preventive Veterinary Medicine xxx (2015) xxx–xxx

7

Fig. 5. Proportion of estimates of the coefficients for the effects of lactation number (lct), leukosis (leu) and days in milk (dim) on the logit of the probability of being seropositive for Neospora caninum that were within 20% of the “truth” (based on complete data). Derived from simulations of data that were missing completely at random (MCAR), at random (MAR), or not at random (NMAR). (Dashed lines from CCA and solid lines from MI analyses.).

that, at most, about 35% of observations had one or two missing values. For MCAR missing values, records to be made missing in each variable were selected by a random number generator. For MAR data, the probability of being missing was made a function of the variable -neo- (the outcome) and -dim-. For NMAR data, the probability of being missing was made a function of -neo-, -leu- and -lct-. These last 2 variables had missing values made in themselves, so missingness could no longer be fully explained by observed data. For each combination of missingness mechanism and number of missing values, 1000 simulations were run. In each simulation, missing values were created and then the data analyzed using both CCA and MI. Results from the 1000 simulations were compiled to estimate: • average % bias for each of the three regression coefficients • variability in individual coefficient estimates (computed as SD of estimates divided by absolute value of estimate) • % of individual estimates that fell within ±20% of “the truth” Fig. 3 shows the average % bias for MCAR, MAR and NMAR data for both CCA and MI. For MCAR data, there was, on average, no bias in the estimates of any of the coefficients. For both MAR and NMAR missing data, there was evidence of bias which increased as the percent of missing observations went up. For the MAR results the biases are all positive while for the NMAR data they are negative. However, this is a function of how the missing values were created and different coding for generating the missing values could reverse the directions. In all cases, the bias was smaller for the MI results than the CCA results. In the case of the coefficient for -dim-, the bias appears to become very large (up to approximately 600% for CCA of MAR data). However, this is a function of the fact that this coefficient was very small and statistically non-significant, meaning that small changes in the absolute value of the estimate would have resulted in very large % changes. Please cite this article in press as: Dohoo, http://dx.doi.org/10.1016/j.prevetmed.2015.04.006

I.R.,

While Fig. 3 shows the average % bias (averaged over all 1000 simulations) for each missing data mechanism – coefficient combination, it is instructive to consider how variable estimates were from individual simulations. Fig. 4 shows the standard deviation (SD) of the 1000 estimates expressed as a percent of the mean estimate for each coefficient (for MCAR, MAR and NMAR data for both CCA and MI). The SDs increased with the percent missing and ranged as high as approximately 100% of the mean coefficient estimate. In all cases, the SD of the MI estimates was lower than those from the CCA indicating that for any individual study, an MI analysis would produce a result closer to its long term average. For -dimthe difference was dramatic reflecting the fact that the CCA analyses were unable to produce consistent results across simulations for this non-significant coefficient. Ultimately, we would like to know, “Is a CCA or an MI analysis more likely to give me a result close to the ‘truth’?”. Fig. 5 shows the % of individual estimates that were within 20% of the known true value for MCAR, MAR and NMAR data for both CCA and MI. In all cases, the MI analyses performed better than the CCA. For MCAR data, the differences were small except for -dim-. For MAR data, the differences were more substantial. When 10% of observations had one or more missing values, the estimate of the coefficient for -leuwas “correct” (i.e. within 20% of the true value) approximately 85% of the time in the MI analyses while only about 50% of the time for CCA. Neither method was good at generating “correct” estimates for NMAR data except when the % missing was quite low, but still the performance of MI exceeded that of CCA. The actual values for each of the parameters monitored (e.g. % bias) would depend on exactly how the missing data were generated. Nevertheless, the pattern is clear. MI analyses always outperformed CCA. Despite the above conclusions, there are some important limitations to the use of MI. First, the above analyses focused solely on missing values for predictor variables. It has been reported that MI does not work well for imputing dependent variables except

Dealing

with

deficient

and

missing

data.

PREVET

(2015),

G Model PREVET-3790; No. of Pages 8

ARTICLE IN PRESS I.R. Dohoo / Preventive Veterinary Medicine xxx (2015) xxx–xxx

8

in the case of repeated measures data where preceding and subsequent measures may make imputation of the outcome feasible (von Hippel, 2007). Secondly, these analyses all ignored the fact that the data were clustered (cows within herds). Methods for imputing multilevel data are not nearly as well developed as methods for analysing multilevel data and software is not straightforward to use. Nevertheless, it seems likely that the use of imputation will become a regular tool in the epidemiologist’s toolbox. 6. Conclusions Problems with data in epidemiological investigations are inevitable. Probably the most serious is when an investigator is faced with an unknown disease and with no known diagnostic methods. By way of several examples, it has been shown how surrogate measures may be used successfully to proceed with an investigation until such time as the nature of the disease is determined. Partially missing data occur in virtually all epidemiological investigations. Recent developments in statistical software have made methods for dealing with this problem much more broadly available to researchers. In particular, multiple imputation is now very feasible for many veterinary epidemiological studies. By way of a small simulation study, it was shown that multiple imputation may provide much more reliable estimates of parameter effects than the traditional complete-case analysis. References Baghzouz, M., Devitt, D.A., Fenstermaker, L.F., Young, M.H., 2010. Monitoring vegetation phenological cycles in two different semi-arid environmental settings using a ground-based NDVI system: a potential approach to improve satellite data interpretation. Remote Sens. 2010. Carpenter, J.R., Kenward, M.G., 2013. Multiple Imputation and its Applications. Wiley, Chichester, UK, pp. 345p. Christensen, J., Stryhn, H., Vallieres, A., El Allaki, F., 2011. A scenario tree model for the Canadian notifiable avian influenza surveillance system and its application to estimation of probability of freedom and sample size determination. Prev. Vet. Med. 99 (2–4), 161–175. Christensen, J., El Allaki, F., Vallieres, A., 2014. Adapting a scenario tree model for freedom from disease as surveillance progresses: the Canadian notifiable avian influenza model. Prev. Vet. Med. 114 (2), 132–144. Dohoo, I.R., 2014. Bias– is it a problem, and what should we do? Prev. Vet. Med. 113 (3), 331–337. Dufour, S., Dohoo, I.R., Barkema, H.W., Descoteaux, L., Devries, T.J., Reyher, K.K., et al., 2012. Epidemiology of coagulase-negative staphylococci intramammary infection in dairy cattle and the effect of bacteriological culture misclassification. J. Dairy Sci. 95 (6), 3110–3124.

Please cite this article in press as: Dohoo, http://dx.doi.org/10.1016/j.prevetmed.2015.04.006

I.R.,

Emanuelson, U., Egenvall, A., 2014. The data – sources and validation. Prev. Vet. Med. 113 (3), 298–303. Enders, C.K., 2010. Applied Missing Data Analysis. Guilford Press, New York, NY. Ferns, L., Dohoo, I., Donald, A., 1991. A case–control study of Nocardia mastitis in Nova Scotia dairy herds. Can. Vet. J. 32, 673–677. Graham, J.W., 2012. Missing Data. Springer, New York, NY, pp. 323. Groenwold, R.H., Donders, A.R., Roes, K.C., Harrell Jr., F.E., Moons, K.G., 2012. Dealing with missing outcome data in randomized trials and observational studies. Am. J. Epidemiol. 175 (3), 210–217. Gustafson, L., Klotins, K., Tomlinson, S., Karreman, G., Cameron, A., Wagner, B., et al., 2010. Combining surveillance and expert evidence of viral hemorrhagic septicemia freedom: a decision science approach. Prev. Vet. Med. 94 (1–2), 140–153. Hammell, K.L., Dohoo, I.R., 2005a. Mortality patterns in infectious salmon anaemia virus outbreaks in New Brunswick, Canada. J. Fish. Dis. 28 (11), 639–650. Hammell, K.L., Dohoo, I.R., 2005b. Risk factors associated with mortalities attributed to infectious salmon anaemia virus in New Brunswick, Canada. J. Fish. Dis. 28 (11), 651–661. Heeringa, S.G., West, B.T., Berglund, P.A., 2010. Applied Survey Data Analysis. Chapman & Hall/CRC, Boca Raton, FL, pp. 462p. Heisey, D.M., Jennelle, C.S., Russell, R.E., Walsh, D.P., 2014. Using auxiliary information to improve wildlife disease surveillance when infected animals are not detected: a Bayesian approach. PLOS ONE 9 (3), e89843. Lash, T., Fox, M., Fink, A., 2009. Applying Quantitative Bias Analysis to Epidemiologc Data. Springer, New York. Little, R.J.A., Rubin, D.B., 2002. Statistical Analysis with Missing Data. Wiley, New York. McClure, C.A., Hammell, K.L., Dohoo, I.R., 2005. Risk factors for outbreaks of infectious salmon anemia in farmed Atlantic salmon, Salmo salar. Prev. Vet. Med. 72 (3–4), 263–280. O’Halloran, J.L., L’Aventure, J.P., Groman, D.B., Reid, A.M., 1999. Infectious salmon anemia in Atlantic salmon. Can. Vet. J. 40 (5), 351–352. Roberts, H.C., Elbers, A.R., Conraths, F.J., Holsteg, M., Hoereth-Boentgen, D., Gethmann, J., et al., 2014. Response to an emerging vector-borne disease: surveillance and preparedness for Schmallenberg virus. Prev. Vet. Med. 116 (4), 341–349. Stark, D.A., Anderson, N.G., 1990. A case–control study of Nocardia mastitis in Ontario dairy herds. Can. Vet. J., 1990. StataCorp, 2013. Stata Multiple Imputation Reference Manual. Stata Press, College Stn., TX, pp. 373. van Buuren, S., 2012. Flexible Imputation of Missing Data. Chapman & Hall/CRC, Boca Raton, FL, pp. 316p. van den Brom, R., Luttikholt, S.J., Lievaart-Peterson, K., Peperkamp, N.H., Mars, M.H., van der Poel, W.H., et al., 2012. Epizootic of ovine congenital malformations associated with Schmallenberg virus infection. Tijdschr. diergeneeskd. 137 (2), 106–111. VanLeeuwen, J.A., Keefe, G.P., Tremblay, R., Power, C., Wichtel, J.J., 2001. Seroprevalence of infection with Mycobacterium avium subspecies paratuberculosis, bovine leukemia virus and bovine viral diarrhea virus in Maritime Canada dairy cattle. Can. Vet. J. 42, 193–198. von Hippel, P.T., 2007. Regression with missing YS: an improved strategy for analyzing multiply imputed data. Sociol. Methodol. 37, 83–117. White-Newsome, J.L., Brines, S.J., Brown, D.G., Dvonch, J.T., Gronlund, C.J., Zhang, K., et al., 2013. Validating satellite-derived land surface temperature with in situ measurements: a public health perspective. Environ. Health Perspect. 121 (8), 925–931. Wilesmith, J.W., Wells, G.A., Cranwell, M.P., Ryan, J.B., 1988. Bovine spongiform encephalopathy: epidemiological studies. Vet. Rec. 123 (25), 638–644.

Dealing

with

deficient

and

missing

data.

PREVET

(2015),

Dealing with deficient and missing data.

Disease control decisions require two types of data: data describing the disease frequency (incidence and prevalence) along with characteristics of th...
987KB Sizes 1 Downloads 9 Views