JDR Clinical Research Supplement

vol. XX • issue X • suppl no. X

Perspective

Promises and Pitfalls in the Use of “Big Data” for Clinical Research T.A. DeRouen1*

Key Words: biostatistics, clinical outcomes, clinical studies/trials, comparative effectiveness research, epidemiology, data mining. In the past, a frequent criticism of clinical dental research has been that many studies lack credibility because of their small sample sizes. One of the hot topics in biomedical and other kinds of research these days is the use of “big data.” Is this an approach that promises to overcome the sample size issue in clinical dental research? The term big data is not well defined, and it generally refers to a situation where the amount of data available to address an issue exceeds what has traditionally been available by several magnitudes. The increasing availability and decreasing expense of computing technology to capture and manage extremely large data sets have led to the situation where the goal seems to be to accumulate as much data on as many variables as possible. Then, the problem becomes one of making sense out of all those data. Interestingly, although statisticians spend careers trying to untangle the intricacies, interrelationships, and interpretation of data sets large and small, the debate over what to do with big data currently does not often take into account the statistical point of view. Computer scientists and engineers, as well as professionals with

expertise in how to capture and manage humongous amounts of data, are the ones leading the discussion. That can lead to repetition of the same kinds of statistical pitfalls that were discovered in the past with smaller data sets but only now with bigger data. The December 2014 issue of Significance—a translational magazine jointly published by the American Statistical Association and the Royal Statistical Society—republished an article from the Financial Times by economist, journalist, and broadcaster Tim Harford titled “Big Data: Are We Making a Big Mistake?” based on a lecture that he gave at the Royal Statistical Society International Conference. In the article, which I recommend, he warns us “not to forget the statistical lessons of the past as we rush to embrace the big data future.” What I hope to do here is point out some of the pitfalls and misunderstandings that can occur as we embrace the opportunities for the use of big data in clinical research. Generally, big data in clinical research come in 2 forms: data on a huge number of variables per person or data on a huge number of persons. Sometimes there are both, but they typically introduce different kinds of statistical problems. A recent and familiar example of the first situation—the availability of data on a huge number of variables

per person—comes from studies of the human genome. Genomewide association studies look for associations of human conditions or diseases with characteristics of the human genome. Although some studies target preidentified genomic characteristics with prespecified hypotheses, other studies basically go on “fishing expeditions” to try to see what associations pop up in the analysis. That can lead to an enormous number of tests of associations (often called the multiple comparisons problem) and, therefore, to the opportunity for multiple false-positive findings due to chance alone. Attempts to use large numbers of covariates from the data sets in regression analyses can lead to the situation where the number of covariates approaches the number of patients, which causes problems in estimating the parameters of the regression model. Anyone embarking on these kinds of studies should include a collaborating statistician with expertise in these issues to avoid obtaining “findings” that are flawed because of these statistical issues. The second kind of big data in clinical research comes from collections of data on a large number of people, often collected for purposes other than answering the research question of interest but which may contain data relevant to the research question.

DOI: 10.1177/0022034515587863. 1Center for Global Oral Health, School of Dentistry, University of Washington, Seattle, WA, USA; *corresponding author, [email protected] © International & American Associations for Dental Research 1S Downloaded from jdr.sagepub.com by guest on September 18, 2015 For personal use only. No other uses without permission. © International & American Associations for Dental Research

Month XXXX

JDR Clinical Research Supplement

Sometimes the data arise from what Harford called “found data,” the “digital exhaust” from such things as web searches, credit card payments, and mobile phone tracking. He cites the example of Google Flu Trends, an exercise published by Google in the scientific journal Nature (Ginsburg et al. 2009), which demonstrated that flu trends in the United States could be tracked more quickly than the methods of the Centers for Disease Control and Prevention by utilizing information from Google searches about what people searched for and whether they had flu symptoms. The process was not based on any modeling of specific terms in searches and their relationship to specific flu symptoms but merely a gross observed association that became a “black box” for predictions based on that association. The result was that the process worked for a couple of years, but over time things changed: people did searches for different reasons; the association used in the algorithm no longer held; and the Google Flu Trends predictions became inaccurate (Butler 2013). It is an example of how data analysis with no underlying theoretical model can end up being misleading. In a similar manner, the approach of using observed associations within data sets to estimate missing data points, if not based on theoretical models with biological plausibility, may exacerbate bias already existent in the data set. A very valuable source of high-quality big data that has been available for some time is the NHANES (National Health and Nutrition Examination Survey)— the data sets of which are obtained from national surveys of representative samples of the U.S. population and in which a variety of health measures are included (including oral health). In some cases, the surveys are cross sectional, providing a one-point-in-time assessment so that they cannot be used to establish causality but can be used to examine associations. For example, when it was of interest to examine the association of bisphenol A with the number of dental sealants and composite restorations in children, the 2003-2004 NHANES was consulted

because it contained measures of urinary bisphenol A as well as oral examinations indicating the number of restorations and occlusal sealants among 1,001 children. One shortcoming was that the number of restorations was not differentiated into types (amalgam, composite, etc.) such that the associations were somewhat tenuous but still informative (McKinney et al. 2014). Longitudinal data are also available, as in the First NHANES Epidemiologic Follow-up Study, which examined the associations between periodontal disease and cardiovascular disease in a series of papers (Hujoel et al. 2000, 2001, 2002). Other examples of big data sets involving large numbers of people are insurance claim databases, hospital system discharge databases, and, perhaps in the future, integrated clinical record databases for large health care systems. Currently available databases, such as those for insurance claims and hospital discharges, are created for other purposes and pose many problems, often hidden, in their use for clinical research. For example, when planning a study to compare the efficacy of different materials used for direct pulp caps, we obtained access to a longitudinal dental insurance database that we could query to estimate the general success and failure rates of direct pulp caps. As we began mining the data, conversations with those who managed the data set revealed that in the past, both direct and indirect pulp caps had been covered by insurance, but for the period that we were examining, indirect pulp caps had ceased being covered. There was concern that some of the claims for direct pulp caps in the data set were really for indirect ones. The insurer eventually resolved the issue by requiring additional documentation for claims, but the potential contamination of data that we thought were based purely on direct pulp caps, as well as the presumed higher success rates for indirect pulp caps, led us to conclude that the data were not reliable enough to use for our purposes. The point is that the problem was not evident in the raw data; it surfaced only through discussions with those familiar with the nature of the data.

Another example of the use of insurance claims data is a recently published study on the association between treatment for periodontitis and medical costs for systemic diseases, such as coronary artery disease, type 2 diabetes, cerebral vascular disease, and rheumatoid arthritis, as well as pregnancy (Jeffcoat et al. 2014; Jeffcoat 2015). In the examination of combined medical and dental insurance claims databases, 338,891 individuals were identified who had at least 1 dental claim for periodontitis treatment in a baseline year and who also had medical claims in the same year that indicated an underlying diagnosis for 1 of the 5 systemic medical conditions. The patients were grouped into 2 categories according to periodontitis treatment: those who were designated as “treated” (if they had ≥4 treatments in the baseline year) and those who were designated as “untreated” (if they had 1, 2, or 3 treatments in the baseline year; i.e., the comparison group). The treated group consisted of approximately 1% of the patients. Then, the treated and untreated groups were compared with respect to subsequent medical insurance claim costs and hospitalizations for each of the 5 systemic conditions. The findings, generally, were that the treated patients with ≥4 treatments for periodontitis in the baseline year had significantly fewer hospitalizations and medical costs for most of the systemic conditions. That conclusion is not disputable, and the authors do not directly claim that the lower hospitalizations and medical costs are attributable to the additional periodontal treatment. But some people have already concluded, based on this study, that periodontal treatment saves medical costs. To reach any such conclusion from these data, one has to make significant assumptions. To conclude that fully treating periodontitis saves medical costs, one has to assume that the 1% who had ≥4 treatments in a year were the only ones who were adequately treated and that the 99% who had 1, 2, or 3 treatments in a year were inadequately treated. But if the assumption is that the amount of

2S Downloaded from jdr.sagepub.com by guest on September 18, 2015 For personal use only. No other uses without permission. © International & American Associations for Dental Research

JDR Clinical Research Supplement

vol. XX • issue X • suppl no. X

treatment given in a year was indicative of the severity of the disease—such that those who received 300,000 patients is provocative, but it cannot answer the most important questions when crucial data are not part of the data set. These examples illustrate some key points in the use of big data in clinical research. One point is that, while small sample size is often a problem in clinical research, huge sample sizes in big data are not necessarily the solution. The large sample size addresses the issue of sampling variability, wherein one has a chance of getting a nonrepresentative

sample because of greater sampling variability with small sample sizes. The opportunities of big data offered by insurance claims databases and the like overcome the sampling variability problem, but they may be fraught with what is called sampling bias—that is, it represents a segment of the population that may not be representative of the rest of the population. Within that big data set, the quality of the data may be suspect in that the data may not represent actual diagnoses but what is paid as an insurance claim. Also, it may not contain information that is needed to interpret what is actually happening, such as measures of health and disease. Such big data will offer opportunities to generate hypotheses, which should be pursued, but care should be taken in drawing conclusions from them. In the future, if large clinical record databases become available, they will be more useful. However, querying such databases to assess whether treatments A or B show the best success rates and directing future treatments to the one showing the greatest success in the name of quality improvement involves drawing conclusions from observational data. While observational studies are very helpful, they can contain hidden biases regarding how treatments are assigned, and there are many examples of conclusions drawn from well-done observational studies that are contradicted when subjected to the gold standard of a randomized trial, where the key ingredient is the randomization. Those who argue that we cannot wait for randomized trials to decide on comparative treatment efficacy are reminded that the Food and Drug Administration requires randomized clinical trials for good reason before approving new drugs, and the same principles should apply. Easily accessible big data will offer new opportunities in clinical research, but unless such resources contain high-quality relevant

data and are approached with a healthy dose of skepticism, they are likely to offer false promises of definitive answers. Acknowledgments The author received no financial support and declares no potential conflicts of interest with respect to the authorship and/or publication of this article. References Butler D. 2013. When Google got flu wrong. Nat News. 494(7436):155–156. DeRouen TA. 2015. Effect of periodontal therapy on systemic diseases. Am J Prev Med. 48(3):e4. Ginsberg J, Mohebbi MH, Patel RS, Brammer L, Smolinski MS, Brilliant L. 2009. Detecting influenza epidemics using search engine query data. Nature. 457(7232):1012–1014. Harford T. 2014. Big data: are we making a big mistake? [accessed 2015 Apr 28]. http://www. ft.com/cms/s/2/21a6e7d8-b479-11e3-a09a00144feabdc0.html. Hujoel PP, Drangsholt M, Spiekerman C, DeRouen TA. 2000. Periodontal disease and coronary heart disease risk. J Am Med Assoc. 284(11):1406–1410. Hujoel PP, Drangsholt M, Spiekerman C, DeRouen TA. 2001. Examining the link between coronary heart disease and the elimination of chronic dental infections. J Am Dent Assoc. 132(7):883–889. Hujoel PP, Drangsholt M, Spiekerman C, DeRouen TA. 2002. Pre-existing cardiovascular disease and periodontitis: a follow-up study. J Dent Res. 81(3):186–191. Jeffcoat M. 2015. Response to a letter from Dr. Timothy A. DeRouen. Am J Prev Med. 48(3):e5. Jeffcoat MK, Jeffcoat RL, Gladowski PA, Bramson JB, Blum JJ. 2014. Impact of periodontal therapy on general health: Evidence from insurance data for five systemic conditions. Am J Prev Med. 47(2):166–174. McKinney C, Rue T, Sathyanarayana S, Martin M, Seminario AL, DeRouen T. 2014. Dental sealants and restorations and urinary bisphenol A concentrations in children in the 2003–2004 National Health and Nutrition Examination Survey. J Am Dent Assoc. 145(7):745–750.

3S Downloaded from jdr.sagepub.com by guest on September 18, 2015 For personal use only. No other uses without permission. © International & American Associations for Dental Research

Promises and Pitfalls in the Use of "Big Data" for Clinical Research.

Promises and Pitfalls in the Use of "Big Data" for Clinical Research. - PDF Download Free
122KB Sizes 2 Downloads 10 Views