PERSPECTIVES OPINION

Big data in gastroenterology research Robert M. Genta and Amnon Sonnenberg Abstract | In epidemiological research, large datasets are essential to reliably capture small variations among comparative groups or detect new unsuspected associations. Although large databases of web-search information, social media, airline traffic and telephone records are already widely used to capture social trends, large databases in medical research are just emerging. With the universal use of electronic medical records underway, vast amounts of health-related information will become available for biomedical research. Accepting such new research tools—based on the analysis of large pre-existing datasets rather than hypothesis-driven, in-depth prospective study—will require a new mindset in clinical research, as data might be ‘messy’ and only associations, but not causality, can be detected. In spite of such limitations, the utilization of these new resources for medical research harbours great potential for advancing knowledge about digestive diseases. Genta, R. M. & Sonnenberg, A. Nat. Rev. Gastroenterol. Hepatol. 11, 386–390 (2014); published online 4 March 2014; doi:10.1038/nrgastro.2014.18

Introduction Until now, clinical research in gastroenterol­ ogy has been based mostly on data gener­ ated through randomized clinical trials or accumulated through case series collected at individual medical centres. The accumu­ lation of vast amounts of health-care data for administrative purposes by govern­ mental agencies, health insurers, pharma­ ceutical companies, central laboratories and radiological institutes has opened up the possibility to utilize such large national, and even international, data repositories for epidemiological research. Our own experi­ ence relates to that gained through multiple studies based on the analysis of data depos­ ited in a national database of gastro­intestinal biopsies.1–3 The specimens were collected from patients who had endoscopic proce­ dures performed at outpatient endoscopy or surgery centres throughout the USA and rep­ resented close to 15% of all gastrointestinal biopsy specimens collected in non-­hospitalbased procedures in the USA. At the time of writing, the database contains findings Competing interests In addition to his academic post, R.M.G. is an employee of Miraca Life Sciences, serving as Chief of Academic Affairs. Miraca Life Sciences possess a private database of pathology records, which R.M.G. manages. A.S. declares no competing interests.

386  |  JUNE 2014  |  VOLUME 11

from 1 million oesophagogastroduodeno­ scopies and 1.5 million colonoscopies, with a total of about 5 million individual biopsy specimens logged in the database. Other similar examples of large national databases utilized for gastroenterology research are the Clinical Outcomes Research Initiative, a database of endoscopic procedures from endoscopy centres distributed throughout the USA, or the Hospital Episode Statistics from the UK, which contains >125 million admitted patient, out­patient and accident and emergenc­y records each year.4,5 Three recurrent issues have dominated the concerns and types of critiques levelled against the use of large databases for epi­ demiological research—accuracy of the data, selection bias and the retrospective nature of the data collections. These issues are at the centre of the conceptual shifts required to understand and appreciate studies based on the analysis of large amounts of data col­ lected from different sources and for dif­ ferent purposes.6 In this Perspectives, we discuss each of these potential problems, as well as the disadvantage and advantages of data analysis and how big data can be ­utilized in gastroenterology research (Box 1).

Need for large amounts of data With few notable exceptions, such as func­ tional bowel disorders and GERD, most



digestive diseases are fairly rare in that their prevalence rates rarely exceed a fraction of 1% of the population.7,8 As such, a lengthy, time-consuming and costly endeavour would be needed for a gastroenterology div­ision or a pathology department to accumulate a large cohort of patients with a specific diagnosis of interest, let alone accumulate multiple such cohorts devoted to a variety of differ­ ent diagnoses. Few of the existing h­ospitalbased repositories, which draw mostly or exclusively from the local population, include more than a few hundred patients.9–12 Most repositories also lack a representative control group and comparisons are fre­ quently carried out among various groups of patients with the same disease rather than between patients with and without a given illness. Any risk factor automatically splits the patient population into two subgroups, that is, those with and without the risk factor. The presence of multiple risk factors divides the popu­lation into ever smaller subgroups of patients with a given set of characteristic risk exposures. In the absence of a very large patient population, many potential combi­ nations of risk factors are represented by an insufficient number of case and control par­ ticipants, and a statistically reliable analysis of the joint influence of multiple risk factors becomes impossible. Epidemiological research focuses on disease variations as related to geographi­ cal, temporal and patient characteristics.13 Geographical variation refers to the different incidence, prevalence and other morbidity parameters around the globe, among differ­ ent countries, states and counties, or even among different localities within a small neighbourhood. Temporal variations include differences that occur during historical time periods and among consecutive birth cohorts, the seasonal or monthly fluctuations in the frequency of onset or flares of a con­ dition, and the clustering of the condition among individuals born at different times of the year. In addition to demographics (age, sex, race, marital status, occupation, income, education level), patient characteristics also include comorbid conditions, dietary habits and exposure to a variety of potential risk factors (such as smoking, UV radiation, and so on). The goal of epidemio­logical research is to describe particular patterns in www.nature.com/nrgastro

© 2014 Macmillan Publishers Limited. All rights reserved

PERSPECTIVES disease variation across geography, time and demographics, which might in turn provide clues about similar under­lying variations of causative risk factors and exposure to such risks. Although an individual risk can exert a strong influence on disease occurrence, its variation within a given population might still be small: when Helicobacter pylori infection was rampant, 80% of the adult population in some areas harboured the infection, whereas in other localities 60% were infected.14 This relatively small dif­ ference in prevalence of H. pylori infection markedly affected the risk of gastric cancer among different populations.15 To reliably capture small variations, large datasets are needed. For instance, season­al variations in the incidence of gastro­­intestinal disease rarely amount to more than 2–3% above and below baseline. 16,17 Trying to substantiate such variations in populations of less than several thousand patients is a fruitless endeavour. However, small studies claiming to have detected seasonal varia­ tions continue to be published and tend to receive attention disproportionate to the validity of the findings. 18–20 As the large Danish study that confuted the sensationa­ lized claims of a relationship between use of mobile phones and brain tumours demon­ strated,21 large, better-designed studies that impugn the validity of small, ‘sensational’ findings are often published in second-tier journals and rarely receive equal attention in the media. Although the population of the USA is a composite of people of different ethnicities and cultural heritage, its society functions as a large ‘melting pot’ driven by a perva­ sive pressure to acquire and adopt common customs and habits. Similar phenom­ena also apply to the populations of the European Union. Clearly, therefore, large segments of the population need to be studied to be able to reliably detect small variations across different locations or communi­ ties with increasingly homogenous behav­ iour and risk exposure. Traditional study designs based on the evaluation of small groups of carefully selected ‘representa­ tive’ patients will eventually give way to the analysis of megadata spanning across entire populations.

Constraints in data analysis Apart from the size of the study popula­ tion, the outcome of any epidemiological analysis also depends on the magnitude of the variation between the case and control group, the quality of the comparison group

and the precision in the ascertainment of disease status. Studies comprised of large patient populations can help in illuminating p­henomena associated with small variations, although not all such small, but statistically significant, variations are also clinically and physiologically relevant. As with other epi­ demiological studies, large database analyses can become compromised by the selec­ tion of inappropriate or non­representative control groups. Most currently available databases still do not capture the entire pop­ ulation. For instance, Medicare data in the USA are limited to patients over the age of 65 years. The dataset of the US Department of Veterans Affairs mostly comprises a pop­ ulation of men with a high exposure rates to nicotine and alcohol.22,23 Furthermore, a database of individuals taking prescribed drugs only captures patients on medica­ tion, and a database of information on endoscopic procedures only includes a select group of patients with digestive dis­ eases who undergo endoscopy. Our own studies of patho­epidemiology are focused only on patients in whom gastrointestinal tissue has been sampled.1–3 Within all such databases, no control group of disease-free individuals or individuals who are free of the under­lying primary selection criterion for inclusion into the database exists. As such, most comparisons relate to a subgroup of patients character­ized by a specific disease with the entirety of the remaining individu­ als in the database. No a priori assurance exists that the selection criterion does not affect the comparison, and this potential bias needs to be assessed anew with each study. Once such concerns can be alleviated, large databases do offer the advantage of access to an almost unlimited number of control individuals. In a large database that captures a substantial segment of the population, par­ ticipants with the disease of interest com­ prise only a small subfraction and all other individuals’ records inside the database are essentially eligible to serve as controls. Because large databases of health-care information are primarily accumulated for administrative purposes, the measurement of individual disease states is generally less precise than information collected through prospectively designed research studies. In Crohn’s disease, for instance, none of the aforementioned databases would include information about the age at disease onset, the disease activity index, or the number of previous surgeries and hospital admis­ sions. Obviously, databases would be less suited to pursue specific hypotheses that

NATURE REVIEWS | GASTROENTEROLOGY & HEPATOLOGY © 2014 Macmillan Publishers Limited. All rights reserved

Box 1 | Pros and cons of database analysis Cons ■■ Criteria for inclusion into the database can bias study outcome ■■ False, missing, or imprecise information ■■ Data contained as ‘free text’ are difficult to access ■■ Loss of clinical or demographic details ■■ No control over type and quality of data ■■ Limited by types of research questions that can be pursued ■■ No substitute for randomized clinical trial or experiment Pros ■■ More representative data with less selection bias ■■ Large case numbers even in rare diagnoses ■■ Large number of individuals to act as controls ■■ Ability to adjust outcome to a multitude of risk factors ■■ Reliable detection of small variations ■■ Conservative estimates of true associations ■■ Coverage of lengthy periods and large areas inaccessible otherwise ■■ Analyses reasonably fast, inexpensive and easy

depend on such detailed and precise infor­ mation. Database analyses are more suited for hypothesis-generating and ‘let’s look’ endeavours in which investigators search for new and hitherto untested associations. These efforts are designed to find new pat­ terns, test seemingly outlandish hypotheses, and make ground-breaking discoveries, but are also fraught with a large failure rate and the risk of discovering nothing or disprov­ ing the original hypothesis. To overcome this inherent deficiency, such exploratory studies need to be performed easily, fast and cheaply. Database analyses generally fulfil such criteria; they are frequently exempt from institutional review board approval or only undergo an expedited review. As all the data have already been collected, the inves­ tigation can be focused on the analysis and interpretation of existing ‘old’ data rather than on the de novo creation or collection of ‘new’ data. The study can, therefore, be completed much faster than most laboratory work or any prospective randomized clinical trial. In these research endeavours, database analyses expand our existing scientific toolbox, but are not meant to replace prospectively designed clinical trials or case–control studies.

Small–accurate or large–messy? In databases of thousands or millions of patients, it is impossible to verify individual VOLUME 11  |  JUNE 2014  |  387

PERSPECTIVES diagnoses. But are there good reasons in the first place to assume that substantial portions of the diagnoses are actually erroneous? For instance, it has been repeatedly shown that, using samples from a variety of different databases, the International Classification of Disease codes of IBD generally repre­ sent true instances of such diagnoses in the vast majority of patients.24,25 All clinicians attend medical school and undergo similar s­peciality and subspeciality training. What is common language and standardized nomen­ clature good for otherwise? Do we really need to continually verify that every patholo­ gist and gastroentero­logist truly understands what is meant by the terms Crohn’s disease, ulcerative colitis or Barrett oesophagus? Furthermore, the frequent request that all cases in a reported series be re-evaluated by the investigators of the study rests on the arbitrary assumption that these investigators would make better, more correct diagnoses than the original evaluators. In fact, in the absence of detailed objective information about the competence of all involved, there are similar chances that the new evaluators could be more, less, or equally competent as those who made the original diagnoses. The use of large datasets is generally asso­ ciated with less-detailed information about the disease state than available from pro­ spectively designed clinical trials. Varying disease characteristics might be collapsed into the mean of the case population and subsets of disease types are lost due to increased heterogeneity within this popu­ lation. Moreover, large datasets contain a portion of imprecise, false, or missing infor­ mation (so-called messy data). These messy data might shift the outcome of a study towards the null hypothesis, but do not give rise to false patterns. With the rare exception of systematic bias, bad data obliterate rather than generate epidemiological patterns. This argument pertains especially to large data­ sets in which the mere number of thousands of individual data points provides a safe­ guard against a misleading picture created by random fluctuations. For example, if in a large dataset 30% of cases actually belong in the control group, or if 10% of controls represent missed cases of disease, the cal­ culated odds ratio will be too small, but provide a conservative estimate still point­ ing in the right direction. In a two-by-two table of risk (yes–no) plotted versus disease (yes–no), a strong association between risk and disease emerges, if a large fraction of the population assembles along the northwestsoutheast diagonal of the table (with strong 388  |  JUNE 2014  |  VOLUME 11

concordant presence or absence of risk and disease). Inaccurate assignment of risk or disease status can only blur such a picture by distributing patients falsely into cells outside the diagonal. A pattern that emerges from seemingly poor data would only grow stronger with better data. Except for few very strong and obvious associations, large numbers of indi­ vidual data points are needed to discern a pattern. This principle applies to data of pathoepidemiology, as well as to most other types of biomedical research. In essence, many pixels are needed to draw a crisp picture. Unless it refers to some disastrous systematic error in the initial data collec­ tion, the old adage of ‘garbage in (results in), garbage out’ does not pertain to database research. Few, detailed case histories gen­ erally tell the investigator much less than many, less-detailed case histories, and in epidemiological research a large number of cases, supersedes a large number of details in few patients. A widely reported case in point is the successful attempt by Google engineers to detect influenza epidemics using search engine query data.26 By applying a set of analytic algorithms to >3 billion searches stored daily in their database, Google engi­ neers discovered that searches for terms such as “runny nose,” “cough,” or “expec­ torant” predicted the imminent arrival of the epidemic better and faster than the case reporting by physicians and health cen­ tres to the US Centers for Disease Control and Prevention. These data were certainly messy: plenty of people, possibly even most of them, who searched for “runny nose” neither had nor would ever develop influ­ enza. But the analysis of immense numbers, even those that were contaminated, enabled the recognition of trends that proved to be c­onsistently correct.26

If n = all, where is the bias? A tendency exists among reviewers and readers to discredit otherwise good data from large datasets for reasons of concern about an underlying bias. This tendency constitutes a serious bias in its own right: the bias of imagined bias. A reviewer might reject data solely for the theoretical possi­ bility of an underlying bias without describ­ ing in greater detail how such bias actually occurred, let alone providing any convinc­ ing evidence for its true existence. The larger and more encompassing the study popula­ tion becomes, the lesser the potential for selection bias.



The relevance of confounding factors and the magnitude of their effect on the overall study outcome tend to be overestimated. Although race and income can affect the prevalence of H. pylori gastritis, for instance, their overall influence on the association between H. pylori gastritis and colon polyps or Barrett oesophagus is unlikely to be large.1,2 Adjustment for confounding factors rarely changes the odds ratios by more than a few tenths of a decimal point. Why would IBD, peptic ulcer or haemorrhoids be differ­ ent diseases among Medicare patients or US military veterans than in other populations? H. pylori gastritis might be less common in affluent patients (who can more readily afford endoscopy), but its natural history is unlikely to run a different course in people of high compared with low income.

The ‘R’ word A retrospective design is often perceived as a major shortcoming of database analysis. Indeed, the authors of such a study might have to include a generic statement, such as: ‘a shortcoming of our database study is its retrospective design,’ although this assessment is often incorrect, as nothing is inherently wrong with the study of inform­ ation collected before somebody decided to analyse it. On the contrary, a retrospec­ tive study provides valuable information on the way things work outside the arti­ ficial constraints of a prospective design. In the case of a study that uses a database, criticism about the retrospective nature of the data is as appropriate as disparaging all archeology or astronomy on the grounds that their analyses deal exclusively with preexisting data. Certain types of clinical data are almost impossible to generate outside the realm of database analysis. Even over a pro­ longed time period, an individual clinician or entire medical centre cannot accumulate a sufficient number of patients with a rare gastrointestinal condition. For monetary and ethical reasons, it can be impossible to follow prospectively and without medical intervention the natural history of many chronic digestive diseases as it unfolds over decades before developing a complication. Moreover, the length of necessary follow-up can exceed the productive life span of the individual investigator.

Big data will become universal Currently, our ability to analyse big data is still somewhat limited by the availability of databases. A gastrointestinal researcher can tap, into, for instance, one of the major www.nature.com/nrgastro

© 2014 Macmillan Publishers Limited. All rights reserved

PERSPECTIVES public databases of the Department of Veterans Affairs and Centres for Medi­ care and Medicaid Services in the USA,27,28 the General Practice Research Data­­link and Hospital Episode Statistics in the UK, 5,17,29 the statistics of the Allge­meine Ortskranken­kassen and Verband Deutscher Renten­versicherungsträger in Germany,30,31 or the national patient registries that exist in Denmark and Sweden.32 Other similar data­ bases abound.33 Yet another mode of access is provided by permission to use one of the many private databases that exist, such as the database of the Kaiser Permanente of California, the Clinical Outcome Research Initiative, or the database of Miraca Life Sciences.1–4,34 However, it is worth bearing in mind that we are moving towards a world in which the word ‘big’ in the expression ‘big data’ will be an understatement: ‘universal’ will be a more appropriate descriptor. US physicians have been encouraged to use electronic records systems since 1996, when Congress passed legislation (HIPAA, or Health Insurance Portability and Accountability Act) intended to help detect and combat insurance fraud.35 For a variety of reasons—including resist­ ance to innovation, cost, software prob­ lems, and lack of uniformity of the available systems—the vast majority of physicians and hospitals in the USA ignored this recommen­ dation. A notable exception was the Veterans Administration hospital system with its elec­ tronic health records (EHRs) and centralized data repository of all patient encounters since 1970. In 2010, as part of the PPACA (the Patient Protection and Affordable Care Act, widely referred to, initially only by its detrac­ tors and later even by President Obama, as ‘Obamacare’) was passed. 36 Amidst the numerous provisions of this act, best known for its mandate that all Americans obtain health insurance coverage, is the Physician Quality Reporting System, under which phy­ sicians are required to convert to EHRs by the year 2014.37 In the UK, the National Health Service began deployment of EHR systems in 2005, with the goal to have all patients with a centralized EHR by 2010.38 The programme failed and was dismantled in 2010, but new programmes with similar goals are being developed. Other European countries are at different stages of having centralized EHR systems, with the Netherlands and Estonia having almost reached the goal.39 When all the health records of a coun­ try’s population are available for searches, the statistician’s dream of n = all will be an unqualified reality. EHRs contain not only

a selection of data entered for a specific purpose (for example, a brief outline of the relevant clinical history and endoscopic findings to accompany a colonic biopsy specimen) but represent the convergence of everything primary physicians and specialists­ have gathered about a patient. Thus, even if the information was collected at different times and locations, a researcher will be able to access all health-related data about all individuals who have seen a phy­s­ician. Epidemiology will undergo a revolutionary shift and the only limit will be of a techni­ cal nature. Search engines capable of sifting through billions of numerical data (labora­ tory values) and images (for example, radio­ logical and endoscopic findings) are already available and will only need incremental improvements. The challenge comes from the billions of virtual pages of non­standardized text that will have to be searched. As there are infinite ways of saying the same thing, sophisticated automatic readers will need to discern between conflicting statements. For example, the software of a natural lan­ guage reader can be easily instructed to read a clinical summary, extract patients with a diagnosis of a specific condition (for example, eosinophilic oesophagitis), and place them into the correct category. The reader can also learn to discard certain patient groups, for instance those whose records state: “There is no evidence of eosino­philic oesopha­gitis” and “no eosinophilic oesophagitis is found.” However, what will the electronic reader do with a patient whose note states that “although 18 eosinophils per high-power field were detected, the diagnosis of eosino­ philic oesophagitis is not favoured”? These problems are solvable. Google translate,40 a highly efficient and exceptionally accurate automatic translation software, was built almost entirely by statisticians who, with little or no help from linguists, determined the most likely meaning of each word based on the analysis of context in a corpus of bil­ lions of texts in >50 languages. Within gastro­ enterology, such systems that process natural language have been successfully used, for instance, to identify surveillance colonoscopy for IBD in the Veterans Affairs database41 and to distinguish surveillance from non­ surveillance pathology reports in the Kaiser– Permanente database.42 The use of big data for epidemiological research will raise new concerns about privacy. Although patients’ identifying information is removed when data are analysed (and, therefore, institutional review boards waive the informed consent requirement), the theoretical possibility

NATURE REVIEWS | GASTROENTEROLOGY & HEPATOLOGY © 2014 Macmillan Publishers Limited. All rights reserved

of recognizing individuals still exists. For example, if a very rare disease is mapped with sufficient geographical detail, a determined intruder could discover who the patients are. However, such potential breeches of privacy are not unique to d­atabase-stored informa­ tion and are unlikely to halt the spread of research based on big data.

Conclusions Epidemiological research has come to increasingly rely on large and very large databases. Over the past three decades, the numbers of individuals included in epi­ demio­logical studies have risen exponen­ tially from thousands to millions, and this trend is still continuing. These developments must also be accompanied by a change in the appreciation of what the analyses of such large datasets entail and what they can achieve. On one hand, the change is associ­ ated with a growing loss of control over the actual appearance of the data; on the other hand, the data themselves become more rep­ resentative of the underlying reality. Data­ base analysis can therefore afford to sacrifice some accuracy with respect to detail at the benefit of being able to include thousands or even millions of individual data points and ‘paint’ a more reliable picture of reality. As the size of databases approaches the entire population with n becoming equal to all, reality no longer needs to be inferred by extrapolation from the study results, because the results themselves are the reality. It also stands to reason that when the data show the way reality is, issues of confounding and statistics all become secondary. Investigators and reviewers seem to still be struggling with these changing concepts, at times trying to gauge new types of research by inade­quate criteria derived from previous e­xperience with clinical trials or even laboratory ex­per­ i­m ents. Hopefully, as the utilization of large databases advances, familiarity with database analysis will rapidly expand and gastro­enterologists will soon become better equipped to understand its tremendous potential in advancing general k­nowledge about d­igestive diseases. University of Texas Southwestern Medical Centre, 5323 Harry Hines Boulevard, Dallas, TX 75390, USA. Miraca Life Sciences Research Institute, Mirace Life Sciences, 6655 North MacArthur Boulevard, Irving, TX 75039, USA (R.M.G.). Portland VA Medical Centre, Oregon Health & Science University, 3710 SW US Veterans Hospital Road, Portland, OR 97239 USA (A.S.). Correspondence to: R.M.G. [email protected]

VOLUME 11  |  JUNE 2014  |  389

PERSPECTIVES 1.

2.

3.

4.

5.

6.

7.

8. 9.

10.

11.

12.

13.

14.

15.

Sonnenberg, A. & Genta, R. M. Helicobacter pylori is a risk factor for colonic neoplasms. Am. J. Gastroenterol. 108, 208–215 (2013). Sonnenberg, A., Lash, R. H. & Genta, R. M. A national study of Helicobacter pylori infection in gastric biopsy specimens. Gastroenterology 139, 1894–1901 (2010). Dellon, E. S. et al. Inverse association of esophageal eosinophilia with Helicobacter pylori based on analysis of a US pathology database. Gastroenterology 141, 1586–1592 (2011). Sonnenberg, A. et al. Patterns of endoscopy in the United States—analysis of data from the Centers for Medicare and Medicaid Services and the National Endoscopic Database. Gastrointest. Endosc. 67, 489–496 (2008). Crooks, C., Card, T. & West, J. Reductions in 28‑day mortality following hospital admission for upper gastrointestinal hemorrhage. Gastroenterology 141, 62–70 (2011). Meyer-Schonberger, V. & Cukier, K. Big Data: a Revolution That Will Transform How We Live, Work, and Think (Houghton Mifflin Harcourt, 2013). Everhart, J. E. (Ed.) Digestive Diseases in the United States: Epidemiology and Impact. (US Department of Health and Human Services, NIH publication no. 94–1447, US Government Printing Office, 1994). Talley, N, J., Locke, G. R. III & Saito, Y. A. (Eds) GI Epidemiology (Blackwell Publishing, 2007). Loftus, E. V. Jr et al. Ulcerative colitis in Olmsted County, Minnesota, 1940–1993: incidence, prevalence, and survival. Gut 46, 336–343 (2000). Eckardt, V. F., Gockel, I. & Bernhard, G. Pneumatic dilation for achalasia: late results of a prospective follow up investigation. Gut 53, 629–633 (2004). Gupta, N. et al. Adequacy of esophageal squamous mucosa specimens obtained during endoscopy: are standard biopsies sufficient for postablation surveillance in Barrett’s esophagus? Gastrointest. Endosc. 75, 11–18 (2012). Ludvigsson, J. F. et al. Increasing incidence of celiac disease in a North American population. Am. J. Gastroenterol. 108, 818–824 (2013). Hennekens, C. H. & Buring, J. E. Epidemiology in Medicine (Ed. Mayrent, S. L.) (Lippincott Williams & Wilkins, 1987). Pounder, R. E. & Ng, D. The prevalence of Helicobacter pylori infection in different countries. Aliment. Pharmacol. Ther. 9 (Suppl. 2), 33–39 (1995). Sonnenberg, A. Differences in the birth-cohort patterns of gastric cancer and peptic ulcer. Gut 59, 736–743 (2010).

390  |  JUNE 2014  |  VOLUME 11

16. Sonnenberg, A. & Wasserman, I. H. & Jacobsen, S. J. Monthly variation of hospital admission and mortality of peptic ulcer disease: a reappraisal of ulcer periodicity. Gastroenterology 103, 1192–1198 (1992). 17. Sonnenberg, A. Seasonal variation of enteric infections and inflammatory bowel disease. Inflamm. Bowel Dis. 14, 955–959 (2008). 18. Almansa, C. et al. Seasonal distribution in newly diagnosed cases of eosinophilic esophagitis in adults. Am. J. Gastroenterol. 104, 828–833 (2009). 19. Sezgin, O., Altintas, E. & Tombak, A. Effects of seasonal variations on acute upper gastrointestinal bleeding and its etiology. Turk. J. Gastroenterol. 18, 172–176 (2007). 20. Manser, C. N. et al. Heat waves, incidence of infectious gastroenteritis, and relapse rates of inflammatory bowel disease: A retrospective controlled observational study. Am. J. Gastroenterol. 108, 1480–1485 (2013). 21. Frei, P. et al. Use of mobile phones and risk of brain tumours: update of Danish cohort study. BMJ 343, d6387 (2011). 22. McKinney, W. P., McIntire, D. D., Carmody, T. J. & Joseph, A. Comparing the smoking behavior of veterans and nonveterans. Public Health Rep. 112, 212–217 (2010). 23. Hawkins, E. J., Grossbard, J., Benbow, J., Nacev, V. & Kivlahan, D. R. Evidence-based screening, diagnosis, and treatment of substance use disorders among veterans and military service personnel. Mil. Med. 177 (Suppl. 8), 29–38 (2012). 24. Thirumurthi, S., Chowdhury, R., Richardson, P. & Abraham, N. S. Validation of ICD‑9‑CM diagnostic codes for inflammatory bowel disease among veterans. Dig. Dis. Sci. 55, 2592–2598 (2010). 25. Bernstein, C. N., Blanchard, J. F., Rawsthorne, P., Wajda, A. Epidemiology of Crohn’s disease and ulcerative colitis in a central Canadian province: a population-based study. Am. J. Epidemiol. 149, 916–924 (1999). 26. Ginsberg, J. et al. Detecting influenza epidemics using search engine query data. Nature 457, 1012–1014 (2009). 27. El-Serag, H. B. et al. The use of screening colonoscopy for patients cared for by the Department of Veterans Affairs. Arch. Intern. Med. 166, 2202–2208 (2006). 28. El-Serag, H. B., Xu, F., Biyani, P. & Cooper, G. S. Bundling in medicare patients undergoing bidirectional endoscopy: how often does it happen? Clin. Gastroenterol. Hepatol. 12, 58–63 (2014).



29. Yang, Y. X., Lewis, J. D., Epstein, S. & Metz, D. C. Long-term proton pump inhibitor therapy and risk of hip fracture. JAMA 296, 2947–2953 (2005). 30. Sonnenberg, A. Occupational distribution of inflammatory bowel disease among German employees. Gut 31, 1037–1040 (1990). 31. Bloom, B. S. Cross-national changes in the effects of peptic ulcer disease. Ann. Intern. Med. 114, 558–562 (1991). 32. Lagergren, J., Mattsson, F. & Nyrén, O. Gastroesophageal reflux does not alter effects of body mass index on risk of esophageal adenocarcinoma. Clin. Gastroenterol. Hepatol. 12, 45–51 (2014). 33. Davila, J. A. & El-Serag, H. B. GI Epidemiology: databases for epidemiological studies. Aliment. Pharmacol. Ther. 25, 169–176 (2007). 34. Corley, D. A. et al. Impact of endoscopic surveillance on mortality from Barrett’s esophagus-associated esophageal adenocarcinomas. Gastroenterology 145, 312–319 (2013). 35. US Department of Health and Human Services. Health Information Privacy. HHS.gov [online], http://www.hhs.gov/ocr/privacy/index.html (2013). 36. Govtrack.us. H. R. 3590 (111th): Patient Protection and Affordable Care Act. Govtrack.us [online], http://www.govtrack.us/congress/ bills/111/hr3590 (2013). 37. Centers for Medicare & Medicaid Services. Physician Quality Reporting System. CMS.gov [online], http://www.cms.gov/Medicare/ Quality‑Initiatives‑Patient‑Assessment‑ Instruments/PQRS/index.html?redirect=/ PQRS/ (2013). 38. Department of Health Informatics Directorate. NHS Connecting for Health. NHS Connecting for Health [online], http://www.connectingforhealth. nhs.uk/ (2013). 39. Estonia ICT Demo Center. Electronic Health Record. eEstonia [online], http://e-estonia.com/ components/electronic‑health‑record (2013). 40. Google. Google Translate [online], http:// translate.google.com/ (2014). 41. Hou, J. K. et al. Automated identification of surveillance colonoscopy in inflammatory bowel disease using natural language processing. Dig. Dis. Sci. 58, 936–941 (2013). 42. Velayos, F. S. et al. Prevalence of colorectal cancer surveillance for ulcerative colitis in an integrated health care delivery system. Gastroenterology 139, 1511–1518 (2010). Author contributions Both authors contributed equally to all aspects of this manuscript.

www.nature.com/nrgastro © 2014 Macmillan Publishers Limited. All rights reserved

Big data in gastroenterology research.

In epidemiological research, large datasets are essential to reliably capture small variations among comparative groups or detect new unsuspected asso...
599KB Sizes 0 Downloads 3 Views