w a t e r r e s e a r c h 5 3 ( 2 0 1 4 ) 2 6 e3 4

Available online at www.sciencedirect.com

ScienceDirect journal homepage: www.elsevier.com/locate/watres

Public health and pipe breaks in water distribution systems: Analysis with internet search volume as a proxy Julie E. Shortridge*, Seth D. Guikema Department of Geography & Environmental Engineering, Johns Hopkins University, USA

article info

abstract

Article history:

Drinking water distribution infrastructure has been identified as a factor in waterborne

Received 19 September 2013

disease outbreaks and improved understanding of the public health risks associated with

Received in revised form

distribution system failures has been identified as a priority area for research. Pipe breaks

1 December 2013

may pose a risk, as their occurrence and repair can result in low or negative pressure,

Accepted 9 January 2014

potentially allowing contamination of drinking water from adjacent soils. However,

Available online 21 January 2014

measuring this phenomenon is challenging because the most likely health impact is mild gastrointestinal (GI) illness, which is unlikely to result in a doctor or hospital visit. Here we

Keywords:

present a novel method that uses data mining techniques and internet search volume to

Distribution network

assess the relationship between pipe breaks and symptoms of GI illness in two U.S. cities.

Pipe breaks

Weekly search volume for the terms diarrhea and vomiting was used as the response

Gastrointestinal illness

variable with the number of pipe breaks in each city as a covariate as well as additional

Non-linear regression

covariates to control for seasonal patterns, search volume persistence, and other sources of GI illness. The fit and predictive accuracy of multiple regression and data mining techniques were compared, with the best performance obtained using random forest and bagged regression tree models. Pipe breaks were found to be an important and positively correlated predictor of internet search volume in multiple models in both cities, supporting previous investigations that indicated an increased risk of GI illness from distribution system disturbances. ª 2014 Elsevier Ltd. All rights reserved.

1.

Introduction

While drinking water in developed countries is consistently treated to be compliant with health guidelines, the aging condition of drinking water distribution system infrastructure presents a risk of contaminant intrusion and negative impacts on public health. Breaks and leaks in distribution pipelines can allow pathogens present in surrounding soil or water to

enter the distribution system during low or negative pressure events. It is estimated that anywhere from 10 to 50% of waterborne disease outbreaks associated with treated drinking water are attributable to distribution system deficiencies (CDC, 2006; CDC, 2008; CDC, 2011). It is reasonable to assume that the outbreaks analyzed in CDC (2006, 2008, 2011) represent only a small percentage of the overall disease attributable to drinking water, as they require that multiple cases of illness be reported and linked to drinking water exposure (CDC, 2011),

* Corresponding author. Tel.: þ1 2026796535. E-mail address: [email protected] (J.E. Shortridge). 0043-1354/$ e see front matter ª 2014 Elsevier Ltd. All rights reserved. http://dx.doi.org/10.1016/j.watres.2014.01.013

w a t e r r e s e a r c h 5 3 ( 2 0 1 4 ) 2 6 e3 4

and few people seek medical care for mild to moderate gastrointestinal (GI) illness (Wheeler et al., 1999). Messner et al. (2006) estimate that community water systems are responsible for 16.4 million cases of acute GI illness per year in the United States. While most of these are mild to moderate cases that do not require a doctor or hospital visit, they can still result in high societal costs; for example, in 1988 it was estimated that mild GI illness resulted in $19.5 billion in lost productivity annually (Garthright et al., 1988). Because of these issues, improved understanding of the incidence and severity of health impacts from water distribution systems has been identified as a high priority research area (USEPA and Water Research Foundation, 2010). Despite the need for additional research on the public health impacts of distribution system deficiencies, a number of issues make detection and measurement of these impacts a challenge. Exposure to pathogens through water distribution systems requires a complex chain of events to occur. While a conceptual model to support microbial risk assessment for low-pressure events is presented by Besner et al. (2011), existing data to support such an analysis is limited and subject to numerous uncertainties and assumptions. While sustained low-pressure events caused by main breaks and maintenance activities are relatively easy to identify, short term pressure transients caused by changes in pump operation, power outages, and sudden changes in demand are unlikely to be identified without high-speed pressure monitoring that is generally not in use in existing systems (Friedman et al., 2004). Population exposure depends on the quantity of pathogens able to enter the distribution system, their transport and dilution throughout the system, and the number of users who eventually consume the contaminated water. Individuals may be exposed to contaminated water at many places other than their homes, such as offices, schools, or restaurants, confounding efforts to monitor illness at fine spatial scales. Furthermore, only a small percentage of GI illness results in a doctor or hospital visit, making health outcomes difficult to track. Because of these challenges, much existing research on this topic has either focused on the occurrence of pressure transients and external contamination in water distribution systems or survey-based monitoring and intervention trials that aim to estimate the incidence of GI illness attributable to treated drinking water. Sampling studies have indicated that pathogenic microorganisms are frequently present in soil and water adjacent to drinking water pipes (Karim et al., 2003), while low- and negative-pressure transients have been documented in multiple systems (LeChevallier et al., 2003; Karim et al., 2003). Water sampling studies have indicated that distribution systems can allow introduction of viruses into non-disinfected systems (Lambertini et al., 2012). Evaluations of whether low-pressure events lead to measurable increases in GI illness have been mixed. Rates of self-reported GI illness were found to increase following distribution system disruptions in a study conducted in Norway (Nyga˚rd et al., 2007) and following self-reported losses in water pressure in the UK (Hunter et al., 2005). However, Malm et al. (2013) monitored calls to a health care hotline system in Sweden and found no statistically significant change in call volume related to GI illness following distribution system disruptions.

27

While not specifically focused on low pressure events, decreased incidence of GI illness have also been observed in water systems with less pipe length per person (Nyga˚rd et al., 2004; Tinker et al., 2009) and amongst study participants who drank water that had been bottled at a treatment plant rather than untreated tap water (Payment et al., 1997), indicating that increased rates of illness could be a result of distribution system deficiencies more generally. The intervention trials and survey-based monitoring evaluations above can provide important insights into health risks associated with the studied distribution systems. However, extrapolating these insights to distribution systems more generally is difficult due to the tremendous variability in water system characteristics. The likelihood of contamination in a given system is likely to depend heavily on factors such as water source, treatment procedures, and distribution system condition and characteristics. For example, the distribution system evaluated by Payment et al. (1991, 1997) was found to be highly susceptible to negative pressure events (LeChevallier et al., 2003). Furthermore, the chance that a contamination event results in observable illness depends on the population served by the system, as certain demographic groups, such as children, the elderly, and immunocompromised individuals, are more likely to become sick after a given exposure. Differences in system characteristics could partly explain the higher rates of illness attributed to drinking water from those studies, as well as differences in research design. While conducting a similar monitoring effort on a larger scale could provide valuable insights into how risks differ amongst different water systems, survey-based monitoring and intervention trials tend to be very resource-intensive and practical only over relatively small scales. Therefore, new methods are needed to support broader studies that can evaluate widescale risks, as well as relative risks in different types of systems. The use of internet search query data has the potential to prove useful in this regard. It is estimated that 37e52% of Americans search for health information on the internet (Brownstein et al., 2009). Internet search volume has already been shown to be strongly correlated with traditional disease monitoring data in a number of cases. Search volume for influenza-related search terms is capable of providing early detection of influenza epidemics (Ginsberg et al., 2008; Polgreen et al., 2008). This ability has also been demonstrated for a number of GI illnesses, with strong correlations between search volume and confirmed infections of rotavirus (Desai et al., 2012), salmonella (Brownstein et al., 2009), and gastroenteritis (Pelat et al., 2009). These results show that internet search volume has the potential to be an easily and rapidly accessible source of information regarding disease incidence over large areas and long-time scales where traditional monitoring may be infeasible. Surveillance data can also easily be collected through time to support longitudinal evaluations, and thus avoid the difficulties associated with cross-sectional comparisons between or within water service areas. The objective of this paper is to assess whether a statistical relationship exists between pipe breaks in municipal drinking water distribution systems and GI illness at the metropolitan scale as estimated by internet search volume. We use a novel

28

w a t e r r e s e a r c h 5 3 ( 2 0 1 4 ) 2 6 e3 4

approach where weekly internet search volume for terms related to GI illness is the response variable and we test the ability of various parametric models and non-parametric data-mining techniques to model the relationship between search volume, pipe breaks and other environmental factors. Results from two cities are compared to assess whether observed relationships are consistent.

2.

Materials and methods

Two cities, referred to in the following sections as City A and City B, were used as study areas. Both cities are located in the mid-Atlantic region of the United States and have metro area populations ranging from 2 to 5 million. The water systems in each city were established over 200 years ago, and like many cities in the Eastern United States, many components of the water systems are reaching or have surpassed their planned lifespans. The two cities have temperate climates with average summer highs approaching 90  F and average winter lows of approximately 30  F. In each city, metro-area weekly internet search volume for the term “diarrhea vomiting-dog” was obtained through the Google Trends website and used as a response variable. This term captures the volume of searches for the words diarrhea or vomiting, while removing those that also contained the word “dog,” as this was identified by Google as a common related search. We obtained pipe break and leak data directly from the two cities. For City A, we receive information directly from their work order system; each time they investigate a possible pipe problem or fix a pipe break, we receive a notification directly from a senior engineer in the organization responsible for the water system performance. This data captures all pipe breaks, leaks, and repairs that the City A system managers are aware of, meaning that we capture a range of break sizes for City A, from small leaks to catastrophic failures of large pipes. For City B we received a database of historic breaks from the 1970s through 2006 directly from the City’s water department. This database included all known breaks and leaks for the most recent years but was incomplete for the earlier years. As with City A, this database contains information about all pipe breaks and pipe leaks that the system managers were aware of. There is not a

clear distinction made between a pipe break and a pipe leak in either of the databases used. Our analysis thus includes information on both full breaks of pipes and small leaks. In reality, there is a continuum of problems, from pipe being completely severed down to small, slow leaks. Of course, neither database contains information about undetected leaks or breaks. Due to availability of pipe break data, weekly data from January 2011 to February 2013 (109 weeks of observations with a total of 1970 pipe breaks) was used in City A, and from January 2005 to March 2006 in City B (62 weeks of observations with a total of 908 breaks). The City B data was constrained by both the completeness of the record and the availability of Google search volume data for the earlier dates. The temporal resolution of available search volume data depends on the volume of searches for the term of interest. For our search term, the maximum temporal resolution was weekly for both cities. The search data available from Google does not present a total number of searches, but instead measures the number of searches for the term of interest relative to the total number of searches in that week on a scale of 0e100. This is done to account for times with higher or lower internet activity generally. Other search volume terms, including diarrhea and vomiting as separate searches, and searches controlling for the term “pregnant” (a term that was commonly combined with diarrhea and vomiting) were also evaluated. These terms were found to be highly correlated with our final search term and did not lead to significantly different results. Covariates included: - Season, average daily temperature, and average daily precipitation to control for seasonal and climatic variations in GI incidence. Climate data was taken from NOAA National Weather Service Monthly Weather Summaries. - Counts of pipe breaks from the week of interest and the prior week (lagged pipe breaks) to account for disease incubation periods. - Sewer overflow events (including both combined sewer overflows and sanitary sewer overflows) from the week of interest and the prior week to control for potential illness from sewer overflows (City A only due to data availability), obtained from the state environmental agency.

Table 1 e Summary of response variable and covariate data. Search volume numbers refer to the relative number of searches in each week for the term of interest relative to the total number of searches in that week, not the actual number of searches. Variable Response Variable Search volume Covariates Lagged search Season Temperature (avg) Precipitation (avg) Pipe breaks Pipe breaks lagged SSO SSO lagged

Description Weekly search volume for “diarrhea vomiting-dog” in metro area of interest Search volume from previous week Categorical variable on season Average daily temperature for the week Average daily precipitation for the week Number of pipe breaks in each city per week Number of pipe breaks from previous week Number of sewer overflow events in City A per week Number of sewer overflow events from previous week

Min City A 42 City A 42 NA 26.5 0 3 3 3 1

Mean

Max

Min

65.4

95

City B 40

Mean

Max

66.7

99

65.3 NA 57.2 0.13 18.1 18.3 14.2 14.1

95 NA 85.6 1.21 79 79 47 47

City B 40 NA 19.6 0 3 5 NA NA

66.7 NA 53.6 0.11 14.6 15.0 NA NA

99 NA 83.1 0.86 56 56 NA NA

29

w a t e r r e s e a r c h 5 3 ( 2 0 1 4 ) 2 6 e3 4

- Internet search volume from the prior week, to control for potential persistence in internet search volume A summary of internet search volume and covariate data in each city is provided in Table 1. The relationship between environmental exposure, disease incidence, and internet search volume is complex and has the potential to exhibit non-linearity and interactions between covariates. To find models that could capture these complexities, we compared the fit and predictive accuracy of multiple regression and data mining techniques. Each model was fit to the full data set to evaluate goodness of fit based on mean absolute error between actual and modeled search volume. Additionally, holdout cross validation was used to compare the models’ out-of sample predictive accuracy using a 50-fold holdout analysis. In each iteration, 90% of observations were randomly selected and used to fit the models, which were then used to predict search volume in the remaining 10% of observations. Predictive accuracy was measured using mean absolute error between actual and predicted search volume in these held-out samples. Six models were tested to compare their in-sample and out-of-sample accuracy in each city. Because these models use different functional forms and mathematical algorithms to fit and predict data, comparing multiple models is a way to identify which methods can best capture relationships that exist between the response variable and covariates. The six models included: 1. A Poisson-transformed generalized linear model (GLM) with variable removal based on Akaike information criterion (AIC minimization) (Cameron and Trivedi, 1998). 2. A Poisson-transformed generalized additive model (GAM) with individual cubic regression splines applied over each covariate. GAMs use smoothing functions applied over covariates, allowing them to capture non-linear relationships between covariates and the response variable. Smoothing functions are fit using penalized likelihood maximization to prevent overfitting of the model, and can be penalized to zero for covariates that don’t improve model fit (Hastie and Tibshirani, 1990). 3. Multivariate adaptive regression splines (MARS): Data is represented using a non-linear, multivariate function estimated by multivariate spline basis functions fit to recursively partitioned segments of the data (Friedman, 1991). 4. Classification and Regression Tree (CART): A single regression tree was fit to the data and then pruned to the optimal size using cross validation (Breiman et al., 1984). 5. Bagged CART (BC): 50 regression trees are each trained on a separate bootstrapped subset of the data, and the final model prediction is the average of each individual tree prediction (Hastie et al., 2009). 6. Random Forest (RF): 500 regression trees are each trained on a separate bootstrapped subset of the data, and correlation between trees is reduced through random splitting of nodes (Breiman, 2001). A null model was also included for comparative purposes, in which search volume was simply estimated as equal to

mean search volume in all observations used to fit the model. For example, for the entire dataset, the null model would predict a search volume of 65.4 in City A and 66.7 in City B (the mean search volume in each city). For the holdout analysis, the null model predicts search volume by calculating the mean search volume for the 90% of weeks selected to train the models. Models were also evaluated against a persistence model (where search volume was assumed to equal search volume from the previous week); however, because this model resulted in higher errors than all other models, the null model was used as a comparison to evaluate model performance.

3.

Results

Table 2 presents the mean absolute error for each model in each city. The random forest model resulted in the lowest insample and out-of-sample errors in both cities, and was able to achieve statistically significant reduction in error when compared to the null model based on Bonferroni-corrected Wilcox Rank Sum tests. The bagged CART model also resulted in a statistically significant reduction in both in-sample and out-of-sample error in both cities compared to the null model. Fig. 1 shows time series of actual and predicted search volume in both cities for the random forest and bagged CART models. Predicted search volume for each week was estimated by fitting the models to the whole data set with the week in question removed, and then generating a prediction for that week. That is, each point in the time series is a holdout estimate for that week. These time series indicate that the models are capable of capturing relatively slow trends in search volume, but are unable to capture some week-to-week variability, as well as some extreme values. While the effects of a given pipe break would generally be realized within a 2e3 day period, there can be a significant delay between occurrence of a pipe break and detection and repair, particularly for smaller pipe breaks. Because of this, the pipe break data available would not be sufficient for modeling short-term trends and health impacts on, for example, a daily basis. The ability of the model to capture the longer-term, weekly trends suggests that it is appropriate for gaining insights into the relationship

Table 2 e Mean absolute in-sample and out-of-sample errors for each model. Bold-italic values indicate statistically significant improvement over null model (Bonferroni-corrected p-value < 0.05). Model name

GLM GAM MARS CART Random forest Bagged CART Null model

City A

City B

InOut-of-sample In-sample Out-of-sample sample 7.59 7.33 7.58 6.28 3.81

7.83 8.21 8.97 9.64 7.87

8.96 8.61 8.58 7.14 4.51

10.99 12.03 11.55 11.64 10.25

5.18

7.90

6.14

10.28

8.65

8.44

12.43

12.47

30

w a t e r r e s e a r c h 5 3 ( 2 0 1 4 ) 2 6 e3 4

Fig. 1 e Actual and predicted search volume for the random forest and bagged CART models.

between pipe breaks and health impacts at these longer time scales. However, the model is not appropriate for real-time monitoring of pipe breaks. Because they significantly outperformed the null model in terms of fit and predictive accuracy in both cities, the random forest and bagged CART models were used to evaluate covariate influence in each city. Because typical measures of influence that are used in parametric models (such regression coefficients and p-values) do not exist for tree-based models, partial dependence plots were developed to assess covariate importance and influence in each model. Partial dependence plots measure the marginal influence that changing each covariate of interest, while keeping all other covariates equal, has on model predictions. A relatively flat partial dependence plot indicates that the covariate of interest has little influence on the model’s predictions, while a large change in response variable values indicates that the covariate has a large degree of influence in model predictions. This variation was measured for each model by estimating the relative “swing” attributable to each covariate n, which consisted of the range of partial dependence values associated

with the covariate of interest, divided by the total swing over all n covariates in that model (Equation (1)). A relative swing of 0 would indicate that the model did not use that covariate in its predictions at all, while a relative swing of 1 would indicate that model relied entirely on one covariate. Relative Swingn ¼

maxðPDn Þ  minðPDn Þ P n Swingn

(1)

Table 3 shows the relative swing associated with each covariate compared to all other covariates, with a higher swing associated with greater influence over model predictions. In City A, the most influential covariate in both models is the season (which is responsible for 29% of bagged CART variability and 18% of random forest variability), while the most influential covariate in City B is lagged search volume (responsible for 42% of bagged CART variability and 32% of random forest variability). The influence of lagged pipe breaks ranges from 4% in the City B bagged CART model to 17% in the City A random forest model. Covariate influence varies somewhat between the two cities, although a large degree of agreement in the variable importance estimated from each of

w a t e r r e s e a r c h 5 3 ( 2 0 1 4 ) 2 6 e3 4

31

Table 3 e Relative swing in partial dependence plots for the random forest, and bagged CART models in each city. A greater value of relative swing indicates that the model relies more heavily on that covariate, while a low value indicates that a covariate is not very important in generating model predictions. A value of 0 would indicate that the covariate is not used in the model at all. Bagged CART City A Season SSO BreaksLag SearchLag Temp SSOlag Breaks Prec City B SearchLag Temp Season Prec BreakLag Breaks

Random forest 0.29 0.17 0.14 0.14 0.08 0.07 0.07 0.05

Season BreaksLag SSO SearchLag Temp Prec Breaks SSOlag

0.18 0.17 0.17 0.14 0.12 0.10 0.07 0.06

0.42 0.24 0.15 0.13 0.04 0.02

SearchLag Temp Season BreakLag Breaks Prec

0.32 0.20 0.19 0.11 0.10 0.08

the models is observed in City A (Kendall’s W equal to 0.94) and a moderate degree of agreement is observed in City B (Kendall’s W equal to 0.57). Seasonal variables (month and temperature) and search lag show a high degree of influence overall. Environmental indicators (SSO and lagged breaks) account for a large portion of variance in City A, but appear less influential in City B. The partial dependence plots shown in Figs. 2 and 3 also show some consistent patterns across models and cities. In all instances, there is a clear positive relationship between search volume and lagged search volume, indicating that there may be some persistence in search volume that lasts longer than the one week periods evaluated in this study. In both cities, there is a positive relationship between lagged pipe breaks and search volume in both models. In City A, an increase from 0 to 20 pipe breaks results in an increase in an approximately 6% increase in search volume in the bagged CART model, and an 8% increase in the random forest model. However, increasing the number of pipe breaks beyond 30 has no additional impact on search volume. In City B, an increase from 0 to 55 breaks results in an approximately 6% increase in search volume in the random forest model, but only a 2% increase in the bagged CART model. For current week pipe breaks, a negative relationship is evident in City A, while no clear relationship is observed in City B. This is as expected, assuming a required incubation period between exposure and effect. Both cities also exhibit evidence of a seasonal pattern, with higher search volume in the summer relative to the winter and in weeks with high temperatures. In City A, a positive relationship also exists between sewer overflows for the current week, but no such relationship is observed for lagged sewer overflows.

4.

Discussion

All models tested were able to provide an improved fit compared to the null model, but only the bagged CART and

Fig. 2 e Normalized partial dependence plots for City A. Internal tick marks show 10% sample quantiles for each covariate. Plots show the marginal influence of changing the covariate of interest while all over covariates are held constant.

random forest models were able to provide statistically significant improvements in both fit and predictive accuracy in both cities. Time series plots of predicted and actual search volume in the highest performing models indicate that the models were capable of capturing trends at a monthly time scale, but not shorter term variability and particularly extreme values. This unexplained variance is not surprising considering (1) the low number of covariates explored here when compared to the numerous factors that could contribute to illness rates and online activity related to disease symptoms and (2) the possible lags between a break or leak and detection of that event. It could also be the result of minor

32

w a t e r r e s e a r c h 5 3 ( 2 0 1 4 ) 2 6 e3 4

Fig. 3 e Normalized partial dependence plots for City B. Internal tick marks show 10% sample quantiles for each covariate.

discrepancies in the geographic regions covered by the different data sets; in particular, the metro areas represented in the Google search data include some suburbs not included in the pipe break service areas. However, population density is highest in the areas covered by the pipe break data, which is also where many people commute for work during the day. Therefore, the disparity in populations covered by the two data sets is expected to be minimal, and in any case would result in underreporting of correlation between pipe breaks and search volume and decreased model performance, rather than the opposite. While this unexplained variance would make the models unsuitable for activities requiring refined predictions (such as real-time monitoring), the statistically significant improvements demonstrated by the bagged CART and random forest models make them suitable for generating insights on covariate importance and influence. In terms of covariate influence, the models in City B appear to be largely informed by search volume persistence and seasonal characteristics, with lagged search volume, season, and temperature accounting for a total of 71e79% of the variation in partial dependence values for these models. In City A, the models were informed by a larger number of covariates, with the top three covariates in each model

accounting for a total of 52e60% of partial dependence swing. Partial dependence variability was largely determined by season and environmental factors (lagged pipe breaks and SSOs) in City A, with a moderate degree of variability attributed to lagged search volume. The direction of influence showed consistent results between cities and models for a lagged search volume, temperature, season, and lagged pipe breaks. Counts of lagged pipe breaks show a consistently positive relationship with search volume in each city and evaluated model. While this result cannot provide any proof of a causal relationship between the two, the presence of a similar relationship across two cities and multiple models is consistent with previous studies that identified water distribution system inadequacies as a contributor to GI illness. The use of internet search volume as a proxy for subclinical GI illness and counts of all pipe breaks in a water system are both novel methods that present an interesting comparison to the existing literature on this topic. One important distinction between this work and that of Malm et al. (2013) and Nyga˚rd et al. (2007) is the magnitude of the events analyzed. Those analyses evaluated the results of large disturbances that affected hundreds to thousands of people, whereas our evaluation makes no distinction based on pipe size or number of affected customers. It is reasonable to assume that the majority of pipe breaks in our dataset, as in most cities, are rather small and did not result in total pressure loss. Therefore, our results point more towards a health impact associated with small scale disturbances that occur relatively frequently in water systems, rather than large disruptions that may affect many customers but only occur rarely. Additionally, it means that our results may be more comparable to the work of Malm et al. (2013), as their dataset include breaks that resulted in varying degree of pressure loss, rather than only focusing on disturbances that resulted in total pressure loss as in Nyga˚rd et al. (2007). It is also worth noting that Sweden experiences an estimated 5000 pipe repairs per year, which includes pipe breaks and leakage repairs (Malm et al., 2013), compared to 240,000 main breaks alone in the US (ASCE, 2013). This results in a per-capita repair rate of 0.0005 in Sweden, compared to 0.0008 for breaks alone in the US. While it is impossible to say whether this is the case for the specific cities evaluated, it seems reasonable to assume that systems more prone to pipe failure would result in greater health impacts. Aside from Malm et al. (2013), the majority of studies on drinking water distribution systems and GI illness have relied on self-reporting during intervention trials or interviews with researchers following system disruptions, as in Nyga˚rd et al. (2007). One primary difference between these studies and ours is the spatial and temporal scale of evaluation. While Nyga˚rd et al. (2004) evaluates campylobacter infections across Sweden in a cross sectional comparison between municipalities, they do not evaluate any temporal changes in infection rates. Similarly, intervention trials such as those presented by Payment et al. (1997) and post-disturbance monitoring (Nyga˚rd et al., 2007) compare health outcomes in a limited number of households in a small geographic area. Scaling these evaluations up to do wide-scale or long-term monitoring would be very resource intensive, whereas internet search data to support a longitudinal metropolitan-level evaluation is

w a t e r r e s e a r c h 5 3 ( 2 0 1 4 ) 2 6 e3 4

freely available. Monitoring illness at the city-level also reduces the confounding issue of mobility, where participants may be exposed to an illness in a place other than their residence. Therefore, internet search data has the potential to be a tool in instances where traditional monitoring is infeasible, allowing for evaluations to proceed over greater temporal and spatial scales. Nevertheless, there are some important limitations that should be considered when using internet search data to estimate public health impacts. Naturally, it is unlikely that a proxy measure such as internet search volume could provide as accurate an estimate of illness rates as direct questioning. Therefore, quantitative estimates of disease risk based on internet search data would require additional analysis relating search volume to monitoring data collected via traditional means. Furthermore, direct questioning has the advantage of allowing researchers to control for other exposure factors that cannot be monitored by proxy (such as contact with sensitive subpopulations or international travel). Internet data might also be non-representative of the population as a whole, particularly with regard to age and computer access and literacy, and may be subject to bias when the disease of interest is featured on the news (Lee, 2010). Because of these issues, it is important to consider internet surveillance data as a potential tool to be used in tandem with traditional survey methods. Despite these limitations, there are a number of possibilities where these methods could support further research. One simplification in our models was that all pipe breaks were treated equally, regardless of the size of the pipe, the occurrence and magnitude of pressure transients, duration of the leak and repair, and number of customers affected. This was a necessary simplification in our work because the pipe break data we used lacked information on the size of pipe or population affected. More explicit modeling of these factors could be useful in developing more accurate models and understanding which types of breaks are most likely to result in health impacts. Our evaluation could also be scaled up to include additional cities and water systems where break information was available. This would not only provide more statistical power to support evidence of a relationship between pipe breaks and public health outcomes, but could also allow insights into the systems and conditions where pipe breaks have the greatest impact on illness. In addition, if internet search volume data were made available at a more geographically detailed scale such as the zip code level, this could substantially enhance the ability to examine the relationship between internet search volume and pipe breaks, which are often located at the scale of individual street segments in utility-provided data.

5.

Conclusion

Disruptions and inadequacies in drinking water distribution systems are recognized as an issue with potential public health impacts and an important area for research. However, measuring the health impacts associated with distribution system disturbances, such as pipe failures, presents a number of challenges. The physical mechanisms by which a pipe

33

failure could lead to pathogen exposure e low or negative pressure events, contaminant intrusion, and transport to water users e are difficult to monitor and model. Furthermore, gastrointestinal health outcomes are likely to be sub-clinical, meaning traditional monitoring of doctor and hospital visits will only capture a small percentage of cases. We present a novel method that compares weekly internet search volume for symptoms of GI illness with pipe break counts, while controlling for seasonal patterns, climatic fluctuations, and other possible environmental factors. We observed a positive relationship between search volume and counts of pipe breaks from the previous week in both cities using multiple models. These results support previous investigations indicating that drinking water distribution system disruptions contributed to higher rates of GI illness, and point towards the potential importance of frequent, relatively-small pipe breaks. Our results also indicate that internet search data is very promising in that it can be easily scaled up to conduct longterm or wide-scale evaluations that are infeasible using traditional monitoring.

Acknowledgments This work was partially funded by NSF grants 1031046 (CMMI) and 1069213 (IGERT). This support is gratefully acknowledged. We also acknowledge and thank the two utilities that provided the pipe break data used in this research. All opinions are those of the authors and do not necessarily reflect the positions of the NSF or the participating utilities.

references

American Society of Civil Engineers (ASCE), 2013. 2013 Report Card for America’s Infrastructure. Retrieved August/9, 2013, from. http://www.infrastructurereportcard.org/. Besner, M., Pre´vost, M., Regli, S., 2011. Assessing the public health risk of microbial intrusion events in distribution systems: conceptual model, available data, and challenges. Water Res. 45 (3), 961e979. Breiman, L., 2001. Random forests. Mach. Learn. 45 (1), 5e32. Breiman, L., Friedman, J.H., Olshen, R.A., Stone, C.J., 1984. Classification and Regression Trees. Wadsworth & Brooks, Monterey, CA. Brownstein, J.S., Freifeld, C.C., Madoff, L.C., 2009. Digital disease detectiondharnessing the web for public health surveillance. N. Engl. J. Med. 360 (21), 2153e2157. Cameron, A.C., Trivedi, P., 1998. Regression Analysis of Count Data. Cambridge University Press. Centers for Disease Control and Prevention (CDC), 2006. Surveillance for Waterborne Disease and Outbreaks Associated with Drinking Water and Water Not Intended for Drinking d United States, 2003e2004. In: Surveillance Summaries, December 22, 2006, vol. 55. MMWR (No. SS-12). CDC, 2008. Surveillance for Waterborne Disease and Outbreaks Associated with Drinking Water and Water Not Intended for Drinking d United States, 2005e2006. In: Surveillance Summaries, September 12, 2008, vol. 57. MMWR (No. SS-9). CDC, 2011. Surveillance for Waterborne Disease and Outbreaks Associated with Drinking Water and Water Not Intended for

34

w a t e r r e s e a r c h 5 3 ( 2 0 1 4 ) 2 6 e3 4

Drinking d United States, 2007e2008. In: Surveillance Summaries, September 23, 2011, vol. 60. MMWR (No. RR-12). Desai, R., Lopman, B.A., Shimshoni, Y., Harris, J.P., Patel, M.M., Parashar, U.D., 2012. Use of internet search data to monitor impact of rotavirus vaccination in the united states. Clin. Infect. Dis. 54 (9), e115ee118. Friedman, J.H., 1991. Multivariate adaptive regression splines. Ann. Stat., 1e67. Friedman, M., Radder, L., Harrison, S., Howie, D., Britton, M., Boyd, G., Wood, D., 2004. Verification and Control of Pressure Transients and Intrusion in Distribution Systems. AWWA Research Foundation and US Environmental Protection Agency. Garthright, W.E., Archer, D.L., Kvenberg, J.E., 1988. Estimates of incidence and costs of intestinal infectious diseases in the United States. Public Health Rep. 103 (2), 107. Ginsberg, J., Mohebbi, M.H., Patel, R.S., Brammer, L., Smolinski, M.S., Brilliant, L., 2008. Detecting influenza epidemics using search engine query data. Nature 457 (7232), 1012e1014. Hastie, T., Tibshirani, R., 1990. Generalized Additive Models. Chapman, Hall, London. Hastie, T., Tibshirani, R., Friedman, J., 2009. The Elements of Statistical Learning: Data Mining, Inference and Prediction, second ed. Springer, New York. Hunter, P.R., Chalmers, R.M., Hughes, S., Syed, Q., 2005. Selfreported diarrhea in a control group: a strong association with reporting of low-pressure events in tap water. Clin. Infect. Dis. 40 (4), e32ee34. Karim, M.R., Abbaszadegan, M., LeChevallier, M., 2003. Potential for pathogen intrusion during pressure transients. J.-Am. Water Works Assoc. 95 (5). Lambertini, E., Borchardt, M.A., Kieke Jr., B.A., Spencer, S.K., Loge, F.J., 2012. Risk of viral acute gastrointestinal illness from nondisinfected drinking water distribution systems. Environ. Sci. Technol. 46 (17), 9299e9307. LeChevallier, M., Gullick, R., Karim, M., Friedman, M., Funk, J., 2003. The potential for health risks from intrusion of contaminants into the distribution system from pressure transients. J. Water Health 1, 3e14. Lee, B.K., 2010. Epidemiologic research and web 2.0dthe userdriven web. Epidemiology 21 (6), 760e763. Malm, A., Axelsson, G., Barregard, L., Ljungqvist, J., Forsberg, B., Bergstedt, O., Pettersson, T.J., 2013. The association of drinking water treatment and distribution network disturbances with

health call centre contacts for gastrointestinal illness symptoms. Water Res. 47 (13). Messner, M., Shaw, S., Regli, S., Rotert, K., Blank, V., Soller, J., 2006. An approach for developing a national estimate of waterborne disease due to drinking water and a national estimate model application. J. Water Health 4 (Suppl. 2), 201e240. ˚ ., Lindba¨ck, J., Nyga˚rd, K., Andersson, Y., Røttingen, J., Svensson, A Kistemann, T., Giesecke, J., 2004. Association between environmental risk factors and campylobacter infections in Sweden. Epidemiol. Infect. 132 (02), 317e325. Nyga˚rd, K., Wahl, E., Krogh, T., Tveit, O.A., Bøhleng, E., Tverdal, A., Aavitsland, P., 2007. Breaks and maintenance work in the water distribution systems and gastrointestinal illness: a cohort study. Int. J. Epidemiol. 36 (4), 873e880. Payment, P., Siemiatycki, J., Richardson, L., Renaud, G., Franco, E., Prevost, M., 1997. A prospective epidemiological study of gastrointestinal health effects due to the consumption of drinking water. Int. J. Environ. Health Res. 7 (1), 5e31. Payment, P., Richardson, L., Siemiatycki, J., Dewar, R., Edwardes, M., Franco, E., 1991. A randomized trial to evaluate the risk of gastrointestinal disease due to consumption of drinking water meeting current microbiological standards. Am. J. Public Health 81 (6), 703e708. Pelat, C., Turbelin, C., Bar-Hen, A., Flahault, A., Valleron, A., 2009. More diseases tracked by using Google Trends. Emerg. Infect. Dis. 15 (8), 1327. Polgreen, P.M., Chen, Y., Pennock, D.M., Nelson, F.D., Weinstein, R.A., 2008. Using internet searches for influenza surveillance. Clin. Infect. Dis. 47 (11), 1443e1448. Tinker, S., Moe, C., Klein, M., Flanders, W., Uber, J., Amirtharajah, A., Tolbert, P., 2009. Drinking water residence time in distribution networks and emergency department visits for gastrointestinal illness in metro Atlanta, Georgia. J. Water Health 7 (2), 332e343. United States Environmental Protection Agency (USEPA) and Water Research Foundation, 2010. Final Priorities of the Distribution System Research and Information Collection Partnership. April. http://www.epa.gov/safewater/ disinfection/tcr/pdfs/tcrdsac/finpridsricp051010.pdf (accessed 17.07.12.). Wheeler, J.G., Sethi, D., Cowden, J.M., Wall, P.G., Rodrigues, L.C., Tompkins, D.S., Roderick, P.J., 1999. Study of infectious intestinal disease in England: rates in the community, presenting to general practice, and reported to national surveillance. Br. Med. J. 318 (7190), 1046e1050.

Public health and pipe breaks in water distribution systems: analysis with internet search volume as a proxy.

Drinking water distribution infrastructure has been identified as a factor in waterborne disease outbreaks and improved understanding of the public he...
1MB Sizes 0 Downloads 0 Views