Inference of seasonal and pandemic influenza transmission dynamics Wan Yanga,1, Marc Lipsitchb, and Jeffrey Shamana a Department of Environmental Health Sciences, Mailman School of Public Health, Columbia University, New York, NY 10032; and bCenter for Communicable Disease Dynamics, Harvard School of Public Health, Harvard University, Boston, MA 02115
Edited by Kenneth W. Wachter, University of California, Berkeley, CA, and approved January 22, 2015 (received for review August 5, 2014)


influenza transmission dynamics data assimilation asymptomatic infections spatial patterns

I

nfluenza remains a significant public health burden. Worldwide, it causes 3–5 million cases of severe illness, and 250–500 thousand deaths annually (1). Several times per century, a novel antigenic subtype appears and is able to spread efficiently from person to person, causing a pandemic. Influenza epidemics recur annually between pandemics because influenza A viruses, especially, continually evolve to escape human immunity, leaving a proportion of the population newly susceptible despite prior infections. These epidemics occur generally in the winter in temperate regions and show more temporal variability in the tropics. In many countries, traditional sentinel surveillance systems track the incidence of influenzalike illness (ILI) and the prevalence of PCRconfirmed influenza among specimens sent for testing. The product of these quantities has been proposed as a proxy for the incidence of influenza, up to an unknown multiplicative constant, under certain assumptions (2). The Centers for Disease Control and Prevention (CDC) has collected weekly ILI surveillance data routinely for over 3 decades. Additionally, algorithms have been developed to estimate ILI based on online search queries (3); these Google Flu Trends (GFT) ILI data are available for 29 countries around the world in real time. While surveillance data are valuable for assessing influenza activity in the general population, such measures typically do not depict true influenza infection rates. For instance, ILI is recorded as the number of patients diagnosed with ILI per patient−doctor visit in the United States rather than per capita incidence. Further, underreporting is common due to asymptomatic infections and those symptomatic but unattended. These deficiencies, together with observational error, create challenges for delineating a complete picture of influenza epidemiology using these data; however, Bayesian inference methods exist that, when coupled with dynamical models, are equipped to handle such imperfect observational data and partially observed dynamical systems (4−8). Our previous work applied such Bayesian inference frameworks (also known as data assimilation methods) to simulate www.pnas.org/cgi/doi/10.1073/pnas.1415012112
Results Several epidemiological parameters are key for understanding the transmission dynamics of an infectious disease. The basic reproductive number, R0, represents the average number of secondary infections generated from a primary case in an entirely susceptible population. Characterization of R0 is consequently critical for assessing the rate of disease propagation through a population and the possibility of outbreak containment. For instance, studies have suggested that pandemic influenza spread can be contained if R0 < ∼2 (9, 10). When analyzed retrospectively, R0 facilitates quantification of the variability of influenza transmission potential among different outbreaks. Epidemiological surveys and mathematical modeling approaches have been used to estimate R0. These estimates often rely on surveillance data during the first few weeks of an outbreak, i.e., the exponential growth period. For influenza, R0 has been estimated at values between 1.3 and 3 (11), depending on the specific assumptions made with respect to initial susceptibility and either serial interval or infectious period. While some estimates of these two Significance Infectious disease surveillance systems are powerful tools for monitoring and understanding infectious disease dynamics; however, underreporting (due to both unreported and asymptomatic infections) and observation errors in these systems create challenges for delineating a complete picture of infectious disease epidemiology. This issue is true for influenza, an infectious disease of pandemic potential. Here we develop and present influenza inference systems capable of compensating for observational biases and underreporting. Using both Google Flu Trends and Centers for Disease Control and Prevention data in conjunction with Bayesian model inference methods, we are able to infer the evolving epidemiological features of influenza and its impacts among the large population during 2003−2013, including the 2009 pandemic. In addition, differences among regions within the United States are identified. Author contributions: W.Y., M.L., and J.S. designed research; W.Y. performed research; J.S. compiled data; W.Y., M.L., and J.S. analyzed data; and W.Y., M.L., and J.S. wrote the paper. Conflict of interest statement: M.L. discloses consulting or honorarium income from the Avian/Pandemic Flu Registry (Outcome Sciences, funded in part by Roche), AIR Worldwide, Pfizer, and Novartis. This article is a PNAS Direct Submission. 1
To whom correspondence should be addressed. Email:
[email protected] This article contains supporting information online at www.pnas.org/lookup/suppl/doi:10. 1073/pnas.1415012112//DCSupplemental.
PNAS Early Edition  1 of 6
BIOPHYSICS AND COMPUTATIONAL BIOLOGY
seasonal outbreaks and demonstrated that reliable forecast of the peak timing of seasonal influenza can be made using GFT ILI data combined with regional viral isolation data (termed ILI+) (4, 5). Here, we use these same inference methods to estimate key epidemiological parameters for both pandemic and seasonal outbreaks, using the entirety of outbreaks in 115 US cities during the 2003–2004 through 2012–2013 influenza seasons, and demonstrate how bigdatadriven surveillance can be used to reveal the transmission dynamics of influenza among the general population.
APPLIED MATHEMATICS
The inference of key infectious disease epidemiological parameters is critical for characterizing disease spread and devising prevention and containment measures. The recent emergence of surveillance records mined from big data such as healthrelated online queries and social media, as well as model inference methods, permits the development of new methodologies for more comprehensive estimation of these parameters. We use such data in conjunction with Bayesian inference methods to study the transmission dynamics of influenza. We simultaneously estimate key epidemiological parameters, including population susceptibility, the basic reproductive number, attack rate, and infectious period, for 115 cities during the 2003–2004 through 2012– 2013 seasons, including the 2009 pandemic. These estimates discriminate key differences in the epidemiological characteristics of these outbreaks across 10 y, as well as spatial variations of influenza transmission dynamics among subpopulations in the United States. In addition, the inference methods appear to compensate for observational biases and underreporting inherent in the surveillance data.
parameters have been derived from clinical and epidemiological data (12, 13), they are usually somewhat arbitrarily assumed in modeling approaches, which adds further uncertainty to any estimation of R0. Particularly, the initial susceptibility, often assumed close to 100% for a pandemic strain, is less certain for seasonal strains due to difficulty constraining factors such as rates of vaccination coverage and efficacy, which can affect the population susceptibility during a given season. As a result, R0 is usually estimated for pandemic outbreaks but only rarely for seasonal outbreaks. In this study, we applied a humiditydriven Susceptible− Infected−Recovered−Susceptible (SIRS) model jointly with either the ensemble adjustment Kalman filter (EAKF) or a particle filter (PF) to simulate influenza outbreaks for 115 US cities (4, 14). Our SIRS model includes two variables—numbers of susceptible (S) and infected (I) persons—and five parameters— immunity period (L), infectious period (D), maximum and minimum basic reproductive numbers, R0max and R0min, and γ, a scaling factor used to map the SIRS model simulated incidence rate to the observation metric ILI+ (see Materials and Methods). In this study, the initial susceptibility for a given local outbreak is defined as the maximal estimate of susceptibility, as made during simulation of that outbreak with the model inference system; the basic reproductive number R0, effective reproductive number Re (defined as R0S/N, where N = 100,000 is the population), and infectious period D are estimated at the week of maximal Re. Both filter methods provide joint estimates of these epidemiological variables and parameters without specific assumptions (e.g., the level of initial susceptibility). Results with the EAKF and PF were found to be consistent (see SI Appendix, Fig. S1). Before applying this inference framework to observed US epidemics, we tested both data assimilation methods—the EAKF and the PF—using four alternate compartment model forms: (i) a SIRS model, (ii) a Susceptible−Infected−Recovered (SIR) model, (iii) a Susceptible−Exposed−Infectious−Recovered (SEIR) model, and (iv) a SEIRtype model with two substages for both the exposed and infectious compartments (referred to as SE2I2R). These combined model inference frameworks were tested using modelgenerated “synthetic” data for which target parameter estimates were known. Our tests showed that inferences made using simpler models (SIRS and SIR) outperformed those using more complicated models (SI Appendix). In addition, we found that the singleage class SIRS model inference system is able to reliably infer the leading eigenvalue of the effective reproductive number Re from aggregated time series synthesized from multiage
class compartmental models (SI Appendix). Consequently, we here present results using a singleage class SIRS model. Our estimates of key epidemiological characteristics in the United States during 2003–2013 using this SIRS model (Table 1) are generally in line with findings from past studies (10, 11, 15−22); however, by using data across 115 cities, we are able to provide a much more comprehensive assessment of these characteristics over 10 y. In doing so, we are also able to discriminate and quantify variations in these epidemiological parameters across years including both seasonal and pandemic influenza outbreaks (Fig. 1). As would be expected, population susceptibility is highest at the beginning of the spring and fall waves of 2009 pandemic, with a mean of 76.5% and an interquartile range (IQR) of 72.9– 79.6%, and is significantly higher than any of the nine epidemic seasons (P < 2.2e16). Seasons with moderately severe outbreaks, i.e., the 2003–2004, 2007–2008, and 2012–2013 (23, 24), had the next highest initial susceptibility estimates. In contrast, seasons right after those profound outbreaks (e.g., 2003–2004, 2007–2008, and the pandemic) had the lowest initial susceptibilities (e.g., 2010–2011, 2004–2005, and 2008–2009) (Fig. 1A). Past studies estimate the basic reproductive number, R0, for the 2009 pandemic in the range from 1.2 to 2.3 with a median of 1.5 and tend to report larger values at the beginning of the pandemic (18). We estimate R0 for the spring and fall 2009 pandemic waves as 1.63 (1.41–1.79; mean and IQR, same elsewhere unless stated otherwise). Interestingly, the 2009 pandemic strain had an R0 lower than the 2003–2013 interpandemic strains (Fig. 1C). The 2003 A/H3N2Fujian epidemic had the highest R0 estimates. Our modeling framework assumes that R0 varies as a function of ambient absolute humidity conditions (see Materials and Methods). This effect implies that for a given outbreak, R0 will typically maximize in deep winter when absolute humidity levels are lowest; however, due to higher population susceptibility, the 2009 pandemic outbreaks took place out of season when humidity conditions were less favorable for high R0. Unlike R0, the effective reproductive number, Re, provides a measure of transmission force during an outbreak; it is calculated as the product of R0 and population susceptibility. Due to higher population susceptibility to the novel strain, the 2009 pandemic had a moderate maximum Re of 1.24 (1.17–1.29) for the fall wave. Epidemic influenza seasons with the highest maximum Re are 2003–2004 (1.40, 1.30–1.45), 2007–2008 (1.33, 1.22–1.39), and 2012–2013 (1.30, 1.25–1.35); all three of these seasons experienced moderately severe outbreaks (2, 23, 25).
Table 1. Parameter estimates by the EAKF and those reported in the literature Season
S maximum, %
2003–2004 2004–2005 2005–2006 2006–2007 2007–2008 2008–2009 2010–2011 2011–2012 2012–2013 2012–2013* pdm0910† pdm09w1‡ pdm09w2§
70.6 64.6 64.9 64.6 66.8 64.7 64.5 66.3 67.7 68.9 76.3 75.6 77.4
(65.4, (60.6, (60.7, (60.2, (62.5, (60.7, (60.6, (62.8, (63.5, (65.0, (73.2, (72.7, (73.1,
74.1) 67.1) 68.3) 68.2) 69.7) 68.0) 67.2) 69.2) 70.2) 71.6) 79.1) 78.8) 81.2)
R0 2.04 1.98 1.94 1.88 2.03 1.91 1.97 1.86 1.97 1.96 1.91 1.57 1.68
(1.84, (1.86, (1.86, (1.78, (1.85, (1.78, (1.85, (1.74, (1.84, (1.84, (1.66, (1.35, (1.49,
Re maximum 2.21) 2.11) 2.04) 2.01) 2.20) 2.06) 2.11) 1.97) 2.13) 2.10) 2.09) 1.72) 1.83)
1.40 1.25 1.24 1.19 1.33 1.21 1.22 1.18 1.28 1.30 1.33 1.15 1.24
(1.30, (1.17, (1.15, (1.12, (1.22, (1.16, (1.16, (1.15, (1.21, (1.25, (1.21, (1.05, (1.17,
1.45) 1.30) 1.30) 1.24) 1.39) 1.25) 1.27) 1.22) 1.32) 1.35) 1.39) 1.20) 1.29)
D at maximal Re
γ at maximal Re
Attack rate, %
4.20 5.00 4.93 5.02 4.97 5.04 4.71 5.18 4.80 4.76 3.66 3.90 3.69
1.53 1.66 1.62 1.62 1.57 1.61 1.65 1.58 1.58 0.62 1.33 1.54 1.43
24.1 13.3 8.2 9.8 18.7 9.6 17.9 7.9 34.4 17.4 33.3 5.3 27.0
(3.77, (4.64, (4.35, (4.47, (4.43, (4.68, (4.41, (4.87, (4.13, (4.25, (2.86, (3.50, (3.19,
4.50) 5.23) 5.32) 5.39) 5.36) 5.30) 4.93) 5.43) 5.19) 5.22) 4.35) 4.22) 4.11)
(1.43, (1.53, (1.54, (1.52, (1.46, (1.52, (1.56, (1.51, (1.43, (0.58, (1.14, (1.44, (1.32,
1.63) 1.77) 1.71) 1.73) 1.67) 1.68) 1.76) 1.65) 1.71) 0.67) 1.46) 1.62) 1.52)
(17.6, 30.0) (8.5, 16.5) (4.2, 9.5) (5.5, 13.6) (13.4, 23.9) (6.3, 12.6) (12.2, 23.2) (4.4, 10.61) (26.2, 42.2) (12.6, 22.5) (26.3, 38.8) (2.0, 7.4) (21.7, 32.5)
Estimates in this study are segregated by season/pandemic wave over all US cities and shown as mean and interquartile range (in parentheses). *Corrected estimates for 2012–2013 season, using lower prior values of γ [uniform distributions U(0.33, 2) vs. U(1,10) for other seasons] to account for the elevated flu related search intensity due to media coverage. † The entire pandemic (pdm0910) is defined as the period between weeks ending 11 April 2009 and 2 October 2010. ‡ The first 2009 pandemic wave (pdm09w1) is defined as period between weeks ending 11 April 2009 and 22 August 2009. § The second 2009 pandemic wave (pdm09w2) is defined as the period between weeks ending 8 August 2009 and 2 October 2010.
2 of 6  www.pnas.org/cgi/doi/10.1073/pnas.1415012112
Yang et al.
In comparison, the mild 2011–2012 season (26), along with the spring 2009 pandemic wave, had the lowest Re (Fig. 1D). The infectious period represents the duration of host infectiousness. It is usually estimated from clinical studies (15) or assumed to follow a certain distribution (e.g., a Gamma distribution) in modeling work (27). Using the ILI+ data and our inference approach, estimates of D, the mean infectious period, are substantially lower for the two 2009 pandemic waves (3.79, 3.34–4.17 d) than for the epidemic seasons (4.90, 4.43–5.26 d). This is in agreement with previous studies suggesting a shorter serial interval for the 2009 pandemic strain (15, 20, 28). Note that estimates of infectious period are larger than past estimates for serial interval as our model does not include a latent period. Our inference framework includes a parameter, γ, that maps modelsimulated incidence to observed ILI+. This mapping represents the weekly ratio of the probability a person seeks medical attention for any reason to the probability a person seeks medical attention due to influenza (Eq. 4; see Materials and Methods and ref. 5). For the interpandemic seasons during 2003– 2012, γ was estimated as 1.84 (1.57–1.93) over the flu season. A recent community cohort study (29) suggested a 4.3% probability of seeking medical attention among influenzainfected persons (including those with asymptomatic infection). The CDC estimated 1.25 billion doctor visits per year during 2009–2010 in the United States (30), which converts to a weekly doctor visitation rate of 7.8% of the US population. Combining these two estimates suggests a γ of 1.81 (7.8%/4.3%), very close to our model inference estimates. Because the probability a person seeks medical attention for any reason (i.e., the numerator of γ) is relatively stationary, the fluctuation in γ arises mainly from variation in the probability of seeking medical attention for influenza (i.e., the denominator of γ). A lower γ estimate thus reflects a higher tendency to seek medical help for influenza. The GFT ILI records during the 2012–2013 season are over 2 times CDC estimates, partly due to the intense media coverage accorded influenza that year (23, 31). To account Yang et al.
for this issue, we use a broader and lower range of values for the prior of γ for the 2012–2013 season than for the other seasons; this choice results in a more reasonable estimate of the attack rate for that season, while estimates for other parameters are similar to those made using the same prior for other seasons (Table 1). Estimated at the time with the maximum epidemic forcing (i.e., maximum Re), the 2012–2013 season had the lowest γ; the two 2009 pandemic waves and the 2003–2004 season had the next lowest γ estimates (Fig. 1F). These findings are consistent with observations that more people sought medical attention for influenza during the pandemic due to awareness of the ongoing pandemic and during the 2003–2004 and 2012–2013 seasons due to the greater virulence of the strains circulating at the time (24). Using our estimates of γ, we are able to convert simulated influenza incidence rates over each flu season into an attack rate. The fall 2009 pandemic wave struck an estimated 27% (22–33%) of the population; the 2003–2004, 2007–2008, 2009–2010, and 2012–2013 seasons had the next highest attack rates (19%, 14– 24%), and other epidemic seasons had much lower attack rates (10%, 5–13%) (Fig. 1B). These model estimates are based on the final outcomes of infection within the general population, unlike estimates of secondary attack rate that are conditioned on the presence of a primary case within a household (17, 19, 20). The latter method depends on the composition of survey households [e.g., whether a child is present in a household (17)] and thus may not be as representative of the general population. Our observational measure of influenza incidence, ILI+, is derived in part from municipalscale GFT ILI estimates. Such municipalscale ILI+ data are attractive due to their spatial granularity. These GFT estimates, however, have documented problems, including likely statistical overfitting to their CDC ILI target and a tendency to overestimate ILI rates (32). Particularly, the GFT ILI data for the 2012–2013 were up to 2–3 times CDC reported ILI (23). While combining GFT ILI with regional viral isolation data to create ILI+ helps better reflect influenza activity PNAS Early Edition  3 of 6
BIOPHYSICS AND COMPUTATIONAL BIOLOGY
APPLIED MATHEMATICS
Fig. 1. Comparison of key epidemiological variables/parameters over 2003–2004 through 2012–2013 flu seasons: (A) susceptibility at the onset, (B) attack rate over the entire season/pandemic wave, (C) the basic reproductive number R0, (D) the maximum effective reproductive number Re, (E) infectious period D estimated at the maximal epidemic forcing, and (F) γ at the maximal epidemic forcing. Each subplot shows the pattern of differences over the 10 y computed by Tukey’s honest significant difference test. The three segments on each horizontal line denote the mean and the 95% confidence intervals of difference between that season and season 2003/2004 (the first season in this study), determined by an ANOVA test. A horizontal line segment intersecting the vertical dash line (x = 0) indicates the parameter estimate for that season is not significantly different from season 2003/2004 and vice versa; similarly, overlapping between two horizontal line segments indicates estimates for those two seasons are not significantly different and vice versa. The mean estimate (absolute value) is listed next to each season in red. pdm denotes pandemic.
in a local city, systematic observational biases in the ILI+ metric likely still persist. To examine potential biases introduced by the uncertainty in GFT ILI data, we applied the same model filter frameworks to regional CDC ILI+ data (5) over the same 2003–2013 seasons. Estimates of γ are higher for the simulations using the CDC ILI+ data [e.g., 2.05 (1.69–2.16) vs. 1.84 (1.57–1.93) when using the GFT ILI+ data for the 2003–2012 epidemic seasons]. However, estimates of initial susceptibility, R0, maximal Re, and attack rate are similar using the two data types (see SI Appendix, Table S1). These findings suggest that the scaling factor, γ, in our inference frameworks is compensating for the biases in GFT ILI estimates. The municipal GFT ILI+ data also allow investigation of spatial variations of influenza transmission dynamics among subpopulations within the United States. Of the 115 cities with GFT data, 62 have complete records for all seasons/pandemic waves during 2003–2013. For these 62 cities, we used R0 at the time of maximum epidemic forcing (maximum Re) as estimated for each of the nine interpandemic outbreaks and two pandemic waves to calculate the correlation of R0 between pairs of cities (1,891 pairs in total). This correlation quantifies the chronological covariability of R0 during 2003–2013 across cities. In general, cities within the same Health and Human Services (HHS) region tend to have more positively correlated R0 estimates (Fig. 2 and SI Appendix, Table S2). Cities along the east coast (i.e., regions 1–4) have highly positively correlated R0 (r = 0.85 ± 0.10). Cities within regions 5 and 7 are also highly correlated with cities within regions 1–4 (r = 0.83 ± 0.08 among cities within either of these two clusters). In contrast, cities along the west coast (regions 9 and 10) have more distinct R0 estimates (r = 0.43 ±
0.35 among cities within these two regions; and r = 0.32 ± 0.26 among cities within these two regions vs. all others). Four cities—Las Vegas, NV, Phoenix, AZ, Tempe, AZ, and Tucson, AZ—stand out (the blue strip in Fig. 2); R0 estimates for these cities tend to vary out of phase with the majority of other cities (r = 0.04 ± 0.27). These cities, particularly Las Vegas and Phoenix, share a similar desert climate, which may contribute to these differences. Discussion Our findings indicate that online influenza surveillance data can be used in conjunction with a simple dynamical model and data assimilation methods to infer many key epidemiological parameters for both seasonal and pandemic influenza. Moreover, this approach provides joint estimates of these key epidemiological variables and parameters that best match influenza activity over the entire season. This model inference framework is able to depict the evolving epidemiological features of influenza across seasons and over the course of each outbreak. The inference approach presented here used a simple wellmixed, singleage class SIRS model. The model includes simple distributions for sojourn times in compartments, in particular an exponential distribution for the infectious period, with fixed infectiousness until recovery, two assumptions that are probably not accurate and will bias estimates of the reproductive number (15, 27, 33). Such model misspecification is a potential source of error that suggests caution when interpreting these findings. However, the filters, through their continued adjustment of the model state, are able to partially compensate for
Fig. 2. Correlation of R0 over the 2003–2013 period between cities. A total of 62 cities have complete ILI+ records during 2003–2013; correlations of R0 over the nine interpandemic seasons and two pandemic waves between every pair of cities were computed. The heat map shows the correlation of chronological R0 between each city pair (names of cities are shown on the x and y axes). Black lines divide the cities by HHS regions.
4 of 6  www.pnas.org/cgi/doi/10.1073/pnas.1415012112
Yang et al.
Yang et al.
Materials and Methods Data. Weekly ILI+ data are compiled by multiplying weekly municipal GFT ILI, regional GFT ILI, or regional CDC ILI estimates by their corresponding regional influenza viral isolation rate, as reported by the World Health Organization and National Respiratory and Enteric Virus Surveillance System (5). From 2003−2004 to 2012–2013, municipal GFT ILI+ data are available for up to 115 US cities (the number of cities for each year ranges from 66 to 115). SIRS Model. The SIRS model is a wellmixed humidityforced model that tracks the flow of population in each disease stage by the following equations: dS N − S − I βðtÞIS = − −α dt L N
[1]
dI βðtÞIS I = − +α dt N D
[2]
where S is the number of susceptible persons in the population, t is time, N is the population size, I is the number of infectious persons, N – S – I is the number of resistant individuals, β(t) is the transmission rate at time t, L is the average duration of immunity, D is the mean infectious period, and α is the rate of travelrelated import of influenza virus into the model domain. The basic reproductive number at time t is related to the transmission rate through the expression R0(t) = β(t)D, and determined by a function with humidity forcing (4): R0 ðtÞ = expð−180qðtÞ + lnðR0max − R0min ÞÞ + R0min
[3]
where R0max and R0min are, respectively, the maximum and minimum daily basic reproductive number, and q(t) is the specific humidity at time t. Mapping the SIRS Model to Observations. Let p(i) be the probability of any person contracting influenza during a given week. The ILI+ observation is an estimate of the percentage of influenza visits among all patient visits, or the probability that a person seeking medical attention, m, has influenza, i.e., p(ijm). By Bayes’ rule, p(i) is then pðiÞ =
pðmÞ pðmÞ pðijmÞ ≈ ðILI+ Þ: pðmjiÞ pðmjiÞ
[4]
On the other hand, the SIRS model simulates the spread of influenza within a perfectly mixed population. For a population of N people, influenza incidence roughly follows a binomial distribution, i.e., B(N, p(i)), with an expectance of p(i)N. That is, in the SIRS model the mean incidence rate, ξ, is d+ , using Eq. 4 simply p(i). Accordingly, we derive a model estimate of ILI+, ILI and the SIRSsimulated incidence rate, ξ,
PNAS Early Edition  5 of 6
BIOPHYSICS AND COMPUTATIONAL BIOLOGY
We did not discriminate strains in the SIRS model. As such, our estimates may confound or blend outcomes due to overlapping outbreaks should there be multiple strains cocirculating. Future studies could address strainspecific inference as strainspecific data become available at the municipal level. Our inference system was run discontinuously for each season; consequently, the immunity period, L, in the SIRS model was less constrained and thus not analyzed here. Constraining an increased number of state variables/parameters introduced by more comprehensive models will require data streams with additional information and finer resolution (e.g., serosurvey on population susceptibility and agestructured surveillance records). Such indepth inference could be achieved in the future, as data of better quality become available to address these issues. For instance, data with finer age structure may allow more detailed inference on the transmission dynamics of pandemic versus epidemic influenza. In summary, we have shown that the transmission dynamics of influenza among the general population can be inferred using data assimilation methods and big data estimates of incidence. As more people have access to and increasingly rely on online systems worldwide, mining of similar big data from online social networks may provide valuable information on the early spread of diseases (e.g., the early wave of a pandemic) as well as transmission dynamics for a number of other diseases. Such inference will rely heavily on the quality and reliability of these big data observational estimates.
APPLIED MATHEMATICS
model misspecification (14). Tests using synthetic data indicate the inference framework is able to accurately estimate key epidemiological parameters despite its simplicity (SI Appendix). The modelfilter framework also accommodates signals of second peaks in incidence, which would not be possible in an unforced SIRS model. Such simulation occurs, in principle, in two different ways: either by attributing the second peak to a humiditydriven increase in R0 large enough to compensate for depletion of susceptibles or by revising its estimates of state variables (such as the proportion susceptible) upward in light of data suggesting increasing incidence after a first peak. We summarize our parameter estimates by their values at the week of maximum epidemic forcing, but the actual values of these parameters vary during the simulation, due to the action of the filtering methods. Therefore, as expected, the shapes of the epidemic curves (SI Appendix, Figs. S15 and S16) are more faithful to the data than would be the predictions of an equivalent model with fixed parameters. The ILI metric in the United States is recorded as the ratio of ILIrelated patient visits to total doctor−patient visits; the denominator of this ratio can be affected by circulating virus severity (e.g., a more virulent strain, as during the 2012–2013 season) and novelty (e.g., the 2009 pandemic). Year to year, these issues, as well as changes in the number of participating clinics, can introduce biases in the CDC ILI record. These biases are in part handled by the mapping parameter γ, which provides an estimate of the difference in medical attentionseeking behavior over different seasons. For the GFT data, which are based on online search behavior, γ also reflects the attention accorded an influenza outbreak in the general population. This attention likely varies with influenza virulence or confounding events such as more intensive media coverage that changes online search behavior. By estimating the γ parameter, we are able to compensate for unusual increases in ILI observations such as those seen during the 2012–2013 season due to intense media coverage of influenza. More importantly, the estimates of γ, using either the municipal GFT or regional CDC ILI+ data, are consistent with observationally derived estimates of asymptomatic infection rates (29, 34). The parameter γ therefore also appears to account for asymptomatic infections, although we did not explicitly model this phenomenon. Our study provides estimates of initial susceptibility for both epidemic and pandemic influenza outbreaks. These estimates provide some interesting insights into the dynamics of influenza transmission over a large population. Susceptibility at the beginning of the spring 2009 pandemic wave, although significantly higher than any of the epidemic seasons, is only 75.6% (72.7– 78.8%). It is not surprising to find susceptibility lower than 100%, as the elderly are often less susceptible to a pandemic strain due to prior exposure to structurally similar strains (19, 35, 36). However, initial susceptibility, commonly assumed to decrease over time because a portion of the population would have been infected during the herald wave, is even higher at the beginning of the fall 2009 wave (Fig. 1A). Likewise, we find higher R0/Re at the beginning of the fall wave than the spring wave. In a previous study (37), higher R0 was also estimated for the second 1918 pandemic wave in New York City. One hypothesized explanation is that crossimmunity conferred by recent winter infection with seasonal influenza strains provides partial protection against the initial spring emergence of a pandemic strain. The lower initial population susceptibility and R0 for the first wave of the 2009 pandemic might suggest similar crossprotection from seasonal influenza for the general population. Pairwise analysis of R0 estimates at the time of maximum epidemic forcing during 2003–2013 reveals a positive correlation among most cities in the United States. Cities in the eastern United States exhibit a greater positive correlation of R0 than those located in the western United States. R0 estimates for Las Vegas and cities in Arizona were negatively correlated with many cities outside this region, suggesting differing transmission dynamics for the desert southwest of the United States.
c+ = ILI
ξ ξ = pðmÞ γ pðmjiÞ
[5]
pðmÞ where γ = pðmjiÞ is a scaling factor mapping simulated influenza incidence rate, ξ, to the ILI+ observation (5).
Model−Data Assimilation Methods. We applied either a particle filter with resampling and regularization (PF) (38) or the ensemble adjustment Kalman filter (EAKF) (39) jointly with the SIRS model and ILI+ records. Both model−data assimilation methods are done sequentially by repeated prediction−update cycles. In each cycle, a prediction is made by integrating the SIRS model up to the next observation, and an update is triggered by the arrival of new ILI+ data. The PF simulates the dynamical system with 10,000 randomly generated system replicas, termed particles. The filter then assimilates weekly ILI+ records to evaluate the likelihood of each particle. Particles are selected based on their likelihoods and eventually converge to those with the greatest likelihoods. In comparison, the EAKF generates 300 random system replicas, termed ensemble members. The posterior (i.e., update) of the ensemble mean is weighted based on the prediction (i.e., the prior), the observation, and their respective variances. The EAKF adjusts each ensemble member toward the ensemble mean such that the posterior variance is identical to what is predicted by Bayes’ theorem (4, 5).
ensemble member includes a set of all state variables and model parameters, which are selected or adjusted at each data assimilation checkpoint; these ensembles provide distributions of each state variable/parameter at each time point. ILI+ data were assimilated from the week ending 11 April 2009 to the week ending 22 August 2009 for the first wave of 2009 pandemic, and from the week ending 8 August 2009 to the week ending 2 October 2010 for the second wave. Simulations for the entire pandemic were made with ILI+ records between weeks ending 11 April 2009 and 2 October 2010. For a given seasonal outbreak, simulations were done from Week 40 of a given year to Week 39 of the next year. We define the week with the maximal and minimal susceptible level, S, as the onset and ending, respectively, of the epidemic/ pandemic outbreak, and a flu season as the period between onset and ending. The attack rate is calculated as the sum of simulated incidence from onset to ending. The timing of maximum epidemic forcing is defined as the week with the highest effective reproductive number. The basic reproductive number R0, effective reproductive number Re, and infectious period D are estimated at the week with the maximum epidemic forcing.
Estimation of Key Epidemiological Parameters. We estimate key epidemiology parameters for 2003–2004 through 2012–2013 season for all US cities with ILI+ records using either the PF or the EAKF. Within both filters, each particle/
ACKNOWLEDGMENTS. Funding was provided by US National Institutes of Health (NIH) Grants GM100467 (to W.Y., M.L., and J.S.) and GM110748 (to J.S.) and the NIH Models of Infectious Disease Agent Study program through Cooperative Agreement 1U54GM088558 (to J.S. and M.L.), as well as National Institute of Environmental Health Sciences (NIEHS) Center Grant ES009089 (to J.S.) and the Research and Policy for Infectious Disease Dynamics (RAPIDD) program of the Science and Technology Directorate, US Department of Homeland Security (J.S.).
1. World Health Organization (2009) Influenza (Seasonal) Fact Sheet 211 (World Health Organization, Geneva). 2. Goldstein E, Cobey S, Takahashi S, Miller JC, Lipsitch M (2011) Predicting the epidemic sizes of influenza A/H1N1, A/H3N2, and B: A statistical method. PLoS Med 8(7):e1001051. 3. Ginsberg J, et al. (2009) Detecting influenza epidemics using search engine query data. Nature 457(7232):1012–1014. 4. Shaman J, Karspeck A (2012) Forecasting seasonal outbreaks of influenza. Proc Natl Acad Sci USA 109(50):20425–20430. 5. Shaman J, Karspeck A, Yang W, Tamerius J, Lipsitch M (2013) Realtime influenza forecasts during the 20122013 season. Nat Commun 4:2837. 6. King AA, Ionides EL, Pascual M, Bouma MJ (2008) Inapparent infections and cholera dynamics. Nature 454(7206):877–880. 7. Ong JB, et al. (2010) Realtime epidemic monitoring and forecasting of H1N12009 using influenzalike illness from general practice and family doctor clinics in Singapore. PLoS ONE 5(4):e10036. 8. Ionides EL, Bretó C, King AA (2006) Inference for nonlinear dynamical systems. Proc Natl Acad Sci USA 103(49):18438–18443. 9. Longini IM, Jr, et al. (2005) Containing pandemic influenza at the source. Science 309(5737):1083–1087. 10. Ferguson NM, et al. (2006) Strategies for mitigating an influenza pandemic. Nature 442(7101):448–452. 11. Coburn BJ, Wagner BG, Blower S (2009) Modeling influenza epidemics and pandemics: Insights into the future of swine flu (H1N1). BMC Med 7:30. 12. Cowling BJ, Fang VJ, Riley S, Malik Peiris JS, Leung GM (2009) Estimation of the serial interval of influenza. Epidemiology 20(3):344–347. 13. Lessler J, et al.; New York City Department of Health and Mental Hygiene Swine Influenza Investigation Team (2009) Outbreak of 2009 pandemic influenza A (H1N1) at a New York City school. N Engl J Med 361(27):2628–2636. 14. Yang W, Karspeck A, Shaman J (2014) Comparison of filtering methods for the modeling and retrospective forecasting of influenza epidemics. PLOS Comput Biol 10(4):e1003583. 15. Carrat F, et al. (2008) Time lines of infection and disease in human influenza: A review of volunteer challenge studies. Am J Epidemiol 167(7):775–785. 16. Carrat F, et al. (2002) Influenza burden of illness: Estimates from a national prospective survey of household contacts in France. Arch Intern Med 162(16):1842–1848. 17. Carcione D, et al. (2011) Secondary attack rate of pandemic influenza A(H1N1) 2009 in Western Australian households, 29 May7 August 2009. Euro Surveill 16(3):19765. 18. Boëlle PY, Ansart S, Cori A, Valleron AJ (2011) Transmission parameters of the A/H1N1 (2009) influenza virus pandemic: A review. Influenza Other Respi Viruses 5(5):306–316. 19. Riley S, et al. (2011) Epidemiological characteristics of 2009 (H1N1) pandemic influenza based on paired sera from a longitudinal community cohort study. PLoS Med 8(6):e1000442. 20. Fraser C, et al.; WHO Rapid Pandemic Assessment Collaboration (2009) Pandemic potential of a strain of influenza A (H1N1): Early findings. Science 324(5934):1557–1561. 21. Cannell JJ, Zasloff M, Garland CF, Scragg R, Giovannucci E (2008) On the epidemiology of influenza. Virol J 5:29.
22. Yang Y, et al. (2009) The transmissibility and control of pandemic influenza A (H1N1) virus. Science 326(5953):729–733. 23. Olson DR, Konty KJ, Paladini M, Viboud C, Simonsen L (2013) Reassessing Google Flu Trends data for detection of seasonal and pandemic influenza: A comparative epidemiological study at three geographic scales. PLOS Comput Biol 9(10):e1003256. 24. Centers for Disease Control and Prevention (2013) Flu Activity Picks Up Nationwide (Centers Dis Control Prevention, Atlanta). Available at cdc.gov/flu/spotlights/fluactivitypicksup.htm. Accessed November 15, 2013. 25. Centers for Disease Control and Prevention (2013) 20122013 Flu Season Drawing to a Close (Centers Dis Control Prevention, Atlanta). Available at cdc.gov/flu/spotlights/ 20122013fluseasonwrapup.htm. Accessed November 15, 2013. 26. Centers for Disease Control and Prevention (2012) 20112012 Flu Season Draws to a Close (Centers Dis Control Prevention, Atlanta). Available at cdc.gov/flu/spotlights/ 20112012fluseasonwrapup.htm. Accessed November 15, 2013. 27. Wallinga J, Lipsitch M (2007) How generation intervals shape the relationship between growth rates and reproductive numbers. Proc R Soc B 274(1609):599–604. 28. White LF, et al. (2009) Estimation of the reproductive number and the serial interval in early phase of the 2009 influenza A/H1N1 pandemic in the USA. Influenza Other Respi Viruses 3(6):267–276. 29. Hayward AC, et al.; Flu Watch Group (2014) Comparative community burden and severity of seasonal and pandemic influenza: Results of the Flu Watch cohort study. Lancet Respir Med 2(6):445–454. 30. Centers for Disease Control and Prevention (2014) Annual Number and Percent Distribution of Ambulatory Care Visits by Setting Type According to Diagnosis Group, United States 2009–2010 (Centers Dis Control Prevention, Atlanta). 31. Butler D (2013) When Google got flu wrong. Nature 494(7436):155–156. 32. Lazer D, Kennedy R, King G, Vespignani A (2014) Big data. The parable of Google Flu: Traps in big data analysis. Science 343(6176):1203–1205. 33. Lau LLH, et al. (2010) Viral shedding and clinical illness in naturally acquired influenza virus infections. J Infect Dis 201(10):1509–1516. 34. Centers for Disease Control and Prevention (2014) Ambulatory Care Use and Physician Visits (Centers Dis Control Prevention, Atlanta). 35. Karageorgopoulos DE, Vouloumanou EK, Korbila IP, Kapaskelis A, Falagas ME (2011) Age distribution of cases of 2009 (H1N1) pandemic influenza in comparison with seasonal influenza. PLoS ONE 6(7):e21690. 36. Taubenberger JK, Morens DM (2006) 1918 Influenza: The mother of all pandemics. Emerg Infect Dis 12(1):15–22. 37. Yang W, Petkova E, Shaman J (2014) The 1918 influenza pandemic in New York City: Agespecific timing, mortality, and transmission dynamics. Influenza Other Respi Viruses 8(2):177–188. 38. Arulampalam MS, Maskell S, Gordon N, Clapp T (2002) A tutorial on particle filters for online nonlinear/nonGaussian Bayesian tracking. IEEE Trans Signal Process 50(2):174–188. 39. Anderson JL (2001) An ensemble adjustment Kalman filter for data assimilation. Mon Weather Rev 129(12):2884–2903.
6 of 6  www.pnas.org/cgi/doi/10.1073/pnas.1415012112
Yang et al.