A hybrid evolutionary data driven model for river water quality early warning.

Journal of Environmental Management 143 (2014) 8e16

Contents lists available at ScienceDirect

Journal of Environmental Management journal homepage: www.elsevier.com/locate/jenvman

A hybrid evolutionary data driven model for river water quality early warning Alejandra Burchard-Levine a, *, Shuming Liu b, Francois Vince c,1, Mingming Li d, Avi Ostfeld e, 2 a Tsinghua University e Veolia Environnement Joint Research Center for Advanced Environmental Technology, School of Environment, Tsinghua University, Beijing, China b School of Environment, Tsinghua University, Beijing, China c Veolia Environnement, 8 North Dongsanhuan Road, Beijing 100004, China d School of Environment, Tsinghua University, Beijing, China e Faculty of Civil and Environmental Engineering, Technion e Israel Institute of Technology, Haifa 32000, Israel

a r t i c l e i n f o

a b s t r a c t

Article history: Received 1 July 2013 Received in revised form 20 January 2014 Accepted 21 April 2014 Available online

China’s fast pace industrialization and growing population has led to several accidental surface water pollution events in the last decades. The government of China, after the 2005 Songhua River incident, has pushed for the development of early warning systems (EWS) for drinking water source protection. However, there are still many weaknesses in EWS in China such as the lack of pollution monitoring and advanced water quality prediction models. The application of Data Driven Models (DDM) such as Artificial Neural Networks (ANN) has acquired recent attention as an alternative to physical models. For a case study in a south industrial city in China, a DDM based on genetic algorithm (GA) and ANN was tested to increase the response time of the city’s EWS. The GA-ANN model was used to predict NH3eN, CODmn and TOC variables at station B 2 h ahead of time while showing the most sensitive input variables available at station A, 12 km upstream. For NH3eN, the most sensitive input variables were TOC, CODmn, TP, NH3eN and Turbidity with model performance giving a mean square error (MSE) of 0.0033, mean percent error (MPE) of 6% and regression (R) of 92%. For COD, the most sensitive input variables were Turbidity and CODmn with model performance giving a MSE of 0.201, MPE of 5% and R of 0.87. For TOC, the most sensitive input variables were Turbidity and CODmn with model performance giving a MSE of 0.101, MPE of 2% and R of 0.94. In addition, the GA-ANN model performed better for 8 h ahead of time. For future studies, the use of a GA-ANN modelling technique can be very useful for water quality prediction in Chinese monitoring stations which already measure and have immediately available water quality data. Ó 2014 Elsevier Ltd. All rights reserved.

Keywords: Genetic algorithm Artificial Neural Networks Water quality Early warning system China

1. Introduction China’s fast pace industrialization, rapid population growth and increase in urbanization are some of the main reasons for continuous occurrences of river water pollution contamination events. Cities situated downstream of vulnerable rivers and dependent on

* Corresponding author. E-mail addresses: [email protected], alejandra. [email protected] (A. Burchard-Levine), [email protected] (S. Liu), [email protected] (F. Vince), [email protected] (M. Li), [email protected] (A. Ostfeld). 1 Tel.: þ86 10 5953 2700. 2 Tel.: þ972 4 8292782; fax: þ972 4 8228898. http://dx.doi.org/10.1016/j.jenvman.2014.04.017 0301-4797/Ó 2014 Elsevier Ltd. All rights reserved.

these for drinking water supply need to have appropriate early warning systems (EWS) in order to protect citizens from receiving contaminated drinking water. Although the Chinese government has emphasized on the development and installation of early warning systems, accidents have continued to occur at high cost and affecting the health of large populations. Some of the recent incidents are the blast of Jihua petroleum factory upstream of the Songhua River in Jilin Province, in November 2005 (Hu, 2009), and the cadmium spill which contaminated over 100 km of the Longjiang River in southwestern China’s Guangxi region, in January 2012 (Cadmium spill threatens water supply in Liuzhou, China, 2012). The major issue is that EWS for drinking water protection are still incomplete in many areas of China. A complete EWS namely

A. Burchard-Levine et al. / Journal of Environmental Management 143 (2014) 8e16

contains detection, confirmation and characterization components. From the aforementioned, a challenging component is the characterization one which usually consists of advanced mathematical models which can predict the fate of pollution events. This component would allow water utilities to better understand the characteristics of a possible pollution event, eventually acquire more time to deal with it and use appropriate measures. In the context of drinking water source protection and EWS, the specific type of modelling of interest involves the modelling of transient water quality conditions associated with contamination events and their potential impact on drinking water intakes (Grayman, 2001). Throughout the literature, many hydrological models whether a public tool or licensed tool can be used to model an accidental event and characterize its behaviour. There are many ways of classifying these hydrological models. For simplicity, these were divided into two categories (BurchardLevine et al., 2012): Physically-based “white-box” models that calculate and integrate all the physico-chemical mechanisms occurring in the river to predict pollutant fate (Hao, 2006). Empirically-based or data-driven “black-box” models that aim at predicting pollutant fate using a purely numeric approach and without any detailed understanding of physico-chemical mechanisms.

9

et al. (1999) used a genetic algorithm to optimise the inputs to an ANN model used to forecast runoff from a small catchment. The objective of this research was to develop a water quality prediction model which would enhance the characterization component of China’s EWS while taking into account data availability constraints of the country. The use of a DDM based on GA and ANN, referred to as the GA-ANN model, instead of a classical physical model, was explored and tested in order to increase the capacity for water utilities to characterize and increase the response time of a possible contamination event. The GA-ANN model was used to predict downstream physicochemical water quality using available upstream physicochemical water quality variables ahead of time. This could help to characterize whether the downstream water quality is good enough as a drinking water source. Bearing in mind, this method can be used for accidents only occurring upstream. A case study is presented for a large industrial city X in the south of China which depends on the River Z as the drinking water source. The GA-ANN model was first tested for the prediction of NH3eN, CODmn and TOC at Station B using past measured water quality variables at Station A, located 12 km upstream, and Station B at minimum 2 h ahead of time. Finally, the model was tested in order to see its performance for larger time delays, meaning its capacity to predict ahead of time. 2. Materials and methods

Physical models for water contamination prediction and forecasting are very popular and useful when all the information needed is available. Meanwhile, data-driven models (DDM) for surface water quality predictions have gained a lot of attention in order to overcome challenges such as non-linearity which are not taken into account in physical models. In addition, in the case of China where data availability is a constraint, DDM, such as Artificial Neural Networks (ANN), can overcome this issue with its flexibility which allows it to predict and forecast using available data (Burchard-Levine et al., 2012). Burchard-Levine et al. (2012) provide a more detailed review on the state of EWS and water quality modelling in China. In a review by Maier et al. (2010), it is declared that ANN were used to predict water quantity variables in more than 90% of the papers, of which flow was by far the most popular. On the other hand, water quality variables were predicted in fewer than 10% of the papers. Hence, the application of ANN for water quality prediction is relatively recent. This fact is reflected in China where many applications of ANN based models have been studied for forecasting and prediction of river flows and watershed rainfall-runoff; fewer can be found for water quality studies. Zhao et al. (2007) studied a water quality, COD and DO, forecast model for the Yuqiao reservoir in Tianjin. In most water resources ANN applications, very little attention has been given to the task of selecting appropriate model inputs (Maier and Dandy, 2000; Bowden et al., 2005). Bowden et al. (2005) presented a review of the current state of input determination methods in water resources applications of ANN modelling. To address the lack of a robust procedure for selecting important inputs, Bowden et al. (2005) presented and recommended two input determination methodologies out of which one of them was a Genetic Algorithm coupled with ANN for the selection of the final subset of important model inputs. Genetic Algorithms are well suited for the task of selecting an appropriate combination of inputs to a model, as they have the ability to search through large numbers of combinations where interdependencies between variables may exist (Bowden et al., 2005). Kuo et al. (2006) used a hybrid neural GA for water quality management of the Feitsui Reservoir in Taiwan. Abrahart

ANN are powerful adaptive tools which imitate the central nervous system by recognizing patterns in data and allow mimicking those patterns. ANN are able to forecast data without the need to formalize how these patterns work and what inner mechanisms/formula are involved (Adeli, 2001; Maier et al., 2010). One of the most important steps in the ANN development process is the determination of significant input variables. Usually, not all of the potential input variables will be equally informative since some may be correlated, noisy or have no significant relationship with the output variable being modelled. In the application of ANN, it is always a challenge to select the appropriate input variables. Over-parameterization might lead to bad outputs. In this paper, GA is connected to ANN to optimize the water quality input variables and prevent over-parameterization in model training. Hence, this technique is referred to as the application of GA for optimizing the process of selection of input variables for an ANN based water quality prediction model. The following section describes the methodology to configure and test an ANN model which uses GA for input variable optimization: Raw data pre-processing, ANN model, GA input variable selection and final ANN model. 2.1. Raw data organization and pre-processing Raw water quality data was received from the local environmental monitoring center of the southern Chinese City X in question. At first, the raw dataset received needs to be sorted and dates of measurements need to be matched. If one dataset measures a parameter every half hour while another measures it every 2 h, the average of four data points will be taken in order to match it with the corresponding 2 h measurements. In addition, it is quite usual that raw water quality data contain a lot of false measurements and noise. There are many ways and tools which exist to assess whether a measurement should be discarded, i.e. pre-process data. As a preliminary pre-processing step, all measurements above and below two standard deviations can be eliminated. In order to smooth the data, all missing and eliminated points can be replaced by the average of adjacent measurements.

10


2.2. ANN Model development The main idea in building an ANN model is to use the available data in order to train the ANN to be able to predict relevant water quality information. Many different architectures exist for ANN. Architectures that feature prominently within ANN water quality modelling and analysis applications are the multilayer perceptron (MLP) and the generalised regression neural network (GRNN) (May et al., 2009). The MLP is the quintessential ANN architecture. It is by far the most popular of all ANN architectures, and its use is reported in approximately 90% of applications using ANN (May et al., 2009). The classic three-layered structure comprises three layers of neurons representing the input and output, with a single hidden layer. In practice, many functions are difficult to approximate with only one hidden layer and the use of more than one hidden layer provides greater flexibility and enables approximation of complex functions with fewer connection weights in many situations. Some studies such as Flood and Kartam (1994) suggest using two hidden layers as a starting point. However, it must be stressed that optimal network geometry is highly problem dependent (Maier and Dandy, 2000). For this study, three hidden layers were chosen and tested. The ANN nonlinear autoregressive with exogenous input (NARX) architectures for time series based on MLP was selected and tested for this study. The NARX model represents a dynamic system which can predict future values of a time series y(t) from past values of that time series and past values of a second time series x(t) (Hagan et al., 2010). The NARX model has a limited feedback architecture where the feedback comes only from the output neuron instead of the hidden neurons (Flasch et al., 2010). The NARX can be written as

yðtÞ ¼ f ðyðt 1Þ; .; yðt dÞ; .; xðt dÞÞ

(1)

where x(t) is the input of the model and y(t) is the output of the model at discrete time t. d is the input and output time delay. The function f is a nonlinear function which can be approximated by a MLP. Fig. 1 below shows the NARX architecture schematic. A time delay 1:1, where d is 1, means that a measurement at time t of Station A is able to predict the measurement at the same time t of Station B. A time delay of 1:2, where d is 2, means that a measurement at time t of Station A is able to predict the next measurement at time t þ 1 of Station B. If the increment of time between measurements is 2 h, the time delay of 1:2 means a prediction of 2 h ahead of time, the time delay of 1:3 means a prediction of 4 h ahead of time and so on. The MATLAB Neural Network Toolbox 7.0 implementation of NARX neural networks and Nonlinear Input-Output were used in this study (Hagan et al., 2010). The data available must first be divided into three sets: training, validation and testing. Subsequently, the number of hidden layers and time delay ranges is set to a value. During training, the LevenbergeMarquardt algorithm was employed. An ANN run means feeding input and output data into the model. After, the neural network is trained, validated and tested. Then, the Neural Network Toolbox calculates the Mean Squared

Fig. 1. NARX schematic.

Fig. 2. Steps in NSGA-II.

Value (MSE) and Regression R between the model outputs (predicted values) and targets (actual measured data). The performance of the model is measured by the mean squared error (MSE) which is the average squared difference between outputs and targets. This result can also be reflected as a mean percentage error (MPE) shown below as

i h average ðoutputtargetÞ2 MSE ¼ MPEð%Þ ¼ average target2 average target2 (2) The regression R measures the correlation between outputs and targets. It is the linear regression between outputs and targets. An R value of 1 means a close relationship, 0 a random relationship. 2.3. GA for input variable selection Genetic Algorithms (GA) (Holland, 1975; Goldberg, 1989) are powerful optimization tools based upon the underlying principles of natural evolution and selection. GA have the ability to search through large numbers of combinations, where interdependencies between variables may exist (Bowden et al., 2005). The three genetic operators are selection, crossover (mating) and mutation to determine which subset of all possible combinations of decision variables to evaluate during the search procedure (Bowden et al., 2005; Liu et al., 2012). The possible solutions are coded as binary strings which are equivalent to biological chromosomes. Each bit of the binary string or chromosome is a gene. Initially, the search starts with a random population of chromosomes. After, each chromosome is decoded into a solution and its fitness is evaluated using an objective function. According to the fitness of each chromosome, the selection procedure determines who will crossover or mutate. The GA will run for a specific number of generations (Bowden et al., 2005). 2.3.1. Non-dominated Sorting Genetic Algorithm-II (NSGA-II) The Non-dominated Sorting Genetic Algorithm-II (NSGA-II) is an extension of the GA for multiple objective function optimization. The objective of the NSGA-II algorithm is to improve the adaptive fit of a population of candidate solutions to a Pareto front constrained by a set of objective functions. The algorithm uses an evolutionary process with surrogates for evolutionary operators including selection, genetic crossover, and genetic mutation. The population is sorted into a hierarchy of sub-populations based on the ordering of Pareto dominance. Similarity between members of each sub-group is evaluated on the Pareto front, and the resulting groups and similarity measures are used to promote a diverse front of nondominated solutions (Brownlee, 2011). NSGA-II has demonstrated an ability to provide accurate and feasible multiple Pareto-optimal solutions in one single run and also to maintain population diversity in the set of the non-dominated solutions. The following steps show how the NSGA-II works and are illustrated in the Fig. 2 below:


11

2.3.1.1. Initialization. A random parent population of size P is created. This is a sub-population of input water quality variables and selected time delays for the ANN model.

2.4.1.1. Objective function. The objective function is determined by the model performance. It is given by comparing the model outputs and targets and calculating the MSE and R values.

2.3.1.2. Calculate fitness. The fitness is calculated by the ANN model performance values of MSE and R.

2.4.1.2. Constraints. The MSE should be as close as possible to 0. This means that the model with best performance has the least difference between the outputs and the targets. At the same time, the R value should be as close as possible to 1. This means that the model with best performance has targets and outputs with as much as possible a linear correlation.

2.3.1.3. Non-dominated sorting of parent population. Each population which is composed of the selected input water quality variables and time delays with corresponding MSE and R values is assigned a rank equal to its non-domination level or front number (1 is the best level, 2 is the next best level, etc). The crowding distance of populations is calculated in each non-domination level and the population is sorted in descending order of crowding distance. 2.3.1.4. Tournament selection. Two individuals are selected at random and compared on their front number and crowding distance. The better one is selected and copied into the mating pool. 2.3.1.5. Crossover and mutation. The populations may have a mutation or a crossover. 2.3.1.6. Mixing. The parent population and child population are combined. The size of the combined population is 2P. 2.3.1.7. Non-dominated sorting. The combined population is sorted by Non-dominated Sorting. 2.3.1.8. Reverse elite. The best children of each generation will be kept and be put as parents in the next generation. The rest of the children will be removed. In this way, the best result will be found. 2.3.1.9. End. Stopping will occur when the fixed number of generations is reached. 2.3.1.10. Output. The final population with fronts ordered from best to worse, meaning lowest front to largest will appear.

2.4.1.3. Decision variables. The decision variable is the subset number of a combination of input variable and a range of time delays. The model tries to find the optimal subset of input variables and time delay between a determined range which gives the best model performance. 2.4.1.4. The GA-ANN model. In the case study below, the GA-ANN is utilized for input variables and time delay optimization for the prediction of NH3eN, CODmn and TOC. This will be further elaborated in the following section. 2.5. Case study A major industrial city (entitled City X) situated in the south of China possesses two state-of-the-art drinking water source intake monitoring stations on River Z, namely Station A and Station B, which allow the city to have good detection and confirmation components for its EWS. These stations allow for raw water quality monitoring just before it enters the drinking water treatment plant as shown in Fig. 3 below. However, the city does not use any mathematical models which would allow it to characterize a possible accidental pollution event. Alert is launched if a given water quality threshold on one alert station is exceeded; however no relationship has been drawn between the data monitored at each station. In order to understand the relationship between the monitoring stations, the available data was tested on the GA-ANN data-driven model to study its performance in increasing the response time of the city’s early warning system.

2.4. GA-ANN model formulation In the GA-ANN model, GA is used to compile a population where each chromosome is a subset combination of possible input variables out of the total set of available input variables. The GA randomly compiles this population and feeds it into the ANN model. Each chromosome goes through an ANN run where the model outputs its performance translated by the MSE and R values. The GA then ranks the chromosomes from first place to last place according to the model performance. This means that a chromosome with the smallest MSE and largest R value will rank in first place. In this way, the GA is used to analyse which input variable combination gives better model performance for the prediction of selected output variables. 2.4.1. The best performance problem The GA-ANN model inputs are Population number P, number of Generation G, number of Input Variables I and Time Delay Range TD. Once the GA-ANN model inputs and time delay range are selected, the GA-ANN model outputs a population where each chromosome has I input variables out of the total available input variables and has been tested for the time delay range TD giving the corresponding MSE and R values. Finally, the chromosomes are ranked from best performance to worst performance according to the Non-dominated Sorting.

Fig. 3. Southern Chinese City X with two drinking water source intake monitoring stations on River Z, Stations A and B.

12


The NARX architecture for the ANN part of the GA-ANN model is used in this case as past values for model input and output would be available during deployment. The data was divided into three parts:

Table 1 Measured water quality parameters at stations A and B. Measured variables

Station A

Station B

pH Turbidity Conductivity DO Water temperature NH3eN CODmn TOC TP Fluoride

X X X X X X X X X X

X X X X X X X X N/A N/A

1. March 12th 2009 to December 31st 2010 was used for training. 2. January 1st 2011 to December 31st 2011 was used for validation. 3. January 1st 2012 to February 26th 2012 was used for testing. The number of hidden layers was set to three and the time delay was tested for 2 h (1:2) and for the range of zero hours (1:1), i.e. instantaneous prediction, and 8 h (1:5) ahead of time. The GA-ANN model was run 15 times to find the best input parameter sub-group of three out of the total ten and 15 times for four out of the total ten. Hence, the model was run for a total of 30 runs. The results where then combined, meaning that the model was run in total 30 times to study which input parameters give the best performance and for what time delays can the model continue to perform well for the prediction NH3eN, CODmn and TOC at Station B using input variables at station A. Table 4 below shows an example of a GA-ANN model run for an input variable number 3, population number 10 and generation number 12. The model will find the best combination of three input variables out of the ten available for a time delay range of 1:1 to 1:5. Table 5 below shows the results for the GA-ANN model runs from the example in Table 4. The GA-ANN generated a population of ten containing the best three out of ten input parameter combinations for a time delay range from 1:1 to 1:5. The final outputs of the GA-ANN model are 50 combinations of three input parameters and time delay with corresponding performance measurements, MSE and R, ranked from best to worst. For this case, the best result for NH3eN prediction is to use input variables Turbidity, DO and NH3eN for a prediction at 6 h ahead of time (1:4).

2.5.1. Available data and information Station A and B are 12 km apart and measure the water quality parameters shown below in Table 1. Measurements of physico-chemical parameters (pH, Turbidity, Conductivity, DO, Water Temperature, NH3eN, CODmn, TOC, TP, and Fluoride) are taken every 2 h and are available from March 28th 2009 to February 26th 2012. It has also been estimated that the time of travel between Stations A and B is 2 h. This is an assumption made using Google maps (distance between stations is about 12 km) and assuming a flow velocity of 1.5 ms1 (Chen et al., 2007). The Tables 2 and 3 below give the raw data statistics of Stations A and B. 2.5.2. Increasing the response time of the city’s early warning system Using available measured water quality parameters of the city’s two intakes, Stations A and B, the GA-ANN model was tested in order to predict NH3eN, CODmn and TOC at Station B ahead of time using variables measured at Station A. The GA-ANN model was tested to predict the aforementioned variables 2 h ahead of time as it has been previously estimated that the time of travel between the two stations is 2 h. In doing so, the performance of the model was analysed to understand which variables (three out of ten and four out of ten measured at Station A) are more sensitive for the prediction of NH3eN, CODmn and TOC at Station B. Finally, the same analysis was done on the model in order to test its performance for larger response times such as prediction at 8 h ahead of time.

3. Results and discussion Increasing the response time of the city’s early warning system allows the water utilities to take better decisions ahead of time on how to react if there are abnormal high water quality concentrations at the monitoring station intakes. The water utilities may decide to use an emergency water treatment process or to shut down the plant.

Table 2 Station A raw data statistics. Parameters

pH

Turbidity

Conductivity

DO

Water temp.

NH3eN

CODmn

TOC

TP

Fluoride

Units National standard upper limit Mean Standard deviation Range Minimum Maximum Count

e

NTU e 85.4 96.5 1999 1 2000 11,686

us/cm e 357 53.2 575 1 576 11,684

mg/l 6 8.27 1.90 7.77 4.43 12.2 11,737

mg/l 1 0.176 0.161 2.78 0.001 2.78 11,193

mg/l 6 1.89 0.998 9.9 0.1 10 11,603

mg/l

mg/l 0.2 0.120 0.0826 1.81 0.003 1.82 11,684

mg/l 30 0.303 0.132 0.98 0.02 1 11,711

9 7.76 0.325 1.32 7.05 8.37 11,738

C e 19.0 7.37 26.8 5.4 32.2 11,694

2.32 2.56 163 0.1 163 11,420

Table 3 Station B raw data statistics. Parameters

pH

Turbidity

Conductivity

DO

Water temp.

NH3eN

CODmn

TOC

Units National standard upper limit Mean Standard deviation Range Minimum Maximum Count

e

NTU e 86.43 99.41 659 9 668 11,850

us/cm e 330.32 76.93 766 1 767 11,843

mg/l 6 7.11 2.44 19.99 0.01 20 11,824

mg/l 1 0.227 0.38 4.989 0.003 4.99 11,840

mg/l 6 1.97 1.20 9.9 0.1 10 11,574

mg/l e 2.12 1.70 73.6 0.1 73.7 11,633

9 7.79 0.343 7.56 6.44 14 11,850

C e 19.56 6.81 26.2 7 33.2 11,850

A. Burchard-Levine et al. / Journal of Environmental Management 143 (2014) 8e16 Table 4 GA-ANN model run example for NH3eN prediction for three out of ten best input parameters and time delay range between 1:1 (0 h) and 1:5 (8 h). Model input commands

Model input selection

Population P (Even) Generation G

10 12

Output parameter (For NH3eN enter 1, CODmn enter 2, TOC enter 3, TOX enter 4) Number of parameters (>1)

1

Time delay (>0)

5

3

Comments

The model will run in total 120 (P*G) combinations NH3eN

The model will test combinations of 3 input variables out of the total 10. The range will be between 1:1 (0 h) and 1:5 (8 h).

The prediction of NH3eN, CODmn and TOC at Station B ahead of time using variables measured at Station A would allow station B to be prepared for accidents from organic wastes contained in municipal sewage and discharges from abattoirs, food-processing and similar agricultural industries and organic toxic chemicals from industrial effluents which are reflected by CODmn and TOC variables and sewage, fertilisers, agricultural wastes or industrial wastes containing organic nitrogen, free ammonia or ammonium salts which are reflected by NH3eN (Bartram and Ballance, 1996). Table 6 and Fig. 4 below show the amount of time an input variable was selected as one of the best three or four model inputs out of the total 10 input variables at Station A, i.e. Run Selection (%), for the prediction of NH3eN, CODmn and TOC. In addition, Table 6 shows the performance of the GA-ANN model when using the best three and four selected input variables. Finally, Fig. 4 shows the total run selection for the three aforementioned predicted output variables. 3.1. Ammonia-nitrogen (NH3eN) prediction For NH3eN prediction at 2 h ahead of time, it has been observed in Table 6 and Fig. 4 that the input variables from most sensitive to least sensitive are:

TOC; CODmn < TP; NH3 N; Turbidity < Fluoride < Conductivity < pH; DO; Water Temp: It seems that the most sensitive input variables for the prediction of NH3eN at Station B are TOC, CODmn, TP, NH3eN and Turbidity at Station A. The performance of the GA-ANN model for NH3eN prediction is good as the MSE is 0.0033, MPE is 6% and R is 92%. The targets and outputs are close and have a linear relationship with each other. The following Figs. 5 and 6 below show the distribution of the MSE and R values obtained for three and four input variables.

13

Table 5 GA-ANN model run results example for NH3eN prediction for three out of ten best input parameters and time delay range between 0 h (1:1) and 8 h (1:5). Mean square Regression Rank Time delay error (MSE) R (increment of time intervals 1:X)

Input variables

Turb. WT NH3eN CODmn NH3eN Turb. DO TOC TOC Turb. TOC pH CODmn Turb. pH CODmn CODmn pH TOC Turb. TP pH Turb. pH NH3eN COD pH Turb. CODmn pH Cond. CODmn NH3eN pH WT WT DO Turb. pH NH3eN pH TP TP DO pH pH TP TOC Cond. pH

DO CODmn Cond. Turb. TP NH3eN CODmn pH Cond. DO CODmn CODmn NH3eN TOC DO TOC TP DO Cond. pH TOC Cond. CODmn pH TOC WT CODmn pH pH Turb. NH3eN WT pH pH TP DO pH pH DO Cond. Turb. TOC NH3eN TOC Cond. DO CODmn NH3eN NH3eN DO

NH3eN pH pH TP WT pH pH pH DO WT pH pH TP pH TOC pH pH TOC NH3eN TOC pH CODmn TOC WT pH pH NH3eN TP TOC WT TP TOC WT TP TOC Cond. WT DO Cond. WT TP pH pH pH TOC TP pH pH pH TOC

4 5 4 3 4 4 5 5 5 4 5 5 3 3 4 3 3 5 5 2 3 5 2 4 2 3 5 2 2 4 3 2 4 2 2 3 2 2 3 4 1 1 1 1 1 1 1 1 1 1

0.0032 0.0032 0.0032 0.0033 0.0033 0.0033 0.0033 0.0033 0.0033 0.0033 0.0033 0.0033 0.0033 0.0033 0.0033 0.0033 0.0033 0.0033 0.0033 0.0033 0.0033 0.0033 0.0033 0.0033 0.0033 0.0033 0.0033 0.0034 0.0034 0.0034 0.0034 0.0034 0.0034 0.0034 0.0035 0.0035 0.0035 0.0035 0.0035 0.0035 0.0037 0.0037 0.0037 0.0038 0.0038 0.0038 0.0038 0.0038 0.0038 0.0039

0.9178 0.9176 0.9173 0.9162 0.9162 0.9162 0.9162 0.9162 0.9162 0.9161 0.916 0.916 0.9159 0.9158 0.9158 0.9157 0.9156 0.9156 0.9151 0.9149 0.9149 0.9149 0.9146 0.9146 0.9145 0.9143 0.9143 0.9142 0.9142 0.914 0.9138 0.9137 0.9131 0.9125 0.9115 0.9112 0.911 0.9108 0.9101 0.9097 0.9056 0.9045 0.9036 0.9035 0.9035 0.9035 0.9034 0.9024 0.9016 0.9003

1 2 3 4 4 4 4 4 4 5 6 6 7 8 8 9 10 10 11 12 12 12 13 13 14 15 15 16 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 31 31 32 33 34 35

*Turb.: Turbidity, Cond.: Conductivity, DO: Dissolved Oxygen, WT: Water Temperature, NH3eN: Ammonia-Nitrogen, CODmn: Chemical Oxygen Demand, TOC: Total Organic Carbon, TP: Total Phosphorus, F: Fluoride.

3.2. Chemical Oxygen Demand in manganese (CODmn) prediction

The following Figs. 7 and 8 below show the distribution of the MSE and R values obtained for three and four input variables.

For CODmn prediction at 2 h ahead of time it has been observed in Table 6 and Fig. 4 that the input variables from most sensitive to least sensitive are:

3.3. Total Organic Carbon (TOC) prediction

Turbidity < CODmn < TP < pH < Fluoride < Conductivity

For TOC prediction at 2 h ahead of time it has been observed in Table 6 and Fig. 4 that the input variables from most sensitive to least sensitive are:

< DO; Water Temp: < NH3 N < TOC It seems that the most sensitive input variables for the prediction of CODmn at Station B are Turbidity and CODmn at Station A. The performance of the GA-ANN model for CODmn prediction is good as MSE is 0.201, MPE is 5% and R is 0.87. The targets and outputs are close and have a linear relationship with each other.

Turbidity < CODmn < NH3 N < TP < Water Temp: < Fluoride < TOC < Conductivity < pH; DO It seems that the most sensitive input variables for the prediction of TOC at Station B are Turbidity and CODmn at Station A. The

14


Table 6 Run selections (%) of input variables at Station A as one of the three or four out of total ten input variables at Station A which give best model performance for 2 h ahead of time for the prediction of NH3eN, CODmn and TOC at Station B. Input variables

Run selections (%)

Output variables

NH3eN

Number of input variables

3

4

Total

3

4

Total

3

4

Total

pH Turbidity Conductivity DO Water temp. NH3eN CODmn TOC TP Fluoride

2% 13% 4% 4% 0% 11% 19% 15% 22% 11%

4% 15% 7% 3% 6% 16% 13% 16% 7% 12%

3% 14% 6% 3% 3% 14% 16% 16% 14% 11%

9% 24% 9% 4% 4% 2% 24% 2% 15% 7%

12% 22% 9% 7% 5% 5% 12% 2% 15% 11%

11% 23% 9% 5% 5% 4% 16% 2% 15% 10%

2% 27% 4% 7% 11% 9% 16% 7% 11% 7%

6% 18% 7% 3% 7% 16% 16% 6% 12% 7%

4% 21% 6% 4% 9% 13% 16% 6% 12% 7%

Model performance Mean square error MSE Mean Percent Error MPE Regression R

0.0033 6% 0.92

CODmn

0.0033 6% 0.92

0.0033 6% 0.92

TOC

0.201 5% 0.92

performance of the GA-ANN model for TOC prediction is good as MSE is 0.101, MPE is 2% and R is 0.94. The targets and outputs are close and have a linear relationship with each other. The following Figs. 9 and 10 below show the distribution of the MSE and R values obtained for three and four input variables. 3.4. Time delay Fig. 11 below illustrates the total run selections of time delays at Station A between 1:1 and 1:5 which give the best model performance for the prediction of NH3eN, CODmn and TOC at Station B. From Fig. 11, it is observed that for NH3eN prediction, 67% of runs selected the time delay 8 h (1:5) to give the best model

0.201 5% 0.83

0.201 5% 0.87

0.101 2% 0.94

0.101 2% 0.94

0.101 2% 0.94

performance, while 33% of runs selected 6 h (1:4) and 0% for 2 h (1:2) and 0 h (1:1). For CODmn and TOC prediction 100% of runs selected the time delay 8 h (1:5) to give the best model performance. This shows that the GA-ANN model in this case performs well for larger time delays and may have the capacity to predict for more than 8 h ahead of time. It has been shown in other studies that recurrent networks have difficulties in capturing long-term dependencies (i.e. when inputs at high lags have a significant effect on network outputs). The inclusion of inputs at explicit time lags, resulting in NARX recurrent networks, has been found to considerably improve performance in such cases (Maier and Dandy, 2000). 3.5. Conclusion Overall, the GA-ANN model was able to successfully predict physico-chemical variables, NH3eN, CODmn and TOC at station B at

20% 15% 10%

NH3-N

5%

CODmn TOC

0%

Regression R

Total Run Selecons (%)

25%

0.916 0.9155 0.915 0.9145 0.914 3

4 Number of Input Variables

Input Variables

0.00345 0.0034 0.00335 0.0033 0.00325 3


Fig. 5. Mean square errors for three and four input variables out of ten for NH3eN prediction at 2 h ahead of time (30 runs).

Fig. 6. Regression R values for three and four input variables out of ten for NH3eN prediction at 2 h ahead of time (30 runs).

Mean Square Error (MSE)

Mean Squared Error (MSE)

Fig. 4. Total run selections of input variables at Station A as one of the three or four out of total ten which gives best model performance for 2 h ahead of time for the prediction of NH3eN, CODmn and TOC at Station B.

0.203 0.202 0.201 0.2 0.199 3


Fig. 7. Mean square errors for three and four input variables out of ten for CODmn prediction at 2 h ahead of time (30 runs).

0.83

100%

0.829

80%

Run Selecons (%)

Regression R


0.828 0.827 0.826 3

4

15

60%

NH3-N

40%

CODmn TOC

20% 0%

Number of Input Variables

1:1

Fig. 8. Regression R values for three and four input variables out of ten for CODmn prediction at 2 h ahead of time (30 runs).

1:2

1:3

1:4

1:5

Time Delays

Mean Square Error (MSE)

Fig. 11. Total run selections of time delays at Station A between 1:1 and 1:5 which give best model performance for the prediction of NH3eN, CODmn and TOC at Station B.

0.1025 0.102 0.1015 0.101 0.1005 0.1 3


Acknowledgements

Fig. 9. Mean square errors for three and four input variables out of ten for TOC prediction at 2 h ahead of time (30 runs).

2 h ahead of time using measured variables at station A and the ANN NARX architecture. The model was also able to show which input variables are more sensitive for the prediction of the three aforementioned output variables. In addition, the GA-ANN model has shown for the case of NH3eN, CODmn and TOC prediction that it performs even better for larger time delays. For future studies, the use of data-driven models, such as the GA-ANN model, can be very useful for water quality prediction in Chinese stations which already measure water quality data, but have not performed any modelling in the past. This type of model is easy to use, fast to use and the water operators can use this model based on the data that is immediately available from the intake measurements. However, this study was limited to measuring NH3eN, CODmn and TOC which only reflects certain municipal and industrial organic wastes Hence, this model could be further tested for other relevant parameters that were not available which may reflect other accidental spills. Furthermore, the GA-ANN model developed for this case study can only be specifically applied for this case as it is based on its own past data. However, this type of modelling technique can be used in any location where sufficient data is available and can help overcome information availability

Regression R

0.941 0.9405 0.94 0.9395 0.939 3

constraints. Hence, this modelling technique could be further tested for other locations in China with similar parameters. Finally, this type of modelling technique has showed to be very interesting to better understand the non-linear relationships between different water quality parameters. It may certainly be a useful technique to better assess water quality for specific locations using past available data.


Fig. 10. Regression R values for three and four input variables out of ten for TOC prediction at 2 h ahead of time (30 runs).

This work was supported by the Tsinghua University-Veolia Environnement Joint Research Center for Advanced Environmental Technology (20093000328). References Abrahart, R.J., See, L., Kneale, P.E., 1999. Using pruning algorithms and genetic algorithms to optimise network architectures and forecasting inputs in a neural network rainfall-runoff model. J. Hydroformatics 01 (2), 103e114. Adeli, H., 2001. Neural networks in civil engineering: 19892000. Comp.-Aided Civ. Infrastruct. Eng. 16, 126e142. Bartram, J., Ballance, R., 1996. Water Quality Monitoring e a Practical Guide to the Design and Implementation of Freshwater Quality Studies and Monitoring Programmes. UNEP/WHO. Bowden, G.J., Dandy, G.C., Maier, H.R., 2005. Input determination for neural network models in water resources applications. Part 1-background and methodology. J. Hydrol. 301 (1e4), 75e92. Brownlee, J., 2011. Non-dominated Sorting Genetic Algorithm. Retrieved from Clever Algorithms. http://www.cleveralgorithms.com/nature-inspired/ evolution/nsga.html. Burchard-Levine, A., Liu, S., Vince, F., 2012. Drinking water source contamination early warning system and modelling in China: a review. Int. J. Environ. Pollut. Remediat. 1 (1), 13e19. Cadmium spill threatens water supply in Liuzhou, China. (2012). Retrieved 14-Feb 2012 from CTV News: http://www.ctv.ca/CTVNews/TopStories/20120130/ liuzhou-china-cadmium-spill-water-supply-threat-120130/. Chen, Z., Chen, D., Xu, K., Zhao, Y., Wei, T., Chen, J., Watanabe, M., 2007. Acoustic Doppler current profiler surveys along the Yangtze River. Geomorphology, 155e 165. Flasch, O., Bartz-Beielstein, T., Davtyan, A., Koch, P., Konen, W., Oyetoyan, T.D., Tamutan, M., 2010. Comparing SPO-tuned GP and NARX Prediction Models for Stormwater Tank Fill Level Prediction. IEEE, pp. 1e8. Flood, I., Kartam, N., 1994. Neural networks in civil engineering. I: principles and understanding. J. Comput. Civ. Eng. 8 (2), 131e148. Goldberg, D.E., 1989. Genetic Algorithms in Search, Optimization and Machine Learning, p. 412. Grayman, W.M., 2001. Design of Early Warning and Predictive Source-Water Monitoring Systems. AWWA Research Foundation. Hagan, M. T., Beale, M. H., & Demuth, H. B. (2010). Matlab Neural Network 7 User’s Guide. Natick, MA: Mathworks. Hao, F. e, 2006. The significance, difficulty and key technologies of large scale model applied in estimation of non-point source pollution. Acta Sci., 362e365 (in Chinese). Holland, J., 1975. Adaptation in Natural and Artificial Systems. The University of Michigan Press, Ann Arbor. Hu, B. e, 2009. Safe river water: a ubiquitous and collaborative water quality monitoring solution. Pervasive Mob. Comput. 5, 419e431. Kuo, J.-T., Wang, Y.-Y., Lung, W.-S., 2006. A hybrid neuralegenetic algorithm for reservoir water quality management. Water Res., 1367e1376.

16


Liu, S., Liu, W., Chen, J., Wang, Q., 2012. Optimal locations of monitoring stations in water distribution systems under multiple demand patterns: a flaw of demand coverage method and modification. Front. Environ. Sci. Eng. 6 (2), 204e212. Maier, H.R., Dandy, G.C., 2000. Neural Networks for the Prediction and Forecasting of Water Resources Variables: a Review of Modelling Issues and Applications, 15, pp. 101e124. Maier, H.R., Jain, A., Dandy, G.C., Sudheer, K.P., 2010. Methods used for the development of neural networks for the prediction of water resource variables in

river systems: current status and future directions. Environ. Model. Softw. 25, 891e909. May, R.J., Maier, H.R., Dandy, G.C., 2009. Developing artificial neural networks for water quality modelling and analysis. In: Hanrahan, G. (Ed.), Modelling of Pollutants in Complex Environmental Systems. ILM, pp. 27e61. Zhao, Y., Nan, J., Cui, F.-y., Guo, L., 2007. Water Quality Forecast through Application of BP Neural Network at Yuqiao Reservoir, 8(9), pp. 1482e1487.

Integrated biomarkers in wild crucian carp for early warning of water quality in Hun River, North China.

Integrated hydrological and water quality model for river management: a case study on Lena River.

Comparison of different model approaches for a hygiene early warning system at the lower Ruhr River, Germany.

hyperglycemic events.

A real-time, dynamic early-warning model based on uncertainty analysis and risk assessment for sudden water pollution accidents.

Impact of Yangtze river water transfer on the water quality of the Lixia river watershed, China.

Dengue outlook for the World Cup in Brazil: an early warning model framework driven by real-time seasonal climate forecasts.

Detection of water quality changes along a river system.

The Niagara River: A water quality management overview.

Gravity-driven hybrid membrane for oleophobic-superhydrophilic oil-water separation and water purification by graphene.

Improving the quality of data from EFNEP participants with low literacy skills: a participant-driven model.

Evaluation of river water quality variations using multivariate statistical techniques: Sava River (Croatia): a case study.

The Frequency Component of Water Quality Criterion Compliance Assessment Should be Data Driven.

A hybrid polyoxometalate-organic molecular catalyst for visible light driven water oxidation.

A warning threshold for monitoring tuberculosis surveillance data: an alternative to hidden Markov model.

A data-driven approach to quality risk management.

Toxicological assessment of river water quality in bioassays with fish.

Efficiency of a standardized artificial substrate for biological monitoring of river water quality.

Mutagenicity of drinking water sampled from the Yangtze River and Hanshui River (Wuhan section) and correlations with water quality parameters.

Application of a simple multiplicative spatio-temporal stream water quality model to the river Conwy, North Wales.

A data-driven, mathematical model of mammalian cell cycle regulation.

Maternal early warning systems.

Software to Facilitate Remote Sensing Data Access for Disease Early Warning Systems.

Filtering big data from social media--Building an early warning system for adverse drug reactions.