w a t e r r e s e a r c h 5 1 ( 2 0 1 4 ) 2 3 4 e2 4 5

Available online at www.sciencedirect.com

ScienceDirect journal homepage: www.elsevier.com/locate/watres

A coupled classification e Evolutionary optimization model for contamination event detection in water distribution systems Nurit Oliker, Avi Ostfeld* Faculty of Civil and Environmental Engineering, Technion e Israel Institute of Technology, Haifa 32000, Israel

article info

abstract

Article history:

This study describes a decision support system, alerts for contamination events in water

Received 26 July 2013

distribution systems. The developed model comprises a weighted support vector machine

Received in revised form

(SVM) for the detection of outliers, and a following sequence analysis for the classification

25 October 2013

of contamination events. The contribution of this study is an improvement of contami-

Accepted 26 October 2013

nation events detection ability and a multi-dimensional analysis of the data, differing from

Available online 8 November 2013

the parallel one-dimensional analysis conducted so far. The multivariate analysis examines the relationships between water quality parameters and detects changes in their

Keywords:

mutual patterns. The weights of the SVM model accomplish two goals: blurring the dif-

Water distribution systems

ference between sizes of the two classes’ data sets (as there are much more normal/regular

Water quality

than event time measurements), and adhering the time factor attribute by a time decay

Water security

coefficient, ascribing higher importance to recent observations when classifying a time

Event detection

step measurement. All model parameters were determined by data driven optimization so

Support vector machine

the calibration of the model was completely autonomic. The model was trained and tested

Sequence analysis

on a real water distribution system (WDS) data set with randomly simulated events superimposed on the original measurements. The model is prominent in its ability to detect events that were only partly expressed in the data (i.e., affecting only some of the measured parameters). The model showed high accuracy and better detection ability as compared to previous modeling attempts of contamination event detection. ª 2013 Elsevier Ltd. All rights reserved.

1.

Introduction

Securing infrastructure is unquestionably an eminent task, being vital for the population welfare. A water distribution system (WDS) is inherently vulnerable as it comprises numerous exposed elements that can be easily infiltrated. The threat of contamination events in a WDS, deliberate, accidental, or natural, raises growing concern worldwide. Many

resources have been invested both in academia and industry for the development of contamination warning systems. The two major issues in the application of such system are the placement of sensors in the network and the analysis of the measured data. The problem of sensor placement was widely explored, featuring over 90 studies related to sensor placement optimization (Hart and Murray, 2010). The vast majority of sensor placement models use the assumption of a “perfect

* Corresponding author. Tel.: þ972 4 8292782; fax: þ972 4 8228898. E-mail address: [email protected] (A. Ostfeld). 0043-1354/$ e see front matter ª 2013 Elsevier Ltd. All rights reserved. http://dx.doi.org/10.1016/j.watres.2013.10.060

235

w a t e r r e s e a r c h 5 1 ( 2 0 1 4 ) 2 3 4 e2 4 5

sensor”, meaning that if a sensor measures any concentration of a contaminant, it will detect it promptly. Few studies assumed that sensors will detect any contaminant above some contamination threshold concentration. Virtually the recognition of a contaminant is a complex task. In the overview of warning systems development, there is a gap in the data analysis element and a need in elaborating the models, and respectively, the evaluation of the detection ability. Some attempts were made to develop sensors suitable for identifying specific pollutants according to their unique properties [e.g., the use of light scattering for the detection of spectral signature (Adams and Mccarty, 2007)]. The large variety of pollutants, however, made it impossible to deal with all, and problematic to focus just on some. In addition, the task of pollutant recognition was revealed as rather complex. Thus, the approach of specific contaminant recognition was fallen from grace, and a more generic approach was adopted. The latter features the use of typically monitored water quality parameters, such as turbidity, electrical conductivity, pH, and residual chlorine concentration, for the detection of abnormal behavior. The premise of this approach is that a contaminant intruding the system will cause some disturbances in the general water quality measurements. According to this, information from online water quality sensors may provide an early indication of a pollution presence in the WDS. The challenge is then to distinguish between normal behavior of the parameters, and changes triggered by contaminants intrusion. A few studies utilized general water quality measurements for the purpose of contamination event detection. Murray et al. (2010), Perelman et al. (2012), and Arad et al. (2013) developed event detection models. Those studies feature a parallel analysis of the water quality parameters. Some machine learning methods learn the behavior of each parameter time series, and the expected measured value of the next time step is predicted. That way, the models identify deviations from expected behavior and classify measurement outliers. The estimations of all the parameters are integrated to assess the probability of an event occurrence. Guepie et al. (2012) developed a model based on residual chlorine decay. Their premise was that a contaminant in the WDS will consume a significant fraction out of the measured chlorine and this single parameter will provide valuable sufficient data. The above studies utilized supervised classification methods. Through employing these methods, the classifier learns to distinguish between normal operation and event time measurements according to a given training data set. The lack of real event time measurements requires the use of simulated contamination events for the training and testing of the models. To maintain generality, and in the absence of adequate knowledge, the models apply some random disturbances to the measured data for representing the contaminant effect. Uber et al. (2007) provided guidelines for event simulation based on contaminant reaction kinetics and uncertainty. The aforementioned studies utilized water quality time series for the purpose of contamination event detection by applying simultaneous processes of univarite analysis. The parallel processes were followed by an integration of the analyses for the assessment of event occurrence probability. A multivariate analysis differs from the studies that were

conducted so far by involving the observation and analysis of more than one parameter at a time. The objectives of this study are: (1) to apply multivariate analysis of the data, which includes the simultaneous analysis of all variables in a multi-dimensional space. The multivariate analysis produces a very different description of the system, revealing phenomena evolving parameters relationships, (2) to develop a complete and independent system whose input is the on-line measurements and its output is a contamination event alert, and (3) to develop a model that requires minimal calibration. This work suggests a contamination event detection model supplying an autonomic decision support system. The study applies a weighted support vector machine (SVM) classifier for the detection of outlier measurements, and a following sequence analysis for the events classification. SVM was introduced by Boser et al. (1992). It is a very popular classification method prominent in its high accuracy and the ability to deal with high-dimensional data. The SVM transforms an un-specified learning task into a linear optimization problem. The training data set is used to create a hyperplane which separates the space into two classes. A decision boundary, situated in both sides of the hyperplane defines the separating area. The objectives that define the hyperplane parameters are: maximization of the separation area, and minimization of the error of misclassified vectors. Naturally, this is a trade-off, since the more errors enabled, the larger the margin will be. The SVM problem can be defined as:   n P 2 Minimize 12kWk þ C Si W;b; Si0 i¼1 (1) Subject to : yi ðWT xi þ bÞ  1  Si

i ¼ 1; .; n

where W is the normal vector to the hyperplane, C is the classifier parameter, Si are the slack variables, n is the number of vectors in the training data set, yi is either 1 or 1 indicating the class to which the vector xi belongs, and b is a coefficient which determines the axis intercepts. The geometrical margin (the separating region lies from both sides of the hyperplane) is expressed by 1=kWk (note that the maximization of 1=kWk is equivalent to the minimization of kWk2 ). The slack variable Si holds the misclassification error of the vector xi. A slack equals 0 for a well-classified vector (not contributing to the error), and a positive value for a misclassified vector (according to the error distance). The first part of the objective function in Eq. (1) expresses the geometrical margin maximization, where the second part describes the classification errors minimization. The slack variables range between 0 and 1 for the vectors lying within the separating area, and higher than 1 for the misclassified ones. The slack value for a well-classified vector is 0, thus not contributing to the accumulated error. The vectors with positive corresponding slacks are entitled support vectors. C is a constant which sets the relative importance of maximizing the margin versus minimizing the slack variables. A higher value of C implies a higher penalty for each misclassified vector, thus reducing the number of support vectors. Accordingly, a higher C value decreases the margin area. The constraints assure that all vectors will be classified correctly, excluding the support vectors.

236

w a t e r r e s e a r c h 5 1 ( 2 0 1 4 ) 2 3 4 e2 4 5

In cases in which the data is not linearly separable, the use of a linear classifier such as SVM requires some modifications of the data. The so-called kernel trick maps the input space data into a feature space, using a non-linear function. If the original data is not linearly separable, it is possible that its projection in higher dimension will be. Trail and error is usually used to determine the suitable kernel function (BenHur and Weston, 2010).

2.

Methodology

The proposed methodology (Fig. 1) consists of two major parts: an inner weighted SVM classifier for revealing outlier’s measurements (i.e., the SVM PARAMETER SELECTION stage), and a

framework of sequence analysis which exploits the binary output of the SVM for events classification (i.e., the OPTIMIZATION STAGE). The algorithm is aimed at classifying multivariate time series data (Fig. 2) as an event or normal operation instance. Below are the main notions of the model, followed by a detailed description, and an illustrative example.

2.1.

Weighted non-linear SVM

2.1.1.

Non-linear classifier

It is assumed that contamination events can cause deviations in the data in both directions (i.e., all of the measured values can be inclined either to the positive or negative direction). The data is thus not lineary separatable since a linear classifier

Fig. 1 e Methodology scheme.

237

w a t e r r e s e a r c h 5 1 ( 2 0 1 4 ) 2 3 4 e2 4 5

Fig. 2 e Water quality time series e normal and randomly simulated events.

has a single hyperplane discerning between normal and outlier vectors. A single hyperplane may classify only one direction of deviations as outliers, where the other will necessarily belong to the major “normal” class. Hence, a linear classifier is insufficient for this problem and the use of a kernel function is required. Selected by symmetry considerations and some trial and error tests, the applied kernel function is a Gaussian Radial Basis Function, expressed by:   ! Xi  Xj 2   K Xi ; Xj ¼ Exp 2

(2)

where xi is a given classified vector and xj is the support vector. The norm represents the Euclidian distance between the classifier boundary and the classified data. This transformation attributes normal distribution to the data, examining its deviation from the Gaussian function (Buhmann, 2003).

2.1.2.

Vector weights

The common SVM treats all input vectors in the same way. In case of an unbalanced data set, the classifier will not work properly and a correction is required. In this study, two imbalances were taken into account. The first is unbalanced class sizes. When the data set contains significantly more vectors belonging to one class, the classifier might classify any vector as part of this class. This classifier may be quite accurate as most of the data will be well-classified. However, it is clearly missing the purpose of classification, avoiding from distinguishing between the two classes. For creating a classifier that will focus on separating the two classes, the differences between the two data set sizes

should be blurred. This can be obtained by calculating the ratio between the two data sets and giving each vector a weight inversely correlated to its class size. That way the minor class vector will receive a higher penalty as a support vector, and so, higher importance when constructing the classifier. In this study, the WDS data contains much more normal operation measurements than event-time measurements, and thus, requires such an adjustment. The second imbalance is related to the time factor. The water quality parameters are measured continuously and produce a set of time series. When classifying an observation, using all prior data, it is reasonable to examine it while giving higher importance to recent observations. Hence, the model includes another weight factor which decays with time. This weight vector corresponds to the difference between the measurement time and the current classification. That way the classifier gives higher importance to recent observations, yet fully exploits the entire data set. Thus, for each support vector, the weight attributed to the objective function is a combination of the two factors: unbalanced group and time decay. The weight is given by: Mi ðj; k; DtÞ ¼ Expð  0:0001DtÞ þ

SizeðkÞ SizeðjÞ

(3)

where Mi is the weight given to vector i, j is the class to which vector i belongs, k is the opposing class, and Dt is the number of time steps between measurements up to the current classification. The first part of the expression is the time decay factor, and the second is the classes size blurring. The equal weights given to the two factors (i.e., the coefficients of time decay and classes size elements both equal one), together with the 0.0001 exponent value were determined in order to equality the scale. That way both factors are well represented in the expression. The value of the exponent (0.0001) was

238

w a t e r r e s e a r c h 5 1 ( 2 0 1 4 ) 2 3 4 e2 4 5

selected to attribute significantly higher weights to the recent observation, and yet give a not negligible weight to previous ones. Including the weights vector Mi in the model, expends the formulation of the objective function given in (1) to: " # n X 1 2 Mi Si Minimize kWk þ C W;b; Si0 2 i¼1

2.2.

(4)

Sequence analysis

The SVM classifies outlier measurements and normal operations. Yet, classification of outliers differs from classification of events. A single time-step outlier does not necessarily indicate an event occurrence as the outlier might originate from a measurement noise/error. However, a sequence of outliers is clearly a much stronger evidence to a possible event occurrence. The classification of events is thus based on an output sequence analysis of the SVM results. The sequence length being analyzed is selected by the enumeration described below. In every time step the binary sequence, ending in the current one, is utilized for classification as a normal or an event time step. The sequence is evaluated by a weighted likelihood measure of three weights a1, a2, and a3 (which sum-up to unity), a threshold alert value T, and three sequence likelihood coefficients of proportion, continuity, and probability: SLM ¼ a1  proportion þ a2  continuity þ a3  probability

(5)

where: SLM ¼ sequence likelihood measure. If SLM > T then the current time step is classified as an event. The sequence likelihood coefficients in (5) are: proportion e the proportion of outliers out of the sequence (i.e., the number of outliers divided by the sequence length); continuity e the longest series of outliers within the analyzed sequence, divided by the sequence length; and probability e the appearances of the sequence are checked to characterize its probability to represent an event time observation (i.e., the probability is the number of event time appearances, divided by the total number of appearances). The probability element represents the known history of this sequence, the proportion expresses an exceptional extent over other measurements, and the continuity describes the

completely autonomic. To this end a part of the data base is dedicated to the model calibration. For modeling, the used data base was divided into training, validation, and testing. Thus, the utilization of the model is performed by iteratively training the classifier on the training data set and testing it on the validation data set for assessing its suitability on different parameters. That way the selection of parameter values is accomplished autonomously by data driven optimization. The model includes three data driven parameter selections: (1) Enumeration for selecting the SVM parameter, (2) optimization for choosing the three measured weights a1, a2, a3, and the alert threshold T, and (3) enumeration for picking and evaluating sequence lengths.

2.3.

The proposed methodology scheme is given in Fig. 1. It is comprised of the following steps:

2.3.1.

Data cleaning

The model comprises a simple data cleaning scheme which consists of removing non-positive values, and values which exceed some standard deviation from the mean. Negative values, or a zero value, are both physically infeasible when referring to water quality parameters and thus removed from the data set. A threshold for the standard deviation of the considerably well-measured data was set to four. Very exceptional values, situated at least four standard deviations away from the mean are also eliminated.

2.3.2.

Setting the SVM parameter by enumeration

This step sets the SVM parameter, expressed as C in the objective function (1), through enumeration. This parameter determines the tradeoff between a wider margin area and less misclassified vectors. The enumeration process tests the performance of the classifier with different C values. The tested C values range from 0.1 to 20 in intervals of 0.1 (i.e., 200 possible solutions). For each value, the SVM classifier is constructed based on training data, and tested on a validation data set part. The performance of the classifier is evaluated by maximizing a weighted measure of accuracy and detection ability:

MaximizeðAccuracy þ Detection ratioÞ Well classified time steps events number ; Detection ratio ¼ Detected Accuracy ¼ Total number of time steps Total events number

reliability of the outlier’s indications. That is to say, the presence of outliers in the sequence is a stronger evident to an event if the outliers are successive. The weights a1, a2, a3 as well as the threshold value T, are determined by the optimization stage described further below. The model parameters are all set by data driven optimization modeling. As such, the utilization of the model does not require user intervention, and the calibration of the system is

Detailed model description

(6)

The accuracy is estimated by the proportion of well classified vectors out of all classifications. Similarly, the detection ability is estimated by the proportion of detected events out of the total events number (i.e., events are considered detected if any vector was classified as an outlier during the event). The C value of the classifier which attained the best performance is selected. It should be noted that the selection of enumeration over optimization for determining the C parameter is due to the

239

w a t e r r e s e a r c h 5 1 ( 2 0 1 4 ) 2 3 4 e2 4 5

Fig. 4 e An example of the GA progress along generations.

Fig. 3 e An example of the enumeration for selecting the SVM parameter C.

to obtain the optimization objective (6) fitness value. Fig. 4 presents an example of the algorithm performance, converging closely to the global optimum of two.

2.3.4. instability of the Accuracy þ Detection ratio objective function. That is to say, a small change in a C value causes a significant change in the objective function. Employing an optimization model has a fair chance of converging to a local optimum point. If the solution found is multi e optimal and different C values feature the best performance, then the value situated in the smoother area of the function is selected (i.e., the highest measure value which is surrounded with the most moderate slope). An example of the objective function versus C is described in Fig. 3. It can be seen from Fig. 3 that the objective function is highly non e smooth.

2.3.3. Optimization of the measured weights a1, a2, a3, and the alert threshold T The measured weights and alert threshold of the sequence analysis are set by a GA optimization (Deb, 2001). The GA parameters are given in Table 1. Each solution vector includes a1, a2, a3, and T. The objective function is given in (6) and is identical to the one used in the enumeration stage (2.3.2). Each solution in the population is evaluated through classification of the validation data set. The optimal sequence length is selected for each solution in an inner process by an enumeration scheme described further below, which is used

Sequence length enumeration

For given a1, a2, a3, and T, the best sequence length is selected through enumeration. Any sequence length will produce different likelihood coefficients (i.e., proportion, continuity, probability) and thus result in different thresholds and different classifications. The selection of the sequence length (i.e., the number of previous time steps taken into account in the classification) is determined through enumeration, which defines the sequence length that produces the best objective performance as given in (6). An example of the calculated objective performance (6) as a function of sequence length for given a1, a2, a3, and T, is shown in Fig. 5. It can be seen from Fig. 5 that the best sequence length is 8, yielding an Accuracy þ Detection objective ratio value of 1.968. Generally for most solutions found for the used data base the best length was around 10 bits. It is likely that this length supply adequate information on one hand (representing about 50 min of measurements), and still not too long to produce enough sequence examples for the probability calculation presented in expression (5) (i.e., the known data set can provide enough sequence examples for the sequence analysis step). The enumeration ranges from two time steps up to a maximum event duration which is assumed to be 36 time steps. The lower limit constrains the classification to utilize at

Table 1 e The GA properties and optimization parameters (Deb, 2001). GA attribute

Selection

Optimization parameters

Selection 4 3 measure weights (a1, a2, a3) and alert threshold T (a1 ¼ proportion, a2 ¼ continuity, a3 ¼ probability) Max (Accuracy þ Detection ratio) (Eq. (6)) 1 a1 þ a2 þ a3 ¼ 1 [0.01 0.01 0.2 0.01]

Population size Crossover fraction

20 0.8

Decision variables number Variables

Maximum generation number Elite solution number Population type Initial population selection

100

Objective function

2 Real vector Feasible population (constraints satisfying)

Constrains number Constraint Variables lower bound (a1, a2, a3, T) Variables upper bound (a1, a2, a3, T)

Parent selection

Roulette

[0.99 0.99 0.95 0.99]

240

w a t e r r e s e a r c h 5 1 ( 2 0 1 4 ) 2 3 4 e2 4 5

relative to the separating hyperplane (which divides the space to normal and outlier regimes). Compared to the parallel single variable analysis, the described process provides a multivariate analysis, as the 6dimensional measurements vector is classified according to a 6-dimension space by its location relative to a single hyperplane. That differs from the parallel analysis, where each of the 6 measured parameters was analyzed, producing six different sequences of outlier detection that were integrated into a single event alert signal. Through invoking the SVM stage, its output is a 20 bit length binary sequence, representing the classification of measurements as normal (0) or outlier (1). Note (Table 2) that a value of 1 at the true results vector is associated with an event occurrence, where for the SVM outcome a 1 entry is related to an outlier classification. The last bit of the analyzed sequence corresponds to the classified time step. That is to say, a time step is classified according to the analysis of the sequence ending in that time step. The optimization performs the selection of the optimal set of model variables a1, a2, a3, and T. Given a specific set of a1, a2, a3, and T (i.e., an optimization solution vector), the best sequence length, associated with the optimization solution, is searched through enumeration, screening segments from two to 36 adjacent time steps. The training data set is used to estimate the probability of sequences, where the validation data is utilized to quantify its performances. For example: consider a1 ¼ 0.3, a2 ¼ 0.1, a3 ¼ 0.6, and T ¼ 0.65. For this case the sequence likelihood measure (SLM) will equal:

Fig. 5 e Enumeration of sequence length versus Accuracy D Detection ratio for given a1, a2, a3, and T.

least one time step prior to the classified one. The upper limit enables the analyzed sequence to include an entire event record. The notion is that the lower maximum would restrict the analysis for a partial segment out of the event time outlier’s detection, while the higher limit may provide surplus degenerated information.

2.4.

Illustrative example

This section is aimed at clarifying the proposed methodology as depicted in Fig. 1 through a simple illustrative example. Consider a data set which consists of 20 time steps, divided into training and testing sub data sets (Table 2). The first 13 bits (i.e., time steps) are taken as training data (65%), where the last seven (35%) are used for testing (note that part of the described training data set was allocated to the validation data set for the model calibration). In each time step a six-length vector of measurements is received from the on-line sensor. The vector is observed and analyzed in its six-dimensional space. A vector that belongs to the training data set is used to place the separating hyperplane according to the optimization given in (4). A vector belongs to the testing data set is used to classify the measured time step as normal or outlier, depending on its location

SLM ¼ 0:3  proportion þ 0:1  continuity þ 0:6  probability

(7)

To trigger an event at a given time step, the condition is SLM > 0.65. The sequence likelihood measure is calculated in each time step. The probability is a function of the sequence and the training data set, while the proportion and continuity are functions of the analyzed sequence alone. For classifying the first time step in the validation data set, the missing previous time steps which belong to the training data set are utilized. Refer to Table 2 at the row of the testing data set of the SVM classification (i.e., the vector 0110100). The enumeration starts from a 2-bit length sequence. The first analyzed sequence, used for the classification of the first time step of the testing data set is (00) (where the first zero is taken from the training data set, the

Table 2 e Illustrative example. Feature Time step number True results SVM classification 2 bit sequence event classification 3 bit sequence event classification 4 bit sequence event classification

Training data set 1 2 1 1 E 0 1 N O NA

3 1

4 1

5 1

1 O

1 O

1 O

6 0 N 1 O

7 0 N 0 N

8 0 N 1 O

Legend: N ¼ normal, E ¼ event, O ¼ outlier, NA ¼ not applicable.

9 0 N 0 N

10 1 E 1 O

Accuracy þ detection ratio (EQ. 6)

Testing data set 11 1 0 N

12 0 N 0 N

13 0 N 0 N

14 0 N 0 N 0 0 0

15 1 E 1 O 0 0 0

16 1

17 1

18 1

1 O 1 1 0

0 N 0 0 0

1 O 0 0 0

19 0 N 0 N 0 0 0

20 0 N 0 N 0 0 0

NA

1.57 1.57 0.43

w a t e r r e s e a r c h 5 1 ( 2 0 1 4 ) 2 3 4 e2 4 5

second is the first bit of the testing data set). This (00) sequence appeared in the training data set of the SVM classification twice (starting at time step 11, and starting at time step 12). At both instances the last time step was at a normal operation conditions (refer to time step 12 and 13 for which the true result is a normal operation). The probability of this (00) sequence is thus: 0/2 ¼ 0. The proportion and continuity likelihood variables both equal 0 as well (expressing the proportion of outliers and their continuity within the sequence). the SLM is thus: 0.3  0 þ 0.1  0 þ 0.6  0 ¼ 0 < 0.65 as a result the sequence (00) will not trigger an event, and thus the first time step in the testing data set for a 2 bit sequence analysis is classified as normal (refer to the first zero entry of the 2 bit sequence event classification row in Table 2). The following 2 bit sequence to be evaluated is (01). This sequence appears three times in the training data set of the SVM classification (refer to time steps 1e2, 7e8, and 9e10). Two of those where while an event took place (time step 1e2, and time steps 9e10) and one (7e8) while normal operation. The probability thus equals 2/3, and the proportion and continuity equal 1/2. The SLM ¼ 0.3  0.5 þ 0.1  0.5 þ 0.6  0.667 ¼ 0.6 < 0.65, thus the second time step will also be classified as normal (i.e., 0) at the 2 bit sequence event classification. This same procedure is repeated for all 2 bit sequences in the testing data set (i.e., the next to be evaluated is (11)). This procedure produces the 2-bit sequence event classification vector of 0010000. The accuracy of this classification is 4/7 as four out of the total seven time steps were correctly classified (compare 0010000 to the true results of 0111100 of the testing data set). The detection ratio of this classification is 1, as there was one true alert during time steps 15 to 18 (refer to the true results of the testing data set), which the 2 bit sequence event classification revealed at time step 16. The Accuracy þ Detection ratio objective value is thus: 0.57 þ 1 ¼ 1.57. Following the same procedure, 3-bit and 4-bit sequences are analyzed and produce the results given in Table 2. The first sequence analyzed for the 3-bit classification is (0 0 0) where the first two zeros are taken from the training data set (i.e. time steps 12 and 13). It can be seen that the results of the 4-bit classification is significantly worse than the 2 and 3 bit classifications. It is likely that a 4-bit sequence is long relative to the size of the training data set of the illustrative example. The number of examples for the probability calculation is small and might lead the classifier to inaccuracy. Yet it should be noted that the performance of the model for each sequence length changes according to the example and the presented mismatch of the 4-bit sequence represent this special case alone. Following the enumeration of all sequence lengths, the length which maximized the Accuracy þ Detection ratio objective value for a1 ¼ 0.3, a2 ¼ 0.1, a3 ¼ 0.6, and T ¼ 0.65 is stored. This Accuracy þ Detection ratio objective value is then returned to the GA procedure as the objective fitness function value of the solution vector for further altering of a1, a2, a3, and T.

2.5.

Real time model performance and updating

The proposed methodology is aimed at running off-line on historical data for tuning its parameters towards its real time event classification goal. The off-line output model parameters are the optimal: SVM C coefficient, the sequence length (SL), the measure weights a1, a2, a3, and the threshold T (i.e., a total of six: C, SL, a1, a2, a3, T).

241

Commonly, the SVM exploits only the training data to construct its classifier. In case of time dependent vectors, as is the case in this study, the classifier should utilize continuously the latest available data. There is no justification in giving a time decay weight, if the classifier consists only the training group data, as the latest observations, receiving the higher weight in the classification, are not updated. Therefore for real time event classification an updating model is applied. The assumption is that a contamination will be detected after a duration of 24 h. Thus, in a 24 h delay, the data can be used to reconstruct the classifier. In each time step the classifier is reconstructed including all of the data, up to the past 24 h. That way the SVM is relaying on an increasing data set, and thus its classification reliability improves. Using the updated increasing data base the measure of the selected off-line best sequence length (SL) is updated on-line. All other off-line model parameters of C, SL, a1, a2, a3, T remain constant. It should be noted that conceptually those parameters should also be updated occasionally as the data set substantially increases, but this was not explored in this study. The entire computer code of the proposed methodology and metadata on the program structure are provided as supplementary material.

3.

Application

3.1.

Data base

The methodology was tested on real data (Fig. 2) collected by a utility in the United States and available from CANARY (2013). It incorporates online multivariate water quality measurements taken every 5 min during four weeks (i.e., w8000 time steps), through a supervisory control and data acquisition (SCADA) system. All data resemble normal operating conditions, and includes: Total Chlorine, Electrical Conductivity (EC), pH, Temperature, Total Organic Carbon (TOC) and Turbidity. The data set was divided into three subsets: 50% for training, 20% for validation, and 30% for testing. The training and validation sub data sets are utilized as “off-line” data for respectively train and tune the model parameters of: C, SL, a1, a2, a3, T. The 30% testing data set imitate real-time operation for the exploration of the model performance.

3.2.

Simulated events

Unfortunately, for the classification task, there is no available data measured during an event occurrence. Therefore, to train the SVM classification model, contamination events were simulated and superimposed on the measured routine patterns (see Fig. 2) following Klise and McKenna (2006), McKenna et al. (2008), Perelman et al. (2012), Arad et al. (2013). The events were characterized by their magnitude, direction, and effect durations, randomly determined by a uniform distribution selection from a set range of values as presented in Table 3. The superimposed events fitted the Beta distribution function (Keeping, 2010), diverted through its two shape parameters.

242

w a t e r r e s e a r c h 5 1 ( 2 0 1 4 ) 2 3 4 e2 4 5

Table 3 e The ranges of the simulated event properties. The frequency of all events ranged between once to twice a day; the beta distribution parameters where both set as two. Event type

Number of five minutes time steps (total duration in hrs) Minimum

Slight Low Medium Partial

24 36 48 36

(2) (3) (4) (3)

Maximum 48 72 72 72

(4) (6) (6) (6)

Strength Minimum

Maximum

Minimum

Maximum

0 0.5 1 0.5

1 2 2.5 2

6 6 6 1

6 6 6 6

The events duration was assumed to be between two and 6 h (Table 3). The magnitude strength ranged from zero to 2.5 standard deviations. The direction of the deviations was set also randomly. Both magnitude and direction were set separately for each of the water quality parameter. That is to say, in the generation of events, each parameter response was assumed independent, thus each was solely distorted. The average frequency of events was between one to twice a day, and the occurrence timing had no restrictions (i.e., enabling long normal operation times along with overlapping events).

3.3.

Model assessment

The proposed model was applied on the aforementioned data, composed of the six water quality parameters time series data, and the superimposed simulated events. The model was tested with diverse scenarios of simulated events differing in their duration, strength, direction, frequency, and number of affected parameters (Table 3). The “partial” events type simulated events which distorted only some of the parameters. The “off-line” running time using the training and validation sub data sets for calibrating the model parameters of C, SL, a1, a2, a3, T, lasted approximately 3.5 h on an Intel Core i7860 processor (8 M Cache, 2.80 GHz). The model performance was checked on the remaining 30% testing “untouched” sub data set for its Accuracy and Detection ratio (Eq. (6)). The online updating process of the model has a very short execution time. It includes a single construction of the weighted SVM classifier (by Eq. (4)) and a simple calculation of the probability factor shown in Eq. (5). That sums up to less than a tenth of a second for each update. The average model performances for 15 runs for each event type (see also Table 3), are presented in Fig. 6. It can be seen from Fig. 6 that (as expected) the average Accuracy and Detection ratio increase along with the events distortion intensities: from approximately 0.42 Detection ratio and 0.83 Accuracy at the Slight event’s type, to approximately 0.99 Detection ratio and 0.97 Accuracy for the Medium case.

3.4.

Number of influenced parameters

having two optional decision rules for triggering an event alert: decision rule 1 which warrants at least three outliers out of the six water quality measured parameters, and decision rule 2 which requires at least five. Table 4 describes the Arad et al. (2013) comparison results, featuring the models averaged performances for different event scenario types utilizing 15 trials. For events type of Low/ Medium intensities impacting all six parameters, the Arad et al. (2013) model showed a slight advantage when using decision rule 1. However, the current model substantially outperformed Arad et al. (2013) for all other Partial type distortions. Figs. 7 and 8 graph the results of Table 4 and present for each of the cases the minimum and maximum Accuracy and Detection ratio values of the 15 runs. It can be seen from Fig. 7 that the average detection ratio of the current model is substantially better compared to Arad et al. (2013) in case of one or two distorted parameters, where for the accuracy, the improvement exists, but is less apparent (Fig. 8). The multivariate analysis may reveal outliers which are exceptional only in some of the dimensions. Arad et al. (2013) model attribute equal weight to all parameters and require the measurement to be exceptional in at least 3 or 5 out of them. Therefore, the presented model shows superiority in the detection of partly influencing events. The differences are less noticeable in the accuracy measure, as there are more classified time steps than events and the relative mistake is smaller. In all cases most of the time steps are classified as normal operation. Another possible reason for the relative improvement of the presented model is the suitability of the structural risk minimization principle used in SVM, opposite to the empirical

Model comparisons

To evaluate the model performances two comparisons were made, the first was to Arad et al. (2013) and the second to CANARY (2013). Table 4 and Figs. 7 and 8 describe comparisons to Arad et al. (2013). Arad et al. (2013) suggested an event detection model

Fig. 6 e Average (15 runs for each event type) model performances (see also Table 3).

243

w a t e r r e s e a r c h 5 1 ( 2 0 1 4 ) 2 3 4 e2 4 5

Table 4 e Comparison to Arad et al. (2013). Event type

Low/medium Partial Partial Partial

No. Of influenced parameters

6 3 2 1

Current model

Arad et al. (2013) Decision rule 1

Decision rule 2

Detection ratio

Accuracy

Detection ratio

Accuracy

Detection ratio

Accuracy

0.98 0.93 0.97 0.58

0.94 0.96 0.94 0.92

0.99 0.85 0.72 0.38

0.96 0.90 0.90 0.91

0.92 0.77 0.68 0.30

0.93 0.93 0.80 0.83

risk minimization principle used in the artificial neural network utilized by Arad et al. (2013). The structural risk minimization is an inductive principle for the selection of the kernel function (shown in Eq. (2)) by exploring the trade-off between hypothesis space complexity and fitting quality. Empirical risk minimization is a specific case of minimizing the empirical error. The structural risk minimization screens a chosen class of functions, whereas the empirical risk requires a chosen function, thus performs a broader search (Vapnik and Strain, 1977). CANARY (2013) has the general structure of Arad et al. (2013) model, featuring a parallel outlier detection for each of the measured time series, integrated to the assessment of an event probability. The outlier detection is performed by a combination of a few algorithms: a linear filter, a multivariate nearest-neighbor algorithm, and a set-point proximity algorithm. The comparison between the suggested model and CANARY (2013) is presented in Table 5. It includes 10 events scenarios that were run once with the current model and five times with CANARY (2013), each with a different set of algorithm parameters. The bold values in the Table are the applied changes in the CANARY parameters, relative to the default parameters. The performance of the suggested model was significantly better feature a higher detection ratio and accuracy. For the ‘Low’ type events (defined in Table 3) the average detection ratio of the current model was 0.95 against 0.89 of the CANARY (2013). The accuracy was 0.93 compared to 0.79 of CANARY (2013). For the ‘Medium’ type events the current model achieved a detection ratio and accuracy of 1 and 0.95 respectively, where CANARY (2013) achieved 0.92 and 0.75.

Unlike the suggested model and Arad et al. (2013) models which are injective (i.e. does’nt requires any calibration, and produces a sole injective classification), the CANARY (2013) is a user-interactive software requires parameters calibration. Thus, the comparisons included five sets of parameters. However, note that the results are not deterministic and CANARY performance might be improved by an additional tuning of its parameters.

Fig. 7 e Comparison of the average detection ratio to Arad et al. (2013) (15 runs of each type of event).

Fig. 8 e Comparison of the average accuracy to Arad et al. (2013) (15 runs of each type of event).

4.

Conclusion

This study presented the development and application of a coupled SVM- evolutionary optimization methodology for water quality event detection in water distribution systems. Differing from the parallel one-dimensional data analysis conducted thus far for event detection (e.g., Perelman et al., 2012; Arad et al., 2013; CANARY, 2013), the current model is based on simultaneous conjunction analysis of the entire water quality time series data set. This is a conceptual different approach from previous modeling efforts. The model showed high accuracy and detection ratio, and a substantial increased accuracy and detection ratio when compared to Arad et al. (2013), salient in contamination events which influenced only part of the water quality parameters. The multivariate analysis reflects the exceptionality of a measurement relative to the entire region space and thus may reveal events that are expressed even in a single parameter. That is a notable advantage as a contaminant intruding the system is likely to affect only some of the measured parameters. Further research is suggested to examine other classification methods for their suitability to event detection. Another

244

Table 5 e Comparison to CANARY (2013). Parameters

Current study

Canary e default parameters

Canary e parameters sensitivity analysis 0.80

0.70

0.9

0.9

0.0035

0.0035

0.0035

0.01

0.1

Precision -conductivity

1.00

1.00

1.00

1.5

2

Precision e pH

0.01

0.01

0.01

0.05

0.1

Precision e free Cl

Event scenario

Detection ratio

Accuracy

Detection ratio

Accuracy

Detection ratio

Accuracy

Detection ratio

Accuracy

Detection ratio

Accuracy

Detection ratio

Accuracy

Low 1 2 3 4 5 Average

1 1 0.75 1 1 0.95

0.93 0.91 0.91 0.90 0.98 0.93

0.83 0.80 0.75 1.00 1.00 0.88

0.76 0.79 0.78 0.78 0.81 0.79

0.83 1.00 0.75 1.00 1.00 0.92

0.75 0.78 0.76 0.76 0.79 0.77

0.83 1.00 0.75 1.00 1.00 0.92

0.75 0.78 0.76 0.75 0.79 0.77

0.75 0.80 0.75 1.00 1.00 0.86

0.76 0.82 0.80 0.79 0.82 0.80

0.75 0.80 0.75 1.00 1.00 0.86

0.77 0.83 0.80 0.82 0.83 0.81

Medium 1 2 3 4 5 Average

1.00 1.00 1.00 1.00 1.00 1.00

0.93 0.95 0.92 0.98 0.97 0.95

0.78 1.00 1.00 0.83 1.00 0.92

0.74 0.74 0.78 0.75 0.73 0.75

0.78 1.00 1.00 0.83 1.00 0.92

0.73 0.73 0.76 0.74 0.72 0.74

0.78 1.00 1.00 0.83 1.00 0.92

0.73 0.73 0.76 0.74 0.72 0.74

0.78 1.00 1.00 0.83 1.00 0.92

0.76 0.76 0.80 0.77 0.73 0.76

0.78 0.88 1.00 0.83 1.00 0.90

0.76 0.75 0.81 0.78 0.74 0.77

w a t e r r e s e a r c h 5 1 ( 2 0 1 4 ) 2 3 4 e2 4 5

0.90

Event threshold

w a t e r r e s e a r c h 5 1 ( 2 0 1 4 ) 2 3 4 e2 4 5

possible research direction is to replace the supervised classification method, which utilizes the simulated events data to construct the classifier, with an unsupervised method. This trait will contribute to the generality of the model, lacking the need of assuming the events effect on the parameters. Currently, along these lines, the development and application of a minimum volume ellipsoid for the classification of outliers’ measurements is tested. The ellipsoid may replace the SVM stage of the current model. The ellipsoid is characterized by a quadratic surface discerning the normal regime from the outliers. The smooth function of the classifier boundary may describe the distribution of the data more easily than the linear SVM and reduce the need of kernel functions.

Acknowledgments This work was supported by the Technion Funds for Security Research and by the Technion Grand Water Research Institute. We would like to express our deep appreciation and gratitude to Dr. Lina Perelman for providing us her model code for the event detection comparison, the sampling of events, and her guidance and close assistance with the developed SVM model. Dr. David Hart, Member of the Technical Staff at Sandia National Laboratories, Albuquerque, New Mexico, USA, is highly acknowledged for his valuable assistance in running CANARY.

Appendix A. Supplementary data Supplementary data related to this article can be found at http://dx.doi.org/10.1016/j.watres.2013.10.060.

references

Adams, J.A., Mccarty, D., 2007. Real-time on-line monitoring of drinking water for waterborne pathogen contamination warning. Int. J. High Speed electronics Syst. 17 (4), 643e659. http://dx.doi.org/10.1142/S0129156407004850. Arad, J., Housh, M., Perelman, L., Ostfeld, A., 2013. A dynamic thresholds scheme for contaminant event detection in water distribution systems. Water Res. 47 (5), 1899e1908. http://dx. doi.org/10.1016/j.watres.2013.01.017. Ben-Hur, A., Weston, J., 2010. A user’s guide to support vector machines. In: Data Mining Techniques for the Life Sciences, Methods in Molecular Biology, vol. 609. Humana Press, pp. 223e239 a part of Springer Science þ Business Media, LLC 2010, http://dx.doi.org/10.1007/978-1-60327-241-4_13.

245

Boser, B.E., Guyon, I.M., Vapnik, V.N., 1992. A training algorithm for optimal margin classifiers. In: Proceedings of the Fifth Annual Workshop on Computational Learning Theory. ACM, pp. 144e152. Buhmann, M.D., 2003. Radial Basis Functions: Theory and Implementations. University of Giessen, Cambridge University Press. CANARY, 2013. A Water Quality Event Detection Tool. Available online at: https://software.sandia.gov/trac/canary (accessed 24.10.13.). Deb, K., 2001. Multi-objective Optimization Using Evolutionary Algorithms. John Wiley and Sons, Ltd. Guepie, B.K., Fillatre, L., Nikiforov, I., 2012. Sequential monitoring of water distribution network. In: Paper Presented at the IFAC Proceedings Volumes (IFAC-papers Online), vol. 16, pp. 392e397. PART 1, http://dx.doi.org/10.3182/20120711-3-BE2027.00114. Hart, W.E., Murray, R., 2010. Review of sensor placement strategies for contamination warning systems in drinking water distribution systems. J. Water Resour. Plann. Manage. 136 (6), 611e619. http://dx.doi.org/10.1061/(ASCE)WR.19435452.0000081. Keeping, E.S., 2010. Introduction to Statistical Inference. Dover Publications, Inc. Klise, K.A., McKenna, S.A., 2006. Multivariate application for detecting anomalous water quality. In: Proceedings of the 8th Annual Water Distribution Systems Analysis Symposium. WDSA, Cincinnati, Ohio, USA, pp. 1e11. http://ascelibrary.org/ doi/abs/10.1061/40941%28247%29130. McKenna, S.A., Wilson, M., Klise, K.A., 2008. Detecting changes in water quality data. J. Am. Water Works Assoc. 100 (1), 74e85. http://www.awwa.org/publications/journal-awwa/abstract/ articleid/15801.aspx. Murray, R., Haxton, T., McKenna, S.A., Hart, D.B., Klise, K.A., Koch, M., Vugrin, E.D., Martin, S., Wilson, M., Cruze, V.A., Cutler, L., 2010. Water Quality Event Detection Systems for Drinking Water Contamination Warning Systems: Development Testing and Application of CANARY. EPA/600/R10/036. U.S. Environmental Protections Agency, Office of Research and Development, National Homeland Security Research Center, Cincinnati, Ohio, USA http://cfpub.epa.gov/ si/si_public_record_report.cfm?address¼nhsrc/ &dirEntryId¼221394. Perelman, L., Arad, J., Housh, M., Ostfeld, A., 2012. Event detection in water distribution systems from multivariate water quality time series. Environ. Sci. Technol. 46 (15), 8212e8219. http:/dx. doi.org/10.1021/es3014024. Uber, J.G., Murray, R., Magnuson, M., Umberg, K., 2007. Evaluating real-time event detection algorithms using synthetic data. In: Paper Presented at the Restoring Our Natural Habitat e Proceedings of the 2007 World Environmental and Water Resources Congress. http://ascelibrary.org/doi/abs/10.1061/ 40927%28243%29499. Vapnik, V., Strain, A., 1977. On structural risk minimization or overall risk in a problem of pattern recognition. Autom. Remote Control 10 (3), 1495e1503.

A coupled classification - evolutionary optimization model for contamination event detection in water distribution systems.

This study describes a decision support system, alerts for contamination events in water distribution systems. The developed model comprises a weighte...
1MB Sizes 0 Downloads 0 Views