IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, VOL. 23, NO. 4, APRIL 2012

631

Feature Extraction for Change-Point Detection Using Stationary Subspace Analysis Duncan A. J. Blythe, Paul von Bünau, Frank C. Meinecke, and Klaus-Robert Müller

Abstract— Detecting changes in high-dimensional time series is difficult because it involves the comparison of probability densities that need to be estimated from finite samples. In this paper, we present the first feature extraction method tailored to change-point detection, which is based on an extended version of stationary subspace analysis. We reduce the dimensionality of the data to the most nonstationary directions, which are most informative for detecting state changes in the time series. In extensive simulations on synthetic data, we show that the accuracy of three change-point detection algorithms is significantly increased by a prior feature extraction step. These findings are confirmed in an application to industrial fault monitoring. Index Terms— Change-point detection, feature extraction, high-dimensional data, segmentation, stationarity, time-series analysis.

I. I NTRODUCTION

C

HANGE-POINT detection is a task that appears in a broad range of applications such as biomedical signal processing [1]–[3], speech recognition [4], [5], industrial process monitoring [6], [7], fault state detection [8], and econometrics [9], [10]. The goal of change-point detection is to find the time points at which a time series changes from one macroscopic state to another. As a result, the time series is decomposed into segments [6] of similar behavior. Changepoint detection is based on finding changes in the properties of the data, such as in the moments (mean, variance, kurtosis) [6], spectral properties [11], and temporal structure [12], or changes with respect to certain patterns [13]. The choice of any of these aspects depends on the particular application domain as well as on the statistical type of the changes that one aims to detect. The literature contains many procedures for change point detection. However, none of these methods approaches finding change points in high dimensions. Likewise, many feature extraction methods are presented in the literature, but none of these finds the features for change-point detection. To this Manuscript received January 2, 2012; revised January 13, 2012; accepted January 18, 2012. Date of publication February 10, 2012; date of current version March 6, 2012. D. A. J. Blythe is with the Bernstein Center for Computational Neuroscience, Berlin Institute of Technology, Berlin D-10587, Germany (e-mail: [email protected]). P. von Bünau and F. Meinecke are with the Berlin Institute of Technology, Berlin D-10587, Germany (e-mail: [email protected]; [email protected]). K.-R. Müller is with the Berlin Institute of Technology, the German Institute of Economic Research, the Bernstein Centre for Computational Neuroscience and the Bernstein Focus Neurotechnology, Berlin D-10587, Germany (e-mail: [email protected]). Color versions of one or more of the figures in this paper are available online at http://ieeexplore.ieee.org. Digital Object Identifier 10.1109/TNNLS.2012.2185811

end, we propose using stationary subspace analysis (SSA) to obtain informative features for change-point detection. For a large family of general segmentation algorithms, state changes are detected based on comparing the empirical distributions between windows of the time series [12], [14], [15]. Estimating and comparing probability densities is a difficult statistical problem, particularly in high dimensions. However, not all directions in the high-dimensional signal space are informative for change-point detection, often there exists a subspace in which the distribution of the data remains constant over time (stationary). This subspace is irrelevant for changepoint detection, but increases the overall dimensionality. Moreover, stationary components with a high signal power can make change points invisible to the observer and also to detection algorithms. For example, there are no change points visible in the time series depicted in Fig. 1(a), even though there exists one direction in the 2-D signal space which clearly shows two change points, as can be seen in Fig. 1(b). However, the nonstationary contribution is not visible in the observed signal because of its relatively low power [Fig. 1(c)]. In this example, we also observe that it does not suffice to select channels individually, as neither of them appears informative. In fact, in many application domains such as biomedical engineering [2], [16], [17] or geophysical data analysis [18], it is most plausible that the data is generated as a mixture of underlying sources that cannot be measured directly. In this paper, we show how to extract useful features for change-point detection by finding the most non-stationary directions using SSA [19]. Even though there exist a wide range of feature extraction methods for classification and regression [20], to date no specialized procedure for feature extraction or for general signal processing [21] has been proposed for change-point detection. In controlled simulations on synthetic data, we show that for three representative change-point detection algorithms the accuracy is significantly increased by a prior feature extraction step, in particular if the data is high dimensional. This effect is consistent over various numbers of dimensions and strengths of change points. In an application to fault monitoring, where the ground truth is available, we show that the proposed feature extraction improves the performance and leads to a dimensionality reduction where the desired state changes are clearly visible. Moreover, we also show that we can determine the correct dimensionality of the informative subspace. The remainder of this paper is organized is follows. In Section II, we introduce our feature extraction method that is based on an extension of SSA. Section III contains the results of our simulations, and in Section IV we present the

2162–237X/$31.00 © 2012 IEEE

632

IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, VOL. 23, NO. 4, APRIL 2012

Observed time series

Comparison of source signal power

Sources

Signals

Sources

Underlying sources

Stationary Non−stationary Time

Time

Time

(a)

(b)

(c)

Fig. 1. Informative versus uninformative directions for change-point detection. (a) Observed bivariate time series where no pronounced changes are visible. (b) Two underlying sources, where one of them exhibits clearly visible changes. (c) Stationary sources have much higher signal power than the informative nonstationary sources and thus mask the presence of change points in the observed data.

application to fault monitoring. Our conclusions are outlined in Section V. II. R ELATIONSHIP TO E XISTING W ORK ON F EATURE E XTRACTION Feature extraction from raw high-dimensional data has been shown to be useful not only for improving the performance of subsequent learning algorithms on the derived features [20] but also for understanding high-dimensional complex physical systems where the relevant information is difficult to identify. In this section, we discuss the relationship of our own work to existing work on feature extraction and review the SSA algorithm, which we extend with a view to finding the most nonstationary directions. We then show that this approach corresponds to finding the projection that is most likely to be nonstationary in terms of a statistical hypothesis test. In many application areas such as computer vision [22], bioinformatics [23], [24], and text classification [25], defining useful features is in fact the main step toward successful machine learning. General feature extraction methods for classification and regression tasks are based on maximizing the mutual information between features and target [26], explaining a given percentage of the variance in the dataset [27], choosing features which maximize the margin between classes [28], or selecting informative subsets of variables through enumerative search (wrapper methods) [20]. However, although feature extraction methods for specific problem settings have become increasingly popular [29], [30], for change-point detection no dedicated feature extraction has been proposed [6]. Unlike in classical supervised feature selection, where a target variable allows us to measure the informativeness of a feature, for change-point detection we cannot tell whether a feature elicits the changes that we aim to detect since there is usually no ground truth available. Even so, feature extraction is feasible following the principle that a useful feature should exhibit significant distributional changes over time. Reducing the dimensionality in a preprocessing step should be particularly beneficial for the change-point detection task: most algorithms either explicitly or implicitly make approximations to probability densities [12], [15] or directly compute a divergence measure based on summary statistics, such as the mean and covariance [6] between segments of the time series—both are hard problems whose sample complexities grow exponentially with the number of dimensions.

As we have seen in the example presented in Fig. 1, selecting channels individually (univariate approach) is not helpful or may lead to suboptimal features. The overall data may be nonstationary, notwithstanding the fact that each dimension seems stationary. Moreover, a single nonstationary source may be expressed across a large number of channels. It is therefore more sensible to estimate a linear projection of the data that contains as much information relating to change points as possible. In this paper, we demonstrate that finding the projection to the most nonstationary direction using SSA significantly increases the performance of change-point detection algorithms. A. SSA In this section, we review SSA. SSA [19] factorizes a multivariate time series x(t) ∈ R D into stationary and nonstationary sources according to the linear mixing model    s n  s s (t) A A (1) x(t) = As(t) = s n (t) where s s (t) are the ds stationary sources, s n (t) are the dn (dn + ds = D) nonstationary sources, and A is an unknown time-constant-invertible mixing matrix. The spaces spanned by the columns of the mixing matrix As and An are called the s- and n-spaces, respectively. Note that, in contrast to independent component analysis (ICA) [31], there is no independence assumption on the sources s(t). The aim of SSA is to invert the mixing model (1), given only samples from the mixed sources x(t), i.e., we want to estimate the demixing matrix Bˆ that separates the stationary from the nonstationary sources. Applying Bˆ to the time series x(t) yields the estimated stationary and nonstationary sources sˆ s (t) and sˆ n (t), respectively  s   s s s n  s   s sˆ (t) Bˆ A Bˆ A sˆ (t) Bˆ ˆ = Bx(t) = n x(t) = n s n n . (2) n sˆ (t) sˆn (t) Bˆ Bˆ A Bˆ A The submatrices Bˆ s ∈ Rds ×D and Bˆ n ∈ R(dn )×D of the estimated demixing matrix Bˆ project to the estimated stationary and nonstationary sources and are called s-projection and nprojection, respectively. The estimated mixing matrix Aˆ is the inverse of the estimated demixing matrix, Aˆ = Bˆ −1 . The inverse of the SSA model (1) is not unique: given one ˆ any linear transformation within the two demixing matrix B, groups of estimated sources leads to another valid separation, because it leaves the stationary resp. nonstationary nature of the sources unchanged. But, also the separation into s- and n-sources itself is not unique: adding stationary components to a nonstationary source leaves it nonstationary, whereas the converse is not true. That is, the n-projection can only be identified up to arbitrary contributions from the stationary sources. Hence we cannot recover the true n-sources, but only the true s-sources (up to linear transformations). Conversely, we can identify the true n-space (because the s-projection is orthogonal to it) but not the true s-space. However, in order to extract features for change-point detection, our aim is not to recover the true nonstationary sources, but instead the most nonstationary ones.

BLYTHE et al.: FEATURE EXTRACTION FOR CHANGE-POINT DETECTION

633

An SSA algorithm depends on a definition of stationarity that the s-projection aims to satisfy. In the SSA algorithms [19], [32], a time series X t is considered stationary if its mean and covariance are constant over time     E X t1 = E X t2     E X t1 X t1 = E X t2 X t2 for all pairs of time points t1 , t2 ∈ N0 . This is a variant of weak stationarity [33], where we do not take time structure into account. Following this concept of stationarity, the SSA algorithm [19] finds the s-projection Bˆ s that minimizes the difference between the first two moments of the estimated s-sources sˆs (t) across epochs of the time series, since we cannot estimate the mean and covariance at a single time point. Thus we divide the samples from x(t) into n nonoverlapping epochs defined by the index sets T1 , . . . , Tn ⊂ N0 and estimate the epoch mean and covariance matrices 1  x(t) and μˆ i = |T | t ∈ Ti





 1 ˆi = x(t) − μˆ i x(t) − μˆ i  |T | − 1 t ∈ Ti

respectively, for all epochs 1 ≤ i ≤ n. Given an s-projection, the epoch mean and covariance matrix of the estimated s-sources in the i th epoch are μˆ si = Bˆ s μˆ i

ˆ i ( Bˆ s ) . ˆ is = Bˆ s  

and

The difference in the mean and covariance matrix between two epochs is measured using the Kullback–Leibler divergence between Gaussians. The objective function is the sum of the difference between each epoch and the average epoch. Since the s-sources can only be determined up to an arbitrary linear transformation and since a global translation of the data does not change the difference between epoch distributions, without loss of generality we center and whiten1 the data such that average epoch’s mean and covariance matrix are N 1  μˆ i = 0 N

and

i=1

N 1  ˆ i = I.  N

(3)

i=1

Moreover, we can restrict the search for the true s-projection to the set of matrices with orthonormal rows, i.e., Bˆ s ( Bˆ s ) = I . Thus the optimization problem becomes Bˆ s = argmin

N 

ˆ is ) DKL N (μˆ si , 

B B  =I i=1 N 

= argmin

B B  =I i=1



In this section, we describe the procedure for feature extraction for change-point detection, explaining, in detail, the theoretical motivation. In order to extract useful features for change-point detection, we would like to find the projection to the most nonstationary sources. However, the SSA algorithms [19], [32] merely estimate the projection to the most stationary sources and choose the projection to the nonstationary sources to be orthogonal to the found s-projection, which means that all contributions from the stationary sources are projected out from the estimated n-sources. The justification for this choice is that it maximizes the nonstationarity of the n-sources in the case where the covariance between the true n- and s-sources is constant over time. This, however, may not always be the case: significant nonstationarity may well be contained in changing covariance between s- and n-sources. In fact, we observe this in our application to fault monitoring. Thus, in order to find the most nonstationary sources, we also need to optimize the n-projection. Before we turn to the optimization problem, let us first of all analyze the situation more formally. We consider first a simple example where we have one stationary and one nonstationary source with corresponding normalized basis vectors As  = 1 and An  = 1, respectively, and let φ be the angle between the two spaces, i.e., cos φ = As An . We will consider an arbitrary pair of epochs, T1 and T2 , and show which projection Bˆ n maximizes the difference in mean μ and variance σ between T1 and T2 . Let X 1 and X 2 be bivariate random variables modeling the distribution of the data in the two epochs, respectively. According to the linear mixing model (1), we can write X 1 and X 2 in terms of the underlying sources X 1 = As X s + An X n1 X 2 = As X s + An X n2 where the univariate random variable X s represents the stationary source and the two univariate random variables X n1 and X n2 model the nonstationary sources, in the epochs T1 and T2 , respectively. Without loss of generality, we will assume that the true s-projection B s = (An )⊥ is normalized, B s  = 1. In order to determine the relationship between the true s-projection and the most nonstationary projection, we write it in terms of B s and An Bˆ n = α B s + β An

 N (0, I )

ˆ is + (μˆ si ) μˆ si − log det 

B. Finding the Most Nonstationary Sources

(5)

 Bˆ n 

(4)

which can be solved efficiently by using multiplicative updates with orthogonal matrices parameterized as matrix exponentials of antisymmetric matrices [19], [34].2 1 A whitening transformation is a basis transformation W that sets the

sample covariance matrix to the identity. It can be obtained from the sample ˆ as W =  ˆ −1/2 . covariance matrix  2 An efficient implementation of SSA may be downloaded free of charge at http://www.stationary-subspace-analysis.org/toolbox.

with coefficients α, β ∈ R such that = 1. In the next step, we will observe which n-projection maximizes the difference in mean μ and covariance σ between the two epochs T1 and T2 . Let us first consider the difference in the mean of the estimated n-sources        

μ = E Bˆ n X 1 −E Bˆ n X 2 = Bˆ n An E X n1 − E X n2 . This is maximal for Bˆ n An = 1, i.e., when Bˆ n is orthogonal to B s . Thus, with respect to the difference in the mean, choosing the n-projection Bˆ n to be orthogonal to the s-projection is always optimal, irrespective of the type of distribution change between epochs.

634

IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, VOL. 23, NO. 4, APRIL 2012

ˆn B

p−values for stationarity test of estimated s−sources

ˆn B Bs

1

Bs

0.9

Σ1

3

Σ2

4 s

True d

0.6

5

ed

(b)

0.7

ct je re

(a)

0.8

st te

Σ1

Σ2

y rit na io at St

2

0.4

7

0.3 0.2

8

Fig. 2. (a) Two epoch covariance matrices 1 and 2 where the nonstationarity is confined to changes in the variance along one direction, hence the most nonstationary projection Bˆ n is orthogonal to the true stationary projection B s . (b) Covariance of the two dimensions changes between 1 and 2 , so that we can find a nonstationary projection that is more nonstationary than the orthogonal complement of the true s-projection.

Let us now consider the difference in variance σ of the estimated n-sources between epochs. This is given by     σ = Var Bˆ n X 1 − Var Bˆ n X 2    

= β 2 Var X n1 − Var X n2   π +2 α cos φ + + β cos φ 2 (Cov[X s , X n1 ] − Cov[X s , X n2 ]) .   

0.5

6

0.1 9 2

4 6 Candidate ds

8

Fig. 3. Average p-values obtained over 100 realizations of the dataset for each setting of the true ds . The number of dimensions is D = 10, and the y-axis displays the actual number for ds . The x−axis displays the value for the parameter ds used to compute the stationary projection using SSA. The red box shows the decision made at the p = 0.01 confidence level. The red box displays the choice made using this decision rule for choosing the parameter ds which occurs most often. The results show that the method picks the correct parameter, on averages, over simulations.

s-spaces. Additional nonstationarity in the n-sources may suffice, nevertheless, to outweigh the effect of small sample sizes for estimation of parameters depending only on the s-sources.

=σsn

Clearly, when there is no change in the covariance of the s- and the n-sources between the two epochs, i.e., σsn = 0, the difference σ is maximized for Bˆ n = (B s )⊥ . See Fig. 2(a) for an example. However, when the covariance between sand n-sources does vary, i.e., |σsn | > 0, the projection (B s )⊥ is no longer the most nonstationary. To see this, consider the derivative of σ with respect to the α at α = 0 π σsn . ∂σ /∂α|α=0 = 2 cos φ + 2 Since this derivative does not vanish, α = 0 (see 5) is not an extremum when |sn | > 0, which means that the most nonstationary projection is not orthogonal to the true s-projection. This is the case in Fig. 2(b). Thus, in order to find the projection to the most nonstationary sources, we also need to maximize the nonstationarity of the estimated n-sources. To that end, we simply maximize the SSA objective function (4) for the n-projection N

 n ˆ ˆ in + (μˆ ni ) μˆ ni − log det  B = argmax B B  =I

(6)

i=1

ˆ n = Bˆ n  ˆ i ( Bˆ n ) and μˆ n = Bˆ n μˆ i for all epochs 1 ≤ where  i i i ≤ N. In order to choose the number of epochs N, the timescale of the changes one aims to detect should be taken into account. Changes should not, as a rule of thumb, occur on a timescale greater than or equal to an order of magnitude smaller than the length of the epochs chosen for SSA. However, clearly, small sample sizes adversely affect the accuracy of any estimation procedure, these considerations do, though, apply to the n and

C. Relationship to Statistical Testing In this section, we show that maximizing the SSA objective function to find the most nonstationary sources can be understood from a statistical testing point of view, in that it also maximizes the p-value for rejecting the null hypothesis that the estimated directions are stationary. More precisely, we maximize the p-value for a statistical hypothesis test that compares two models for the data: the null hypothesis H0 that each epoch follows a standard normal distribution; versus the alternative hypothesis H A that each epoch is Gaussian distributed with individual mean and covariance matrix. Let X 1 , . . . , X N be random variables modeling the distribution of the data in the N epochs. Formally, the hypothesis can be written as follows: H0 : X 1 , . . . , X N ∼ N (0, I ) H A : X 1 ∼ N (μ1 , 1 ), . . . , X N ∼ N (μ N ,  N ). In other words, the statistical test tells us whether we should reject the simple model H0 in favor of the more complex model H A . This decision is based on the value of the test statistics, whose distribution is known under the null hypothesis H0. Since H0 is a special case of H A and since the parameter estimates are obtained by maximum likelihood, we can use the likelihood ratio test statistic [35], which is the ratio of the likelihood of the data under H0 and H A , where the parameters are their maximum likelihood estimates. Let X ⊂ Rdn be the dataset which is divided into N epochs ˆ n, . . . ,  ˆ n be the maxiT1 , . . . , Tn and let μˆ n1 , . . . , μˆ nN , and  1 N mum likelihood estimates of the mean and covariance matrices of the estimated n-sources, respectively. Let pN (x; μ, ) be

BLYTHE et al.: FEATURE EXTRACTION FOR CHANGE-POINT DETECTION

635

the probability density function of the multivariate Gaussian distribution. The likelihood ratio test statistic is given by  x∈X pN (x; 0, I ) (7)

(X ) = −2 log  N  ˆ n) ˆ ni ,  i=1 x∈Ti pN (x; μ i which is approximately χ 2 distributed with 1/2Ndn (dn + 3) degrees of freedom. Using the facts that we have set the average epoch’s mean and covariance matrix to zero and the identity matrix respectively N 1  n μˆ 1 = 0 N

and

i=1

N 1  n ˆi = I  N

(8)

TABLE I OVERVIEW OF S IMULATIONS P ERFORMED AND C ORRESPONDING F IGURES R EPORTING THE R ESULTS . A T ICK D ENOTES T HAT THE C ORRESPONDING PARAMETER WAS VARIED IN THE E XPERIMENT. A C ROSS D ENOTES T HAT THE C ORRESPONDING PARAMETER IS K EPT F IXED . (“P.” D ENOTES PANEL W ITHIN THE R ESPECTIVE F IGURES AND “F I .” D ENOTES THE F IGURE ) Setup

D

dn

ds

p

SLCD

Kohl./Lemm

CUSUM

1 2 3

✗  ✗

 ✗ ✗

  ✗

✗ ✗ 

Fi. 6, P. 1 Fi. 6, P. 2 Fi. 6, P. 3

Fi. 8, P. 1 Fi. 8, P. 2 Fi. 8, P. 3

Fi. 7, P. 1 Fi. 7, P. 2 Fi. 7, P. 3

i=1

the test statistic simplifies to N

 ˆ in + μˆ ni 2 + tr( ˆ in ) (9)

(X ) = −ds N + Ni − log det  i=1

where Ni = |Ti | is the number of data points in the i -th epoch. If every epoch contains the same number of data points (N1 = · · · = N N ), then maximizing the SSA objective function (6) is equivalent to maximizing the test statistic (9) and hence minimizing the p-value for rejecting the simple (stationary) model for the data. (This can be seen by a simple inspection of the test statistic and the SSA objective function.) As we will see in the application to fault monitoring (Section IV), the p-value of this test furnishes a useful indicator of the number of informative directions for changepoint detection. More specifically, we increase the number of candidate stationary sources until the test returns that they are significantly nonstationary. As long as the p−value is large, we may safely conclude that these estimated stationary sources ( Bˆ s s(t)) are stationary and that they may be removed without loss of informativeness for change-point detection. In the specific case of change-point detection, removing additional directions may lead to increases in performance as a result of the reduced dimensionality: this is, of course, a data-dependent question. In Fig. 3 we illustrate the results of the procedure for choosing ds , the number of stationary sources. We use the dataset described in Section III. The procedure for choosing ds is as follows: for each dˆs from 1, . . . , D, where D is the dimensionality of the dataset, we compute the projection to the stationary sources. For each dˆs , we calculate the test statistic (X ). We then choose ds to be the largest such ds such that we do not reject the null hypothesis, at the p = α confidence level, given in 7. We display the p-values obtained using SSA for a fixed value for simulated data’s dimensionality D = 10 and the number of stationary sources ranging from d1 = 1, . . . , 9 and the chosen parameter ranging from 1, . . . , 9. The confidence level α = 0.01 for rejection of the null hypothesis that the projected data is stationary returns the correct ds on average in all cases. III. S IMULATIONS In this section we demonstrate the ability of SSA to enhance the segmentation performance of three change-point detection algorithms on a synthetic data setup. The algorithms are

single linkage clustering with divergence (SLCD) [36], which uses the mean and covariance as test statistics, CUSUM [37], which uses a sequence of hypothesis tests, and the Kohlmorgen/Lemm [12], which uses a kernel density measure and a hidden Markov model. For each segmentation algorithm, we compare the performance of the baseline case in which the dataset is segmented without preprocessing, the case in which the data is preprocessed by projecting to a random subspace, and the case in which the dataset is preprocessed using SSA for a broad set of parameter settings. We compare performance with respect to the following schemes of parameter variation. 1) The dimensionality D of the time series is fixed and dn , the number of nonstationary sources, is varied. 2) The number dn of nonstationary sources is fixed and ds , the number of the stationary sources, is varied. 3) D, dn , and ds are fixed and the power p between the changes in the nonstationary sources is varied. For two of the change point algorithms which we test, SLCD and Kohlmorgen/Lemm, all three parameter variation schemes are tested. For CUSUM, the second scheme does not apply as the method is a univariate method. For each setup and for each realization of the dataset, we repeat segmentation on the raw dataset, the estimated nonstationary sources after SSA preprocessing for that dataset, and on a dn -dimensional random projection of the dataset. The random projection acts as a comparison measure for the accuracy of the SSA-estimated nonstationary sources for segmentation purposes. Note that we do not test against PCA and ICA for all parameter settings as additional baselines; we show in Section III-D that in many cases PCA and ICA provide no advantage over using a random projection for dimension reduction. Thus to check that potential improvement using SSA does not occur purely by virtue of the lower dimensionality, we only evaluate the performance of a random projection in addition to the baseline and SSA. A. Synthetic Data Generation The synthetic data which we use to evaluate the performance of change point detection methods is generated as a linear mixture of stationary and nonstationary sources. The data is further generated time segment-wise: each segment has fixed length (for example each segment may be 50 data points in length) and each dataset consists of a concatenation of

IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, VOL. 23, NO. 4, APRIL 2012

time segments.3 The d stationary sources are distributed normally on each time segment according to N (0, Ids ). The other dn (nonstationary) source signals s n (t) are distributed according to the active model k of this time segment; this active model is one of five Gaussian distributions Gk = N (0, k ): the covariance k is a diagonal matrix whose eigenvalues are chosen at random from five log-spaced values between σ12 = 1/ p and σ52 = p; thus five covariances, corresponding to the Gk of the Markov chain, are then chosen in this way. The transition between models over consecutive time segments follows a Markov model with transition probabilities  0.9 i= j Pi j = (10) 0.025 i = j. B. Overview of the Algorithms Tested We test the difference in performance between segmentation without and with feature extraction using SSA. The three algorithms we test are single linkage clustering with a symmetrized divergence measure (SLCD), weighted CUSUM (CUSUM), and the Kohlmorgen–Lemm algorithm (KL). We choose these algorithms to test for the efficacy of the feature extraction method, because these algorithms are representative of the most commonly employed algorithms in the changepoint detection literature. In particular, SLCD uses sample statistics, in analogy to, for example, LDA in classification. CUSUM uses hypothesis testing, which is in essence a different approach applicable only for 1-D data. The KL algorithm represents the semi/nonparametric camp of algorithms.

Comparison of ICA and PCA performance 0.8 0.75 0.7 0.65 AUC

636

0.6 0.55 0.5 0.45 0.4 BL

SSA

RP

PC1

PC2

PC3

PC4

PC5

IC1

IC2

IC3

IC4

IC5

Fig. 4. Quantiles over improvement for 100 realizations of the dataset described in Section III-D. The performance of the baseline, SSA, a random projection, each PCA component, and each ICA component are tested for preprocessing for change point detection using single linkage clustering with a symmetrized divergence measure (SLCD). For the dataset, we set the dimensionality to D = 5, the number of non-stationary sources to dn = 1, and the power change in the nonstationary source to p = 2.8. The labels of the cases tested are displayed along the x−axis. “PC” stands for principal component, “IC” for independent component, “BL” for baseline, “RP” for random projection. The boxes extend to over the 25th and 75th percentiles with the mean marked with a red line. The crosses denote outliers. The figure shows that no improvement over random projections can expected, in general, using ICA or PCA.

More specifically, each algorithm is accompanied by a parameter τ that controls the tradeoff between TPR and FPR. For SLCD this is the number of clusters, for CUSUM this is the threshold set on the log likelihood ratio, and for the Kohlmorgen/Lemm this is the parameter controlling how readily a new state is assigned to the model.

C. Performance Measure In our experiments, we evaluate the algorithms based on an estimation of the area under the ROC curves (AUC) across realizations of the dataset. The true positive rate (TPR) and false positive rate (FPR) are defined with respect to the fixed time segments with which the synthetic dataset is generated; a change point may occur only between two such time segments of fixed length. Each of the change-point algorithms that we test reports changes with respect to the same division into epochs as per the synthetic dataset: thus the TPR and FPR are well defined. We use the AUC because it provides information relating to a range of TPR and FPR. In signal detection, the tradeoff achieved between TPR and FPR depends on operational constraints: cancer diagnosis procedures must achieve a high TPR perhaps at the cost of a higher than desirable FPR. Network intrusion detection, for instance, may need to compromise the TPR, given the computational demands set by too high an FPR. In order to assess detection performance across all such requirements, the AUC provides the most informative measure: all tradeoffs are integrated over. In particular, AUC corresponds to the area under the ROC curve, i.e., AUC = 1 f (g)dg where f = T P R and g = F P R. 0 3 Note the difference here between “time segment” and “epoch”; “time segment” refers to a section of data, which according to the generative model for the synthetic data setup, is stationary, whereas we use “epoch” to denote a length of data chosen to calculate sample covariances and means for the purpose of the SSA computations.

D. PCA and ICA Provide No Improvement Over Random Projections Before embarking on a full-blown evaluation of the performance of SSA as opposed to the baseline case, we demonstrate that PCA [27], [38] and ICA [39], [40] are not suitable for the task of extracting appropriate features for change-point detection. To this end, we take the dataset with change points of the same structure as described as in Section III-A, consisting of changes in covariance over time, but using dependent non-Gaussian noise as the base distribution of the stationary segments, rather than Gaussian noise. In addition, the dataset is whitened before change-point detection. We set the dimensionality of the dataset to D = 5 and the number of nonstationary sources to dn = 1. We then perform change-point detection using single linkage clustering with a symmetrized divergence measure (see Section III-E) in the baseline setting, using SSA and using a random projection of the same dimensionality as the SSA projection. However, in addition, we perform changepoint detection on each of the PCA and ICA components. For ICA, we use the fast ICA [41] implementation in deflationary mode.4 The results, as AUCs scores, are displayed in Fig. 4, which show that, on average, each PCA and ICA component provides no means for improvement over the baseline or the random projection, whereas SSA does. Of course, in the 4 The software is available for download at http://research.ics.tkk.fi/ ica/fastica.

BLYTHE et al.: FEATURE EXTRACTION FOR CHANGE-POINT DETECTION

637

Stationary (1 4) and non stationary (5 6) sources + true changepoints (red) 1 2 3 4 5 6 (a) Observed signals + detected changepoints (red) 1 2 3 4 5 6 (b) Estimated non stationary sources + detected changepoints (red) 1 2 (c)

Fig. 5. Illustration of a case in which SSA significantly improves SLCD ds = 4 (no. of stationary sources), dn = 2 (no. nonstationary sources), p = 2.3 (power change in nonstationary sources). (a) True decomposition into stationary and nonstationary sources with the true change points marked. (b) Change points which SLCD finds on the entire dataset (sources mixed) clearly some change points are left undetected. (c) Change points found by SLCD on the estimated nonstationary sources.

case where the nonstationary sources are independent from the stationary sources, these will be included amongst the ICA components, independence of the nonstationary sources, however, in general will not be the case. But, even when independence arises, which of the ICA components are the nonstationary sources will be left undetermined. Likewise, in the case when the nonstationary sources are of, for instance, higher variance than the stationary sources, then PCA will succeed in extraction; however, in general, this situation will not be the case. We therefore continue by comparing the baselines with SSA and random projections. E. Single Linkage Clustering With Symmetrized Divergence Measure (SLCD) The first algorithm on which we test the efficacy of the feature extraction method, for a broad range of parameter settings is SLCD. Single linkage clustering with a symmetrized distance measure is a simple algorithm for change-point detection which has, however, the advantage of efficiency and of segmentation based on a parameter independent distance matrix (thus detection may be repeated for differing tradeoffs between TPR and FPR without reevaluating the distance measure). In particular, segmentation based on single linkage clustering [36] computes a distance measure based on the covariance and mean over time windows to estimate the occurrence of changepoints: the algorithm consists of the following three steps. 1) The time series is divided into 200 time segments for which we estimate the epoch-mean and epochˆ i )}200 . covariance matrices {(µ ˆ i,  i=1 2) The dissimilarity matrix D ∈ R200×200 between the time segments is computed as the symmetrized Kullback– Leibler divergence DKL between the estimated distri-

butions (up to the first two moments)   1 ˆ i ) || N (µ ˆ j) ˆ i,  ˆ j, Di j = DKL N (µ 2   1 ˆ j ) || N (µ ˆ i) ˆ j, ˆ i,  + DKL N (µ 2 where N (µ, ) is the Gaussian distribution. 3) Based on the dissimilarity matrix D, single linkage clustering [36] (with number of clusters set to k = 5) returns an assignment of time segments to clusters such that a change point occurs when two neighboring time segments do not belong to the same cluster. 1) Results for SLCD: The results of the simulations for varying numbers of nonstationary sources in a dataset of 30 channels are shown in Fig. 6(a). When the degree to which the changes are visible is lower, the SSA preprocessing significantly outperforms the baseline method, even for a small number of irrelevant stationary sources. The results of the simulations for a varying number of stationary dimensions with two nonstationary dimensions are displayed in Fig. 6(b). For small d, the performance of the baseline and SSA preprocessing are similar: SSAs performance is more robust with respect to the addition of higher numbers of stationary sources, i.e., noise directions. The segmentations produced using SSA preprocessing continue to carry information relating to change points for ds = 30, whereas, for d ≥ 12, the baseline’s AUC approaches 0.5, which corresponds to the accuracy of randomly chosen segmentations. The results of the simulations for varying power p in the nonstationary sources with D = 20, ds = 16 (no. of stat. sources), and dn = 4 are displayed in Fig. 6(c). The performance of both the baseline and the SSA preprocessing

638

IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, VOL. 23, NO. 4, APRIL 2012

1

Area Under ROC

0.9

0.8

0.7

0.6

Baseline SSA Random Projection

0.5

0.4 0

5

10

15

20

25

300

5

10

No. of non−stationary sources − d n (a)

15

20

25

301

No. of stat. sources − d s (b)

1.5

2

2.5

3

3.5

4

Power change in the non−stationary sources − q (c)

Fig. 6. Results of the simulations for SLCD. (a) Results for a fixed dimensionality of the time series, D = 30 and varying dn , the number of stationary sources. (b) Results for a fixed number of nonstationary sources, dn = 2 and varying ds , the number of stationary sources. (c) Results for fixed D = 20, ds = 16, and dn = 4 and for varying p, the power change in the nonstationary sources. Each displays the results in terms of the area under the ROC curve computed as per Section III-C. The error bars extend from the 25th to the 75th percentiles.

1

Baseline SSA Random Projection

Area Under ROC

0.9 0.8 0.7 0.6 0.5 0.4 0

5

10

15

20

25

No. of stat. sources − d s (a)

301

1.5

2

2.5

3

3.5

4

Power change in the non−stat. sources − q (b)

Fig. 7. Results of the simulations for CUSUM. (a) Results for a fixed number of nonstationary sources, dn = 1 and varying ds , the number of stationary sources. (b) Results for fixed D = 16 and d = 15 and for varying p, the power change in the nonstationary sources. Each displays the results in terms of the area under the ROC curve computed as per Section III-C. The error bars extend from the 25th to the 75th percentiles.

improves with increasing power change p. This effect is evident for lower p for the SSA preprocessing than for the baseline. An illustration of a case in which SSA preprocessing significantly outperforms the baseline is displayed in Fig. 5. The estimated nonstationary sources exhibit a far clearer illustration of the change points than the full dataset, the corresponding segmentation performances reflect this fact.

integrated over a measure F  ∞  pθ1 (y j , . . . , yk ) ˜ kj = d F(θ1 ) .

−∞ pθ0 (y j , . . . , yk )

F. Weighted CUSUM for Changes in Variance The second algorithm on which we test is weighted CUSUM. In statistical quality control, CUSUM (or cumulative sum control chart) is a sequential analysis technique developed in 1954 [37]. CUSUM is one of the most widely used and oldest methods for change point detection; the algorithm is an online method for change-point detection based on a series of log-likelihood ratio tests. Thus CUSUM algorithm detects a change in parameter θ of a process pθ (y) [37] and is asymptotically optimal when the pre-change and post-change parameters are known [6]. For the case in which the target value of the changing parameter is unknown, the weighted CUSUM algorithm is a direct extension of CUSUM [6], by integrating over a parameter interval, as follows. The following ˜ k constitutes likelihood ratios between the currently statistics

j estimated parameter of the nonstationary process and differing target values (values to which the parameter may change),

˜ kj ) ≥ h}}. ta = min{k : max{ j ≤ k : ln(

(11)

˜ k denote averages of likelihood ratios over an interval Thus

j which are high when the value of the parameter of the data’s generative distribution at a given time point differs from the parameter estimated at the time of the last estimated change point. Here y j , . . . , yk denote the time points lying inside a sliding window of length k where yk indicates the latest time point received. The stopping time is then given as follows: (12)

The function F serves as a weighting function for possible target values of the changed parameter. In principle, the algorithm can thus be applied to multidimensional data. However, as per [6], the extension of the CUSUM algorithm to higher dimensions is nontrivial, not just because integrating over possible values of the covariance is computationally expensive but also because various parameterizations can lead to the same likelihood function. Given this, we test the effectiveness of the algorithm in computing 1-D segmentations. In particular, we compare the segmentation performed on the 1-D projection chosen by SSA with the best segmentation of all individual dimensions with respect to the hit rate on each trial. In accordance with [6], we choose F to comprise a fixed uniform interval containing all possible values of the process’s variance. We approximate the integral above as a sum over

BLYTHE et al.: FEATURE EXTRACTION FOR CHANGE-POINT DETECTION

639

Area under the ROC Curve

1

0.9

0.8

0.7

0.6

Baseline SSA Random Projection

0.5

0.4 0

5

10

15

20

25

No of non−stat. sources − d n (a)

300

5

10

15

20

25

301

No of stat. sources − d s (b)

1.5

2

2.5

3

3.5

4

Power change in stat. sources − q (c)

Fig. 8. Results of the simulations for the KL algorithm. (a) Results for a fixed dimensionality of the time series, D = 30 and varying dn , the number of stationary sources. (b) Results for a fixed number of nonstationary sources, dn = 2 and varying ds , the number of stationary sources. (c) Results for fixed D = 20, ds = 16, dn = 4, and for varying p, the power change in the nonstationary sources. Each displays the results in terms of the area under the ROC curve computed as per Section III-C. The error bars extend from the 25th to the 75th percentiles.

evenly spaced values on that interval. We approximate the stopping time by setting ˜ kk−W +1 ) ≥ h}. ta ≈ min{k : ln(

(13)

The exact details of our implementation are as follows. Let T be the number of data points in the data set X. 1) We set the window size W , the sensitivity constant h, and the current time step as tc = W + 1 and θ0 = var({x 1 , . . . , x W }) and  = {θ1 , . . . , θr } = {c, c + b, c + 2b, . . . , d}.  ˜ k = 1 ri=1 pθi (y j ,...,yk ) . 2)

j b pθ0 (y j ,...,yk ) ˜ k ) ≥ h, then a change point is reported at time 3) If ln(

j tc and tc is updated so that tc = tc + W and θ0 = var({x tc −W +1 , . . . , x tc }). We return to step 2. ˜ k < h, no change point is reported and 4) Otherwise, if

j tc = tc + 1. We return to step 2. 1) Results for Weighted CUSUM: In Fig. 7(a), the results for varying numbers of stationary sources are displayed. Weighted CUSUM with SSA preprocessing significantly outperforms the baseline for all values of D (dimensionality of the time series). Here we set dn = 1, the number of nonstationary sources, for all values of ds , the number of stationary sources. In Fig. 7(b), the results for changes in the power change between ergodic sections p are displayed for D = 16, ds = 15, and dn = 1. SSA outperforms the baseline for all except very low values of p, the power level change, where all detection schemes fail. The simulations show that SSA represents a method for choosing a 1-D subspace to render unidimensional segmentation methods applicable to higher dimensional datasets: the resulting segmentation method on the 1-D derived nonstationary source will be simpler to parameterize and more efficient. If the true dimensionality of the nonstationary part is dn , then no information loss should be observed. G. Kohlmorgen/Lemm Algorithm The final algorithm on which we test is the KL algorithm. This algorithm is a flexible nonparametric and multivariate method which may be applied in online and offline operation modes. Distinctive about the KL algorithm is that a kernel density estimator, rather than a simple summary statistic, is used to estimate the occurrence of change points. In particular, the

algorithm is based on a standard kernel density estimator with Gaussian kernels and estimation of the optimal segmentation based on a hidden Markov model [12]. More specifically, if we estimate the densities on two arbitrary time segments E i , E j of our dataset X with Gaussian kernels, then we can define a distance measure d between time segments via the L2-Norm yielding d(E i , E j ) =

  W −1   1 (Yw − Yv )2 exp − W 2 (4πσ 2 )d/2 4σ 2 w,v=0    (Yw − Z v )2 −2exp − 4σ 2    (Z w − Z v )2 +exp − . 4σ 2

Thus, d(E i , E j ) measures the distance between densities estimated via kernel density estimation on the time segments E i and E j , where the distance measure is obtained by taking the difference between the densities as functions on the basis of the L2-norm. The final segmentation is then based on the distance matrix generated between time segments calculated with respect to the distance measure d. As per the weighted CUSUM, it is possible to define algorithms whose sensitivity to distributional changes in reporting change points is related to the value of a parameter C, which controls the probability of transitions to new states in the fitting of the hidden Markov model. However, in [42] it is shown that, in the case when all change points are known, then one can also derive an algorithm which returns exactly that number of change points: in simulations we evaluate the performance on the first variant over a full range of parameters to obtain an ROC curve. In addition, we choose the parameter σ according to the rule of thumb given in [12], which sets σ proportional to the mean distance of each data point to its D nearest neighbors, where D is the dimensionality of the data, this is evaluated on a sample set. The exact implementation we test is based on [12] and [42]. The details are as follows. 1) The time series is divided into time segments. 2) A distance matrix is computed between time segments using kernel density estimation and the L2-norm as described above.

640

IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, VOL. 23, NO. 4, APRIL 2012

Seven Channels including changepoints 1

2

3

4

5

6

7 0

500

1000

1500

2000

2500

3000

Fig. 9. Pump dataset. The machine under investigation is a pump, driven by an electric motor. The measurements made are of machine vibration at seven sensors. The data alternates between two conditions normal functionality and pitting in both gears. BNISE

Obj. Func Value on Estimated s−Sources 30

1 25 0.8

20

0.6

15

0.4

10

0.2

5

0

0 0

2 dn

4

6

2

4 ds

(a)

(b)

AUC with SSA w.r.t. D−d

6

p−values s−sources 1

1

0.9

0.8

SSA Baseline 0.05

0.7 1

2

3

4 dn

5

6

7

(c)

1

2

3

4

5

6

ds

(d)

Fig. 10. Pump dataset. Schemes for selecting the parameter ds . (a) Measure BNISE for increasing values of dn . (b) Function as compared to randomly generated data. (c) AUC performances for various values of dn . (d) p-values on the estimated s-sources.

3) The estimated density on each time segment corresponds to a state of the Markov Model. So a state sequence is a sequence of estimated densities. 4) Finally, based on the estimated states and distance matrix, a hidden Markov model is fitted to the data and a change point reported whenever consecutive epochs have been fitted with differing states. Note that we test the KL algorithm because it is representative of a family of algorithms in the change-point literature which perform segmentation by fitting a hidden Markov model and using a variant of the Viterbi algorithm to assign change points, see [43] for another example. 1) Results for the Kohlmorgen/Lemm Algorithm: SSA preprocessing improves the segmentation obtained using the KL algorithm for all three setups of the dataset. In particular, the AUCs for varying ds and fixed D are displayed in Fig. 8(a), with D = 30. The AUCs for varying ds and fixed dn are

displayed in Fig. 8(b), with dn = 2. The AUCs for varying power change in the nonstationary sources p and fixed D, d are displayed in Fig. 8(c), with p ranging between 1.1 and 4.0 at increments of 0.1. Of additional interest is that, for varying dn and fixed D, the performance of segmentation with SSA preprocessing is superior for higher values of ds : this implies that the improvement of change-point detection of the KL algorithm due to the reduction in dimensionality to the informative estimated n-sources outweighs the difficulty of the problem of estimating the n-sources in the presence of a large number of noise dimensions. IV. A PPLICATION TO FAULT M ONITORING In this section, we apply our feature extraction technique to fault monitoring. The dataset consists of multichannel measurements of machine vibration. The machine under investigation is a pump, driven by an electric motor. The incoming

BLYTHE et al.: FEATURE EXTRACTION FOR CHANGE-POINT DETECTION

641

Baseline 1 2 3 4 5 6 7 (a) With SSA Preprocessing: D−d = 2 1

2 (b) Real ChangePoints 1 2 3 4 5 6 7 (c)

Fig. 11. Pump dataset. (a) All segmentations are computed using the KL algorithm with the number of change points N specified. The baseline corresponds to segmentation without SSA preprocessing. (b) Segmentation with SSA preprocessing. (c) Real change points superimposed over the raw dataset.

shaft is reduced in speed by two delaying gear combinations (a gear combination is a combination of a driving and a driven gear). Measurements are repeated for two identical machines, where the first shows a progressed pitting in both gears, and the second machine is virtually fault free. The rotating speed of the driving shaft is measured with a tachometer.5 The pump dataset is semisynthetic insofar as we juxtapose non-temporally consecutive sections of data between the two pump conditions. Sections of data from the first and second machine are spliced randomly (with respect to the time axis) together to yield a dataset with 10 000 time points in seven channels. An illustration of the dataset is displayed in Fig. 9. A. Setup We preprocessed with SSA using a division of the dataset into 30 equally sized epochs and d estimated nonstationary sources, for ds , the number of stationary sources ranging between 1 and 6, where D = 7 is the dimensionality of the dataset. Subsequently we ran the KL algorithm on both the preprocessed and raw data using a window size of W = 50 and a separation of 50 data points between nonoverlapping epochs. B. Parameter Choice To select the parameter ds (and thus dn = D − ds ), the number of stationary sources, we use the following scheme: the measure of stationarity over which we optimize for SSA and SSA is given by the loss function in (2). For each d = dim(V s ), we compute the estimated projection to the stationary sources using SSA on the first half of the data available, compute this loss function on the estimated stationary sources on the second half, and compare the result to the values of the loss function obtained on the dataset obtained by randomly permuting the time axis. This random permutation should produce, on average, a set of approximately stationary 5 The dataset can be downloaded free of charge at http://www. ph.tn.tudelft.nl/ypma/mechanical.html.

sources regardless of nonstationarity present in the estimated stationary sources for that d. In addition, a measure of the information relating to nonstationarity lost in choosing the number of stationary sources to be ds , the baseline-normalized integral stationary error (BNISE), can be defined as follows: BNISE(d) =

 L d ( Aˆ −1 , X) − E X (L d ( Aˆ −1 ), X ) σ X (L d ( Aˆ −1 , X ))

(14)

d

Feature extraction for change-point detection using stationary subspace analysis.

Detecting changes in high-dimensional time series is difficult because it involves the comparison of probability densities that need to be estimated f...
2MB Sizes 0 Downloads 3 Views