2226

IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, VOL. 25, NO. 12, DECEMBER 2014

A Parsimonious Mixture of Gaussian Trees Model for Oversampling in Imbalanced and Multimodal Time-Series Classification Hong Cao, Member, IEEE, Vincent Y. F. Tan, Member, IEEE, and John Z. F. Pang, Student Member, IEEE Abstract— We propose a novel framework of using a parsimonious statistical model, known as mixture of Gaussian trees, for modeling the possibly multimodal minority class to solve the problem of imbalanced time-series classification. By exploiting the fact that close-by time points are highly correlated due to smoothness of the time-series, our model significantly reduces the number of covariance parameters to be estimated from O(d2 ) to O(Ld), where L is the number of mixture components and d is the dimensionality. Thus, our model is particularly effective for modeling high-dimensional time-series with limited number of instances in the minority positive class. In addition, the computational complexity for learning the model is only of the order O(Ln+ d2 ) where n+ is the number of positively labeled samples. We conduct extensive classification experiments based on several well-known time-series data sets (both single- and multimodal) by first randomly generating synthetic instances from our learned mixture model to correct the imbalance. We then compare our results with several state-of-the-art oversampling techniques and the results demonstrate that when our proposed model is used in oversampling, the same support vector machines classifier achieves much better classification accuracy across the range of data sets. In fact, the proposed method achieves the best average performance 30 times out of 36 multimodal data sets according to the F-value metric. Our results are also highly competitive compared with nonoversampling-based classifiers for dealing with imbalanced time-series data sets. Index Terms— Gaussian graphical models, imbalanced data set, mixture models, multimodality, oversampling, time-series.

I. I NTRODUCTION N BINARY classification, class imbalance is the problem of having significantly fewer samples in one class compared with the other. This imbalance is one of the key challenges [1], [2] because standard classification methods tend to bias toward the class containing the dominant number of learning samples. However, in many real-world applications, the minority class is often the more informative of the two and, indeed, the class of significant interest for accurate modeling. For instance, the known failure occurrences for aircrafts are

I

Manuscript received March 24, 2013; accepted February 23, 2014. Date of publication March 12, 2014; date of current version November 17, 2014. H. Cao is with the Institute for Infocomm Research, Agency for Science Technology and Research, Singapore 138632 (e-mail: [email protected]). V. Y. F. Tan is with the Department of Electrical and Computer Engineering and the Department of Mathematics, National University of Singapore, Singapore 119077 (e-mail: [email protected]). J. Z. F. Pang is with the Institute for High Performance Computing, Agency for Science Technology and Research, Singapore 138632 (e-mail: [email protected]). Color versions of one or more of the figures in this paper are available online at http://ieeexplore.ieee.org. Digital Object Identifier 10.1109/TNNLS.2014.2308321

very rare but of far greater importance (than the abundant number of normal instances) for modeling to predict the next failure. Similar examples appear in fraud detection in financial data, intrusion detection in network forensics and cancer detection in biomedical diagnosis, and hence forth. Motivated by the data paucity and multimodality in the minority class (e.g., there may be two distinct failure modes for aircrafts), this paper proposes an oversampling method based on a parsimonious mixture of Gaussian trees model for imbalanced time-series classification. Such a model is shown to: 1) compare excellently with other state-of-the-art methods [1], [3] in terms of classification accuracy; 2) require the estimation of far fewer parameters compared with the existing methods; 3) model multimodal minority classes using mixture distributions; 4) model the dependencies between the points in a time-series explicitly; and 5) have a relatively low computational complexity. Existing techniques for class imbalance can be divided into two categories: those at the algorithm-level [4]–[6] and those at the data-level [1], [7]–[16]. Algorithm-level approaches correct the class imbalance by incorporating a predefined cost assigned to each class [4]. This class of methods include those that fall under the category known as the uneven dataspace weighting [3]–[5], [13] or the modification of the system operating point on the receiver operating characteristics curve [6]. Though some methods, as reviewed in [2], have reportedly achieved excellent performance with good theoretical guarantees, they require the predetermination of optimal class cost or dataspace weighting, which can vary on a case-by-case basis. Data-level methods, also known as sampling methods, correct the imbalance at the data level by undersampling the majority class [12] or by oversampling the minority class [1], [7]–[11], [13], [14], or by a combination of both methods [8]. In general, undersampling is efficient because of the reduced size of the resulting data set. However, it also suffers from the risk of discarding important samples and resulting in less representative distribution of learning data for generalization. Oversampling has the advantage of retaining all the existing training samples at the expense of an enlarged learning data set. Depending on the oversampling scheme used, the (judiciously generated) synthetic samples may also enhance representativeness of the under-represented minority class by introducing rich data variations at the same time of correcting the class population imbalance. Oversampling is the approach we adopt in this paper, as it has been found to be highly effective and efficient for reestablishing

2162-237X © 2014 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.

CAO et al.: PARSIMONIOUS MIXTURE OF GAUSSIAN TREES MODEL

the class balance for conventional imbalanced classification problems. A. Challenges with Time-Series Classification In this paper, we consider an important family of data sets, namely those that contain time-series data. Time-series data are frequently encountered in many machine learning and data mining disciplines, such as finance, network security, healthcare, weather prediction, and machine diagnostics. This is because the data collected in these fields often contain a temporal aspect. Thus, time-series analytics has become an increasingly important research thrust in data mining. A time-series instance (or signal) [17] is a vector of realvalued samples from a discretely sampled continuous-time signal, which can be either in the spatial-domain or in the time-domain. Similar to other types of machine learning data, one can consider each time point of a time-series as a feature dimension and directly apply the standard machine learning techniques for modeling and classification. However, there are several problems with this naïve approach: first, the length of a time-series instance is typically long, i.e., the feature space is high dimensional. This, in turn, results in the problem of having an insufficient number of learning samples, especially for the minority class. Second, different time-series instances from the same class usually contain large amount of withinclass data variation compared with the other types of data. This is partly a result of the nonperfectly aligned signals along the time axis. Depending on its severity, the misalignment can lead to increased distance in different time series instances from the same class (intraclass distance). One solution is to compute the warping distance between the two instances by searching for the optimal mapping path that serves to realign the two time series. A given test sample is then classified based on the label of the top-one nearest training neighbor. This algorithm is often referred as the one nearest neighbor (1NN) classifier with dynamic time warping (DTW) [17]. Besides 1NN-DTW, some other works have also used the Euclidean distance in the reduced feature subspace [18] or the k-nearest neighbors classifier with DTW distance. B. Existing Oversampling Approaches for Imbalanced Time-Series Classification The classification of imbalanced time-series data sets is more difficult than the usual imbalanced classification problem because of the high-dimensional nature of the timeseries data and the inherent strong correlation between the nearby time points. In this paper, we consider the task of oversampling of the positive samples to correct the class imbalance between the minority (positive) and the majority (negative) classes. One naïve approach is to duplicate the existing positive samples as many times as needed to balance the two classes. Clearly, this would lead to overfitting due to the paucity of positively labeled samples. Therefore, existing current oversampling methods introduce some variation using synthetically created samples. This can be done in one of the following three approaches: the first approach interpolates between the selected positive seed samples and their random positive nearest neighbors for creating the synthetic samples.

2227

Well-known methods that adopt this approach include SMOTE [7], B-SMOTE [10], and ADASYN [11]. Their salient differences are as follows: SMOTE selects all positive samples as seed samples and evenly generates the synthetic samples from each selected seed. B-SMOTE identifies a set of hard and nonoutlier positive samples at the classification border and evenly creates the synthetic instance from this border set. ADASYN adopts an adaptive approach where the number of synthetic samples to be generated for each positive seed is determined by the percentage of negative samples in its neighborhood. While SMOTE has been extended in [16] for oversampling in the distance space (instead of the feature space) where the data are first converted to the form of a distance matrix representation, it is not explicitly designed for time-series classification. The second class of oversampling approaches generates the features of the synthetic samples independently. DataBoost [13], in particular, randomly generates each feature value from a Gaussian restricted to a bounded range [min, max]. The third class of methods involves modeling the distribution of positive samples in a multidimensional feature space, thus considering correlations. The estimated model is then used to randomly generate synthetic samples. A recently developed method structurepreserving oversampling (SPO) [1] estimates the covariance matrix of the positive class before performing regularization on the eigenspectrum. Following that, synthetic samples are randomly generated. Together with a Support Vector Machines (SVMs) classifier [19], classification accuracy of SPO was shown to be competitive compared with the other oversampling methods for highly imbalanced time-series classification. Finally, a Gaussian mixture model was also used in [20] but is not adaptable to time-series classification as the correlations between the nearby time-points are not considered. None of the above methods exploit the time-series correlations between neighboring time points. We compare our proposed method based on sparse graphical models with the existing methods and observe that the resulting classification performance is significantly better for data sets with multimodal distribution for the minority class, with the advantage of reduced complexity in terms of storage. C. Solution via Sparse Graphical Models To deal with the curse-of-dimensionality in modeling timeseries data, we adopt a machine learning methodology through the formalism of sparse graphical models [21], [22] to explicitly and statistically model dependence across time of the time-series data. This has been done previously in other contexts [23]. Graphical models provide a simple way to visualize the structure of a probabilistic model and more precisely, the conditional independence relationships among a collection of random variables. Our model results in a dramatic reduction in the number of parameters to learn and thus we are able to strike a better balance between the dichotomy of data fidelity and overfitting. Graphical models have been used in a variety of areas that deal with high-dimensional1 data sets such as clinical analysis [24]. See [21] and [22] for overviews. 1 High-dimensional here refers to the fact that the number of variables exceeds the number of data points available for training.

2228

IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, VOL. 25, NO. 12, DECEMBER 2014

average performance 30 times out of 36 data sets with multimodal minority-class according to the F-value metric [2], [3]. Compared with nonoversampling-based methods, our proposed method achieves the best performance 19 times out the same 36 data sets. E. Structure of Paper

Fig. 1. (a) Bimodal Wafer data set. Our model captures the correlation among neighboring time points and the bimodal aspect of the data. (b) and (c) Two constituent components separated by our algorithm. (d) Empirical covariance matrix does not have any sparsity structure. In fact, it is singular because of the high-dimensionality. (e) and (f) Learned inverse covariance matrices are, however, sparse and explicitly model the temporal structure of the time-series data. Neighboring time points are highly correlated and this is evidenced by the strong bands along the diagonals.

This paper is structured as follows. In Section II, we provide a brief overview of the statistical modeling technique known as graphical models [21]. We also recall the method of Chow and Liu [25] and Tan, Anandkumar and Willsky [30] to learn tree-structured Gaussian graphical models. In Section III, we recall the mixture of trees model and suggest a generalized expectation-maximization-based [31], [32] algorithm to learn the parsimonious mixture of Gaussian trees. In Section IV, we detail the oversampling and classification procedure. In Section V, we present extensive numerical results to corroborate the claim that our model performs well on a diverse range of time-series data sets [33]. We conclude our discussion and suggest some avenues for further research in Section VI. II. S YSTEM M ODEL AND M ATHEMATICAL P RELIMINARIES

In particular, we focus on a class of models known as tree-structured graphical models. Such models admit tractable learning [25] and inference [26] algorithms because greedy graph-theoretic-based [27] algorithms exploiting the tree assumption are often optimal in minimizing some objective function (e.g., sum of edge weights). Furthermore, to model the positively labeled time-series data, which may contain more than one model, we use a mixture of trees [28] approach. However, we need to modify the conventional mixture of trees approach to deal with continuous-valued data. This we do through the use of Gaussian graphical models. Our model is thus coined mixture of Gaussian trees. D. Summary of Our Main Contributions We summarize our contributions here. First, using sparse graphical models, we dramatically reduce the number of parameters of the model, ameliorating the problem of overfitting. Second, there is only one parameter to tune, i.e., the number of mixture components, which we also typically set to two considering the limited number of samples in the minority class. A more principled way to choose this parameter based on the Bayesian information criterion (BIC) [29] is also proposed. Third, the model is well adapted to the problem at hand, i.e., time-series classification. This is because timeseries data are highly correlated and our model explicitly takes this into account. Fourth, the model is also able to deal with multimodal time-series data since we use a mixture of Gaussian trees, each of which has one mode. This addresses the issue of large within-class variations of time-series data. See Fig. 1 for an illustration of a data set (Wafer) and the utility of our method to model time-series correlations and multimodality. Finally, our proposed methods significantly outperform other oversampling-based methods on multiple data sets. In fact, the proposed method achieves the best

In this section, we start by stating the notation that we adopt and the binary classification problem. We then review the formalism of graphical models [21], [22], paying particular attention to Gaussian graphical models and tree-structured graphical models. Finally, we review the algorithm by Chow and Liu [25] for learning tree models from data. A. Notation Upper case (e.g., X) and lower case (e.g., x) letters, respectively, denote random variables and their realizations. Sets are denoted by calligraphic font (e.g., X ) and the cardinality of finite sets by | · |. The probability distribution of random variable X denoted as pX (x) or simply as p(x) if the random variable X is clear from the context. A random vector is a finite collection of random variables and is denoted by bold-face upper case as X = (X1 , . . . , Xd ), where d is the dimension of the random vector in context. A realization of this random vector, a deterministic vector in Rd , is denoted as x = (x1 , . . . , xd ). Given a subset of indices S ⊂ {1, . . . , d}, XS = {Xi : i ∈ S} denotes the set of random variables indexed by S. Correspondingly, xS = {xi : i ∈ S}. Matrices are also denoted using bold-face upper case (e.g., A). We use standard notation for information-theoretic quantities [34]. For example, the mutual information between the random variables X and Y (a measure of statistical dependence) is denoted as I(X; Y ) and the Kullback–Leibler (KL) divergence (a measure of distance) between distributions p and q is denoted as D(p  q). If random variables X and Y are conditionally independent given another random variable Z (i.e., pX,Y |Z (x, y|z) = pX|Z (x|z)pY |Z (y|z)), we denote this by the Markov chain X − Z − Y . The correlation coefficient between the two random variables X and Y , a number between −1  and 1, is defined as ρ(X, Y ) = Cov(X, Y )/ Var(X) Var(Y ).

CAO et al.: PARSIMONIOUS MIXTURE OF GAUSSIAN TREES MODEL

2229

The multivariate Gaussian probability density function with mean μ ∈ Rd and covariance matrix Σ ∈ Sd+ (Sd+ is the cone of positive semidefinite d × d matrices) is denoted as   1 1 T −1 exp − (x − μ) Σ (x − μ) . N (x; μ, Σ) =  2 2π det(Σ) (1) B. Binary Classification Problem We consider a binary classification problem where the training data are partitioned into two disjoint subsets D+ := ¯ (n− ) }. The former, x(1) , . . . , x {x(1) , . . . , x(n+ ) } and D− := {¯ a small set, is the set of positively labeled samples and the latter, a larger set, is the set of negatively labeled samples. The union D+ ∪ D− is the set of training samples. We would like to use D+ and D− to determine the class label of a set of samples that are unlabeled, i.e., the test samples Dtest . To do so, we propose a statistical model (based on sparse graphical models) for the samples in D+ and then augment to D+ by sampling from the learned distribution so that the resulting positive class has the same number of samples as the negative one. Note that, previous work [11] has shown empirically that when the data becomes more balanced via oversampling, the test performance of the learned model improves. Finally, we perform the classification on the augmented data set. C. Graphical Models A graphical model [21], [22] is a probability distribution for which a graph encodes the conditional independence relations among a collection of random variables X = (X1 , . . . , Xd ). Consider a simple, undirected graph G =  E) with vertex  (V, set V = {1, . . . , d} and edge set E ⊂ V2 . The vertices of this graph are in one-to-one correspondence with the random variables in X. In other words, we associate vertex i ∈ V to random variable Xi , which takes on values in some alphabet Xi . In the sequel, we use the terms variable, vertex and node interchangeably. We say that random vector X is Markov on graph G if the following local Markov property holds: p(xi |xV\i ) = p(xi |xNi )

(2)

where Ni := {j ∈ V : (i, j) ∈ E} is the set of neighbors of i in G and V \i is the set of nodes in V excluding i. Equation (2) says that conditioned on the Markov blanket, which is the set of neighbors of i, the random variable Xi is independent of all the other random variables in the graph. By the Hammersley–Clifford theorem [22], if the random vector X = (X1 , . . . , Xd ) is Markov on the undirected graph G, then the joint distribution pX (x) (assuming that it is positive everywhere) can be factorized according to the maximal cliques of the graph G  pXC (xC ) (3) pX (x) = C∈C

where C is the set of maximal cliques of G. Equation (3) shows that the number of parameters used to describe the joint distribution is vastly reduced (vis-à-vis the model without any Markov assumptions) since they only depend on the (product

Fig. 2. Example of tree-structured graphical models known as a star (left). Here, G = (V, E), where V = {1, . . . , d = 5} and E = {(1, j) : 2 ≤ j ≤ 5}. Conditioned on X1 renders (X2 , X3 ) statistically independent of (X4 , X5 ) because node 1 separates (2, 3) and (4, 5). The covariance matrix of the model Σ has a sparse inverse. Namely, it has sparsity structured given by the matrix on the right, where x denotes a nonzero element.

of the) joint distributions on the (ideally small) maximal cliques of the graph. For example, if the random variables Xi , i ∈ V are discrete and they take on K = |Xi | values, without the factorization in (3) (thus Markov on a complete graph), we would need O(K d ) parameters to describe the joint distribution through a conditional probability table. Assuming that X is a Markov chain, i.e., that X1 − X2 − · · · − Xd , it can easily be observed that only O(dK 2 ) parameters are required because the factorization in (3) specializes in this case to p(x) = p(x1 )p(x2 |x1 ) · · · p(xd |xd−1 ). The number of parameters assuming this factorization is linear in d (since there are d factors). Thus, overfitting can be ameliorated if we assume that the data can be modeled adequately by a sparse graphical model. Another example of a tree is the star graph shown on the left of Fig. 2. 1) Gaussian Graphical Models: If we assume that X = (X1 , . . . , Xd ) is jointly Gaussian, then the inverse covariance matrix J := Σ−1 has a sparsity structure corresponding to that of G = (V, E). Such a probabilistic model is known as a Gaussian graphical model. In this case, if we let Ji,j be the (i, j)-element of J, then Ji,j = 0 if and only if (i, j) ∈ E. For example, the sparsity structure of the inverse covariance matrix of the star is shown in Fig. 2. Again, if we assume the Markov chain or star structure among the d variables, then we only require O(d) parameters to describe the Gaussian graphical model. Such an assumption drastically reduces the number of parameters and thus reduces overfitting. We relax this assumption in Section III-A by considering mixtures. 2) Tree-Structured Graphical Models: The Markov chain is a special case of tree-structured graphical models [35] (or trees). Such models admit efficient and exact inference [26] and learning [25] algorithms. In addition, we are concerned with the classification of time-series data, in which the time point at time i ∈ {1, . . . , d}, namely Xi , is highly correlated to its neighbors Xi−1 and Xi+1 . Thus, modeling the data by a tree, a superset of the class of Markov chains, which is used ubiquitously to model time-series data [23], is highly appropriate. We choose to model the data using trees because first, the learning is conceptually easier (though there are more advanced techniques to specifically learn Markov chains [36]) and second, the set of tree models is more flexible and does not incur any penalty in terms of the number of parameters as we shall see. We now review the basic properties of trees. An undirected tree is an undirected graph that does not contain any loops. Equivalently, between every two nodes i, j ∈ V in a tree, there exists a unique path joining them.

2230

IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, VOL. 25, NO. 12, DECEMBER 2014

We denote the set of edges joining i and j as Path(i, j) ⊂ E. A tree-structured graphical model is one that is Markov on a tree. The joint distribution of a random vector X Markov on a tree T = (V, E) admits the factorization  p(xi , xj )  . (4) p(xi ) p(x) = p(xi )p(xj ) (i,j)∈E

i∈V

This is because the maximal cliques are precisely the edges of the graph so (3) reduces to (4) and all we need to describe the joint distribution of a tree model are the pairwise statistics on the edges, i.e., p(xi , xj ) for all (i, j) ∈ E. As with the chain, tree models with discrete variables taking on K values can be described by O(dK 2 ) parameters. The focus of this paper is however, on Gaussian tree models. In this case, it can be easily verified [30] that if the model is normalized such that the variances of Xi are unity for every i ∈ V, then all elements of the covariance matrix Σ are functions of the set of correlation coefficients on the edges of the graph, i.e., {ρi,j : (i, j) ∈ E}. Indeed, for general models with variances σi2 := Σi,i , by exploiting the conditional independences between the random variables (X1 , . . . , Xd ), which are Markov on the tree T , it can easily be shown [30] that  ρi,j σi σj (i, j) ∈ E  (5) Σi,j =  (i, j) ∈ / E. (k,l)∈Path(i,j) ρk,l σi σj For example, the covariance of X2 and X3 of the model in Fig. 2 is σ2 σ3 ρ1,2 ρ1,3 . Thus, in general, the sets {σi : i ∈ V} and {ρi,j : (i, j) ∈ E} completely characterize the covariance matrix of X. The construction of Σ in (5) results in a positive definite matrix if |ρi,j | ∈ (0, 1) for all (i, j) ∈ E [30]. Furthermore, J := Σ−1 is sparse according to E. Since there are only d − 1 edges in a tree model, 2d − 1 parameters (d − 1 correlation coefficients and d variances) suffice to describe the covariance matrix. Together with the d elements in the mean of X, for Gaussian trees 3d − 1 parameters suffice to describe the entire distribution. This is in contrast to Gaussian models with full covariance matrices, i.e., those without any Markovity assumption. Such unstructured models require d(d + 3)/2 parameters to describe. This is quadratic in d and hence, prohibitively large. Thus, the treestructured Gaussian graphical model is parsimonious. D. Learning Tree-Structured Graphical Models Chow and Liu [25] used the factorization in (4) to derive an algorithm to learn trees from data. However, Chow and Liu assumed that the variables are discrete, i.e., they take on finitely many values. Here, we describe an extension of their algorithm to continuous-valued data and in Section III-B, we use this subroutine to learn mixtures of Gaussian trees. The setup is as follows: given that samples x(1) , . . . , x(n) ∈ Rd are independently and identically drawn from a treestructured graphical model pX (x), we aim to reconstruct the set of edges of the underlying graph of p, i.e., if X is Markov on G = (V, E), we want to learn E. To do so, consider the following optimization (known as a reverse I-projection): min D(ˆ p || q)

q∈Td

(6)

where Td denotes the family of tree-structured graphical models taking values in Rd and pˆ is an estimate of the distribution (not necessarily a tree-structured one) from the samples x(1) , . . . , x(n) . Intuitively speaking, the optimization in (6) finds an approximating tree-structured graphical model that is closest in the KL-divergence sense to pˆ. By exploiting the factorization in (4), Chow and Liu showed that the optimization problem in (6), which appears to be complicated because it is an optimization over all dd−2 spanning trees [27], reduces to the following max-weight spanning tree problem [27]:

ˆ i ; Xj ) Eˆ := arg max I(X (7) E∈Td

(i,j)∈E

where the maximization in (7) is over the set of (spanning) ˆ i ; Xj ) is the empirical mutual information trees Td and I(X between Xi and Xj computed with respect to pˆ pˆ(xi , xj ) ˆ i ; Xj ) = I(X dxi dxj . (8) pˆ(xi , xj ) log p ˆ (xi )ˆ p(xj ) 2 R The optimization in (7) can be solved efficiently by Kruskal’s algorithm [27], which runs in time O(d2 log d). Intuitively, the maximization selects the tree with edges that connect random variables that are the most highly correlated since mutual information [34] measures the degree of correlation. In the jointly Gaussian case, which is of central interest ˆ i ; Xj ), in this paper, the empirical mutual information I(X defined in (8), can be expressed in closed form as [34] ˆ i ; Xj ) = − 1 log(1 − ρˆ2i,j ) I(X (9) 2 where ρˆi,j is an estimate of the correlation coefficient between random variables Xi and Xj . This quantity can be computed via a maximum-likelihood procedure. For example, if μ ˆi := (k) 1 n x is the maximum-likelihood estimate of the mean k=1 i n of the ith Gaussian random variable Xi and

(k) (k) ˆ i,j := 1 Σ (xi − μ ˆi )(xj − μ ˆj ) n n

(10)

k=1

is the maximum-likelihood estimate of [Σ]i,j , then the correlation coefficient between Xi and Xj can be estimated as ρˆi,j =   ˆ i,j Σ ˆ j,j −1/2 . Since I(X ˆ i,i Σ ˆ i ; Xj ) in (9) is monotonically Σ increasing in |ˆ ρi,j |, instead of the mutual informations in (7), we can equivalently search for the edge set with the highest sum of |ˆ ρi,j |. In this way, we circumvent the need to perform the computation in (9). III. M IXTURE OF G AUSSIAN T REES M ODEL FOR THE P OSITIVE C LASS In this section, we review the mixture of trees model introduced by Meil˘a and Jordan [28] and the EM-based algorithm used to learn such models from data. However, because our data are continuous-valued, it is necessary to modify the work by Meil˘a and Jordan [28] to handle such noncategorical data. Thus, we suggest the Gaussian mixture of trees model and also a learning algorithm for such models by combining the EM-framework in [28] and the learning algorithm for Gaussian tree models [30] described in Section II-D. We will

CAO et al.: PARSIMONIOUS MIXTURE OF GAUSSIAN TREES MODEL

2231

use the learning algorithm to model the time-series data from the positive class, which typically has a very few samples (compared with the negative class) but can be multimodal in nature. A. Mixture of Gaussian Trees Even though, trees and their variants have been employed successfully in a variety of machine learning tasks such as classification [37], Gaussian trees can only model unimodal data, i.e., it has one peak. Thus a natural generalization to effectively model data with more than one mode (or multimodal data) is to consider a convex combination of two or more Gaussian trees. Such a model is characterized by a set of nonnegative mixing coefficients {αl : 1 ≤ l ≤ L} L which sum up to one (i.e., l=1 αl = 1) and a collection of constituent Gaussian tree models {pl (x) = N(x; μl , Σl ) : 1 ≤ l ≤ L}. Since each pl is a Gaussian tree model, Jl := Σ−1 has l a sparsity pattern that corresponds to a tree and, in particular, Jl only has O(d) nonzero elements. The probability density function of the random vector X, following a mixture of Gaussian trees model ML , satisfies: ML :

p(L) (x; θ) =

L

αl pl (x)

(11)

l=1

where θ := {αl , μl , Σl : 1 ≤ l ≤ L}. Because each mixture component pl is a Gaussian tree, the convex combination p(L) can have up to L modes. Since the positive class, which is modeled using a mixture of Gaussian trees, has a few samples, we typically set L = 1, 2 to avoid overfitting. Note that, the number of parameters required to describe the model is small. Indeed, only 3dL−1 parameters suffice. This number is linear in d and hence the model is parsimonious. We are thus able to prevent overfitting and represent multimodal data well. There is flexibility in how many mixture components L to use in (11). If L is too large, we overfit, while if L is too small, the model may not represent the data well. One way to automatically learn the number of mixture components L is to use the BIC [29]. This criterion stipulates that the model with the smallest BIC score should be chosen. The BIC score for model ML is defined as BIC(L) := min −2 log p(L) (x(1) , . . . , x(n) ; θ) + k(L) log n θ

(12) where because of the independent identically distributed nature of the data, the first term is in fact the sum of n terms of the form log p(L) (x(k) ; θ), and k(L) = 3dL − 1 is the number of parameters of the model ML ; this is because per tree, there are d means, d variances, and d − 1 correlation coefficients that need to be specified and globally, we also need to specify L − 1 mixing parameters. This works out to k(L) = 3dL − 1. The first term in (12) represents the fit to data; it is the maximized log-likelihood of the data under model ML , maximized over all parameters θ. The second term in (12) represents a penalty for using a more complex model. Thus, the BIC in (12) represents a natural tradeoff between data fidelity and a penalty for overfitting. We will

see in Section V-E that, in practice, the BIC is a good model selection procedure to estimate L. B. Learning Mixture of Gaussian Trees A central question in the study of mixture models is the maximum-likelihood learning of the model parameters from data. As in Section II-D, we assume that we are given a set of samples {x(1) , . . . , x(n) } ⊂ Rd , each sampled independently and identically drawn from a probability distribution p(x) that is either an exact mixture of Gaussian trees or can be well modeled by such a mixture. We would like to learn the parameters of the model θ assuming that L is given. This learned model will then be used for oversampling. The usual way of learning mixture models is to regard cluster assignments as latent variables and employ an EM-procedure [32] that alternates between an expectation step (E-step) and a maximization step (M-step). A canonical example is in Gaussian mixture modeling [21, Section 9.2.2]. However, since we want to take the time-series correlations into account, we have to perform further projections onto the set of Gaussian trees using the technique described in Section II-D. First, we initialize the parameters αl , μl , Σl for 1 ≤ l ≤ L to feasible values. So, for example, we can set αl = 1/L. The means μl ∈ Rd can be drawn randomly from some continuous distribution in Rd (for example a Gaussian). Alternatively, we can use the k-means algorithm to obtain good estimates of μl . Finally, Σl ∈ Sd+ can be set to any positive definite covariance matrix. To do so, we generate a matrix A at random and set Σ = AA . Then, based on these initial parameters, compute the log-likelihood of the data under the current model. Second, for the E-step, compute the responsibilities [21] as αl N(x(k) ; μl , Σl ) γkl := L (k) ; μ , Σ ) j j j=1 αj N(x

(13)

where k ∈ {1, . . . , n} indexes samples and l ∈ {1, . . . , L} indexes mixture components. Roughly speaking, γkl represents the probability that sample x(k) belongs to mixture component l and γkl are the latent variables (class assignments) in this L EM procedure. Note that, l=1 γkl = 1 for nall k. Third, for the M-step, we set nl := k=1 γkl to be the effective number of samples in mixture component l. Note L that, l=1 nl = n. We then reestimate the intermediate parameters consisting of mixing coefficients nl αnew (14) = l n the means nl 1

μnew = γkl x(k) (15) l nl k=1

and the intermediate covariances nl

˜ new = 1 Σ γkl (x(k) − μl )(x(k) − μl ) l nl

(16)

k=1

for each l ∈ {1, . . . , L}. However, note from (16) that the ˜ new ) are not tree,Σ constituent Gaussians p˜l (x) := N(x; μnew l l new −1 ˜ structured (because (Σ ) is, in general, not sparse with l

2232

IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, VOL. 25, NO. 12, DECEMBER 2014

respect to a tree) but we require them to be so because we would like to learn a parsimonious model with only O(d) parameters. It is at this point that we depart from the traditional procedure used to learn the parameters of a Gaussian mixture ˜ new onto the set of trees, model. We must now project Σ l i.e., perform the optimization (6) for each component, resulting in the lth edge set being updated as follows:

Eˆlnew := arg max Iˆp˜l (Xi ; Xj ) (17) E∈Td

(i,j)∈E

where Iˆp˜l (Xi ; Xj ) is the empirical mutual information between Xi and Xj computed with respect to the Gaussian p˜l , which has mean and covariance given by (15) and (16), respectively. The empirical mutual information Iˆp˜l (Xi ; Xj ) ˆ ˜ new in place of Σ. is given by the formula in (9) with Σ l new The updated covariance matrix Σl is then given by (5) with edge set Eˆlnew . In addition, the correlation coefficients along the edges and the variances are computed based on ˜ new. To find the (unique) path the intermediate covariance Σ l between two nodes efficiently, we employ Dijkstra’s shortest path algorithm (see Section IV-C) on Eˆlnew [27]. The covariis guaranteed to have an inverse whose sparsity ance Σnew l structure corresponds to the tree with edge set Eˆlnew . After the M-step, we recompute the log-likelihood of the n samples , μnew , Σnew : 1 ≤ l ≤ L}. under the new parameters {αnew l l l We iterate and recompute the responsibilities in the E-step in (13). The algorithm is terminated when the difference between the log-likelihoods of the samples between two iterations is below a predefined small threshold. In practice, for our data sets, we have found that the number of iterations it takes for the algorithm to converge is small (< 10). This is because we provide good initializations using k-means. One appealing feature of the EM-based algorithm described above is that the log-likelihoods do not decrease with the iteration count [21], [32]. There is a subtlety here and we cannot simply convince ourselves that this is true by the usual EM argument. In the proposed method, we do not exactly maximize the likelihood of the model under the old parameters at each iteration; rather, we find a set of parameters such that the likelihood of the model increases (or rather does not decrease). This is because the optimization problem in (6) does not correspond exactly to solving the maximum-likelihood problem. Thus, we are in fact using a generalized EM procedure [31]. When the log-likelihood converges, it means that we have found a stationary point in the maximization of the log-likelihood over the set of parameters θ. Thus, we can use the log-likelihood at each iteration as a test for the convergence of the algorithm. C. Implementation Details One potential problem when dealing with high-dimensional time-series data (i.e., when d is large) is that the evaluation of the likelihoods N(x(k) ; μl , Σl ) in (13) results in a very small values. Hence, the computation of the responsibilities in (13) is ill-conditioned. When implementing the algorithm, we perform all computations in the log-domain. For the L = 2 case, the computation in (13) reduces to the addition and

subsequent division of two small numbers. Division is easily done in the log-domain while for addition, we use the identity log(a + b) ≡ log {exp [log(a) − log(b)] + 1} + log(b). (18) With this modification, all computations are well conditioned. IV. OVERSAMPLING AND C LASSIFICATION F RAMEWORK U SING A M IXTURE OF G AUSSIAN T REES In this section, we describe the framework for oversampling and subsequent classification using the mixture of Gaussian trees model. We then discuss the advantages of the proposed oversampling and classification framework compared with the existing techniques such as SPO [1]. Finally, we derive the computational complexity. Consider a general binary classification problem in which we are provided with a set of positively labeled samples D+ := {x(1) , . . . , x(n+ ) } and a set of negatively labeled ¯ (n− ) }. These form the training samples D− := {¯ x(1) , . . . , x samples, which are used to learn a classifier to classify test samples Dtest . The cardinality of the set of positively labeled samples n+ = |D+ | is assumed to be much smaller than the cardinality of the set of negatively labeled samples n− = |D− |. As mentioned in Section I, this frequently occurs in real-life scenarios where the positive class could represent samples drawn from an anomalous event, while the negative class could represent samples from the typical event. At a high-level, what we plan to do is to augment to the samples in the positive class by first modeling the underlying distribution of the samples in D+ and then sampling from that learned distribution to create   a new positively labeled class D+ such that n+ = |D+ | = n− . A. Details of the Oversampling and Classification Procedure First, we use the procedure described in Section III-B to learn a mixture of Gaussian trees from D+ . For this, we either fix L or find its best value via the BIC score in (12). For simplicity, we usually set L = 2 so that we can model multimodal data from a limited number of samples and the number of parameters is also small. Second, we generate n− − n+ samples from p(L) . Suppose that L = 2, then we sample from a Bernoulli random variable B ∈ {1, 2} with bias α1 . If B = 1 (resp. B = 2), then we sample a vector randomly from N(μ1 , Σ1 ) (resp. from N(μ2 , Σ2 )). For sampling vectors from a multivariate Gaussian N(μ, Σ), one usually performs the Cholesky decomposition of the covariance matrix Σ = LL , where L is lower triangular. However, the computational complexity of Cholesky decomposition is cubic in d. One then samples a vector z from N(0, I) and then performs a linear transformation Lz + μ to obtain the desired sample. However, by noting that each component is a tree, we can implement this step more judiciously by sampling from any node in the tree (the root) and direct all undirected edges away from the root. For example, if X1 is the root of the tree in Fig. 2. Then, we can sample all other variables conditioned on their sole parents in a BFS fashion, i.e., from the conditional distributions p(xj |x1 ), j = 2, . . . , 5. The complexity of such a procedure exploiting the learned tree structure is linear in d.

CAO et al.: PARSIMONIOUS MIXTURE OF GAUSSIAN TREES MODEL

Algorithm 1 Oversampling from Mixture of Gaussian Trees Input: Training data D+ , D− ; Test data Dtest ; Tolerance  > 0; Number of components L Output: Class labels for Dtest Init: Initialize αl , μl and Σl to feasible values Calculate: Log-likelihood as in (11) while (τ > ) do E-step: Calculate responsibilities γkl using (13) ˜ new ) using (14)–(16) M-step I: Update (αnew , μnew ,Σ l l l new ˆ using (17) (Chow-Liu) M-step II: Update tree El M-step III: Update covariance to Σnew using Dijkstra’s l algorithm and (5) Calculate: Log-likelihood as in (11) Calculate: τ = Difference in successive log-likelihoods end while Calculate: Generate n+ − n− samples from the model in a breadth-first search (BFS) fashion Calculate: Learn SVM classifier to label samples in Dtest Having augmented to the samples in D+ to form a new data  set D+ , we can then use a standard SVMs [19] on D− and  D+ to classify the unlabeled samples. The whole procedure is summarized in Algorithm 1. B. Advantages of the Proposed Method In imbalanced time series classification, the minority class is a high-dimensional data set, i.e., the length of the timeseries far exceeds the number of samples. Simultaneously, its distribution may be multimodal in nature due to large within-class data variations. We would thus like to learn a parsimonious model so that the minority-class may be modeled with limited parameters before synthetic examples are generated to balance the training data. We now highlight a few features of the proposed oversampling method that explain why it performs well on a diverse range of data sets, as we will see in Section V. First, the number of parameters in our model is small (of order O(Ld)). This is in contrast to other oversampling methods for dealing with imbalanced data, such as SPO [1], where the corresponding number is O(d2 ). It is well known in machine learning that the fewer the number of parameters, the less likely that overfitting occurs (Occam’s razor [21]). Second, the mixture of Gaussian trees model has no parameters to tune except for L, which is set to 1 or 2 and this choice can be determined by the BIC (12) see Section V-E. Other methods such as SPO [1] require an additional postprocessing step on the eigenspectrum of the learned covariance matrix Σ to truncate small eigenvalues of Σ. To do this in a principled way requires cross-validation, which is computationally expensive. Regularizing the eigenspectrum is required because there are too a few positively labeled samples (relative to the dimension) for Σ to be positive definite. However, our model does not suffer from this problem due to its parsimony. Third, the model is well adapted to time-series data since each component is Markov on a tree [23]. Time-series data are characterized by the fact that each sample in time is

2233

highly correlated to its neighbors and the correlation decays as the distance between the two nodes increases. By learning a tree, we explicitly incorporate this important feature or characteristic of the data set into our model. Finally, we are able to handle multimodal data because the model is a mixture of trees. We have observed visually that the positive class usually consists of data that can be naturally clustered into multiple groups. See Fig. 1 where data for the Wafer data set is displayed and it is clearly bimodal. SPO [1] only handles positively labeled unimodal data well. Even though SPO was proposed and tested for a highly imbalanced scenario with limited positive samples (≤ 50), this unimodal assumption is restrictive for generic imbalanced time-series classification that we consider here. C. Computational Complexity The advantages of the proposed method outweigh the computational requirements, which are higher than SPO [1]. Recomputing the statistics in the M-step in (14)–(16) requires O(nd2 ) operations (additions and multiplications) as we need to compute pairwise statistics for each of the n samples. Learning the Chow-Liu tree for each component given the intermediate covariance matrix involves a max-weight spanning tree procedure and requires O(d2 log d) operations [27]. Dijkstra’s algorithm requires O(d2 ) operations [27] and the same number of operations is required for filling in the rest of the entries of the covariance matrix using (5). To find the unique path between any two nodes in the learned tree, we use the graphshortestpath function in MATLAB. This step, however, can be further optimized using recursive algorithms on trees to exploit the fact that the unique path between two faraway nodes contains many subpaths that are essentially overlapping. We leave this extension for future work. Since there are L components, these complexities sum up to give O(Ld2 (n + log d)). The E-step requires of the order of O(Lnd) operations, which is dominated by the M-step so the overall computational complexity is O(Ld2 (n + log d)). We have assumed in this calculation that the number of iterations of the generalized EM procedure is constant. We observe, in practice, that this is less than 5 with a good initialization, say via k-means. The dependence on d2 cannot be avoided because we are using pairwise statistics in our modeling. V. N UMERICAL R ESULTS A. Illustration on a Toy Data Set Starting with a toy data set (with d = 2 dimensions), we show how our mixture model handles multimodal data. In Fig. 3, a data set consisting of a large number of negatively labeled samples (black) and a small number of positively labeled samples (blue) is shown. The positive samples are randomly selected from the positive class of a balanced data set. We deliberately generated the data set so that the positively labeled samples are drawn from a mixture distribution whose components have distinctly different means. Note that, since d = 2 (for ease of visualization), the tree aspect of the model cannot be visualized since there are only two nodes in the

2234

IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, VOL. 25, NO. 12, DECEMBER 2014

Fig. 3. Visual comparison: The positive samples, negative samples and synthetically generated positive samples are the blue pluses, black dots and red circles respectively. Observe that our proposed 2MoGT method (mixture of Gaussian trees with L = 2 components) is able to model the positive class (blue) better than SPO [1] and DataBoost [13] which assume that the data is unimodal. Also, 2MoGT exhibits more spread (or variance) compared with SMOTE [7]. This increased variability results in better support vectors [19] to separate the classes for subsequent classification.

trees. We generated synthetic samples to augment to the positive class using the following methods: SPO [1], SMOTE [7], B-SMOTE [10], ADASYN [11] and DataBoost [13]. We observe that SMOTE generates synthetic samples (red) that closely resemble the positively labeled samples. This is because SMOTE selects each sample as a seed for interpolating a new sample with a randomly selected nearest neighbor. B-SMOTE and ADASYN generate the majority of their synthetic samples near the boundary with the negative class. This is to protect the hard-to-classify positive samples. However, this distorts the distribution of the original positive class. Though these three methods work well in general to maintain the two separate distributions of our positive class, the mechanism of interpolation with its random nearest neighbor lacks the ability to compensate for the lack of samples in the positive class. Thus, we see that empty internal areas within each cluster remain largely empty and none of the synthetic samples have been generated in the vicinity of the current positive class. Two other methods, SPO and DataBoost, do not model the multimodal data well since many synthetic samples belong to space with low probability mass. This is because both methods assume that the data are unimodal. Though the unimodal assumption has the advantage of requiring a few positive samples for modeling, oversampling methods are unable to adapt well to the multimodal scenario. These synthetically generated samples do not concur with the true positive distribution and thus, the additional computational expense is not well justified. Our proposed method based on mixtures of Gaussian trees accounts for the multimodal aspect of the data. The synthetic samples occupy some of the empty spaces within each cluster and they also expand into vicinity of the region occupied by the available positive samples.

Compared with other methods, the (augmented) synthetic data set generated by 2MoGT appears to be the most similar to the balanced multimodal data set. B. Data Sets, Competing Methods, and Figures of Merit We constructed 54 imbalanced two-class data sets from 10 multiclass data sets (FaceAll, TwoPatterns, Fish, SLeaf, OSULeaf, Wafer, Yoga, 50words, SynCtrl, and Adiac) from the UCR time-series repository [33] for numerical evaluation. See Table I for the description of the data sets. These 10 data sets are selected from the 20 available UCR data sets as they contain sufficiently large numbers of samples to enable us to simulate highly skewed binary classification in which the size of the training data in the negative class is many times of that of the positive class and, at the same time, the positive class still contains a meaningful number of samples to facilitate learning. Out of the 10 UCR data sets, FaceAll, SLeaf, Fish, TwoPatterns, OSULeaf, 50Word, SynCtrl, and Adiac 14, 15, 7, 4, 6, 50, 6, and 7 classes, respectively. We systematically converted them into multiple imbalanced two-class data sets by selecting one or merging several original classes to form the positive class and using the remaining classes to form the negative class. The indices of the selected positive classes are appended to the name of the data set. For instance, FaceAll2_3 means that the classes originally labeled with 2 and 3 are selected to form the positive class. Note that, we have distinguished two types of imbalanced data sets, which we call multimodal and unimodal. Multimodal here refers to the case that multiple original class are selected to form the new positive class, while unimodal means that only one original classes is selected. By merging original

CAO et al.: PARSIMONIOUS MIXTURE OF GAUSSIAN TREES MODEL

2235

TABLE I I MBALANCED M ULTIMODAL (MM W ITH 30 IN T OTAL ) AND U NIMODAL (UM W ITH 18) T IME -S ERIES D ATA S ETS C ONSTRUCTED FROM E IGHT M ULTICLASS D ATA S ETS FROM THE UCR R EPOSITORY [33], W HERE * D ENOTES I NDEX OF THE O RIGINAL C LASS I NCLUDED IN O UR P OSITIVE C LASS

classes, we simulate the situation in which the positive class is indeed multimodal, and it is this situation in which our proposed algorithm is expected to perform well since it takes the multimodality into account explicitly. Note that, we do not construct the multimodal imbalanced data sets for Wafer and Yoga since the originally data sets contain only two classes. We do not construct the unimodal imbalanced data sets for Adiac, Fish, 50words, SynCtrl, and OSULeaf either since each of their original class contain limited number of samples. For each binary data set, the training and test data are divided randomly and all available samples are included in the training or test set. As shown in Table I, we made sure that the number of positively labeled samples does not exceed 50% of the total available positive samples. For each training set, the number of positive samples is no more than 200 to simulate the scenario where the number of positive instances is limited. The imbalance ratios (i.e., the ratio of the number of negative samples to positive ones) are between 1.9 and 15. For many of our data sets, the number of positive samples n+ is comparable with the time-series length d except for SLeaf, 50words, Fish, OSULeaf, and Adiac since they all have very limited numbers of positive samples. The paucity of such data sets with respect to the high dimensionality of the time-series is one of key challenges as mentioned earlier. The figures of merit that we use to assess the accuracy of the algorithms are the F-value and G-mean [2], [3]. These are simple functions of the precision and recall. Because of space constraints, we only tabulate the F-values for the various algorithms and data sets. We observed that the behavior of the G-means are similar to the F-Values. All the imbalanced data sets and tables for the G-means can be found at the following link: http://goo.gl/vuwGW1. For reproducibility of our results, the MATLAB source code for our method, LMoGT is made available at the same link. C. Comparisons with Oversampling Methods The comparisons of our mixture of Gaussian trees method for oversampling with other oversampling methods are detailed in Tables II and III. During the SVM-based classification with radial basis function kernel, we perform log-scale grid search and choose the best combination of SVM parameters (C, g) for each oversampling method and each data

set through cross-validation. Our methods are termed 1MoGT and 2MoGT where numeral denotes the number of mixture components, which is set to either 1 or 2. In the 1MoGT case, note that we are simply modeling the positively labeled data using one Gaussian tree model. We ran all experiments over 10 independent runs and computed the means and standard deviations of the F-values and G-means. This is to confirm that the variability across different oversampled data sets is not large and our results are consistent. For the multimodal data sets, we observe from Table II that 2MoGT outperformed all the remaining oversampling methods in good consistency. Here, 2MoGT achieves the best average performance 28 times (according to the F-value) out of a total of 36 data sets. In particular, we found for some difficult-toclassify data sets (e.g., TwoPattern4_1, SLeaf1_2, 50words3_4 and Adiac7_8_9_10) that 2MoGT achieves large margins of improvement over all other oversampling methods. The good consistency and large improvement margin of 2MoGT suggest that it is important to model the distribution of such data sets with more than one mixture component for subsequent oversampling. As discussed previously, our proposed parsimonious model requires significantly fewer parameters for achieving good modeling outcomes in this process where the positive samples available for modeling can be very limited compared with their corresponding time-series length. We have also benchmarked 2MoGT with a recently developed integrated oversampling (INOS) method [3], which enhances and integrates SPO with interpolation-based oversampling techniques. Using the suggested integration ratio of 70%, we find for the multimodal data sets, INOS achieves better performance than SPO, but its performance is not comparable with 2MoGT. For instance, INOS achieves average F-values of 92.5%, 85.6%, 87.4%, 81.6%, 65.3%, 78.2%, 98.6%, and 71.5% for FaceAll, TwoPatterns, Fish, SLeaf, OSULeaf, 50words, SynCtrl, and Adiac, respectively. The corresponding average F-values of 2MoGT are 96.5%, 87.8%, 88.0%, 87.2%, 66.4%, 90.2%, 99.0%, and 79.9% with clear winning margins for six out of the eight multimodal data sets. For the unimodal data sets, we observe from Table III that 1MoGT and 2MoGT perform competitively across all 18 imbalanced time-series data sets compared with the other oversampling methods. Since the positive classes of these data sets are unimodal, we notice that SPO performs the

2236

IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, VOL. 25, NO. 12, DECEMBER 2014

TABLE II C OMPARISON OF F-VALUES FOR VARIOUS OVERSAMPLING M ETHODS ON M ULTIMODAL P OSITIVE C LASS

TABLE III C OMPARISON OF F-VALUES FOR VARIOUS OVERSAMPLING M ETHODS ON U NIMODAL P OSITIVE C LASS

best on many occasions since it fits data better. Moreover, SPO also incorporates regularization and cleaning mechanisms to boost its performance. Without such additional measures, our 1MoGT performed slightly worse than SPO, but it requires significantly fewer parameters for modeling time-series. Other than SPO, our 1MoGT and 2MoGT perform the best by achieving the best F-value for seven data sets and the best G-mean for eight data sets out of a total of 18 data sets. We also noticed when 1MoGT outperforms 2MoGT, the differences in performance are not very pronounced. This can be attributed to the fact that we also learn the mixing parameter αl . If the positive class is strongly unimodal and we set L = 2, then either the inferred α1 or α2 will be

small (close to zero), thus that component is automatically de-emphasized. D. Comparisons With Nonoversampling Methods We also compared our proposed 2MoGT and 1MoGT with other (nonoversampling) techniques for dealing with imbalanced time-series classification. These techniques include 1NN-DTW [17], 1NN [18], EasyEnsemble [12] and BalanceCascade [12]. 1NN-DTW and 1NN are popularly used in time-series data classification and EasyEnsemble and BalanceCascade are ensemble and undersampling-based methods for machine learning from imbalanced data. Our results in terms of the F-value are shown in Tables IV and V.

CAO et al.: PARSIMONIOUS MIXTURE OF GAUSSIAN TREES MODEL

2237

TABLE IV C OMPARISON OF F-VALUES TO O THER N ONOVERSAMPLING T ECHNIQUES ON M ULTIMODAL P OSITIVE C LASS

TABLE V C OMPARISON OF F-VALUES TO O THER N ONOVERSAMPLING T ECHNIQUES ON U NIMODAL P OSITIVE C LASS

Out of the six methods, we find that 1NN-DTW achieves the best average F-value and G-mean performances for TwoPatterns, FaceAll, and SynCtrl data sets while 2MoGT achieves the best average performances for the remaining data sets including SLeaf, 50Words, Wafer, Yoga, Fish,

OSULeaf, and Adiac. Though both 1NN-DTW and 2MoGT are winners on a number of different types of data sets, we note that 1NN-DTW is only a clear winner over 2MoGT on TwoPatterns. Here, we regard 3% winning margin on the average F-value as a confident separation to

2238

IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, VOL. 25, NO. 12, DECEMBER 2014

TABLE VI C OMPARISON OF BIC VALUE (12) FOR U NI - AND M ULTIMODAL D ATA S ETS

ascertain that one method is a clear winner over another method on a data set. On the other hand, 2MoGT is the clear winner over 1NN-DTW on six data sets: SLeaf, Yoga, 50Words, Fish, OSULeaf, and Adiac. The average F-value and G-mean of 2MoGT for the five types of unimodal data sets are 93.7% and 92.9%, respectively, which are 1.3% and 1.1% higher than those of 1NN-DTW. Similarly, the average F-value and G-mean of 2MoGT for the eight types of multimodal data sets are 86.9% and 90.4%, respectively, which are 3.9% and 2.0% higher than those of 1NN-DTW. In addition, we find that 1MoGT performs competitively on many unimodal data sets. Thus, 1MoGT, which has a small set of parameters (exactly 2d − 1), can be used for fitting very simple unimodal Gaussian tree models to the data. This ameliorates overfitting. Besides accuracy, it is well known that using 1NN-DTW to classify a test sample is computationally intensive [17]. E. Usefulness of the BIC in Selecting the Number of Mixtures In Table VI, we show using the SLeaf and TwoPatterns data sets that the BIC [29] correlates well with the classification accuracy as indicated by the F-value. The final column BIC accuracy indicates whether the BIC metric predicts whether L = 1 or L = 2 results in a higher F-value. Note that a smaller BIC value (12) is preferred. BIC does not correlate so well with the modality of the data set. This is because the minority positive class may already be multimodal so the BIC value may not be indicative of the modality of the positive class. We do not consider larger values of L because the number of samples in the minority class is already assumed to be small, thus a small value of L is likely to work well in practice. VI. C ONCLUSION We proposed oversampling from a mixture of Gaussian trees model to correct the imbalance in time-series classification. Our parsimonious model is able to capture the dependencies among the neighboring time points and the multimodal aspect of the positive class. It is also computationally efficient to

learn. Most importantly, we validated the performance of the proposed method on 36 multimodal data sets and 18 unimodal data sets. We observed that our model results in outstanding classification performance vis-à-vis the state-of-the-art oversampling and nonoversampling methods for imbalanced classification of time-series data. We would like to explore the following three extensions: first, for very high-dimensional data, we can relax the tree assumption even further and consider learning forests [38], which are acyclic graphs that are not necessarily connected. This we can do by simply removing weak edges from the constituent trees. Second, we would like to develop recursive algorithms on trees that find subpaths within long paths to further reduce the computational complexity (see Section IV-C). Finally, we would also like to explore the applicability of our method to multiclass classification problems perhaps using the one-versus-all strategy propounded by Rikfin and Klautau [39]. Some preliminary work along these lines applied to human activity classification was done in [40]. ACKNOWLEDGMENT The authors would like to thank Dr. S.-K. Ng and Dr. X.-L. Li for their constructive criticism, which helped to improve the quality of this paper. R EFERENCES [1] H. Cao, X.-L. Li, Y.-K. Woon, and S.-K. Ng, “SPO: Structure preserving oversampling for imbalanced time series classification,” in Proc. IEEE 11th Int. Conf. Data Mining, Dec. 2011, pp. 1008–1013. [2] H. He and E. A. Garcia, “Learning from imbalanced data,” IEEE Trans. Knowl. Data Eng., vol. 21, no. 9, pp. 1263–84, Sep. 2009. [3] H. Cao, X.-L. Li, Y.-K. Woon, and S.-K. Ng, “Integrated oversampling for imbalanced time series classification,” IEEE Trans. Knowl. Data Eng., vol. 25, no. 12, pp. 2809–2822, Apr. 2013. [4] Y. Sun, M. S. Kamel, A. K. C. Wong, and Y. Wang, “Cost-sensitive boosting for classification of imbalanced data,” Pattern Recognit., vol. 40, no. 12, pp. 3358–78, Dec. 2007. [5] H. Cao and A. C. Kot, “Manipulation detection on image patches using FusionBoost,” IEEE Trans. Inf. Forensics Security, vol. 7, no. 3, pp. 992–1002, Jan. 2012. [6] M. A. Maloof, “Learning when data sets are imbalanced and when costs are unequal and unknown,” in Proc. Int. Conf. Mach. Learn., Workshop Learn. Imbalanced Data Sets, 2003, pp. 1–8.

CAO et al.: PARSIMONIOUS MIXTURE OF GAUSSIAN TREES MODEL

[7] N. V. Chawla, K. W. Bowyer, L. O. Hall, and W. P. Kegelmeyer, “SMOTE: Synthetic minority over-sampling technique,” J. Artif. Intell., vol. 16, pp. 321–57, Sep. 2002. [8] A. Estabrooks, T. Jo, and N. Japkowicz, “A multiple resampling method for learning from imbalanced data sets,” Comput. Intell., vol. 20, no. 1, pp. 18–36, 2004. [9] G. Batista, R. C. Prati, and M. C. Monard, “A study of the behavior of several methods for balancing machine learning training data,” ACM SIGKDD Explorations Newsletter, vol. 6, no. 1, pp. 20–29, 2004. [10] H. Han, W. Y. Wang, and B. H. Mao, “Borderline-SMOTE: A new oversampling method in imbalanced data sets learning,” in Proc. Int. Conf. Intell. Comput., 2005, pp. 878–87. [11] H. He, Y. Bai, E. A. Garcia, and S. Li, “ADASYN: Adaptive synthetic sampling approach for imbalanced learning,” in Proc. Int. Conf. Neural Netw., 2008, pp. 1322–28. [12] X.-Y. Liu, J. Wu, and Z.-H. Zhou, “Exploratory undersampling for classimbalance learning,” IEEE Trans. Syst., Man Cybern., vol. 39, no. 2, pp. 539–50, Apr. 2009. [13] H. Guo and H. L. Viktor, “Learning from imbalanced data sets with boosting and data generation: The DataBoost-IM approach,” ACM SIGKDD Explorations Newsletter, vol. 6, no. 1, pp. 30–39, 2004. [14] T. Jo and N. Japkowicz, “Class imbalances versus small disjuncts,” ACM SIGKDD Explorations Newsletter, vol. 6, no. 1, pp. 40–49, 2004. [15] Y. Tang, Y.-Q. Zhang, N. V. Chawla, and S. Krasser, “SVMs modeling for highly imbalanced classification,” IEEE Trans. Syst., Man, Cybern., B, Cybern., vol. 39, no. 1, pp. 281–88, Feb. 2009. [16] S. Koknar-Tezel and L. J. Latecki, “Improving SVM classification on imbalanced time series data sets with ghost points,” Knowl. Inf. Syst., vol. 28, no. 1, pp. 1–23, Jul. 2011. [17] X. Xi, E. Keogh, C. Shelton, and L. Wei, “Fast time series classification using numerosity reduction,” in Proc. Int. Conf. Mach. Learn., 2006, pp. 1033–1040. [18] M. N. Nguyen, X.-L. Li, and S.-K. Ng, “Positive unlabeled learning for time series classification,” in Proc. Int. Joint Conf. Artif. Intell. IJCAI, Jul. 2011, pp. 1421–1426. [19] V. N. Vapnik, The Nature of Statistical Learning Theory. Berlin, Germany: Springer-Verlag, 1995. [20] D. Fecker, V. Märgner, and T. Fingscheidt, “Density-induced oversampling for highly imbalanced datasets,” Proc. SPIE, vol. 8661, pp. 86610P-1–86610P-11, 2013, doi: 10.1117/12.2003973. [21] C. M. Bishop, Pattern Recognition and Machine Learning. Berlin, Germany: Springer-Verlag, 2008. [22] S. Lauritzen, Graphical Models. Oxford, U.K.: Oxford Univ. Press, 1996. [23] F. Bach and M. I. Jordan, “Learning graphical models for stationary time series,” IEEE Trans. Signal Process., vol. 52, no. 8, pp. 2189–2199, Aug. 2004. [24] A. Simpson et al., “Beyond atopy: Multiple patterns of sensitization in relation to asthma in a birth cohort study,” Amer. J. Respir Criticle Care Med., vol. 181, no. 11, pp. 1200–1206, 2010. [25] C. K. Chow and C. N. Liu, “Approximating discrete probability distributions with dependence trees,” IEEE Trans. Inf. Theory, vol. 14, no. 3, pp. 462–467, May 1968. [26] F. Kschischang, B. Frey, and H.-A. Loeliger, “Factor graphs and the sumproduct algorithm,” IEEE Trans. Inf. Theory, vol. 47, no. 2, pp. 498–519, Feb. 2001. [27] T. Cormen, C. Leiserson, R. Rivest, and C. Stein, Introduction to Algorithms, 2nd ed. New York, NY, USA: McGraw-Hill, 2003. [28] M. Meil˘a and M. I. Jordan, “Learning with mixtures of trees,” J. Mach. Learn. Res., vol. 1, pp. 1–48, Oct. 2000. [29] G. Schwarz, “Estimating the dimension of a model,” Ann. Statist., vol. 6, no. 2, pp. 461–464, 1978. [30] V. Y. F. Tan, A. Anandkumar, and A. S. Willsky, “Learning Gaussian tree models: Analysis of error exponents and extremal structures,” IEEE Trans. Signal Process., vol. 58, no. 5, pp. 2701–2714, May 2010. [31] M. I. Jordan, Learning in Graphical Models. Cambridge, MA, USA: MIT Press, 1999. [32] A. P. Dempster, N. M. Laird, and D. B. Rubin, “Maximum likelihood from incomplete data via the EM algorithm,” J. R. Statist. Soc. B, vol. 39, no. 1–38, 1977. [33] E. Keogh, X. Xi, L. Wei, and C. A. Ratanamahatana. (2006). UCR Time Series Classification/Clustering Page [Online]. Available: http://www.cs.ucr.edu/eamonn/time_series_data [34] T. M. Cover and J. A. Thomas, Elements of Information Theory, 2nd ed. New York, NY, USA: Wiley-Interscience, 2006.

2239

[35] J. Pearl, Probabilistic Reasoning in Intelligent Systems: Networks of Plausible Inference, 2nd ed. San Mateo, CA, USA: Morgan Kaufmann, 1988. [36] K.-U. Höffgen, “Learning and robust learning of product distributions,” in Proc. 6th Annu. COLT, 1993, pp. 77–83. [37] V. Y. F. Tan, S. Sanghavi, J. W. Fisher, and A. S. Willsky, “Learning graphical models for hypothesis testing and classification,” IEEE Trans. Signal Process., vol. 58, no. 11, pp. 5481–5495, Nov. 2010. [38] V. Y. F. Tan, A. Anandkumar, and A. S. Willsky, “Learning highdimensional Markov forest distributions: Analysis of error rates,” J. Mach. Learn. Res., vol. 12, pp. 1617–1653, May 2011. [39] R. Rifkin and A. Klautau, “In defense of one-vs-all classification,” J. Mach. Learn. Res., vol. 5, pp. 101–141, Nov. 2004. [40] H. Cao, M.-N. Nguyen, C. Phua, S. Krishnaswamy, and X.-L. Li, “An integrated framework for human activity classification,” in Proc. Ubicomp, 2012, pp. 331–340.

Hong Cao (S’08–M’11) received the B.Eng. (Hons.) and Ph.D. degrees from Nanyang Technological University, Singapore, in 2001 and 2010, respectively. He currently heads the Distributed Analytics Laboratory, Institute for Infocomm Research, Agency for Science, Technology, and Research, Singapore. His current research interests include stream and time series data mining, machine learning, and multimedia forensics. Dr. Cao received the Best Paper Award in IWDW 2010 and an honorary mention in ISCAS 2010 for his previous work in image forensics. In recent years, his team also received the international and local benchmarking data challenges, such as GE FlightQuest Phase I, Opportunity Activity Recognition Challenge 2011, and Up-Singapore Hackathon 2012. He currently serves as a Secretary for the IEEE Signal Processing Society, Singapore.

Vincent Y. F. Tan (S’07–M’11) received the B.A. and M.Eng. degrees in electrical and information sciences from Sidney Sussex College, Cambridge University, Cambridge, U.K., in 2005, and the Ph.D. degree in electrical engineering and computer science from the Massachusetts Institute of Technology, Cambridge, MA, USA, in 2011. He was a Post-Doctoral Researcher with the Department of Electrical and Computer Engineering (ECE), University of Wisconsin-Madison, Madison, WI, USA, and a Scientist with the Institute for Infocomm Research, Agency for Science, Technology, and Research, Singapore. Since 2014, he has been an Assistant Professor with the Departments of ECE and Mathematics, National University of Singapore, Singapore. His current research interests include information theory as well as learning and inference of graphical models. Dr. Tan is a recipient of the Charles Lamb Prize in 2005, a Cambridge University Engineering Department Prize awarded annually to the top candidate in electrical and information sciences, and the MIT EECS Jin-Au Kong Outstanding Doctoral Thesis Prize in 2011. He is a member of the IEEE Machine Learning for Signal Processing Technical Committee and a Technical Program Committee Member of the 2014 International Symposium on Information Theory.

John Z. F. Pang (S’13) received the B.Sc. degree (Hons.) in mathematical sciences from the Nanyang Technological University, Singapore, in 2013. He was a Research Intern with the Data Analytics Department, Institute for Infocomm Research, Agency for Science, Technology, and Research (A*STAR), Singapore, in 2012. He is currently a Research Engineer with the Computing Science Department, Institute for High Performance Computing, A*STAR. His current research interests include optimization, network science, complex systems, modeling, and simulation.

Copyright of IEEE Transactions on Neural Networks & Learning Systems is the property of IEEE and its content may not be copied or emailed to multiple sites or posted to a listserv without the copyright holder's express written permission. However, users may print, download, or email articles for individual use.

A parsimonious mixture of Gaussian trees model for oversampling in imbalanced and multimodal time-series classification.

We propose a novel framework of using a parsimonious statistical model, known as mixture of Gaussian trees, for modeling the possibly multimodal minor...
3MB Sizes 0 Downloads 8 Views