982

IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, VOL. 26, NO. 5, MAY 2015

A One-Class Kernel Fisher Criterion for Outlier Detection Franck Dufrenois

Abstract— Recently, Dufrenois and Noyer proposed a one class Fisher’s linear discriminant to isolate normal data from outliers. In this paper, a kernelized version of their criterion is presented. Originally on the basis of an iterative optimization process, alternating between subspace selection and clustering, I show here that their criterion has an upper bound making these two problems independent. In particular, the estimation of the label vector is formulated as an unconstrained binary linear problem (UBLP) which can be solved using an iterative perturbation method. Once the label vector is estimated, an optimal projection subspace is obtained by solving a generalized eigenvalue problem. Like many other kernel methods, the performance of the proposed approach depends on the choice of the kernel. Constructed with a Gaussian kernel, I show that the proposed contrast measure is an efficient indicator for selecting an optimal kernel width. This property simplifies the model selection problem which is typically solved by costly (generalized) cross-validation procedures. Initialization, convergence analysis, and computational complexity are also discussed. Lastly, the proposed algorithm is compared with recent novelty detectors on synthetic and real data sets. Index Terms— Kernel fisher criterion, outlier detection, unsupervised learning.

I. I NTRODUCTION

I

N THE field of kernel-based learning methods, the most commonly used and cited outlier detection method is certainly the one-class support vector machines (OC-SVM). With the well-established formalism of SVM, OC-SVM tries to isolate a target data subset from the rest of the data set either by an hyperplane [2], [3] or an hypersphere [4], [5], but without using class labels. This method is then considered to be unsupervised. However, in OC-SVM, the characterization of outliers is not based on a specific identification criterion but reduced to the selection of its free parameters, that is, the expected fraction of outliers and the kernel width. However, in practical use, the prior knowledge of these parameters is unrealistic and it is rather left free to the user. Indeed, usually no labeled samples from the outlier population are available, and the idea that the identification of outliers in a data set needs to specify in advance the expected fraction of outliers is a conceptual shortcoming. To avoid what has been described as the dog chasing its tail problem, Tax and Duin [6] introduce artificial outliers in the one-class classifier to obtain Manuscript received June 3, 2013; revised April 9, 2014 and May 25, 2014; accepted June 3, 2014. Date of publication July 16, 2014; date of current version April 15, 2015. The author is with the Laboratoire d’Informatique Signal et Image de la Côte d’Opale, Calais 62228, France (e-mail: [email protected]). Color versions of one or more of the figures in this paper are available online at http://ieeexplore.ieee.org. Digital Object Identifier 10.1109/TNNLS.2014.2329534

an estimate of the fraction of outliers. This strategy allows to minimize the volume occupied in feature space by the oneclass classifier and provides a better fitting of the decision boundary around the target data. Roth [7] proposes to identify outliers from a more formal way. Indeed, he considers that data mapped on the induced feature space can be explained by a Gaussian model, under the condition that the free parameters of the approach are correctly chosen. Thus, this link authorizes unsupervised identification of outliers by way of hypothesis testing. The ability of detecting atypical data without a priori knowledge on their distribution constitutes a serious progress on this issue. However, this method suffers from several disadvantages, mainly that of the need to compute a time consuming normalization integral which considerably reduces the field of applications. Other work considers the possibility to collect typical data from a labeled source data set and thus to derive reference statistics. Then it is possible to predict outliers in the nonlabeled target data set. This strategy falls into the category of semisupervised learning method. In this case, outlier identification schemes can work in offline or online mode depending on the type of application studied. In real-time environments, recent fault diagnosis schemes for nonlinear dynamical systems have been proposed by using neural network [8] or the concept of reservoir computation models [9]. In offline mode, Gao et al. [10] propose a framework based on finite mixture model which models both the data and the constraints imposed by labeled examples. Sugiyama et al. [11] propose an inlier based outlier detector which use the density ratio between the joint distributions of the training and test data. Recently, Li and Tsang [13] introduce a new relative outlier detector which combines the objective of maximum mean discrepancy (MMD) [14] and the structural risk minimization to learn both a decision boundary and the data labels. The discrimination outlier/dominant is introduced by the MMD criterion which is based on the euclidean distance between the center of the labeled data set (normal data) and the unlabeled one. In [1], the discrimination outlier/dominant is formalized from the properties of the diagonal elements of the hat matrix which are traditionally used as outlier diagnostics in linear regression. Considering a Gaussian model for the data, a one-class Fisher discriminant is then derived but is limited to deal with linear target data. In this paper, we propose to extend their work by defining a one-class kernel Fisher dicriminant criterion to separate atypical data from nonlinear normal data. The maximization of the proposed criterion provides both an optimal projection subspace from which we can derive a decision boundary

2162-237X © 2014 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.

DUFRENOIS: ONE-CLASS KERNEL FISHER CRITERION FOR OUTLIER DETECTION

and an optimal indicator vector for data classification. We show that these two tasks can be separated. First, the estimation of the optimal state of the indicator vector is formulated as the maximization of an unconstrained binary linear problem (UBLP) easily solved from a perturbation method. Second, the subspace selection problem is equivalent to solve a generalized eigenvalue problem. Constructed with the Gaussian kernel function, our contrast measure overcomes one of the main shortcoming of most previous methods. Indeed, the proposed criterion does not require prior assumptions on the expected fraction of outliers. The kernel width corresponding to the optimal separation provides a maximal answer of our contrast measure regardless of the contamination rate. This property simplifies the model selection problem which is traditionally solved by costly cross-validation procedures. II. N OTATIONS We will denote S, O, and D the whole data population of size n, the outlier population of size n O , and the dominant population of size n D , respectively. Let 1n be the row unitary vector of size n and In the corresponding identity matrix. Let Z = [z 1 , z 2 , ..., z n ] be the corresponding data matrix where the i th column zi ∈Rd . The class labels of the data sets O and D will be defined by two binary indicator vectors 1O and 1D (1D = 1 − 1O ), Respectively, and where the i th component 1Oi = 1 if z i ∈ O or 0 if z i ∈ D. We will define ID = diag(1D ) and IO = diag(1O ) the binary diagonal indicator matrices corresponding to the sets D and O, respectively. T 1 the centering matrix with We will note P = I − (1/|P|)1P respect to the data population P (P ∈ {S, O, D}) and |P| its cardinality. Classification of nonlinear data sets can be successfully designed by linear techniques in feature space induced by kernel functions. In this framework, we call ϕ : χ → H a mapping from the input space χ to a highdimensional or infinite-dimensional Hilbert space H and let  = (ϕ (z 1 ) , ϕ (z 2 ) , ..., ϕ (z n )) be the sequence of images of the input data matrix Z in H. In practice, the mapping ϕ is advantageously replaced by a positive definite kernel function k : χ × χ → R that encodes the inner product in H, instead. Thus, we can construct a kernel matrix or Gram matrix K which is equivalent to the product T . In this paper, we will consider the Gaussi an ker nel Gr am matri x K (σ ) which is parameterized by a bandwidth σ . III. A SSUMPTIONS Outlier detection is a challenging task especially if the problem is treated in an unsupervised way. The lack of a priori knowledge about outliers leads statisticians to infer a statistical model for the typical data to predict outliers. In our paper, we propose to differentiate the two data subsets from the density that they have in the feature space. In particular, a data population will be considered as dominant or typical if both its density is higher than that occupied by the outlier population, and it represents the majority of the whole data population, that is, n D /n > 0.5. The density of a data population over

983

another can be characterized by the kernel Gram matrix which displays a specific partitioning when the kernel parameter reaches a given value. For the sake of discussion, we assume that the data matrix Z is reordered such that the first n D data are in the set D and the next n O are in O. Then, the Gaussian kernel gram matrix K exhibits the following partitioning:   K D K DO (1) K = T K DO KO where K D ∈ Rn D ×n D , K O ∈ Rn O ×n O , and K DO ∈ Rn D ×n O are, respectively, the kernel matrices defined for the dominant domain, the outlier domain, and the cross-domain between the dominant and the outlier domain. Then, let us consider the following proposition. Proposition 1: The dominant population will be said ideally separable from the outlier population if the Gaussi an ker nel Gr am matri x K (σ ) tends toward a specific configuration or ideal configuration when its bandwidth σ reaches an optimal value σopt or a certain range of values ⎧ 1T 1 ⎪ ⎪ K D (σ ) σ−→ →σopt ⎪ ⎨ K O (σ ) −→ I (2) σ →σopt ⎪ ⎪ ⎪ ⎩ K DO (σ ) −→ 0. σ →σopt

The degree of separability of the data set S will depend on the distance between its kernel Gram matrix and the ideal partitioning model (2), and will impose the conditions of convergence for the proposed algorithm. Of course, reader must note that the assumption (2) represents an ideal scenario which will serve as useful input for future demonstrations. IV. P REVIOUS W ORK Recently, Dufrenois and Noyer [1] proposed a new contrast measure to isolate a target data population D corrupted by an outlier population O. This measure is derived from the properties of the subspace decomposition of the hat matrix traditionally used as outlier diagnostic in linear regression. Its formulation is very similar to that of the one-class Fisher linear discriminant. More precisely, let F ∈ Rd×m be a linear transformation that maps each z (∈ Rd ) into a smaller dimensional space as follows: F : Rd → Rm , z →  z = F T z, with m < d. Following [1], the separation between the two data populations consists in finding the optimal projective subspace F that maximizes or minimizes the following contrast measure: CX /Y (F) =

F T EX F δ F F T EY

(3)

δ are two sample covariance matrices where E X and E Y correctly defined and δ is a regularization parameter which prevents numerical instabilities introduced by the inversion of E Y . From the properties induced by the subspace decomposition of the hat matrix, Dufrenois and Noyer show that the projective subspace F ∗ which optimally separates D from O

984

IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, VOL. 26, NO. 5, MAY 2015

is conjointly the solution of the following criteria:

Proposition 2: If the data set S fulfils the partition model in (2), then the problem of separation between the dominant set D and the outlier set O defined by

F ∗ = argmax CO/D (F) F

= argmax CO/S (F)

(F ∗ , 1∗O ) = arg max CO/S (F, 1O )

F

= argmin CD/S (F)

(4)

F

with

⎧ T T ⎪ ⎨ E O = Z D IO D Z T ZT E D = Z D ID D ⎪ ⎩ δ T ZT + δI E S = Z D D

(5)

where E O ∈ Rd×d is logically named the outlier covariance matrix, E D ∈ Rd×d the dominant covariance matrix, and E Sδ ∈ Rd×d the regularized total covariance matrix. As can be seen from (5), covariance matrices are centered with respect to D, which means that the data subset D serves as statistical reference. Each criterion in (4) is equivalent to solve a generalized eigenvalue problem (see [1]). Of course, these criteria are only meaningful if data are labeled, that is, the indicator vector 1O (1D ) is beforehand known. In practice, obtaining labeled data of diversified outlier instances is rarely available or will be too costly to get. We therefore consider the vector 1O as a new unknown of our problem. In the sequel, we propose to study the criterion CO/S (F, 1O ) which represents the contrast measure between the outlier population and the whole data set and to formulate the problem of separation between the sets D and O as a maximization type problem similar to (6) (F ∗ , 1∗O ) = arg max CO/S (F, 1O ). F,1O

Unfortunately, the previous problem is not well posed because whatever F, we have CO/S (F, 1O ) ≤ CO/S (F, 1) and then 1∗O = 1 is a trivial solution of (6). In the next section, we propose a slightly modified version of the contrast measure CO/S . V. M ODIFIED C ONTRAST M EASURE Let 1O = (n D /nn O )1/4 1O be a weighted version of the binary indicator vector 1O and IO = diag(1O )diag(1O ) = (n D /nn O )1/2 diag(1O ) be its corresponding weighted indicator matrix. Then, we define the following contrast measure: C O/S (F, 1O ) =

T Z T )F F T (Z D IO D

T Z T + δ I )F F T (Z D D

.

(7)

The binary outlier indicator matrix IO in E O [see (5)] has been replaced by its weighted version. The factor (n D /nn O )1/2 in IO weights the influence of the outlier population in the computation of the outlier covariance matrix by controlling the relative ratio between the size of the dominant population and the size of the outlier population. At a first glance, such a regularizer may seem surprising but as we will see later, its influence changes the nature of the problem and offers the conditions to turn an ill-posed problem into a well-posed problem. With these considerations, we propose to formulate our outlier detection problem as follows.

(8)

F,1O

is well posed. In this case, the pair (F ∗ , 1∗O ) optimally separates the dominant set D from the outlier set O, that is, in the sense of the proposed contrast measure CO/S (F, 1O ). In the sequel, we present a methodology to find the optimal projective subspace F ∗ and the optimal indicator outlier indicator 1∗O . VI. S EPARATED C OMPUTATION A. Upper Bound of CO/S (F, 1O ) The proposed contrast measure (7) is equivalent to a Rayleigh-like quotient which is among the most popular LDA criterion. In standard LDA, subspace selection and class labels are traditionally estimated in an iterative fashion [15], [16]. This kind of algorithms summarized under the name discriminant clustering assumes that the data clusters are statistically well defined, that is, the intra/intercluster distances are well captured. In this situation, K-means algorithms can be used efficiently. However, this assumption cannot be maintained in our paper since the outlier cluster is not statistically identifiable. Recently, Ye et al. [17] have shown that the objective function in discriminant clustering has an upper bound which is independent from subspace selection. This remarkable result simplifies the optimization problem and provides a direct extension to deal with nonlinear data using the kernel trick. By relying on this result, we propose the following theorem: T Z T Z  be the n × n symmetric, Theorem 1: Let G = D D positive semidefinite Gram matrix centered with respect to the set D and δ be a fixed regularization parameter. For any F and 1O , then the contrast measure CO/S (F, 1O ) has the following upper bound: (9) CO/S (F, 1O ) ≤ h(1O ) with h(1O ) is defined by T diag(L 11 , L 22 , ..., L nn )1O h(1O ) = 1O

(10)

where diag(L 11 , L 22 , ..., L nn ) is a diagonal matrix composed of the diagonal elements of the following matrix:   G −1 . (11) L=I− I− δ This theorem is a direct modification of Ye’s theorem (see [17] for the proof). Then, it is now obvious that the Gram matrix G allows to switch easily from linear to nonlinear by using the kernel trick. With notations defined in Section II, it is straightforward to deduce the expression of the kernel Gram matrix centered with respect to the set D T T T  D = D K D . G K = D

(12)

The Gaussian kernel having been chosen to lead our paper, and the kernel Gram matrix will be implicitly parameterized by the bandwidth σ : G K (σ ).

DUFRENOIS: ONE-CLASS KERNEL FISHER CRITERION FOR OUTLIER DETECTION

B. Computation of 1O 1∗

With the inequality (9), if the vector O is an optimal solution for the problem (8), it will be also for the upper bound h(1O ), and vice versa. Then, we propose to formulate the computation of the state of 1O as a linear programming problem based on the upper bound h(1O ) defined in (10). Proposition 3: If 1O is the solution of (8), it is also the solution of max h(1O )   n n D 1/4 subject to1O ∈ 0, nn O

(13)

where h(1O ) is the concave function and the global optimum of (13) gives an optimal separation between the set D and the set O when the bandwidth σ reaches a critical value or a certain range of values. Equation (13) is similar to an unconstrained binary linear problem (UBLQ) where 1O is a n−dimensional binary variable. At a first glance, (13) is an nondeterministic polynomial time hard problem and it is commonly solved by exact methods such as the branch and bound techniques or heuristic methods such as simulated annealing, tabu search, and genetic algorithms. Moreover, (13) introduces a further difficulty since the diagonal coefficients of the matrix L in (11) also depends on the solution through the centering matrix D . This specificity makes it difficult to use previous optimization schemes in a standard way. However, as h is assumed to be a concave objective function, the optimization of (13) can be simplified. In the sequel, we propose an original optimization scheme whose main steps are based on the proof of the proposition (13). Proof: Without loss of generality and for simplicity purposes, let us consider the reindexing of the data matrix Z as defined in Section III. From this notation, let ( 1O(k) )nk=0 be a sequence of binary vectors indexed by k where the kth element of the sequence is defined by k

n−k

    1O(k) = (0, ..., 0, 1, ..., 1). The vector 1O(k) represents the state of the reordered outlier indicator vector at the index k where the first k terms belong to the set D ant the others to O. Now, from this notation, it is obvious that the expected optimal state of 1O(k) will be reached for k = n D . Following the same reasoning, the weighted outlier indicator vector 1O can be replaced by its reordered version 1O(k) = (k/n(n − k))1/4 1O(k) . Let n ( h (k) )k=0 be the sequence generated from the value of h(1O ). The kth element of this sequence is given by  1/2 k h (k) = h( l(k) 1O(k) ) = (14) n(n − k) where the term l(k) is defined by trace( L) l(k) =  L ii trace( L) − ki=1

if k = 0 otherwise.

(15)

985

As L  0 (semi definite positive), it is obvious that the sequence ( l(k) )nk=0 is decreasing with the index k and the trivial vector 1O(k=0) = 1 is the global solution of h(1O ). This result confirms that the optimization problem based only on the objective function h(1O ) would be ill posed. To simplify the demonstration, we consider an ideal situation in which the dominant population is clearly isolated from the outliers population or in other words its kernel gram matrix shows an ideal partitioning as defined in (2). Under these conditions, the distribution of the diagonal coefficients of L clearly shares into two distinct subdistributions, one relative to the dominant population D and the other corresponding to the outliers population O   nO nD     (16) ( L 11 , L 22 , ..., L nn ) ∼ εD , ..., εD , εO , ..., εO with (εD , εO ) = [0, 1]2 . By substituting (16) into (14), ( h (k) )nk=0 can be expressed from two adjacent sequences k n and ( h 2(l) )l=k+1 with their lth elements are ( h 1(l) )l=0 defined by ⎧  1/2 ⎪ l ⎨ h 1(l) = (n O εO + (n D − l)εD ) n(n−l)  1/2 (17) ⎪ l ⎩ h 2(l) = (n − l) εO n(n−l) . To show the concavity of ( h (k) )nk=0 , the behavior of k n in their respective the sequences (h 1(l) )l=0 and (h 2(l) )l=k+1 intervals is studied. Our analysis is restricted to the case n D > n O , which is the main assumption of our paper. Under this condition, we propose to show that the sequences h 1 and h 2 have an increasing and decreasing behavior, respectively. 1) Decrease of the Sequence h 2 : First, we show that the sequence h 2 is decreasing. Indeed, assume that the sequence h 2 is strictly decreasing, that is, h 2(k) > h 2(k+1) for every k. We obtain the following inequality: 2k + 1 > n

(18)

which is always verified from k = n D since we assume that n D > n/2. 2) Increase of the Sequence h 1 : The goal here is to show that the sequence h 1 is strictly increasing until the value of k = n D when the ratio εO /εD reaches a certain range of values. Thus, assume that the relation h 1(k) < h 1(k+1) is true which is equivalent to verify the following inequality: u (k) u (n−k−1) < v (k)

for k = {0, ..., n − 1}

(19)

where u (k) = (k/k + 1)1/2 and v (k) = 1 − (1/n O (εO /εD ) + n D − k). It is easy to show that the n−1 sequence (u k )n−1 k=0 is strictly increasing whereas (v (k) )k=0 is strictly decreasing, both of them being lower than 1. The n sequence (u (n−k−1) )n−1 k=0 in (19) is the sequence (u k )k=0 that have subjected an axial reflexion (−k) and a translation (n−1). Then, the sequence (u (n−k−1) )n−1 k=0 is strictly decreasing and has an axis of symmetry in k = n − 1/2 with the sequence (u k )n−1 k=0 where the symbol x denotes the integer part of x. Thus, we can conclude that the product (u (k) u (n−k−1) )n−1 k=0

986

IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, VOL. 26, NO. 5, MAY 2015

Fig. 1. Example: (a) noisy linear cluster (•) with outliers (+). (b) Distribution of the diagonal coefficients of the matrix L for σ = 1 (• dominant pop., + outliers pop.). (c) Behavior of the corresponding objective function h (k) with respect to the reindexed indicator vector and for different values of the bandwidth σ .

defines a concave sequence bounded by one with a maximum equals to n − 1/n + 1 in k = n − 1/2 . For the sequence (v (k) )n−1 k=0 , it is also lower than one but decreasing and parameterized by the ratio εO /εD . The break in the increase of h 1 is reached when the index k verifies u (k) u (n−k−1) = v (k) and its value varies with respect to the ratio εO /εD . Thus, the problem is then to find the value of the ratio εO /εD or the value of σ which shifts the intersection of the sequence n−1 (v (k) )n−1 k=0 with (u (k) u (n−k−1) )k=0 at least until a value of k ≥ n D , that is, when the sequence (v (k) )n−1 k=0 becomes an . After some algebraic upper bound of (u (k) u (n−k−1) )n−1 k=0 manipulations, we deduce from (19) the value of εO /εD which constraints the sequence h 1 to be strictly increasing until k = n D εO ≥β (20) εD where β=

  1 1 + k) − n D . max( k nO 1 − u (k) u (n−k−1)

β provides a theoretical lower bound of the ratio εO /εD in order that the global optimum of (13) corresponds to an optimal separation of the set S. Recall that this demonstration is based on the assumption that S verifies the partitioning model (2) which conduces to a binary partition of the distribution of the diagonal coefficients of L. Of course, for real data sets, this partitioning model is not so contrasted and then the partitioning of the distribution of the L ii is more blurred. Indeed, the two subdistributions show more variability but tend on average toward two values which are more and more distinct when the bandwidth is properly selected. 3) Illustration of the Behavior of h (k) With Respect to the Bandwidth σ : To illustrate the previous analysis, we generate a toy data set composed of a noisy linear cluster of 100 data points [black dots in Fig. 1(a)] corrupted by 20 uniformly distributed outliers [symbols + in Fig. 1(a)]. We assume that this data set verifies approximately the partitioning model (2). Hence, the behavior of the diagonal elements of the matrix L computed from to this data set is far from being ideal (binary partition) but shows more variability. However, when the value

of the kernel width reaches a specific value (σ = 1, in this example), we clearly identify from Fig. 1(b) two subdistributions of the L ii , one relating to the dominant population [marks • in Fig. 1(b)] and the other, to the outlier population [marks + in Fig. 1(b)]. Fig. 1(c) shows the behavior of the sequence h (k) (14) with respect to the reindexed indicator vector 1O(k) and for different values of the bandwidth σ in {0.01, 0.08, 0.1, 0.2, 0.3, 0.5, 1, 2, 4, 10, 100}. First, we can note that each curve h (k) (σ ) is concave and has a distinct maximum which is located by a vertical dashed line. Each sequence h (k) (σ ) is explained by two adjacent sequences: an increasing sequence corresponding to the sequence h 1(k) (17) and generated from the dominant population and a decreasing one corresponding to the sequence h 2(k) (17) and generated from the outlier population. Second, we clearly observe that there exists a range value of σ which provides a stationary maximum of h (k) (σ ) corresponding to the optimal index k = n D (n D = 100 in our example). In particular, for σ = 1, the peak of the sequence h (k) (σ ) is clearly identified and localized in k = n D . These results obtained on a toy data set confirms the properties stated in Proposition 3. 4) Maximization of Our Modified Problem: The previous analysis based on the concavity of h, conduced us to propose an iterative algorithm that checks the sign of the objective function’s gradient when the state of the indicator vector is locally perturbed. ) Let us define 1(t O the state of the indicator vector à step t ) and 1(t O ( j ) the state of this vector affected by a change of its j th component. Depending on whether the data j belongs to the sets D(t ) or O(t ) at step t, we have two possible variations (t ) of the j th component of 1O ⎧  (t) 1/4 n D −1 ⎪ (t ) ⎪ if j ∈ D(t) ⎨ (1O + e( j )) (t) n(n O +1) (t )  (t) 1O ( j ) = 1/4 ⎪ ⎪ ⎩ (1(t ) − e( j )) nD(t)+1 if j ∈ O(t) O

(21)

n(n O −1)

j

where e( j ) = (0, ..., 0, 1, 0, ..., 0) is an elementary state.

DUFRENOIS: ONE-CLASS KERNEL FISHER CRITERION FOR OUTLIER DETECTION

) (t ) Then, we deduce the expressions of (h(1(t O )) and h ( j )  (t ) 1/2 nD (t ) (t ) (t ) (h(1O )) = h = l . (22) ) nn (t O ⎫ ⎧  (t) 1/2 n D −1 ⎪ ⎪ (t ) (t ) (t ) ⎪ if j ∈ D ⎪ ⎬ ⎨ (l + L j j ) (t) n(n +1) (23)  (t)O 1/2 h (t ) ( j ) = ⎪ ⎪ ⎪ ⎭ ⎩ (l (t ) − L (tj j)) nD(t)+1 if j ∈ O(t ) ⎪ n(n O −1)

 ) where l (t ) = (trace(L) − i∈D(t) L (t ii ). As h is a concave objective function, this perturbation involves either a positive or negative variation of h depending on whose side of ) the concavity the current state of the vector 1(t O is on. Thus, we can derive the following label assignment rules based on the sign of the objective function’s gradient: (t )

(t )

h (j ∈ D )− h

(t )

(t )

h (j ∈ O )− h

(t )

(t )

j ∈ D(t +1) ≶0 j ∈ O(t +1)

(24)

j ∈ O(t +1) ≶0 j ∈ D(t +1).

(25)

The binary rule (24) means that if the data j belongs to the set D at step t then the data j will stay in the same index set if the sign of the objective function’s gradient is negative or will move in the set O, otherwise. Following the same reasoning, we obtain the rule (25). In summary, (24) and (25) state that a positive variation of the objective function at the index j corresponds to a change of set while a negative variation maintains the data j in the current set. Thus, from (23)–(25), we can deduce the updating rule of the index set O(t +1) at the step t + 1     (t ) (t ) ) (t ) (t ) > t /L < t \ n ∈ O O(t +1) = m ∈ D(t ) /L (t mm nn D O (26) (t ) (t ) where the thresholds t D and tO are defined by   ⎧ 1/2 (t) (t) ⎪ n D (n O +1) (t ) ⎪ ⎪t = − 1 l (t ) ⎪ (t) (t) ⎨ D n O (n D −1)  (27)  (t) (t) 1/2  ⎪ ⎪ n D (n O −1) (t ) (t ) ⎪ ⎪ l . ⎩ tO = 1 − n(t) (n(t) +1) O

D

Thus, each component of O(t ) is updated from (26) and the method terminates when the variation |h (t +1) − h (t ) | is lower than a tolerance value. The main steps of the maximization of the criterion are summarized in Algorithm 1. 5) Convergence Analysis: Condition for convergence has been previously proven for the ideal case, that is, when we assume an ideal partitioning of the kernel Gram matrix. Of course in real situation, the partitioning is blurred and the objective function h (t ) is noisy. Now, let us consider the general case. Condition for convergence is derived on the (t ) behavior of the sequence = 1, 2, .... First, since  (t ) h for(t )t 1/2 in h (t ) (22) is upper the weighting factor n D /n(n − n D ) (t ) bounded by 1 (of course we assume that n D = n for t = 1, 2, ....), we can conclude that the sequence h (t ) is bounded by trace(L). Therefore, it is sufficient to show that

987

Algorithm 1: Proposed perturbation method Input: Gaussian Kernel matrix: K (σ ) Output: 1∗O Set tol ← 1e − 4, t ← 0 (t =0) Random or supervised Initialization of 1O and O(t =0) (t +1) (t =0) O ←O Compute: - The diagonal elements of L (t ) (σ ) (Eq.11) - The objective function h (t ) (Eq.22) (t ) (t ) - The thresholds tO and tD (Eq.27) (t +1) (t ) − h | > tol /∗ (convergence) ∗/ while |h do for j ∈ (1,2,...,n) do if j ∈ / O(t ) then (t ) if L (tj j) > tD then (t +1) 1O j ( j ) = 1 and O(t +1) = O(t +1) ∪ { j } end else (t ) (t ) if L j j < tO then (t +1) 1O j ( j ) = 0 and O(t +1) = O(t +1) \ { j } end end end Compute: - The diagonal elements of L (t +1) (σ ) (Eq.11) - The objective function h (t +1) (Eq.22) (t +1) (t +1) - The thresholds tO and tD (Eq.27) t ←t +1 end +1) if 1(t == 1 then O /∗(no convergence)∗/ break end (t ) 1∗O ← 1O and O∗ = O(t )

h (t ) is strictly monotonic increasing, then h (t ) < h (t +1) for t = 1, 2, .... Let us suppose we start with an initial estimate (t =0) (t =0) nD or n O as the solution of the initialization step. Let J = D(t ) ∩ O(t +1) be the subset that moves from the set D to O and I = O(t ) ∩ D(t +1) , the subset that moves from the set O to D between t and t + 1, respectively. Then, we can easily deduce the expression of h (t +1)  h

(t +1)

=l

(t +1)

.

(t +1)

nD

1/2

+1) nn (t O

  (t +1)  (t +1)  (t +1) . Let n J where l (t +1) = l (t ) − i∈I L ii + j ∈ J L j j (t +1)

and n I be the size of the sets J and I , respectively, and n (t +1) = n (tI +1) − n (tJ +1) Is the difference in size between the two sets, then with little algebra, we obtain the following inequality: 

l (t ) w(t ) >  − j ∈ J L (tj j+1)

(t +1) i∈I L ii

(28)

988

IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, VOL. 26, NO. 5, MAY 2015

with



w(t ) = ⎝1 −



) n (t D

(t )

) (t +1) ) (n − n (t D − n

(n − n D )

(t )

(n D + n (t +1))

computed from the data point from the set D corresponding to the maximal value of the decision function w∗D (z) and delimits the decision boundary isolating the dominant set from the outliers.

1/2 ⎞ ⎠.

If (28) is verified at each step, the convergence of the algorithm is ensured. The convergence of the algorithm is strongly dependent of the choice of the kernel bandwidth. Indeed, if the bandwidth is not correctly selected, the number of data that ) moves from the set O to the set D may be exactly n −n (t D and of course no data moves from D to the set O. Then the size ) (t ) = 1 and variation n (t +1) = n (tI +1) = n − n (t D , making w (t +1) = n). This situation corresponds to h (t +1) unbounded (n D the solution 1∗O = 1 which is removed (no convergence, see Algorithm 1). VII. S UBSPACE E STIMATION : A G ENERALIZED E IGENVALUE P ROBLEM Let  D = (ϕ D (z 1 ) , ϕ D (z 2 ) , ..., ϕ D (z n )) = D be the sequence of images which are centered with respect to sample mean of the set D. Consider the following transformation on the centered feature space: wD (z) = F T ϕD (z) .

(29)

An important issue in any kernel-based learning technique is that every solution F ∈ H can be written as an expansion in terms of mapped training data, that is n

αi ϕ (zi ) = α T

F=

(30)

i=1

where α = (α1 , α2 , ..., αn ). Substituting Z  D by  D into the matrices E Sδ and E O (5) and replacing F by (30) into (3), C O/S can be formulated in the Hilbert space b CO/S (α, 1O ) =

α EO α T α E Sδ α T

(31)

where E Sδ = K  D TD K + δ I and E O = K  D IO TD K are the kernel version of the covariance matrices for the data set S and the data subset O, respectively. Equation (31) is obtained by replacing the inner product T  by the Gram matrix K (see Section II). The solution of (31) is equivalent to solve the following generalized eigenvalue problem (GEP): E O α = λE Sδ α.

(32)

As the matrix E O is positive symmetric definite, (32) provides positive eigenvalues. Then, the solution of (32) amounts to retain the eigenvector α corresponding to the largest positive eigenvalue. Once the optimal vector α ∗ has been estimated, we can construct a decision function which separates the dominant population from the outliers. From (29) and (30), we deduce the expression of the decision function which predicts the belonging of a data z either to D or O by   (33) f (z) = sign |w∗D (z)| − bound   ∗ T ∗ T ∗ where w D (z) = α kz − (1/n D )K 1 D with k z =  ϕ(z), and bound = max |w∗D (z i ) |. The term called bound is i∈D

VIII. M ODEL S ELECTION A recurrent issue in machine learning methods is the model selection problem. Our approach does not waive to this problem since it is parameterized by two free parameters: the bandwidth σ of the Gaussian kernel and the regularization parameter δ. A. Selection of the Bandwidth σ We propose to show here that our one-class kernel Fisher criterion (31) provides a maximal answer when the value of σ reaches an optimal value σopt . This statement conduces us to formulate the following proposition. Proposition 4: Consider that the data set S fullfils the partition model (2). Thus, let λ(σ ) be the value of the criterion C O/S (α ∗ , 1∗O ) for a given σ , there is an optimal kernel width σopt that verifies λ(σ ) < λ(σopt ),

for all σ = σopt

(34)

where the vectors 1∗O and α ∗ in C O/S have been beforehand estimated by the perturbation method and solving with (32), respectively. Proof: First, we observe that the covariance matrices E Sδ and E O are based on the same matrix composition  D ATD with A = IO or A = I. It is obvious that  D ATD is positive semidefinite (psd) and, since the kernel Gram matrix K is psd, E Sδ and E O are also psd. Now, let us detail the matrix composition  D ATD . Considering the arrangement adopted D is defined by in (1), the reindexed centering matrix    D = U V  (35) 0 I with V = −(1/n D )1nTD 1nO and U = In D ×n D − (1/n D )1nTD 1nD . Contrary to  S ,  D is a nonorthogonal projector ( D = 2D , but  D = TD ). However, the composition  D ATD is symmetric. Indeed, we obtain   W V T D IO D = (36) VT I   TD = W +T U V DI (37)  V I with W = (n O /n 2D )1nTD 1nD . Then, it is obvious that E O and E Sδ only differ in the content of their first diagonal block. Using the fact that the diagonal blocks of a psd matrix are also psd, we can conclude that the submatrices W and W + U in (36) and (37) are psd. As the matrix U is also psd, we can deduce the following result: α EO α T ≤ 1 for any σ. α E Sδ α T

DUFRENOIS: ONE-CLASS KERNEL FISHER CRITERION FOR OUTLIER DETECTION

TABLE I D ATA S ETS

Algorithm 2: DKHM Input: Z train , Z test (if given), X: range interval for σ Output: 1Oopt and αopt Set δ (default value=0.1) Cmax ← − inf for σ in 10 X do Compute: - Gaussian kernel matrix K (Z train , σ ) - The indicator vector 1∗O (see Algorithm 1) - The covariance matrices E O and E Sδ - The subspace vector α ∗ by (32) - The Fisher score CO/S (α ∗ , 1∗O ) by (31) if CO/S (α ∗ , 1∗O ) > Cmax then Cmax = CO/S (α ∗ , 1∗O ) αopt = α ∗ 1Oopt =1∗O end end if Z test is not empty then /∗ prediction step /∗ f (Z test ) by (33). end

The ratio is closer the value one as σ tends to σopt . Indeed, consider the ideal case where  T  1 1 0 K −→ σ →σopt 0 I then it is straightforward to show that 1T 1.U.1T 1 −→ 0 σ →σopt

which completes the proof.

B. Selection of the Regularization Parameter δ The regularization parameter choice by itself is a fundamental problem in learning theory since the performance of most regularized learning algorithms crucially depends on the choice of one such parameter. In our algorithm, a regularization parameter δ is added to the total scatter matrix E S to overcome numerical unstability of its inversion for high dimensional data sets. In [18], a general statement is mentioned which stipulates that a nonnegligible value of the regularization parameter δ can improve the separability of the two populations. This tendency seems to be verified in all the experiments studied here, and we propose to set δ = 1e − 1 in our paper. Of course, a more detailed study will have to be conducted in a future work.

C. Overall Implementation Details The overall methodology, logically named discriminative kernel hat matrix (DKHM) by reference to the original work in [1], is presented in Algorithm 2.

989

IX. E XPERIMENTS

A. Data Sets The experiments are conducted on two noisy synthetic data sets and seven real-world data sets. The statistics of the studied data sets are shown in Table I. More precisely, the sine-noise distribution is composed of a sine-wave which is perturbed by an outlier population [Fig. 3(c)]. The sine-wave consists of 200 samples uniformly distributed along y = 4 + 2si n(2π x/10) + e with x ∈ [−5, 5] and e ∼ N(0, σ 2 = 0.7). The outlier population is composed of 40 noise points randomly drawn from the area {(x, y)| x ∈ [0, 4], y ∈ [0, 4]} and 10 noise points randomly drawn around a vertical line x = −2. The ring-noise distribution is based on a dominant elliptic shaped data subset (200 data points) generated from the following polar model x(θ ) = 8 cos(θ ) + ex , y(θ ) = −30 + 12 sin(θ ) + e y with θ ∈ [0, 2π], ex ∼ N(0, σ 2 = 3) and e y ∼ N(0, σ 2 = 1.0) [Fig. 3(a)]. The outlier population consists of two residual Gaussian clusters. The first one (40 data points) is generated from a bivariate normal model of mean (−5, 0) and covariance matrix diag(2, 30) and the second one (10 data points) defined by its mean (2, −30) and its covariance matrix diag(4, 10). Next, the Diabete, Heart, B-cancer, and Sonar data sets are real two-class problems which are often used as benchmarking data in numerous work on outlier detection or novelty detection [12], [19]. In our experiment, we convert artificially the data sets into a single-class problem by considering the dominant set as the larger class and the other as the outliers. Another experiment commonly used for illustrating the outlier detection problem is the recognition of handwritten digits. In the MNIST database, the data are divided into a training set and a test set, each consisting of 28 × 28 gray images of digits. To illustrate the performance of our approach, we propose to recognize the digit 1 as the dominant set and to classify the other digits (digits different from 1 and erroneous handwritten digits) in the outlier set. Moreover, we reduce the dimensionality of the original data (d = 784) by using a supervised projection method, such as LDA. We train the LDA approach to separate the digit 1 from the digit 0 and we retain the most significant LDA vectors. Five vectors have been retained. Then, both the training set vectors and test set vectors are projected on the subspace spanned by the five LDA vectors.

990

IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, VOL. 26, NO. 5, MAY 2015

TABLE II I NFLUENCE OF THE I NITIALIZATION : (25 T RIALS )

B. Alternative Novelty Detectors We have chosen to compare DKHM with recent cutting edge methods in outlier detection. We distinguish two classes of methods: unsupervised and semisupervised novelty detectors. In the first group, we study the OC-SVM [2], the oneclass kernel Fisher discriminant (OC-KFD) [7], and the oneclass rate distortion algorithm (OC-RD) [20]. In the second group, we choose the least-square outlier detection (LSOD) method [12], maximum likelihood outlier detection method (MLOD) [11] and the maximum mean discrepancy relative outlier detection algorithm (MMD-ROD) [13]. To conduct tests in both unsupervised or semisupervised mode, we define the source domain which will include the labeled instances and the target domain for the unlabeled set. In the unsupervised mode, the dominant data and outliers will be completely assigned to the target domain, the source domain will be empty. In the semisupervised mode, outliers are detected relatively to the dominant data. Thus, half the dominant data will be transferred into the source domain and the other half into the target domain. Because the prior knowledge of anomalous data is not always available in practice, outliers will be assigned to the target domain. To analyze the robustness of the proposed methods, the outliers will be gradually transferred into the target domain in the following percentage: r = {0.2, 0.5, 0.9}. The outlier detection or novelty detection is basically a class imbalance problem. In imbalanced environments, traditional performance measures such as the error rate or the accuracy, extracted from the confusion matrix and used for two-class problems are unsuited. In this paper, we used the F-measure to evaluate the classification performance of the detectors [21]. All the methods are evaluated by the average performance of 25 repetitions. C. Parameter Setup A good selection of the hyperparameters of each method affects the performance of a novelty detector. This step is crucial in many learning algorithm. In this experiment, most of the methods use a Gaussian kernel. 1) DKHM: It has two free parameters: the regularization term δ and the width σ of the Gaussian kernel. In accordance with Section VI-B, the value of δ is set to 1e − 1. We consider the kernel width σ is picked via the range 10 X with X = {−2, −1, 0, 1, 2} and the optimal value is selected from the highest value of the contrast measure as described in section VIII.A. Of course, no learning step with crossvalidation strategy is used for adjusting the parameter σ . The

state of the optimal indicator vector 1Oopt is used to compute the F-measure. In Algorithm 2, all the data initialize the matrix Z train whereas the matrix Z test is empty (no prediction step). 2) OC-SVM: We have chosen to use the ν−SVM of Scholkopf et al. [3]. However, the ν-SVM solution is dependent on the outlier ratio ν and the Gaussian kernel width σ . As no plausible selection criterion is available, we develop a grid search approach to select the optimal pair (σ, ν). The best (σ, ν) combination in the sense of the F-measure is obtained by N fold M times cross validation (N = 2, M = 10). The range value for σ is defined in [0.1,10] with a sampling step of 0.4 and for ν is defined in [0.1,0.9] with a sampling step of 0.1. 3) OC-KFD: It is parameterized by a regularization term δ and the width σ of the Gaussian kernel. A cross-validated likelihood criterion is used to infer the free parameters. The performance of OC-KFD is highly dependent of the initial range value of the pair (γ ,σ ). We decide to set δ = 1e − 4 and define σ in [10,...,150] with an increment of 10 for the 2-D synthetic data set and we set δ = 1e − 1 and define σ in [1.5e4, ..., 3e4] with a step of 1000 for the multidimensional real data sets. This preselection is necessary in order to obtain coherent results. 4) OC-RD: It is parameterized by an inverse temperature term β (see [20]). However, no plausible selection criterion is proposed in [20]. As previously, we adopt a grid search-based strategy to adjust this parameter. 5) LSOD and MLOD: Three parameters affect the performance of these methods: a regularization term δ, the width σ of the Gaussian kernel, and the number k of Gaussian kernels used. According to [12], δ is chosen from 10{−3,−2,−1,0,1,2,3} and k is among {10, 50, 100, 150, 200}. These two algorithms use the score of leave-one-out cross-validation to solve the model selection problem (see [11], [12] for more details). 6) MMD-ROD: The objective function needs to correctly select three parameters. The width σ of the Gaussian kernel and two tradeoff parameters: C and η. Following [13], C is chosen from 10{−3,−2,−1,0,1,2,3} and η is defined in 10{−1,0,1,2,3}. However, MMD-ROD do not offer a plausible selection criterion. As previously, we propose to choose the combination (σ, C, η) such that the resulting F-measure is maximal during the validation step. As mentioned before, this setup can be considered as the upper limit of the MMD-RODs performance. Reader must note that in [13], artificial outliers are generated and added in the validation set to improve the classification results. In our paper, all the experiments are conducted without adding artificial outliers.

DUFRENOIS: ONE-CLASS KERNEL FISHER CRITERION FOR OUTLIER DETECTION

991

TABLE III C OMPARTIVE T EST: 25 T RIALS

X. R ESULTS A. Influence of the Initialization In this part, we analyze the behavior of our algorithm with respect to the initialization of the state of the outlier indicator vector. This analysis is based on the sine-noise and the breast cancer data sets. We fix the kernel width closer the optimum, that is, σ = 0.5 for the sine-noise data set and σ = 100 for the breast cancer data set. We consider two modes of (t =0) are randomly initialization: n/2 + 1 coefficients of 1O selected either from the whole data set (mode S-Random) or from the set D (mode D-Random) and set to the value 1. We analyze also the behavior of the algorithm according to the contamination rate by varying r in {0.2, 0.5, 0.9} (%). Table II shows the results in terms of classification from the F-measure and the number of iterations needed for convergence. All these measures are given in average for 25 repetitions. As we can see in Table II, the performance of our algorithm is few sensitive to the initialization mode since the classification results are very close whatever the mode used. In the sine-noise data set, it seems that the D-Random mode gives more stable results than the S-Random one (see the value of the standard deviations in columns 2 and 4 ). This tendency is confirmed for the real data set. We will note also that our algorithm converges in few iterations and more quickly in the D-Random mode. In front of the prohibitive cost which could result from the development of a supervised initialization step and the results of Table II, we continue with a random initialization step in the following experiments.

Fig. 2. Computational cost (in Seconds) of different outlier detection methods.

B. Unsupervised Mode Unsupervised experiments which are summarized in Table III lead to the following analysis: generally speaking, DKHM outperforms most of the studied unsupervised novelty detectors. However, we can note that the performance level of our algorithm varies depending on the data set. Indeed, a detailed analysis shows that the ellipse-noise, sine-noise, b-cancer, and digit 1 data sets fit very well the partitioning model adopted in (2). These good results show that an unique value of the kernel width exists and correctly shares these data

992

IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, VOL. 26, NO. 5, MAY 2015

Fig. 3. Decision boundaries on the 2-D synthetic data sets. (a) and (c): Ring-noise and sine-noise data sets. (b) and (d) Corresponding classification results (see text).

sets. On the over hand, the partitioning model (2) seems less suited for the heart, diabete and sonar data sets since when the contamination rate increases the average F-measure quickly decreases until values lower than 80%. We can note that OC-RD show most of the time wrong classification results. OC-RD, it is penalized by its ball-based search strategy. Despite a supervised search of its freeparameters to obtain the best F-measure, OC-SVM is less efficient than our approach. We note also that OC-KFD fails for most of the data. Indeed, the model selection problem requires to solve a multidimensional normalization integral which becomes rather time consuming when the dimension and the size of the data exceed some values (i.e., d > 5 and n > 500). In Table III, the mark - means that oc-KFD fails to give a numerical solution in a reasonable time. However, for small sample size problems, OC-KFD gives very good results especially when the contamination rate is small. But an increase of the outliers percentage involves a decrease of its performance. The reason is that the outlier detection scheme adopted in OC-KFD is based on the estimation of a confidence envelope which can be sensitive to a high contamination rate and produce high classification errors. In addition, we noticed that the cross-validated likelihood function with respect to the kernel width may give several local maxima making the performance of the method sensitive to the initial range value of the kernel width. In opposite, the proposed contrast measure gives an unique maxima when the data set S verifies the partitioning model (2). C. Semisupervised Mode The last three columns of Table III summarizes the quantitative results obtained from the semisupervised novelty detectors. The best results are in bold for this mode. We can note that MMD-ROD clearly outperforms LSOD and MLOD and confirms its classification performance obtained in [13]. However, despite a validation step for selecting its free parameters, MMD-ROD shows most of the time inferior performances than the proposed algorithm (see results on sinenoise, ring-noise, heart, digit 1, and diabete). MMD_ROD performs better on the sonar and b-cancer data sets. Notice that

MMD-ROD requires the selection of three hyper parameters during the validation step, making it rather time consuming with large data sets. D. Computational Cost Fig. 2 shows the efficiency of the studied algorithms, measured by the cpu time (in second). Since the computational costs of OC-KFD are significantly above those of the other methods, they have not been shown in Fig. 2. These average values correspond to 25 trials and are given at optimality, that is, when the free parameters are set to their optimal values. For small or medium sized data sets, DKHM is very competitive against most other novelty detectors, but for large data sets its effectiveness decreases significantly. Indeed, in each experiment, the cost of the matrix inversion of the kernel gram matrix in L takes more than 80% of the total time while the update of the state of the indicator vector needs less than 1% of the total time. E. Decision Boundaries In this part, we illustrate graphically the classification results obtained by the proposed approach on the synthetic data sets. The experiment is conducted in unsupervised mode. Fig. 3(a) and (c) shows the ring-noise and the sine-noise data sets as defined in Section IX-A. These tests are difficult because the dominant data subsets are locally corrupted by outliers. In particular, the ring-noise data set is severely corrupted by clustered outliers making difficult the separation. Fig. 3(b) and (d) shows the classification results obtained using our method when the contamination rate r = 0.9. Each figure shows one classification result among the 25 trials. As you can note, despite a random initialization of the state of the =0) indicator vector1(t O , our algorithm automatically determines a coherent value of the kernel width for separating the dominant data set (filled black circles) from the outliers (+ marks). We observe also that our perturbation method provides stable classification results despite an increase of the outlier percentage (see the rows 2 and 3, column DK H M of Table III) and the decision boundaries estimated from (32)

DUFRENOIS: ONE-CLASS KERNEL FISHER CRITERION FOR OUTLIER DETECTION

993

Fig. 4. Traffic detection: (a) background view. Evolution of C(α ∗ , 1∗O ) according to σ for the scenario (b) 1, (c) 2, and (d) 3. The vertical arrows show the maximal value of C(α ∗ , 1∗O ) with respect to σ .

Fig. 5. Traffic detection on motorway. First row: Scenario 1. Middle row: Scenario 2. Last row: Scenario 3. (a) #14. (b) σ = 359. (c) σopt = 130. (d) #115. (e) σ = 359. (f) σopt = 16. (g) #40. (h) σ = 359. (l) σopt = 130.

and (33) enclose smoothly the main part of the data [Fig. 3(b) and (d)]. However, when outliers are highly structured, the percentage of false alarms may increase, generating locally isolated decision boundaries and decreasing the value of the F-measure. This situation is illustrated with the ring-noise data set where a small outlier cluster produces a wrong decision boundary [Fig. 3(b)]. XI. A PPLICATION TO O BJECT D ETECTION IN V IDEO S EQUENCES In this experiment, we consider a real-world application of outlier detection. We examine the performance of our method in object detection from Red Green Blue images. Here, the goal is to identify novel vehicles in a target motorway view relative to its background. In the background view [Fig. 4(a)],

no vehicles are traveling on the road, and hence they are considered as relatively novel in the target view. The video sequence is composed of 120 images of size 60 × 80. For each color channel, the background and the target images are divided into a series of corresponding overlapped 3 × 3 pixel block pairs and a sparse distance matrix is computed from the squared difference of each block pair. Next, all the distance matrices are summed. To analyze the behavior of our method with respect the contamination rate, we consider three scenarios. In Scenario 1, a single vehicle far from the camera appears in the target view which corresponds to the smallest contamination rate [Fig. 5(a)]. In Scenario 2, we consider another single vehicle closer to the camera which of course increases the contamination rate [Fig. 5(d)]. In Scenario 3, three vehicles

994

IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, VOL. 26, NO. 5, MAY 2015

are considered which provide the highest contamination rate [Fig. 5(g)]. As we can note on the second and third column of Fig. 5, the selection of the kernel width influences the detection performance. When its value is not optimally selected, the targets are wrongly extracted [white areas in Fig. 5(b), (e), and (h)]. When the value of the kernel width provides a maximal answer of the contrast measure [Fig. 4(b)–(d)], the white areas in Fig. 5(c), (f), and (i) correctly delimit the corresponding targets. We can notice also that our approach shows very good detection performances despite the change of the contamination rate. XII. C ONCLUSION In this paper, we detail a new algorithm which formalizes the problem of outlier detection as the solution of a one-class kernel Fisher discriminant. The maximization of our criterion provides both the optimal projective subspace and the label vector separating the normal data from outliers. By deriving an upper bound of our criterion, we have shown that these two problems can be solved in a separated way. First, the optimal label vector is iteratively obtained by a tricky perturbation method and, once the label vector is given, the projective subspace is estimated by solving a generalized eigenvalue problem. We have demonstrated that our latent label assignment algorithm exhibit global convergence behavior, that is, for any random initialization, the sequence of iterates generated by the algorithm converges to a stationary point. The proposed approach in this paper overcomes several recurrent shortcomings in one-class classifiers. First, we have demonstrated that our Gaussian kernel-based contrast measure is an efficient indicator to select the optimal kernel width. This property simplifies the model selection problem traditionally solved by costly cross-validation procedures. Next, we do not need any prior assumptions about the fraction of outliers which corrupts the data set. Such prior assumptions are common drawbacks in standard unsupervised novelty detectors such as one-class SVM and kernel PCA. The solution of the GEV problem induced by subspace selection provides a decision boundary to achieve good generalization on unseen data. Some extensions of this paper are under considerations: as mentioned previously, the computation of the matrix L and particularly the inversion operation is the most time-consuming part which constraints DKHM to work medium-sized data sets. Our future goal is to reduce this task to deal with large-scale problems. Our approach is based on a simplified partitioning model of the Kernel Gram matrix which assumes that only one value of the kernel width is optimal. This assumption may restrict the use of the proposed method to some kinds of data sets. Another point is then to relax this constraint by considering a multiple kernel-based extension of the proposed method. R EFERENCES [1] F. Dufrenois and J. C. Noyer, “Formulating robust linear regression estimation as a one-class LDA criterion: Discriminative hat matrix,” IEEE Trans. Neural Netw. Learn. Syst., vol. 24, no. 2, pp. 262–273, Feb. 2013. [2] B. Schölkopf and A. Smola, Learning With Kernels. Cambridge, MA, USA: MIT Press, 2002.

[3] B. Schölkopf, J. C. Platt, J. C. Shawe-Taylor, A. J. Smola, and R. C. Williamson, “Estimating the support of a high-dimensional distribution,” Neural Comput., vol. 13, no. 7, pp. 1443–1471, Jul. 2001. [4] D. Tax, “One class classification,” Ph.D. dissertation, Delft Univ. Technology, Delft, The Netherlands, 2001. [5] D. M. J. Tax and R. P. W. Duin, “Support vector data description,” Mach. Learn., vol. 54, no. 1, pp. 45–66, Jan. 2004. [6] D. M. J. Tax and R. P. W. Duin, “Uniform object generation for optimizing one-class classifiers,” J. Mach. Learn. Res., vol. 2, pp. 155–173, Jan. 2001. [7] V. Roth, “Kernel Fisher discriminants for outlier detection,” Neural Comput., vol. 18, no. 4, pp. 942–960, Apr. 2006. [8] H. Ferdowsi, S. Jagannathan, and M. Zawodniok, “An online outlier identification and removal scheme for improving fault detection performance,” IEEE Trans. Neural Netw. Learn. Syst., vol. 25, no. 5, pp. 908–919, May 2014. [9] H. Chen, P. Tino, A. Rodan, and X. Yao, “Learning in the model space for cognitive fault diagnosis,” IEEE Trans. Neural Netw. Learn. Syst., vol. 25, no. 1, pp. 124–136, Jan. 2014. [10] J. Gao, H. Cheng, and P. Tan, “Semi-supervised outlier detection,” in Proc. ACM Symp. Appl. Comput., Dijon, France, Apr. 2006, pp. 635–636. [11] M. Sugiyama, T. Suzuki, S. Nakajima, H. Kashima, P. von Bünau, and M. Kawanabe, “Direct importance estimation for covariate shift adaptation,” Ann. Inst. Statist. Math., vol. 60, no. 4, pp. 699–746, Dec. 2008. [12] S. Hido, Y. Tsuboi, H. Kashima, M. Sugiyama, and T. Kanamori, “Statistical outlier detection using direct density ratio estimation,” Knowl. Inform. Syst., vol. 26, no. 2, pp. 309–336, Feb. 2011. [13] S. Li and I. Tsang, “Learning to locate relative outliers,” in Proc. Asian Conf. Mach. Learn., Taoyuan, Taiwan, Nov. 2011, pp. 47–62. [14] K. M. Borgwardt, A. Gretton, M. J. Rasch, H.-P. Kriegel, B. Schölkopf, and A. J. Smola, “Integrating structured biological data by kernel maximum mean discrepancy,” Bioinformatics, vol. 22, no. 14, pp. e49–e57, 2006. [15] C. Ding and T. Li, “Adaptive dimension reduction using discriminant analysis and K -means clustering,” in Proc. 24th Int. Conf. Mach. Learn., Corvallis, OR, USA, Jun. 2007, pp. 521–528. [16] F. De la Torre Frade and T. Kanade, “Discriminative cluster analysis,” in Proc. 23rd Int. Conf. Mach. Learn., Pittsburgh, PA, USA, Jun. 2006, pp. 241–248. [17] J. Ye, Z. Zhao, and M. Wu, “Discriminative K -means for clustering,” in Proc. 21st Conf. Neural Inform. Process. Syst., Vancouver, BC, Canada, Dec. 2007. [18] F. Dufrenois and J. C. Noyer, “A kernel hat matrix based rejection criterion for outlier removal in support vector regression,” in Proc. Int. Joint Conf. Neural Netw., Atlanta, GA, USA, Jun. 2009, pp. 736–743. [19] G. Lanckriet, L. Ghaoui, and M. Jordan, “Robust novelty detection with single-class MPM,” in Proc. 16th Conf. Neural Inform. Process. Syst., Vancouver, BC, Canada, Dec. 2002, pp. 905–912. [20] K. Crammer and G. Chechnik, “A needle in a haystack: Local one-class optimization,” in Proc. 21st Int. Conf. Mach. Learn., Banff, AB, Canada, Jun. 2004, pp. 711–718. [21] V. Garcia, J. Sanchez, R. Mollineda, R. Alejo, and J. Sotoca, “The class imbalance problem in pattern classification and learning,” in Proc. 2nd Congr. Espanol Inform., vol. 1. Zaragoza, Spain, Sep. 2007, pp. 283–291.

Franck Dufrenois received the B.S. and M.S. degrees in telecommunication engineering and the Ph.D. degree in electronics from the University of Sciences and Technologies, Lille, France, in 1989, 1990, and 1994, respectively. He is an Associate Professor with the University of Littoral Côte d’Opale Calais-France, Calais, France. His current research interests include machines learning, robust statistics, image processing.

A one-class kernel fisher criterion for outlier detection.

Recently, Dufrenois and Noyer proposed a one class Fisher's linear discriminant to isolate normal data from outliers. In this paper, a kernelized vers...
2MB Sizes 3 Downloads 3 Views