IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, VOL. 26, NO. 8, AUGUST 2015

1761

Incremental Generalized Discriminative Common Vectors for Image Classification Katerine Diaz-Chito, Francesc J. Ferri, Senior Member, IEEE, and Wladimiro Díaz-Villanueva

Abstract— Subspace-based methods have become popular due to their ability to appropriately represent complex data in such a way that both dimensionality is reduced and discriminativeness is enhanced. Several recent works have concentrated on the discriminative common vector (DCV) method and other closely related algorithms also based on the concept of null space. In this paper, we present a generalized incremental formulation of the DCV methods, which allows the update of a given model by considering the addition of new examples even from unseen classes. Having efficient incremental formulations of well-behaved batch algorithms allows us to conveniently adapt previously trained classifiers without the need of recomputing them from scratch. The proposed generalized incremental method has been empirically validated in different case studies from different application domains (faces, objects, and handwritten digits) considering several different scenarios in which new data are continuously added at different rates starting from an initial model. Index Terms— Generalized discriminative common vector (GDCV), image classification, incremental learning, null space, subspace-based methods.

I. I NTRODUCTION

S

TATISTICAL methods based on subspaces have been broadly used and studied in computer vision and related fields due to their appealing capabilities and good practical behavior in many interesting recognition and classification tasks. In the particular context of image classification, and other areas using nontrivial visual descriptors, the ratio of the dimensionality of the input space to the training set size is usually very large. This fact plays a key role in many subspacebased learning methods, because it may influence both their generalization ability and efficiency. This situation is usually referred as the small sample size (SSS) case [1]. Among the most popular methods for image classification based on subspaces, we find linear discriminant analysis (LDA), which tries to maximize the distance between classes while the distance between the samples within the same class is minimized [2]. The Null space Linear Discriminant

Manuscript received January 14, 2014; revised September 5, 2014; accepted September 7, 2014. Date of publication September 22, 2014; date of current version July 15, 2015. This work was supported in part by the European Regional Development Fund and in part by the Spanish Government Consolider Ingenio 2010 Programme under Grant CSD2007-00018. K. Diaz-Chito is with the Centre de Visió per Computador, Universitat Autònoma de Barcelona, Barcelona 08193, Spain (e-mail: [email protected]). F. J. Ferri and W. Díaz-Villanueva are with the Department of d’Informàtica, Universitat de València, Valencia 46010, Spain (e-mail: [email protected]; [email protected]). Color versions of one or more of the figures in this paper are available online at http://ieeexplore.ieee.org. Digital Object Identifier 10.1109/TNNLS.2014.2356856

Analysis (NLDA) approach proposed in [1] consists of projecting onto the null space of the within-class scatter matrix, and then using principal component analysis (PCA) to maximize the between-class scatter. Huang et al. [3] introduced the PCA+Null approach to reduce first the input space dimension by applying PCA and then removing the null space of the total scatter matrix of the transformed samples. Ye [4] introduced alternative formulations that lead to orthogonal LDA, which is equivalent to NLDA under very mild conditions [5]. Another algorithm sharing the same concepts and goals as NLDA is the discriminative common vectors (DCVs) approach [6], which is also a variation of LDA [7], initially proposed for face recognition tasks. This algorithm has been shown to attain a very good tradeoff between performance and efficiency in several application domains [6], [8]. The methodology consists of constructing a linear mapping onto the null space of this within-class scatter matrix in which all training data collapse into the so-called common vectors [9], and then computing the matrix that maximizes the scatter among these. Classification of new data can be finally accomplished by first projecting and then measuring an appropriate similarity measure to the DCV of each class [6]. Tamura and Zhao [10] proposed a generalized version of the DCV approach called rough common vectors (RCVs). This approach flexibilizes the definition of the involved subspaces and is able to produce good results in practice regardless of the ratio of dimensions to training examples, which constitutes one of the limitations of the original DCV approach. The basic idea of RCV is to divide the feature space into two complementary subspaces. One is spanned by the eigenvectors corresponding to the largest eigenvalues of the within-class scatter matrix. The other is spanned by the eigenvectors corresponding to the smallest eigenvalues, including zeros. The above methods have been originally formulated in a batch mode setting, where learning takes into account the training set as a whole and a final model is obtained. Despite of its enormous scope, this learning strategy is not suitable in a large number of practical problems in which data comes from dynamic environments and/or is acquired over time. In these cases, there is a need of adaptively changing one or several, previously given models in an incremental way. Most incremental methods use progressive corrections applied to a previously learned model. These corrections are usually calculated from sparse data, with the consequent reduced temporal and spatial costs. For example, Pang et al. [11] proposed an incremental LDA (ILDA) algorithm for the classification of data streams. Ye et al. [12]

2162-237X © 2014 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.

1762

IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, VOL. 26, NO. 8, AUGUST 2015

proposed another ILDA alternative called Incremental Dimension Reduction via QR Decomposition, which computes the optimal discriminant vectors by evolving the subspace spanned with the class means. Zhao and Yuen [13] presented a further incremental algorithm called Incremental Linear Discriminant Analysis based on Generalized Singular Value Decomposition based on the fast singular value decomposition, whereas an ILDA algorithm to accurately update the discriminant vectors of the dual-space was proposed in [14]. Kim et al. [15] proposed an ILDA method, which applies the concept of the sufficient spanning set to perform the update. Regarding the DCV method, some incremental approaches have been proposed too. Ferri et al. [16] and Lu et al. [17] presented incremental DCV (IDCV) methods using difference subspaces and one sample updates. Diaz-Chito et al. [18], [19] proposed more general IDCV versions using Gram–Schmidt orthonormalization and eigendecomposition. According to the empirical results reported, and consistently with the comparisons in [6], the IDCV was clearly superior to all previous ILDA proposals. Previously proposed IDCV methods only address the SSS case and they can potentially face computational, numerical, and generalization problems as the number of samples approaches the number of dimensions. In this paper, we propose a novel incremental DCV methodology, called incremental generalized discriminative common vectors (IGDCVs) that is able to provide successful classifiers in all possible cases, regardless of the SSS assumption. IGDCV is based on the update of the projection matrices corresponding to the extended null space of the within-class scatter matrix and constitutes a generalization and extension of previous works [16], [19]. By relaxing the null space condition, the resulting method can be applied beyond the SSS case. Moreover, several different update scenarios are considered depending on whether new data contain samples from new classes or not. The rest of this paper is organized as follows. Section II describes the details of several null space-based methods, which are important from the point of view of this paper. Section III presents the IGDCV approach for different ways of incorporating new data, which constitutes the main contribution of this paper. Section IV describes the empirical validation of the proposed approach. Finally, Section V contains the main conclusions and some ideas about further research. II. N ULL S PACE BASED L INEAR D ISCRIMINANT M ETHODS A. Notation Let the training set X be composed of C classes, where every class j has m j samples. The total number of samples C j in the training set is M = j =1 m j . Let x i be a ddimensional column vector, which denotes the i th sample from the j th class. The within-class scatter matrix, SwX , is defined as SwX

=

mj C   

 j T j xi − x j xi − x j = X c X c T

(1)

j =1 i=1

where x j is the average of the samples in the j th class, and the j centered data matrix, X c consists of column vectors (x i − x j )

for all j = 1, . . . , C and i = 1, . . . , m j . Correspondingly, the between-class scatter matrix is given by SbX =

C 

m j (x j − x )(x j − x )T

(2)

j =1

where x is the overall mean. The DCV method [6], which is equivalent to the null space method [1], [7] consists of finding a projection matrix, W ∈ Rd×(C−1), that maximizes the projected between-class scatter, subject to the fact that the subspace generated by W belongs to the null space of SwX   max tr W T SbX W   (3) s.t. tr W T SwX W = 0. In this expression, the trace operator, tr(·), is used as a scalar measure of the scatter associated to the corresponding matrices. It can be shown that this problem is equivalent to the maximization of the modified Fisher criterion (ratio of total class scatter to within-class scatter) and is a convenient alternative in the SSS case to the classical Fisher criterion (ratio of between-class scatter to within-class scatter) [1], [6]. A convenient and efficient way of implementing the DCV method consists of obtaining an orthonormal basis of the null space, N (SwX ), as a first step. When projecting the training data onto this subspace, at most C distinct common vectors are obtained [6]. The second step consists of obtaining the final (C − 1)-dimensional subspace that maximizes the projected total scatter. This second step is straightforward and very efficient as it can be done using only the C common vectors. Given its huge size when dealing with high-dimensional data, projections onto N (SwX ) in the first step of the method are implicitly managed using an orthonormal basis, Q, of its complementary, the range space, R(SwX ). Any sample x can be projected onto N (SwX ) as (x − Q Q T x). Apart from the obvious choice of using eigendecompositions in both steps, Cevikalp et al. [6] proposed using Gram–Schmidt orthonormalization on difference subspaces as an efficient option. This procedure has been recently improved using different alternatives [8], [20]–[22]. Nevertheless, the asymptotical cost is O(dM2 ) for all methods. B. Extended Null Space Methods The procedure to obtain a projection basis and the corresponding common vectors can be relaxed through the extension of the null space of SwX (which implies restricting the corresponding range space) [10]. By extending the null space with new, almost null directions, the method can be conveniently applied in more general situations and not only in the SSS case. Moreover, by introducing data variability in the null space, the final projection will exhibit less sensitivity to noise and outliers and this could lead to a better generalization ability of the corresponding classifiers. Let Q be a basis for a restricted range subspace, where some almost null directions have been removed. The resulting amount of variability can be measured from the within-class scatter matrix projected onto this subspace as tr(Q T SwX Q).

DIAZ-CHITO et al.: IGDCVs FOR IMAGE CLASSIFICATION

1763

Algorithm 1 GDCV Method 1:

This quantity is at most tr(SwX ) when no directions are removed and decreases as more important directions disappear from Q. Consequently, the preserved relative variability after a projection, Q, can be written as   tr Q T SwX Q   . α =1− tr SwX The parameter α takes values in [0, 1]. When α = 1, the projection is equivalent to the one obtained by DCV. For different particular values of α < 1, different projections can be obtained with different levels of preserved variability. Note that decreasing variability in the restricted range space directly results in increasing variability in the corresponding extended null space. The projection basis fulfilling the above conditions for a given value of α can be obtained through the eigendecomposition of SwX . In particular, we denote as 1 and U1 the diagonal matrix that contains the nonzero eigenvalues in descending order and its corresponding eigenvectors, respectively. These ordered eigenvectors can be split into two matrices, U1 = [Uα Uα ], whose corresponding eigenvalues sum up to α · tr(SwX ) and (1 − α) · tr(SwX ), respectively. Consequently, the intraclass scatter matrix can be decomposed in terms of nonzero eigenvalues and eigenvectors as   α [Uα Uα ]T . SwX = U1 1 U1T = [Uα Uα ] α Note that U1 is an orthonormal basis of the range space of SwX and that Uα spans a subspace that approximates it as α → 1. We will refer to this subspace as the restricted range space while its complementary will be referred to as the extended null space. The basis Uα generates a subspace of the extended null space, which is associated to directions with very small but nonzero intraclass variation. The relation among the different subspaces and their corresponding bases is shown in Fig. 1. Taking into account this partition into subspaces, the generalized discriminative common vector (GDCV) method (referred to as RCV in [10]) can be summarized as Algorithm 1. This algorithm boils down to the plain DCV algorithm if α = 1. Note that when, α < 1, the eigendecomposition in

Split U1 = [Uα Uα ] and 1 =





, in such a α way that α contains the smallest eigenvalues in 1 and tr (α ) = α · tr (1 ) j 3: Project class means as x com = x j − Uα UαT x j . These are the so-called common vectors of each class. 1 , . . . , x C ] and let X com be its 4: Define X com = [x com com c centered version with regard to the mean, x com = j 1 C j =1 x com C 5: Obtain the projection W ∈ Rd×(C−1) such that T T com com W ) is maximum. tr (W X c X c 2:

Fig. 1. Partition of a vector space into range and null subspaces according X. to the ordered eigenvectors and eigenvalues of Sw

Obtain U1 such that SwX = U1 1 U1T.

Step 1 is needed to conveniently extend the null space and other more efficient alternatives are not possible [16], [17]. Either by eigendecomposing the smaller (in the SSS case) X c T X c instead of X c X c T or by computing the SVD of X c , the asymptotic cost of the GDCV method is in O(dM2 ) with a hidden term proportional to M 3 that usually becomes very important in practice. It is worth noting that Steps 3–5 in the above algorithm are the same as in the original DCV method and several different ways of efficiently implementing them are possible. Nevertheless their impact in the total cost, which is dominated by the costs in Steps 1 and 2, is almost negligible. III. I NCREMENTAL F ORMULATION OF E XTENDED N ULL S PACE M ETHODS A. Notation Consider the general case in which once an initial data set, X, has been used to obtain Uα , α , and W by applying the GDCV method, a new (smaller) data  set Y consisting of n j samples from each class (and N = Cj=1 n j ) is given. Let Z = [X Y ] and let Yc and Z c be their corresponding centered versions, each one with regard to its own mean. The corresponding within-class scatter matrices will be SwY = Yc YcT and SwZ = Z c Z cT . The goal now is to obtain Uα , α , and W  corresponding to the complete data set, Z , without having to apply the GDCV algorithm to Z . Instead, they will be obtained incrementally by adding the effect of new data, Y , into the previous solution corresponding to X. In Section III-B, the specific case α = 1 in which the scatter in the null space is zero and the incremental method is exact will be considered. Then, the more interesting case in which α < 1 will be introduced. This general formulation and the corresponding generalized incremental algorithm constitutes the main contribution of this paper. B. Incremental Discriminant Common Vectors From the definition of scatter matrices and manipulating the expressions in a similar way as in [23] it can be shown that SwZ = SwX + SwY + A A T

(4)

1764

IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, VOL. 26, NO. 8, AUGUST 2015

Fig. 2. Relation among dimensions in the subspaces associated to data sets X and Z .

where A = [a1 , . . . , aC ] is a matrix whose columns are the C weighted average differences given by  m jnj (x j − y j ), j = 1, . . . , C. aj = mj + nj In terms of the eigendecompositions of matrices SwZ and SwX in (4), we have U1 1 U1 = U1 1U1T + Yc Yc T + A A T . T

(5)

Note that the difference between the range subspace generated by U1 and new one generated by U1 consists of the new orthogonal directions contained both in the centered new data set, Yc , and in the average differences in A. These orthogonal directions can be obtained by projecting the corresponding vectors using I −U1 U1 T . In other words, the vectors [Yc A] − U1 U1 T [Yc A] are orthogonal to U1 and contain the sought new directions. Consequently, a basis for the new range space generated by U1 can be also written as [U1 V1 ] where V1 is obtained as   V1 = orth [Yc A] − U1 U1 T [Yc A] where orth() refers to any orthonormalization procedure. This construction constitutes an incremental characterization of the range space of SwZ with regard to the one of SwX . This situation is graphically shown in Fig. 2. As both U1 and [U1 V1 ] generate the same subspace, R(SwZ ), they are in general related by a rotation, R1 , as

C. Incremental Generalized Discriminant Common Vectors The previous approach can be generalized to the case in which the null space is extended by removing part of the corresponding range space. In this case, the only information kept about X is the basis of the restricted range space, Uα , and its corresponding eigenvectors, α (apart from the final projection W ). Our aim now is to construct an approximated version of Uα using only this information. First, observe that the orthogonal basis of R(SwZ ) can be written as U1 = [Uα Uα V1 ]R1 from which we are only interested in the first basis vectors Uα = [Uα Uα V1 ]Rα . Note that V1 and Uα are no longer available. Instead, an alternative basis parameterized by α can be obtained as Vα = orth([Yc A] − Uα UαT [Yc A]). The relation of Vα and U1 can be written as U1 = [Uα Vα D]R1 where D is an orthonormal basis in such a way that both [Uα V1 ] and [Vα D] span the same subspace. This can be expressed as [Uα V1 ] = [Vα D]S

U1 = [U1 V1 ]R1 . By substituting this into (5) and by projecting these scatters onto R(SwZ ) as [U1 V1 ]T (·) [U1 V1 ], we obtain   1 0 R1 1 R1T = + [U1 V1 ]T Yc Yc T [U1 V1 ] 0 0 +[U1 V1 ]T A A T [U1 V1 ] ≡ M1 .

Fig. 3. Relation among the different subspaces whose direct sum is the Z . The bases in shaded regions span the same subspace. The range space of Sw dark-shaded areas are not available if only the previous Uα is kept.

(6)

Note that matrix M1 at the right-hand side of the above equation is a small square matrix of size at most M + N that can be efficiently built from U1 , 1 , Yc , and A. Consequently, both rotation R1 and new eigenvalues 1 can be obtained by eigendecomposing M1 . This relation leads to an incremental algorithm that runs in time O(dr 2 ), where r is the rank of the Z set [18]. Although this algorithm is in practice much faster than the batch one, it is worth noting that in the worst case, in which all samples are linearly independent, its asymptotic cost is the same.

for a given rotation, S. The relations among the different subspaces needed to extend the space generated by Uα to the one generated by U1 are graphically shown in Fig. 3. Equation (5) can now be rewritten in terms of Uα , Vα , D, and R1 (instead of U1 ) and using Uα and Uα (instead of U1 ). These scatter matrices can now be projected onto a subspace slightly smaller than R(SwZ ) as [Uα Vα ]T (·)[Uα Vα ], which leads to  T   I 0 I 0 R1 1 R1T 0 [I 0]S 0 [I 0]S   T  0 0 0 0 α = + Mα (7) 0 VαT Uα 0 VαT Uα where

 α Mα = 0

 0 + [Uα Vα ]T Yc Yc T [Uα Vα ] 0

+ [Uα Vα ]T A A T [Uα Vα ]

DIAZ-CHITO et al.: IGDCVs FOR IMAGE CLASSIFICATION

Fig. 4. Ideal restricted range space and eigenspectra corresponding to Z (bottom), and approximated restricted range space and eigenspectra of Sw the matrix Mα (top).

1765

Fig. 5.

IGDCV algorithm.

is a smaller, parameterized version of M1 that approaches it as α → 1. The scatter associated to α in (7) will be very close to zero and by obviating it, we can identify the projected SwZ with Mα . In addition, under the same circumstances, we can identify the highest eigenvalues and corresponding eigenvectors of SwZ and Mα . In particular, the matrix Mα can be decomposed as T Mα = R˜1 ˜1 R˜1

from which we can obtain the eigenvectors, R˜ β , corresponding ˜  , such that tr( ˜  ) = β · tr(˜ ). to the largest eigenvalues,  β β 1 Consequently, final accurate approximations for the incremental extended null space projection with parameter α can be written as Uα

≈ [Uα Vα ] R˜β

˜ β . α ≈  Note that proportion β is defined with regard to tr(Mα ) = tr(˜1 ), whereas α refers to tr(SwZ ) = tr(1 ) and they will be different in general. The different bases of the subspaces and corresponding eigenspectra are shown in Fig. 4. Let tα = tr(Mα ) = tr(˜1 ), t X = tr(SwX ) = tr(1 ), and t Z = tr(SwZ ) = tr(1 ). The appropriate value for β can be obtained by observing that the final projected intraclass scatter (the sum ˜  ) = β · tα and of darker eigenvalues down in Fig. 4) is tr( β that this scatter should ideally be equal to α · t Z (the sum of the darker eigenvalues up in Fig. 4). The final scatter, t Z , is not available but it can be estimated from the matrices in (7) as   t Z ≈ (1 − α) · tr SwX + tα = (1 − α) · t X + tα by neglecting the effect of the projection [Uα Vα ]T (·) [Uα Vα ] on the trace operator. Note that (1 − α) · t X corresponds to the amount of intraclass variance that has been removed from X at the previous iteration.

Fig. 6. Random partitions considered in the experiments. (a) Adding samples. (b) Adding classes.

Consequently, we have β · tα = α · [(1 − α) · t X + tα ]. The final expression for β written in terms of the ratio between the traces of the two diagonal matrices available is β = (1 − α)

tr(α ) +α tr(˜ ) 1

using the fact that tr(α ) = α ·t X . From this expression, it can be seen that β ≥ α (as the above ratio is always below 1) and that both values get closer as they approach their limit value of 1. By considering the proposed approximation, the (few) directions that are lost (depending on α), are compensated by taking (further) directions from new data (according to β) in such a way that the estimated intraclass variance is the same. Consequently, the quality of the approximation will depend on how representative new data in Y is, compared with previous data in X. In the ideal case in which both X and Y are large enough and independent identically distributed according to a particular probability distribution (well behaved as it is usually assumed for LDA), the corresponding null space would be optimally approximated. The IGDCV algorithm is schematized in Fig. 5 along with the asymptotic cost corresponding to each of its steps.

1766

IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, VOL. 26, NO. 8, AUGUST 2015

Fig. 7. Sample images from the three databases used along with corresponding details. C is the number of classes. M j is the total number of samples per class. m j is the number of samples per class in X as the incremental algorithm progresses. n j is the number of samples per class in each block, Y .

Once the basis of the restricted range space has been updated using this algorithm, DCV and final projection are computed as in Steps 3–5 in the basic GDCV algorithm in Section II. D. Particular Cases of Unitary Updates and New Classes in Y The IGDCV algorithm and all the derived expressions, including the decomposition of the intraclass scatter matrix, are also valid in several particular cases with some precautions. In the case of having one single sample of a particular class in Y , that is, n j = 1 for some j , the corresponding mean j is y j = y1 and all expressions to compute A are still valid. The corresponding column in Yc will consequently be the zero vector and it can be safely removed. If the same happens for all classes, that is if n j = 1 for all j , then the whole matrix Yc can be replaced by the zero matrix (SwY = 0), and then SwZ = SwX + A A T . If some of the data vectors in Y correspond to new classes, which are not present in X, the expressions of Section III-C are valid by extending the value of C and setting m j = 0 in X for all new classes. Both, if m j or n j are zero for a class j , the mean, y j , is undefined and its column in A, a j , should be set to zero. If all the data vectors in Y correspond to new classes, then the whole matrix A is the zero matrix and can be removed from all expressions. In this case the within-class scatter matrix becomes SwZ = SwX + SwY . E. Computational Cost The overall cost of the proposed incremental update is dominated by the cost of the Step 7, O(dr 2 ), where r is the expected rank of the range space preserved that heavily depends on the parameter α. Note that the behavior of the algorithm will heavily depend on how the expected rank of the restricted range space grows at each incremental update. The worst case of the proposed algorithm corresponds to α = 1, which leads to an exact solution that is equivalent to the DCV and IDCV algorithms.

Even though previously proposed incremental approaches [17], [18] are clearly more efficient as no eigendecomposition and rotation are needed when α = 1, they are only valid in the SSS case, whereas our proposal generalizes the IDCV under all possible circumstances. Observe that the incremental formulation in Section III-A–D could be easily changed to obtain an incremental algorithm with a parameter k, the number of dimensions to keep, instead of α, the amount of scatter to preserve. Regardless of how the rank of the preserved range space, r , is kept small, the true benefit of the incremental algorithm is obtained, when r is significantly smaller than M + N, which is the rank of the data set Z in the case of linear independence. Note that the advantage of the proposed IGDCV over IDCV will greatly increase in large-scale and stream-like problems, where both dimension, d, and stream length, M, are very large, whereas the expected rank is roughly constant. IV. E XPERIMENTS AND R ESULTS This section describes the experiments carried out to validate and assess the proposed IGDCV algorithm both in terms of how it approximates the performance of the batch GDCV method for different values of α and also by measuring the decrease in computational time between incremental and batch learning algorithms. To this end, two different empirical scenarios have been considered: 1) having a model with a fixed number of classes (adding samples), whose size is continuously updated at different rates and 2) having a model that is continuously updated by adding samples from strictly new classes (adding classes). These two cases along with their range of parameter values are representative of the wide range of possible cases that have been outlined in Section III-D. In all the experiments, an initial model is obtained (using the corresponding batch algorithm) and then small amounts of data (that is, the Y set) are progressively added to the previous learnt model. From one iteration to the next in the incremental algorithm, only the restricted range space basis, Uα , and its corresponding eigenvalues, α , along with the updated values x j and m j for each class are considered. Note that by applying IGDCV along a number of iterations, approximation errors are expected to be magnified. In this way, it will be possible to better observe potential deviations from the approximate results given by our proposal and the ideal ones obtained with the corresponding batch algorithm. Apart from measuring CPU times, the final projections obtained at each iteration of the incremental algorithm are used to construct the nearest neighbor classifier in the transformed space as in all previous, closely related studies [6], [10]. All classifiers obtained during the process are assessed by measuring their corresponding error rate on a separate test set. The whole incremental run (that has a varying number of iterations depending on n j ) has been repeated ten times for different random partitions of the database in three subsets: 1) initial training; 2) incremental training; and 3) test sets. The sizes of these sets are ∼20%, 50%, and 30% of the total number

DIAZ-CHITO et al.: IGDCVs FOR IMAGE CLASSIFICATION

1767

Fig. 8. Left: accuracy versus training samples per class (m j ) in the GDCV method for different values of α. Right: corresponding increase in the dimensionality of the restricted within-class range space. (a) and (b) Cmu–Pie. (c) and (d) Coil-20. (e) and (f) Mnist.

of images available depending on the particular number of classes and samples per class in each data set. For each run of each particular experiment, the incremental training set has

been randomly split into blocks composed either of a given number of samples from each class (adding samples scenario) or a given number of new classes (adding classes scenario).

1768

IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, VOL. 26, NO. 8, AUGUST 2015

Fig. 9. Accuracy (left) and rank of the restricted within-class range space (right) computed by the incremental algorithm versus training samples per class (m j ) for different incremental set sizes (n j ) and two fixed values α = 0.95 and α = 0.85. (a) and (b) Cmu–Pie. (c) and (d) Coil-20. (e) and (f) Mnist.

This schema is graphically shown in Fig. 6. Final CPU times and error rates presented are the corresponding averages. All data sets in the experiments consist of 8-bit gray level images in the range [0, 1]. The face images in the

corresponding data sets have been aligned using central points in eyes and mouth as references. The binary images of digits have been converted to gray level ones using the distance transform [18], [24]. The image resolution in

DIAZ-CHITO et al.: IGDCVs FOR IMAGE CLASSIFICATION

1769

Fig. 10. CPU time in seconds (left) and CPU relative time in % (right) versus training samples per class (m j ) in IGDCV and GDCV methods for different incremental set sizes (n j ) for a fixed value α = 0.95. (a) and (b) Cmu-Pie. (c) and (d) Coil-20. (e) and (f) Mnist.

all databases was set to a relatively low value without significantly degrading the final performance of the classifiers in order to study the behavior of the algorithms at the

boundary of the SSS case. All algorithms have been run on a computer with a Dual Core AMD processor at 1 GHz with a 8-GB RAM.

1770

IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, VOL. 26, NO. 8, AUGUST 2015

Fig. 11. Sample images and corresponding properties of the two databases used in the experiments adding new classes. C is the total number of classes. M j is the total number of samples per class on the data set. m j = n j is the number of samples per class used for training. C x and C y are the number of classes in the initial and incremental sets, respectively.

A. Experiments Increasing the Number of Samples per Class For these experiments, three databases containing images of faces, objects, and digits have been considered. First, the Cmu–Pie data set [25] that is composed of pictures of faces with changes in pose, expression, and illumination has been selected. In addition, the Coil-20 data set [26] that contains images of different objects with changes of viewpoints from 0◦ to 355◦ , with increments of 5◦ and the Mnist data set [27] that contains a variety of examples for the digits from 0 to 9 have been considered. The main features of these data sets are shown in Fig. 7. In particular, their dimensionality and the range of sizes (per class) of the corresponding X(m j ) and Y (n j ) sets for all particular runs in this experiment are shown. A common feature of the three databases considered is that α = 1 is not the best option, when applying the GDCV method. In other words, the results of the plain DCV approach can be improved by removing some of the variability in the range subspace. To see this and to use the corresponding results as a reference, the batch GDCV algorithm with different values of α has been applied to training data from all databases with the same effective sizes (the ranges in the m j column in Fig. 7) that will be used later on the incremental algorithms. The corresponding results are graphically shown in Fig. 8, where the averaged accuracies obtained with different values of α are plotted for increasing training set sizes at the left-hand side of the figure. The ranks corresponding to the obtained restricted within-class range spaces are shown at the right-hand side in the same figure. For these and following figures, the dark-shaded region in the plots indicates that the ratio of dimension, d, to sample size, C · m j , is below 1 (and the setting α = 1 is not possible [6]). Note that this condition is only fulfilled by Cmu–Pie when m j ≥ 24. In addition, the light-shaded region indicates that this ratio is roughly below 1.5 and above 1. It can be clearly observed in the case of Cmu–Pie and Mnist databases that the performance results for α = 1 degrade as the ratio of dimension to sample size gets closer to 1.

Before this, all settings lead to improvements as the number of training samples increase. In general, α values

Incremental Generalized Discriminative Common Vectors for Image Classification.

Subspace-based methods have become popular due to their ability to appropriately represent complex data in such a way that both dimensionality is redu...
3MB Sizes 1 Downloads 4 Views