Autogrouped sparse representation for visual analysis.

5390

IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. 23, NO. 12, DECEMBER 2014

Autogrouped Sparse Representation for Visual Analysis Jiashi Feng, Xiao-Tong Yuan, Zilei Wang, Member, IEEE, Huan Xu, and Shuicheng Yan, Senior Member, IEEE Abstract— In image classification, recognition or retrieval systems, image contents are commonly described by global features. However, the global features generally contain noise from the background, occlusion, or irrelevant objects in the images. Thus, only part of the global feature elements is informative for describing the objects of interest and useful for the image analysis tasks. In this paper, we propose algorithms to automatically discover the subgroups of highly correlated feature elements within predefined global features. To this end, we first propose a novel mixture sparse regression (MSR) method, which groups the elements of a single vector according to the membership conveyed by their sparse regression coefficients. Based on MSR, we proceed to develop the autogrouped sparse representation (ASR), which groups correlated feature elements together through fusing their individual sparse representations over multiple samples. We apply ASR/MSR in two practical visual analysis tasks: 1) multilabel image classification and 2) motion segmentation. Comprehensive experimental evaluations show that our proposed methods are able to achieve superior performance compared with the state-of-the-art classification on these two tasks. Index Terms— Object recognition, image classification, sparse coding.

I. I NTRODUCTION

M

OST of current image classification, recognition and retrieval systems represent images by global features, which are statistical aggregation of local features [1]. Each element within the global feature vector corresponds to a certain Manuscript received December 31, 2012; revised February 24, 2014 and August 27, 2014; accepted September 25, 2014. Date of publication October 8, 2014; date of current version November 6, 2014. The work of X.-T. Yuan was supported in part by the National Natural Science Foundation of China under Grant 61272223 and in part by the National Farmworker Jobs Program under Grant BK20141003. The work of H. Xu was supported by the Ministry of Education of Singapore through Academic Research Fund Tier 2 under Grant R-265-000-443-112. The associate editor coordinating the review of this manuscript and approving it for publication was Prof. Mary Comer. (Corresponding author: Zilei Wang.) J. Feng is with the Department of Electrical Engineering and Computer Sciences, International Computer Science Institute, University of California at Berkeley, Berkeley, CA 94720 USA (e-mail: [email protected]). X.-T. Yuan is with the Jiangsu Key Laboratory of Big Data Analysis Technology, Nanjing University of Information Science and Technology, Nanjing 210044, China (e-mail: [email protected]). Z. Wang is with the Department of Automation, University of Science and Technology of China, Hefei 230027, China (e-mail: [email protected]). H. Xu is with the Department of Mechanical Engineering and Mathematics, National University of Singapore, Singapore 119077 (e-mail: mpexuh@ nus.edu.sg). S. Yan is with the Department of Electrical and Computer Engineering, National University of Singapore, Singapore 119077 (e-mail: eleyans@ nus.edu.sg). Color versions of one or more of the figures in this paper are available online at http://ieeexplore.ieee.org. Digital Object Identifier 10.1109/TIP.2014.2362052

visual pattern. For example, in the bag-of-words features [2], each element corresponds to a “visual word”. Based on the obtained global feature vectors, images similarity is measured by the distance of these vectors for the following image recognition related tasks. Typically, the distance is calculated by treating the feature vector as a whole [1]. Though the computation is often efficient, this strategy ignores the fact that different elements are describing visual patterns from different objects contained in a single image. This may render the resultant images similarity inaccurate, with respect to an object of interest, due to the contamination of the noise from other objects and background in the same image. Thus, we cannot get the correct ranking results based on the global feature similarity. To handle such mutual interference of multiple objects in the images, several previous works propose to perform segmentation or detection as pre-processing before feature extraction [3]. However, such pre-processing is quite complicated and performance may not be satisfactory. In this work, instead of segmenting the objects in advance, we propose to partition the extracted global feature vector into several sub-vectors. Each sub-vector consists of several “correlated” elements which together describe a typical visual pattern for an object. Meanwhile the correlation between two different sub-vectors is weak. Thus, treating each sub-vector individually in the distance calculation, visual patterns from different objects (i.e., non-correlated elements) are in fact considered separately and do not interfere each other anymore. We are then able to obtain more accurate images similarity specific for one object of interest, which is immune to the interference from other objects and background. This is in spirit similar to object segmentation. However, we process the global features directly instead of on the raw images. We also show that it is much simpler and more efficient than object segmentation, yet the performance is satisfactory for performing image classification. To automatically discover the subgroups within a feature vector, we propose a Mixture Sparse Regression (MSR) method. MSR is motivated by solving the problem of mixture regression, where only a mixture of observations from different regression models are provided and we need to estimate the different regression models. MSR is able to identify the elements sharing the same sparse regression coefficients with respect to a provided design matrix, and simultaneously estimate the values of the regression coefficients. Thus, the elements within the input vector sharing the same regression coefficients naturally form a subgroup. It is straightforward to apply for discovering the subgroups of a visual feature vector.

1057-7149 © 2014 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.

FENG et al.: ASR FOR VISUAL ANALYSIS

Fig. 1. Illustration on the proposed auto-grouped sparse representation (ASR) method. The elements of the image-level representation vectors describe different visual patterns (denoted by different shapes in the vector, e.g., triangle, square). In ASR, the feature elements are divided into K groups according to their individual sparse representations. Each group represents a set of correlated elements. On the other hand, the elements group can help identify the underlying group-wise sparse representations. Such mutual boosting is represented by the bi-directional arrows. Based on the group-wise sparse representations, a multigraph is constructed to describe the similarity between the input and the basis images. The similarity is determined by the sparse representation coefficients obtained in ASR. For better viewing, please refer to the color version.

We then extend the MSR to the Auto-grouped Sparse Representation (ASR) method which automatically learns the element groups shared by multiple samples. More specifically, ASR performs single sparse representation (SR) for the feature elements of each image with respect to all the other training samples, and meanwhile ASR encourages the correlated elements to share the same sparse representation model as MSR. Thus the correlated elements can be “auto grouped” together by identifying whether they possess similar sparse representation models. This process is illustrated in Fig. 1. Then the similarity of images is calculated based on the sparse representation coefficients. SR has seen many success applications in solving face recognition [4], image classification [5], [6] problems. However, in traditional SR, the linear representation coefficients are obtained from best approximating the feature vector of input sample by linearly combining the whole basis feature vector. In contrast, ASR calculates the SR coefficients with respect to the element groups separately. In other words, each feature group is associated with its individual SR coefficients. Thus ASR is able to alleviate the interference from irrelevant groups and provide better SR estimations. Note that the proposed MSR and ASR are general methods and can also be applied for other intrinsic group identification tasks, such as motion segmentation. In this work, we examine the applicability of them in two practical visual analysis tasks. The first application is to build the multigraph by ASR [7] for more accurately classifying multi-label images. Compared with conventional single edge graph, our multigraph achieves the state-of-the-art performance on the NUS-WIDE-LITE database. The second application is two-view motion segmentation. MSR segments the motion trajectories by grouping the corresponding mixture linear regression models. Compared with previously well-performed methods [8], [9], MSR significantly decreases the segmentation error rates and offers more accurate and stable segmentation results. II. R ELATED W ORK The proposed work aims at automatically uncovering the group structures across multiple feature entries and simultaneously calculating the underlying sparse representations within

5391

each group. The most intuitive approach to tackle this problem is the Expectation-Maximization (EM) method [8]. EM may regard the group assignments as hidden variables, and iterates the inference over hidden variables and the parameter estimation of decoupled models until a local optimum is reached. Gaffney et al. [10] applied the EM method to the trajectory clustering with the assumption that the motion trajectories are generated from a mixture regression model. The documentable limitation of the EM method is the locality of its optimization and thus the final solution is typically sensitive to initialization. The second type of approaches rely on convex relaxation. Quadrianto et al. [11] proposed to solve the regression model with mixture of several regression vectors by relaxing the assignment variables into continuous ones. Their experimental results show that the convex formulation performs better than the EM method on a number of benchmark datasets. However, their formulation seems hard to be generalized to sparse representation setting. Indeed, to the best of our knowledge, there has been no effort on solving the sparse mixture regression problem in a convex optimization framework. Sparse representation (SR) has proven to be a powerful tool in computer vision tasks including image classification, image enhancement and motion segmentation [12]. Given a query datum, SR aims to find its parsimonious linear representation on an over-complete basis. By pursuing such parsimony, or called sparsity, SR can provide more interpretable models for further data analysis [13]. However, SR models the linear relationship amongst data only from one view. Thus the estimation is prone to be sensitive to noise and outliers. To alleviate this issue, multi-task sparse representation (MTSR) is advocated recently [6]. In MTSR, each task performs single SR from one view of the data and multiple concurrent tasks are encouraged to aggregate useful information from different views through proper joint constraints. Thus MTSR can produce the more accurate and robust representation of data than single SR. However, an important premise of MTSR is that the structure of the task is specified in advance, which can rarely be obtained in practice. In contrast, our proposed ASR method can automatically discover the intrinsic groups. Our method is directly inspired by the convex relaxation of clustering [14], where the authors employ the sparsityinducing norms to enforce the fusion of data points. Sparsity-inducing norms have emerged as flexible tools that allow variable selection in penalized linear models [15], [16]. In this paper, we combine these lines of research into our framework of auto-grouped sparse representation. III. M IXTURE S PARSE R EGRESSION Given p observations composing a vector y ∈ R p , each element yi corresponds to an observation. The regressors are provided as A ∈ R p×n , and each row vector Ai corresponds to the regressor for yi . There are K different regression models ω 1 , . . . , ω K } for the p observations. For each observation, {ω we have yi = Ai ω i , where ω i is the corresponding regression model. The target of mixture sparse regression is to find the groups of observations in y and simultaneously estimate the parameters of the regression models. In particular, we have a

5392

IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. 23, NO. 12, DECEMBER 2014

prior structure information for the regression models: most of the parameters are zero, namely the models are sparse. In this work, we propose a novel method to address this problem. It estimates the regression model for each observation individually and meanwhile enforces the estimated models to fuse into K models. The proposed objective function is as follows, 1 ω k 1 , yCk − ACk ω k 22 + λ ω 2 K

min

ωk Ck ,ω

K

k=1

(1)

k=1

where Ck ⊆ {1, . . . , p} is the feature element indices contained in the kth group, y Ck denotes elements of y indexed by Ck and ACk denotes rows indexed by Ck in the matrix A. In the above optimization problem, each element of y is assigned to its corresponding group such that the overall loss is minimized. The regularization term · 1 accounts for the prior that the regression models are sparse. The above objective function is a combinatorial optimization problem and in general computationally intractable. Following the relaxation technique introduced in [14], we replace the hard constraint on the number of groups with the fusion-encouraging constraint on the models p {wi }i=1 ⊂ Rn : min wi

s.t.

p p 1 |yi − Ai wi |2 + λ wi 1 , 2 i=1 i=1 1wi =w j ≤ t.

(2)

The indicator function 1wi =w j takes value1 if wi = w j and 0 otherwise. Intuitively, the constraint i< j 1wi =w j ≤ t limits the number of different wi ’s and thus constrains the number of groups. When t ≥ p( p − 1)/2, each wi forms an individual group. Otherwise, along with the decrease of t, more wi ’s have the same value. When t = 0, all of the wi ’s are identical. However, the combinatorial problem (2) is hard to solve. To circumvent the difficulties, we replace the indicator 1wi =w j by wi − w j ∞ [14], which convexifies the problem (2) to: w i ∈D

s.t.

p 1 i=1

2

|yi − Ai wi |2 + λwi 1 ,

wi − w j ∞ ≤ t.

i< j

Similar to the indicator, the constraint above also encourages most of the wi ’s to be same and fuse them together. We also introduce another convex constraint set D = {w|w2 ≤ D} to prevent the magnitude of wi from being arbitrarily large. The above problem is equivalent to the following penalized form with appropriate choice of penalty parameter β: p 1 2 |yi − Ai wi | + λwi 1 + β wi − w j ∞ , min 2 w i ∈D i=1

1 |yi − Ai wi |2 , f M (w) := 2 p

i=1

r M (w) := λ

p i=1

wi 1 + β

wi − w j ∞ .

i< j

Here f M (w) is the smooth loss function and r M (w) is the non-smooth regularization. Though they do not destroy the convexity of the problem, the non-smooth terms generally slow down the convergence rate and efficiency for traditional convex solvers (e.g., gradient descent). Fortunately, the problem having such non-smooth terms can be solved by the smooth approximation technique introduced in [17], whose details are given in the next section. In practice, the recovered regression models wi may not readily form K distinct groups, due to the noise in the data. To obtain the final K groups, we build an affinity graph of these wi ’s and then cluster them into K groups by spectral clustering [18]. This K accordingly. also provides the feature element groups {yy k }k=1 K ω k }k=1 via performing sparse Finally we refine the estimation {ω regression on each group separately. IV. AUTO -G ROUPED S PARSE R EPRESENTATION

i< j

min

separately as follows,

i< j

(3) Obviously, the objective function in (3) contains a smooth loss term and two non-smooth regularization terms. We write them

In this section, we introduce a novel auto-grouped sparse representation (ASR) method, as an extension of MSR, to automatically identify the intrinsic groups of image global feature vectors. Each group consists of heavily correlated feature elements, and irrelevant feature elements can be isolated into different groups. Thus, the negative interference among different objects can be effectively alleviated by simply treating the feature subgroups separately. Before presenting the algorithm details, we first give an example to illustrate the basic idea of ASR in Fig. 1. For an input image, the elements in its global feature, which describe different visual patterns from the same object (e.g., human eyes and mouth, denoted as triangle and star in the figure), are heavily correlated. These elements should be grouped together and admit an identical sparse representation model over the basis. In the sparse representation, only the basis images containing such human face pattern are activated. Similarly, the other elements of the feature are also partitioned into several groups by ASR. Each group contains several correlated feature elements which together describe certain characteristic visual pattern. Thus, ASR constructs multiple sparse representations over the basis for a single image. Each sparse representation can be used to build a single-edge graph, and combining the multiple single-edge graphs induces a multigraph. Such a multigraph provides more flexible and accurate image similarity than a traditional single edge graph, considering each of the edge is constructed without the affect of other irrelevant feature elements. Formally, given n image features {yy 1 , . . . , y n } ⊂ R p×n , we aim at finding K non-overlapped groups of the feature j j elements in the features y j : {yy C1 , . . . , y C K }. Here Ck ⊆ {1, . . . , p} is an index set indicating the feature elements in the kth group. Each group consists of correlated feature elements

FENG et al.: ASR FOR VISUAL ANALYSIS

5393

and describes specific visual pattern in middle level. ASR finds the groups via performing mixture sparse regression for all the features w.r.t. a provided basis A ∈ R p×m , and reveals the element groups by investigating the affinity of their sparse regression coefficients. ASR actually solves the following problem, min

ωkj Ck ,ω

K n K n 1 j j 2 j ω k 1 . ω yy Ck − ACk ω k + λ 2 2 j =1 k=1

(4)

j =1 k=1

Note that the objective function of ASR in Equation (4) is essentially different from the one of MSR in Equation (1). MSR yields different partitions for different target features if applying MSR in a straightforward way. In contrast, ASR enforces the groups to be consistent across all the input features. The consideration here is that the produced element groups can be more robust and reliable than relying only on a single sample. Similar to MSR, the objective function of ASR in Equation (4) can be relaxed to: p n 1 j j 2 j min yi − Ai wi + λwi 1 j 2 w ∈D ,c(i) j =1 i = 1

i

+β

p

1c(i) = c(i )

i,i =1

s.t.

n

j

j

wi − wi ∞ ,

j =1

c(i ) ∈ {1, . . . , K }, ∀i.

(5)

Here c(i ) maps the feature index i to an index of one cluster out of K clusters. Also, the regularization enforces the differj j ence between coefficients wi and wi to be small. To see this, j j j j note that wi − wi ∞ > 0 and therefore nj =1 wi − wi ∞ is equivalent to calculating the 1 norm over the vector of coefficient differences [wi1 −wi1 ∞ , . . . , win −win ∞ ] ∈ Rn . The sparsity-induced property of 1 norm encourages most of j j the difference wi − wi ∞ to be zero. When the elements i

and i do not enter the same cluster, 1c(i)=c(i ) = 0 and the difference of corresponding vectors is not penalized. The objective function in (5) has the similar structure as MSR in (3), which also contains a smooth term plus two non-smooth terms: p 1 j j 2 f A (w) := yi − Ai wi , 2 i=1

r A (w) :=

p n

Algorithm 1 Auto-Grouped Sparse Representation (ASR)

j

λwi 1 + β

j =1 i=1

i

Visual tracking based on extreme learning machine and sparse representation.

Orthogonal Procrustes Analysis for Dictionary Learning in Sparse Linear Representation.

Group-based sparse representation for image restoration.

Maxdenominator Reweighted Sparse Representation for Tumor Classification.

Kernel reconstruction ICA for sparse representation.

Learning local appearances with sparse representation for robust and fast visual tracking.

Supervised Discriminative Group Sparse Representation for Mild Cognitive Impairment Diagnosis.

Two-stage nonnegative sparse representation for large-scale face recognition.

Sparse Representation for Prediction of HIV-1 Protease Drug Resistance.

Tight Graph Framelets for Sparse Diffusion MRI q-Space Representation.

Single-Trial Sparse Representation-Based Approach for VEP Extraction.

LGE-KSVD: robust sparse representation classification.

Online Hierarchical Sparse Representation of Multifeature for Robust Object Tracking.

Multiple kernel learning for sparse representation-based classification.

Distributed dictionary learning for sparse representation in sensor networks.

Robust Fringe Projection Profilometry via Sparse Representation.

3D ear identification based on sparse representation.

Half-quadratic-based iterative minimization for robust sparse representation.

Sparse representation based biomarker selection for schizophrenia with integrated analysis of fMRI and SNPs.

Use of customizing kernel sparse representation for hyperspectral image classification.

A Modified Sparse Representation Method for Facial Expression Recognition.

Matrix variate distribution-induced sparse representation for robust image classification.

Low-rank and eigenface based sparse representation for face recognition.

Fourier ptychographic microscopy with sparse representation.