Use of customizing kernel sparse representation for hyperspectral image classification Bin Qi,1,2 Chunhui Zhao,1,* and Guisheng Yin2 1

College of Information and Communication Engineering, Harbin Engineering University, Harbin 150001, China 2

College of Computer Science and Technology, Harbin Engineering University, Harbin 150001, China *Corresponding author: [email protected] Received 2 September 2014; revised 3 November 2014; accepted 5 December 2014; posted 10 December 2014 (Doc. ID 222150); published 23 January 2015

Sparse representation-based classification (SRC) has attracted increasing attention in remote-sensed hyperspectral communities for its competitive performance with available classification algorithms. Kernel sparse representation-based classification (KSRC) is a nonlinear extension of SRC, which makes pixels from different classes linearly separable. However, KSRC only considers projecting data from original space into feature space with a predefined parameter, without integrating a priori domain knowledge, such as the contribution from different spectral features. In this study, customizing kernel sparse representation-based classification (CKSRC) is proposed by incorporating kth nearest neighbor density as a weighting scheme in traditional kernels. Analyses were conducted on two publicly available data sets. In comparison with other classification algorithms, the proposed CKSRC further increases the overall classification accuracy and presents robust classification results with different selections of training samples. © 2015 Optical Society of America OCIS codes: (100.0100) Image processing; (280.0280) Remote sensing and sensors; (110.2960) Image analysis. http://dx.doi.org/10.1364/AO.54.000707

1. Introduction

Hyperspectral image classification is basic research of remote sensing technology, but it plays a fundamental role in a variety of paramount applications, such as food processing, environment survey, geological investigation, and military applications [1,2]. Hyperspectral images are captured by remote sensing sensors in hundreds of narrow and approximate continuous spectral channels. Pixels are represented by spectral vectors with each entry corresponding to a spectral channel [3]. Generally, the spectral characteristic of objects is reflected at a specific wavelength, even for objects from the same species of different types [4]. Thanks to remote sensing technology, this natural phenomenon can be observed from remote sensing sensors under its wide spectral range. Such 1559-128X/15/040707-10$15.00/0 © 2015 Optical Society of America

a large number of spectral channels implies highdimensional data, which enables the discrimination of objects, but also presents potential challenges for the classification algorithms [5]. The purpose of hyperspectral image classification is to use labeled training samples from several distinct object classes to correctly determine the class to which a new test sample belongs [4]. The support vector machine (SVM) was proposed by Cortes and Vapnikas a classification approach in the fields of pattern recognition and machine learning based on the structural risk minimization principle [6]. It has been proved to be an efficient high-dimensional classification algorithm, which aims at providing a tradeoff between hypothesis space complexity and quality of fitting the training data. Many previous publications have shown competitive performance when applying SVMs to hyperspectral image classification [7–9]. Various variations of SVM-based algorithms were also proposed to improve the classification 1 February 2015 / Vol. 54, No. 4 / APPLIED OPTICS

707

performance. In [10], Chen et al. proposed a novel subspace mechanism, named optimizing subspace SVM ensemble (OSSE), as an improvement of random subspace SVM ensemble (RSSE) by selecting discriminating subspaces for individual SVM. Genetic algorithm is incorporated to optimize the selected subspaces with a Jeffries–Matusita distance being adopted as a criterion. Xue et al. proposed a hyperspectral image classification approach named HAPSO-SVM by integrating the harmonic analysis, particle swarm optimization, and support vector machine in [11]. Initially, harmonic analysis is applied to transform pixels from the spectral domain to the frequency domain. Then particle swarm optimization is adopted to optimize the penalty parameter and kernel parameter for SVM. Finally, the extracted features are classified with the optimized model. In [12], Gurram and Kwon proposed a kernel-based contextual classification approach built on the Hilbert space embedding. The proposed contextual support vector machine is aimed at jointly exploiting both local spectral and spatial information in a reproducing kernel Hilbert space. A novel Hierarchical semi-supervised SVM was proposed by Shao et al. [13], in which the unlabeled samples are utilized by means of their cluster features. In [14], Gu and Feng optimized the Laplacian support vector machine by introducing distance metric learning. In the procedure of constructing graph with distance metric learning, equivalence and nonequivalence pairwise constraints are imposed for better capturing the similarity of samples from different classes. Sparse representation of signals has been found in various applications, such as face recognition [15], pedestrian detection [16], image restoration [17], image similarity assessment [18], and hyperspectral image classification [19]. One of its instantiations, minimum description length principle [20], stipulates that in high-dimensional data, useful information that yields the most compact representation should be preferred for specific decision-making tasks such as classification [15]. In sparse representation, a natural signal can be represented as a linear combination of basis atoms from an overcomplete dictionary (i.e., the number of basis atoms exceeds the dimension of the signal). The sparse dictionary can be constructed either based on a mathematical model of the training data or can be constructed from the training data directly [21]. The purpose of sparse representation is to reconstruct the original signal by a compact representation. Much work has been done on how to search the optimal representation with a sufficiently sparse or at a required sparsity level [22,23]. K-SVD is a well-known algorithm to construct the sparse dictionary [22]. To learn an overcomplete dictionary for a set of training signals, K-SVD seeks the sparse dictionary leading to the best possible representation of each signal with strict sparsity constraints. It works as an iterative scheme that alternates between sparse coding of the training data with respect to the current dictionary and an 708

APPLIED OPTICS / Vol. 54, No. 4 / 1 February 2015

update process for the dictionary atoms so as to better fit the training data [18]. Even though an immense variety of models have been applied for exploiting the structure of the dictionary, one particular simple approach that constructs the dictionary directly from the training data always works effectively [15]. In this study, the sparse dictionary is constructed in this manner. In the classification of hyperspectral images, a usual phenomenon always appears that it is hard to linearly separate pixels from the same species of different types. Kernel methods are efficient approaches that enable them to operate in a high-dimensional, implicit-feature space without computing the mapping of data, but rather by simply computing the inner products between all pairs of data in the feature space [24]. Many researchers had incorporated kernel methods in the calculation of sparse representation (denoted as kernel sparse representation) [19,21,25,26]. However, the full potential of kernel sparse representation has not been fully explored, such as customizing kernels by integrating a priori knowledge from training data [3]. In this study, the kth nearest neighbor density estimation (kNNDE) is developed to be incorporated as a spectral weighting scheme in the customizing kernels. The design idea of the kNNDE spectral weighting scheme is based on the assumption that paramount features should have a closer kth nearest neighbor (i.e., a pixel from the same class that has the smallest distance with the query pixel) and a farther kth nearest enemy (i.e., a pixel from a different class that has the smallest distance with the query pixel), and noisy features that have the opposite situation. As such, the spectral features with higher discrimination should be given comparatively higher weights and promote their contributions in the classification. For convenient citation, we use the following abbreviations: “SVM” refers to the support vector machine, “CSVM” refers to the support vector machine with customizing kernel, “SRC” refers to the sparse representation-based classifier, “KSRC” refers to the kernel sparse representation-based classifier, and “CKSRC” refers to the customizing kernel sparse representationbased classifier. Notation-wise, scalars are denoted by lowercase letters. Vectors and matrices are denoted by lowercase and uppercase bold letters, respectively. The remainder of this paper is organized as follows. In Section 2, we review traditional SRC and KSRC, and introduce a way to incorporate kth nearest neighbor density estimation as a weighting scheme in customizing kernels. In Section 3, we introduce the data sets and experimental design. Experiments are carried out to assess the performance of proposed CKSRC in comparison with SVM, CSVM, SRC, and KSRC in Section 4. Finally, this paper ends with the conclusions in Section 5.

2. Hyperspectral Image Classification Based on Sparse Representation

The basic problem for hyperspectral image classification is to use labeled training samples from c classes to correctly assign a class label to a new test sample. Given sufficient training samples with d spectral bands, the sparse dictionary for the ith class is arranged as matrix Di  ai;1 ; ai;2 ; …; ai;ni  ∈ Rd×ni , whose columns are ni training samples fai;j gj1;2;…;ni (referred to as atoms) from the ith class, and n  Pc n is the total number of training samples. i1 i A.

Sparse Representation-Based Classifier

In sparse representation, it is assumed that the signal can be considered as the linear summation of basis atoms with associated sparse coefficients [15]. Thus, an unknown test sample x ∈ Rd will approximately lie in the linear span of corresponding atoms from the same class. Suppose the test sample x comes from the ith class, then x can be represented as x  αi;1 ai;1  αi;2 ai;2  …  αi;ni ai;ni ;

(1)

where αi  αi;1 ; αi;2 ; …; αi;ni T ∈ Rni is the sparse coefficient vector associated with the ith class. Since the membership of test sample x is initially unknown, a joint dictionary is defined by concatenating all n training samples from all c classes as D  D1 ; D2 ; …; Dc   a1;1 ; a1;2 ; …; ac;nc ;

(2)

where each column of D is the basis atom of the joint dictionary. Thus, the linear representation of x can be written in terms of joint dictionary as x  Dα;

(3)

where α  0; …; 0; αi;1 ; αi;2 ; …; αi;ni ; 0; …; 0T ∈ Rn is the sparse coefficient vector whose entries are zero except those associated with the ith class. The index set which α have nonzero entries is the support of α. The number of nonzero entries in α is called the sparsity level and is denoted by K  ‖α‖0. Given the joint dictionary D, the sparse coefficient vector α is obtained by solving the following optimization problem: αˆ  arg min‖x − Dα‖2 subject to‖α‖0 ≤ K 0 ; α

(4)

where K 0 is a preset upper bound on the sparsity level. The problem in Eq. (4) is NP-hard (nondeterministic polynomial-time hard), which can be approximately solved by greedy algorithms, such as orthogonal matching pursuit (OMP) [27]. For the ith class, let δi : Rn → Rn be a function that selects the sparse coefficients associated with the ith class. ˆ ∈ Rn is a column vector That is, for any αˆ ∈ Rn, δi α whose nonzero entries are the elements in αˆ that are associated with the ith class. Using only the coefficients associated with the ith class, one can approximate the given test sample x as Dδi α. Then the

sample x can be classified based on these approximations by assigning it to the class that minimizes the reconstruction error between x and Dδi α Classx  arg

ˆ 2: min ‖x − Dδi α‖

i1;2;…;c

(5)

B. Kernel Sparse Representation-Based Classifier

It is a common phenomenon that hyperspectral data is not linearly separable among classes. Kernel method provides a natural way to project original data into high-dimensional feature space [28]. Kernel function κ:Rd × Rd → R is defined as the inner product like the following: κxi ; xj   hϕxi ; ϕxj i;

(6)

where ϕ· is a feature map from Rd to Hilbert space. Commonly used kernels include linear kernel, polynomial kernel, sigmoid kernel, and Gaussian radial basis function (RBF) kernel. In this study, RBF kernel is applied to the sparse representation-based classifier as given by κxi ; xj   exp−γ‖xi − xj ‖2 ;

(7)

where γ is a parameter that controls the width and tunes the smoothing of the kernel function. For easier illustration, the joint dictionary D is expressed as D  a1 ; a2 ; …; an , whose columns fai gni1 are n training samples from c classes. Let x ∈ Rd be a sample of interest and ϕx be its representation in the feature space. The kernel sparse representation of ϕx can be formulated with the linear combination of basis atoms in the feature space as ϕx  ϕa1 ; ϕa2 ; …; ϕan αϕ1 ; αϕ2 ; …; αϕn T  Dϕ αϕ ; (8) where the columns of Dϕ are the representations of basis atoms in the feature space and αϕ is the sparse coefficient of ϕx. Similar to the optimization problem in original space, αϕ can be recovered by solving the following problem [21,29]: αˆ ϕ  arg min‖ϕx − Dϕ αϕ ‖2 subject to‖αϕ ‖0 ≤ K 0 : αϕ

9 Let K D ∈ Rn×n be the kernel matrix 0

κa1 ; a1  B κa2 ; a1  B KD  B @ κan ; a1 

1    κa1 ; an     κa2 ; an  C C C; .. A  .  κan ; a2     κan ; an 

κa1 ; a2  κa2 ; a2 

(10)

where the i; jth entry of K D is κai ; aj , and κD;x  κa1 ; x; κa2 ; x; …; κan ; xT ∈ Rn is the column 1 February 2015 / Vol. 54, No. 4 / APPLIED OPTICS

709

vector whose ith entry is κai ; x. The correlation between ϕx and a basis atom ϕai  is computed by hϕx; ϕai i  hϕai ; ϕxi  κai ; x  κD;x i ; (11) where κD;x i is the ith element of vector κD;x. The linear representation of ϕx can be written in the combination of basis atoms in the feature space as K D β  κD;x ;

(12)

where β is the sparse coefficient of ϕx in the feature space. Assume ϕx is projected on a set of selected dictionary atoms fϕai gi∈Ω, the sparse coefficient βΩ is given by K D Ω;Ω βΩ  κD;x Ω ;

(13)

where K D Ω;Ω is formed by the rows and columns of K D indexed on Ω and κD;x Ω is the submatrix of κD;x consisting of the rows in κD;x indexed on Ω. Correspondingly, βΩ can be obtained as βΩ  K D Ω;Ω −1 κD;x Ω :

ϕ

−1

ϕr  ϕx − D α  ϕx − D ∶;Ω K D Ω;Ω  κD;x Ω ; (15) where Dϕ ∶;Ω is formed by the columns of Dϕ indexed on Ω. Substituting Eq. (15) into Eq. (9), the sparse coefficient αˆ ϕ could be obtained with a strict sparsity constraint. The reconstruction error between the test sample ϕx and its approximation from the ith class is given by ‖ϕx − Dϕ δi αˆ ϕ ‖2 1

 hϕx − Dϕ δi αˆ ϕ ; ϕx − Dϕ δi αˆ ϕ i2 ϕ

1

ˆϕ

ˆϕ

1 2

 κx; x − 2δi α  κD;x  δi α  K D δi α  :

fˆ kho x 

k ; 2nτx − 1dkho x

(19)

where τx is the function that indicates the class label of sample x, nτx is the number of training samples from the τxth class, and dkho x is the distance between x and its kth nearest neighbor. The heterogeneous density is defined as k  ; fˆ khe x  P c k d 2 n x l l1;l≠τx he

T

Then the sample x can be assigned to the class that minimizes the reconstruction error between ϕx and Dϕ δi αˆ ϕ  as Classx  arg min ‖ϕx − Dϕ δi αˆ ϕ ‖2 : i1;2;…;c

(17)

C. Customizing Kernel Based on the kth Nearest Neighbor Density Estimation

In probability and statistics, density estimation is the construction of an estimate of the unobservable APPLIED OPTICS / Vol. 54, No. 4 / 1 February 2015

20

where dkhe x is the distance between x and its kth nearest enemy, and c is the number of classes. The density ratio between homogeneous density and heterogeneous density is given by fˆ kho x  fˆ k x he

P c



l1;l≠τx nl

nτx − 1

·

dkhe x : dkho x

(21)

In physics, the density of a material is defined as its mass per unit volume. This can be extended to the approximation of the kth nearest neighbor density as

k→1

1 ; 2ndx

(22)

where dx is the distance between x and its nearest neighbor. Thus, the density ratio in Eq. (21) can be rewritten as

(16)

710

18

where x is a sample point, n is the total number of samples, dk x is the distance between x and its kth nearest neighbor, and f x is the density function with its kth nearest neighbor density estimation denoted as fˆ k x. Given a sample x from the τxth class, the homogeneous density is defined as

dk x→0

 hDϕ δi αˆ ϕ ; Dϕ δi αˆ ϕ i2 T

k ; 2ndk x

fˆ x ≈ lim fˆ k x  limfˆ k x 

ˆϕ

 hϕx; ϕxi − 2hD δi α ; ϕxi ˆϕ

fˆ k x 

(14)

The residual vector between ϕx and its approximation using the selected atoms fϕai gi∈Ω can be expressed as ϕ ϕ

underlying probability density function based on a finite set of observations [30]. The kth nearest neighbor density is defined as

fˆ kho x ≈ fˆ k x he

P

c l1;l≠τx

 nl

nτx − 1

·

dhe x ; dho x

(23)

where dho x is the distance between x and its nearest neighbor, and dhe x is the distance between x and its nearest enemy. Consider that the training set consists of n vectors from d-dimensional space xi ∈ Rd ; i  1; 2; …; n. Each sample is represented as a vector xi  xi;1 ; xi;2 ; …; xi;d T. The element xi;j corresponds to the reflectance value of xi from the jth spectral band. The spectral weight ωj for the jth

spectral band is defined by incorporating homogeneous density and heterogeneous density as Pn hPc ωj 

i1

Pn

l1;l≠τxi 

i  nl dhe;j xi 

i1 nτxi  − 1dho;j xi   i Pn hPc n − NE x ‖ ‖x i;j j i 2 i1 l1;l≠τxi  l i ;  P h n − 1‖x − NN x ‖ n τxi  i;j j i 2 i1

(24)

where dho;j xi  is the distance between xi and its nearest neighbor NNxi  on the jth dimension, NN j xi  is the jth entry of NNxi , dhe;j xi  is the distance between xi and its nearest enemy NExi  on the jth dimension, and NEj xi  is the jth entry of NExi . For the features with higher homogeneous density and lower heterogeneous density, it has smaller within-class difference and bigger between-class difference. Hence, such features should be highlighted in the classification. On the other hand, for the features with lower homogeneous density and higher heterogeneous density, it has bigger within-class difference and smaller between-class difference. A smaller weight should be assigned to such features to reduce the bad effect caused by such noisy bands. To reflect the contribution of each feature to the classification, a weight vector ω  ω1 ; ω2 ; …; ωd T is used to scale each element xi;j before mapping it into feature space, where ωj is defined in Eq. (24). To simplify notation, a diagonal weighting matrix Ψ  diagω is introduced. The customizing Gaussian radial basis function kernel is given by κΨ xi ; xj   exp−γ‖Ψxi − xj ‖2 ;

(25)

92AV3C) [31] and Center of Pavia data [29] are used to evaluate the performance of the classifiers (Fig. 1). For AVIRIS data, the hyperspectral image consists of a scene of size 145 × 145 pixels, with a spatial resolution of 20 m/pixel and has 220 bands across the spectral range from 0.2 to 2.4 μm. In the experiments, 20 water absorption bands (numbers 104–108, 153– 163) are discarded [32]. From 16 land-cover classes available in the reference map, 9 classes are selected since the pixels from the residual classes are not enough, which is hard to sufficiently represent the characteristic of corresponding objects. Among the 9 classes, “corn-notill” and “corn-mintill” are of the same species of corn with different types. Additionally, “soybean-notill,” “soybean-mintill,” and “soybean-cleantill” are of the same species of soybean with different types. Another publicly available data set used in this study is the urban image from the Center of Pavia, which is acquired by the reflective optics system imaging spectrometer (ROSIS). The ROSIS sensor generates 115 spectral bands ranging from 0.43 to 0.86 μm and has a spatial resolution of 1.3 m/pixel. The experimental environment is listed as the following: the simulation software is MATLAB R2012b, the operation system is Windows 8, the processor is Intel Core i7-3740QM CPU 2.70 GHz, the memory is 16 GB, and the system type is a 64-bit operating system. B. Training and Testing Data Sets

Experimental analysis is organized into 4 parts. The first aims at analyzing the effectiveness of the proposed classifier (CKSRC). A comparison with SVM, CSVM, SRC, and KSRC is provided as an assessment. For convenient citation, we denote SVM, 0.8

Customizing

Kernel

Sparse

Representation-Based

Input: a matrix of training samples D  a1 ; a2 ; …; an  ∈ Rd×n for c classes, a test sample x ∈ Rd . Output: Classx. 1. Normalize the columns of D. 2. Calculate the spectral weight ω with Eq. (24). 3. Calculate the sparse coefficients for sample x by solving the following l0 -minimization problem with customizing Gaussian radial basis function kernel: αˆ ϕ  arg minαϕ ‖ϕx − Dϕ αϕ ‖2 subject to‖αϕ ‖0 ≤ K 0 . 4. Compute the reconstruction error ‖ϕx − Dϕ δi αˆ ϕ ‖2 for i  1; 2; …; c. 5. Output: Classx  arg mini1;2;…;c ‖ϕx − Dϕ δi αˆ ϕ ‖2 .

Corn-notill Corn-mintill Grass-pasture Grass-trees Hay-windrowed Soybean-notill Soybean-mintill Soybean-cleantill Woods

0.7 0.6 0.5 0.4 0.3 0.2 0.1 400

(b)

0.5

0.4 0.3

0.2 0.1

450

500

550

600

650

700

750

800

850

Wavelength(nm)

(c)

In this study, public vegetation reflectance data from northwest Indiana’s Indian Pines (AVIRIS sensor, 12 June 1992, ftp://ftp.rcn.purdue.edu/biehl/MultiSpec/

Water Trees Asphalt Brick Bitumen Tile Shadow Meadow Soil

0.6

0

Experimental Data Sets and Simulation Environment

800 1000 1200 1400 1600 1800 2000 2200 2400

Wavelength(nm)

3. Experimental Design A.

600

(a) Average relative reflectance

Algorithm 1: Classification

Average relative reflectance

where γ is the smoothing parameter. Algorithm 1 below summarizes the classification procedure of the proposed CKSRC.

(d)

Fig. 1. Hyperspectral data sets (a) sample band of the AVIRIS data set (band 120), (b) average reflectance profiles of the AVIRIS data set, (c) sample band of the Center of Pavia data set (band 60), and (d) average reflectance profiles of the Center of Pavia data set. 1 February 2015 / Vol. 54, No. 4 / APPLIED OPTICS

711

CSVM, SRC, and KSRC as contrastive algorithms. The number of training and testing pixels in each class for the AVIRIS data set and the Center of Pavia data set are shown in Tables 1 and 2, respectively. In the second experiment, a comparison between CKSRC and contrastive algorithms is conducted with different percentages of the training sample. To evaluate the robust performance of the customizing kernel, training data with different sample sizes are acquired. For each class, all the samples are randomly partitioned into 10 subgroups with approximate equal size. Initially, the percentage of training data size is set as 0.1. That is, of the 10 subgroups, 1 subgroup is retained as training data and Table 1.

Number of Training and Testing Pixels for the AVIRIS Data Set

Class

Training

Testing

717 417 248 373 244 484 1234 307 647 4671

1434 834 497 747 489 968 2468 614 1294 9345

C1-Corn-notill C2-Corn-mintill C3-Grass-pasture C4-Grass-trees C5-Hay-windrowed C6-Soybean-notill C7-Soybean-mintill C8-Soybean-clean C9-Woods Total

Table 2.

4. Results and Discussion A. Classification Accuracy

The classification performance for each class, Cohen kappa coefficient (a robust measure of the degree agreement), overall classification accuracy (i.e., the percentage of correctly classified pixels among all the test pixels considered), and average classification accuracy (i.e., the mean of the classification accuracies) on AVIRIS data and Center of Pavia data are shown in Tables 3 and 4, respectively. The maximum value of each row is shown in boldface. It is clearly see that CKSRC exhibits the best overall classification accuracy for both of the two data sets. For analyzing the AVIRIS data set, in comparison with SVM, CSVM, SRC, and KSRC, the proposed CKSRC shows an increase in overall classification accuracy of 2.04%, 4.22%, 11.59%, and 0.24%, respectively. It could also conclude from the classification accuracies that traditional SVM exhibits fine performance on the distinctive classes (such as C3, C4, C5, C9). However, when SVM is used to classify pixels from the same species of different types (such as C1, C2,

Number of Training and Testing Pixels for the Center of Pavia Data Set

Class C1-Water C2-Trees C3-Asphalt C4-Brick C5-Bitumen C6-Tile C7-Shadow C8-Meadow C9-Soil Total

Training

Testing

2355 270 110 95 235 330 260 1525 100 5280

4712 542 220 191 470 660 520 3059 204 10578

the remaining 9 subgroups are used as testing data. Then, the percentage is set as 0.2 that 2 subgroups are retained as training data and the remaining 8 subgroups are used as testing data. In a similar manner, different percentages of the training sample are chosen until it reaches 0.9. For each selection of training sample percentage, the experiment is repeated 10 times, with different combinations of subgroups used as training data and the remaining subgroups used as testing data. In this manner, the performance of different classifiers can be evaluated with respect to different selections of training samples. For simply demonstrating the significant improvement of the proposed CKSRC over other contrastive algorithms, Welch’s t-test is provided in the third part on the overall classification accuracy with different percentages of the training sample. Finally, the discussion of parameter setting and sparsity level selection for SRC, KSRC, and CKSRC is given in the fourth part.

Table 3. Comparison of Cohen Kappa Coefficients, Overall Classification Accuracies (%), Average Classification Accuracies (%), and Classification Accuracies (%) Conducted by the SVM, CSVM, SRC, KSRC, and CKSRC Algorithms Yielded on the AVIRIS Data Seta

Cohen kappa coefficient Overall classification accuracy Average classification accuracy

Classification accuracy

a

712

C1-Corn-notill C2-Corn-mintill C3-Grass-pasture C4-Grass-trees C5-Hay-windrowed C6-Soybean-notill C7-Soybean-mintill C8-Soybean-clean C9-Woods

SVM

CSVM

SRC

KSRC

CKSRC

0.9394 94.84 95.46 91.07 89.57 98.99 99.87 100.00 90.50 94.53 94.79 99.85

0.9140 92.66 93.39 90.38 84.41 97.59 99.46 100.00 82.54 92.14 94.30 99.69

0.8271 85.29 85.97 79.43 73.98 88.53 93.98 99.59 76.45 84.44 78.50 98.84

0.9606 96.64 96.97 93.58 93.76 98.99 99.87 99.80 94.73 96.60 95.44 100.00

0.9633 96.88 97.12 94.28 93.17 98.59 99.73 99.80 95.76 96.92 95.93 99.92

The maximum value of each row is shown in boldface. APPLIED OPTICS / Vol. 54, No. 4 / 1 February 2015

Table 4. Comparison of Cohen Kappa Coefficients, Overall Classification Accuracies (%), Average Classification Accuracies (%), and Classification Accuracies (%) Conducted by the SVM, CSVM, SRC, KSRC, and CKSRC Algorithms Yielded on the Center of Pavia Data Seta

Cohen kappa coefficient Overall classification accuracy Average classification accuracy C1-Water C2-Trees C3-Asphalt C4-Brick C5-Bitumen C6-Tile C7-Shadow C8-Meadow C9-Soil

Classification accuracy

CSVM

SRC

KSRC

CKSRC

0.9732 98.12 92.33 99.98 97.79 96.36 92.67 98.09 95.30 94.62 99.77 56.37

0.9651 97.53 91.26 99.98 95.02 81.36 69.11 95.11 95.30 89.42 99.44 96.57

0.9740 98.17 94.52 99.94 95.02 95.91 86.39 97.23 92.27 94.23 99.51 90.20

0.9833 98.81 96.31 100.00 96.49 95.45 93.72 98.09 95.91 95.77 99.71 91.67

0.9851 98.95 96.57 100.00 97.42 95.00 91.62 98.51 96.36 95.77 99.80 94.61

The maximum value of each row is shown in boldface.

B. Classification Accuracy with Different Percentages of the Training Sample

In the second experiment, the robustness of the classification algorithms is evaluated for different percentages of the training sample. Each classifier is tested with the percentage of training sample varying from 0.1 to 0.9. For each training sample percentage, the experiment is repeated 10 times with different selections of the training data. The average overall classification accuracy with respect to different percentages of the training sample is shown in Fig. 2. It obviously shows that the proposed CKSRC performs better than other contrastive algorithms and a strong robustness to the selection of training samples for both the AVIRIS data set and the Center of Pavia data set. C.

Welch’s t-Test

To simply demonstrate the improvement of the proposed CKSRC, Welch’s t-test is applied in this study.

We assume the overall classification accuracy Y of CKSRC is a random variable from normal distribution with mean μ and variance σ 2 as

95

Overall classification accuracy

and C6, C7, C8), the classification accuracy decreases a lot. But the proposed CKSRC shows comparatively higher classification accuracies on these classes. Additionally, it achieves bad performance by using SRC in original space. Incorporating the kernel mapping method significantly improves the overall classification accuracy. For analyzing the Center of Pavia data set, in comparison with SVM, CSVM, SRC, and KSRC, the proposed CKSRC shows an increase in overall classification accuracy of 0.83%, 1.42%, 0.78%, and 0.14%, respectively. The highest classification accuracy was obtained when differentiating the first class (water). The comparison of computation time of different classification algorithms is summarized in Table 5.

90

85

80

75 SVM CSVM SRC KSRC CKSRC

70

65 0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

(a) 98.5 98

97.5

97

96.5

96

95.5

SVM CSVM SRC KSRC CKSRC

95

Table 5. Comparison of Computation Time(s) Conducted by the SVM, CSVM, SRC, KSRC, and CKSRC Algorithms Yielded on the AVIRIS and Center of Pavia Data Sets

AVIRIS Center of Pavia

SVM

CSVM

SRC

KSRC

CKSRC

9.4 7.4

21.9 9.1

22.2 73.1

93.8 104.8

112.7 113.4

0.9

Percentage of training sample

Overall classification accuracy

a

SVM

94.5

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

Percentage of training sample

(b) Fig. 2. Average overall classification accuracy with different percentages of the training sample. (a) AVIRIS data set. (b) Center of Pavia data set. 1 February 2015 / Vol. 54, No. 4 / APPLIED OPTICS

713

Y ∼ N μ; σ 2 :

(26)

The overall classification accuracy Y 0 of the contrastive algorithm is a random variable from normal distribution with mean μ0 and variance σ 20 as Y 0 ∼ N μ0 ; σ 20 :

test statistic is approximated as an ordinary Student’s t distribution with the degree of freedom calculated using ν≈

(27)

S2 ∕n  S20 ∕n2 : S ∕n ∕n − 1  S20 ∕n2 ∕n − 1 2

(33)

2

The sets of observations from CKSRC and contrastive algorithms are fY 1 ; Y 2 ; …; Y n g and fY 01 ; Y 02 ; …; Y 0n g, respectively. The sample mean and variance of the observations are denoted as ¯ S2  Y; and Y¯ 0 ; S20 , respectively. Assume Y 1 ; Y 2 ; …; Y n are independent and identically distributed random variables. Then Y¯ can be considered as a random variable that distributes normally with mean μ and variance σ 2 ∕n as

The probability to reject H 0 conditioned on valid H 0 is

Y¯ ∼ N μ; σ 2 ∕n:

where the quantile k is determined by the significant level. In this study, the significant level α is set as α  0.1. We require that such kind of probability error cannot exceed α as

(28)

Similarly, Y¯ 0 distributes normally with mean μ0 and variance σ 20 ∕n as Y¯ 0 ∼ N μ0 ; σ 20 ∕n:

H 1 :μ > μ0 :

(30)

The test statistic of the two-sample Z test is given as Y¯ − Y¯ 0 − ξ Z  q ; σ 20 σ2  n n

(31)

where ξ is a constant. However, in this study, σ 2 and σ 20 are unknown. Replacing them with their estimates, sample variance S2 and S20 , leads to the test statistic of Welch’s t-test as Y¯ − Y¯ 0 − ξ Z  q : S20 S2  n n

(32)

In order to meet the hypothesis from Eq. (30), constant ξ is set as ξ  0. The distribution of the above Table 6.

1 ¯ − Y¯ 0 Y P@q2 ≥ kA ≤ α: S0 S2 n  n

(34)

0

(29)

Then we could have Y¯ − Y¯ 0 ∼ N μ − μ0 ; σ 2 ∕n  σ 20 ∕n. Set the hypothesis as H 0 :μ ≤ μ0

Preject H 0 jvalid H 0   PZ ≥ k 1 0 ¯ − Y¯ 0 Y  P@q2 ≥ kA; S0 S2 n  n

(35)

Table 6 shows the quantile k  tα ν  t0.1 ν with different selection of percentage of training sample with AVIRIS and Pavia of Center data sets. The q observation values of Y¯ − Y¯ 0 ∕ S2 ∕n  S2 ∕n with 0

proposed algorithms and contrastive algorithms are shown in Table 7. We can see that most of the observation values fall in the reject region. So we will accept H 1 that the proposed approaches have a significant improvement compared with other approaches. D. Effect of the Sparsity Level and RBF Kernel Parameter for the Sparse Representation-Based Classifier

The effect of the sparsity level K 0 and RBF kernel parameter γ used in the above experiments are discussed in the final experiment. We randomly select half of data as training samples and the remaining data as testing samples. The sparsity level K 0 is varied from 10 to 200 with a step pace of 10. For analyzing the AVIRIS data set, the RBF kernel parameter

Quantile t 0.1 ν of the Contrastive Algorithms with Different Percentages of the Training Sample with the AVIRIS and Center of Pavia Data Sets

Percentage of Training Sample

AVIRIS

Center of Pavia

714

SVM CSVM SRC KSRC SVM CSVM SRC KSRC

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1.30 1.30 1.33 1.30 1.29 1.29 1.29 1.29

1.30 1.29 1.36 1.29 1.29 1.29 1.29 1.29

1.29 1.29 1.31 1.29 1.28 1.28 1.28 1.28

1.30 1.29 1.31 1.29 1.28 1.29 1.28 1.28

1.29 1.29 1.33 1.29 1.28 1.29 1.28 1.28

1.29 1.30 1.30 1.30 1.28 1.29 1.28 1.28

1.29 1.30 1.33 1.29 1.29 1.29 1.28 1.29

1.30 1.31 1.31 1.30 1.29 1.29 1.29 1.29

1.33 1.31 1.42 1.32 1.29 1.30 1.29 1.29

APPLIED OPTICS / Vol. 54, No. 4 / 1 February 2015

Table 7.

q Observation Values Y − Y 0 ∕ S 2 ∕n  S 20 ∕n of the Contrastive Algorithms with Different Percentages of the Training Sample with the AVIRIS and Center of Pavia Data Sets

Percentage of Training Sample

AVIRIS

Center of Pavia

SVM CSVM SRC KSRC SVM CSVM SRC KSRC

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1.14 5.82 47.55 1.38 9.43 4.11 9.23 1.44

2.18 9.76 51.17 1.42 12.03 4.55 10.82 1.42

3.95 15.55 72.62 2.36 18.22 6.93 14.90 2.53

3.90 13.40 61.33 2.09 22.01 5.19 14.59 1.68

5.10 10.82 49.21 2.05 19.10 4.11 17.76 1.50

5.35 8.72 75.79 1.77 12.08 3.27 14.57 1.28

7.38 9.66 48.97 2.37 8.48 3.00 11.78 1.29

6.47 9.37 56.87 2.09 3.49 1.66 6.82 0.58

4.78 8.63 30.37 1.28 0.14 1.41 5.61 0.64

is varied from 2−6 to 2−4 for KSRC and from 2−11 to 2−9 for CKSRC (prior experiments are conducted to approximately estimate the parameter range). For analyzing the Center of Pavia data set, the RBF kernel parameter is varied from 2−4 to 2−2 for KSRC and from 2−8.5 to 2−6.5 for CKSRC. The overall classification accuracies with different parameter settings for the AVIRIS data set and the Center of Pavia data set are shown in Figs. 3 and 4, respectively. Since RBF kernel is not incorporated in SRC, only the overall classification accuracy with different selections of the sparsity level is plotted in Figs. 3(a) and 4(a). For both AVIRIS data and Center of Pavia data, the highest overall classification accuracy is obtained with K 0  10 for SRC and decreases with increasing sparsity level. The decrease of accuracy might lie in the unequal size of basis atoms for each 94

74 72 70 68 66 64 62 0

50

100

150

93 92 91 90

log2( γ )=-5.5

88

log2( γ )=-5 log2( γ )=-4.5

87 86

200

log2( γ )=-6

89

log2( γ )=-4

0

50

100

150

Overall classification accuracy

76

60

94.5

SRC

78

Overall classification accuracy

Overall classification accuracy

80

class in the sparse dictionary. In linear sparse representation, the increase in the sparsity level may decrease the probability of basis atoms from small size classes to be retained in the sparse coefficients. For the AVIRIS data set, one can observe from Figs. 3(b) and 3(c) that γ  2−5 and γ  2−9.5 leads to the highest overall classification accuracy for KSRC and CKSRC, respectively. For a fixed γ, the performance of KSRC and CKSRC approximately improves as K 0 increases. For the Center of Pavia data set, the variation of different selections of the sparsity level is very small. From Figs. 4(b) and 4(c) we could see that γ  2−2.5 and γ  2−7.5 leads to comparable higher overall classification accuracy for KSRC and CKSRC, respectively. The stable performance suggests that we could also use empirical parameters K 0 and γ.

94 93.5 93 92.5 92

log2 ( γ )=-11 log2 ( γ )=-10.5

91.5

log2 ( γ )=-10

91

log2 ( γ )=-9.5

90.5 90

200

log2 ( γ )=-9

0

50

100

Sparsity level

Sparsity level

Sparsity level

(a)

(b)

(c)

150

200

Fig. 3. Effect of sparsity level and RBF kernel parameter on the AVIRIS data set (a) SRC, (b) KSRC, and (c) CKSRC. 98.5

96.4 96.35 96.3 96.25 96.2

97.5 97 96.5 96 95.5 95

log2 (γ )=-4 log2 (γ )=-3.5

94.5

log2 (γ )=-3

94

log2 (γ )=-2.5

93.5

log2 (γ )=-2

50

100

150

200

98 97.5 97 96.5 96 95.5

log2 (γ )=-8.5 log2 (γ )=-8

95

log2 (γ )=-7.5

94.5

log2 (γ )=-7

94

log2 (γ )=-6.5

93.5

93 0

Overall classification accuracy

96.5 96.45

Overall classification accuracy

Overall classification accuracy

98 SRC

96.55

0

50

100

150

Sparsity level

Sparsity level

(a)

(b)

200

0

50

100

150

200

Sparsity level

(c)

Fig. 4. Effect of sparsity level and RBF kernel parameter on the Center of Pavia data set (a) SRC, (b) KSRC, and (c) CKSRC. 1 February 2015 / Vol. 54, No. 4 / APPLIED OPTICS

715

5. Conclusion

In this study, a new customizing kernel is proposed on the sparse representation-based classifier for hyperspectral image classification. The ratio between homogeneous density and heterogeneous density is incorporated as a weight for the customizing kernel to increase the separability of each class. For the AVIRIS data set, the further increase of accuracy mainly from the pixels from the same species of different types. Moreover, experiments for robust performance evaluation were given. With different selections of training samples and training sample percentages, CKSRC always exhibits consistently fine performance. Welch’s t-test also proves the significant improvement of the proposed algorithm. This study was partially supported by the National Natural Science Foundation of China (Nos. 61405041 and 61275010), the China Postdoctoral Science Foundation (Grant No. 2014M551221), the Heilongjiang Postdoctoral Science Found (Grant No. LBHZ13057), the Key Program of Heilongjiang Natural Science Foundation (No. ZD201216), and the Program Excellent Academic Leaders of Harbin (No. RC2013XK009003). References 1. F. Melgani and L. Bruzzone, “Classification of hyperspectral remote sensing images with support vector machines,” IEEE Trans. Geosci. Remote Sens. 42, 1778–1790 (2004). 2. B. Qi, C. Zhao, E. Youn, and C. Nansen, “Use of weighting algorithms to improve traditional support vector machine based classifications of relflectance data,” Opt. Express 19, 26816–26826 (2011). 3. B. Guo, S. R. Gunn, R. I. Damper, and J. D. B. Nelson, “Customizing kernel functions for SVM-based hyperspectral image classification,” IEEE Trans. Image Process. 17, 622– 629 (2008). 4. B. Qi, C. Zhao, and G. Yin, “Feature weighting algorithms for classification of hyperspectral images using a support vector machine,” Appl. Opt. 53, 2839–2846 (2014). 5. M. Fauvel, J. A. Benediktsson, J. Chanussot, and J. R. Sveinsson, “Spectral and spatial classification of hyperspectral data using SVMs and morphological profiles,” IEEE Trans. Geosci. Remote Sens. 46, 3804–3814 (2008). 6. C. Cortes and V. Vapnik, “Support-vector networks,” Mach. Learn. 20, 273–297 (1995). 7. M. Pal and G. M. Foody, “Feature selection for classification of hyperspectral data by SVM,” IEEE Trans. Geosci. Remote Sens. 48, 2297–2307 (2010). 8. Y. Bazi and F. Melgani, “Toward an optimal SVM classification system for hyperspectral remote sensing images,” IEEE Trans. Geosci. Remote Sens. 44, 3374–3385 (2006). 9. B.-C. Kuo, H.-H. Ho, C.-H. Li, C.-C. Hung, and J.-S. Taur, “A kernel-based feature selection method for SVM with RBF kernel for hyperspectral image classification,” IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 7, 317–326 (2014). 10. Y. Chen, X. Zhao, and Z. Lin, “Optimizing subspace SVM ensemble for hyperspectral imagery classification,” IEEE J. Sel. Topics Appl. Earth Obs. Remote Sens. 7, 1295–1305 (2014). 11. Z. Xue, P. Du, and H. Su, “Harmonic analysis for hyperspectral image classification integrated with PSO optimized SVM,” IEEE J. Sel. Topics Appl. Earth Obs. Remote Sens. 7, 2131–2146 (2014). 12. P. Gurram and H. Kwon, “Contextual SVM using Hilbert space embedding for hyperspectral classification,” IEEE Geosci. Remote Sens. Lett. 10, 1031–1035 (2013).

716

APPLIED OPTICS / Vol. 54, No. 4 / 1 February 2015

13. Z. Shao, L. Zhang, X. Zhou, and L. Ding, “A novel hierarchical semisupervised SVM for classification of hyperspectral images,” IEEE Geosci. Remote Sens. Lett. 11, 1609–1613 (2014). 14. Y. Gu and K. Feng, “Optimized Laplacian SVM with distance metric learning for hyperspectral image classification,” IEEE J. Sel. Topics Appl. Earth Obs. Remote Sens. 6, 1109–1117 (2013). 15. J. Wright, A. Y. Yang, A. Ganesh, S. S. Sastry, and Y. Ma, “Robust face recognition via sparse representation,” IEEE Trans. Pattern Anal. Mach. Intell. 31, 210–227 (2009). 16. B. Qi, V. John, Z. Liu, and S. Mita, “Use of sparse representation for pedestrian detection in thermal images,” in The IEEE Conference on Computer Vision and Pattern Recognition (CVPR) Workshops (IEEE, 2014), pp. 274–280. 17. J. Zhang, D. Zhao, and W. Gao, “Group-based sparse representation for image restoration,” IEEE Trans. Image Process. 23, 3336–3351 (2014). 18. L.-W. Kang, C.-Y. Hsu, H.-W. Chen, C.-S. Lu, C.-Y. Lin, and S.-C. Pei, “Feature-based sparse representation for image similarity assessment,” IEEE Trans. Multimedia 13, 1019–1030 (2011). 19. J. Liu, Z. Wu, Z. Wei, L. Xiao, and L. Sun, “Spatial-spectral kernel sparse representation for hyperspectral image classification,” IEEE J. Sel. Topics Appl. Earth Obs. Remote Sens. 6, 2462–2471 (2013). 20. M. L. Wong, W. Lam, and K. S. Leung, “Using evolutionary programming and minimum description length principle for data mining of Bayesian networks,” IEEE Trans. Pattern Anal. Mach. Intell. 21, 174–178 (1999). 21. H. V. Nguyen, V. M. Patel, N. M. Nasrabadi, and R. Chellappa, “Kernel dictionary learning,” in IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) (IEEE, 2012), pp. 2021–2024. 22. M. Aharon, M. Elad, and A. Bruckstein, “K-SVD: an algorithm for designing overcomplete dictionaries for sparse representation,” IEEE Trans. Signal Process. 54, 4311–4322 (2006). 23. M. Elad and M. Aharon, “Image denoising via sparse and redundant representations over learned dictionaries,” IEEE Trans. Image Process. 15, 3736–3745 (2006). 24. M. A. Aizerman, E. A. Braverman, and L. I. Rozonoer, “Theoretical foundations of the potential function method in pattern recognition learning,” Automat. Remote Control 25, 821–837 (1964). 25. Y. Chen, N. M. Nasrabadi, and T. D. Tran, “Hyperspectral image classification using dictionary-based sparse representation,” IEEE Trans. Geosci. Remote Sens. 49, 3973–3985 (2011). 26. S. Shekhar, V. M. Patel, N. M. Nasrabadi, and R. Chellappa, “Joint sparse representation for robust multimodal biometrics recognition,” IEEE Trans. Pattern Anal. Mach. Intell. 36, 113–126 (2014). 27. J. A. Tropp and A. C. Gilbert, “Signal recovery from random measurements via orthogonal matching pursuit,” IEEE Trans. Inf. Theory 53, 4655–4666 (2007). 28. C. Camps-Valls and L. Bruzzone, “Kernel-based methods for hyperspectral image classification,” IEEE Trans. Geosci. Remote Sens. 43, 1351–1362 (2005). 29. Y. Chen, N. M. Nasrabadi, and T. D. Tran, “Hyperspectral image classification via Kernel sparse representation,” IEEE Trans. Geosci. Remote Sens. 51, 217–231 (2013). 30. A. Banerjee and P. Burlina, “Efficient particle filtering via sparse Kernel density estimation,” IEEE Trans. Image Process. 19, 2480–2490 (2010). 31. A. M. Filippi, R. Archibald, B. L. Bhaduri, and E. A. Bright, “Hyperspectral agricultural mapping using support vector machine-based endmember extraction (SVM-BEE),” Opt. Express 17, 23823–23842 (2009). 32. L. Fang, S. Li, X. Kang, and J. A. Benediktsson, “Spectralspartial hyperspectral image classification via multiscale adaptive sparse representation,” IEEE Trans. Geosci. Remote Sens. 52, 7738–7749 (2014).

Use of customizing kernel sparse representation for hyperspectral image classification.

Sparse representation-based classification (SRC) has attracted increasing attention in remote-sensed hyperspectral communities for its competitive per...
473KB Sizes 0 Downloads 6 Views