This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination. IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS

1

Brief Papers Nonsmooth ICA Contrast Minimization Using a Riemannian Nelder–Mead Method Suviseshamuthu Easter Selvan

Abstract— This brief concerns the design and application of a Riemannian Nelder–Mead algorithm to minimize a Hartleyentropy-based contrast function to reliably estimate the sources from their mixtures. Despite its nondifferentiability, the contrast function is endowed with attractive properties such as discriminacy, and hence warrants an effort to be effectively handled by a derivative-free optimizer. Aside from tailoring the Nelder– Mead technique to the constraint set, namely, oblique manifold, the source separation results attained in an empirical study with quasi-correlated synthetic signals and digital images are presented, which favor the proposed method on a comparative basis. Index Terms— Hartley entropy, Nelder–Mead algorithm, oblique manifold, quasi-correlated sources.

I. I NTRODUCTION Independent component analysis (ICA) has been extensively applied during recent years in several areas pertaining to multichannel signal processing in order to estimate the underlying independent sources from a set of mixed observations. The objective of ICA is to find a linear nonorthogonal coordinate system for multivariate data, which minimizes the statistical dependence among the axial projections of the data. The statistical dependence can either be accurately measured using the mutual information (MI) or be indirectly assessed with surrogate functions, e.g., kurtosis and negentropy. Therefore, the cornerstones for developing an ICA algorithm are the following: 1) choice of a dependence measure (contrast function); 2) a mechanism to estimate the contrast function from samples; and 3) design of a robust optimization technique to minimize the contrast function. A surrogate contrast function may lend itself to the implementation of a gradient-based and computationally efficient optimization method; however, such an algorithm, in general, would compromise the accuracy of the ICA estimate. On the contrary, applications abound in practice, wherein despite the computational complexity, the sources ought to be estimated Manuscript received June 10, 2013; revised February 27, 2014; accepted March 6, 2014. This brief presents research results of the Belgian Network DYSCO (Dynamical Systems, Control, and Optimization), funded by the Interuniversity Attraction Poles Programme initiated by the Belgian Science Policy Office. The author is with the Department of Mathematical Engineering, Université catholique de Louvain, Louvain-la-Neuve 1348, Belgium (e-mail: [email protected]). Color versions of one or more of the figures in this paper are available online at http://ieeexplore.ieee.org. Digital Object Identifier 10.1109/TNNLS.2014.2311036

precisely with the help of MI, for instance, offline demixing of electroencephalogram or magnetic resonance image data. The MI-based contrast can be expressed using an entropy measure that quantifies the randomness of the data. The Shannon-entropy-based MI contrast has been employed by many in the field of ICA. Admittedly, the global minimization of this contrast will indeed lead to perfect source separation under the assumptions of strict source independence and non-Gaussianity. Nonetheless, the foreseen difficulties—the estimation of Shannon entropy from mixture samples and avoidance of spurious mixing local minima—are to be dealt with. In addition, to evaluate this contrast function, the source densities have to be estimated via a parametric or nonparametric scheme. While the parametric density estimation requires a prior knowledge of the source distributions, the nonparametric techniques face computational bottlenecks. As the Rényi’s entropy generalizes the Shannon entropy and shares a few key properties, an ICA method minimizing the MI based on the Rényi’s quadratic entropy has been ventured in [1]. Nevertheless, Pham et al. [2] persuaded that relying on the Rényi’s entropy in an ICA algorithm without restricting the value of Rényi’s exponent1 would no longer guarantee the recovery of original sources; in other words, the quadraticentropy-based MI cannot be deemed an ICA contrast. Instead, the Hartley entropy can be regarded as a suitable alternative to estimate the MI, since it preserves the ICA contrast properties. In fact, the range-based contrast function introduced in [3] and investigated in [4] is akin to the MI contrast based on the Hartley entropy measure. It has been demonstrated in [4] that this contrast is endowed with several desirable properties. Primarily, the range-based contrast enjoys the discriminacy property, which means that all the local minima of the contrast function correspond to the separation of the original sources or atleast lead to a satisfactory solution [4, p. 26]. Thus, the discriminacy property is related to the local minimum points and not only to the global one. Recall that the Shannon-entropybased contrast can have mixing local minima [5], [6], but not the range-based one, which is proven to be discriminant. The key contribution of this brief is to apply a Riemannian derivative-free optimization strategy to minimize the Hartleyentropy-based MI, and as a consequence, to produce an accurate ICA demixing matrix estimate. The downside of the contrast function considered in this brief is its nondifferentiability, which does not allow the use of a classical gradient-based method. Even its discrete-gradient approximation via a second-order Taylor expansion was reported to be computationally demanding as the problem dimension grows. A remedial approach has been suggested in [7] to handle the 1 For the Hartley, Shannon, and quadratic entropy, the Rényi’s exponent equals 0, 1, and 2, respectively.

2162-237X © 2014 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.

This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination. 2

IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS

minimization of the range-based contrast without the derivative information and to maintain the orthogonality constraint using the Givens rotations. Subsequently, Lee et al. [8] recommended a framework to relax the orthogonality constraint while minimizing the same contrast function, but without specifying a derivative-free algorithm for this task. Furthermore, an effort has been taken in [10] to minimize the range-based contrast with a cross-entropy (CE) method and to intrinsically satisfy the unit-norm column constraint2 by sampling from the von Mises-Fisher distribution. In spite of an improvement in the solution quality, the populationbased CE method suffers inherently due to computational overheads and it is bereft of convergence guarantees. Therefore, a direct search method—mesh adaptive direct search (MADS) on the oblique manifold (OB)—was proposed in [11] recently, which extends the convergence results of MADS to a Riemannian setting; note, however, that the algorithm incurs significant computational burden. Notwithstanding the contrast function being piecewise smooth as shown in Section IV, in principle, a viable expedient is to use Riemannian versions of subgradient techniques, e.g., [12]. Unfortunately, implementing subgradient algorithms often proves to be tedious, which, from an engineering standpoint, warrants the exploration of methods that are simpler to implement such as the Nelder–Mead. This brief concerns the adaptation of the well-known Nelder–Mead algorithm to the manifold of interest (OB), since this direct search method has acquired effervescent popularity in solving multidimensional unconstrained problems. Albeit the lack of theoretical results, it is a widely sought optimizer in many fields, e.g., chemistry, chemical engineering, and medicine, as evidenced by thousands of references in [13] and its inclusion in MATLAB, R (open source), and NAG (numerical library). In addition, it uses a small number of function evaluations per iteration, and especially, its rules for adapting the simplex shape to the local contours of many low dimensional functions lead to low iteration counts. A repeated restart strategy as outlined in Remark 1 ensures that the Nelder–Mead converges to a local minimum in practice. In the present setting, it turns out to be an unmixing local minimum thanks to the discriminant contrast function. II. P RELIMINARIES The essential ingredients for adapting the Nelder–Mead simplex algorithm to a Riemannian manifold are the Riemannian average, exponential map, and logarithm map, which replace their counterparts—the arithmetic mean, sum, and difference of vectors, respectively—in the Euclidean setting. A methodology has been derived in [14] to generalize the Nelder–Mead algorithm to Riemannian manifolds. Interested readers may refer to [15] and [16] to delve into related studies. Definition 1 (Oblique Manifold): The set of n × n matrices with unit Euclidean norm columns is referred to as the oblique manifold OB(n) = {X ∈ Rn×n : ddiag(XX) = In } 2 One may refer to [9] for the benefits derived from replacing the orthogonality constraint with the unit-norm column constraint.

where ddiag(·) constructs a diagonal matrix with the diagonal elements of the matrix in the argument, and In represents the n × n identity matrix. Precisely, OB(n) is the Cartesian product of n copies of hypersphere Sn−1 , and hence, it is not the orthogonal group. This means that X is not constrained to be orthogonal, but simply to have unit-norm columns. Definition 2 (Riemannian Average on OB): Given a set of points Pi ∈ OB, i = 1, 2, . . . , N, lying on an open Riemannian ball B of sufficiently small radius, the Riemannian average is defined as the (global) minimizer M of 1 d(P, Pi )2 2 N

g(P) =

(1)

i=1

where d(·, ·) denotes the geodesic distance on OB from an arbitrary point P ∈ OB to each point Pi . For instance, if the points on Sn−1 happen to lie on an open half-sphere (i.e., the radius of B is π/2), there exists a unique M, and (1) can be globally minimized according to Afsari et al. [17]. Definition 3 (OB Exponential Map): Let P be a point of OB and V ∈ TP OB be a tangent vector to OB at P. The corresponding exponential map denoted as ExpP : TP OB → OB is defined by ExpP (V) := P diag(cos(ζ(V))) + V diag(sin(ζ(V))  ζ (V)) where  denotes the element-wise division, ζ(·) returns the 2-norm of each column of the matrix in the argument, and diag(·) represents the diagonal matrix whose diagonal entries are the components of the vector in the argument. Definition 4 (OB Logarithm Map): Let P belongs to OB and let P be the endpoint of the geodesic segment that starts at P, and has the direction V ∈ TP OB and the length |V|. Then, the logarithm map denoted as LogP : ExpP (U ) → TP OB, where U ⊂ TP OB contains 0, is given by LogP (P ) := (P − P ddiag(PP )) ×ddiag(1−(PP )•[2] )− 2 × ddiag(arccos(PP )) 1

where (·)•[2] denotes the element-wise power-2 operation of the matrix of interest. Note that the difference between the points P and P on OB turns out to be the velocity vector V ∈ TP OB computed using the logarithm map V = LogP (P ). Similarly, a point on the parameterized geodesic curve γ passing through P and P can be obtained by evaluating γ (ρ) := ExpP (ρV)

(2)

at a specific value of ρ. III. N ELDER –M EAD S IMPLEX ON OB At iteration k ≥ 0, a nondegenerate simplex, whose n(n − 1) + 1 vertices belong to OB, is constructed. The contrast (objective) function f is evaluated at the vertices of the simplex in order to label the vertices as

This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination. IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS

Algorithm 1 Nelder–Mead Simplex on OB 1: Input: real-valued function f defined on OB; convergence thresholds 1 , 2 ≥ 0; search geodesic given by (4); shrink mechanism defined in (5); 2: Initialization: k := 0; initial simplex with n(n − 1) + 1 vertices on OB; 3: Order the vertices to satisfy (3), and label B, S, and W; 4: while maxi | f B − f i | > 1 & maxi B − Xi max > 2 do 5: Estimate the Riemannian average M of all vertices except W using Algorithm 2; 6: if an acceptable vertex is determined following the stepwise procedure described in Section III then 7: Replace W with the acceptable vertex; 8: else 9: Perform a shrink as in (5), and generate new vertices except B; 10: end if 11: k := k + 1; 12: Reorder the vertices to identify B, S, and W; 13: end while ◦ 14: return X := B;

X1 [k], X2 [k], . . . , Xn(n−1)+1 [k], which are associated with the function values in the increasing order f1 [k] ≤ f 2 [k] ≤ · · · ≤ f n(n−1)+1 [k]

(3)

where f i [k] implies f (Xi [k]). Since the minimum of f is sought for the problem at hand, the labels X1 [k], Xn(n−1)+1 [k], and Xn(n−1) [k] refer to the best, worst, and second-worst vertices, respectively. For simplicity’s sake, hereafter the notations B := X1 [k], W := Xn(n−1)+1 [k], and S := Xn(n−1) [k] are followed without including the iteration index k; likewise, the corresponding function values are denoted as f B , f W , and f S , respectively. The Riemannian average of all the vertices in a simplex except W is referred to as M, and the function value evaluated at M is denoted as f M . Furthermore, by using (2), the search geodesic  emanating from W and linking M is expressed as (ρ) := ExpM (−ρLogM (W)).

(4)

In a generic iteration, Algorithm 1 attempts to find an acceptable vertex in a sequence of steps; if it succeeds in finding one, then W is replaced with that vertex, and the algorithm skips the remaining steps to proceed to the next iteration. On the other hand, if it encounters failure, a shrink is performed to generate a new set of vertices except B, to form a new simplex in the subsequent iteration. 1) Reflect: Compute the reflection of W about M given by R := (1), and evaluate f R := f (R). If f B ≤ f R < f S , accept R, and terminate the iteration. 2) Expand: If f R < f B , perform an expansion to obtain E := (2), and evaluate f E := f (E). Accept E or R based on whether f E < f R or not, and terminate the iteration. 3) Contract Outside: If f S ≤ f R < f W , perform an outside ˇ := (0.5), and evaluate contraction to compute C

3

Algorithm 2 Riemannian Average on OB 1: Input: X1 , X2 , . . . , Xn(n−1) ∈ OB; 3 ≥ 0; 2: Initialization: l := 0; M[0] := X1 ; n(n−1) 1 3: M := n(n−1) LogM[l] (Xi ); i=1 4: while M F > 3 do 5: M[l + 1] := ExpM[l] (M); 6: l := l + 1; 7: Compute M; 8: end while 9: return M := M[l]; ˇ Accept C, ˇ provided f ˇ ≤ f R , and f Cˇ := f (C). C terminate the iteration; otherwise perform a shrink. 4) Contract Inside: If f R ≥ f W , perform an inside conˆ := (−0.5), and evaluate f ˆ := traction to calculate C C ˆ Suppose that f ˆ < f W , accept C, ˆ and terminate f (C). C the iteration; or else, perform a shrink. 5) Shrink: Replace all the n(n − 1) vertices except B with Xi := ExpB (σ LogB (Xi ))

(5)

where i = 2, 3, . . . , n(n − 1) + 1 and σ = 0.5. The Riemannian average is estimated with a gradient descent approach [18] as summarized in Algorithm 2. Remark 1: As has already been observed in the Euclidean setting [19], the Nelder–Mead algorithm is prone to converge to a nonstationary point even in some convex functions due to the premature collapse of the simplex. This difficulty can be mitigated by a restart strategy proposed in [20], which constructs an initial simplex around the solution obtained in the previous phase. This restart procedure is repeated until there is no improvement in the contrast function value, which implies that a local minimum is attained. Remark 2: Note that the exact computation of the Riemannian average using Algorithm 2 is not a crucial requirement in the minimization process. On the other hand, a local minimizer of (1), namely, Karcher mean, could possibly lie outside the convex hull of the points [17], which is unacceptable. For this reason, the Riemannian average is preferred to a local center of mass on the manifold of interest. In spite of that, in a practical setting, it suffices to have a rough ˆ = [m ˆ (2) · · · m ˆ (n) ], of the vertices notion of the mean, M ˆ (1)m X1 , X2 , . . . , Xn(n−1) given by n(n−1) ( j ) xi ( j) i=1 m ˆ := n(n−1) , j = 1, 2, . . . , n ( j )   x  i=1

i

2

( j) xi ,

provided i = 1, 2, . . . , n(n − 1), are confined to a hemisphere. IV. C ONTRAST F UNCTION BASED ON H ARTLEY E NTROPY Given T observations with n variables, Y ∈ Rn×T , and a candidate demixing matrix, X = [x(1)x(2) · · · x(n) ], x( j ) ∈ Rn , the source estimates can be deduced as A = [a(1) a(2) · · · a(n) ] = XY. The MI between the estin ( j) mated sources is measured as f (X) := H j =1 (a ) − (1) (2) (n) H (a , a , . . . , a ), where H (·) represents the entropy;

This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination. 4

IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS

owing to the reasons mentioned in Section I, the Hartley entropy3 is adopted in this brief. Consequently, the Hartleyentropy-based ICA contrast can be expressed as f (X; Y) :=

n 



log R(x( j ) Y) − log|det X|

(6)

j =1

with R(·) = sup(·) − inf(·) being the range function. Note that the − log|det X| term ensures that the algorithm will not converge to a solution belonging to the set of rank-deficient matrices, and thus prevents the recovery of any source more than once. Essentially, the choice of OB relaxes the requirement for whitening and allows one to deal with correlated sources. In practice, an estimate is preferred for the range function as it remains sensitive to outliers in the observations. For instance, in [7], a range estimate is proposed as m 1  R¯ m ([a1 , a2 , . . . , aT ]) := Rr ([a1 , a2 , . . . , aT ]) m r=1

Rr ([a1 , a2 , . . . , aT ]) := a(T −r+1) − a(r) where the subscript enclosed in parentheses indicates the order statistic of the observation; the component index of a is ignored for convenience. One may resort to an empirically determined value for m, e.g., [7, p. 816], due to the reliance on samples to estimate the range. Further discussion on the choice of m is deferred to Section V. By denoting the mixing matrix as Z ∈ Rn×n and the original sources as C = [c(1) c(2) · · · c(n) ], c( j ) ∈ RT , the strict superadditivity property of the range function enables us to    write R(x( j ) Y) = R(x( j ) ZC) = np=1 |G j p |R(c( p) ), where G = XZ is known as the global (transfer) matrix relating C to A. As a result, (6) is equivalent to ⎡ ⎤ n n   log ⎣ |G j p |R(c( p) )⎦ − log|det G|. (7) f (G; C) := j =1

p=1

Notice that the nondifferentiability issue stems from the absolute value in (7) as pointed out in [4]. V. E XPERIMENTAL R ESULTS The source separation performance of the proposed technique—Nelder–Mead algorithm on OB minimizing the Hartley-entropy-based ICA contrast (OB-NMH)—has been rigorously tested with mixtures derived from the correlated signal/image data having bimodal or multimodal distributions with a finite support width. Indeed, quasi-correlated sources with multimodal density distributions are commonly encountered in real situations, which will pose challenges to an ICA algorithm [5]. We have included the following ICA schemes in the empirical study to assess how OB-NMH fares with respect to a few recent and popular approaches: 1) FastICA [21]; 2) joint approximate diagonalization of eigenmatrices (JADE) [22]; 3) information-maximization 3 The Hartley entropy of a discrete random variable Z with possible outcomes, 1, 2, . . . , q, is given by the cardinality of Z , i.e., H (Z ) := log q = log(card(Z )). In the context of sources having distributions with a finite support, the range function is equivalent to the cardinality.

(infomax) [23]; 4) support width ICA (SWICA) [7]; 5) nonorthogonal SWICA (NOSWICA) [8]; 6) entropy bound minimization ICA (EBMICA) [24]; and 7) ICA by quadratic measures of independence (QICA) [25]. The FastICA, JADE, and infomax algorithms are not-so-recent yet quite popular and extensively used in several applications. The SWICA minimizes the range-based ICA contrast (6) advocated in this brief subject to the orthogonality constraint among the components, whereas the NOSWICA relaxes this constraint. The EBMICA and QICA are the most recent ones aiming at accurate estimation of the underlying sources. While the EBMICA employs an accurate entropy estimator, the QICA relies on an exhaustive search technique to achieve a superior outcome. Interestingly, both NOSWICA and EBMICA maintain the unit-norm column constraint, thereby sharing a similarity with our framework. The parameter values in Algorithm 1 are set to be the same as followed in the MATLAB routine, fminsearch. The software system was implemented in MATLAB R2011b on a MacBook Pro (Intel Core i7 2.2-GHz CPU, 8-GB 1333-MHz DDR3) using Mac OS X Lion 10.7.4. In view of the scale and order indeterminacies of an ICA algorithm, the source estimates from the simulations were first rescaled and reordered, and then their root-mean-square error (RMSE) values with respect to the original sources were computed. Given the ( j) ( j) ( j) original sources c( j ) = [c1 , c2 , . . . , cT ], j = 1, 2, . . . , n, and the estimated ones (after rescaling and reordering) a( j ) =  ( j) ( j) ( j) x( j ) Y = [a1 , a2 , . . . , aT ], j = 1, 2, . . . , n, the RMSE is calculated as

n T ( j )

j =1 t =1 ct − at( j ) 2 . (8) RMSE = n T ( j ) 2 j =1 t =1 ct The RMSE in (8) is always nonnegative and scale invariant; besides, this measure does not suffer from the permutation ambiguity pertinent to an ICA approach, since the source estimates are reordered. For a simulation study involving synthetic signals, we generated six signals for every trial within the amplitude range [−1.5, 1.5] (i.e., by disregarding the samples generated outside this interval), each consisting of 10 000 samples, and randomly drawn from six different Gaussian mixture distributions (GMDs). Each GMD comprises six Gaussian distributions, whose mean, standard deviation, and mixing coefficients are uniform randomly distributed values on the interval [−1.5, 1.5], [0, 1], and [0, 1], respectively. The bounded signals are preferred because Hartley-entropy-based ICA schemes are only meant for demixing the mixtures of sources with a finite support width. The signal samples were drawn from the GMDs, since it is more likely to observe signals following multimodal distributions in real scenarios. During each trial, a multidimensional synthetic signal of size 10 000 × 6 was mixed by a 6 × 6 nonorthogonal mixing matrix with elements drawn from a uniform distribution on the interval [0, 1]. Prior to demixing, the mixtures were whitened due to the supposition that this preprocessing step would improve the convergence of an ICA algorithm irrespective of the constraint imposed on the components. Nevertheless, it must

This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination. IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS

5

Fig. 1. Amplitude error between each original source and estimated sample from a mixture of six synthetic signals. Each signal comprises 10 000 samples drawn from a GMD, and the underlying Gaussian distributions have the mean, standard deviation, and mixing coefficients uniform randomly distributed on the interval [−1.5, 1.5], [0, 1], and [0, 1], respectively. For grading the performance quantitatively, the RMSE values of FastICA, JADE, infomax, SWICA, NOSWICA, EBMICA, QICA, and OB-NMH are listed as 0.362, 0.242, 0.270, 0.384, 0.549, 0.191, 0.492, and 0.034, respectively. TABLE I M EAN AND S TANDARD D EVIATION OF RMSE VALUES FOR THE S OURCE E STIMATES FROM THE E XPERIMENTED ICA M ETHODS U SING S YNTHETIC S IGNALS , FACE , AND S CENE I MAGES . T HIS S IMULATION I NVOLVES 25 T RIAL RUNS W ITH A M IXTURE OF S IX S IGNALS , E ACH C OMPRISING 10 000 S AMPLES D RAWN FROM

S IX D IFFERENT GMD S IN THE A MPLITUDE R ANGE

[−1.5, 1.5], AND S IX R ANDOMLY S ELECTED 350 × 300 FACE AND 256 × 256 S CENE I MAGES FROM A P OOL OF N INE I MAGES IN E ACH C ATEGORY. T HE VALUES IN B OLDFACE R EPRESENT THE M INIMUM ATTAINED A MONG THE I NVESTIGATED A PPROACHES

be borne in mind that the whitening of mixtures is not a mandatory step for the OB-NMH. To demonstrate that the OB-NMH substantially improves the separation performance

in comparison with seven other aforementioned ICA methods, the RMSEs from 25 independent trial runs have been measured, and the corresponding mean and standard deviation values are recorded in Table I. In addition, to visually appreciate how the demixing performance of OB-NMH compares with that of the well-known approaches, the amplitude error between each estimated and original source sample is plotted in Fig. 1 for all the algorithms in a single trial. For further validation, an experiment has been set up with the following image data sets: 1) nine face images of size 350 × 300 from the data set of Informatics and Mathematical Modelling, Technical University of Denmark (http://www.imm.dtu.dk/~aam/) [26] and 2) nine natural scenes of size 256 × 256 from the data set provided by Computational Visual Cognition Laboratory, Massachusetts Institute of Technology (http://cvcl.mit.edu/database.htm) [27]. The images under these categories share a common underlying pattern [8], and the illustrations in Fig. 2 exemplify the correlation among such image sources. In each trial, six images were randomly selected from a pool of nine face/scenery images, and the pixel gray levels in each image were concatenated columnwise to construct the sources. As a result, the multidimensional data derived from the face images and natural scenes are of size 105 000 × 6 and 65 536 × 6, respectively. As described earlier, the mixtures of multidimensional image data were obtained with the help of a random 6 × 6 nonorthogonal mixing matrix in each trial.

This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination. 6

IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS

Fig. 2. Correlation matrices pertaining to the (a) face and (b) scene images to illustrate the fact that the correlation among the sources is often significant.

The prewhitened mixtures were allowed to be demixed with all the investigated ICA algorithms. Subsequently, in each image category, the mean and standard deviation of RMSEs were computed for the source estimates from 25 independent trials, and entered in Table I. Finally, to subjectively assess the source separation quality of OB-NMH, the reconstructed sources with the experimented algorithms in a specific instance of face image simulation are collocated in Fig. 3 along with the respective RMSE values. From the empirical results summarized in Table I, one can infer that the computational advantages of surrogate-contrastbased ICA techniques are counterbalanced by an inferior source separation as evidenced by the respective RMSE values. Even though the SWICA and NOSWICA minimize the same contrast function as in (6), the former approach relies on a rudimentary optimization strategy in order to merely demonstrate the usefulness of (6), and the latter disappointedly lacks robustness as cautioned in [4, p. 262]. Notwithstanding an exhaustive search method resorted to, the QICA often fails to estimate the sources satisfactorily as corroborated by a large RMSE value. The EBMICA yields demixing matrix estimates of acceptable quality with less computational effort. Worth mentioning is that the OB-NMH outperforms the rest by recovering the sources with the least RMSE value. Besides, the outcome of Wilcoxon’s signed-rank test4 alludes to the fact that the reduction in the RMSE from the proposed algorithm is statistically significant. As a consequence, the OB-NMH may 4 Except in one out of 21 instances, the p-value obtained is 1.2290 × 10−05 , which implies that the RMSE from the OB-NMH is less than that of the method under comparison in all the 25 trial runs. Note that the p-value corresponding to the EBMICA versus OB-NMH in the simulation with synthetic signals is 4.0729 × 10−05 as the OB-NMH outclassed the EBMICA in 24 trial runs.

Fig. 3. Recovered sources while demixing a mixture of six face images of size 350 × 300 for visual scrutiny. (a) Original source images. (b) Mixed images. (c)–(j) Estimated and reordered sources from (c) FastICA, (d) JADE, (e) infomax, (f) SWICA, (g) NOSWICA, (h) EBMICA, (i) QICA, and (j) OB-NMH with the RMSE values of 0.462, 0.437, 0.434, 0.337, 0.347, 0.366, 0.337, and 0.062, respectively. TABLE II D EMIXING P ERFORMANCE OF NOSWICA AND OB-NMH IN 25 T RIAL RUNS BY R ANDOMLY C HOOSING S IX O UT OF N INE N ATURAL , A ERIAL , AND T EXTURE I MAGES OF S IZE 200 × 200 FROM E ACH C LASS . T HE M EAN RMSE S OF OB-NMH ( IN B OLDFACE ) A RE L OWER T HAN T HOSE OF NOSWICA IN A LL I NSTANCES W ITH A S TATISTICAL S IGNIFICANCE

be recommended in applications where the solution accuracy is of paramount interest. The NOSWICA, in particular, is analogous to our approach in the sense that the contrast function and the constraint remain the same as those of the OB-NMH. Therefore, to ascertain the benefits offered by Algorithm 1 in terms of demixing performance, the mean RMSEs of source-separated natural, aerial, and texture images from 25 trial runs, each involving six randomly chosen and mixed images, are juxtaposed in Table II. The natural images were taken from the Berkeley segmentation data set and benchmark

This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination. IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS

Fig. 4. Impact of the bounded and unbounded AWGN added to the sources on the demixing performance of OB-NMH measured by the RMSE.

7

ties with “sharp bounds” facilitate the range estimation, the image pixel gray levels were bounded by the interval [0, 255] subsequent to the introduction of Gaussian noise of increasing variance. Interestingly enough, the outcome complies with our conjecture. The determination of m as per [7] guarantees that the range estimation error is bounded above by a small value ε with a high probability as evidenced by an outstanding source separation in simulations. Nevertheless, depending on the input data type (e.g., digital images, audio, and analog electrical signals), it is speculated that even smaller values of m—in the order of tens—would preserve the solution quality while relieving the computational burden (see Fig. 5). In spite of the remarkable performance of the Hartley-entropybased ICA, it has to be cautioned though, that this scheme is not recommended for separating sources with thin-tailed density distributions for obvious reasons. VI. C ONCLUSION A Riemannian framework that minimizes the range-based contrast is succinctly presented for recovering the quasicorrelated sources from their mixtures. While the contrast function is devoid of spurious local minima, it warrants the use of a nonsmooth optimizer. In this respect, how effectively a Nelder–Mead algorithm tailored to the OB could separate the sources upon multiple restarts is evidenced from diverse simulations. A future research direction would be to attempt subgradient methods to verify whether the extra effort required in the implementation process commensurates with performance improvements. ACKNOWLEDGMENT

Fig. 5. RMSE values corresponding to the choice of m in the interval [2, 500], while the sources were estimated from a mixture of six 200 × 200 face and scene images by the OB-NMH scheme. The red asterisk corresponds to the RMSE resulted for the m value determined according to [7].

(https://www.eecs.berkeley.edu/Research/Projects/CS/vision/bs ds/) [28], whereas the aerial and texture images were acquired from the Signal and Image Processing Institute, University of Southern California (http://sipi.usc.edu/database/). It is surmised that the unsatisfactory performance of NOSWICA is attributed to the inherent drawback of a deflation scheme, wherein the source estimation error gets accumulated [11, p. 2488]. The range-based contrast is premised on the boundedness assumption of the original sources, e.g., digital images. In order to validate its robustness when the supports of the source densities have “fuzzy hedges” due to noise, a simulation has been performed with six natural images contaminated with the additive white Gaussian noise (AWGN) (see Fig. 4). Prior to mixing them, the sources were corrupted with the AWGN having zero mean and the standard deviation within the interval [0, 1]. As anticipated, the performance of the OB-NMH degrades in proportion to the noise level. A stratagem to cope with this difficulty is a careful determination of m. Furthermore, to investigate whether the source densi-

The author would like to express his sincere gratitude for the insightful, constructive, and meticulous comments of the reviewers and the Associate Editor. Furthermore, he would like to thank Prof. P.-A. Absil, Université catholique de Louvain, Belgium, for going through the earlier version of this brief and providing suggestions. R EFERENCES [1] K. E. Hild, D. Erdogmus, and J. Príncipe, “Blind source separation using Renyi’s mutual information,” IEEE Signal Process. Lett., vol. 8, no. 6, pp. 174–176, Jun. 2001. [2] D.-T. Pham, F. Vrins, and M. Verleysen, “On the risk of using Rényi’s entropy for blind source separation,” IEEE Trans. Signal Process., vol. 56, no. 10, pp. 4611–4620, Oct. 2008. [3] D.-T. Pham, “Blind separation of instantaneous mixture of sources based on order statistics,” IEEE Trans. Signal Process., vol. 48, no. 2, pp. 363–375, Feb. 2000. [4] F. Vrins, “Contrast properties of entropic criteria for blind source separation: A unifying framework based on information-theoretic inequalities,” Ph.D. dissertation, Faculté des Sci. Appliquées, Univ. catholique de Louvain, Louvain-la-Neuve, Belgium, Mar. 2007. [5] F. Vrins, D.-T. Pham, and M. Verleysen, “Mixing and non-mixing local minima of the entropy contrast for blind source separation,” IEEE Trans. Inf. Theory, vol. 53, no. 3, pp. 1030–1042, Mar. 2007. [6] D.-T. Pham and F. Vrins, “Local minima of information-theoretic criteria in blind source separation,” IEEE Signal Process. Lett., vol. 12, no. 11, pp. 788–791, Nov. 2005. [7] F. Vrins, J. A. Lee, and M. Verleysen, “A minimum-range approach to blind extraction of bounded sources,” IEEE Trans. Neural Netw., vol. 18, no. 3, pp. 809–822, May 2007.

This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination. 8

[8] J. A. Lee, F. Vrins, and M. Verleysen, “Non-orthogonal support-width ICA,” in Proc. 14th Eur. Symp. Artif. Neural Netw., Comput. Intell. Mach. Learn., Bruges, Belgium, Apr. 2006, pp. 351–358. [9] S. E. Selvan, U. Amato, C. Qi, K. A. Gallivan, M. F. Carfora, M. Larobina, and B. Alfano, “Descent algorithms on oblique manifold for source-adaptive ICA contrast,” IEEE Trans. Neural Netw. Learn. Syst., vol. 23, no. 12, pp. 1930–1947, Dec. 2012. [10] S. E. Selvan, A. Chattopadhyay, U. Amato, and P.-A. Absil, “Rangebased non-orthogonal ICA using cross-entropy method,” in Proc. 20th Eur. Symp. Artif. Neural Netw., Comput. Intell. Mach. Learn., Bruges, Belgium, Apr. 2012, pp. 519–524. [11] S. E. Selvan, P. B. Borckmans, A. Chattopadhyay, and P.-A. Absil, “Spherical mesh adaptive direct search for separating quasi-uncorrelated sources by range-based independent component analysis,” Neural Comput., vol. 25, no. 9, pp. 2486–2522, Sep. 2013. [12] G. Dirr, U. Helmke, and C. Lageman, “Nonsmooth Riemannian optimization with applications to sphere packing and grasping,” in Lagrangian and Hamiltonian Methods for Nonlinear Control, vol. 366, F. Allgwer, P. Fleming, P. Kokotovic, A. B. Kurzhanski, H. Kwakernaak, A. Rantzer, J. N. Tsitsiklis, F. Bullo, and K. Fujimoto, Eds. New York, NY, USA: Springer-Verlag, 2007, pp. 29–45. [13] F. H. Walters, L. R. Parker, S. L. Morgan, and S. N. Deming, Sequential Simplex Optimization. Boca Raton, FL, USA: CRC, 1991. [14] D. W. Dreisigmeyer. (2007, Feb.). Direct Search Algorithms Over Riemannian Manifolds [Online]. Available: http://www.optimizationonline.org/DB_HTML/2007/08/1742.html [15] S. Lee, M. Choi, H. Kim, and F. C. Park, “Geometric direct search algorithms for image registration,” IEEE Trans. Image Process., vol. 16, no. 9, pp. 2215–2224, Sep. 2007. [16] H. E. Cetingul, B. Afsari, M. J. Wright, P. M. Thompson, and R. Vidal, “Group action induced averaging for HARDI processing,” in Proc. 9th IEEE ISBI, Princeton, NJ, USA, May 2012, pp. 1389–1392. [17] B. Afsari, R. Tron, and R. Vidal, “On the convergence of gradient descent for finding the Riemannian center of mass,” SIAM J. Control Optim., vol. 51, no. 3, pp. 2230–2260, 2013.

IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS

[18] P. T. Fletcher, C. Lu, S. M. Pizer, and S. Joshi, “Principal geodesic analysis for the study of nonlinear statistics of shape,” IEEE Trans. Med. Imag., vol. 23, no. 8, pp. 995–1005, Aug. 2004. [19] K. I. M. McKinnon, “Convergence of the Nelder–Mead simplex method to a nonstationary point,” SIAM J. Optim., vol. 9, no. 1, pp. 148–158, 1998. [20] Q. H. Zhao, D. Urosevi´c, N. Mladenovi´c, and P. Hansen, “A restarted and modified simplex search for unconstrained optimization,” Comput. Oper. Res., vol. 36, no. 12, pp. 3263–3271, Dec. 2009. [21] A. Hyvärinen, “Fast and robust fixed-point algorithms for independent component analysis,” IEEE Trans. Neural Netw., vol. 10, no. 3, pp. 626–634, May 1999. [22] J.-F. Cardoso and A. Souloumiac, “Blind beamforming for non Gaussian signals,” IEE Proc. F Radar Signal Process., vol. 140, no. 6, pp. 362–370, Dec. 1993. [23] S. Makeig, A. J. Bell, T.-P. Jung, and T. J. Sejnowski, “Independent component analysis of electroencephalographic data,” Adv. Neural Inf. Process. Syst., vol. 8, no. 1, pp. 145–151, 1996. [24] X.-L. Li and T. Adalı, “Independent component analysis by entropy bound minimization,” IEEE Trans. Signal Process., vol. 58, no. 10, pp. 5151–5164, Oct. 2010. [25] S. Seth, M. Rao, I. Park, and J. C. Príncipe, “A unified framework for quadratic measures of independence,” IEEE Trans. Signal Process., vol. 59, no. 8, pp. 3624–3635, Aug. 2011. [26] M. B. Stegmann, B. K. Ersbøll, and R. Larsen, “FAME—A flexible appearance modelling environment,” IEEE Trans. Med. Imag., vol. 22, no. 10, pp. 1319–1331, Oct. 2003. [27] A. Oliva and A. Torralba, “Modeling the shape of the scene: A holistic representation of the spatial envelope,” Int. J. Comput. Vis., vol. 42, no. 3, pp. 145–175, 2001. [28] D. R. Martin, C. Fowlkes, D. Tal, and J. Malik, “A database of human segmented natural images and its application to evaluating segmentation algorithms and measuring ecological statistics,” in Proc. 8th IEEE Int. Conf. Comput. Vis., vol. 2. Vancouver, BC, Canada, Jul. 2001, pp. 416–425.

Nonsmooth ICA contrast minimization using a Riemannian Nelder-Mead method.

This brief concerns the design and application of a Riemannian Nelder-Mead algorithm to minimize a Hartley-entropybased contrast function to reliably ...
8MB Sizes 7 Downloads 3 Views