Distributed dictionary learning for sparse representation in sensor networks.

2528

IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. 23, NO. 6, JUNE 2014

Distributed Dictionary Learning for Sparse Representation in Sensor Networks Junli Liang, Miaohua Zhang, Xianyu Zeng, and Guoyang Yu

Abstract— This paper develops a distributed dictionary learning algorithm for sparse representation of the data distributed across nodes of sensor networks, where the sensitive or private data are stored or there is no fusion center or there exists a big data application. The main contributions of this paper are: 1) we decouple the combined dictionary atom update and nonzero coefficient revision procedure into two-stage operations to facilitate distributed computations, first updating the dictionary atom in terms of the eigenvalue decomposition of the sum of the residual (correlation) matrices across the nodes then implementing a local projection operation to obtain the related representation coefficients for each node; 2) we cast the aforementioned atom update problem as a set of decentralized optimization subproblems with consensus constraints. Then, we simplify the multiplier update for the symmetry undirected graphs in sensor networks and minimize the separable subproblems to attain the consistent estimates iteratively; and 3) dictionary atoms are typically constrained to be of unit norm in order to avoid the scaling ambiguity. We efficiently solve the resultant hidden convex subproblems by determining the optimal Lagrange multiplier. Some experiments are given to show that the proposed algorithm is an alternative distributed dictionary learning approach, and is suitable for the sensor network environment. Index Terms— Distributed dictionary learning, sparse representation, distributed data, K-SVD, alternating-direction method of multipliers (ADMM), Lagrange multiplier, sensor networks, eigenvalue decomposition (EVD), fusion center, big data.

I. I NTRODUCTION

R

ECENT progress in sensor technique and distributed processing facilitate the development of sensor networks [1]–[5]. In the conventional centralized scenario of sensor networks, the storage and computation load of the fusion center increases with increasing nodes, possibly exceeding the available system resources. Moreover, they are discouraged from sharing the local data when the network nodes collect sensitive or private information [6]–[14]. Furthermore, such a centralized scenario lacks robustness because the breakdown of the fusion center often cause the entire network

Manuscript received December 27, 2012; revised July 13, 2013 and November 24, 2013; accepted March 11, 2014. Date of publication April 10, 2014; date of current version May 7, 2014. This work was supported in part by the National Natural Science Foundation of China under Grant 61172123, in part by the Research Youth Star of Shaanxi Province under Grant 2012KJXX-35, in part by the Research Project of Shaanxi Education Department under Grant 12JK0526, in part by the Doctor Supervisor Fund of University under Grant 20126118110004, and in part by FOK YING TUNG Education Fund under Grant 141119. The associate editor coordinating the review of this manuscript and approving it for publication was Dr. Adrian G. Bors. The authors are with the Xi’an University of Technology, Xi’an 710048, China (e-mail: [email protected]; [email protected]). Color versions of one or more of the figures in this paper are available online at http://ieeexplore.ieee.org. Digital Object Identifier 10.1109/TIP.2014.2316373

to collapse. Therefore, in many applications, a decentralized scenario, i.e., a distributed scheme, is often considered [6]–[14]. Particularly, under such an era of data explosion (“Big Data”), it is extremely important for the networks to be able to solve the large-scale problem with a huge volume of features or training examples, including the decentralized storage of these datasets and the accompanying distributed processing. For such a distributed scheme, each network node is able to process its own data to extract useful information by implementing some local computation, communication, and storage operations. Furthermore, they can find a globally optimal (or approximately optimal) solution to the large-scale task by collaborating with their neighbors. In this case, the storage and computation load is shared by all the nodes and also such an architecture is robust to individual node failure. In recent years there has been a growing interest in sparse representation of signals [10], [11], [15]–[18], [46]. The sparse representation of a given signal includes two aspects: sparsecoding and dictionary design. A wide variety of algorithms [19]–[23], including matching pursuit, Orthogonal Matching Pursuit (OMP), and basis pursuit, have been developed to solve the former, which is an NP-hard problem. The approaches to the latter can be broadly classified into two categories, namely, the analytic those and the learning-based those (i.e., adaptive ones). The ability of the analytic those to accurately represent signals is limited by the uncertainty whether they match the signals in question. Unlike them, the learningbased those deduce the dictionary from the given training data adaptively. Therefore, the adaptive dictionaries match the signals more finely, and thus have the potential to be of better performance in practical applications. Generally, similar to the training purpose in machine learning (in order to represent and analyze all samples in the common standard [51]), the adaptive dictionary learning algorithms learn a common dictionar (A general projection basis) from many representative training images (Selecting many kinds (different styles) or specific kind of training images depends on the processing task at hand. Especially, in the former case the obtained dictionary is just the so-called “Global Dictionary”.) so that one can represent the given signals and thus can evaluate, analyze, and process them in the same standard efficiently [17], [47]–[50]. For example, M. Elad et al. learned a dictionary from many face images of various locations to fill the missing face data or compress face images [17]. Yang et al. learned a dictionary from 70 images for single-image super-resolution reconstruction [47]. For Hypergraph Laplacian sparse coding, Gao et al. learned a dictionary from the dataset with many categories of images [48]. Zheng et al. learned a dictionary from many

1057-7149 © 2014 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.

LIANG et al.: DISTRIBUTED DICTIONARY LEARNING FOR SPARSE REPRESENTATION IN SENSOR NETWORKS

training images for image classification or clustering [49]. J. Mairal et al. [50] learned a dictionary from 24 images to obtain a nonlinear image mapping and thus restore the noisy images, or for digital art authentification. In addition, they [50] also learned a dictionary from 3000 images for compressed sensing. The adaptive dictionary learning algorithms include the Method of Optimal Directions (MOD) [18], K-SVD [17], and so on (or see [15] and [16] for details). Especially, the K-SVD method [17], a generalization of the K-means clustering process, is an iterative method that alternates between sparse coding and dictionary update, and has been proven to be able to fit the training data more finely and capture their salient features more accurately [17], [42]. Distributed storage and processing in sensor networks promotes the development of distributed sparse representation, such as distributed basis pursuit [10], distributed sparse linear regression [53], and distributed sensor perception [11]. The distributed computation frameworks can be broadly classified into three categories, namely, diffusion [54], gossip strategy [6], and consensus-based alternating-direction method of multipliers (ADMM) [7]. Although this paper also uses the ADMM framework, it pays more attention to adaptively learning a common dictionary from the distributed data and sparsely representing them rather than finding out a common sparse vector for them [10], [11]. The potential applications include distributed fusion, distributed source coding, distributed scene representation, distributed video coding, multiview image processing, and distributed multiview video coding [12], [13], [24]–[29], etc.Besides, in such privacy-sensitive tasks as digital art authentification [6], [14], [50], the distributed authentic paintings across multiple museums are neither public nor shareable but need to be “integrated” to protect authentic paintings and distinguish imitations. This paper proposes a distributed dictionary learning algorithm for sparse representation of the data distributed across network nodes. It is worthwhile to highlight several aspects of the proposed algorithm here: 1) Unlike the centralized K-SVD algorithm [17], which requires gathering the distributed data together to adaptively learn a common dictionary, the proposed algorithm learns such a dictionary in a completely distributed manner, where each node implements the local operations on its private data, and only exchanges the temporarily learned dictionary atoms with its neighboring nodes rather than collects the distributed data together. Finally, all the nodes obtain the consistent dictionary from the distributed data; 2) Although both the centralized K-SVD algorithm [17] and the proposed one renew the dictionary atoms successively in the dictionary update step, in the former the dictionary atom update and related nonzero coefficient revision are not only simultaneously optimized but also coupled for all nodes. Unlike it, the proposed algorithm decouples the combined procedure into two-step operations to adapt to the distributed data, firstly updating the dictionary atom in terms of the eigenvalue decomposition (EVD) of the sum of the residual (correlation) matrices across these nodes then implementing a local

Fig. 1.

2529

A generic sensor network.

projection operation to obtain the related representation coefficients for each node; 3) Motivated by the fact that ADMM is one to blend the decomposability of dual ascent with the superior convergence property [7], this paper casts the aforementioned atom update problem as a set of decentralized optimization subproblems with consensus constraints. We simplify the multiplier update for the symmetry undirected graphs in sensor networks and thus each node can implement the sparse-coding and dictionary update operations locally and exchange the local atom with its neighbors to obtain the consistent dictionary; 4) In order to avoid the scaling ambiguity, dictionary atoms are typically constrained to be of unit norm [15]–[18]. We efficiently solve the resultant hidden convex subproblems by determining the optimal Lagrange Multiplier [30], [43]. The rest of this paper is organized as follows. Problem is formulated in Section II. The distributed dictionary learning algorithm is developed in Section III. Experimental results are presented in Section IV. Conclusions are drawn in Section V. Notation: Vectors and matrices are denoted by boldface lowercase and uppercase letters, respectively. · F denotes the Frobenius norm of a vector or a matrix, (·)T denotes the transpose of a matrix or vector, I denotes the identity matrix of an appropriate dimension. diag{v} represents a diagonal matrix in which the elements of v are on the diagonal. xlk represents the lth row of the sparse matrix Xk corresponding to the kth node, and dn stands for the nth atom of the related set. tr ace{·} denotes the trace of a matrix, and ·, · represents the inner product of two vectors. Other mathematical symbols are defined after their first appearance. II. P ROBLEM F ORMULATION Consider a network with K nodes modeled by an undirected graph G(, E) with vertices = {1, 2, . . . , K } corresponding to the nodes (blue circles), and edges E (black straight lines) describing links among communicating nodes [1]–[5] as shown in Fig. 1. Each node is capable of storing some local data and performing computations on it, and exchanging messages with its neighbors.1 1 In this paper, we pay attention to more general sensor networks, in which a (sensor) node should be understood as an abstract entity, including a camera sensor, a wireless sensor, a medical equipment, a sensing device, or more generally a network computer.

2530


Assume a group of training examples {Y1 , Y2 , . . . , Y K } distributed across these K nodes, each with L examples. When all the examples could be collected by a fusion center, the centralized scheme [17], [18] can be applied to obtain an adaptive dictionary by solving the following centralized optimization problem: min ||Y − DX||2F = D,X

K

||Yk − DXk ||2F

m and m k is a matrix of size L × card{ωk }, with ones on the ωkm (i ), i th entries and zeros elsewhere, where card{ωkm } stands for the number of elements in the group ωkm and xkm (i ) denotes the i th element of xkm . Then, the K-SVD algorithm finds the closest rank-1 matrix (in Frobenius norm) that approximates the augmented residual m m m m m m matrix Em = [E1 1 E2 2 . . . E K K ] (all K nodes) by minimizing the following objective function

k=1

Subject to ||Xk (•)||0 ≤ T0 (1) 1 n N M×N where D = d . . . d . . . d ∈ R is the so-called dictionary and dn ∈ R M×1 is the nth dictionary atom with unit norm for n = 1, . . . , N, and X = [X1 X2 . . . X K ∈ R N×L K is the sparse representation coefficient matrix. ||Xk (•)||0 ≤ T0 represents the constraint that the number of nonzero coefficients of each column of Xk must not be larger than a prespecified threshold T0 . However, when these distributed data (training examples) cannot be gathered or there is no fusion center, as mentioned in Section I, a decentralized counterpart is expected. The objective of this paper is to find out a common adaptive dictionary for the K nodes to sparsely represent the distributed data {Y1 , Y2 , . . . , Y K } . By some local computations and the communication among neighboring nodes, all the K nodes can finally obtain a consistent dictionary, which attains approximate or even the same performance as that of the centralized counterpart (if it existed). III. P ROPOSED A LGORITHM In this section, we first review the conventional (centralized) K-SVD algorithm, then develop a new distributed dictionary learning algorithm to sparsely represent the data distributed across nodes of sensor networks. A. Review of the Centralized K-SVD Algorithm The K-SVD algorithm [17], a generalization of the K-means clustering process, alternates between sparse coding of the examples (based on the current dictionary) and dictionary atom update to fit the data. The former, the sparse coding step, obtains the sparse coefficients using the pursuit algorithm (e.g., OMP) while keeping the dictionary fixed. In the latter, atoms are updated not simultaneously but one at a time, i.e., renewing one of the atoms while keeping others fixed until all atoms are updated, in which especially the related nonzero coefficients are revised simultaneously with the update of each atom. To update the mth atom dm in (1), the K-SVD algorithm first computes the residual matrix Em k of the kth node Em k

= Yk −

N

dn xkn ,

(2)

n=1,n =m

is the nth row of Xk . Besides, for k = 1, . . . , K , where define ωkm as the group of indices pointing to the nonzero elements of xkm , i.e., xkn

ωkm = {i |1 ≤ i ≤ L and xkm (i ) = 0},

(3)

m 2 minEm ˜ F , −d x

(4)

dm ,˜x

K

where dm ∈ R M×1 and x˜ ∈ R 1×( k=1 card{ωk }) . After implementing the singular value decomposition (SVD) of Em , the K-SVD algorithm updates the mth atom dm in (1) with the left singular vector related to the largest singular value (LSV) and replaces the original nonzero coefficients m m m . . . xm [x1m m 1 x2 2 K K ] with the right singular vector (related to LSV) multiplied by LSV simultaneously, and then revises other atoms in the same mode until all atoms are updated. The dictionary update and sparse-coding procedures are alternated until a prespecified stop criterion is satisfied. m

B. New Distributed Dictionary Learning Algorithm However, when the data (examples) are distributed (stored) across the nodes of sensor networks, it is difficult to update the dictionary atom by implementing SVD of the distributed residual matrices, which cannot be gathered together. To meet with the distributed processing requirement of sensor networks, this subsection develops a distributed dictionary learning algorithm to sparsely represent the data {Y1 , Y2 , . . . , Y K } distributed across the K nodes only by means of the local computations and the communications among neighboring nodes rather than via collecting {Y1 , Y2 , . . . , Y K } together. 1) Decoupling the Combined Dictionary Atom Update and Nonzero Coefficient Revision Procedure Into Two-Stage Operations: The key difficulty for the decentralized processing of the centralized K-SVD algorithm lies in that the dictionary atom update and related nonzero coefficient revision are not only simultaneously optimized but also coupled for all nodes, as shown in Eq.(4). Here we simplify (4) into an equivalent form so that it is suitable for distributed processing. We re-write Eq.(4) in a trace computation form as m T m ˜ ] [Em ˜ ]} {dm , x˜ } = arg min tr ace{[Em −d x −d x dm ,˜x

T m T m = arg min tr ace Em E − 2 Em ˜ d x dm ,˜x + x˜ T (dm )T dm x˜ T m T m = arg min tr ace Em ˜ Em ˜ x˜ T E − 2x d +x dm ,˜x

T m ∝ arg min − 2x˜ Em ˜ x˜ T , d +x dm ,˜x

(5)

where the unit constraint on the dictionary atom is norm

T applied, i.e., dm dm = 1. Setting the partial differential of Eq. (5) with respect to (w.r.t.) x˜ to zero yields ˜ = 0 ⇒ x˜ = (dm )T Em −2(dm )T Em + 2x .

(6)


Inserting x˜ = (dm )T Em into Eq.(5) yields:

m T m min − (dm )T Em E d m

λk,k corresponding to the constraints dkm = dm for k =

= −(d )

K

m m T m m d , Em k k Ek k

K

m m T m m (dkm )T Em dk min − k k Ek k m {dk ,λ } k,k k=1

(7)

k=1

which implies that the optimal dm to Eq. (4) or (7) is actually the eigenvector related to the largest eigenvalue of of the distributed residual correlation matrices

Kthe sum m m m m T E E from all the K nodes. k=1 k k k k Note that Eq.(4) is a simultaneously optimization problem of two coupled variables, i.e., the update of the dictionary atom dm is combined with the revision of the related nonzero coefficients x˜ . In addition, this optimization problem is coupled for all the K nodes due to Em . Comparing with Eq.(4), the equivalent two-step operations shown in Eq.(7) and (6) simplify the coupling procedure, firstly updating the dictionary atom then implementing a simple projection operation to obtain the related sparse representation coefficients, i.e.,˜x = (dm )T Em . The main benefit of this simplification is that the revision of the related nonzero coefficients can be implemented locally at each node if all the nodes have got the dictionary m ˜k atom dm , i.e., x˜ k = (dm )T Em k k for k = 1, . . . , K , where x is the kth segment of x˜ = [x˜ 1 . . . x˜ k . . . x˜ K ]. In other words, if the dictionary atom update can be solved in a distributed way, the combined procedure of the dictionary atom update and related nonzero coefficient revision can be achieved in a completely distributed manner. 2) Distributed Update of Dictionary Atom and Simplification of Lagrange Multipliers: In Eq. (7) the atom dm is related to the K distributed residual matrices across the K nodes, m i.e., Em k k for k = 1, . . . , K , and thus it is not directly determined from SVD of the augmented residual matrix Em . To adapt to the distributed storage, we replace the global K and also enforce the variable dm with K local ones {dkm }k=1 m m consensus constraints dk = d for k, k ∈ {1, 2, . . . , K }. k Obviously, the main benefit is that the local variable dkm can m be separately computed from the local residual matrix Em k k for k = 1, . . . , K . Since each node can communicate with its neighboring nodes directly, only the constraints dkm = dm , k ∈ k Ne(k) among the neighboring nodes are used in the actual computations rather than all the constraints dkm = dm for k k, k ∈ {1, 2, . . . , K }, where Ne(k) stands for the neighbors of the kth node. In fact, the two kinds of constraints are equivalent due to the fact that the graph is actually a connected graph. Based on the consideration above, we reformulate (7) as an equivalent one with the consensus constraints: min − m {dk }

K

m m T m m (dkm )T Em dk k k Ek k

+

λk,k , dkm − dkm

k=1 k ∈Ne(k)

+

K ρ ||dkm − dkm ||2F , 2

(9)

k=1 k ∈Ne(k)

where ρ > 0 is the so-called augmented Lagrangian parameters in the ADMM algorithm (see [7] for details). After initialization, ADMM solves Eq. (9) using the following iterations: {dkm (t + 1)} = arg min − m {dk }

+

K

K

m m T m m (dkm )T Em dk k k Ek k k=1

k=1 k ∈Ne(k)

+

λk,k (t), dkm − dkm (t)

K ρ ||dkm − dkm (t)||2F , 2

(10)

k=1 k ∈Ne(k)

and

λk,k (t + 1) = λk,k (t) + ρ dkm (t + 1) − dkm (t + 1) ,

for k = 1, . . . , K , k ∈ Ne(k).

(11)

Note that the undirected graph is actually a symmetric one [10], [11], [32], i.e., if k ∈ Ne(k), then k ∈ Ne(k ). Therefore, Eq.(11) also contains:

λk ,k (t + 1) = λk ,k (t) + ρ dkm (t + 1) − dkm (t + 1) ,

for k = 1, . . . , K , k ∈ Ne(k ).

(12)

the symmetric characteristics imply that the part Besides, K m m k=1 k ∈Ne(k) λk,k , dk − dk of Eq. (9) can be re-written in another form as: K k=1 k ∈Ne(k)

=

λk,k , dkm − dm k

K k=1 k ∈Ne(k)

(8)

Motivated by the fact that the ADMM method [7] blends the decomposability of dual ascent with the superior convergence properties of the method of multipliers, and coordinates the solutions of small local subproblems to find a solution to a large-scale global problem, we introduce Lagrange Multipliers

K

k=1

Subject to dkm = dkm , k ∈ Ne(k), k = 1, . . . , K .

k

1, . . . , K and k ∈ Ne(k). Thus, the augmented Lagrangian of Eq. (8) can be written in an unconstraint form as:

d

m T

2531

=

K

λk,k , dkm − λk,k , dkm −

k=1 k ∈Ne(k)

=

K

k=1 k ∈Ne(k)

K k=1 k ∈Ne(k) K

λk,k , dkm λk ,k , dkm

k=1 k ∈Ne(k)

λk,k − λk ,k , dkm ,

(13)

2532


where ck = ρ2 card(k) and card(k) denotes the neighbor number of the kth node. Note that dictionary atoms are typically constrained to be of unit norm in order to avoid the scaling ambiguity [15]–[18]. Thus, the optimization problem above becomes T m T m m m m − (d ) E dkm + ck (dkm )T dkm E min k k k k k dkm

T + λk (t) − ρ dkm (t) dkm

Furthermore, by combining Eqs.(11)-(13), we have

λk,k (t) − λk ,k (t)

k ∈Ne(k)

=

λk,k (t − 1) + ρ dkm (t) − dkm (t)

k ∈Ne(k)

− λk ,k (t − 1) − ρ dm (t) − dkm (t) k

= λk,k (t − 1) − λk ,k (t − 1)

k ∈Ne(k)

+

2ρ

dkm (t)

−

dkm (t)

(14)

k ∈Ne(k)

Combining Eqs. (10)–(11) with (14), we define some new Lagrange Multipliers λk , k = 1, . . . , K , and thus Eqs. (10)–(11) can be simplified into − {dkm (t + 1)} = argmin m {dk }

+

K

m m T m m (dkm )T Em dk k k Ek k

K ρ + ||dkm − dm (t)||2F , k 2

(15)

k=1 k ∈Ne(k)

λk (t + 1) = λk (t) +

Subject

k=1

2ρ dkm (t + 1) − dm (t + 1) , k

k ∈Ne(k)

for k = 1, . . . , K . (16)

where λk (t) = k ∈Ne(k) λk,k (t) − λk ,k (t) . Obviously, via the derivations in (12)–(16), the Lagrange Multipliers λk,k , for k = 1, . . . , K and k ∈ Ne(k) reduce to λk for k = 1, . . . , K . Eq.(15) implies that dkm (t + 1) for k = 1, . . . , K can be solved in a separate form. To develop a distributed algorithm, we separate Eq. (15) into K subproblems. To represent conveniently, we take the kth subproblem for example:

m m T m m − (dkm )T Em dk {dkm (t + 1)} = arg min k k Ek k m dk

+ λk (t), dkm ρ + ||dkm − dkm (t)||2F 2

m m T m m = arg min − (dkm )T Em dk k k Ek k m dk

k ∈Ne(k)

T + dkm (t) dkm (t)] m m m m T m m T Ek k Ek k dk − (d ) ∝ arg min k m

3) Solution to Local Atom Update: ForTconvenient to m m m m describe, we define Ak = Ek k Em and bm k k k =

1 m 2 λk (t) − ρ k ∈Ne(k) dk (t) . Thus, Eq. (18) becomes m T m m m T m min c (d ) I − A k k k dk + 2(bk ) dk m dk

Subject

to (dkm )T dkm = 1.

(20)

We define the Lagrangian f : × R −→ R [30] associated with (20) as m m T m f (dkm , λ) = (dkm )T ck I − Am k dk + 2(bk ) dk + λ (dkm )T dkm − 1 (21) R M×1

Setting the partial differential f (dkm , λ) w.r.t. both dkm and λ to zero yields m m m 2 c k I − Am (22) k dk + 2bk + 2λdk = 0, and (dkm )T dkm = 1, respectively, which are known as the Lagrange equations. The analytical solution to (22) is given by (23)

Inserting (23) into (dkm )T dkm = 1 yields T m −2 m g(λ) = (bm k ) (λI + ck I − Ak ) bk − 1,

T Am k = UVU

dk

k ∈Ne(k)

(19)

(24)

which shows that the optimal value of the Lagrange Multiplier λ˘ is the root of g(λ) = 0. The eigenvalue decomposition (EVD) of the positive semidefinite matrix Am k yields

+ λkT (t)dkm T ρ m T m + [ dk dk − 2 dkm dkm (t) 2

to

k ∈Ne(k) (dkm )T dkm = 1,

−1 m dkm = −(λI + ck I − Am k ) bk .

k ∈Ne(k)

+ ck (dkm )T dkm

T + λk (t) − ρ dkm (t) dkm ,

(18)

Furthermore, due to the unit norm constraint, (18) is equivalent to the following optimization one: T m T m m m m min − (d ) E dkm E k k k k k dkm

T + λk (t) − ρ dkm (t) dkm

k=1

K λk (t), dkm

and

Subject to

k ∈Ne(k) m T m (dk ) dk = 1.

(17)

(25)

where the eigenvector matrix U = [u1 u2 . . . u M ] and the related diagonal eigenvalue matrix V = diag{[v 1 , v 2 , . . . , v M ]}, in which the eigenvalues are rearranged in the descending order, i.e., v 1 ≥ v 2 ≥ . . . ≥ v M ≥ 0. Thus, T −2 T m g(λ) = (bm k ) U(λI + ck I − V) U bk − 1,

(26)


which can be re-written in a scalar secular form as: M T 2 ((bm k ) un ) − 1. g(λ) = (λ + ck − v n )2

2533

Distributed Dictionary Learning Algorithm (27)

n=1

Since limλ→v 1 −ck g(λ) = +∞ , limλ→+∞ g(λ) = −1, and T 2 ((bm ∂g(λ) k ) un ) = −2 < 0, λ ∈ (v 1 − ck , +∞), ∂λ (λ + ck − v n )3 M

n=1

(28) g(λ) is a monotonically decreasing function of λ ∈ (v 1 − ck , +∞) and also there is a unique solution λ˘ to g(λ) = 0. To check the rationality of the value λ˘ , inserting (23) to (21) yields the related dual optimization problem [30]: max h(λ) λ

T −1 −1 T m = (bm k ) U(λI+ck I−V) (ck I−V)(λI+ck I−V) ×U bk

− 2(bm )T U(λI + ck I − V)−1 UT bm k km T

+ λ (bk ) U(λI + ck I − V)−2 UT bm k −1

=

M M T 2 T 2 ((bm ((bm k ) un ) (ck − v n ) k ) un ) − 2 (λ + ck − v n )2 (λ + ck − v n ) n=1

n=1

M T 2 ((bm k ) un ) − 1 +λ (λ + ck − v n )2 n=1

=−

M T 2 ((bm k ) un ) − λ. (λ + ck − v n )

(29)

n=1

The first derivative of h(λ) w.r.t. λ yields: T 2 ∂h(λ) ((bm k ) un ) = − 1, ∂λ (λ + ck − v n )2 M

(30)

n=1

which shows that ∂h(λ) ∂λ |λ=λ˘ = 0. Furthermore, the second derivative of h(λ) w.r.t. λ is given by 2((bm )T un )2 ∂ 2 h(λ) k =− , 2 ∂λ (λ + ck − v n )3 M

(31)

n=1

h(λ) ˘ ∈ (v 1 −ck , +∞). which indicates that ∂ ∂λ ˘ < 0 due to λ 2 |λ=λ ˘ is the maximum of h(λ) Eqs. (30) and (31) imply that h(λ) in the interval λ ∈ (v 1 − ck , +∞). In other words, there exists −1 m a (dual) minimum of (18) at d˘ km = −(λ˘ I + ck I − Am k ) bk [30], [55]. To facilitate determining the optimal Lagrange Multiplier λ˘ , we replace v n of (27) with v 1 and v M to obtain the related tighter upper and lower bounds respectively: 2

m ˘ v M − ck + ||bm k || F ≤ λ ≤ v 1 − ck + ||bk || F

(32)

˘ of (27) is determined Once the root (the Lagrange Multiplier λ) m || from the interval v M − ck + ||bm k F , v 1 − ck + ||bk || F , e.g., by a simple bisection method, the dictionary atom d˘ km is determined by (23).

4) Description and Discussion of the Proposed Algorithm: In the proposed algorithm, we set the termination criterion of the mth atom update terms of the consensus property, i.e., K in K 1 n n (t) = M K (K k=1 l=1,l =k dl (t)−dk (t)1 is not larger −1) −5 than the pre-specified tolerance (e.g, 10 ). Then, the proposed distributed dictionary learning algorithm can be summarized as above: Here we evaluate the related communication, storage, and computation costs of the proposed distributed scenario. Sensor nodes transmit the local atoms to its neighboring nodes in parallel, and there are R M×N T ones for each node in total. In addition, each node obtains the common atom using the ADMM method,which requires O{ 43 M 3 N P + T M 2 } multiplications (they mainly come from Eq. (27)). Note that the computations involved in (27) depend on the number of nodes and are independent from the sample number, which implies that the distributed algorithm has the potential advantage for the “Big Data” applications. Moreover, each node just stores the local training samples Yk ∈ R M×L and the related sparse matrix Xk ∈ R N×L . However, for its centralized counterpart, the fusion center needs to receive the samples Y ∈ R M×L K transmitted from all the nodes. Additionally, the fusion center must store all the training samples Y ∈ R M×L K and the related sparse matrix X ∈ R N×L K . For each atom update, it requires to implements the SVD computation of the residual LK matrix Em (with high dimension M × N in an average sense, 4 L K 32 O{ 3 (M × N ) }) to obtain dm and x˜ . In some distributed computation applications (robustness or sensitive and private data reasons), where there is a low volume of distributed data, the entire communication, storage, and computation costs from all the nodes in the distributed scenario may exceed that of the centralized one (if it existed). However, in this case there exist

2534


heavier communication, storage, and computation overhead for the fusion center of the centralized scenario than each node of the distributed one. In contrast, for “Big Data” applications (e.g., learning a dictionary from 3000 images [50]), i.e., L K N T , it is expected to share their storage and computation in parallel by the network nodes so as to alleviate the huge load of the centralized scenario, especially for the fusion center. In summary, the centralized scenario is suitable for the situation, where the fusion center in a robust network can gather a low volume of sharable data distributed across network nodes. In contrast, the distributed scenario is suitable for the following cases: i) the network nodes are discouraged from sharing the data when they collect sensitive or private information (e.g., the applications in this paper); ii) the centralized scenario is fragile (easily collapse); iii) the training set is extremely large in volume, i.e., the “Big Data” application. IV. S IMULATION AND E XPERIMENTAL R ESULTS In this section, synthetic signals and real-world images are processed to assess the performance of the proposed algorithm.2 A. Experiment 1: Dictionary Recovery From Synthetic Data In this subsection, we explore the ability of the proposed algorithm to recover the original dictionary from the synthetic signals, which are distributed across nodes of sensor networks. Similar to the synthetic experiment in [17], we randomly generate a dictionary D consisted of 50 atoms with dimension 20 (i.e., M = 20 and N = 50), each column of which is normalized with Frobenius norm. In addition, a randomly generated network with K = 30 nodes, as shown in Fig. 1, is assumed for these synthetic signals. Furthermore, 30000 training examples of dimension 20 are created by a random combination of arbitrary 3 dictionary atoms, and the sparse coefficients are uniformly distributed and their locations are random and independent. Then, these training examples are evenly partitioned across these 30 nodes, and thus each node of the network has 1000 training examples locally. To recover the initial dictionary D from the distributed signals, the proposed algorithm uses the DCT dictionary and OMP (with the constraint T0 = 3, i.e, 3 coefficients) as the initialized dictionary and the sparse-coding method, respectively. In addition, ρ = 1, P = 100, and T = 1000 are set for the related iterations. For comparison, the centralized K-SVD and MOD dictionary learning algorithms are implemented on the collected synthetic data simultaneously. Similar to [17], the recovered dictionaries using these algorithms are compared against the initial dictionary D by sweeping through the columns of D, finding the closest column (distance in Frobenius norm) in the recovered dictionary, and measuring the distance via 1 − |diT d˜ i |, where di and d˜ i are the atoms of the initial dictionary and recovered one, respectively. A distance less than 0.01 is considered a 2 Here it must be pointed out that except from the proposed distributed algorithm other centralized methods are implemented for comparison under the assumption that the distributed data could be collected together.

Fig. 2.

Recovery Rate versus P.

success. The performance of these algorithms is measured by the recovery rate (i.e., the average number of successes) in 50 Monte Carlo runs, and the recovery rates from the proposed distributed algorithm, the centralized K-SVD algorithm, and the MOD method are plotted in Fig. 2. From this figure, we can see that: the recovery rates from these three algorithms are improved with the increase of iteration number. Similar to the synthetic experiment of [17], the centralized K-SVD method has higher recovery rate than MOD. Since the proposed method within proper iterations (to attain consensus with the centralized one) approximates or nearly equals the centralized one, it is also more accurate than MOD in such a synthetic experiment. But a point worthy of mention here is that the proposed algorithm recovers the dictionary in a completely distributed manner. Since the dictionaries obtained by the proposed distributed algorithm are almost exactly the same for all nodes, only one of them is given in Fig. 3(a). Besides, the initial dictionary D and those from other algorithms in one run are shown in Fig. 3(b), and (c)–(d) for reference purpose, respectively. By comparing Fig. 3(a) and (b), it is easily found that the dictionary recovered by the proposed algorithm is pretty much exactly like the initial one. Since the proposed algorithm is a distributed and iterative one, the consistency and convergence performance among the nodes need to be verified. Here we apply the following measures to quantitatively evaluate the performance of the proposed algorithm: 1) Average coefficient difference errors (ACDE) among these nodes (see residual convergence of [7] for details): ACDE(k, T ) =

K 1 ||Dl − Dk ||1 ,(33) M N(K − 1) l=1,l =k

where Dk represents the dictionary obtained by the kth node in the T th iteration. 2) Coefficient difference errors (CDE) between the true value and the average of all the K nodes at the pth iteration : K 1 ||Dk ( p) − D||1 , CDE( p) = MNK

(34)

k=1

where Dk ( p) represents the dictionary obtained by the kth node at the pth iteration.


Fig. 5.

Fig. 6.

2535

ACDEs versus P.

CDEs versus P.

Fig. 3. Dictionaries in Experiment 1 (first row: Proposed; second row:Initial; third row: K-SVD; last row: MOD), unrecovered those (i.e.,their distances are larger than 0.01) are marked in red boxes.

Fig. 7. Fig. 4.

ACDEs versus T.

3) Difference errors (DE) of the kth node between the estimates at the pth and p + 1 iterations : DE(k, p) =

DEs versus P for all nodes.

1 ||Dk ( p + 1) − Dk ( p)||1 , MN

(35)

When T varies from 100 to 1000 (for p = 1) , the related ACDEs of these K nodes are computed and plotted in Fig. 4, from which it can be seen that the related ACDEs obtained from the proposed algorithm decrease with the increase of the iteration number T . Especially, when T = 1000 the related ACDE of the proposed algorithm is not larger than 0.000073. Besides, when p varies from 1 to 100 while keeping T = 1000, the related ACDEs are plotted in Fig. 5. From Figs. 4 and 5, we can see that the proposed algorithm is of efficient consistency performance and thus can obtain the consistent dictionaries with negligible difference for all nodes

within proper iterations (T ). Then, we compute the CDEs of the proposed algorithm and show them in Fig. 6. For comparison purpose, we also display those of the MOD and K-SVD methods. It is obvious that all the three algorithms have basically converged after 40 iterations and the obtained dictionaries approach the true values finely. Fig. 7 displays the DEs of the proposed algorithm, which further shows that the proposed algorithm has a good convergence performance. Next, the effect of the number of training examples per node on the performance of the proposed algorithm is investigated. We use the same parameters but with a randomly generated network of K = 40 nodes, where 750 training examples are set for each node. Due to space reason and that the consensus property among these nodes are nearly the same as the aforementioned simulation, we only plot the averaged recovery rates and CDEs versus iteration number P in Figs. 2 and 6, respectively. From Fig. 2, we can see that when the number

2536


Fig. 8.

Distributed training images in Experiment 2.

of training examples per node decreases from 1000 to 750, the related recovery rate of the proposed algorithm reduces by 1% or so; Whereas Fig. 6 shows that the related CDEs increase by about 0.003, but still outperform MOD. Finally, the influence of the multiple distributions on the performance of the proposed algorithm is explored. We use the same parameters (still 30 nodes, each of which has 1000 training examples) but different distributions on the training data for these nodes, in which the training examples of the 1st-10th, 11th-20th, and 21st-30th nodes are generated by a random combination of arbitrary 3 ones of the 1st-40th, 6th-45th, and 11th-50th dictionary atoms, respectively. Similarly, we also plot the averaged recovery rates and CDEs versus iteration number P in Figs. 2 and 6, respectively. From them, it can be seen that when the training examples vary from the single distribution to multiple distributions, the recovery rate of the proposed algorithm reduces by nearly 2%; Whereas the related CDEs increase by 0.006. B. Experiment 2: Filling Missing Data of Ancient Frescoes The protection of historical relics is one of the most important tasks of museums [27], [31], [50]. Generally, in order to recover the missing data of ancient frescoes as accurately as possible, the works with the same dynasty and style should be provided. However, they are often collected by different museums. For example, the Dunhuang frescoes are collected by the museums of London, Paris, Washington, and Lanzhou, etc. In this subsection we apply the distributed dictionary learning algorithm to fill the missing data in ancient frescoes while keeping the referred ones private [27], [31], [50]. The similar missing data recovery problem can be found in [17].

Fig. 9. Recovery Results in Experiment 2 (first row: Image to recover; second row: Reference, K-SVD (all); last row: Proposed, MOD (all).

In this experiment, we refer six frescoes images shown in Fig. 8 [52] to train a common dictionary and then apply the similar method in [17] to fill the missing pixels of a fresco (70% are missed), as shown at the top of Fig. 9. Without loss of generality, we assume that the six fresco images are distributed across six different sites (It is certain that the more the references are available for each cite, the better the results will be.). These cites form an annular topology network, in which each node only communicates with its two nearest neighbors, and can implement some local computations. The proposed algorithm divides the 6 images into patches of size 8 × 8 × 3 using the step size 2. Besides, ρ = 1, T = 200, and P = 40 are set for the proposed algorithm.3 The aforementioned centralized K-SVD and MOD methods are also implemented for comparison. The quality of image restoration 3 When the number of nodes is relatively smaller (here K = 6), a smaller iteration number (e.g., T = 200) is required for these nodes.


2537

TABLE I PSNR S OF I MAGE R ECOVERY R ESULTS IN E XPERIMENT 2

Distributed Image Fusion Algorithm

C. Experiment 3: Distributed Image Fusion

Fig. 10. Dictionaries in Experiment 2(first row: Proposed; second row: K-SVD); The atoms from the proposed algorithm, being different from those of K-SVD (i.e.,their distances are larger than 0.01), are marked in red boxes.

is measured in terms of Peak Signal-to-Noise Ratio (PSNR), and the related PSNR metrics of these methods are illustrated in Table I (including the results obtained from the learned local dictionary from a single image, see Line 2 and 3 of Table I for details). Besides, Fig. 9 displays some recovery results. Furthermore, the dictionaries obtained by both the centralized K-SVD and proposed approaches are given in Fig. 10. From the PSNRs in Table I, the recovery results in Fig. 9, and the obtained dictionaries in Fig.10, it can be seen that the proposed method can obtain the approximate or equivalent high-quality recovery result in a completely distributed manner as these centralized methods. In addition, it can be found that the dictionary learned from all the distributed images can achieve better recovery result in terms of PSNR than those from a single local counterpart.

Due to the perception or observation constraints the imaging instruments are not capable of providing all spatial or spectral information in a single image, which is unhelpful for human and machine perception or further image processing and analysis. One possible solution to this problem is image fusion [12], [33]–[39], which is the process of combining two or more source images of the same scene provided by various sensors into a single highly informative image. In recent years, the image fusion technique is widely applied in computer vision, medical application, and remote sensing fields. Without loss of generality, we assume that all these smart imaging sensors can perform some local computations and communicate with neighboring ones. In this paper, it is assumed that the image registration has already been performed, and we only pay attention to the fusion step. Motivated by the salient ability of dictionary learning to extract the features of images [15]–[18], we develop a distributed image fusion algorithm (as above) using the proposed distributed dictionary learning method. In this experiment, we consider multiple source images as shown in Fig. 11, which are obtained by multiple devices with

2538


Fig. 13. Fused results in Experiment 3 (first row (left): MOD; first row (right): KSVD; second row (left): Proposed; second row (right): Wavelet; last row (left): HOSVD; last row (right): LAP).

Fig. 11.

Images to fuse in Experiment 3.

Fig. 12. Dictionaries in Experiment 3 (first row: Proposed; last row: Centralized K-SVD); The atoms from the proposed algorithm, being different from those of KSVD (i.e.,their distances are larger than 0.01), are marked in red boxes.

different bands and locations [41]. Each device represents a sensor node in the spatial location. In addition, we assume that the six nodes form an annular topology network, in which

each node only communicates with its two neighbors. Similar parameters as those of Experiment 2 are used to run the proposed distributed dictionary learning algorithm, i.e., patch size 8 × 8, step size 2, T0 = 10, ρ = 1, T = 200, and P = 40. For comparison, the dictionary obtained by the centralized K-SVD approach is also displayed in Fig. 12. To assess the proposed image fusion algorithm, the related evaluation parameters, such as QABF metric [33], ENtropy (EN) [34], and Overall Cross Entropy (OCE) [35] (Generally, it is believed that the more information the fused result carries, the larger its QABF and EN but the smaller its OCE, see [33]–[35] and [38] for details), are computed to evaluate the performance of some classical image fusion algorithms [17], [18], [36]–[39], including Wavelet, HOSVD, and LAP. The fused results from all the algorithms are shown in Fig. 13 and the related performance measurements using different criteria are illustrated in Table II (only the best results are highlighted), which indicates that the dictionary learning-based methods reserve the edge and landform detail information efficiently, and thus exhibit better visual quality and quantitative metrics (big QABF and EN, and small OCE), which have fully proved that the representation and evaluation of the competitive image patches can be efficiently performed by an adaptively learned common dictionary (based on the same standard). In addition, the fused result from the proposed algorithm approaches that of K-SVD, and is of satisfactory QABF, OCE, and EN performance comparing to other algorithms.


2539

TABLE II M ETRICS OF I MAGE F USION R ESULTS IN E XPERIMENT 3

V. C ONCLUSION Unlike the centralized K-SVD method, which needs to gather the distributed data together, this paper pays attention to adaptively learning a common dictionary from the data distributed across nodes of sensor networks and sparsely representing these data. It is achieved by decoupling the combined dictionary atom update and non-zero coefficient revision procedure, simplifying Lagrange Multiplier revision, and solving local atoms. The simulation and experimental results are presented to illustrate the effectiveness of the proposed method. Some Big Data applications of the distributed dictionary learning method will be investigated in future. A PPENDIX In this appendix, we prove that the optimization problem shown in (18) (or (19)) is a global and convex one in a hidden form [30], [43], [55], and satisfies [44, Lemma 4.1, P. 257], which ensures that the obtained atoms by the distributed method converge to the centralized one. Lemma 1: The optimization problem in (18) is equivalent to that with such an inequality constraint: m m T m (dkm )T (ck I − Am min k )dk + 2(bk ) dk m

subject

m m T m min (dkm )T (ck I − Am k )dk + 2(bk ) dk m dk

n=1

In other words, the larger the scalar ξ ∈ (0, 1] is, the smaller the root λ˜ of g(λ) ˜ = 0 is. Inserting λ˜ to the primal objective function (dkm )T (ck I − m m T m Am k )dk + 2(bk ) dk yields ck −

M M T 2 T 2 ((bm ((bm k ) un ) v n k ) un ) −2 , (λ˜ + ck − v n )2 (λ˜ + ck − v n ) n=1

n=1

which implies that the smaller the λ˜ ∈ (v 1 − ck , +∞) is, the smaller the primal objective function. It is obvious that the larger the scalar ξ ∈ (0, 1] is, the smaller the primal objective function is. Note that 1 is the largest value available for ξ . Therefore, the objective function

Lemma 2: According to Lemma 1,

under the inequality constraint (dkm )T dkm ≤ 1, we assume that m m T m the minimal value of (dkm )T (ck I − Am k )dk + 2(bk ) dk with m m T m respect to dk appears at (dk ) dk = ξ , where 0 < ξ ≤ 1. Similar to Part 3 of Subsection B in Section III, we define the following Lagrangian f : R M×1 × R −→ R [30] associated with

m m T m (dkm )T (ck I − Am min k )dk + 2(bk ) dk m dk

is equivalent to m m T m − (dkm )T Am min k dk + 2(bk ) dk m dk

due to the unit norm constraint (dkm )T dkm = 1. Furthermore, it is equivalent to

m m T m min (dkm )T (ck I − Am k )dk + 2(bk ) dk m dk

max m

under the equality constraint (dkm )T dkm = ξ :

dk

m m T m (dkm )T Am k dk + (−2(bk ) )dk

Subject (36)

from which we define the related dual function as follows: M T 2 ((bm k ) un ) − ξ. (λ + ck − v n )2

M T 2 ((bm k ) un ) − ξ2 = 0. (λ + ck − v n )2

with the inequality constraint (dkm )T dkm ≤ 1 achieves its minimum value at the boundary (dkm )T dkm = 1.

Proof: For the problem

g(λ) ˜ =

n=1

is certainly not larger than that of

dk

≤ 1.

m m T m f˜(dkm , λ) = (dkm )T (ck I − Am k )d k + 2(bk ) dk + λ (dkm )T dkm − ξ ,

M T 2 ((bm k ) un ) − ξ1 = 0 (λ + ck − v n )2

m T m (dkm )T (ck I − Am min k )dk + 2b dk m

dk

to(dkm )T dkm

˜ = −ξ . Therefore, if ξ1 > ξ2 and ξ1 , ξ2 ∈ limλ→+∞ g(λ) (0, 1], the root of

(37)

n=1

Since g(λ) ˜ is a monotonically decreasing function of λ ∈ ˜ = +∞ and (v 1 − ck , +∞). Especially, lim λ→v 1 −ck g(λ)

to (dkm )T dkm ≤ 1,

(38)

which has the same form as (1) in [43]. In terms of the -subdifferentials of convex functions and -normal directions defined in [43], (38) is a global maximizer and is convex in a hidden form. In other words, the optimization problems shown in (18) and (19) are actually both global and convex ones in a hidden form. Proof: Note that the Euclidean ball (dkm )T dkm ≤ 1 actually m m T m defines a convex set for dkm and both (dkm )T Am k dk −2(bk ) dk

2540


and (dkm )T dkm are also convex functions. Since these representations are equivalent, here we only discuss the form in (38). For convenient to represent, we define λ¯ = λ˜ + ck . Obvi¯ − Am )−1 bm , which ously, λ¯ ∈ (v 1 , +∞) and d¯ km = −(λI k k shows that m ¯m ¯ ¯m Am k dk + (−bk ) = λdk , (39) ¯ is positive semidefinite, −Am k + λI and implies that according to Theorem III as well as the subdifferentials of convex functions and -normal directions in [43], the sufficient and necessary conditions in (39) can ensure that (38) is a global maximizer and convex in a hidden form. In other words, the optimization problems shown in (18) (or (19)) is a global and convex one in a hidden form (see Line 3 of Paragraph 2 in the introduction part, Line 4 of Page 446, and Equations (11)-(14) of Page 449 in [43] for details). Furthermore, we prove that the similar lemma (as [44, Lemma 4.1, P. 257]) holds for the hidden convex functions above, which ensures that the obtained atoms by the distributed method converge to that of the centralized one. Similar to the part between Eqs. (4.86) and (4.87) of [44, Lemma 4.1], we define d∗ = argmin{J1 (d) + J2 (d)} =

d∈S m T T −dT Am k d + 2(bk ) d + ck d d

for the optimization problem of this paper, where the hidden m T convex functions J1 (d) = −dT Am k d, and J2 (d) = 2(bk ) d + T ck d d are continuously differentiable. S is an ellipse set of R M , i.e., {d ∈ S|dT d = 1}, then d∗ = arg min{J1 (d) + [∇ J2 (d∗ )]T d}, d

(40)

where m T d∗ = arg min dT (ck − Am k )d + 2(bk ) d d

Subject to dT d = 1,

(41)

According to the Lagrange Multiplier method (or see Part 3 of Subsection B in Section III for details), the solution to Eq. (41) is given by −1 m d∗ = −(λ∗ I + ck I − Am k ) bk ,

where

λ∗

(42)

satisfies

T ∗ m −2 m (bm k ) (λ I + ck I − Ak ) bk =

M

T 2 ((bm k ) un ) (λ∗ + ck − v n )2 n=1

= 1. (43)

According to [44, eq.(4.85)], d† = arg min J1 (d) + [∇ J2 (d∗ )]T d d

m T ∗ = −dT Am k d + (2(bk ) + 2ck d )d

(44)

and

d‡ = arg min J2 (d) − ∇ J2 (d∗ ) d = =

d m T 2(bk ) d + ck dT d ck dT d − 2ck d∗ d.

T ∗ − (2(bm k ) + 2ck d )d

(45)

Obviously, setting the differentiate of (45) w.r.t. d to zero yields d‡ = d∗ . According to Part 3 of Subsection B in Section III as well as the Lagrangian function, the solution to (44) is given by: −1 m ∗ d† = −(λ† I − Am k ) (bk + ck d ),

(46)

m −2 m ∗ T † ∗ where λ† satisfies (bm k +ck d ) (λ I −Ak ) (bk +ck d ) = 1. −1 bm of [40] into (bm + Inserting d∗ = −(λ∗ I + ck I − Am ) k k k −2 m ∗ ck d∗ )T (λ† I − Am k ) (bk + ck d ) = 1 yields ∗ T † m −2 m ∗ (bm k + ck d ) (λ I − Ak ) (bk + ck d ) ∗ m −1 m T † m −2 = (bm k − ck (λ I + ck I − Ak ) bk ) (λ I − Ak ) ∗ m −1 m ×(bm k − ck (λ I + ck I − Ak ) bk )

=

M n=1

T 2 ∗ 2 ((bm k ) un ) (λ − v n ) = 1. (λ∗ + ck − v n )2 (λ† − v n )2

(47)

By comparing (43) with (47), it is easily found that λ† = λ∗ . In addition, inserting λ∗ = λ† into (42) yields ∗ m ∗ ∗ (λ† I − Am k )d = −(bk + ck d ) ⇒ d

−1 m ∗ = −(λ† I − Am k ) (bk + ck d ).

(48)

Furthermore, by comparing (46) with (48), it is also easily found that d† = d∗ . Obviously, d† = d‡ = d∗ , which shows that Lemma 4.1 of Page 257 in that book [44] holds for (18) for k = 1, . . . , K . Therefore, it also holds for the hidden convex objective functions shown in (15) due to the separable property of both (15) T T and its optimization variables a = [(d1m )T (d2m )T · · · (dm K) ] . Based on the derivation above, we let G 1 (a) denote the hidden convex optimization problem shown in (15), and G 2 (a) = 0, as those of [44]. Due to the unit norm constraints K , {dm } K on {dkm }k=1 k k=1 is always bounded and every limit point of dkm (t) is an optimal solution to the original problem (18). Due to that the G 1 (a) shown in Eq. (15) satisfies the [44, Lemma 4.1] and by virtue of [44, Proposition 4.2], we follow the derivations shown in [44, eqs. (4.87)–(4.99)] and K obtained by the proposed iterative thus the atoms {dkm (t)}k=1 method converge to the optimal dm shown in (7). R EFERENCES [1] I. F. Akyildiz, W. Su, Y. Sankarasubramaniam, and E. Cayirci, “Wireless sensor networks: A survey,” Comput. Netw., vol. 38, no. 4, pp. 393–422, 2002. [2] S. Soro and W. Heinzelman, “A survey of visual sensor networks,” Adv. Multimedia, vol. 2009, article ID 640386, 2009. [3] B. Rinner and W. Wolf, “An introduction to distributed smart cameras,” Proc. IEEE, vol. 96, no. 10, pp. 1565–1575, Oct. 2008. [4] W. Dargie and C. Poellabauer, Fundamentals of Wireless Sensor Networks: Theory and Practice. New York, NY, USA: Wiley, 2010. [5] B. Bhanu, C. V. Ravishankar, A. K. Roy-Chowdhury, A. K. Aghajan, and H. D. Terzopoulos, Distributed Video Sensor Networks. New York, NY, USA: Springer-Verlag, 2011. [6] A. Dimarkis, S. Kar, J. Moura, M. Rabbat, and A. Scaglione, “Gossip algorithms for distributed signal processing,” Proc. IEEE, vol. 98, no. 11, pp. 1847–1864, Nov. 2010. [7] S. Boyd, N. Parikh, E. Chu, B. Peleato, and J. Eckstein, “Distributed optimization and statistical learning via the alternating direction method of multipliers,” Found. Trends Mach. Learn., vol. 3, no. 1, pp. 1–122, 2011. [8] F. Zhao and L. Guibas, Wireless Sensor Networks: An Information Processing Approach. San Mateo, CA, USA: Morgan Kaufmann, 2004.


[9] S. Chouvardas, K. Slavakis, and S. Theodoridis, “Adaptive robust distributed learning in diffusion sensor networks,” IEEE Trans. Signal Process., vol. 59, no. 10, pp. 4692–4707, Oct. 2011. [10] J. F. C. Mota, J. M. F. Xavier, P. M. Q. Augiar, and M. Püschel, “Distributed basis pursuit,” IEEE Trans. Signal Process., vol. 60, no. 4, pp. 1942–1956, Mar. 2012. [11] A. Y. Yang, M. Gastpar, R. Bajcsy, and S. S. Sastry, “Distributed sensor perception via sparse representation,” Proc. IEEE, vol. 98, no. 6, pp. 1077–1088, Jun. 2010. [12] M. Cetin et al., “Distributed fusion in sensor networks,” IEEE Signal Process. Mag., vol. 23, no. 4, pp. 42–45, Jul. 2006. [13] Z. Xiong, A. D. Liveris, and S. Cheng, “Distributed source coding for sensor networks,” IEEE Signal Process. Mag., vol. 21, no. 5, pp. 80–94, Sep. 2004. [14] R. Olfati-Saber, J. A. Fax, and R. M. Murray, “Consensus and cooperation in networked multi-agent systems,” Proc. IEEE, vol. 95, no. 1, pp. 215–233, Jan. 2007. [15] I. Tosic and P. Frossard, “Dictionary learning,” IEEE Signal Process. Mag., vol. 28, no. 2, pp. 27–38, Mar. 2011. [16] R. Rubinstein, A. M. Bruckstein, and M. Elad, “Dictionaries for sparse representation modeling,” Proc. IEEE, vol. 98, no. 6, pp. 1045–1057, Jun. 2010. [17] M. Aharon, M. Elad, and A. Bruckstein, “K-SVD: An algorithm for designing overcomplete dictionaries for sparse representation,” IEEE Trans. Signal Process., vol. 54, no. 11, pp. 4311–4322, Jul. 2006. [18] K. Engan, S. O. Asae, and J. H. Husoy, “Multi-frame compression: Theory and design,” Signal Process., vol. 80, no. 10, pp. 2121–2140, 2000. [19] S. Mallat and Z. Zhang, “Matching pursuits with time-frquency dictionaries,” IEEE Trans. Signal Process., vol. 41, no. 2, pp. 3397–3415, Dec. 1993. [20] S. Chen, S. A. Billings, and W. Luo, “Orghogonal least squares methods and their application to non-linear system identifacation,” Int. J. Control, vol. 50, no. 5, pp. 1873–1896, 1989. [21] S. S. Chen, D. L. Donoho, and M. A. Saunders, “Atomic decomposition by basis pursuit,” SIAM Rev., vol. 43, no. 1, pp. 129–159, 2001. [22] I. F. Gorodnisky and B. D. Rao, “Sparse signal reconstruction from limited data using FOCUSS: A re-weighted minimum norm algorithm,” IEEE Trans. Signal Process., vol. 45, no. 3, pp. 600–616, Mar. 1997. [23] R. Tibshirani, “Regression shrinkage and selection via the lasso,” J. Roy. Statist. Soc. B, vol. 58, no. 1, pp. 267–288, 1996. [24] H. W. Chen, L. W. Kang, and C. S. Lu, “Dictionary learning-based distributed compressive video sensing,” in Proc. PCS, Dec. 2010, pp. 210–213. [25] T. T. Do, Y. Chen, D. T. Nguyen, N. Nguyen, L. Gan, and T. D. Tran, “Distributed compressed video sensing,” in Proc. IEEE ICIP, Nov. 2009, pp. 1393–1396. [26] J. J. Vayrynen, L. L. Qvist, and T. Honkela, “Sparse distributed representations for words with thresholded independent component analysis,” in Proc. IJCNN, Aug. 2007, pp. 1031–1036. [27] S. Lyu, D. N. Rockmore, and H. Faid, “A digital technique for art authentication,” Proc. Nat. Acad. Sci. USA, vol. 101, no. 49, pp. 17006–17010, 2004. [28] C. Guillemot, F. Pereira, L. Torres, T. Ebrahimi, R. Leonardi, and J. Ostermann, “Distributed monoview and multiview video coding,” IEEE Signal Process. Mag., vol. 24, no. 5, pp. 67–76, Sep. 2007. [29] I. Tosic and P. Frossard, “Dictionary learning for stereo image representation,” IEEE Trans. Image Process., vol. 20, no. 4, pp. 921–934, Apr. 2011. [30] S. P. Boyd and L. Vandenberghe, Convex Optimization. Cambridge, U.K.: Cambridge Univ. Press, 2004. [31] C. R. Johnson et al., “Image processing for artist identification,” IEEE Signal Process. Mag., vol. 25, no. 4, pp. 37–48, Jul. 2008. [32] J. Bondy and U. Murty, Graph Theory. New York, NY, USA: Springer-Verlag, 2008. [33] C. S. Xydeas and V. Petrovic, “Objective image fusion performance measure,” Electron. Lett., vol. 36, no. 4, pp. 308–309, Feb. 2000. [34] Z. Wang and A. Bovik, “A universal image quality index,” IEEE Signal Process. Lett., vol. 9, no. 3, pp. 81–84, Mar. 2002. [35] Matlab Library [Online]. Available: http://www.cs.rug.nl/~rudy [36] E. H. Adelson, C. H. Anderson, J. R. Bergen, P. J. Burt, and J. Qgden, “Pyramid methods in signal processing,” RCA Eng., vol. 29, no. 6, pp. 33–41, 1984. [37] G. Pajares and J. M. de la Cruz, “A wavelet-based image fusion tutorial,” Pattern Recognit., vol. 37, no. 9, pp. 1855–1872, 2004.

2541

[38] J. Liang, Y. He, D. Liu, and X. Zeng, “Image fusion using higher order singular value decomposition,” IEEE Trans. Image Process., vol. 21, no. 5, pp. 2898–2909, May 2012. [39] Imagefusion Toolbox [Online]. Available: http://www.imagefusion.org/ [40] J. H. Hugues, D. J. Graham, and D. N. Rockmore, “Quantification of artistic style through sparse coding analysis in the drawings of pieter Bruegel the elder,” Proc. Nat. Acad. Sci. USA, vol. 107, no. 4, pp. 1279–1283, 2009. [41] Image Registration and Fusion Systems [Online]. Available: http://www.imgfsr.com/ifsr_if.html [42] M. Elad and M. Aharon, “Image denosing via sparse and redundant representations over learned dictionary,” IEEE Trans. Image Process., vol. 15, no. 12, pp. 3736–3745, Dec. 2006. [43] J. Hiriart-Urruty, “Global optimality conditions in maximizing a convex quadratic function under convex quadratic constraints,” J. Global Optim., vol. 21, no. 4, pp. 445–455, 2001. [44] D. P. Bertsekas and J. N. Tsitsiklis, Parallel and Distributed Computation: Numerical Methods, 2nd ed. Belmont, MA, USA: Athena Scientific, 1999. [45] L. F. Gray, F. J. Flanigan, J. L. Kazdan, D. L. Frank, and B. Fristedt, Calculus Two: Linear and Nonlinear Functions. Berlin, Germany: Springer-Verlag, 1990. [46] J. Liang et al., “Robust ellipse fitting based on sparse combination of data points,” IEEE Trans. Image Process., vol. 22, no. 6, pp. 2207–2218, Jun. 2013. [47] S. Yang, M. Wang, Y. Chen, and Y. Sun, “Single-image super-resolution reconstruction via learned geometric dictionaries and clustered sparse coding,” IEEE Trans. Image Process., vol. 21, no. 9, pp. 4016–4028, Sep. 2012. [48] S. Gao, I. W. Tsang, and L. Chia, “Laplacian sparse coding, hypergraph Laplacian sparse coding, and applications,” IEEE Trans. Pattern Anal. Mach. Learn., vol. 35, no. 1, pp. 92–104, Jan. 2013. [49] M. Zheng et al., “Graph regularized sparse coding for image representation,” IEEE Trans. Image Process., vol. 20, no. 5, pp. 1327–1336, May 2011. [50] J. Mairal, F. Bach, and J. Ponce, “Task-driven dictionary learning,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 34, no. 4, pp. 791–804, Apr. 2012. [51] E. Alpayd, Introduction to Machine Learning (Adaptive Computation and Machine Learning). Cambridge, MA, USA: MIT Press, 2004. [52] Dunhuang Frescoes [Online]. Available: http://www.mafengwo.cn/i/ 1094676.html [53] G. Mateos, J. A. Bazerque, and G. B. Giannakis, “Distributed sparse linear regression,” IEEE Trans. Signal Process., vol. 58, no. 10, pp. 5262–5276, Oct. 2010. [54] J. Chen and A. H. Sayed, “Diffusion adaptation strategies for distributed optimization and learning over networks,” IEEE Trans. Signal Process., vol. 60, no. 8, pp. 4289–4305, Aug. 2012. [55] R. G. Lorenz and S. P. Boyd, “Robust minimum variance beamforming,” IEEE Trans. Signal Process., vol. 53, no. 5, pp. 1684–1696, May 2005.

Junli Liang was born in China. He received the Ph.D. degree in signal and information processing from the Institute of Acoustics, Chinese Academy of Sciences, in 2007. His current research interests include statistical, distributed, and high-dimensional signal processing, image processing, and pattern recognition.

Miaohua Zhang was born in China. Her current research interests include image processing and pattern recognition.

Xianyu Zeng was born in China. Her current research interests include optimization.

Guoyang Yu was born in China. His current research interests include image processing and pattern recognition.

Orthogonal Procrustes Analysis for Dictionary Learning in Sparse Linear Representation.

Sparse dictionary learning of resting state fMRI networks.

A Fast Algorithm for Learning Overcomplete Dictionary for Sparse Representation Based on Proximal Operators.

Contour tracking in echocardiographic sequences via sparse representation and dictionary learning.

Learning Low-Rank Class-Specific Dictionary and Sparse Intra-Class Variant Dictionary for Face Recognition.

Noise-aware dictionary-learning-based sparse representation framework for detection and removal of single and combined noises from ECG signal.

An Online Dictionary Learning-Based Compressive Data Gathering Algorithm in Wireless Sensor Networks.

On A Nonlinear Generalization of Sparse Coding and Dictionary Learning.

Incremental structured dictionary learning for video sensor-based object tracking.

Cerebellar Functional Parcellation Using Sparse Dictionary Learning Clustering.

Sparse representation for infrared Dim target detection via a discriminative over-complete dictionary learned online.

Multiple kernel learning for sparse representation-based classification.

A Distributed Learning Method for ℓ 1 -Regularized Kernel Machine over Wireless Sensor Networks.

Supervised dictionary learning for inferring concurrent brain networks.

A cognitive fault diagnosis system for distributed sensor networks.

Supervised block sparse dictionary learning for simultaneous clustering and classification in computational anatomy.

Secure and Cost-Effective Distributed Aggregation for Mobile Sensor Networks.

Distributed Signal Processing for Wireless EEG Sensor Networks.

Visual tracking based on extreme learning machine and sparse representation.

Subject Specific Sparse Dictionary Learning for Atlas based Brain MRI Segmentation.

Moving target tracking through distributed clustering in directional sensor networks.

Distributed optimal power and rate control in wireless sensor networks.

Geometry-Based Distributed Spatial Skyline Queries in Wireless Sensor Networks.

Distributed efficient similarity search mechanism in wireless sensor networks.