IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, VOL. 26, NO. 10, OCTOBER 2015

2477

Deep and Shallow Architecture of Multilayer Neural Networks Chih-Hung Chang

Abstract— This paper focuses on the deep and shallow architecture of multilayer neural networks (MNNs). The demonstration of whether or not an MNN can be replaced by another MNN with fewer layers is equivalent to studying the topological conjugacy of its hidden layers. This paper provides a systematic methodology to indicate when two hidden spaces are topologically conjugated. Furthermore, some criteria are presented for some specific cases. Index Terms— Deep architecture, factor-like matrix, multilayer neural networks (MNNs), sofic shift, topological entropy.

I. I NTRODUCTION

I

N THE past few decades, multilayer neural networks (MNNs) have received considerable attention and have been successfully applied to many areas, such as combinatorial optimization, signal processing, pattern recognition, and artificial intelligence (AI) [1]–[10]. An MNN is a coupled system of neural network (NN) equations. One important reason for coupling NNs is the simulation of the visual systems of mammals [11], [12] (each layer symbolizes a single cortex in the visual system) and it is proved that the mammal brain is organized in deep architecture1 [13]. Namely, the number of layers in MNNs of a mammal brain is large. Due to the architecture depth of the mammal brain, scientists are interested in learning and training deep architectures [14], [15] since 2002. Hinton et al. [16] obtained great successful results on deep architectures using deep belief networks (DBNs) and restricted Boltzmann machine [17] methods, and have applied them to many fields since then. For example, classification tasks, regression, dimensionality reduction, modeling textures, modeling motion, natural language processing, object segmentation, information retrieval, robotics, and collaborative filtering. The best general reference here is referred to [2], and reader could find the complete bibliography therein. A related topic is that of how to distinguish the different hidden layers and how two hidden layers make a difference [18]. MNNs have been widely applied in multidisciplinary fields, such as signal propagation between neurons, spatially and temporally periodic patterns, mechanisms for short-term memory, motion perception, image processing, pattern Manuscript received March 26, 2014; revised December 28, 2014; accepted December 28, 2014. Date of publication January 15, 2015; date of current version September 16, 2015. This work was supported by the Ministry of Science and Technology under Contract MOST 103-2115-M-390-004. The author is with the Department of Applied Mathematics, National University of Kaohsiung, Kaohsiung 81148, Taiwan (e-mail: [email protected]). Digital Object Identifier 10.1109/TNNLS.2014.2387439 1 Aside from the simulation of the visual systems of mammals, deep architectures are frequently used to learn some complicated functions expressing high-level abstractions such as language and AI-level tasks.

recognition, and drug-induced visual hallucinations [19]–[29]. Arena et al. [30] applied MNNs to deal with the main issues in automatic classification and spot validation that arise during an automatic DNA microarray analysis procedure. It is known that the mammalian retina represents the visual world in a set of about a dozen different feature detecting representations. MNNs are capable of qualitatively reproducing the same full set of space–time patterns as the living retina. This suggests that MNNs can then be used to predict the set of retinal responses to more complex patterns. Readers are referred to [31]–[33] and the references therein for more details. DBNs are known as the composition of multiple layers whose top two layers have undirected symmetric connections between them and form an associative memory, and the lower layers receive top-down directed connections from the layer above. MNNs elucidated in this paper adopt the structure of Cellular Neural Networks, that is, the upper layers receive bottom-up directed connections from the layer below. Such setting is convenient for the understanding of the methodology proposed herein. The discussion can be applied for the more complicated networks, such as DBNs and Hopfield NNs with a minor modification. The related works are still under preparation. Due to the learning algorithm and training processing, the investigation of mosaic solutions is most essential in studying MNNs. Such models do produce abundant output patterns and make the learning algorithm more efficient. In NNs, many types of activation function, for instance, linear, McCulloch–Pitts, signum, sigmoid, and ramp functions, are chosen for many specific purposes. The activation function represents for which conditions of the synapses will be activated. Different activation function makes different output function space and produces different dynamical systems. Ban and Chang [34] consider a different activation function f (x) = (1/2)(|x +1|−|x −1|), which was introduced in [35]. It has been approved that there are many applications for such activation function in the area of image processing (see [36] and the references therein). The topics of pattern formation and spatial chaos for mosaic solutions have been discussed [37]–[40]. Ban and Chang [34] indicated that the solution space of an MNN is a subshift of finite type (SFT, also known as Markov shift), while the output space is a sofic shift (sofic, also known as hidden Markov shift). Moreover, the topological entropy is formulated explicitly to characterize the complexity of the solution and output spaces. The result can be applied to hidden spaces via analogous elucidation.

2162-237X © 2015 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.

2478

IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, VOL. 26, NO. 10, OCTOBER 2015

The aim of this paper is to set up the mathematical foundation for the deep and shallow architecture of MNNs with respect to the above activation function. Depth of architecture refers to the number of the coupled layers in the function learned. It is known that the mammal brain is organized in a deep architecture with a given input represented at multiple layers. Mathematically, demonstrating the depth architecture of MNNs is the illustration of topological conjugacy between hidden layers. The method we have provided herein is more general, which is an easy extension that leads us to consider the classical McCulloch–Pitts model and signum activation function. Theorem 3 demonstrates a necessary and sufficient condition for the topological conjugacy of the output space and the hidden space in a two-layer network with the nearest neighborhood under some assumptions. Once the assumption is removed, Theorem 5 provides a sufficient and checkable condition for determining whether the output space and the hidden space are topological conjugate. In general, let Y(i) and Y( j ) represent the i th and the j th hidden spaces in an N-layer NN for 1 ≤ i < j ≤ N and N is a positive integer. (Note that Y(N) is the output space.) Theorem 10 reveals a necessary and sufficient condition for the topological conjugacy of Y(i) and Y( j ) under some assumptions, and Theorem 11 addresses a sufficient condition for the conjugacy without any assumption. It is remarkable that these conditions are all checkable. This paper is organized as follows. The upcoming section illustrates some related results and background that are used in this paper. The demonstration of the deep and shallow architecture of two-layer NNs is given in Section III, which unveils the main idea of the proposed methodology. Then it is followed by the reveal of the conditions to decide whether the number of the layers of a general MNN can be reduced or not. The conclusion is given in Section V. II. P RELIMINARY This paper considers the deep and shallow architecture of MNNs. This section introduces some notations and results in our previous work that are used in this paper for the self-containment. A 1-D MNN is realized as ⎧  (k)  d (k) ⎪ (k) ⎪ x i (t) = −x i (t) + z (k) + a (k) f x i (t) ⎪ ⎪ ⎪  ⎪ dt  (k)  (k−1) ⎨ + ∈N b f x i+ (t) (1)   d ⎪ ⎪ x (1)(t) = −x (1)(t) + z (1) + a (1) f x (1) (t) ⎪ i i ⎪ dt i ⎪   ⎪ (1)  (1) ⎩ + ∈N a f x i+ (t) for some N ∈ N, k = 2, . . . , N and i ∈ Z. See Fig. 1 for a three-layer NN. We call the finite subset N ⊂ Z the neighborhood, and the piecewise linear map f (x) = (1/2)(|x + 1| − |x − 1|) is called the output function. The template T = [A, B, z] is composed of a feedback template A = (1) (A1 , A2 ) with A1 = (a (1), . . . , a (N) ) and A2 = (a )∈N , a controlling template B = (B2 , . . . , B N ), and a threshold z = (z (1) , . . . , z (N) ), where Bk = (b(k) )∈N for k ≥ 2. A stationary solution x = (x i(1), . . . , x i(N) )i∈Z ∈ RZ∞×N of (1) is called mosaic if |x i(k) | > 1 for 1 ≤ k ≤ (1) (N) N, i ∈ Z. The output y = (yi . . . yi )i∈Z ∈ {−1, 1}Z∞×N

Fig. 1.

Three-layer NNs with nearest neighborhood.

of a mosaic solution is called a mosaic pattern, where yi(k) = f (x i(k) ). The solution space Y of (1) stores the mosaic patterns y, and the output space Y(N) of (1) is the collection of the output patterns in Y, more precisely 

   Y(N) = yi(N) i∈Z : yi(1) . . . yi(N) i∈Z ∈ Y . A neighborhood N is called the nearest neighborhood if N = {−1, 1}. In the rest of this section, we recall the framework of two-layer NNs with the nearest neighborhood for clarity. To investigate the complexity behavior of (1), the prescription of parameters is essential. In general, there are infinite choices of templates. Since, for MNNs, the neighborhood N is finite and the template is invariant for each i , the solution space is determined by the so-called basic set of admissible local patterns B = (B (1), B (2) ), where B (1) (resp. B (2)) is the basic set of admissible local patterns of the first layer (respectively second layer). Ban and Chang [34] demonstrated that the parameter space can be divided into finitely equivalent regions so that two templates T1 and T2 of (1) assert the same solution space if and only if T1 and T2 belong to the same region. In other words, there are only finitely different behaviors being observed if the neighborhood N is given. The basic set of admissible local patterns of the first layer is a subset of {− − −, − − +, − + −, − + +, + − −, + − +, + + −, + ++}. The basic set of admissible local patterns of the second layer is a subset of the ordered set { p1, . . . , p8 }, where p1, . . . , p8 are − − − − , , , , − − − + + − + + + + + + , (2) , , , − − − + + − + + respectively. (Here we refer to the patterns 1 and −1 as + and −, respectively.) First, we consider the local patterns of the second layer. For simplicity, we denote the local pattern α α α by α  α1 α2 . 1 2 To investigate the structure of the solution space, we assign the local patterns an ordering since the solution space Y is determined by the basic set of admissible local patterns. Then we define the ordering matrix to clarify the global patterns in the solution space. The ordering matrix of the two-layer NNs is defined in the equation shown at the top of the next page.

CHANG: DEEP AND SHALLOW ARCHITECTURE OF MULTILAYER NEURAL NETWORKS

⎛ − − − − − + − + −

X2 =

− + + + − − + − + + + −

⎜ ⎜ ⎜ ⎜ ⎜ ⎜ ⎜ ⎜ ⎜ ⎜ ⎜ ⎜ ⎜ ⎜ ⎜ ⎜ ⎜ ⎜ ⎜ ⎜ ⎝

+ + +

2479

− − −

− − +

− + −

− + +

+ − −

+ − +

+ + −

+ + +

−− − − −−

−− − − −+

−− − + −−

−− − + −+

−+ − − −−

−+ − − −+

−+ − + −−

−+ − + −+

−− − − +−

−− − − ++

−− − + +−

−− − + ++

−+ − − +−

−+ − − ++

−+ − + +−

−+ − + ++

−− + − −−

−− + − −+

−− + + −−

−− + + −+

−+ + − −−

−+ + − −+

−+ + + −−

−+ + + −+

−− + − +−

−− + − ++

−− + + +−

−− + + ++

−+ + − +−

−+ + − ++

−+ + + +−

−+ + + ++

+− − − −−

+− − − −+

+− − + −−

+− − + −+

++ − − −−

++ − − −+

++ − + −−

++ − + −+

+− − − +−

+− − − ++

+− − + +−

+− − + ++

++ − − +−

++ − − ++

++ − + +−

++ − + ++

+− + − −−

+− + − −+

+− + + −−

+− + + −+

++ + − −−

++ + − −+

++ + + −−

++ + + −+

+− + − +−

+− + − ++

+− + + +−

+− + + ++

++ + − +−

++ + − ++

++ + + +−

++ + + ++

⎞ ⎟ ⎟ ⎟ ⎟ ⎟ ⎟ ⎟ ⎟ ⎟ ⎟ ⎟. ⎟ ⎟ ⎟ ⎟ ⎟ ⎟ ⎟ ⎟ ⎟ ⎠

It is observed that X2 ( p, q) consists of two local patterns in B (2), and X2 is self-similar; more specifically, if we write   X2;11 X2;12 X2 = X2;21 X2;22

herein pk is defined in (2) and is presented as αk  αk−1 αk+1 for k = 1, . . . , 8. Furthermore, the transition matrix of the second layer T2 ∈ M8×8 ({0, 1}) is defined by

where X2;i j is a 4 × 4 matrix for all i and j , then the bottom patterns of X2;i j ( p, q) and X2;i  j  ( p, q) are identical for all i, i  , j, j , p, and q. Let

while the transition matrix of the first layer T1 ∈ M4×4 ({0, 1}) is defined by

a00 = −−, a01 = −+, a10 = +−, a11 = ++

(3)

define ai1 i2 ai2 i3 = ∅ ⇔ i 2 = i 2 .

−−

−−

−+

+−

++

−−−

−−+







−+−

+−+





++−

⎜ −+ ⎜ ⎜ ∅ X1 = ⎜ +− ⎝ + − − ++



+++

X21 −−

−−

− − −− ⎜ −+ ⎜ ⎜ − + −− = ⎜ +− ⎝ + − −− ++

+ + −−

M ⊗ N = (M(i, j )N) ∈ Mk1 1 ×k2 2 (R)

−+

+−

++

− − −+

− − +−

− − ++

− + −+

− + +−

+ − −+

+ − +−

⎟ − + ++ ⎟ ⎟ ⎟ + − ++ ⎠

+ + −+

+ + +−

+ + ++



stores all patterns with size 4 × 1. As the ordering matrix records the behavior of the global patterns, the transition matrix relates to the number of global patterns. Suppose B(T) = (B (1), B (2)) is the basic set of admissible local patterns of (1) with respect to the template T. The transition matrix T is defined by ⎧ (2) ⎪ ⎨1, pi , p j ∈ B and T(i, j ) = αi−1 α j −1 αi+1 , α j −1 αi+1 α j +1 ∈ B (1) (5) ⎪ ⎩ 0, otherwise

(7)

(8)

while the Hadamard product (inner product) of matrices P, Q ∈ Mr×r (R) is known as (P ◦ Q)(i, j ) = P(i, j )Q(i, j ).

⎟ −++ ⎟ ⎟. ⎟ ∅ ⎠

(6)

Recall that the Kronecker product (tensor product) of matrices M ∈ Mk1 ×k2 (R) and N ∈ M1 ×2 (R) is defined by



Moreover, let the matrix product of X1 represent the generation of local patterns, it is observed that



T1 (i, j ) = 1 if and only if X1 (i, j ) ∈ B (1).

(4)

If ai1 i2 ai2 i3 = ∅, then it is a pattern with size 3×1 and denoted by ai1 i2 i3 . The ordering matrix of the first layer is defined by ⎛

T2 (i, j ) = 1 if and only if pi , p j ∈ B (2)

(9)

Ban and Chang [34] obtain the formula of transition matrix T as follows. Theorem 1: Suppose T is the transition matrix of (1), and T1 and T2 are the transition matrices of (1) with respect to the first and second layers, respectively. Decompose T12 = (Ti, j )2i, j =1 as four smaller 2 × 2 matrices and define T 1 by T 1 ( p, q) = Ti, j (k, l), where p = 2i + j −2, q = 2k +l −2. (10) Then T = T2 ◦ (E 2 ⊗ T 1 )

(11)

where E k is a k × k matrix with all entries being 1s [34]. Ban and Chang [34] assert that the solution space of an MNN is a so-called shift of finite type (SFT, also known as a topological Markov shift) in the symbolic dynamical systems, and the output space is a sofic shift (sofic), which is the image of an SFT under a surjective map. (A detailed instruction for the symbolic dynamical systems is given in [41].) One of the most frequently used quantum for the measure of the spatial complexity is the topological entropy, which measures the growth rate of the number of the global patterns with

2480

IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, VOL. 26, NO. 10, OCTOBER 2015

respect to the size of lattices. Let X be a symbolic space and let n (X) denote the number of the patterns in X of length n. The topological entropy of X is defined by log n (X) n provided the limit exists. The introduction of symbolic transition matrix and labeled graph is used for formulating the topological entropy of the output space of an MNN. Suppose the transition matrix T is determined. Set the alphabet A = {a00 , a01 , a10 , a11 }, where ai j is defined in (3). Define the symbolic transition matrix as    a00 a01 S= ⊗ E 4 ◦ T, S(i, j ) = ∅ if T(i, j ) = 0. a10 a11 (12) h(X) = lim

n→∞

Herein ∅ means there exists no local pattern in B related to its corresponding entry in the ordering matrix. Every symbolic transition matrix induces a labeled graph G = (G, L), which consists of an underlying graph G = (V, E), and the labeling L : E → A, which assigns to each edge a label from the finite alphabet A, where V and E refer to the sets of vertices and edges, respectively, as follows. Let V = { p1 , . . . , p8 }, and ei j ∈ E if init(ei j ) = pi , ter(ei j ) = p j , and T(i, j ) = 1, where ter(e) and init(e) mean the terminal and initial vertices of the edge e ∈ E, respectively. Define L : E → A by   k −1 L(ei j ) = ai j , k = 4 where · is the Gaussian function. A labeled graph G = (G, L) is called right-resolving if the restriction of L to E I is one to one, where E I consists of those edges starting from I . If G is not right-solving, there exists a labeled graph H, derived by applying the subset construction method (SCM) to G, such that the sofic shift defined by H is identical to the original space. The new labeled graph H = (H, L ) is constructed as follows. The vertices I of H are the nonempty subsets of the vertex set V of G. If I ∈ V  and a ∈ A, let J denote the set of terminal vertices of edges in G starting at some vertices in I and labeled a, i.e., J is the set of vertices reachable from I using the edges labeled a. There are two cases. 1) If J = ∅, do nothing. 2) If J = ∅, J ∈ V  , then draw an edge in H from I to J labeled a. Carrying this out for each I ∈ V  and each a ∈ A produces the labeled graph H. Then each vertex I in H has at most one edge with a given label starting at I . This implies that H is right-resolving. The topological entropy of the output space then follows. Theorem 2: Let G be the labeled graph obtained from the transition matrix T of (1). Let H be a labeled graph obtained by applying SCM to G and let H is the transition matrix of H. Then the topological entropy of the output space Y(2) is [34] h(Y(2) ) = log ρH .

(13)

We conclude this section by the following example that is elucidated in [34].

(1) and 0 < Example 1: Suppose 0 < −a1(1) < −a−1 (2) (2) (1) −b1 < −b−1 . Pick [m, n] = [2, 3] in the a -z (1) plane and [m, n] = [2, 2] in the a (2)-z (2) plane. [For instance, T = (A, B, z) with A1 = (2.2, 1.7), A2 = (−4, −2), B = (−2.6, −1.4), and z = (−1.2, 0.3).] The basic sets of admissible local patterns for the first and second layers are

B (1) = {− + −, − + +, + − +, + − −, − − +} and B (2) =



+ − −

,

+ − − , , − + + + + −



respectively. The transition matrices for the first and second layer are ⎞ ⎛ 0 1 0 0 ⎜ 0 0 1 1⎟ ⎟ ⎜ T1 = ⎜ ⎟ ⎝ 1 1 0 0⎠ 0 0 0 0 and

⎛ 0 ⎜0 ⎜ ⎜ ⎜0 ⎜ ⎜0 T2 = ⎜ ⎜0 ⎜ ⎜ ⎜0 ⎜ ⎝0 0

0 0 0 0 0 0 0 0

0 0 1 1 1 1 0 0

0 0 1 1 1 1 0 0

0 0 1 1 1 1 0 0

0 0 1 1 1 1 0 0

respectively. Observe that ⎛ ⎞ ⎛ 0 1 0 1 0 ⎜ 1 0 1 0⎟ ⎜0 ⎜ ⎟ ⎜ T12 = ⎜ ⎟ ⇒ T1 = ⎜ ⎝ 0 1 1 1⎠ ⎝0 0 0 0 0 1 Therefore, the transition matrix of the MNN are ⎛ 0 0 ⎜0 0 ⎜ ⎜0 0 ⎜ ⎜0 0 T=⎜ ⎜0 0 ⎜ ⎜0 0 ⎜ ⎝0 0 0 0 and

⎛ ∅ ⎜∅ ⎜ ⎜∅ ⎜ ⎜∅ S=⎜ ⎜∅ ⎜ ⎜∅ ⎜ ⎝∅ ∅

∅ ∅ ∅ ∅ ∅ ∅ ∅ ∅

∅ ∅ ∅ ∅ a10 a10 ∅ ∅

⎞ 0 0⎟ ⎟ ⎟ 0⎟ ⎟ 0⎟ ⎟ 0⎟ ⎟ ⎟ 0⎟ ⎟ 0⎠ 0

0 0 0 0 0 0 0 0

1 1 1 1

1 1 0 0

⎞ 0 0⎟ ⎟ ⎟. 0⎠ 0

matrix and the symbolic transition 0 0 0 0 1 1 0 0

0 0 0 0 0 0 0 0 ∅ ∅ ∅ ∅ ∅ ∅ ∅ ∅

0 0 0 1 0 0 0 0 ∅ ∅ ∅ a01 ∅ ∅ ∅ ∅

0 0 1 1 1 1 0 0

0 0 0 0 0 0 0 0

∅ ∅ a01 a01 a11 a11 ∅ ∅

⎞ 0 0⎟ ⎟ 0⎟ ⎟ 0⎟ ⎟ 0⎟ ⎟ 0⎟ ⎟ 0⎠ 0 ∅ ∅ ∅ ∅ ∅ ∅ ∅ ∅

⎞ ∅ ∅⎟ ⎟ ∅⎟ ⎟ ∅⎟ ⎟ ∅⎟ ⎟ ∅⎟ ⎟ ∅⎠ ∅

CHANG: DEEP AND SHALLOW ARCHITECTURE OF MULTILAYER NEURAL NETWORKS

2481

First, we consider a simplified case that the ordering matrix of two-layer NNs is represented as ⎛

− −

⎜ ⎜ −− − ⎜ ⎜ −− ⎜ − ⎜ −− + ⎜ X= ⎜ +− ⎜ + ⎜ ⎜ +− − ⎜ −− ⎜ ⎝ + +− + +− = (x pq )1≤ p,q≤4 −

Fig. 2. Labeled graph H obtained by applying SCM to G. An extra vertex p9 = { p5 , p6 } is created so that H is right-resolving.

respectively. Since the labeled graph G, which is obtained from T, is not right-resolving, applying the SCM to G derives a right-resolving labeled graph H (Fig. 2). The transition matrix of H, indexed by p3 , p4 , p5 , p6 , { p5 , p6 }, is ⎛ ⎞ 0 0 0 1 0 ⎜0 0 0 0 1⎟ ⎜ ⎟ ⎜ ⎟ H = ⎜ 1 0 0 1 0⎟ . ⎜ ⎟ ⎝ 1 0 0 1 0⎠ 0 0 1 1 0 Theorem 2 indicates that the topological entropy of the (2) is h(Y(2) ) = log ρ output space H = log g, where √ Y g = (1 + 5/2) is the golden mean. The readers who are interested in more mathematical results or details are referred to our previous papers and the references therein. The topological structure of the output spaces of multilayer cellular NNs is investigated in [38]. Meanwhile, the zeta function, which reveals the information of all periodic solutions, is demonstrated as a rational function. Later on, finding the so-called dimension groups of the hidden and output spaces asserts whether these spaces are shift equivalent or not [37]. From the mathematical aspect, the shift equivalence of two spaces is a necessary (but not sufficient) condition for the topological conjugacy. III. D EEP AND S HALLOW A RCHITECTURE OF T WO -L AYER N EURAL N ETWORKS The investigation in the above not only works for the output space but also can be applied to hidden spaces of an MNN. More specifically, suppose Y is the solution space of an N-layer NNs, and Y(N) is the output space and Y(i) is a hidden space for 1 ≤ i ≤ N − 1. Theorem 2 reveals the topological entropy of the output space h(Y(N) ). With a minor modification, Theorem 2 can be used to assert the topological entropy of hidden space h(Y(i) ) for 1 ≤ i ≤ N − 1. This section studies the deep and shallow architecture of two-layer NNs. The investigation of the deep–shallow architecture of MNNs is equivalent to elucidating the topological conjugacy of two nearest spaces in its solution space. To be accurate, a two-layer NN can be replaced by a single layer NN if and only if Y(1) ∼ = Y(2).



+

+



−− −+

−+ −−

−− ++

−+ +−

+− −+

++ −−

+− ++

++ +−

+

+



⎟ −+ ⎟ −+ ⎟ ⎟ ⎟ ⎟ −+ ++ ⎟ ⎟ ⎟ ⎟ ++ ⎟ −+ ⎟ ⎟ ⎠ ++ ++

Suppose the transition matrix T is given. To simplify the notation, set the vertex set V = {0, 1, 2, 3}, where 0 ≡ −−

1 ≡ −+

2 ≡ +−

3 ≡ ++ .

There exists an edge e ∈ E if and only if T(init(e) + 1, ter(e) + 1) = 1. This makes G = (V, E) a graph presentation of T. Furthermore, consider A = {α0 , α1 , α2 , α3 }, where α0 = −−, α1 = −+, α2 = +−, α3 = + +. Define L(1) , L(2) : E → A by L(1) (e) = α2τ (i(e))+τ (t (e))

(14)

L(2) (e) = α2[i(e)/2]+[t (e)/2]

(15)

where τ (c) := c mod 2 and [·] is the Gaussian function. These two labelings L(1) and L(2) define two labeled graphs G (1) = (G, L(1) ) and G (2) = (G, L(2) ), respectively. Ban and Chang [34] demonstrated that the output space Y(2) is a sofic shift with graph presentation G (2) . With a small modification, we also get that the hidden space Y(1) is a sofic shift with graph presentation G (1) . It is worth emphasizing that G (1) and G (2) share the same underlying graph, but have different labeling. In other words, Y(1) and Y(2) have the same transition matrices T(1) = T(2) , and the symbolic transition matrix S() of G () is defined by ⎧ ⎨ α j , if T() ( p, q) = 1 () S ( p, q) = and L() (( p, q)) = α j for some j ⎩ ∅, otherwise. (16) Recall that ∅ means there exists no local pattern in B related to its corresponding entry in X. Here comes the first main result of this investigation. Theorem 3: Suppose Y(1), Y(2) are irreducible sofic shifts. If G is an essential graph, and G (1) and G (2) are both rightresolving, then Y(1) ∼ = Y(2) if and only if PS(1) = S(2) P, where ⎛ ⎞ 1 0 0 0 ⎜0 0 1 0⎟ ⎜ ⎟ P =⎜ ⎟. ⎝0 1 0 0⎠ 0 0 0 1

2482

IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, VOL. 26, NO. 10, OCTOBER 2015

Proof: Without loss of generality, we may assume that the vertex set of G is V = {0, 1, 2, 3}. Suppose PS(1) = S(2) P. Define (∂, ) : G (1) → G (2) by  3 − v, v = 1, 2 ∂(v) = v, otherwise  (17) (e) = e for all v ∈ V, e ∈ E, where init(e ) = ∂(init(e)), ter(e ) = ∂(ter(e)). Then ∂ is one to one and onto. PS(1) = S(2) P asserts PT(1) = T(2) P and T(1) ( p, q) = T(2) (1 + ∂( p − 1), 1 + ∂(q − 1))

Fig. 3.

Enlarged ordering matrix of two-layer NNs.

(18)

1 ≤ p, q ≤ 4. That is, an edge e ∈ E infers another edge e with init(e ) = ∂(init(e)) and ter(e ) = ∂(ter(e)). Hence,  is well defined, one to one, and onto. This shows that (∂, ) is a graph isomorphism. Moreover, it can be verified that [∂(v)/2] = τ (v), for v = 0, 1, 2, 3. For each e ∈ E L(2) ((e)) = α2[∂(init(e))/2]+[∂(ter(e))/2]

with Xij

x i j ;11 ⎜x i j ;21 =⎜ ⎝x i j ;31 x i j ;41

x i j ;12 x i j ;22 x i j ;32 x i j ;42

x i j ;13 x i j ;23 x i j ;33 x i j ;43

⎞ x i j ;14 x i j ;24 ⎟ ⎟ x i j ;34 ⎠ x i j ;44

for 1 ≤ i, j ≤ 4, as in Fig. 3. x i j ;kl means the pattern

ar1 r2 ar  r

2 3

as1 s2 as  s

,

2 3

where

 i −1 , 2   j −1 , 2   k −1 , 2   l −1 , 2 

r1 =

= α2τ (init(e))+τ (ter(e)) = L(1) (e). This demonstrates (∂, ) : G (1) → G (2) is a labeled-graph isomorphism. In other words, Y(1) ∼ = Y(2) . Conversely, suppose that Y(1) ∼ = Y(2) . Since Y(1) , Y(2) are (1) (2) irreducible and S and S are right-resolving, there exists a conjugacy matrix D such that DS(1) = S(2) D and D is a 0-1 matrix. Similar as above, D defines a graph isomorphism (∂, ) : G (1) → G (2) . If ∂(0) = 0, it comes immediately from DS(1) = S(2) D that ∂(0) = 1 and ∂(2) = 0. Assume that ∂(1) = 2 and ∂(3) = 3. It is easily seen that L(2) ((2, i )) = L(1) (2, i ) for all 0 ≤ i ≤ 3, which gets a contradiction to DS(1) = S(2) D. Another contradiction occurs for the case ∂(1) = 3, ∂(3) = 2. This forces ∂(0) = 0. Repeating the same procedures, derive ∂(1) = 2, ∂(2) = 1, and ∂(3) = 3. In other words, D = P. This completes the proof.  Now we consider the deep–shallow architecture of the original two-layer NNs based on the above result for the simplified case. Start from reconstructing the ordering matrix X2 as a 16 × 16 matrix by enlarging the size of local patterns into a rectangle (Fig. 3). Without ambiguity, we still refer X2 to the enlarged ordering matrix. It comes immediately that X2 still associates with self-similarity. Enlarge the local patterns of (1) so that they are the patterns in the newly constructed ordering matrix. For example, if −  −− is an admissible local pattern in B (2), then the following eight local patterns are selected: X2 (1, 1), X2 (1, 5), X2 (2, 3), X2 (2, 7), X2 (9, 1), X2 (9, 5), X2 (10, 3), and X2 (10, 7). Write ⎛ ⎞ X 11 X 12 X 13 X 14 ⎜ X 21 X 22 X 23 X 24 ⎟ ⎟ X2 = ⎜ ⎝ X 31 X 32 X 33 X 34 ⎠ X 41 X 42 X 43 X 44



r2 = s1 = s2 =

r2 = i − 1 − 2r1 r3 = j − 1 − 2r2 s2 = k − 1 − 2s1 s3 = l − 1 − 2s2 .

(19)

If ar1 r2 ar2 r3 = ∅ or as1 s2 as2 s3 = ∅, then x i j ;kl = ∅. Furthermore, if x i j ;kl = ∅, then it is denoted by the pattern ar1 ar2 ar3 in {+, −}Z 3×2 . as1 as2 as3

Embed the local patterns to those patterns of size 4 × 2 and treat the local pattern (2) (2) (2) (2)

y0 y1 y2 y3

y0(1) y1(1) y2(1) y3(1)

as the concatenation of two smaller patterns y0(2) y1(2) y0(1) y1(1)

and (2) (2)

y2 y3

(1) (1)

y2 y3 . Before addressing the main theorem, we give the order of these patterns first. Since every pattern is composed of two 2 × 2 blocks, we can encode each pattern via a function : {−, +}Z2×2 → {1, 2, . . . , 16} defined as (y1 y2  u 1 u 2 ) = 8χ(y1 ) + 4χ(y2 ) + 2χ(u 1 ) + χ(u 2 ) + 1 where χ : {−, +} → {0, 1}, given by χ(−) = 0 and χ(+) = 1, is used for assigning the weight for symbols

CHANG: DEEP AND SHALLOW ARCHITECTURE OF MULTILAYER NEURAL NETWORKS

2483

+ and −. Note that this defines the ordering matrix X2 observed in Fig. 3. Define two labeling L(1) (y0 y1 y2 y3  u 0 u 1 u 2 u 3 ) = u 0 u 1 u 2 u 3 L(2) (y0 y1 y2 y3  u 0 u 1 u 2 u 3 ) = y0 y1 y2 y3 . Then G (i) = (G, L(i) ) is the labeled graph presentation of Y(i) for i = 1, 2. Furthermore, define η : {−, +}Z2×2 → {−, +}Z2×2 and P4 ∈ R16×16 by η(y1 y2  u 1 u 2 ) = u⎧1 u 2  y1 y2 ⎨1, χ(y1 y2  u 1 u 2 ) = i and χ ◦ η(y1 y2  u 1 u 2 ) = j P4 (i, j ) = ⎩ 0, otherwise let S(1) and S(2) be the symbolic transition matrices of G (1) and G (2) , respectively. The following theorem is derived via an analogous method to the proof of Theorem 3, and thus we omit the proof. Theorem 4: If G (1) and G (2) are both right-resolving, then (1) Y ∼ = Y(2) if and only if S(1) P4 = P4 S(2) . It remains to investigate the deep and shallow architecture of a two-layer NN for the case that either G (1) or G (2) is not right-resolving. First, we introduce a factor-like matrix. A 0-1 matrix F ∈ Rm×n is called factor-like if j F(i, j ) ≤ 1 for 1 ≤ i ≤ m. The following theorem demonstrates a sufficient condition for Y(1) ∼ = Y(2). (1) and G (2) are the labeled graph Theorem 5: Suppose G (1) (2) presentation of Y and Y , respectively. Let H(i) be the labeled graph obtained by applying SCM to G (i) and let H(i) be the symbolic transition matrix of H(i) for i = 1, 2. If there exist factor-like matrices F1,2 and F2,1 such that F1,2 H(1) = H(2) F1,2 and F2,1 H(2) = H(1) F2,1 , (1) (2) then Y ∼ =Y . Proof: With small modification of the proof of Theorem 3, it indicates that the existence of factor-like matrix F1,2 with F1,2 H(1) = H(2) F1,2 induces a one-to-one map from Y(1) to Y(2) . Similarly, there is a one-to-one map from Y(2) to Y(1) if there exists a factor-like matrix F2,1 such that F2,1 H(2) = H(1) F2,1 . The desired result Y(1) ∼ = Y(2) then follows.  The upcoming example for simplified two-layer NNs concludes this section. Example 2: Suppose the local patterns of a simplified two-layer NN are the following: −+ −+

−+ ++

+− −−

+− ++

In other words, the transition matrix ⎛ 0 0 0 ⎜0 0 0 ⎜ T=⎝ 1 0 0 1 0 0

++ −−

++ ++ .

of Y is ⎞ 1 1⎟ ⎟. 1⎠ 1

It comes that T is not irreducible. Fig. 4 shows that the underlying graph G is not essential unless v 2 and v 3 are erased. Note that the new graph G  obtained by deleting these two vertices remains a graph presentation of Y. Furthermore,

Fig. 4. Underlying graph G for the solution space in Example 2 is not essential until v 2 and v 3 are erased.

the symbolic transition matrices of Y(1) and Y(2) are identical, say   ∅ a01 (1) (2) S =S = . a10 a11 It follows immediately that Y(1) ∼ = Y(2). IV. A RCHITECTURE OF G ENERAL M ULTILAYER N EURAL N ETWORKS Following the elucidation of the topological conjugacy of the output space and the hidden space in a two-layer network with the nearest neighborhood, this section extends the methodology we proposed in the previous section to illustrate the deep-shallow-architecture of N-layer NNs with arbitrary neighborhood for an arbitrary positive integer N ≥ 2. A. Architecture of Simplified MNNs To clarify the main idea, the discussion starts with studying simplified N-layer NNs. In this case, the ordering matrix X N is indexed by {−  −  · · ·  −, −  −  · · ·  +, . . . , +  + · · ·+}. Suppose B ⊆ {−, +}Z2×N is a basic set of admissible local patterns determined by a partition of parameter space and Y ⊆ {−, +}Z∞×N presents the solution space. The system then induces a output space Y(N) and N − 1 hidden spaces Y(i) for i = 1, 2, . . . , N − 1. Recall that κ1  κ2 indicates the pattern κ1 . Let A = {α , α , α , α }, where 0 1 2 3 κ2 α0 = −−, α1 = −+, α2 = +−, α3 = + + . For i = 1, 2, . . . , N, define the labeling L(i) : B → A by L(i) (y (N) yr(N)  · · ·  y (1) yr(1) ) = y (i) yr(i) . The transition matrix T(i) and the symbolic transition matrix S(i) for the space Y(i) are then defined analogously as earlier discussion, where i ≤ i ≤ N. Suppose N = 3. Let V = {0, 1, . . . , 7}, where 0 = −  −  −, 1 = −  −  +, . . . , 7 = +  +  +. The labeling L(1) , L(2) , L(3) : E → A can be expressed explicitly by L(1) (e) = α2τ (i(e))+τ (t (e))

(20)

(2)

L (e) = α2τ ([i(e)/2])+τ (t (e)/2])

(21)

(3)

L (e) = α2[i(e)/4]+[t (e)/4]

(22)

where τ (c) := c mod 2 and [·] is the Gaussian function. Theorem 6 extends Theorem 3 to simplified 3-layer NNs.

2484

IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, VOL. 26, NO. 10, OCTOBER 2015

Theorem 6: Suppose Y(i) is irreducible for 1 ≤ i ≤ 3 and G is essential. Let

Theorem 7: Suppose Y(i) is irreducible and G (i) is right-resolving for 1 ≤ i ≤ N. Let P N,N−1 = {(PN−1,N−2 ⊗ I2 ) · (C p )1≤ p≤2 N−1 :

P3,2 = {(P ⊗ I2 ) diag (C p )1≤ p≤4 : C p = I2 , J2 }

PN−1,N−2 ∈ P N−1,N−2 , C p ∈ {I2 , J2 } for all p} (23)

P2,1 = {K −1 ((P ⊗ I2 ) diag (C p )1≤ p≤4)K : C p = I2 , J2 } P3,1 = {L

−1

((P ⊗ I2 ) diag (C p )1≤ p≤4)L : C p = I2 , J2 } and

where ⎛

0 0 1 0

0 1 0 0

⎞ 0 0⎟ ⎟, 0⎠ 1

0 0 1 0 0 0 0 0

0 0 0 0 1 0 0 0

0 0 0 0 0 0 1 0

0 1 0 0 0 0 0 0

0 0 0 1 0 0 0 0

0 0 0 0 0 1 0 0

⎞ 0 0⎟ ⎟ 0⎟ ⎟ 0⎟ ⎟ 0⎟ ⎟ 0⎟ ⎟ 0⎠ 1

1 ⎜0 ⎜ ⎜0 ⎜ ⎜0 L =⎜ ⎜0 ⎜ ⎜0 ⎜ ⎝0 0

0 0 1 0 0 0 0 0

0 1 0 0 0 0 0 0

0 0 0 1 0 0 0 0

0 0 0 0 1 0 0 0

0 0 0 0 0 0 1 0

1 ⎜0 P =⎜ ⎝0 0 ⎛ 1 ⎜0 ⎜ ⎜0 ⎜ ⎜0 K =⎜ ⎜0 ⎜ ⎜0 ⎜ ⎝0 0



 0 , 1

1 0

I2 =

 0 J2 = 1

1 0



0 0 0 0 0 1 0 0

(24)

1 ≤ i < j ≤ N, (i, j ) = (N − 1, N). Then Y(i) ∼ = Y( j ) if ( j ) (i) for some P j,i ∈ P j,i , and only if S P j,i = P j,i S where 1 ≤ i < j ≤ N. B. Architecture of General MNNs

and ⎛

  P j,i = K −1 j,i PN,N−1 K j,i : PN,N−1 ∈ P N,N−1

⎞ 0 0⎟ ⎟ 0⎟ ⎟ 0⎟ ⎟. 0⎟ ⎟ 0⎟ ⎟ 0⎠ 1

If G (i) is right-resolving for 1 ≤ i ≤ 3, then Y(i) ∼ = Y( j ) if ( j ) (i) for some P j,i ∈ P j,i , and only if S P j,i = P j,i S where 1 ≤ i < j ≤ 3. Proof: We show that Y(2) ∼ = Y(3) if and only if (2) (3) either P3,2 S = S P3,2 for some P3,2 ∈ P3,2 . The proof of Y(1) ∼ = Y(2) and Y(1) ∼ = Y(3) can be done analogously. (2) Observe that L (v, v ) = L(3) (v + 1, v + 1) for v,  v ∈ {0, 2, 4, 6} ⊂ V. Let Vk = {2k, 2k + 1}, k = 0, 1, 2, 3. Set  with   = ( G V, E) V = {V0 , V1 , V2 , V3 } and (V, V  ) ∈ E if there   is exist v ∈ V, v ∈ V  such that (v, v ) ∈ E. In other words, G obtained from G by bundling vertices carrying the same labels ∼ (3) if under L(2) and L(3) . Theorem 4 demonstrates X G (2) = X G (2) ∼ Y(3) and only if PSG = (2) = SG (3) P. It can be seen that Y ∼ if and only if X G (2) = X G (3) , and there are conjugacies in Vk , k = 0, 1, 2, 3, where X A refers to the space induced by A. Therefore, Y(2) ∼ = Y(3) if and only if P3,2 S(2) = S(3) P3,2 for some P3,2 ∈ P3,2 .  It follows from Theorem 6 that two layers are topological conjugate if and only if their symbolic transition matrices are similar under a sequence of permutation matrices, provided their labeled graph presentation is right-resolving. For N ≥ 3, Theorem 7 reveals a necessary and sufficient condition for the conjugacy of Y(i) and Y( j ) for 1 ≤ i < j ≤ N. The proof is routine and is analogous as the proof of Theorem 6, thus is omitted.

The regularity observed in the simplified MNNs is actually preserved in general MNNs. To illustrate such property, the discussion of two-layer NNs comes first. In this case, each layer is derived by local patterns of size k × 2, where k ≥ 3. Suppose k = 3, we transfer the size of local patterns to 4 × 2 by considering the set ⎧ ⎫ (2) (2) (2) (2) (2) (2) ⎨ y (2) y (2) y (2) y (2) ⎬ y y y y y y 0 1 2 3 0 1 2 1 2 3 B := (1) (1) (1) (1) : (1) (1) (1) , (1) (1) (1) ∈ B . ⎩ y0 y1 y2 y3 ⎭ y0 y1 y2 y1 y2 y3 Every pattern in B, similar as discussed in the previous section, is numbered by the function : {−, +}Z2×2 → {1, 2, . . . , 16} defined as (y1 y2  u 1 u 2 ) = 8χ(y1 ) + 4χ(y2 ) + 2χ(u 1 ) + χ(u 2 ) + 1 where χ : {−, +} → {0, 1} is given by χ(−) = 0 and χ(+) = 1. The ordering matrix is then defined analogously. Furthermore, set L(1) (y0 y1 y2 y3  u 0 u 1 u 2 u 3 ) = u 0 u 1 u 2 u 3 L(2) (y0 y1 y2 y3  u 0 u 1 u 2 u 3 ) = y0 y1 y2 y3 . Then G (i) = (G, L(i) ) is the labeled graph of Y(i) for i = 1, 2. Notably considering the local patterns of length 4 makes MNNs look like simplified MNNs. To see that, let η : {−, +}Z2×2 → {−, +}Z2×2 be defined as η(y1 y2  u 1 u 2 ) = u 1 u 2  y1 y2 . Set P4 (i, j ) =

 1, 0,

χ(y1 y2  u 1 u 2 ) = i, χ ◦ η(y1 y2  u 1 u 2 ) = j otherwise

it is seen that Theorem 8 is an extension of Theorem 3 without the restriction of the nearest neighborhood. The proof is similar, thus is omitted. Theorem 8: Suppose Y(i) is irreducible and G (i) is right-resolving for i = 1, 2. Then Y(1) ∼ = Y(2) if and only if S(1) P4 = P4 S(2) . For a two-layer NN with arbitrary neighborhood, without loss of generality, we may assume that the size of the local

CHANG: DEEP AND SHALLOW ARCHITECTURE OF MULTILAYER NEURAL NETWORKS

pattern is 2 × N for some  ∈ N. For the case N = 2, set the order for each pattern by (y1 y2 · · · y2  u 1 u 2 · · · u 2 ) =1+

2  

 22−i χ(u i ) + 24−i χ(yi ) .

i=1

Similar as above, let Y(1) and Y(2) be the output space and hidden space extracted from the solution space Y. Define two labelings as L(1) (y0 y1 · · · yk−1  u 0 u 1 · · · u k−1 ) = u 0 u 1 · · · u k−1 L(2) (y0 y1 · · · yk−1  u 0 u 1 · · · u k−1 ) = y0 y1 · · · yk−1 . Define η : {−, +}Zk×2 → {−, +}Zk×2 and Pk ∈ R2

2k ×22k

by

η(y0 y1 · · · yk−1  u 0 u 1 · · · u k−1 ) = u 0 u 1 · · · u k−1  y0 y1 · · · yk−1 Pk (i, ⎧ j) ⎨1, = ⎩ 0,

(y0 y1 · · · yk−1  u 0 u 1 · · · u k−1 ) = i and ◦ η(y0 y1 · · · yk−1  u 0 u 1 · · · u k−1 ) = j otherwise

we have the following theorem. Theorem 9: If G (1) and G (2) are both right-resolving, then (1) Y ∼ = Y(2) if and only if S(1) Pk = Pk S(2) . Demonstrating Theorems 7 and 9 leads to Theorem 10, which addresses the scheme of elucidating the deep–shallow architecture of general MNNs. For an N-layer network whose local patterns are of size 2 × N, suppose the basic patterns are ordered by χ : {−, +}Z2×N → {1, 2, . . . , 22N } and Pk is defined analogously as above. Meanwhile, the labeling L(i) for Y(i) is decided so that G (i) = (G, L(i) ) is the labeled graph presentation of Y(i) for 1 ≤ i ≤ N. Theorem 10 is demonstrated via the discussion in the proof of Theorems 7 and 9. Theorem 10: Suppose S(i) is the symbolic transition matrix of G (i) = (G, L(i) ) and G (i) is right-resolving for i = 1, 2, . . . , n. Let Pn,n−1;k = {(Pn−1,n−2 ⊗ I2 ) · (C p )1≤ p≤2n−1 : Pn−1,n−2 ∈ Pn−1,n−2;k , C p is an  ×  permutation matrix} and

  P j,i;k = K −1 j,i;k Pn,n−1;k K j,i;k : Pn,n−1;k ∈ Pn,n−1;k , 1 ≤ i < j ≤ n, (i, j ) = (n − 1, n)

where K j,i;k is the permutation that bundles those vertices carrying the same label under L(i) and L( j ) and P2,1;k = {Pk }. Then Y(i) ∼ = Y( j ) if and only if S( j ) P j,i;k = P j,i;k S(i) for some P j,i;k ∈ P j,i;k , where 1 ≤ i < j ≤ N. To illustrate the deep–shallow architecture of an MNN for the case G (i) are not right-resolving, the existence of factor-like matrix provides a sufficient condition. Theorem 11 is derived via similar argument in the proof of Theorem 5, hence we omit the proof. Theorem 11: Suppose G (i) and G ( j ) are the labeled graph presentation of Y(i) and Y( j ), respectively, where 1 ≤ i < j ≤ N. Let H(i) and H( j ) be the labeled graph obtained

2485

by applying SCM to G (i) and G ( j ) whose symbolic transition matrices are H(i) and H( j ), respectively. If there exist factorlike matrices Fi, j and F j,i such that Fi, j H(i) = H( j ) Fi, j and F j,i H( j ) = H(i) F j,i , then Y(i) ∼ = Y( j ) . V. C ONCLUSION This paper illustrates the deep and shallow architecture of MNNs (1). Ban and Chang [34] reveal that the solution space Y is a topological Markov chain and output space Y(N) is a sofic shift space in symbolic dynamical systems. Analogously, it comes that every hidden space Y(i) is also a sofic shift for 1 ≤ i ≤ N − 1. It is natural to consider the deep and shallow architecture of MNNs. The present elucidation proposes a scheme to determine whether or not an N-layer NNs can be replaced by a K -layer NN with K < N. The main contribution of this paper are Theorems 10 and 11, where these two theorems propose a scheme for the determination of whether two given layers in an MNN are topological conjugate. It is seen that the i th, (i + 1)th, …, ( j − 1)th layers are redundant, and hence can be removed, if Y(i) ∼ = Y( j ) for 1 ≤ i < j ≤ N. In other words, the given N-layer network can be replaced by a (N + i − j )-layer network. Theorem 10 indicates that two layers are conjugate if and only if one layer can be obtained from relabeling another layer, provided their corresponding labeled graphs are both right-resolving. On the other hand, Theorem 11 demonstrates that the existence of factor-like matrices asserts a sufficient condition for the topological conjugacy of two layers. It is remarkable that a small modification of the above procedure allows for decoupling the solution space of an N-layer NN into arbitrary k subspaces for 2 ≤ k ≤ N. More explicitly, one can treat several hidden layers as an individual space, and then consider the conjugacy of two specific spaces. ACKNOWLEDGMENT The author wishes to express his gratitude to the anonymous referees. Their comments made a significant improvement to this paper. R EFERENCES [1] Y. A. Alsultanny and M. M. Aqel, “Pattern recognition using multilayer neural-genetic algorithm,” Neurocomputing, vol. 51, pp. 237–247, Apr. 2003. [2] Y. Bengio, “Learning deep architectures for AI,” Found. Trends Mach. Learn., vol. 2, no. 1, pp. 1–127, 2009. [3] M. A. Cohen and S. Grossberg, “Absolute stability of global pattern formation and parallel memory storage by competitive neural networks,” IEEE Trans. Syst., Man Cybern., vol. SMC-13, no. 5, pp. 815–826, Sep./Oct. 1983. [4] G. A. Carpenter, S. Grossberg, N. Markuzon, J. H. Reynolds, and D. B. Rosen, “Fuzzy ARTMAP: A neural network architecture for incremental supervised learning of analog multidimensional maps,” IEEE Trans. Neural Netw., vol. 3, no. 5, pp. 698–713, Sep. 1992. [5] G. A. Carpenter, S. Grossberg, and J. H. Reynolds, “ARTMAP: Supervised real-time learning and classification of nonstationary data by a selforganizing neural network,” Neural Netw., vol. 4, no. 5, pp. 565–588, 1991. [6] K. Hornik, M. Stinchcombe, and H. White, “Multilayer feedforward networks are universal approximators,” Neural Netw., vol. 2, no. 5, pp. 359–366, 1989. [7] J. J. Hopfield and D. W. Tank, “‘Neural’ computation of decisions in optimization problems,” Biol. Cybern., vol. 52, no. 3, pp. 141–152, 1985.

2486

IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, VOL. 26, NO. 10, OCTOBER 2015

[8] C. Peterson and B. Söderberg, “A new method for mapping optimization problems onto neural networks,” Int. J. Neural Syst., vol. 1, no. 1, pp. 3–22, 1989. [9] B. Widrow, “Layered neural nets for pattern recognition,” IEEE Trans. Acoust., Speech, Signal Process., vol. 36, no. 7, pp. 1109–1118, Jul. 1962. [10] B. Widrow and M. A. Lehr, “30 years of adaptive neural networks: Perceptron, Madaline, and backpropagation,” Proc. IEEE, vol. 78, no. 9, pp. 1415–1442, Sep. 1990. [11] K. Fukushima, “Artificial vision by multi-layered neural networks: Neocognitron and its advances,” Neural Netw., vol. 37, pp. 103–119, Jan. 2013. [12] K. Fukushima, “Training multi-layered neural network neocognitron,” Neural Netw., vol. 40, pp. 18–31, Apr. 2013. [13] T. Serre, G. Kreiman, M. Kouh, C. Cadieu, U. Knoblich, and T. Poggio, “A quantitative theory of immediate visual recognition,” Prog. Brain Res., vol. 165, pp. 33–56, Oct. 2007. [14] Y. Bengio and Y. LeCun, “Scaling learning algorithms towards AI,” in Large-Scale Kernel Machines, L. Bottou, O. Chapelle, D. DeCoste, and J. Weston, Eds. Cambridge, MA, USA: MIT Press, 2007. [15] P. E. Utgoff and D. J. Stracuzzi, “Many-layered learning,” Neural Comput., vol. 14, no. 10, pp. 2497–2529, Oct. 2002. [16] G. E. Hinton, S. Osindero, and Y.-W. Teh, “A fast learning algorithm for deep belief nets,” Neural Comput., vol. 18, no. 7, pp. 1527–1554, Jul. 2006. [17] Y. Freund and D. Haussler, “Unsupervised learning of distributions on binary vectors using two layer networks,” Baskin Center Comput. Eng. Inf. Sci., Univ. California, Santa Cruz, CA, USA, Tech. Rep. UCSC-CRL-94-25, 1994. [18] V. K˚urková and M. Sanguineti, “Can two hidden layers make a difference?” in Adaptive and Natural Computing Algorithms (Lecture Notes in Computer Science), vol. 7823, M. Tomassini, A. Antonioni, F. Daolio, and P. Buesser, Eds. New York, NY, USA: Springer-Verlag, 2013, pp. 30–39. [19] P. Arena, S. Baglio, L. Fortuna, and G. Manganaro, “Self-organization in a two-layer CNN,” IEEE Trans. Circuits Syst. I, Fundam. Theory Appl., vol. 45, no. 2, pp. 157–162, Feb. 1998. [20] P. C. Bressloff, J. D. Cowan, M. Golubitsky, P. J. Thomas, and M. C. Wiener, “Geometric visual hallucinations, Euclidean symmetry and the functional architecture of striate cortex,” Philosoph. Trans. Roy. Soc. London Ser. B, Biol. Sci., vol. 356, no. 1407, pp. 299–330, 2001. [21] S. Coombes, “Waves, bumps, and patterns in neural field theories,” Biol. Cybern., vol. 93, no. 2, pp. 91–108, 2005. [22] K. R. Crounse and L. O. Chua, “Methods for image processing and pattern formation in cellular neural networks: A tutorial,” IEEE Trans. Circuits Syst. I, Fundam. Theory Appl., vol. 42, no. 10, pp. 583–601, Oct. 1995. [23] S. Coombes and M. Owen, “Evans functions for integral neural field equations with Heaviside firing rate function,” SIAM J. Appl. Dyn. Syst., vol. 3, no. 4, pp. 574–600, 2004. [24] L. O. Chua and T. Roska, Cellular Neural Networks and Visual Computing: Foundations and Applications. Cambridge, U.K.: Cambridge Univ. Press, 2002. [25] D. Cai, L. Tao, A. V. Rangan, and D. W. McLaughlin, “Kinetic theory for neuronal network dynamics,” Commun. Math. Sci., vol. 4, no. 1, pp. 97–127, 2006. [26] A. Longtin and P. K. Swain, Eds., Stochastic Dynamics of Neural and Genetic Networks, vol. 16. College Park, MD, USA: AIP Pub., 2006. [27] J. Moehlis, “Canards for a reduction of the Hodgkin–Huxley equations,” J. Math. Biol., vol. 52, no. 2, pp. 141–153, 2006.

[28] J. Peng, D. Zhang, and X. Liao, “A digital image encryption algorithm based on hyper-chaotic cellular neural network,” Fund. Inf., vol. 90, no. 3, pp. 269–282, 2009. [29] Z. Yang, Y. Nishio, and A. Ushida, “Image processing of twolayer CNNs—Applications and their stability,” IEICE Trans. Fundam., vol. E85-A, no. 9, pp. 2052–2060, 2002. [30] P. Arena, L. Fortuna, and L. Occhipinti, “A CNN algorithm for real time analysis of DNA microarrays,” IEEE Trans. Circuits Syst. I, Fundam. Theory Appl., vol. 49, no. 3, pp. 335–340, Mar. 2002. [31] D. Bálya, B. Roska, T. Roska, and F. S. Werblin, “A CNN framework for modeling parallel processing in a mammalian retina,” Int. J. Circuit Theory Appl., vol. 30, nos. 2–3, pp. 363–393, Mar./Jun. 2002. [32] H. Lorach, R. Benosman, O. Marre, S.-H. Ieng, J. A. Sahel, and S. Picaud, “Artificial retina: The multichannel processing of the mammalian retina achieved with a neuromorphic asynchronous light acquisition device,” J. Neural Eng., vol. 9, p. 066004, Dec. 2012. [33] A. Wohrer and P. Kornprobst, “Virtual retina: A biological retina model and simulator, with contrast gain control,” J. Comput. Neurosci., vol. 26, no. 2, pp. 219–249, Apr. 2009. [34] J.-C. Ban and C.-H. Chang, “The learning problem of multi-layer neural networks,” Neural Netw., vol. 46, pp. 116–123, Oct. 2013. [35] L. O. Chua and L. Yang, “Cellular neural networks: Theory,” IEEE Trans. Circuits Syst., vol. 35, no. 10, pp. 1257–1272, Oct. 1988. [36] L. O. Chua, CNN: A Paradigm for Complexity, vol. 31. Singapore: World Scientific, 1998. [37] J.-C. Ban, C.-H. Chang, and S.-S. Lin, “On the structure of multi-layer cellular neural networks,” J. Differ. Equ., vol. 252, no. 8, pp. 4563–4597, Apr. 2012. [38] J.-C. Ban, C.-H. Chang, S.-S. Lin, and Y.-H. Lin, “Spatial complexity in multi-layer cellular neural networks,” J. Differ. Equ., vol. 246, no. 2, pp. 552–580, Jan. 2009. [39] S.-N. Chow, J. Mallet-Paret, and E. S. Van Vleck, “Pattern formation and spatial chaos in spatially discrete evolution equations,” Random Comput. Dyn., vol. 4, no. 2, pp. 109–178, 1996. [40] J. Juang and S.-S. Lin, “Cellular neural networks: Mosaic pattern and spatial chaos,” SIAM J. Appl. Math., vol. 60, no. 3, pp. 891–915, Feb./Mar. 2000. [41] D. Lind and B. Marcus, An Introduction to Symbolic Dynamics and Coding. Cambridge, U.K.: Cambridge Univ. Press, 1995.

Chih-Hung Chang was born in Taichung, Taiwan, in 1976. He received the bachelor’s and master’s degrees in mathematics from National Tsing Hua University, Hsinchu, Taiwan, in 1998 and 2000, respectively, and the Ph.D. degree in applied mathematics from National Chiao Tung University, Hsinchu, in 2008. He joined the Department of Applied Mathematics, Feng Chia University, Taichung, in 2011, as an Assistant Professor. He has been with the National University of Kaohsiung, Kaohsiung, Taiwan, since 2014. His current research interests include multilayer neural networks, cellular automata, and symbolic dynamical systems.

Deep and shallow architecture of multilayer neural networks.

This paper focuses on the deep and shallow architecture of multilayer neural networks (MNNs). The demonstration of whether or not an MNN can be replac...
766KB Sizes 0 Downloads 9 Views