1781

Data-Driven Interval Type-2 Neural Fuzzy System With High Learning Accuracy and Improved Model Interpretability Chia-Feng Juang, Senior Member, IEEE, and Chi-You Chen

Abstract—Current studies of type-2 neural fuzzy systems (FSs) (NFSs) primarily focus on building a fuzzy model with high accuracy and disregard the interpretability of fuzzy rules. This paper proposes a data-driven interval type-2 (IT2) NFS with improved model interpretability (DIT2NFS-IP). The DIT2NFS-IP uses IT2 fuzzy sets in its antecedent part and intervals in its zero-order Takagi–Sugeno–Kang-type consequent part for rule form simplicity. The initial rule base is generated by a self-splitting clustering algorithm in the input–output space. The DIT2NFS-IP uses a two-phase parameter-learning algorithm to design an accurate model with improved rule interpretability. In the first phase, a new cost function that considers both accuracy and transparent fuzzy set partition is defined. The antecedent and consequent parameters are learned through gradient descent and rule-ordered recursive least squares algorithms, respectively, to achieve cost function minimization. The second phase performs a fuzzy set reduction, followed by consequent parameter learning to improve accuracy. Comparisons with different type-1 and type-2 FSs in five databased modeling and prediction problems verify the performance of the DIT2NFS-IP in both model accuracy and interpretability. Index Terms—Fuzzy neural networks (FNNs), interpretable fuzzy systems (FSs), sequence prediction, type-2 FSs.

I. I NTRODUCTION

F

UZZY SYSTEMS (FSs), in contrast to other models such as neural networks, have the advantage of enabling interpretability in how they comprise fuzzy if–then rules. For neural networks, it can be difficult to understand the meaning of each node and the weights connecting the different nodes. Interpretability makes it possible to understand the linguistic relationship between the input and output variables of a system through the inspection of the rule base. The manual derivation of fuzzy if–then rules through personal experience and expert knowledge can take advantage of high interpretability, but it is time consuming and can lead to inaccuracies, particularly in complex systems. Instead of relying primarily on expert

Manuscript received January 16, 2012; revised August 8, 2012 and October 17, 2012; accepted November 8, 2012. Date of publication December 19, 2012; date of current version November 18, 2013. This work was supported by the National Science Council, Taiwan, under Grant NSC 100-2628-E-005-005MY2. This paper was recommended by Editor E. C. C. Tsang. The authors are with the Department of Electrical Engineering, National Chung Hsing University, Taichung 402, Taiwan (e-mail: [email protected] edu.tw; [email protected]). Color versions of one or more of the figures in this paper are available online at http://ieeexplore.ieee.org. Digital Object Identifier 10.1109/TSMCB.2012.2230253

knowledge, researchers have proposed automated approaches for building FSs to address this rule acquisition bottleneck [1]– [3]. An FS designed with neural learning is typically called a neural FS (NFS) or a fuzzy neural network (FNN). The two major considerations in FS design are accuracy and interpretability. The primary focus of early NFS designs was improving fuzzy model accuracy; interpretability was often disregarded. In recent years, fuzzy model interpretability has received more attention. This creates the new challenge of designing an NFS that achieves both accuracy and an interpretable rule base from the data. This challenge motivates the proposal in this paper of a new type-2 NFS for data-driven FS design problems that considers both model accuracy and interpretability. This new type-2 NFS is thus termed a data-driven interval type-2 (IT2) NFS with improved model interpretability (DIT2NFS-IP). Many data-driven type-1 [4]–[15] and IT2[16]–[22] NFSs have been proposed. IT2 FSs are extensions of type-1 FSs where the membership value of an IT2 fuzzy set is an interval. Many studies have shown that type-2 FSs outperform their type-1 counterparts in different problems [16]–[23]. One important task in type-2 NFS design is the initial rule parameter assignment. The author in [16] manually assigned the initial antecedent fuzzy set parameters regardless of the distribution of training data. The type-2 self-organizing NFS (T2SONFS) in [19] used a one-pass clustering algorithm on the input data to initialize the antecedent fuzzy set parameters. In this approach, the initial fuzzy set parameters were determined by the location of a single training sample that did not consider rule-bounded input–output data distributions. Both the discrete IT2 FS (DIT2FS) [20] and the type-2 fuzzy neural structure [21] used fuzzy c-means to determine the initial antecedent fuzzy set parameters. The performance of these type-2 FSs depends on the random initialization of the fuzzy c-means. In the DIT2NFSIP, a variance-based self-splitting clustering (VSSC) algorithm is proposed to properly determine the initial antecedent and the consequent parameters of an IT2 fuzzy rule according to the distribution of input–output data. Proper assignment of the initial rule parameters using the VSSC algorithm improves the subsequent parameter-learning performance, enabling the DIT2NFS-IP to achieve a high level of learning performance with a smaller number of rules than the fuzzy models used for comparisons in the simulation examples in Section VI. In addition, the DIT2NFS-IP uses a new parameter-learning approach to improve model interpretability.

2168-2267 © 2012 IEEE

1782

Current studies of type-2 NFSs [16]–[22] (and most type-1 NFSs) only focus on building a fuzzy model with high accuracy; they disregard rule interpretability. The DIT2NFS-IP considers both accuracy and interpretability—its major contribution is the use of a new learning algorithm to improve model interpretability with a minimal loss of accuracy. The following three factors of model interpretability [3], [24], [25] are addressed in the DIT2NFS-IP: 1) fuzzy partition transparency, also defined as low-level fuzzy model interpretability [26] (e.g., fuzzy sets are distinguishable and normal, cover the universe of discourse (UOD), and have a moderate number per variable); 2) simplicity of the rule base (e.g., type of rules); and 3) simplicity of the fuzzy models (e.g., number of rules and inputs). Current NFS studies that consider fuzzy partition transparency focus only on type-1 FSs [27]–[31]. It is difficult to assign a semantic value to each fuzzy set when there is significant overlap. To address this problem, the authors in [27] proposed a semantic constraint on the sum of all membership values (per variable) for input data in order to improve fuzzy set distinguishability. This approach, however, can only restrain highly overlapped fuzzy sets and cannot reduce the fuzzy set number. As a result, all of the original fuzzy sets are preserved, and some regions of the UOD may not be covered by the fuzzy sets. The author in [28] proposed a regularized learning approach to merge similar fuzzy sets (categorized in the same group) during the parameter-learning approach. However, the UOD may not be properly covered, and some of the merged fuzzy sets may still be highly overlapped even though they show low similarity. The authors in [29] applied a similarity measure to the fuzzy sets tuned through a constraint-free parameter-learning process in order to merge fuzzy sets with a high degree of similarity. In this approach, when the number of highly similar fuzzy sets is small after the parameter learning, the number of remaining fuzzy sets may be too large for adequate transparency. In contrast to the techniques used to improve type-1 fuzzy partition transparency described previously, the DIT2NFS-IP uses a new parameterlearning algorithm that assures that type-2 fuzzy sets are distinguishable, small in number, and properly covering the UOD. The DIT2NFS-IP uses a zero-order Takagi–Sugeno–Kang (TSK) type-2 consequent (where the consequent value is an interval) in order to simplify the rule form and meet factor two of the interpretability guidelines stated earlier. Instead of using a grid-type partition, the VSSC locates type-2 fuzzy rules by flexibly partitioning the input space in order to generate a moderate rule base (interpretability factor three) for high accuracy. Parameter learning consists of two phases. In addition to output error reduction, two constraints are imposed in the cost function during the first parameter-learning phase to improve fuzzy set transparency (interpretability factor one). The second phase first merges fuzzy sets for fuzzy set number reduction (interpretability factor one) and then tunes the consequent parameter for further accuracy improvement without degrading transparency. The simulation results show that the DIT2NFSIP improves interpretability with an acceptable degradation in modeling accuracy. Moreover, due to its impressive learning ability, the DIT2NFS-IP shows better modeling accuracy than most of the type-1 and type-2 fuzzy models used for comparison.

IEEE TRANSACTIONS ON CYBERNETICS, VOL. 43, NO. 6, DECEMBER 2013

Fig. 1. value.

IT2 fuzzy set with an uncertain width, where MV denotes membership

The rest of this paper is organized as follows. Section II introduces the DIT2NFS-IP functions. Section III introduces the VSSC algorithm for structure learning and IT2 fuzzy set initialization. Section IV introduces the constraints for improving fuzzy set partition transparency. Section V introduces the two-phase parameter-learning method. Section VI presents the simulation results and comparisons with different type-1 and type-2 FNNs. Section VII discusses the characteristics of the DIT2NFS-IP. Finally, Section VIII offers conclusions. II. DIT2NFS-IP F UNCTIONS The DIT2NFS-IP uses zero-order TSK-type rules, and each rule has the following form: ˜i Rule i : IF x1 is A˜i1 AND . . . AND xn is An , THEN y is i i i = 1, . . . , M (1) wl , wr , where A˜ij ’s, j = 1, . . . , n, are IT2 fuzzy sets, wli and wri are real numbers, indices l and r mean left and right limits, respectively, and M is the number of rules. The inputs xj , j = 1, . . . , n, are crisp values and are scaled to fall within the range [−1, 1] or [0, 1]. For the ith fuzzy set A˜ij in the input variable xj , a Gaussian membership function (MF) with a fixed mean mij and an i i uncertain width that takes on values in [σj1 , σj2 ] is used (see Fig. 1), i.e., xi − mij μji (xj ) = exp − σji i i i i ≡ N mj , σj : xj σji ∈ σj1 . (2) , σj2 The MF value is an interval [μji , μji ], where i i 0.1 ≤ σj2 μji (xj ) = N mij , σj2 ; xj , ≤ 0.8 i i i 0.06 ≤ σj1 ≤ 0.8. μji (xj ) = N mj , σj1 ; xj ,

(3) (4)

i i The limits on the values of σj2 and σj1 exist to avoid generating a fuzzy set with an unreasonably large or small width. The values of the upper and lower limits are determined by observing the distribution of the fuzzy sets in the scaled input domain under the consideration of interpretability. The lower i of the inner fuzzy set is set to 0.06 so that limit of the width σj1 i it is smaller than the lower limit (=0.1) of the width σj2 of the outer fuzzy set. Unlike previous type-2 NFSs [17], [19]–[21], the number of fuzzy sets in each input variable may be smaller

JUANG AND CHEN: IT-2 NFS WITH LEARNING ACCURACY AND MODEL INTERPRETABILITY

than the number of rules due to the use of a fuzzy set reduction operation in the DIT2NFS-IP. The fuzzy meet operation uses an algebraic product function to compute the firing strength F j , which is an interval given as follows: i

F i = [f i , f ]

R

yr = fi =

n

μji

fi =

j=1

n

μji .

=

z r = Qr w r

(7)

where Ql and Qr are M × M permutation matrices. The T original rule firing strength orders f = (f 1 , f 2 , . . . , f M ) and 1

2

M

f = (f , f , . . . , f )T are reordered accordingly. To compute the output yl , the new rule orders are expressed as follows: g l = Ql f

g l = Ql f .

(8)

Similarly, to compute the output yr , the new rule orders are expressed as follows: g r = Qr f

g r = Qr f .

i=1

g il zli

L i=1

M i=L+1 M

g il zli

g il + =

g il i=L+1 T T f QT l E1 El Q 1 w l pT l Ql f

T + f T QT l E2 E2 Q l w l

+ qT l Ql f

(10)

where L denotes the left crossover point computed from the KM iterative procedure and pl = (1, 1, . . . , 1, 0, . . . , 0)T ∈ M ×1 Δ

L

ql = (0, . . . , 0, 1, . . . , 1)T ∈ M ×1 Δ

(11)

M −L E1 = (e1 , e2 , . . . , eL )T

∈ L×M E2 = (eL+1 , eL+2 , . . . , eM )T ∈ (M −L)×M

M

g ir zri

i=R+1 M

g ir i=R+1 T f T QT r E3 E3 Q r w r pT r Qr f g ir +

T

T + f QT r E4 E4 Q r w r

+ qT r Qr f

(13)

where R denotes the right crossover point computed from the KM iterative procedure and pr = (1, 1, . . . , 1, 0, . . . , 0)T ∈ M ×1 Δ

R

q r = (0, . . . , 0, 1, . . . , 1)T ∈ M ×1 Δ

(14)

M −R

E3 = (e1 , e2 , . . . , eR )T ∈ R×M E4 = (eR+1 , eR+2 , . . . , eM )T ∈ (M −R)×M

(15)

where ei ∈ M ×1 . The outputs yl and yr in (10) and (13) are expressed in the original rule-ordered format for ease of deriving the parameter-learning algorithm introduced in Section V-B. The defuzzification operation defuzzifies the interval [yl , yr ] by computing the average of yl and yr . Hence, the defuzzified output is y=

yl + yr . 2

(16)

The desired outputs are normalized within the range of [−1, 1] in the parameter-learning process.

(9)

The output yl is computed using the weighted average method L

R

(6)

j=1

z l = Ql w l

i=1

g ir zri +

i=1

In the type reduction operation, an interval output [yl , yr ] is computed using a weighted average operation and the Karnik–Mendel (KM) iterative procedure [32]. In that procedure, the consequent parameters are reordered in ascending T T order. Let wl = (wl1 , . . . , wlM ) and wr = (wr1 , . . . , wrM ) denote the original rule-ordered consequent values, and let z l = T T (zl1 , . . . , zlM ) and z r = (zr1 , . . . , zrM ) denote the reordered 1 2 M sequences, where zl ≤ zl ≤ · · · ≤ zl and zr1 ≤ zr2 ≤ · · · ≤ zrM . According to [32], the relationship between wl , wr , z l , and z r is

yl =

where ei ∈ M ×1 , i = 1, . . . , M , denotes the elementary vectors. These M elementary vectors are partitioned disjointedly across E1 and E2 in (12) in order to represent the summations in (10) in vector form. Similarly, the output yr is computed as follows:

(5)

where

1783

(12)

III. VSSC A LGORITHM FOR T YPE -2 RULE I NITIALIZATION In the proposed VSSC algorithm, each cluster corresponds to a rule in the DIT2NFS-IP. Beginning with an empty set, the VSSC algorithm generates IT2 fuzzy rules by performing clustering on the normalized input–output space. A databased multi-input and single-output learning problem is considered for the sake of clarity. For an input datum [x1 (t), . . . , xn (t)] at time step t, the corresponding desired output is denoted as y d (t + 1). Let a(t) = d n+1 [1 x1 (t), 2 x2 (t), . . . , n xn (t), n+1 y (t + 1)] ∈ denote a weighted input–output training vector. The weights 1 , . . . , n+1 give the relative importance of each vector component in the cluster formation. The VSSC algorithm sets 1 = · · · = n = 1 for input variables, and a higher n+1 value for the output variable to emphasize the importance of the output variable. If the weight n+1 is set to a very high value, then n+1 y d (t + 1) will dominate the clustering result. Therefore, this paper sets n+1 to two. Clustering is performed on the vector a(t). The center of a cluster i is denoted by ci = [ci1 , . . . , cin+1 ] ∈ n+1 . The number of data

1784

IEEE TRANSACTIONS ON CYBERNETICS, VOL. 43, NO. 6, DECEMBER 2013

i

in cluster i is denoted by Ni . The sample variance V of the samples belonging to cluster i is given by 1 i a(t) − ci 2 . V = (17) Ni a(t)∈cluster i i

The sample variance V is used as the criterion to determine which cluster should be split into two. When the splitting operation is performed on r clusters, the VSSC algorithm finds i

I = arg max V .

(18)

1≤i≤r

Next, cluster I is split into two clusters, resulting in a total of r + 1 clusters. To assign initial centers to the two newly split clusters, the two input–output data that are nearest to the original cluster I, i.e., those with the smallest distance D(a(t), cI ), are found. The distance between a datum and a cluster with center ci is calculated as follows: 2 n+1 cij − aj (t) . D a(t), ci =

(19)

j=1

Suppose that the two data are a(i1 ) and a(i2 ). The initial centers of the original cluster I and the new cluster r + 1 are assigned as follows:

cI = a(i1 )

cr+1 = a(i2 ).

(20)

Equation (20) shows that, for the original cluster I that performs the splitting, its original center cI is replaced by a(i1 ). After splitting, the new centers of these r + 1 clusters are recomputed by the k-means algorithm [1]. The splitting process ends when a predefined number of clusters (rules) is met. The entire algorithm is summarized as follows. Begin VSSC Algorithm Initialization. Set r = 1 and find c1 . I Step 1. Find the maximum variance V using (18). Step 2. If [the end criterion is met (i.e., a predefined number of clusters is met)] {Stop} Else {Split cluster I into two new clusters. Set r ← r + 1 Assign initial centers cI and cr+1 of these two clusters using (20). } Step 3. Find centers of existing r clusters using the following k-means algorithm. Recursion {Set Iteration = 0 Set Ni = 1, i = 1, . . . , r. Step 3a. {For each input–output datum a(t), Find i∗ = arg min D(a(t), ci ) 1≤i≤r

Ni∗ ← Ni∗ + 1.} Step 3b. {Recompute ci∗ using i∗

c

1 ←c + a(t) − ci∗ } Ni∗ i∗

(21)

Iteration ← Iteration + 1 If (2-norm of any of the cluster center change is larger than a threshold Δ(= 10−5 )) and (Iteration < 20) {Return to Step 3a} } Go to Step 1 } End Algorithm The number of rules M in the DIT2NFS-IP is equal to the number of clusters r at the end of the VSSC algorithm, where one cluster represents one rule. The initial values of the antecedent and consequent parameters are assigned according to the distribution of the input–output data in each cluster. The initial mean mij of the IT2 MF in (2) is assigned as follows: mij = cij ,

j = 1, . . . , n;

i = 1, . . . , M.

(22)

i i The initial values of the two uncertain widths σj1 and σj2 in (2) are assigned according to the distribution of data in each cluster. i is assigned to be the standard deviation The larger width σj2 i of all the Ni input data in cluster i. The smaller width σj1 is assigned to be standard deviation of the top Ni /2 input data with the shortest distance to the center of cluster i. That is ⎛ ⎞ 1 ⎜ 2⎟ i x(t) = − c i ⎠ σj2 (23) N ⎝ i x(t)∈cluster i

i σj1

⎛ ⎜ 1 ⎜ ⎜ = Ni ⎜ 2 ⎜ ⎝

⎞ x(t)∈cluster i Ni and in the top 2

⎟ ⎟ x(t) − c 2 ⎟ ⎟ i ⎟ ⎠

(24)

where c i = [cii , . . . , cin ] ∈ Rn is the center of cluster i in the input space. The initial center of the consequent interval [wli , wri ] of rule i is set to be the mean of the output data in cluster i, i.e., cin+1 /n+1 . The standard deviation σyi of the output data in cluster i is given by ⎛ ⎞ 1 ⎝ (y d (t+1)−cin+1 /n+1 )2⎠. (25) σyi = Ni d

y (t+1)∈cluster i

Based on the mean and standard deviation of the desired output in cluster i, the initial consequent interval [wli , wri ] of rule i is set as follows: i i i cn+1 ci − σyi , n+1 + σyi . wl , w r = (26) n+1 n+1 The center and length of the initial consequent interval are set according to the distribution of the output data belonging to each rule.

JUANG AND CHEN: IT-2 NFS WITH LEARNING ACCURACY AND MODEL INTERPRETABILITY

Fig. 2. Distributions of IT2 fuzzy sets. (a) Some fuzzy sets are highly overlapped, and part of the UOD is not properly covered. (b) Distinguishable fuzzy sets that properly cover the UOD.

IV. S EMANTIC C ONSTRAINTS FOR I NTERPRETABILITY I MPROVEMENT One important factor in FS interpretability is fuzzy set partition transparency; fuzzy sets must be interpretable and understandable by humans. For example, Fig. 2(a) and (b) shows the indistinguishable and distinguishable fuzzy sets, respectively. In Fig. 2(a), fuzzy sets are seriously overlapped, and part of the UOD is not properly covered. This section introduces two constraints imposed on IT2 fuzzy set parameter learning for the purpose of interpretability improvement. The first constraint groups and merges similar IT2 fuzzy sets to reduce the number of fuzzy sets in each input variable. The second ensures that the merged fuzzy sets are distinguishable and properly cover the UOD. A. Constraint One: Fuzzy Set Grouping Constraint To judge if two fuzzy sets are highly overlapped, a similarity measure is generally used. Although the literature on similarity measures for type-1 fuzzy sets is quite extensive [33], only a few similarity measures for IT2 fuzzy sets [34] have appeared to date. For two IT2 fuzzy sets A˜1j and A˜2j , computation of their similarity degree S(A˜1j , A˜2j ) is much more complex than for that of type-1 fuzzy sets, particularly for those with primary Gaussian MFs. This paper extends the type-1 distance-based similarity measure and grouping approach in [28] to an IT2 distance-based similarity measure (IT2DSM) and grouping. The proposed IT2DSM for two IT2 fuzzy sets A˜1j and A˜2j is given as follows: 1 S A˜1j , A˜2j , A(•) ∈ (0, 1] (27) 1 + d A˜1 , A˜2 j

j

where the estimated distance d(A˜1j , A˜2j ) between two IT2 Gaussian MFs in (2) is given by d A˜1j , A˜2j ! 2 1 2 2 +0.5 σ 1 −σ 2 2 . m1j −m2j +0.5 σj1 −σj1 = j2 j2

(28)

This distance measure is defined based on the differences of the three free parameters (one center and two widths) between two IT2 fuzzy sets and is an extension of that defined for

1785

type-1 fuzzy sets [28]. Suppose that there are M fuzzy sets in variable xj . The IT2DSM separates these M fuzzy sets into g groups using a similarity degree threshold Sth (0.8 in this paper). Let A˜k∗ j be the reference fuzzy set of group Gk . A fuzzy set A˜ij is categorized into group Gk if S(A˜ij , A˜k∗ j ) ≥ Sth . 1 ˜ Initially, fuzzy set Aj is assigned as the reference fuzzy set of the first group G1 . For the other fuzzy sets A˜ij , 2 ≤ i ≤ 1 ˜i M , if S(A˜ij , A˜G j ) ≥ Sth , then Aj belongs to group G1 , i.e., A˜ij ∈ G1 . The first fuzzy set, denoted as 2∗ that does not satisfy the similarity criterion is assigned as the reference fuzzy set of group G2 . For the left fuzzy sets A˜ij , 2 ≤ i ≤ M , and A˜ij ∈ G1 , ˜i ˜i if S(A˜ij , A˜2∗ j ) ≥ Sth , then Aj belongs to group G2 , i.e., Aj ∈ G2 . Again, the first fuzzy set that does not satisfy the similarity criterion is assigned as the reference fuzzy set of group G3 . The aforementioned process is repeated until all fuzzy sets are grouped. For example, the fuzzy sets in Fig. 2(a) can be divided into four groups: G1 = {A˜2j }, G2 = {A˜1j }, G3 = {A˜4j , A˜5j }, and G4 = {A˜3j , A˜6j }. After grouping, fuzzy sets in the same group are forced to move as close as possible via parameter learning so that they can be merged for interpretability improvement without deteriorating the modeling accuracy. To this end, the objective of C (1) = 0 is used in the parameter learning, where C (1) =

g n 2 1 i mj − mkj 2 j=1 i

+

1 4

k=1 Aj ∈Gk g n

i σj2 − σ kj2

2

j=1 k=1 Ai ∈Gk j

n M 2 1 i σj1 − σ kj1 + 4 j=1 i=1 i

(29)

Aj ∈Gk

where mkj , σ kj2 , and σ kj1 are the averages of the center mij and i i uncertain widths σj2 and σj1 , respectively, of fuzzy sets Aij in the same group Gk . Constraint (29) forces the fuzzy sets in the same group to move to their average during the parameterlearning process. This constraint is used in the cost function of parameter learning introduced in Section V. This grouping process is repeated at the beginning of each cycle of parameter learning because a tuned fuzzy set may move from one group to another and the number of groups may vary during the learning process. B. Constraint Two: Distinguishability Constraint During the grouping process, fuzzy sets in different groups may move too close or too far apart after parameter learning. To improve fuzzy set distinguishability, another constraint is imposed on the summed membership values for an input variable. For example, consider two IT2 fuzzy sets A˜1j and A˜2j , as shown in Fig. 3. When A˜1j and A˜2j are very close to each other and variable xj falls between them, then the summed membership value will be large; when two fuzzy sets A˜3j and A˜4j are very far from each other (as shown in Fig. 3), the summed membership values of an input variable x j will be very small. Based on this

1786

IEEE TRANSACTIONS ON CYBERNETICS, VOL. 43, NO. 6, DECEMBER 2013

in parameter learning for an output error reduction is given as follows: Err (t) =

Fig. 3. Distributions of fuzzy sets and the membership values for different inputs xj and xj .

observation, the summed membership values of type-1 fuzzy sets for an input datum are forced to be smaller than a constant value in [27]. However, this constraint forces the fuzzy sets to move apart, preventing a reduction of the total fuzzy set number and without considering the proper coverage of the UOD. This paper proposes a new distinguishability constraint that can avoid fuzzy sets being moved too close together or too far apart. The grouping constraint forces fuzzy sets to move closer to each other, and the distinguishability constraint forces fuzzy sets to move farther apart, thus operating in opposition. To address this, the proposed distinguishability constraint only measures the membership value of the reference IT2 fuzzy set in each group. The reference fuzzy sets are used as references for both similarity and distinguishability measures in the two constraints. For each input variable xj , the sum of the upper ∗ membership values of all reference IT2 fuzzy sets A˜kj , k = 1, . . . , g, is computed and is denoted as MR (xj ), i.e., MR (xj ) =

g

μjk∗ (xj ).

(30)

k=1

The value of MR (xj ) is related to the distinguishability of fuzzy sets and is forced to be within the range [0.5, 1.5] in order that the groups of fuzzy sets are properly located in the UOD, i.e., neither too far apart (when MR (xj ) < 0.5) nor too close together (when MR (xj ) > 1.5). To this end, the distinguishability constraint is assigned as follows: 1 = ε(xj ) 2 j=1 n

C where

⎧ ⎨

ε(xj ) =

(2)

0, (MR (xj ) − 0.5)2 , ⎩ (MR (xj ) − 1.5)2 ,

0.5 ≤ MR (xj ) ≤ 1.5 MR (xj ) < 0.5 MR (xj ) > 1.5.

(31)

(32)

The objective of C (2) = 0 is used in the cost function of parameter learning introduced in Section V. V. DIT2NFS-IP PARAMETER L EARNING A. Parameter Learning—Phase One Parameter learning of the DIT2NFS-IP consists of two learning phases. The first phase tunes all of the free parameters in the antecedent and consequent parts by using a new cost function that considers both model accuracy and model interpretability. For example, consider a single-output case; the cost function

2 1 y(t + 1) − y d (t + 1) 2

(33)

where y(t + 1) and y d (t + 1) denote the DIT2NFS-IP output and the desired output, respectively, for an input datum [x1 (t), . . . , xn (t)]. Minimization of Err improves model accuracy without considering model interpretability. A new cost function E Cost that considers both accuracy and interpretability is proposed. The new cost function adds the constraints in (29) and (31) to the cost function in (33) as two additional objectives. The objective of parameter leaning is to minimize the new cost function E Cost = Err + γ(k) · C (1) + β(k) · C (2)

(34)

where 0 ≤ γ(k) < 1 and 0 ≤ β(k) < 1 are the iterationdependent regulation coefficients that give relative importance to objectives C (1) and C (2) . Equation (34) dynamically combines the three objectives of model accuracy improvement (measured by Err ), reduction of the number of fuzzy sets (measured by C (1) ), and fuzzy set distinguishability improvement (measured by C (2) ) into a single cost function. This formation makes it feasible to apply the gradient descent algorithm to tune the free parameters in an IT2 fuzzy model. Minimization of this cost function helps to obtain an accurate fuzzy model while considering model interpretability. For the two regulation coefficients, this paper sets γ(k) = γ0 × (1 − k/Max_iteration), k = 1, . . . , Max_iteration

(35)

and β(k) = β0 × (k/Max_iteration), k = 1, . . . , Max_iteration.

(36)

Based on the relative experimental values of Err , C (1) , and C (2) , the initial values of the regulation coefficients are set to be γ0 = 0.5 and β0 = 0.02 for all examples in this paper. Coefficients γ(k) and β(k) decrease and increase with iteration number k, respectively. In the early phase of learning, neighboring fuzzy sets are forced to form a group to be subsequently merged for fuzzy set number reduction. Therefore, a relatively higher regulation coefficient γ(k) is assigned. The distinguishability constraint in (31) is based on the fuzzy set grouping result; therefore, a relatively smaller value of β(k) is assigned in the early parameter-learning phase. When the learning converges, the distinguishability of fuzzy sets in different groups becomes more important, and therefore, a larger value of β(k) is assigned. The consequent parameters only affect the output error Err in (33). In other words, the tuning of the consequent parameters for the cost function (E Cost ) minimization is equivalent to the error function (Err ) minimization. Therefore, the rule-ordered recursive least squares (RLS) algorithm [19] is applied to tune the consequent parameters. The algorithm keeps the consequent values in their original rule order, despite their alteration and

JUANG AND CHEN: IT-2 NFS WITH LEARNING ACCURACY AND MODEL INTERPRETABILITY

reordering by the KM algorithm during the parameter-learning process. Equations (10) and (13) can be re-expressed as T

T yl = φ T l wl , φ l =

T T T T f QT l El El Ql +f Ql E2 E2 Ql T pT l Ql f +q l Ql f

∈ 1×M (37)

T

T yr = φ T r wr , φ l =

T T T f T QT r E3 E3 Qr +f Qr E4 E4 Qr T pT r Qr f +q r Qr f

∈ 1×M

1787

variable is equal to the number of groups, and each group is reduced to an IT2 fuzzy set with center mkj and uncertain widths i i σ kj2 and σ kj1 , computed from the average of mij , σj2 , and σj1 , respectively, of the fuzzy sets Aij in the same group Gk . This merging operation will slightly decrease the model accuracy. Therefore, the consequent parameters are retuned using the rule-ordered RLS algorithm with Max_iteration/2 iterations in order to improve the accuracy of the reduced fuzzy model.

(38) respectively. As a result, the output y in (16) can be reexpressed as % T T & wl 1 1 T T φ w l + φ w r = φl φ r y = (yl + yr ) = wr 2 2 l ⎡ r⎤ wl1 ⎢ .. ⎥ ⎢ . ⎥ ⎥ ⎢ % 1 & M 1 M ⎢ wlM ⎥ ⎥ = φ l . . . φ l φ r . . . φr ⎢ ⎢ w1 ⎥ ⎢ r ⎥ ⎢ . ⎥ ⎣ .. ⎦ wM ⎡ w1 ⎤r l ⎢ wr1 ⎥ % 1 1 & M M ⎢ . ⎥

T ⎥ = φ l φr . . . φ l φ r ⎢ ⎢ .. ⎥ = φ w ⎣ M⎦ wl wrM T

(39)

T

T where φl = 0.5φT l and φr = 0.5φr . The components in the consequent parameter vector w are arranged in rule order and are updated by executing the following rule-ordered RLS algorithm:

VI. S IMULATIONS This section describes five examples of DIT2NFS-IP simulation for verification of its performance in model accuracy and interpretability. To see the influence of the proposed interpretability improvement approach on model accuracy, learning of the DIT2NFS-IP considering only output accuracy, i.e., E Cost = Err in (34), is applied to the same examples, and this model is denoted as DIT2NFS-AC. In each example, the accuracy of the DIT2NFS-IP is compared to various type-1 and type-2 fuzzy models using zero-order TSK-type or Mamdanitype fuzzy rules. The use of first-order TSK-type fuzzy rules in a fuzzy model could improve model accuracy. However, it is more difficult to interpret the complex consequent parts associated with this type. Therefore, the following comparisons do not consider fuzzy models with first-order TSK-type fuzzy rules. Example 1 (System Identification With Noise): This example uses the DIT2NFS-IP to identify a nonlinear system with noise. The plant to be identified is guided by the difference equation [19] yˆ(t + 1) =

yˆ(t) + u3 (t). 1 + yˆ2 (t)

(41)

w(t + 1) = w(t) + B(t + 1)φ (t + 1)

B. Parameter Learning—Phase Two

The clean training patterns are generated with u(t) = sin(2πt/100), t = 1, . . . , 200, and yˆ(1) = 0. It is assumed that each measured yˆ value contains noise, and that noise value is denoted by yn . The added noise is artificially generated white noise with uniform distribution. Simulations with noise being generated in the ranges of [−0.1, 0.1] and [−0.5, 0.5] are conducted. Noise is added to the original 200 clean training patterns. For network training, the inputs are yn (t) and u(t), and the output is yn (t + 1). As in [19], the number of rules was set to be six in the VSSC algorithm, and the total number of iterations Max_iteration was set to 500. After training, the original 200 clean patterns were used to test noise resistance ability. Learning accuracy is evaluated using root-mean-square error (rmse) 200 1 [y(t + 1) − yˆ(t + 1)]2 . (42) rmse = 200 t=1

After the parameter learning in phase one is completed, groups of fuzzy sets are formed, and fuzzy sets in the same group are almost completely overlapped. The second phase in parameter learning merges those fuzzy sets in the same group in order to reduce their total number and maintain interpretability. After the merging operation, the number of fuzzy sets in each

The rmse between the actual system output y(t + 1) and the clean output yˆ(t + 1) = y d (t + 1) is calculated. There were 20 Monte Carlo realizations of the aforesaid training and test processes, and the rmses of all the realizations were averaged. Table I shows the rmses and the number of rules and fuzzy sets of the DIT2NFS-AC and DIT2NFS-IP. Because the training

T

× y d (t + 1) − φ (t + 1)w(t) .

T B(t)φ (τ + 1)φ (t + 1)B(t) 1 B(t) − (40) B(t + 1) =

T

λ λ + φ (t + 1)B(t)φ where 0 < λ ≤ 1 is a forgetting factor (λ = 0.9995 in this paper, as in [18] and [19]) and B ∈ M ×M . The initial value of B is generally set to δ · I, with δ being a large positive constant [35]. This paper sets δ to ten. Each time a new training sample is presented, wli and wri are updated using (40). Crossover points L and R, and matrices Ei , i = 1, . . . , 4, Ql , and Qr in (37)–(40) are updated accordingly. The antecedent parameters in DIT2NFS-IP are tuned by using a gradient descent algorithm. Detailed learning equations can be found in the Appendix.

1788

IEEE TRANSACTIONS ON CYBERNETICS, VOL. 43, NO. 6, DECEMBER 2013

TABLE I P ERFORMANCES OF DIT2NFS-IP AND VARIOUS T YPE -1 AND T YPE -2 F UZZY M ODELS FOR N OISY P LANT I DENTIFICATION W ITH D IFFERENT N OISE L EVELS IN E XAMPLE 1

Fig. 4. Distributions of fuzzy sets in the input variables for different fuzzy models in Example 1. (a) DIT2NFS-AC. (b) DIT2NFS-S1. (c) DIT2NFS-IP.

data are different for each of the 20 realizations, the generated number of fuzzy sets in the DIT2NFS-IP varies accordingly. This explains the decimal average number shown in Table I. Table I also shows the performance of the DIT2NFS-IP after stage one of parameter learning (i.e., without the fuzzy set reduction operation); this model is denoted as DIT2NFS-S1. The DIT2NFS-AC shows a smaller rmse because this model only considered learning accuracy. The DIT2NFS-S1 shows an rmse close to that of the DIT2NFS-IP; however, the latter used a smaller number of fuzzy sets. Fig. 4(a)–(c) shows the final fuzzy set distributions of DIT2NFS-AC, DIT2NFS-S1, and DIT2NFS-IP, respectively, with noise in the range of [−0.5, 0.5]. Fig. 4(a) shows that there were six fuzzy sets in each input variable, with some highly overlapped sets due to interpretability not being considered in the DIT2NFS-AC. Fig. 4(b) shows that fuzzy sets in the same group were almost completely overlapped in the DIT2NFS-S1. Fig. 4(c) shows that, in the DIT2NFS-IP, the fuzzy set number was reduced from six to five in variable u(t) and from six to three in input variable yˆ(t). The DIT2NFS-IP shows the advantages of distinguishable and reduced fuzzy sets for interpretability improvement without a serious degradation in learning accuracy. For the purpose of comparison, the interval type-2 fuzzy logic system (T2FLS) with gradient descent algorithm [16] and the T2SONFS [19] were applied to the same problem. These two systems use Mamdani-type rules with a centerof-sets-type reduction, where each reduced IT2 Gaussian MF becomes a type-1 interval. As a result, the consequent part in these two systems can be regarded as a zero-order TSK-type consequent as in the DIT2NFS-IP. The design of the T2FLS and T2SONFS only considers model accuracy. In this and the following examples, these two type-2 FSs used an identical

Fig. 5. Test result of the chaotic series prediction problem using the DIT2NFS-IP in Example 2.

number of Max_iteration training iterations and rules as in the DIT2NFS. This study also compared the performances of a type-1 FNN—the self-constructing neural fuzzy inference network (SONFIN) [35]. Table I shows the performances of the T2FLS, T2SONFS, and SONFIN. The result shows that DIT2NFS-AC has the smallest average rmse for both noise levels among these fuzzy models. The DIT2NFS-IP shows a smaller rmse than the T2FLS, T2SONFS, and SONFIN. In addition, the DIT2NFS-IP shows the advantage of improved model interpretability and a smaller number of fuzzy sets than other fuzzy models. Example 2 (Chaotic Sequence Prediction): This example studies the prediction of the following Mackey–Glass chaotic time series: 0.2s(t − τ ) ds(t) = − 0.1s(t) dt 1 + s10 (t − τ )

(43)

where τ > 17. As in previous studies [36]–[41], parameter τ is set to be 30, and s(0) = 1.2. Four past values are used to predict

JUANG AND CHEN: IT-2 NFS WITH LEARNING ACCURACY AND MODEL INTERPRETABILITY

1789

TABLE II P ERFORMANCES OF DIT2NFS-IP AND VARIOUS T YPE -1 AND T YPE -2 F UZZY M ODELS FOR THE C HAOTIC S EQUENCE P REDICTION P ROBLEM IN E XAMPLE 2, W HERE “NA” D ENOTES N OT AVAILABLE

Fig. 6.

Distributions of fuzzy sets in the input variables for different fuzzy models in Example 2. (a) DIT2NFS-AC. (b) DIT2NFS-S1. (c) DIT2NFS-IP.

s(t), and the input–output data format is [s(t − 24), s(t − 18), s(t − 12), s(t − 6); s(t)]. One thousand patterns were generated from t = 124 to t = 1123, with the first 500 patterns being used for training and the last 500 for testing. Training of the DIT2NFS-IP was conducted with six rules and a Max_iteration = 1600. Fig. 5 shows the prediction results for test patterns using the trained DIT2NFS-IP. Table II shows the training and test rmses of the DIT2NFS-IP, DIT2NFSS1, and DIT2NFS-AC. As in Example 1, the DIT2NFS-S1 shows a smaller rmse than the DIT2FNN-IP; however, the latter used a much smaller number of fuzzy sets. The DIT2NFS-IP improved model interpretability and reduced the total number of fuzzy sets from 24 in the DIT2NFS-AC and DIT2NFS-S1 to only 8. Fig. 6 shows the distribution of fuzzy sets in the DIT2NFS-AC, DIT2NFS-SI, and DIT2NFS-IP. The DIT2NFSS1 forced the highly overlapped fuzzy sets in the DIT2NFSAC to the same group, and the DIT2NFS-IP performed the fuzzy set reduction operation to obtain distinguishable fuzzy sets.

The performance of the DIT2NFS-IP was compared with some recently developed type-1 fuzzy models with Mamdanitype or zero-order TSK-type consequents. These type-1 fuzzy models included a neuro-fuzzy function approximator (NEFPROX) [36], a fuzzy interpolative reasoning method by Chen and Kuo [37], a fuzzy model proposed by Huang and Shen (called HS method) [38], a self-organizing fuzzy modified least squares network [39], and a weighted fuzzy rule interpolation method based on IT2 Gaussian fuzzy sets (FRI-IT2) [41]. Table II shows the performance of these type-1 and type-2 fuzzy models, indicating that the DIT2NFS-AC achieved the smallest test error when only model accuracy was considered in all the fuzzy models. The DIT2NFS-IP shows a larger test error than T2SONFs and TSFLS because the former considered both model accuracy and interpretability. Table II shows that only six fuzzy sets were used in the DIT2NFS-IP in contrast to 24 sets used in the T2SONFs and TSFLS. Except for the T2SONFs and TSFLS, the test error of the DIT2NFS-IP was smaller than FRIIT2 and all of the type-1 fuzzy models used for comparison.

1790

IEEE TRANSACTIONS ON CYBERNETICS, VOL. 43, NO. 6, DECEMBER 2013

TABLE III P ERFORMANCES OF DIT2NFS-IP AND VARIOUS T YPE -1 AND T YPE -2 F UZZY M ODELS FOR THE S TOCK DATA IN E XAMPLE 3

Fig. 7. Distributions of IT2 fuzzy sets in the input variables x1 , . . . , x10 of DIT2NFS-IP in Example 3.

Example 3 (Stock Data): In this and the following example, the DIT2NFS-IP was applied to two real data sets that were also studied in [20]. The problem addressed in this example is the prediction of the price of a stock based on ten short-, mid-, and long-term financial indicators. There were 50 training and 50 test stock data sets as in [20]. Training of the DIT2NFSIP was conducted with six rules and a Max_Iteration = 800. Table III shows the total number of rules and IT2 fuzzy sets, as well as the training and test rmses of the DIT2NFS-AC and DIT2NFS-IP. There were six IT2 fuzzy sets in each variable of the DIT2NFS-AC. Fig. 7 shows the distribution of the IT2 fuzzy sets in the DIT2NFS-IP; there were fewer than six IT2 fuzzy sets in each variable, with only one IT2 FS in some variables. For comparison purposes, Table III shows the performances of several type-2 fuzzy models, including the T2FLS, a DIT2FS with Mizumoto (zero-order TSK) fuzzy rule base (DIT2MFR) [20], and a DIT2-FS with a linguistic (Mamdani-type) fuzzy rule base (DIT2-LFR) [20]. These type-2 fuzzy models were designed to improve model accuracy without considering model interpretability. In the DIT2-MFR and DIT2-LFR, the authors used rule numbers ranging from 2 to 20 and showed the best test results from these 18 trials. Table III shows the best result reported in [20]. In contrast to this exhaustive rule selection approach, the number of rules in the DIT2NFS-AC and DIT2NFS-IP was simply set to six, as in Examples 1 and 2. The results show that the DIT2NFS-AC achieved a smaller test error than the best results of the DIT2-MFR and DIT2LFR. Table III also shows the reported results of different type-1 fuzzy models from [20], including a zero-order adaptive neural fuzzy inference system [4] and an FS modeling method proposed by Sproule et al. [42]. The results show that the DIT2NFS-AC achieved a smaller test error than these type-1

fuzzy models. Other than the DIT2NFS-AC and DIT2-MFR, the DIT2NFS-IP shows not only a higher test accuracy than the other fuzzy models used for comparison but also an improved model interpretability with distinguishable fuzzy sets and a reduction of fuzzy set numbers from 60 to 23. Example 4 (Human Plant Control Data): The problem addressed in this example was to train an FS to imitate a human operator’s actions on the control of a polymerization process of monomers, where the operator needed to decide on the set point for the monomer flow rate based on five input variables [42]. There were a total of 70 data sets, of which 50 data sets were randomly selected to train the models and the remaining 20 data sets were used for testing as in [20] and [42]. In this paper, the data set generation approach was repeated for 50 runs, and the average performance of each fuzzy model was reported. Training of the DIT2NFS-IP was conducted with three rules and a Max_iteration = 2000. Table IV shows the total number of rules and fuzzy sets, and the training and test rmses of the DIT2NFS-IP and the type-1 and type-2 fuzzy models used for comparisons in Example 3. The results show that the training error of the DIT2NFS-AC was smaller than that of the DIT2NFS-IP. However, the DIT2NFS-IP shows a smaller test error than the DIT2NFS-AC because the imposed constraints on fuzzy set distribution may have avoided overtraining. There were three IT2 fuzzy sets in each variable of the DIT2NFS-AC. Fig. 8 shows the distribution of the IT2 fuzzy sets in the DIT2NFS-IP; there was only one IT2 FS in most variables. The DIT2NFS-IP not only shows fewer test errors than the other type-1 and type-2 fuzzy models used for this comparison but also an improved model interpretability with distinguishable fuzzy sets and a reduction of fuzzy set numbers from 15 to 7.

JUANG AND CHEN: IT-2 NFS WITH LEARNING ACCURACY AND MODEL INTERPRETABILITY

1791

TABLE IV P ERFORMANCES OF DIT2NFS-IP AND VARIOUS T YPE -1 AND T YPE -2 F UZZY M ODELS FOR THE H UMAN P LANT DATA IN E XAMPLE 4

Fig. 8.

Distributions of IT2 fuzzy sets in the input variables x1 , . . . , x5 of DIT2NFS-IP in Example 4.

In addition to the original training/test data set generation approach shown in [20], experiments were conducted with the more widely used fivefold cross-validation approach for training/test data generation. The data set was ordered in a random sequence and was evenly divided into five folds, with four folds for training and one for testing. After exchanging the role of each fold, the same training and testing procedure was performed four more times. For the fivefold data set, the average training and test rmses of the T2FLS with three rules (15 fuzzy sets) were 970.3 and 994.4, respectively. The training and test rmses of the DIT2NFS-AC with three rules (15 fuzzy sets) were 51.2 and 250.1, respectively. The training and test rmses of the DIT2NFS-IP with three rules (seven fuzzy sets) were 77.6 and 165.7, respectively. Performance comparison results among the different models in this fivefold data set were similar to those of the original data set. Example 5 (Concrete Compressive Length Data): This was a real-world problem that concerned the estimation of concrete compressive length using age and seven ingredients. The estimation model consisted of eight inputs and one output. The actual data set contained 1030 examples and can be downloaded from the University of California, Irvine (http:// archive.ics.uci.edu/ml/) or Keel (http://www.keel.es/). The fivefold cross-validation data sets in Keel were used. Training of the DIT2NFS-IP was conducted with four rules and a Max_iteration = 500. Table V shows the total number of rules and fuzzy sets, and the training and test rmses of the DIT2NFSAC and DIT2NFS-IP. Because the number of fuzzy sets in the DIT2NFS-IP varied with each of the five data sets, the average number is in decimal. Table V shows the performance of the T2FLS and the reported results of various type-1 fuzzy models that were applied to the same problem [43]. These type-1 fuzzy models included a general fuzzy rule interpolative reasoning method [43], a fuzzy

model proposed by Chen and Kuo [37], the HS method [38], and weighted fuzzy rule interpolation (WFRI) [40]. Table V shows that the DIT2NFS-AC achieved the smallest test error. The DIT2NFS-IP achieved not only many fewer test errors but also a much smaller number of rules and fuzzy sets than the type-1 fuzzy models used for this comparison. The DIT2NFSIP improved model interpretability with distinguishable fuzzy sets and reduced the fuzzy set numbers from 32 to an average of 18.6. In other words, there were only two or three fuzzy sets in each of the eight input attributes in the DIT2NFS-IP.

VII. D ISCUSSIONS A. Discussions on the MOEAs and the DIT2NFS-IP This design of a fuzzy model that considers both model accuracy and interpretability is essentially a multiobjective optimization problem. In this context, the application of multiobjective evolutionary algorithms (MOEAs) to design accurate and interpretable type-1 FSs has been proposed in several studies [24], [25], [44]–[49]. Two opposing objective functions, the maximization of accuracy and the reduction of model complexity (or minimization of the number of rules), are defined in the state-of-the-art MOEA methods [25], [49]. These methods partitioned the input space into grids and obtained the optimal number of rules through MOEA. For such methods, the search space increases with the input dimension. Although the Wang and Mendel method has been used to obtain an initial fuzzy base to reduce the search space, a large memory is still needed in the selection process to store the weighting of all possible rules. In contrast to this pruning from a large rule base approach, the DIT2NFS-IP uses a clustering-based rule growing approach to reduce memory storage. The DIT2NFS-IP does not search for the optimal number of rules. Instead, with a given number

1792

IEEE TRANSACTIONS ON CYBERNETICS, VOL. 43, NO. 6, DECEMBER 2013

TABLE V P ERFORMANCES OF DIT2NFS-IP AND VARIOUS T YPE -1 AND T YPE -2 F UZZY M ODELS FOR THE C ONCRETE C OMPRESSIVE L ENGTH DATA IN E XAMPLE 5

TABLE VI P ERFORMANCE OF DIT2NFS-AC W ITH D IFFERENT R EGULATION C OEFFICIENT VALUES IN THE C OST F UNCTION

of rules, the DIT2NFS-IP aims to find an accurate model with an improved interpretability at the fuzzy set level. The DIT2NFS-IP uses an adaptive cost function in (34) to improve model interpretability, and therefore, the online gradient descent algorithm is used in the two parameter-learning phases. This approach is different from the offline population-based optimization algorithms that find model parameters using a fixed cost function [50]–[52]. The objective functions C (1) and C (2) in (34) are proposed to reduce the number of fuzzy sets (granularity of fuzzy partition) and improve fuzzy set distinguishability, respectively. In [25] and [49], the granularities of fuzzy partitions are not individually defined in a single objective function and are therefore not optimized, even though they are learned. As to the objective of fuzzy set distinguishability, current multiobjective genetic algorithm (MOGA) methods achieve this goal by initially assigning evenly distributed fuzzy sets in an input variable for a given partition granularity. Parameters in an MF are not tuned in the early works of MOGA methods [24], [44]–[46]. In [25], all of the free parameters in an MF were tuned to improve model accuracy; distinguishability is reserved by assigning complex and dynamic search ranges to each free parameter. Instead of using these complex ranges, the DIT2NFS-IP defines the new objective function C (2) to reserve fuzzy set distinguishability. To reduce the search space and provide a fast convergence, the concept of tuning only one lateral displacement for a partition to reserve fuzzy distinguishability in an input variable was proposed in [49]. However, this approach also restricts model accuracy because of the reduction in the degree of freedom of the search space. As described in [49], adapting to each individual MF while learning the fuzzy model structure using MOGA methods could lead to a complex system and degrade learning performance. In particular, the type-2 fuzzy model considered in this paper is much more complex than a type-1 fuzzy model, and therefore, the MOGA-based type-1 fuzzy model design methods cannot be directly applied.

B. Regulation Coefficient Selection The influence of the regulation coefficients γ(k) and β(k) in (34) on the performance of the DIT2NFS-IP is discussed through the example of the chaotic sequence problem (Example 2). The original DIT2NFS-IP with γ0 = 0.5 and β0 = 0.02 is denoted as DIT2NFS-IP(original). Table VI shows the performance of four modified DIT2NFS-IP models with different coefficient values. In the first modified model, i.e., DIT2NFS-IP(I), the coefficient γ0 was set to zero, indicating that neighboring fuzzy sets were not forced to form a group. As a result, the DIT2NFS-IP(I) used a much larger number of fuzzy sets than the DIT2NFS-IP(original) and showed less distinguishability among multiple fuzzy sets (i.e., a larger value of C (2) ). The DIT2NFS-IP(I) also shows more training and test errors than the DIT2NFS-IP(original) because the minimization of a larger value of C (2) in the former restricts model accuracy. The DIT2NFS-IP(II) uses a larger value of β0 (= 0.2) than the DIT2NFS-IP(original), and thus, the former shows a smaller value (zero) of C (2) and generates two more fuzzy sets than the latter. The two additional fuzzy sets provide more free parameters for modeling accuracy improvement with no loss of distinguishability. Therefore, the DIT2NFS-IP(II) shows smaller training and test errors than the DIT2NFS-IP(original). A similar result applies to the DIT2NFS-IP(III), whose γ0 and β0 values were assigned to be ten times those of the DIT2NFS-IP(original). The last modified model, the DIT2NFSIP(IV), uses the fixed coefficients γ(k) = β(k) = 0.5; the other modified models used dynamically changing coefficients. In contrast to the DIT2NFS-IP(original), a higher weighting was assigned to C (2) in the DIT2NFS-IP(IV). The result is that a larger number of fuzzy sets were generated with a smaller value of C (2) . In contrast to the other modified DIT2NFS-IP models, whose β(k) showed their influence on model accuracy primarily at the end of the learning process, the constant value of β(k) in the DIT2NFS-IP(IV) influenced model accuracy throughout the learning process. Therefore, the training and test

JUANG AND CHEN: IT-2 NFS WITH LEARNING ACCURACY AND MODEL INTERPRETABILITY

1793

TABLE VII P ERFORMANCE OF DIT2NFS-AC W ITH D IFFERENT N UMBERS OF RULES IN E XAMPLE 2

errors of the DIT2NFS-IP(IV) are greater than those of the other models in Table VI, although a larger number of fuzzy sets are used. The result verifies the advantage of using dynamically changing coefficients.

in notation of the gradient descent results, (10) can be reexpressed as T

yl =

T T T T f QT l E1 E1 Ql w l +f Ql E2 E2 Ql w l T pT l Ql f +q l Ql f

T

=

C. Discussions on the VSSC Algorithm The maximum number of iterations in this algorithm is set to 20 if the algorithm does not converge (i.e., Δ > 10−5 ) within that period. One reason for this setting is to reduce computation time. Another reason is that small changes to the centers of all clusters have little influence on the fuzzy model because the centers are all further tuned using the gradient descent algorithm. A high value of 20 is conservatively selected in this paper. In fact, the VSSC algorithm shows quick convergence and takes less than four iterations in all of the five examples studied in Section VI. The performance of the DIT2NFS-AC with different numbers of rules (i.e., different numbers of clusters in the VSSC algorithm) is studied by taking the chaotic sequence problem as an example. Table VII shows the training and test rmses of the DIT2NFS-AC with different numbers of rules. The result shows that the errors tend to decrease with the increment of the rule number. However, too many rules may cause overtraining; saturation phenomenon or degradation in test performance may occur. VIII. C ONCLUSION This paper has proposed a new interpretable IT2 NFS, the DIT2NFS-IP, that achieves a high learning accuracy and an improved interpretability. The proposed VSSC-based rule initialization method and the hybrid parameter-learning algorithm aid in designing an accurate fuzzy model comprising a small rule base of simple zero-order TSK-type rules. These simplicities are important factors in model interpretability. To generate transparent fuzzy set partitions, two new constraints regarding fuzzy set grouping, distinguishability, and proper covering of the UOD have been proposed. The results of simulations show that the use of the two constraints on parameter learning of the DIT2NFS-IP not only reduces the number of fuzzy sets but also improves model interpretability. Simulation results also show that the degradation of training accuracy is acceptable for training the DIT2NFS-IP with the two imposed constraints. Some examples even show that the test accuracy of the DIT2NFS-IP is improved by using interpretable fuzzy rules. Comparisons with different type-1 and type-2 FSs that only consider model accuracy show that the DIT2NFS-IP achieves higher or competitive learning accuracy. Future studies will examine the possible applications of the DIT2NFS-IP to other practical databased learning problems. A PPENDIX This appendix derives the antecedent parameter-learning equations using a gradient descent algorithm. For convenience

where

f al +f T bl T

f cl +f T dl (A1)

T cl = p T ∈ M×1 l Ql T T dl = q l Ql ∈ M×1 . (A2)

T M×1 al =QT l E1 E1 Q l w l ∈ T M×1 bl =QT l E2 E2 Q l w l ∈

Similarly, (13) can be re-expressed as T

yr =

T T T f T QT r E3 E3 Q r w r + f Q r E4 E4 Q r w r T pT r Qr f + q r Qr f T

=

f T ar + f br

(A3)

T

f T cr + f d r

T M×1 ar = Q T r E3 E3 Q r w r ∈ T M×1 br = Q T r E4 E4 Q r w r ∈

T cr = pT Qr ∈ M×1 rT T dr = q r Qr ∈ M×1 . (A4)

Let λij denote one of the free antecedent parameters in rule i and variable j. Using the gradient descent algorithm, we have ∂E Cost ∂λij (t) ∂C (1) ∂C (2) + γ(t) + β(t) ∂λij ∂λij

λij (t + 1) = λij (t) − η ∂E Cost ∂Err = i ∂λj ∂λij

(A5) (A6)

where η is the learning constant (this paper sets η = 0.05). The gradient descent of each term is derived as follows. 1) Derivative of ∂Err /∂λij : ∂Err ∂Err ∂y ∂yl ∂y ∂yr = + ∂y ∂yl ∂λij ∂yr ∂λij ∂λij - i 1 ∂yl ∂yr ∂f d = (y − y ) + i i 2 ∂λij ∂f ∂f . i ∂yl ∂yr ∂f + + (A7) ∂f i ∂f i ∂λij where ∂yl bl,i −yl dl,i = T i ∂f f cl +f T dl

∂yl ar,i −yr cr,i = . i T ∂f f T cr +f dr

(A8)

i i The parameters mij , σj2 , and σj1 in (3) and (4) are i updated according to (A7). If λj = mij , then we have i

xj − mij ∂f i = 2 × f × i 2 ∂mij σj2 i xj − mij ∂f i = 2 × f × i 2 . ∂mij σ j1

(A9)

1794

IEEE TRANSACTIONS ON CYBERNETICS, VOL. 43, NO. 6, DECEMBER 2013

i Similarly, if λij = σj2 , then we have 2 i xj −mij ∂f i ∂f i = 2×f × 3 = 0. i i ∂σj2 ∂σj2 σi

R EFERENCES (A10)

j2

i If λij = σj1 , then we have

2 i xj −mij ∂f i = 2×f × 3 . i i ∂σj1 σj1

i

∂f =0 i ∂σj1

(A11)

2) Derivation of ∂C (1) /∂λij : i i The derived results when λij = mij , σj2 , or σj1 are as follows: ∂C (1) 1 i = × σj2 − σ kj2 i 2 ∂σj2

∂C (1) = mij − mkj ∂mij

1 i ∂C (1) = × σj1 − σ kj1 . i 2 ∂σj1

(A12)

The average center mkj and uncertain widths σ kj1 and σ kj2 of the merged fuzzy set in the same group to which the fuzzy sets move to are also adaptively updated using ∂C (1) 1 i = − × σj2 − σ kj2 k 2 ∂σ j2

∂C (1) = − mij + mkj ∂mkj

∂C (1) 1 i = − × σj1 − σ kj1 . k 2 ∂σ j1

(A13)

3) Derivation of ∂C (2) /∂λij : If MR (xj ) ≤ 0.5, then we have xj − mk∗ ∂C (2) j = 2 × (M (x ) − 0.5) × μ (x ) × k ∗ 2 R j j jk∗ ∂mk∗ σj2 j ∂C (2) = 2 × (MR (xj ) − 0.5) × μjk∗ (xj ) k∗ ∂σj2

xj − mk∗ j × k∗ 3 σj2

2 .

(A14)

If MR (xj ) > 1.5, then we have xj − mk∗ ∂C (2) j = 2 × (M (x ) − 1.5) × μ (x ) × k ∗ 2 R j j jk∗ ∂mk∗ σ j j2 ∂C (2) = 2 × (MR (xj ) − 1.5) × μjk∗ (xj ) k∗ ∂σj2

xj − mk∗ j × k∗ 3 σj2

2 .

(A15)

If 0.5 < MR (xj ) < 1.5, then we have ∂C (2) =0 ∂mk∗ j

∂C (2) = 0. k∗ ∂σj2

(16)

[1] C. T. Lin and C. S. G. Lee, Neural Fuzzy Systems: A Neural-Fuzzy Synergism to Intelligent Systems. Englewood Cliffs, NJ: Prentice-Hall, May 1996. [2] O. Cordon, F. Herrera, F. Hoffmann, and L. Magdalena, Genetic Fuzzy Systems: Evolutionary Tuning and Learning of Fuzzy Knowledge Bases. Singapore: World Scientific, 2001. [3] J. Casillas, O. Cordon, F. Herrera, and L. Magdalena, Interpretability Issues in Fuzzy Modeling. Berlin, Germany: Springer-Verlag, 2003. [4] J.-S. R. Jang, “ANFIS: Adaptive-network-based fuzzy inference systems,” IEEE Trans. Syst., Man, Cybern., vol. 23, no. 3, pp. 665–685, May/Jun. 1993. [5] N. K. Kasabov and Q. Song, “DENFIS: Dynamic evolving neural-fuzzy inference system and its application for time-series prediction,” IEEE Trans. Fuzzy Syst., vol. 10, no. 2, pp. 144–154, Apr. 2002. [6] P. P. Angelov and D. Filev, “An approach to online identification of Takagi–Sugeno fuzzy models,” IEEE Trans. Syst., Man, Cybern. B, Cybern., vol. 34, no. 1, pp. 484–498, Feb. 2004. [7] D. Kukolj and E. Levi, “Identification of complex systems based on neural and Takagi–Sugeno fuzzy model,” IEEE Trans. Syst., Man, Cybern. B, Cybern., vol. 34, no. 1, pp. 272–282, Feb. 2004. [8] C. F. Juang and C. M. Chang, “Human body posture classification by a neural fuzzy network and home care system application,” IEEE Trans. Syst., Man, Cybern. A, Syst., Humans, vol. 37, no. 6, pp. 984–994, Nov. 2007. [9] E. D. Lughofer, “FLEXFIS: A robust incremental learning approach for evolving Takagi–Sugeno fuzzy models,” IEEE Trans. Fuzzy Syst., vol. 16, no. 6, pp. 1393–1410, Dec. 2008. [10] W. L. Tung and C. Quek, “eFSM—A novel online neural-fuzzy semantic memory model,” IEEE Trans. Neural Netw., vol. 21, no. 1, pp. 136–157, Jan. 2010. [11] C. F. Juang, T. C. Chen, and W. Y. Cheng, “Speedup of implementing fuzzy neural networks with high-dimensional inputs through parallel processing on graphic processing units,” IEEE Trans. Fuzzy Syst., vol. 19, no. 4, pp. 717–728, Aug. 2011. [12] Y. H. Chien, W. Y. Wang, Y. G. Leu, and T. T. Lee, “Robust adaptive controller design for a class of uncertain nonlinear systems using online T–S fuzzy-neural modeling approach,” IEEE Trans. Syst., Man, Cybern. B, Cybern., vol. 41, no. 2, pp. 542–552, Apr. 2011. [13] P. Angelov, “Fuzzily connected multimodel systems evolving autonomously from data streams,” IEEE Trans. Syst., Man, Cybern. B, Cybern., vol. 41, no. 4, pp. 898–910, Aug. 2011. [14] I. del Campo, K. Basterretxea, J. Echanobe, G. Bosque, and F. Doctor, “A system-on-chip development of a neuro-fuzzy embedded agent for ambient-intelligence environments,” IEEE Trans. Syst., Man, Cybern. B, Cybern., vol. 42, no. 2, pp. 501–512, Apr. 2012. [15] A. H. Sonbol, M. S. Fadali, and S. Jafarzadeh, “TSK fuzzy function approximators: Design and accuracy analysis,” IEEE Trans. Syst., Man, Cybern. B, Cybern., vol. 42, no. 3, pp. 702–712, Jun. 2012. [16] J. M. Mendel, “Computing derivatives in interval type-2 fuzzy logic system,” IEEE Trans. Fuzzy Syst., vol. 12, no. 1, pp. 84–98, Feb. 2004. [17] H. Hagras, “Comments on dynamical optimal training for interval type-2 fuzzy neural network (SOT2FNN),” IEEE Trans. Syst., Man, Cybern. B, Cybern., vol. 36, no. 5, pp. 1206–1209, Oct. 2006. [18] C. F. Juang and Y. W. Tsao, “A self-evolving interval type-2 fuzzy neural network with on-line structure and parameter learning,” IEEE Trans. Fuzzy Syst., vol. 16, no. 6, pp. 1411–1424, Dec. 2008. [19] C. F. Juang and Y. W. Tsao, “A type-2 self-organizing neural fuzzy system and its FPGA implementation,” IEEE Trans. Syst., Man, Cybern. B, Cybern., vol. 38, no. 6, pp. 1537–1548, Dec. 2008. [20] O. Uncu and ˙I. B. Türksen, “Discrete interval type-2 fuzzy system models using uncertainty in learning parameters,” IEEE Trans. Fuzzy Syst., vol. 15, no. 1, pp. 90–106, Feb. 2007. [21] R. H. Abiyev and O. Kaynak, “Type 2 fuzzy neural structure for identification and control of time-varying plants,” IEEE Trans. Ind. Electron., vol. 57, no. 12, pp. 4147–4159, Dec. 2010. [22] C. F. Juang, R. B. Huang, and W. Y. Cheng, “An interval type-2 fuzzy neural network with support vector regression for noisy regression problems,” IEEE Trans. Fuzzy Syst., vol. 18, no. 4, pp. 686–699, Aug. 2010. [23] M. A. Khanesar, E. Kayacan, M. Teshnehlab, and O. Kaynak, “Analysis of the noise reduction property of type-2 fuzzy logic systems using a novel type-2 membership function,” IEEE Trans. Syst., Man, Cybern. B, Cybern., vol. 41, no. 5, pp. 1395–1406, Oct. 2011. [24] H. Ishibuchi and T. Yamamoto, “Fuzzy rule selection by multi-objective genetic local search algorithms and rule evaluation measures in data mining,” Fuzzy Sets Syst., vol. 141, no. 1, pp. 59–88, Jan. 2004.

JUANG AND CHEN: IT-2 NFS WITH LEARNING ACCURACY AND MODEL INTERPRETABILITY

[25] P. Pulkkinen and H. Koivisto, “A dynamically constrained multiobjective genetic fuzzy system for regression problems,” IEEE Trans. Fuzzy Syst., vol. 18, no. 1, pp. 161–177, Feb. 2010. [26] S. M. Zhou and J. Q. Gan, “Low-level interpretability and high-level interpretability: A unified view of data-driven interpretable fuzzy system modeling,” Fuzzy Sets Syst., vol. 159, no. 23, pp. 3091–3131, Dec. 2008. [27] J. V. de Oliveira, “Semantic constraints for membership function optimization,” IEEE Trans. Syst., Man, Cybern. A, Syst., Humans, vol. 29, no. 1, pp. 128–138, Jan. 1999. [28] Y. Jin, “Fuzzy modeling of high-dimensional systems: Complexity reduction and interpretability improvement,” IEEE Trans. Fuzzy Syst., vol. 8, no. 2, pp. 212–221, Apr. 2000. [29] R. Paiva and A. Dourado, “Interpretability and learning in neuro-fuzzy systems,” Fuzzy Sets Syst., vol. 147, no. 1, pp. 17–38, Oct. 2004. [30] R. Mikut, J. Jakel, and L. Groll, “Interpretability issues in data-based learning of fuzzy systems,” Fuzzy Sets Syst., vol. 150, no. 2, pp. 179–197, Mar. 2005. [31] L. Chen, C. L. P. Chen, and W. Pedrycz, “A gradient-descent-based approach for transparent linguistic interface generation in fuzzy models,” IEEE Trans. Syst., Man, Cybern. B, Cybern., vol. 40, no. 5, pp. 1219– 1230, Oct. 2010. [32] J. M. Mendel, Uncertain Rule-Based Fuzzy Logic System: Introduction and New Directions. Upper Saddle River, NJ: Prentice-Hall, 2001. [33] V. V. Cross and T. A. Sudkamp, Similarity and Compatibility in Fuzzy Set Theory: Assessment and Application. Heidelberg, Germany: PhysicaVerlag, 2002. [34] D. Wu and J. M. Mendel, “A vector similarity measure for interval type-2 fuzzy sets,” in Proc. IEEE Int. Conf. Fuzzy Syst., Jul. 2007, pp. 1–6. [35] C. F. Juang and C. T. Lin, “An on-line self-constructing neural fuzzy inference network and its applications,” IEEE Trans. Fuzzy Syst., vol. 6, no. 1, pp. 12–32, Feb. 1998. [36] D. Nauk and R. Kruse, “Neuro-fuzzy systems for function approximation,” Fuzzy Sets Syst., vol. 101, no. 2, pp. 261–271, Jan. 1999. [37] S. M. Chen and Y. K. Ko, “Fuzzy interpolative reasoning for sparse fuzzyrule-based systems based on cutting and transformation techniques,” IEEE Trans. Fuzzy Syst., vol. 16, no. 6, pp. 1626–1648, Dec. 2008. [38] Z. H. Huang and Q. Shen, “Fuzzy interpolation and extrapolation: A practical approach,” IEEE Trans. Fuzzy Syst., vol. 16, no. 1, pp. 13– 28, Feb. 2008. [39] J. D. J. Rubio, “SOFMLS: Online self-organizing fuzzy modified least squares network,” IEEE Trans. Fuzzy Syst., vol. 17, no. 6, pp. 1296–1309, Dec. 2009. [40] S. M. Chen and Y. C. Chang, “Weighted fuzzy rule interpolation based on GA-based weight-learning techniques,” IEEE Trans. Fuzzy Syst., vol. 19, no. 4, pp. 729–744, Aug. 2011. [41] S. M. Chen and Y. C. Chang, “Fuzzy rule interpolation based on interval type-2 Gaussian fuzzy sets and genetic algorithms,” in Proc. IEEE Int. Conf. Fuzzy Syst., Taipei, Taiwan, Jun. 2011, pp. 448–454. [42] B. A. Sproule, M. Bazoon, K. I. Shulman, I. B. Turksen, and C. A. Naranjo, “Fuzzy logic pharmacokinetic modeling: An application to lithium concentration prediction,” Clin. Pharmacol. Ther., vol. 62, no. 1, pp. 29–40, Jul. 1997. [43] P. Baranyi, L. T. Koczy, and T. D. Gedeon, “A generalized concept for fuzzy rule interpolation,” IEEE Trans. Fuzzy Syst., vol. 12, no. 6, pp. 820– 837, Dec. 2004. [44] H. Ishibuchi, T. Murata, and I. B. Türksen, “Single-objective and twoobjective genetic algorithms for selecting linguistic rules for pattern classification problems,” Fuzzy Sets Syst., vol. 89, no. 2, pp. 135–150, Jul. 1997. [45] M. Cococcioni, P. Ducange, B. Lazzerini, and F. Marcelloni, “A Paretobased multi-objective evolutionary approach to the identification of Mamdani fuzzy systems,” Soft Comput., vol. 11, no. 11, pp. 1013–1031, Sep. 2007. [46] H. Ishibuchi and Y. Nojima, “Analysis of interpretability-accuracy tradeoff of fuzzy systems by multiobjective fuzzy genetics-based machine learning,” Int. J. Approx. Reason., vol. 44, no. 1, pp. 4–31, Jan. 2007.

1795

[47] M. Antonelli, P. Ducange, B. Lazzerini, and F. Marcelloni, “Multiobjective evolutionary learning of granularity, membership function parameters and rules of Mamdani fuzzy systems,” Evol. Intell., vol. 2, no. 1/2, pp. 21– 37, Nov. 2009. [48] M. Gacto, R. Alcala, and F. Herrera, “Integration of an index to preserve the semantic interpretability in the multi-objective evolutionary rule selection and tuning of linguistic fuzzy systems,” IEEE Trans. Fuzzy Syst., vol. 18, no. 3, pp. 515–531, Jun. 2010. [49] R. Alcala, M. J. Gacto, and F. Herrera, “A fast and scalable multiobjective genetic fuzzy system for linguistic fuzzy modeling in high-dimensional regression problems,” IEEE Trans. Fuzzy Syst., vol. 19, no. 4, pp. 666– 681, Aug. 2011. [50] C. F. Juang, “Temporal problems solved by dynamic fuzzy network based on genetic algorithm with variable-length chromosomes,” Fuzzy Sets Syst., vol. 142, no. 2, pp. 199–219, Mar. 2004. [51] W. Huang, L. Ding, S. K. Oh, C. W. Jeong, and S. C. Joo, “Identification of fuzzy inference system based on information granulation,” KSII Trans. Internet Inf. Syst., vol. 4, no. 4, pp. 575–594, 2010. [52] W. Huang, S. K. Oh, and S. C. Joo, “Identification of fuzzy inference systems using multiobjective space search algorithm and information granulation,” J. Elect. Eng. Technol., vol. 6, no. 6, pp. 853–866, Nov. 2011.

Chia-Feng Juang (M’99–SM’08) received the B.S. and Ph.D. degrees in control engineering from National Chiao Tung University, Hsinchu, Taiwan, in 1993 and 1997, respectively. Since 2001, he has been with the Department of Electrical Engineering, National Chung Hsing University, Taichung, Taiwan, where he became a Full Professor in 2007 and has been a Distinguished Professor since 2009. He is an Editor of the Journal of Information Science and Engineering and the International Journal of Computational Intelligence in Control. He has authored or coauthored seven book chapters, more than 75 journal papers (including 40 IEEE journal papers), and more than 85 conference papers. Six of his published journal papers have been recognized as highly cited papers according to the Institute for Scientific Information Essential Science Indicators. His current research interests include computational intelligence (CI), field-programmable-gate-array chip design of CI techniques, intelligent control, computer vision, speech signal processing, and evolutionary robots. Dr. Juang was the recipient of the Youth Automatic Control Engineering Award from the Chinese Automatic Control Society, Taiwan, in 2006 and the Outstanding Youth Award from Taiwan System Science and Engineering Society, Taiwan, in 2010. He was the Program Cochair of the IEEE International Conference on Industrial Electronics and Applications in 2010 and was the Program Chair of the International Conference on Fuzzy Theory and Its Applications in 2012.

Chi-You Chen received the B.S. degree in electrical engineering from National Chung Hsing University, Taichung, Taiwan, in 2009, where he is currently working toward the M.S. degree. His research interests are type-2 fuzzy systems (FS), interpretable FSs, and neural fuzzy chips.