A Further Study on Mining DNA Motifs Using Fuzzy Self-Organizing Maps.

IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, VOL. 27, NO. 1, JANUARY 2016

113

A Further Study on Mining DNA Motifs Using Fuzzy Self-Organizing Maps Sarwar Tapan and Dianhui Wang, Senior Member, IEEE Abstract— Self-organizing map (SOM)-based motif mining, despite being a promising approach for problem solving, mostly fails to offer a consistent interpretation of clusters with respect to the mixed composition of signal and noise in the nodes. The main reason behind this shortcoming comes from the similarity metrics used in data assignment, specially designed with the biological interpretation for this domain, which are not meant to consider the inevitable noise mixture in the clusters. This limits the explicability of the majority of clusters that are supposedly noise dominated, degrading the overall system clarity in motif discovery. This paper aims to improve the explicability aspect of learning process by introducing a composite similarity function (CSF) that is specially designed for the k-mer-to-cluster similarity measure with respect to the degree of motif properties and embedded noise in the cluster. Our proposed motif finding algorithm in this paper is built on our previous work robust elicitation algorithms for discovering (READ) [1] and termed READ Deoxyribonucleic acid motifs using CSFs (READcsf ), which performs slightly better than READ and shows some remarkable improvements over SOM-based SOMBRERO and SOMEA tools in terms of F-measure on the testing data sets. A real data set containing multiple motifs is used to explore the potential of the READcsf for more challenging biological data mining tasks. Visual comparisons with the verified logos extracted from JASPAR database demonstrate that our algorithm is promising to discover multiple motifs simultaneously. Index Terms— Composite similarity metrics, computational Deoxyribonucleic acid (DNA) motif discovery, fuzzy selforganizing maps (FSOMs), robust elicitation algorithms.

I. I NTRODUCTION N CONTINUATION to our previous study on fuzzy selforganizing map (FSOM)-based motif discovery [1], this paper addresses a persistent problem in existing SOM-based tools [2]–[5] such that they commonly demonstrate a critical limitation in addressing the practical fuzzy mixture of signal and noise in the clusters. These tools ignore the presence of noise in the clusters at all clustering states and optimize the clusters based on only their degree of motif properties, despite the known fact that every cluster practically comprises some degree of noise in most cases due to the specific nature of the problem. Such ignorance to embedded noise consequently limits the explicability of the noise-dominated clusters that occupy the largest portion in the maps, which is, in general, a common problem in any clustering-based approach for this task. The primary reason for this is the critical limitation of the

I

Manuscript received November 18, 2014; revised April 7, 2015; accepted May 17, 2015. Date of publication June 9, 2015; date of current version December 17, 2015. The authors are with the Department of Computer Science and Information Technology, La Trobe University, Melbourne, VIC 3086, Australia (e-mail: [email protected]; [email protected]). Color versions of one or more of the figures in this paper are available online at http://ieeexplore.ieee.org. Digital Object Identifier 10.1109/TNNLS.2015.2435155

existing similarity metrics that are meant to consider the motif properties and ignore the embedded noise in the clusters during k-mer (k-length subsequence) assignment. Improvement in this aspect of SOM-based motif discovery is necessary and has motivated this paper. This paper extends our proposed robust elicitation algorithms for discovering (READ) framework [1] to address previously unsolved issues in this approach for motif discovery. Technical contributions of this paper include the following. 1) Improving the explicability aspect of clustering algorithms using SOM networks through introducing a new similarity metric that is designed to offer a rational treatment to the embedded noise-mixture in clusters. 2) Investigating the learning behavior of SOM networks for subtle pattern discovery task. The nature of the problem necessitates describing two challenging properties of the k-mer data set. 1) A considerably low signal-to-noise ratio [6], [7] causing noise-dominated clusters to largely populate the maps. 2) Due to natural degeneration caused by evolutionary pressure, motif (signal) elements (binding sites) often have a close resemblance to noise, which causes the unavoidable presence of some degree of noise in the putative motif clusters. Thus, an explicable clustering requires both signal and noise elements (also, their characteristics) in the clusters to be combinedly and complementarily considered, possibly through using specially designed similarity metrics in the clustering process, contrasting the use of existing similarity metrics that are mostly designed with a signal only characterization approach for motif discovery. The use of biologically inclined similarity metrics, such as Mismatch Score (MISCORE) [8] and log-likelihood metric [9], offers an explicable assignment of the putative binding site k-mers to putative clusters (i.e., clusters with a good degree of motif properties) through characterizing functional motif properties in the clusters [8]. Their use in the iterative optimization of clusters aims to consistently improve the degree of motif properties of the clusters irrespective of their dominant signal type. This causes a nontrivial inconsistency throughout the clustering process, since there is a consistent attempt to improve every noise-dominated cluster with a better degree of motif properties in the same manner that only suits the optimization of putative motif clusters. Thus, applying such similarity metrics limits the interpretation to only the putative motif clusters as the noise-dominated ones become inexplicable, imposing a major drawback in terms of system clarity.

2162-237X © 2015 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.

114


This paper proposes a new and adaptive similarity metric, named composite similarity function (CSF), which is designed to address the discrete composition of signal and noise in the clusters during k-mer distribution. The CSF-based similarity quantification between a k-mer and a cluster gives a composite similarity measure with respect to the current signal composition (noise level) of the cluster using two separate but complementary modeling schemes, connected through adaptive composition weight. In CSF, the first component is our MISCORE [8], which is a useful signal modeling scheme with biological interpretation, capable of measuring the potential of a k-mer through characterizing several motif properties of a cluster. The second one is a newly developed background signal modeling scheme, named B-MISCORE [10], which gives the similarity measure of a motif and its elements to the backgrounds through a large random sampling of the backgrounds (see preliminaries). Technically, applying CSFs in clustering-based motif discovery offers the following benefits: 1) a consistent interpretation of all clusters in the system; 2) a useful indication of the noise level of each cluster throughout the iterations, offering an effective monitoring of the ongoing clustering process; and 3) a means of embedding a discrete optimization of the putative clusters throughout the iterations, potentially increasing the chances of obtaining more putative motif candidates (detailed in Section III-D). The remainder of this paper is organized as follows. Section II provides some preliminaries used in this paper. Section III details the proposed CSFs. Section IV describes the READ Deoxyribonucleic acid (DNA) motifs using CSFs (READcsf ) algorithm. Section V reports our experimental results. Section VI concludes this paper. II. P RELIMINARIES The positional frequency matrix (PFM)-based motif model, denoted by M, is a matrix, i.e., M = [ f (bi , j )]4×k , where bi ∈ χ = {A, C, G, T } and j = 1, . . . , k, and each entry f (bi , j ) represents the probability of nucleotide bi at the j th position. Similarly, a k-mer K s = q1 , q2 , . . . , qk is encoded as a binary matrix K = [k(bi , j )]4×k with k(qi , j ) = 1 and k(bi , j ) = 0 for bi = qi . A. Background Modeling Using B-MISCORE B-MISCORE [10] is a new modeling scheme for evaluating a motif or its elements with respect to the backgrounds. First, a large collection of random sets denoted by ζ = {G 1 , G 2 , G 3 , . . . , }, |ζ | ≥ 1000, are generated, where a random set G l consists of randomly grouped k-mers from backgrounds, i.e., G l = {K 1 , K 2 , K 3 , . . .}, 25 ≤ |G l | ≤ 50. Then, the background probability of each K ∈ , where is the k-mer data set produced from the input sequences, is computed using a first-order Markov chain transition matrix β = [π(a, a )]4×4 as P(K , M B ) = p(b1 ) π(a, a )k(a,a ) (1) ∀(a,a )

where k(a, a ) gives the count of dinucleotide aa in K and p(b1 ) is the independent background probability of the

nucleotide appearing at the first position in K . The background probability of the k-mers are globally normalized as P(K ) − min {P(K )} ∀K ∈ . (2) Pn (K ) = max {P(K )} − min {P(K )} ∀K ∈

∀K ∈

The background score of K i ∈ , can then be measured using a random k-mer set, namely G l , as 1 Pn (K )d(K i , K p ) (3) d B (K i , G l ) = |G l | ∀K p ∈G l

where d(K i , K p ) is the Hamming distance [8] between two k-mers. It can be deduced from (3) that d B (K i , G l ) is a weighted measure of K i being a background class element with respect to ∀K p ∈ G l , where d(K i , K p ) applies the weight to the contribution of each K p (∀K p ∈ G l ) in evaluating the similarity of K i to the backgrounds. Then, a large collection of random sets ζ is used to obtain a discriminative background score of K i (B-MISCORE) with denotation rb (K i ) as rb (K i ) = min {d B (K i , G l )} ∀G l ∈ζ

(4)

where a smaller rb (K i ) score represents a higher chance of K i being a background class element, and vice versa. For a given set (S) of k-mers, the B-MISCORE-based model score (BMMS) can be written as 1 rb (K ) (5) Rb (S) = |S| ∀K ∈M S

where a larger Rb (∗) score represents a higher potential of the model being a putative motif, and vice versa. B. Fuzzy-SOM Batch Learning Let be the set of all binary encoded k-mers from the input sequences and let N represent the number of nodes in the FSOM network where the j th node has a 2-D grid coordinate as z j = [z j 1 , z j 2 ] and a node-PFM as M j . Then the batch update rule of FSOM can be written from [1] as || N m k=1 μki (t) h j k (t)K i i=1 (6) M j (t + 1) = || N m μ (t) h (t) j k k=1 i=1 ki where μki (t) is the fuzzy membership [11] of K i to Mk (t) that can be computed as N 2 −1 (K i , Mk (t)) m−1 μki (t + 1) = (7) (K i , Ml (t)) l=1

where is a similarity metric and the exponential term m > 1 essentially controls the amount of fuzziness in μ. In (6), h j k (t) is the following neighborhood function:

z j − z k 2 (8) h j k (t) = exp − 2σ (t)2 where the neighborhood range σ (t) can be monotonically shrunken using the criterion mentioned in [12] as

t (9) σ (t + 1) = σ (t0 ) exp −2σ (t0 ) tmax

TAPAN AND WANG: FURTHER STUDY ON MINING DNA MOTIFS USING FSOMs

where σ (t0 ) is a fairly large initial σ and tmax is the maximum epoch set by the user.

115

Equation || (11) can be rewritten by dividing the terms by j = i=1 μ j i (t) as λ j (t + 1) −1 j

III. C OMPOSITE S IMILARITY F UNCTIONS The CSF associated with a given j th node at tth iteration can be written as j (K i , M j (t)) = λ j (t) r (K i , M j (t)) + (1 − λ j (t))[1 − rb (K i )]

where r (K i , M j (t)) is the MISCORE-based similarity [8] between K i and M j (t) with respect to the motif properties in the j th cluster and rb (K i ) is the B-MISCORE-based similarity measure of K i to the backgrounds. In (10), λ j (t) is the (adaptive) composition weight that reflects the current noise level in the node, where a higher such (0 < λ j (t) < 1) value represents signal domination over noise in the j th cluster and sensibly assigns a higher weight to r (K i , M j (t)) score in the CSF-based similarity measure, and vice versa. B-MISCORE in (4) gives the similarity of K i to the backgrounds, where a smaller score gives a higher similarity score referring to a higher chance of K i being categorized as noise in the data set. In the CSF-based similarity measure, an inverse of the B-MISCORE score is taken as [1 − rb (K i )], interpreting the dissimilarity of K i to the backgrounds. This makes it complementary to MISCORE in measuring the merit of the k-mers, where r (K i , M j ) gives how similar K i is to M j with respect to the embedded motif properties in M j (t) and [1 − rb (K i )] gives how dissimilar K i is from the backgrounds. This follows a sensible understanding that a higher dissimilarity of a k-mer from the backgrounds and a higher similarity of the k-mer to a given cluster with a good degree of motif properties conjunctively quantify a higher potential of the k-mer to be a putative motif element. In (10), a smaller j (K i , M j (t)) score gives a higher composite-similarity between K i and M j (t), and vice versa. A. Adaptation The adaptation of composition weights is functionally required to reflect the changes in the noise level of each node throughout the clustering iterations. The composition weights are updated at the end of each iteration as || λ j (t + 1) = || i=1

i=1

μ j i (t) rb (K i )

μ j i (t)[r (K i , M j (t)) + rb (K i )]

(11)

where is the k-mer data set and μ j i (t) is the fuzzy membership of i th k-mer to j th node at tth iteration that can be computed using the CSFs as μi j (t) =

N j (K i , M j (t)) m−1 2

l=1

l (K i , M j (t))

i=1 μ j i (t) rb (K i ) = || −1 || −1 i=1 μ j i (t)r (K i , M j (t)) + j i=1 μ j i (t)rb (K i ) j f

= (10)

−1 (12)

where λ j (t) reflects the current noise level in the j th node and N is the number of nodes in the network.

||

Rb (M j (t)) f

R f (M j (t)) + Rb (M j (t))

(13)

where R f (M j (t)) is the fuzzy extension of the MISCOREbased motif score (MMS), previously described in [1], for quantifying motif properties of a given fuzzy cluster and f Rb (M j (t)) gives the background similarity score of the fuzzy cluster that can be rationally interpreted as the current degree of noise in the cluster, implying λ j (t) as the current noise level of the j th cluster at tth iteration. Note that λ j (t +1) > λ j (t) is the result given by (11) by an increase of motif characteristics and, consequently, a decrease in the noise level of j th node in terms of its dissimilarity to the backgrounds, while λ j (t +1) < λ j (t) is caused by the opposite. The adaptation process enables λ j (t) to reveal a relative measure of signal and noise composition in j th cluster (for 1 < j < N) at any clustering cycle t. During the initialization stage of training, each cluster usually demonstrates a high presence of noise and gradually some of the clusters get improved in terms of motif modeling. The value of λ is associated with the degree of mixture of signals and noise in each node (cluster) that can be useful in discriminating potential motif models from the random ones. Note that this value is only a relative indicator rather than a physical quantification of noise level in the cluster. The implementation of CSFs in FSOM is conceptualized in Fig. 1. Remark 1: In typical SOM-based algorithms for DNA motif discovery, CSFs can be applied to find the best matching unit with denotation ci (t) for a given K i at tth iteration as ci (t) = arg min{ l (K i , Ml (t))}. The adaptation given in (11) l then can be simplified for a crisp set of k-mers as λ j (t + 1) = ( ∀K ∈V j (t ) rb (K )/ ∀K ∈V j (t )[r (K , M j (t)) + rb (K )]), where V j (t) is the j th crisp cluster produced at tth iteration. B. Demonstration on Signal Discrimination The demonstration uses a 10 × 10 FSOM network trained on a k-mer (k = 12) data set generated from a set of promoter sequences of coregulated genes that contain a known motif of CREB [13] transcription factor. The objective is to visualize the effectiveness of CSF adaption in discriminating the putative (signal dominated) nodes from the nonputative (noise) nodes with respect to the following three modes of the composition parameter (λ): 1) λAdaptive :⇒ adaptive composition as shown in (10); 2) λ0.5 :⇒ equal weight composition in (10), yielding j (K i , M j (t)) = 0.5×r (K i , M j (t))+0.5×[1−rb (K i )]; 3) λ1.0 :⇒ B-MISCORE omitted composition that rewrites (10) as j (K i , M j (t)) = r (K i , M j (t)). We applied z-score to statistically measure the relative degree of discrimination between a set of signal nodes (Sq ) and a set

116


agrees with the distribution of signals in the nodes of the maps. Similar results were observed on other data sets in our unreported experiments. C. Demonstration on CSF Initialization

Fig. 1. Conceptualization of CSF-based clustering of k-mers in FSOM network for DNA motif discovery, where nodes are illustrated with discrete signal and noise composition.

D. Benefits

of noise nodes (Nq ) as Z (Sq , Nq ) =

In another attempt, we would like to observe the impacts of different initialization processes: 1) random value initialization and 2) the same value initialization, of the CSF weights on its signal-discrimination ability. The experiment was set to separate n-number of putative signal nodes with a comparatively lower degree of noise in them from the rest of the nodes that are mostly noise dominated, based on their respective noiselevel indication, which allowed comparing the CSF adaptation for these two types of nodes with supposedly opposite signal characteristics in the map throughout the iterations. In implementation, n = 10 nodes were first separated as signal nodes during each iteration using their respective noise-level indicator, while the rest of the nodes were categorized as noise nodes in the network. Then for each initialization, mean{∗} and std{∗} of the CSF-weights [λ(t)] of these two types of clusters were iteration-wise plotted in Fig. 3. Observations: This visualization shows that CSF-adaptation is capable of effectively discriminating the putative nodes from the noise dominated ones in the map by assigning a comparatively higher λ(t) value to the signal nodes throughout the major portion of the training, regardless of the initialization applied. That is, the adaptation receives a very minimal or a negligible impact from the initialization of CSF weights, which adds a supportive feature to its algorithmic robustness. Hence, a random initialization of the weights can be conveniently applied, as used in the experiments in this paper.

E{Q(Sq )} − E{Q(Nq )} std{Q(Nq )} × C

(14)

where E{∗} is the expectation on q models, Q(∗) is the respective model quantification for the adaptation modes, and C = 3 is a scaling constant for visualization. In typical SOM-based motif discovery, a limited number of top scoring models are extracted as putative signal nodes from a trained map. Similarly in this demonstration, nodes are evaluated at each iteration and a limited top q and a bottom q scoring node are categorized as the putative signal and the noise nodes, respectively, for Z (Sq , Nq ) score computation. The network is given the same initial state for each of the modes of adaptation during each run. This is repeated separately for q ∈ {5, 10, 15, 20} and a 10-run average is presented in Fig. 2. Observations: This visualization depicts that the adaptive mode of the composition (λAdaptive ), which is functionally required in CSFs, offers a better discrimination of signal nodes than the other two modes considered. This describes the usefulness of the combination of background referencing by B-MISCORE and the adaptive composition used in CSFs. Fig. 2 also shows a rational decrease in the degree of discrimination as q increases to a larger number, which

This section describes the key benefits of the proposed CSF-based similarity measure, contrasting the use of traditional similarity metrics in SOM-based (clustering based) motif discovery, as follows. 1) The proposed CSFs address a major limitation of the state-of-the-art clustering approaches that inconsistently apply the same (analogous) optimization to the clusters with opposite signal characteristics, causing inexplicable noise clusters to largely populate the maps. CSFs resolve this inconsistency by refining the clusters to become more motif like, depending on their signal composition. In other words, CSFs ensure that the degree of optimization to each cluster is directly related to the present degree of motif properties and embedded noise in that cluster during each iteration, offering a consistent interpretation to every cluster in the maps and, resultantly, an improved system clarity. 2) The proposed CSF adaptation reveals the current noise level in each cluster (node) throughout the iterations, which enables the monitoring of the ongoing clustering process. In this aspect, the proposed similarity function is certainly more useful than the traditional similarity metrics that are not meant to: a) reveal the degree of embedded noise in the clusters at any clustering iteration


Fig. 2.

117

Discrimination of signal nodes from the noise nodes by different modes of adaptation in CSFs using CREB [13] transcription factor data set.

and b) enable the monitoring of the quality of the ongoing clustering. 3) CSFs follow a critical argument that if a putative binding site k-mer has a similar match to multiple putative clusters, then intuitively it is useful to consider their individual merit in terms of the present degree of motif properties and embedded noise in the clusters for a more appropriate assignment of the k-mer. In this manner, CSFs enable embedding a discrete optimization to the putative clusters throughout the iterations, which potentially increases the chances of producing more putative motif candidates in the maps. In contrast, a discrete optimization to the putative clusters in the state-of-theart clustering-based approaches can only be applied as a separate process after the post-training clusters are evaluated by external motif scoring metrics. E. FSOM Learning Using CSFs The update-equation of FSOM learning in READcsf is given in (6), which is the same as that used in READ [1]. However, the CSF implementation distinguishes k-mer distribution using fuzzy membership computation in READcsf as ⎡ ⎤ 2 −1 N m−1 (K , M (t)) l i l ⎦ . μli (t) = ⎣ (15) q (K i , Mq (t)) q=1

The classical FCM objective [11] can then be written as Jm (t) =

|| N i=1 j =1

μm i j j (K i , M j (t))

(16)

where (∗, ∗), , N, and m take denotations that are previously described. The decrease in Jm (t) can be monitored along with the increase of the performance N coefficient pc(μ(t)) = (1/||) j =1 i=1 μi j (t)2 to stop FSOM training in READcsf when the neighborhood range is sufficiently shrunken [1]. IV. READcsf : I MPROVED READ W ITH CSFs The CSF-based similarity measure is implemented in the FSOMs of our READ system [1] for motif discovery for two reasons: 1) to demonstrate the usefulness of CSFs in clustering-based motif discovery and 2) to obtain a new motif mining tool (READcsf ) that benefits from the synergy between: a) the FSOM-based soft-clustering that addresses the underlying fuzziness in the data sets and b) the CSF-enabled treatment to the fuzzy signal and noise composition of the clusters. This section describes the READcsf algorithm (see its pseudo code in Algorithm 1), emphasizing: 1) the training of multiple FSOMs using the CSF-based similarity measure and 2) the CSF-based motif scoring functions for node calibration, while the technical description of the common components between READ and READcsf are conveniently referred to [1]. A. Overview The 2-D output grid of READcsf is a lattice of N = R × C nodes, where R and C are the number of rows and columns, respectively. The j th node, j = 1, 2, . . . , R × C, has a 2-D coordinate z j = [z j 1 , z j 2 ] in the lattice. The j th node is initialized with: 1) a randomly generated PFM M j (t0 ) and

118


Fig. 3. Demonstration on the effects of different initialization values of CSF weights. (a) Random initialization of the weights. (b) Same initialization value of the weights, i.e., ∀λ j (t0 ) = 0.5.

2) a randomly initialized composition weight 0 < λ j (t0 ) < 1. The learning steps at tth iteration are then as follows. 1) Membership Computation: Compute the fuzzy membership of each k-mer to every node using CSFs. 2) Prototype Updating: Update the node prototypes using fuzzy membership distribution of k-mers and grid-based neighborhood cooperation between the nodes. 3) CSF Adaptation: Update the composition weight λ j (t) based on the current noise level of the j th node. Post-training nodes in the map are evaluated and ranked using the proposed CSFs-based motif scoring metric given in (17). Multiple FSOMs are trained for variable k-mer lengths (kmin ≤ k ≤ kmax ) due to the unknown length of the motif elements. User-defined top T candidates are then returned as final motifs through an open competition between the candidates with different consensus lengths (k-mer length) extracted from multiple FSOMs. An overview of the READcsf algorithm is presented in Fig. 4 illustrating the parallel training of multiple FSOMs. B. Candidate Evaluation The post-training nodes need calibration to identify the putative candidates in the maps. The CSF-based similarity measure gives a new motif scoring function that can be written as || j (K i , M j (t )) μ j i (t ) (17) Q(M j (t )) = i=1 || i=1 μ j i (t )

Fig. 4.

READcsf algorithm overview.

where t (t ≤ tmax ) is the final training iteration, λ j (t ) is the final noise level of the j th node, and μ j i (t ) is the final fuzzy membership between K i and M j (t ). By applying the CSF description given in (10) and by algebraic derivation, (17) can be rewritten as Q(M j (t )) = λ j (t ) R f (M j (t )) f + [1 − λ j (t )] 1 − Rb (M j (t ))

(18)

R f (∗)

where is the fuzzy-MMS metric [1] for quantifying the f degree of motif properties and [1− Rb (∗)] is the inverse of the fuzzy-BMMS metric for measuring the dissimilarity of a fuzzy cluster from the backgrounds. Smaller such scores, combined through the final signal composition weight of the cluster, are functionally required for a smaller Q(∗) score to calibrate a fuzzy cluster as a putative motif candidate in the map. In (18), f the following holds: 0 < λ j (t ), R f (∗), Rb (∗) < 1. C. Postprocessing The top T candidates (user defined) are selected to be optimized by their grid-based neighboring nodes, followed by their defuzzification (decoding), as applied in READ [1].


Algorithm 1 READcsf Learning Pseudocodes 1: 2: 3: 4: 5: 6: 7: 8: 9: 10: 11: 12: 13: 14: 15: 16: 17: 18: 19: 20: 21: 22: 23: 24: 25: 26: 27: 28:

START input: , T , N = R × C, tmax . ensur e: = {}; 5 ≤ R ≤ 100; 5 ≤ C ≤ 100; 1. Initialization: For the j -th node, j = 1, 2, . . . , N: Generate a random PFM M j (t0 ). Allocate a 2D coordinate as z j = [z j 1 , z j 2 ]. Randomly initialize λ j (t0 ), i.e., 0 < λ j (t0 ) < 1. 2. Training: for t = 1 : tmax do M j (t)[4×k] ⇐ [0]4×k ; h j (t) ⇐ 0; j = 1, . . . , N. 2.1. task: fuzzy membership computation for i = 1 : || do for l = 1 : N do 2 −1 N l (K i ,Ml (t )) m−1 μli (t) ⇐ . q ( K i ,Mq (t )) q=1 end for end for 2.2. task: node updates computation for i = 1 : || do for j = 1 : N do for l = 1 : N do m (t) h (t) K . M j (t) ⇐ M j (t) + μli jl i m (t) h (t). h j (t) ⇐ h j (t) + μli jl end for end for end for 2.3. task: adaptation for j = 1 : N do M (t ) M j (t + 1) ⇐ h j(t ) .

119

READ system was introduced with a primary focus on addressing: 1) the underlying fuzziness in the characterizing features of the motif models and 2) the practically fuzzy association of the motif instances (binding sites) to multiple and different motif models. Aiming to address such inherent fuzziness in DNA motifs in their discovery, READ system adopted modified FSOMs with an unsupervised soft-clustering approach and several heuristics-based postprocessing schemes for effective motif mining [1]. In contrast, READcsf primarily focuses on addressing the explicability aspect of the clusters in the map by introducing a means of quantitatively measuring the signal and noise composition in a cluster at any given clustering state through the use of a novel background-similarity measure [10]. READcsf introduces a novel and sensible approach for cluster analysis in motif discovery in addition to offering the features of its predecessor. V. P ERFORMANCE E VALUATION

The candidates extracted from multiple FSOMs are then refined with the postprocessing scheme detailed in [1]. The top T candidates are then returned as the final motifs through an open contest among the motif candidates with different consensus lengths extracted from multiple maps.

This section reports the performance evaluation with two objectives: 1) to demonstrate the benefits of applying CSFs through a performance comparison between READ and READcsf and 2) to study the usefulness of READcsf as a motif mining tool in comparison with other SOM-based tools, i.e., SOMBRERO [2] and SOMEA [5], and other prominent tools, namely, MEME [14], AlignACE [15], and WEEDER [16]. This paper adopts the performance measures used in [1], i.e., recall (R), precision (P), and F-measure (F) rates. There exist a large number of different algorithms and tools for DNA motif discovery in the current literature. However, due to the constraints on time, resources, and succinctness of this paper, we have carefully selected a couple of those tools in the quantitative evaluation based on the following reasons: 1) SOMBRERO and SOMEA represent a class of SOM-based tools that are the recent developments in SOM-based/clustering-based motif discovery and due to their high relevancy to this paper and 2) MEME, AlignACE, and WEEDER represent a very prominent group of tools that are developed on different state-of-the-art computational approaches. In addition, our previous work [1] can be referred for a comprehensive performance evaluation between several state-of-the-art clustering-based approaches, namely: 1) a standard FCM-based [11] approach; 2) a classical batch learning SOM-based [17] approach; and 3) our FSOM-based [1] approach for DNA motif discovery task. We also acknowledge the fact that the mining tools, founded on different approaches and algorithms, have different strengths and weaknesses, and a comparison of the performance of these tools is not completely fair due to several unavoidable reasons. Thus, a perfect performance benchmarking is neither expected nor achievable, and the results reported here should only serve as references.

D. Relation to READ System

A. Results on Real DNA Data Sets

It is deemed to be meaningful to describe the relationship between READcsf and its predecessor READ system [1].

Due to the significance of the results obtained on real data sets in DNA motif discovery, we used eight real data sets

29: 30: 31:

j

end for

||

i=1 μ j i (t ) rb (K i ) . μ (t i=1 j i ) r(K i ,M j (t ))+rb (K i )

λ j (t + 1) ⇐ ||

32: σ (t + 1) ⇐ σ (t0 ) exp {−2σ (t0 ) t/tmax }. 33: Stop training if the termination condition is satisfied. 34: end for 35: 3. Motif extraction and post-processing: 36: • Evaluate and rank each node using (17). 37: • Extract top T candidates from the ranking. 38: • Apply post-processing as described in [1]. 39: • Retain candidates for an open contest among the candidates

extracted from multiple maps with different motif length. 40: END 41: Notations: |∗|: set cardinality; T : number of motifs to return; N: number of nodes in the map; tmax : maximum number of training epoch; t0 : initialization stage of the map; M j (t): node PFM of j -th node at t-th epoch; h j l (t): neighborhood function given in (8); and M j (t), h j (t): are two computing component for

node updating. 42: Note: This pseudocode applies to FSOM training for a given

consensus length k and multiple FSOM trainings are required for user defined kmin ≤ k ≤ kmax .

120


TABLE I P ERFORMANCE E VALUATION U SING R EAL D ATA S ETS

TABLE II S TATISTICAL D ESCRIPTION OF THE E IGHT R EAL D ATA S ETS

in the evaluation. These data sets, collected from [13] and [18], are composed of the real promoter sequences of coregulated genes that contain verified motifs (functional binding sites) that bind to ERE, MEF2, SRF, CREB, E2F, MyoD, CRP, and GCN4 Transcription Factors (TFs). Each data set contains a varying number of sequences and one verified motif with known location of its instances (known binding sites) in the sequences. These data sets are useful to evaluate the tools with respect to the original sequence properties in finding known motifs. The statistical features of these data sets are given in Table II. During each run of READcsf on a data set, multiple FSOMs were trained with random map sizes between 10 × 10 to 20 × 20 for each consensus length k (kmin ≤ k ≤ kmax ) such that (kmin , kmax ) = (l−3, l+3), where l is the consensus length of the true motif in the data set. Then, the top 10 candidates were set to be extracted from each map. The composition weight associated with each node was randomly initialized as 0 < λ(t0 ) < 1. The initial neighborhood range σ (t0 ) = 3, fuzziness regulator m = 1.025, and maximum epoch tmax = 100 were set for training as applied in [1]. The training and parameter settings of READ, SOMBRERO, and SOMEA were described in [1]. For the sake of a fair comparison, READcsf , READ, SOMBRERO, and SOMEA were allowed to have the same map size, a random initialization of nodes, the same number of maximum epoch, the same expected motif width (k-mer length), and the top 10 candidates to be returned during each run on

each data set. In addition, MEME, AlignACE, and WEEDER were run on these data sets using the parameter settings detailed in [1]. The best motif found during each run of a tool on each data set in terms of F-measure was saved, and the recall (R), precision (P), and F-measure (F) rates obtained by these motifs were recorded. The average recall, precision, and F-measure rates over 10 runs obtained by the tools in each data set are presented in Table I, showing that READcsf outperformed READ on seven of eight data sets in terms of F-measure, indicating the benefits of using CSFs, since its implementation distinguishes READcsf from READ. In comparison with other SOM-based tools, READcsf (0.74) obtained a noticeable 25.7% improvement over SOMBRERO (0.55) and a 2.7% improvement over SOMEA (0.72) in terms of average F-measure computed on the data sets. In addition, READcsf (0.77) obtained a considerable 10.4% improved average recall rate over SOMBRERO (0.69), indicating its significantly improved ability to retrieve true binding sites over SOMBRERO. Also, it obtained a remarkable 31.9% and a 6.9% improved average precision rate over SOMBRERO and SOMEA, respectively. In comparison with the other tools considered, the average F-measure of READcsf (0.74) on these data sets shows a significant improvement over MEME (0.65) and AlignACE (0.69) and is also better than WEEDER (0.72). Notably, the average recall rate of READcsf (0.77) is found to be significantly higher than MEME (0.64) and AlignACE (0.69) and better than WEEDER (0.75), even though similar average precision rates of these tools are observed. Improvement in the recall rates without compromising the precision rates enabled READcsf to outperform the other tools considered. Remark 2: It was previously shown in [1] that the operational complexity of standard SOM (som ) and FSOM ( f som ) can be similar in practical implementation, i.e., f som ≈ som . This can be achieved by customizing FSOM learning without losing its integrity. We extend this understanding to READcsf training that comprises: 1) B-MISCORE computation of k-mers and 2) FSOM learning, giving its overall operational requirement as READcsf = (rb (K )) + f som . The first term imposes a minor increase in training time due to its precomputable nature, implying no major difference major difference between the computation time of READcsf and the state-of-the-art SOMs. In a


121

TABLE III P ERFORMANCE E VALUATION U SING M ULTIPLE M OTIF D ATA S ETS

demonstration, a 10-run average training time of READcsf and standard SOMs on eight data sets were found as 115.62 and 99.20 s, respectively, where the same number of nodes and a fixed number of cycles were set for a fair comparison using an Intel Core i7-3612QM CPU at a 2.10-GHz machine. B. Results on Multiple Motif Data Sets Computational tools are expected to be capable of finding multiple motifs if these exist in the query set of input sequences. However, the F-measure-based performance evaluation on motif mining task requires knowing the specific locations of the instances (binding sites) of different motifs in the set of sequences, as a prerequisite to recall and precision measure. To the best of our knowledge, it is difficult to find a set of coregulated sequences with the pointed locations of the binding sites of different transcription factors in the same sequence collection, which can be applied in the quantitative performance evaluation of tools in multiple motif-mining tasks. Therefore, due to the lack of availability of the real data sets with such properties in the public databases, we adopted five artificial data sets from our previous studies [1], [5] in this evaluation. Applying these data sets serves two other purposes. 1) Each data set contains 20 sequences of real promoters taken from relevant species, and each data set has three verified motifs, each for a different TF, and the known motif instances are arbitrarily planted in the promoters. These data sets are useful in evaluating the tools in terms of simultaneously mining multiple motifs to imitate a plausible scenario in real-world motif mining, where the

input set of promoter sequences may harbor multiple functional motifs of different TFs. 2) Each data set is composed of considerably large-length sequences and features a problematically low signal-tonoise ratio (≤0.0018). These challenging features test the ability of the tools in finding motifs in large data sets in a simulated environment. Note that these results serve only as a reference rather than a complete scalability benchmarking of the tools, which is beyond the scope of this paper. READcsf , READ, SOMEA, SOMBRERO, MEME, and WEEDER were run on these data sets. The training and parameter settings of these tools were kept similar to those applied in the single-motif discovery task. However, due to the increased size of data sets, the SOM/FSOM-based tools were given a larger map size of 20 × 20, and all the tools were set to return the top 20 candidates during each run. Then, the best motif for each TF in terms of the F-measure was recorded during each run of the tools on a data set, and the average R, P, and F-measure rates over 10 runs are presented in Table III. The results show that READcsf obtained a noticeable 4.3% improvement in terms of the average F-measure and a remarkable 8.8% improvement in terms of the average recall rate over READ on these data sets. Their average precision rates were found to be closely similar, i.e., READcsf (0.41) and READ (0.40), which was caused by applying the same postprocessing scheme for model refinement described in [1]. Thus, it is deducible that the improvement in the F-measures of READcsf over READ is caused by its higher recall ability of the true motif instances, which is potentially facilitated by

122


Fig. 5. Verified motif logos of SWI4 and SWI6 TFs collected from JASPAR [21] database are compared with the discovered logos by READcsf and MEME.

the CSF-based similarity computation, revealing the usefulness of CSFs over traditional similarity metrics in FSOM-based motif discovery. READcsf also produced the best average F-measure among the SOM/FSOM-based tools considered. Remarkably, READcsf (0.47) obtained a noticeable 23.4% improvement over SOMBRERO (0.36) and a 10.6% improvement over SOMEA (0.42) in terms of the average F-measure on these data sets, demonstrating its potential ability in producing more useful mining results than the existing SOM/FSOM-based tools. In comparison with the other tools, MEME obtained the best average F-measure on these data sets. Note that the SOM-based (also, FSOM-based) tools face the following two major performance biases compared with the other tools (e.g., MEME) on multimotif data sets: 1) the proper map size selection and 2) the k-mers length selection to simultaneously satisfy multiple motifs, as elaborated in [1]. Despite these biases, READcsf (0.47) obtained a similar average F-measure to MEME (0.48). Notably, READcsf (0.57) obtained a 28.1% improved average recall rate over MEME (0.41), which is certainly advantageous in this complicated and low-performance motif discovery exercise. Demonstration Using a Real Data Set: To learn the capability of our computational tool developed in this paper in discovering multiple motifs, a real data set containing the instances of multiple motifs is examined. First, we collected a set of coregulated sequences from [19] and [20] for SWI4 and SWI6 transcription proteins. Then, we carefully selected only the common sequences (intersection) from the two sets of sequences. This gives a sequence collection (named SWI4_SWI6) that contains the instances of both SWI4 and SWI6 TFs, however, with no information on the specific locations of the binding sites in the sequences. Feature-wise SWI4_SWI6 sequence set contains 78 sequences with an average length of 717.5 bp each. The unavailability of the locations of the binding sites in sequences enables this discovery exercise to mimic a practical motif finding task. We ran READcsf , MEME, and Weeder on this data set and each tool was allowed to return maximum top 20 motifs during each run. It was observed that READcsf was able to find both motifs simultaneously in each run after a careful inspection on the list of motifs returned and by comparing them with the verified logos of these motifs collected from JASPAR [21] database. For a qualitative comparison, best samples of logos

discovered by these tools over 10 runs are presented in Fig. 5 for a visual comparison with the verified logos from JASPAR database, where READcsf has shown a promising performance of discovering multiple motifs. It was observed that READcsf retuned those two motifs within the top five candidate motifs in a run, while the other tools considered could not either recognize these motifs in higher ranks in their returned lists or discover either of the motifs and miss the other one in a run. Thus, to quantitatively measure this motif-recognizability performance of these tools, we adopted mean r ank (φ) score q computing from [8] as φ(M) = q(q + 1)/ 2 i=1 r ank(Mi ), where q is the number of the relevant items (motifs) whose rank orders are to be considered and a higher φ(∗) indicates a better motif recognizability. We observed the following mean rank scores for the tools in finding these motifs over 10 runs as follows.

These figures show that READcsf discovered both motifs in top ranking and outperformed the other tools. Previous studies [1], [2], [5] revealed that SOM-based clustering approaches are capable of returning multiple candidate motifs simultaneously in the same search, where multiple candidates are usually found to share partial representation of the same motif and often they represent different motifs with significantly diverse properties. The latter observation leads to their usefulness in discovering multiple motifs in the query sequences. C. Robustness Analysis In the SOM-based tools, an improper map size degrades the quality of clustering and motif mining performances. To robustly handle the negative effects of the map size setting, READ and READcsf adopted: 1) FSOMs for soft-partitioning in the k-mer data set and 2) a postprocessing scheme [1] capable of quickly turning a noisy motif model into a desired one by acquiring left-behind subtle motif elements and by iteratively removing noise from the model. These two mechanisms enabled READ to be more robust in handling inaccurate


123

TABLE IV ROBUSTNESS A NALYSIS OF SOM/FSOM-BASED T OOLS IN H ANDLING D IFFERENT M AP S IZES

map sizes than SOMBRERO and SOMEA, as demonstrated in [1]. Similarly, READcsf is anticipated to demonstrate such robustness. However, the involvement of CSFs is interesting to observe in such an aspect, which is investigated in this section. READcsf , READ, SOMBRERO, and SOMEA were run on the real data sets using standard map sizes of 10 × 10, 15 × 15, and 20 × 20, while the other training parameters were kept same as those applied in the single motif discovery task described in Section V-A. A 10-run average of F-measure obtained by each tool for each map size on each data set is presented in Table IV for comparison. Table IV also includes the standard deviation, as the robustness indicator, computed over the average F-measure obtained by the tools on different map sizes. This shows that READcsf produces the smallest std value in most of the cases, indicating better robustness than the other SOM/FSOM-based tools considered in handling improper map size settings. The notable improvement in terms of such robustness of READcsf over READ system can be sensibly implied as the effect of CSF-implementation in the clustering process. In addition, this observation indicates that the rational treatment of signal and noise composition in the clusters by CSFs can conjunctively improve such robustness in clustering-based motif elicitation while applied with postprocessing schemes [1] especially designed for this task. D. Discussion This section presents a discussion on how motif elements are discriminated by CSFs through using signal and noise characteristics and quantifying their composition in the clusters. Given that the signal type notations read as {Mr = a random model, Mt = a true motif model, K r = a random k-mer, and K t = a true binding site k-mer}, a simplified and general interpretation of CSF-based similarity measure can be described using the following four cases. Case 1: (K t , Mt ) gives a smaller score caused by the combined effects of: 1) a good degree of motif properties in Mt ; 2) consequently, a smaller degree of embedded noise in the cluster represented by a larger λ value; and 3) a smaller [1 − rb (K t )] score indicating an inherently higher dissimilarity of K t to the backgrounds. Case 2: (K r , Mt ) gives a larger score contrasting case 1, due to a larger [1 − rb (K t )] score indicating a higher degree of noise resemblance property of K r . Case 3: (K t , Mr ) gives a larger score contrasting case 1, caused by the combined effects of the following:

1) a random model Mr is likely to have a higher degree of noise embedded, causing a larger noise level represented by a smaller λ value associated with Mr and 2) the MISCORE-based similarity r (K t , Mr ) is larger due to the absence of motif properties in a random model. Case 4: (K r , Mr ) gives a larger score with some degree of randomness contrasting case 1, caused by the stochastic nature of r (K r , Mr ) quantification due to the absence of motif properties in a random (noise) model [8]. That is, the relationship between a random k-mer and a noise-dominated model imposes some degree of uncertainty in the modeling scheme, which is a persistent problem in all existing approaches of signal (motif) discrimination due to the special characteristics of this problem. However, it is observed in [8] that R(Mt ) E{R(Mr )} holds, where E{∗} is the expectation over a large number of random models, implying r (K t , Mt ) < r (K r , Mr ) and, consequently, (K t , Mt ) < (K r , Mr ) in an average case. VI. C ONCLUSION A consistent interpretation of clusters through an explicable distribution of k-mers with respect to the embedded signal and noise characteristics in the clusters is a fundamental requirement for system clarity, which, however, has not been previously solved, where the existing domain-specific similarity metrics play a persistently problematic role. This paper has addressed this problem through introducing the CSFs that are capable of measuring the degree of noise embedded in each cluster and utilizing this information in discriminating putative motif clusters from the noise dominated ones during k-mer distribution throughout the training. This offers two significant benefits in SOM-based motif discovery: 1) an improved explicability of all clusters in the maps with practical benefits in terms of improved motif mining results and 2) a new similarity measure to improve several problematic aspects of the classical SOM-based approaches that indiscriminatingly (analogously) treat clusters with a different degree of signal and noise composition due to applying the existing similarity metrics. This paper has described CSF implementation to introduce READcsf as an improved mining tool that has shown

124


promising improvement in terms of discovery results over READ, SOMBRERO, SOMEA, and the other tools considered in the experiments, revealing the usefulness of the technical solutions presented. The outcome of this paper may potentially lead to a new direction of future research on: 1) novel similarity metrics for DNA motif mining and 2) the noise information-based clustering techniques in SOM/FSOM-based motif mining. In addition, further research can be conducted on advanced characterization of noise and motif elements in biological data sets to benefit computational motif mining tools. ACKNOWLEDGMENT The authors would like to thank the anonymous reviewers for their insightful comments that truly helped them to improve the quality of this publication. The authors would also like to thank the previous research group members, Dr. N. K. Lee from Universiti Malaysia Sarawak, Malaysia, Dr. S. Li from CRISO, Australia, and Dr. M. Alhamdoosh, for contributing in discussions and data set collection.

[14] T. L. Bailey and C. Elkan, “Unsupervised learning of multiple motifs in biopolymers using expectation maximization,” Mach. Learn., vol. 21, nos. 1–2, pp. 51–80, Oct./Nov. 1995. [15] F. P. Roth, J. D. Hughes, P. W. Estep, and G. M. Church, “Finding DNA regulatory motifs within unaligned noncoding sequences clustered by whole-genome mRNA quantitation,” Nature Biotechnol., vol. 16, no. 10, pp. 939–945, Oct. 1998. [16] G. Pavesi, G. Mauri, and G. Pesole, “An algorithm for finding signals of unknown length in DNA sequences,” Bioinformatics, vol. 17, no. 1, pp. S207–S214, Apr. 2001. [17] T. Kohonen, Self-Organizing Maps (Information Sciences). Berlin, Germany: Springer-Verlag, 1995. [18] J. Zhu and M. Q. Zhang, “SCPD: A promoter database of the yeast Saccharomyces cerevisiae,” Bioinformatics, vol. 15, nos. 7–8, pp. 607–611, Jul./Aug. 1999. [19] C. T. Harbison et al., “Transcriptional regulatory code of a eukaryotic genome,” Nature, vol. 431, no. 7004, pp. 99–104, Sep. 2004. [20] K. D. MacIsaac, T. Wang, D. B. Gordon, D. K. Gifford, G. D. Stormo, and E. Fraenkel, “An improved map of conserved regulatory sites for Saccharomyces cerevisiae,” BMC Bioinformat., vol. 7, no. 1, p. 113, Mar. 2006. [21] D. Vlieghe et al., “A new generation of JASPAR, the open-access repository for transcription factor binding site profiles,” Nucl. Acids Res., vol. 34, pp. D95–D97, Jan. 2006.

R EFERENCES [1] D. Wang and S. Tapan, “A robust elicitation algorithm for discovering DNA motifs using fuzzy self-organizing maps,” IEEE Trans. Neural Netw. Learn. Syst., vol. 24, no. 10, pp. 1677–1688, Oct. 2013. [2] S. Mahony, D. Hendrix, A. Golden, T. J. Smith, and D. S. Rokhsar, “Transcription factor binding site identification using the self-organizing map,” Bioinformatics, vol. 21, no. 9, pp. 1807–1814, May 2005. [3] D. Liu, X. Xiong, Z.-G. Hou, and B. DasGupta, “Identification of motifs with insertions and deletions in protein sequences using self-organizing neural networks,” Neural Netw., vol. 18, nos. 5–6, pp. 835–842, Jun./Jul. 2005. [4] D. Liu, B. DasGupta, and H. Zhang, “Motif discoveries in unaligned molecular sequences using self-organizing neural networks,” IEEE Trans. Neural Netw., vol. 17, no. 4, pp. 919–928, Jul. 2006. [5] N. K. Lee and D. Wang, “SOMEA: Self-organizing map based extraction algorithm for DNA motif identification with heterogeneous model,” BMC Bioinformat., vol. 12, no. 1, p. S16, Feb. 2011. [6] W. W. Wasserman and A. Sandelin, “Applied bioinformatics for the identification of regulatory elements,” Nature Rev. Genet., vol. 5, no. 4, pp. 276–287, 2004. [7] K. D. MacIsaac and E. Fraenkel, “Practical strategies for discovering regulatory DNA sequence motifs,” PLoS Comput. Biol., vol. 2, no. 4, p. e36, Apr. 2006. [8] D. Wang and S. Tapan, “MISCORE: A new scoring function for characterizing DNA regulatory motifs in promoter sequences,” BMC Syst. Biol., vol. 6, no. 2, p. S4, Dec. 2012. [9] G. D. Stormo and D. S. Fields, “Specificity, free energy and information content in protein-DNA interactions,” Trends Biochem. Sci., vol. 23, no. 3, pp. 109–113, Mar. 1998. [10] D. Wang, “B-MISCORE: A new similarity metric for self-organization of DNA k-mers,” La Trobe Univ., Melbourne, VIC, Australia, Tech. Rep. LTU-22-06-2013, Jun. 2013. [Online]. Available: http:// homepage.cs.latrobe.edu.au/dwang/BMISCORE.pdf [11] J. C. Bezdek, Pattern Recognition With Fuzzy Objective Function Algorithms. Norwell, MA, USA: Kluwer, 1981. [12] M. M. Van Hulle, “Self-organizing maps,” in Handbook of Natural Computing: Theory, Experiments, and Applications. New York, NY, USA: Springer-Verlag, 2012. [13] Z. Wei and S. T. Jensen, “GAME: Detecting cis-regulatory elements using a genetic algorithm,” Bioinformatics, vol. 22, no. 13, pp. 1577–1584, Apr. 2006.

Sarwar Tapan received the bachelor’s degree in computer science from the University of Wollongong (Malaysia Campus), Wollongong, NSW, Australia, in 2004, the master’s degree in cognitive sciences from Universiti Malaysia Sarawak, Kota Samarahan, Malaysia, in 2008, and the Ph.D. degree in computer science from La Trobe University, Melbourne, VIC, Australia, in 2013. His current research interests include the applications of intelligent computing techniques in decision support systems, data visualization, data mining and business intelligence, predictive modeling, and biological sequence analysis emphasizing on computational discovery of regulatory DNA motifs.

Dianhui Wang (M’03–SM’05) received the Ph.D. degree from Northeastern University, Shenyang, China, in 1995. He was a Post-Doctoral Fellow with Nanyang Technological University, Singapore, and a Researcher with The Hong Kong Polytechnic University, Hong Kong, from 1995 to 2001. He joined La Trobe University, Melbourne, VIC, Australia, in 2001, where he is currently a Reader and an Associate Professor with the Department of Computer Science and Information Technology. He is an Adjunct Professor with the State Key Laboratory of Synthetical Automation of Process Industries, Northeastern University. His current research interests include data mining and computational intelligence techniques for bioinformatics and engineering applications, and randomized learning algorithms for big data modelling. Dr. Wang also serves as an Associate Editor of the IEEE T RANSACTIONS ON N EURAL N ETWORKS AND L EARNING S YSTEMS , the IEEE T RANSACTIONS ON C YBERNETICS , Information Sciences, Neurocomputing and the International Journal of Machine Learning and Cybernetics.

A robust elicitation algorithm for discovering DNA motifs using fuzzy self-organizing maps.

Brickworx builds recurrent RNA and DNA structural motifs into medium- and low-resolution electron-density maps.

Data-Mining-Based Coronary Heart Disease Risk Prediction Model Using Fuzzy Logic and Decision Tree.

Integrating and mining the chromatin landscape of cell-type specificity using self-organizing maps.

Generating 'cloned DNA maps'.

Weighting Criteria and Prioritizing of Heat stress indices in surface mining using a Delphi Technique and Fuzzy AHP-TOPSIS Method.

Further studies on partially purified calf thymus DNA polymerase a.

Computational Analysis.

Denaturation maps of DNA: experimental and theoretical maps of phiX174 DNA.

Data Mining of Determinants of Intrauterine Growth Retardation Revisited Using Novel Algorithms Generating Semantic Maps and Prototypical Discriminating Variable Profiles.

DNA nanostructures constructed with multi-stranded motifs.

A General Fuzzy Cerebellar Model Neural Network Multidimensional Classifier Using Intuitionistic Fuzzy Sets for Medical Identification.

An Efficient Approach in Analysis of DNA Base Calling Using Neural Fuzzy Model.

A polymorphic DNA clone which maps to 19q13.2-19qter (D19S62).

Further studies on types A and B rat mtDNAs: cleavage maps and evidence for cytoplasmic inheritance in mammals.

Diffusion maps, clustering and fuzzy Markov modeling in peptide folding transitions.

Ionic Current-Based Mapping of Short Sequence Motifs in Single DNA Molecules Using Solid-State Nanopores.

The quorum-sensing regulator ComA from Bacillus subtilis activates transcription using topologically distinct DNA motifs.

Using fuzzy-set qualitative comparative analysis.

Coronary artery disease detection using a fuzzy-boosting PSO approach.

Hormone-responsive enhancer-activity maps reveal predictive motifs, indirect repression, and targeting of closed chromatin.

Diagnosing Parkinson's Diseases Using Fuzzy Neural System.

Fuzzy association rule mining and classification for the prediction of malaria in South Korea.

CisMiner: genome-wide in-silico cis-regulatory module prediction by fuzzy itemset mining.