Journal of Bioinformatics and Computational Biology Vol. 12, No. 5 (2014) 1450027 (29 pages) # .c The Authors DOI: 10.1142/S0219720014500279

J. Bioinform. Comput. Biol. 2014.12. Downloaded from www.worldscientific.com by UNIVERSITY OF DAYTON on 12/29/14. For personal use only.

IncMD: Incremental trie-based structural motif discovery algorithm

Ghada Badr*,†,§, Isra Al-Turaiki*,¶, Marcel Turcotte‡,|| and Hassan Mathkour*,** *College

of Computer and Information Sciences King Saud University Riyadh, Kingdom of Saudi Arabia

†IRI

- The City of Scienti¯c Research and Technological Applications University and Research District P. O. 21934, New Borg Alarab, Alexandria, Egypt ‡Faculty

of Engineering School of Electrical Engineering and Computer Science University of Ottawa, Ottawa, Canada § [email protected][email protected] ||[email protected] **[email protected] Received 1 October 2013 Revised 1 June 2014 Accepted 5 September 2014 Published 31 October 2014 The discovery of common RNA secondary structure motifs is an important problem in bioinformatics. The presence of such motifs is usually associated with key biological functions. However, the identi¯cation of structural motifs is far from easy. Unlike motifs in sequences, which have conserved bases, structural motifs have common structure arrangements even if the underlying sequences are di®erent. Over the past few years, hundreds of algorithms have been published for the discovery of sequential motifs, while less work has been done for the structural motifs case. Current structural motif discovery algorithms are limited in terms of accuracy and scalability. In this paper, we present an incremental and scalable algorithm for discovering RNA secondary structure motifs, namely IncMD. We consider the structural motif discovery as a frequent pattern mining problem and tackle it using a modi¯ed a priori algorithm. IncMD uses data structures, trie-based linked lists of pre¯xes (LLP), to accelerate the search and retrieval of patterns, support counting, and candidate generation. We modify the candidate generation step in order to adapt it to the RNA secondary structure representation. IncMD constructs the frequent patterns incrementally from RNA secondary structure basic elements, using nesting and joining operations. The notion of a motif group is introduced in order to simulate an alignment of motifs that only di®er in the number of unpaired bases. In addition, we use a cluster beam approach to select motifs that will survive to the next iterations of the search. Results indicate that IncMD can perform better than some of the available structural motif

1450027-1

G. Badr et al. discovery algorithms in terms of sensitivity (Sn), positive predictive value (PPV), and speci¯city (Sp). The empirical results also show that the algorithm is scalable and runs faster than all of the compared algorithms. Keywords: Motif discovery; RNA secondary structure; data mining; A priori; Trie.

J. Bioinform. Comput. Biol. 2014.12. Downloaded from www.worldscientific.com by UNIVERSITY OF DAYTON on 12/29/14. For personal use only.

1. Introduction With the explosive growth in biological data, the need arises for methods to extract meaningful information. One way is to discover common patterns, also motifs, in the data. The presence of motifs is usually an indication of key biological roles played by the discovered motifs, such as transcriptional and post-transcriptional regulation. Motifs in biological sequences (e.g. DNA) are stretches of repeated nucleotide patterns. In this case they are called sequential motifs. Motifs can also occur in biological structures such as RNA structures. RNA is found as a single strand where individual bases can bond with each other forming base pairs.1 Bonding makes RNA fold into a structure called secondary structure. A motif in RNA secondary structures appears as a pattern of conserved base pairs, and in this case it is called structural motifs.2,3 Over the past few years, hundreds of algorithms have been developed to tackle the sequential motifs discovery problem, while less work has been done for the discovery of structural motifs. Unlike sequential motifs, which are conserved in sequence, structural motifs may have completely di®erent sequences, yet share a common structure. Figure 1 shows two RNA sequences and their common secondary structure motif. Although the sequences of the motif are di®erent, they fold into the same secondary structure. An important application of structural motif discovery is the identi¯cation of noncoding RNAs (ncRNA). These are RNA molecules that do not code for proteins but were found to play signi¯cant biological functions in translation, splicing, and gene regulation. Some ncRNAs can be small ranging in length between 80 and 150 nucleotides. While other ncRNAs are classi¯ed as long ncRNAs, such as the RNAIII which can be 500 nucleotides long. ncRNAs can be acting in cis, like in riboswitches which bind to a metabolite, causing the fold of the transcript to change. The fold

Fig. 1. Example of two RNA sequences and their common structural motif. 1450027-2

IncMD: Incremental trie-based structural motif discovery algorithm

changes can have an e®ect on transcription termination or translation initiation.4 ncRNAs can also be acting in trans such as microRNAs. A comprehensive survey of ncRNAs can be found in Ref. 5. 1.1. Problem de¯nition

J. Bioinform. Comput. Biol. 2014.12. Downloaded from www.worldscientific.com by UNIVERSITY OF DAYTON on 12/29/14. For personal use only.

In this paper, we present an algorithm for the structural motif discovery problem. Before proceeding with the problem de¯nition, it is necessary to present a de¯nition of RNA secondary structure. De¯nition 1. Given an RNA sequence R ¼ fr1 ; r2 ; . . . ; rn g of length n, the secondary structure S of R is a set of base pairs (ri ; rj ), where 1  i < j  n, that satis¯es the following two conditions1: (1) Each base is paired at most once. (2) If (ri ; rj ), (rk ; rl ) 2 S, then i < k < j , i < l < j. In rare situations, the two conditions above may be violated. When a base is paired more than once, more complex structures are formed such as triple and quadruple structures. If the second condition is not satis¯ed, the resulting structure is called a pseudoknot.1 There are di®erent rules for base paring.1 A Waston–Crick base pair is formed when A bonds with U through a double hydrogen bond or when G bonds with C through a triple hydrogen bond. A wobble base pair is formed when G bonds with U by a single hydrogen bond. There are other pairing rules such as G-A and U-C pairs, but they are less common.1 Now our problem can be formulated as follows: Given the RNA sequences R1 ; R2 ; . . . ; RN , it is required to ¯nd the set of secondary structure motifs that are present in at least q input sequences, where q  N. The motifs are hypothesized to be responsible for the function or regulation of the RNA sequences. The cases of base triple, quadruple, and pseudoknots are not handled in this paper. There are many ways to represent RNA secondary structures. We adopt the most common representation, dot-bracket notation (DBN).6 Using this representation, a secondary structure is described as a string over the alphabet  ¼ fð; :; Þg, where matched brackets indicate a base pair and a dot represent an unpaired base. Di®erent combinations of dots and bracket pairs result in di®erent basic RNA structure elements. When a number of unpaired bases are contained inside a base pair, a hairpin loop is formed. A set of consecutive base pairs is called a staked pairs. An internal loop is a loop with two base pairs and at least one unpaired base on each side of the loop. A bulge is basically like an internal loop, it has two base pairs but with only one side of the loop having an unpaired bases. A multi-loop is any loop with three or more base pairs. Finally, an external base refers to any unpaired base that is not contained in a loop. Any RNA secondary structure can be decomposed into one type or more of the these elements.7 Figure 2 shows the basic RNA secondary structure building blocks. The secondary structure in Fig. 1 can be constructed by putting together a hairpin-loop, stacked pairs, and an internal loop. 1450027-3

J. Bioinform. Comput. Biol. 2014.12. Downloaded from www.worldscientific.com by UNIVERSITY OF DAYTON on 12/29/14. For personal use only.

G. Badr et al.

Fig. 2. RNA secondary structure basic building blocks.

1.2. Contribution In this paper, we consider the structural motif discovery as a frequent pattern mining problem and propose an incremental approach to solve it. The proposed algorithm, called IncMD, combines the well-known data mining algorithm (a priori)8 with the use of the Trie data structure. This work is also inspired by our previous algorithm TrieAMD9 for the discovery of sequential motifs. The Trie data structure, implemented as linked lists of pre¯xes (LLP),10 is used to store the input sequences such that to accelerate the search and retrieval of motifs, support counting, and candidate generation. The candidate generation step is modi¯ed so that candidates motifs can be generated incrementally using two operation: the nest and the join. This allows structures to built using basic RNA structure elements that are shown in Fig. 2. The nesting allows for the creation of new structures by adding a base pair, a bulge or an internal loop to an existing structure. The joining operation exploits adjacency information between a pair of structures in order to create a new structure. A very preliminary version of this work was presented in Ref. 11. We extend the work by introducing the notion of a motif group in order to simulate an alignment of motifs that are only di®erent in the number of unpaired bases they contain. In addition to using the support to evaluate motifs, we use minimum free energy to identify good candidates. Unlike the classic a priori where support value is applied at candidate (motif) level, in IncMD the support is applied at group level. This means that all motifs contained in a group contribute to the group support. Only groups stratifying a given support threshold are allowed to proceed to the next iteration of the algorithm. However, not all member motifs of a group can survive with their group. We 1450027-4

J. Bioinform. Comput. Biol. 2014.12. Downloaded from www.worldscientific.com by UNIVERSITY OF DAYTON on 12/29/14. For personal use only.

IncMD: Incremental trie-based structural motif discovery algorithm

apply a clustered beam approach in order to select member motifs that can continue to the next iteration. In order to do that, motifs inside each group are clustered based on the sequence they appear in. Each cluster is then sorted based on minimum free energy. Then for each cluster, only a pre-de¯ned beam width, k, of motifs are selected. Results using more datasets from our previously published benchmark12 indicates that IncMD performs better than some of the available motif discovery algorithms. Based on empirical results, the main advantage of our algorithm is that it runs faster than all of the compared algorithms. The results also con¯rm that running time and space requirements increase linearly with respect to sequence length. Increasing the length of the sequences has negligible e®ect on the accuracy of predictions. In contrast to other algorithms, IncMD is able to handle long input sequences. Compared to other pattern-driven algorithms, our algorithm is able to discover complex structures and avoid discovering redundant motifs. 2. Background 2.1. Structural motifs discovery algorithms Many algorithms have been published to solve the structural motif discovery problem. Based on how they explore the search space, they are classi¯ed into two main classes: enumerative approaches and heuristics approaches.12 In enumerative approaches, the search space is exhaustively explored in order to discover overrepresented motifs. Some algorithms in this class, such as FOLDALIGN13–15 and SLASH,16 are based on dynamic programming. They basically use Sanko®'s algorithm17 for simultaneous folding and aligning RNA sequences. Other algorithms rely on data structures to facilitate fast enumeration of motifs. For example, a±x trees7 and su±x arrays18 are used to accelerate the access and retrieval of words. In other algorithms the motif discovery problem is formulated as a clique ¯nding problem where graph mining algorithms are used such as in comRNA19 and RNAmine.20 In heuristic approaches, only promising regions of the search are explored. Examples of algorithms in this class include: CM¯nder21 and RNApromo22 which are based on Expectation maximization. There are also algorithms that use evolutionary algorithms such as in RNAGA,23 GPRM,24 and GeRNAMo.25 In addition, there are other heuristics speci¯cally designed to tackle the motif discovery problem such as RNAPro¯le.26 Motifs are scored using many objective functions including: thermodynamic stability, alignment score, and probabilistic measures. However, no single objective functions is known captures all the hidden features of biologically relevant structural motifs. 2.2. Data structures in motif discovery The use of data structures for motif discovery was ¯rst proposed by Sagot27 for the discovery of sequential motifs. The algorithms works by traversing a su±x tree in a 1450027-5

J. Bioinform. Comput. Biol. 2014.12. Downloaded from www.worldscientific.com by UNIVERSITY OF DAYTON on 12/29/14. For personal use only.

G. Badr et al.

recursive depth-¯rst manner. A su±x tree28 is a data structure that stores all su±xes of the given sequence(s). Weeder algorithm29 also uses su±x tree in a similar way to discover sequential motifs. The algorithm employes a dynamically chaining threshold on mismatches to ignore paths that are very unlikely to be valid motifs. Su±x-treebased algorithms were also proposed to discover more complex models of sequential motifs.30,31 In structural motif discovery, Mauri and Pavesi7 used a±x trees data structures to discover hairpins, bulges, and internal loops in RNA. An a±x tree32 is a data structure that captures more information than a su±x tree does. It stores all the information about a given string and its reverse. The data structure is used to identify all substrings of a given length. The substrings appearing in at least prede¯ned number of sequences are kept for further processing. Surviving substrings are expanded by adding a base pair or an unpaired base one at a time. When the algorithm reaches the point where motifs cannot be expanded, the motifs are ranked based on their free energy. Only the motifs satisfying a given threshold are reported. Since the approach can only discover hairpin-loops, a following post-processing step is required to generate structures with multi-loops. Su±x arrays, were used in Seed18 for the discovery of structural motifs. A su±x array33 is a space e±cient alternative to su±x trees. It is an array of integers specifying starting positions of su±xes in lexicographical order. The algorithm lists all stems in one of the input sequences called the seed. It only keeps stems appearing in at least pre-de¯ned number of sequences. In the next step, base pairs are replaced with the actual base pairs occurring in the seed sequence. Then, pairs of motifs are combined to construct multi-stem motifs. The resulting motifs capture both sequence and structure information. Finally, motifs are evaluated based on free energy before they are reported. 2.3. Data mining in motif discovery Data mining is an important ¯eld that has many applications in bioinformatics. The data mining ¯eld o®ers e±cient and scalable algorithms to tackle bioinformatics challenging problems.34 Data mining techniques are used in many tasks such as: The analysis of microarray data, clustering and classi¯cation of sequences and structures, and modeling biological networks. The sequential motif discovery problem can be mapped to frequent pattern mining problem. The a priori algorithm8 is a well-known algorithm for mining frequent patterns. A modi¯ed version of a priori was proposed to extract hidden features in protein family alignments.35 A cuto® on information content was used instead of minimum support to evaluate candidate motifs. In Ref. 36, an algorithm called SPACE was proposed to ¯nd spaced sequential motifs. The algorithm generated candidates that are then re¯ned into item sets before being evaluated. Data mining techniques can be combined with data structures in order to discover motifs.9,37 A priori-Motif37 is an algorithm proposed to solve the Planted (l, d) Motif 1450027-6

J. Bioinform. Comput. Biol. 2014.12. Downloaded from www.worldscientific.com by UNIVERSITY OF DAYTON on 12/29/14. For personal use only.

IncMD: Incremental trie-based structural motif discovery algorithm

Problem.38 It is inspired by the classic a priori but also is data structure-based. Two data structures are used in the algorithm: A su±x tree to store input sequences and to count the support of candidates and a tree-like structure to generate and store motifs. TrieAMD algorithm9 tackles the motif discovery problem combining a priori and one data structure only: A Trie. The Trie is used to store input sequences, to extract frequent patterns, and to generate candidate motifs. The candidates are generated using a concatenation operation instead of a join operation. A new measure based on information content is then used in addition to the minimum support to evaluate candidate motifs. Sequential pattern mining was used to discover highly conserved single-stranded regions in ribosomal RNA secondary structures.39 These are short conserved regions, (4–6) nucleotides long, that lie within secondary structure loops. The algorithm begins with all the secondary structure elements (loops) found in a given rRNA dataset. An a priori-like approach is applied where the elements are regarded as items. In each iteration, candidates are pruned based on the support which is the number of organisms they appear in. After the conserved single-stranded regions are found, a machine learning approach based on decision trees is applied to identify signi¯cant discriminating patterns from groups of organisms. RNA secondary structure can be represented as a graph, which makes it possible to apply graph mining techniques. In Ref. 40 an RNA secondary structure is represented as an ordered, labeled tree. The nodes of the tree are the secondary structure elements: Stems, hairpins, bulges, internal loops, and multi-loops. The algorithm starts by ¯nding the common motifs in a randomly selected sample of trees. This is done by considering every pair of trees in the sample and ¯nding the largest approximately common motif between the two trees. The candidate motifs (trees) are stored in a data structure similar to su±x tree of trees. In the subsequent steps, the count of each motif is determined and ¯ltered using statistical measures. RNAmine20 also uses graph mining algorithm to discover motifs shared by a subset of di®erent RNA sequences. An RNA sequence and its secondary structure is represented as a directed, labeled graph, called stem graph. In a stem graph, a node denotes a stem and an edge denotes the relative positions of two stems. A probabilistic approach is used to derive initial candidate stems.41,42 Short stems are discarded and the algorithm exhaustively enumerates stem patterns using a branch and bound algorithm. 3. Trie-Based LLP Data Structure A Trie43 is a data structure used to store and retrieve a ¯nite set of words. It is a rooted, directed tree where each node, except the root, is labeled with one character P from the alphabet . A word is represented by concatenating the characters in the path from the root to a leaf node. If a set of words share a common pre¯x, then their corresponding paths will branch of a common node. Figure 3(a) shows a Trie constructed for ¯ve words: tea, ten, ban, back, and inn. The main advantage of using a 1450027-7

G. Badr et al.

J. Bioinform. Comput. Biol. 2014.12. Downloaded from www.worldscientific.com by UNIVERSITY OF DAYTON on 12/29/14. For personal use only.

(a)

(b)

Fig. 3. A Trie constructed form the words: tea, ten, ban, back, and inn (a) and its corresponding LLP (b).

Trie data structure is that it allows inserting, searching, and deleting a given word in time that is linear in the length of the word. So the time is independent of the overall size of the words and the size of the Trie. The Trie data structure is widely used in many applications for e±cient indexing, matching, and retrieval of a set of words over a given alphabet. Many exact and approximate matching algorithms were proposed in which the Tries were used to store strings.10 A Trie can be constructed in OðnÞ time and requires OðnÞ space, where n is the total length of the words inserted in the Trie.44 A variation of the Trie data structure is called the (LLP) which was proposed in Ref. 10. The LLP data structure connects nodes at each level of the Trie together in a double linked list. This makes the LLP a set of double linked lists of all the Trie levels. This implementation facilitates a level by level traversal of the Trie. Nodes in the LLP store useful information that help accelerate the search and retrieval of words. The stored information include: The character represented by the node, link to the parent node, and a link to the last child. Figure 3(b) shows the LLP data structure corresponding to the Trie in Fig. 3(a). 4. Proposed Algorithm We propose IncMD an algorithm that is inspired by the a priori,8 a well-known data mining algorithm for frequent pattern mining. As in our previous work for sequential motif discovery, the TrieAMD algorithm,9 IncMD modi¯es a priori and uses Trie data structure. A window-based Trie data structure is used to store input sequences. In Ref. 10, many exact and approximate matching algorithms were proposed in which the Tries were used to store the databases for strings. In IncMD, the Trie is also used for searching, retrieving, and generating candidate motifs with the help of a hash structure. The hash structure stores pointers to nodes of the Trie that represent the 5' end (H5) as well as the 3' end (H3) of the candidate structures. IncMD starts by extracting frequent hairpin-loops which are then nested and combined iteratively to incrementally generate new motifs. Each application of the nest or join operations gives rise to a new motif group. A motif group is a group that contain motifs that 1450027-8

IncMD: Incremental trie-based structural motif discovery algorithm

di®er only in the number of unpaired bases they have. Motif are evaluated and ranked at group level based on group support. In addition, each motif is evaluated individually based on minimum free energy. Only groups that satisfy a given support value are chosen to be further processed in next iterations. For a surviving group, only a selected set of its member motifs is chosen to survive with the group. The set is called cluster beam. The algorithm continues until no more motif groups can be generated. In the following subsections, each step of the algorithm is explained in more detail.

J. Bioinform. Comput. Biol. 2014.12. Downloaded from www.worldscientific.com by UNIVERSITY OF DAYTON on 12/29/14. For personal use only.

4.1. Trie construction IncMD begins with the construction of a Trie data structure from the input sequences. Starting at each position in a given sequence, ¯xed-size words are inserted into the Trie. The depth of the Trie is the word length and it corresponds to the maximum length of a motif. Each node in the Trie has a bit string bit, of length that is equal to the number of input sequences. If bit½i ¼ 1 for a given node, then the word corresponding to the path ending at this node is present in sequence i. The sum PN i¼1 bit½i indicates in how many sequences the word is present. Storing windows of the input sequences in the Trie allows to simultaneously search all possible starting points in all sequences for any possible occurrence of a given pattern. 4.2. Extraction of frequent hairpin-loops The next step after the construction of the Trie is the extraction of frequent hairpinloops. A hairpin-loop is a simple secondary structure component that contains exactly one base pair and a number, u, of unpaired bases inside the base pair. An example of a hairpin-loop with u ¼ 3 is shown in Fig. 2. Extracting hairpin-loops is done by processing only two levels of the Trie: the ¯rst level and level u þ 2. Direct access to the designated levels is possible by the implementation of Trie as an LLP data structure.2 The LLP enables a level by level traversal of the Trie. For each node at the ¯rst level, if the nodes' base can be paired with the base of any of its children at level u þ 2, then a hairpin-loop is found. A structure S, a hairpin-loop in this case, is stored as a pair of pointers ðH5; H3Þ, where H5 and H3 point to the left most and right most bases of the hairpin-loop structure, respectively. Hairpin-loops with exactly u unpaired bases form a single motif. In order to allow the search to start with a collection of di®erent hairpin-loops instead of one, it is possible to use a range of unpaired bases u ¼ umin to umax . Thus, the resulting hair-loops in this step share the same number of base pairs but may di®er in the number of unpaired bases. All the hair-loops form the basis for the ¯rst group of motifs. The group support is de¯ned as the number of sequences in which any of the member motifs can appear. To calculate the group support, the bit string bit of all the H3 nodes of member motifs are ORed. The group support is then the number of ones in the OR result. 1450027-9

J. Bioinform. Comput. Biol. 2014.12. Downloaded from www.worldscientific.com by UNIVERSITY OF DAYTON on 12/29/14. For personal use only.

G. Badr et al.

Fig. 4. All the hairpin-loops that appear in the input sequences in Fig. 1, when u ¼ 3. Note that for clarity, only seven levels of the Trie are shown.

A frequent group corresponds to a group of motifs of length u þ 2 (or from umin þ 2 to umax þ 2) that is present in more than q input sequences. q is called the minimum support and it is de¯ned by the user. Figure 4 shows all the hairpin-loops of size three that are present in the sequences in Fig. 1. The ¯gure shows four hairpin-loops: AACGU, ACUAU, GUAGC, UGGUA which all share the same dot-bracket notation: ð. . .Þ. 4.3. Motif group generation After all frequent hairpin-loops are extracted, motifs are generated from the hairpinloops group by repeatedly applying two types of operations: nesting and joining. Each operation gives rise to a di®erent group of motifs. The starting point for generating motifs is the hairpin-loop since it is the basic element of any RNA secondary structure. The nesting operation is used to create new structures by nesting existing ones inside a base pair, a bulge or an internal loop. The join operation exploits adjacency relationships between a pair of structures. If two structures are adjacent, then they can be combined to create a new structure. 4.3.1. Group nesting The basic idea of this operation is to expand a group by adding a base pair, a bulge or an internal loop. The elements are added to each structure of the group member motifs as shown in Fig. 5. Adding a base pair to a single structure S, requires searching for the sequence of S in all paths that start at the second level of the Trie. When a match is found, it is stored as a pair ðH5; H3Þ, where H5 and H3 are pointers to the left most and right most nodes of the match. H5 is located at the second level, while H3 is located at level jSj þ 1. For each match, the next step is to check whether the base at the parent of H5 node can be paired with the base of any of the direct children of H3 node. If this is possible, then S is said to be successfully nested in a base pair. The new structure is also stored as a pair of pointers ðH5new ; H3new Þ, where H5new points to the parent of H5 and H3new points to the 1450027-10

J. Bioinform. Comput. Biol. 2014.12. Downloaded from www.worldscientific.com by UNIVERSITY OF DAYTON on 12/29/14. For personal use only.

IncMD: Incremental trie-based structural motif discovery algorithm

Fig. 5. The nesting operation.

pairing child node of H3. Figure 6 shows an example of adding a base pair to the hairpin-loop ACUAU on the Trie. Adding a bulge is done in a similar way to adding a base pair. However, the paths that need to be searched for the sequence of S and the nodes that need to be checked for pairing are di®erent in this case. In order to expand S with a bulge of size left on the left hand side, the search for the sequence of S considers paths starting at level

Fig. 6. An example of adding a base pair to the hairpin-loop ACUAU. 1450027-11

J. Bioinform. Comput. Biol. 2014.12. Downloaded from www.worldscientific.com by UNIVERSITY OF DAYTON on 12/29/14. For personal use only.

G. Badr et al.

Fig. 7. An example of expanding the hairpin-loop GUCUAC with a left bulge of size 2.

left þ 2. Once a match is found, it is stored as a pair ðH5; H3Þ, where H5 points to the left most node of the match at level left þ 2 and H3 points to the right most node at level left þ jSj þ 1. Let p1 be the parent of H5 that is located at the ¯rst level of the Trie. If p1 can be paired with any of the children of H3, then the structure is said to be successfully nested in a left bulge and the new structure is stored. In this case, H5new points to p1 , while H3new points to the paired child of H3. Figure 7 shows an example of adding a left bulge of size 2 to the hairpin-loop GUCUAC. To add a right bulge of size right, the sequence of S needs to be located in the paths starting at the second level. The base at the parent of H5, p1 is checked for matching with any of the children of H3 at level right þ jSj þ 2. If pairing is possible, then the structure is said to expand successfully with a right bulge. The H5new points to p1 , while H3new points to the paired node. Figure 8 shows an example of adding a right bulge of size 2 to the hairpin-loop GUCUAC. Adding an internal loop is considered as a special case of adding a bulge where a structure is expanded by adding a left and a right bulge. As in hairpin-loop extraction, left and right can be replaced with ranges. 4.3.2. Groups joining In this operation, a new group is created by joining motifs from two existing groups. This operation is necessary to create structures with multi-loops as shown in the example of Fig. 9. Let S1 ðH51 ; H31 Þ and S2 ðH52 ; H32 Þ be two existing structures on the Trie. Let j be the level where H31 is located. Then, in order to create a new structure, the sequence of S2 is searched for in the subtree rooted at H31 . All paths starting at level dist þ j þ 1 are searched for the sequence of S2 . If a match is found, 1450027-12

J. Bioinform. Comput. Biol. 2014.12. Downloaded from www.worldscientific.com by UNIVERSITY OF DAYTON on 12/29/14. For personal use only.

IncMD: Incremental trie-based structural motif discovery algorithm

Fig. 8. An example of expanding the hairpin-loop GUCUAC with a right bulge of size 2.

the join is said to be successful. Figure 10 shows an example of joining UCUAA and GUAGC with an allowed distance of three bases. The main role of the nesting and joining operations is to put together basic building blocks in order to create RNA structures. The two operations are used in the iterations of the proposed algorithm, IncMD, to generate all possible structures in a

Fig. 9. The join operation. 1450027-13

J. Bioinform. Comput. Biol. 2014.12. Downloaded from www.worldscientific.com by UNIVERSITY OF DAYTON on 12/29/14. For personal use only.

G. Badr et al.

Fig. 10. An example of applying a join operation between two structure to create a new structure. The allowed number of bases between the joined structures is three, dist ¼ 3.

1450027-14

IncMD: Incremental trie-based structural motif discovery algorithm

search space that is restricted by the parameters: loop size (u ¼ umin , umax ), join distance (dist), and the sizes of left and right bulges (left, right). An RNA structure falling outside of the de¯ned search space might not be discovered by the algorithm.

J. Bioinform. Comput. Biol. 2014.12. Downloaded from www.worldscientific.com by UNIVERSITY OF DAYTON on 12/29/14. For personal use only.

4.4. Motif evaluation In each iteration of the algorithm, motifs are evaluated using two measures: support and minimum free energy. In contrast to TrieAMD, where the support is applied at motif level, the support here is applied at group level. Only groups satisfying a given support threshold are selected to the next iteration. However, not all member motifs survive with their groups. Instead, inside each group, motifs are evaluated, clustered Algorithm 1 IncMD 1: Construct Trie 2: create a new group f irstGroupp 3: f irstGroup.motif s = extractHairpinLoops(u) 4: Candidates.Add(f irstGroup) 5: while Candidates = ∅ do 6: pool.AddAll(Candidates) // accumulates results from all iterations 7: prevCandidates.clear() 8: prevCandidates.AddAll(Candidates) 9: Candidates.clear() 10: for all groups gi ∈ prevCandidates do 11: gnew ← nestGroup(gi , lef t, right) 12: if gnew .support ≥ minimumSupport then 13: Candidates.Add(gnew ) 14: end if 15: for all groups gj ∈ prevCandidates do 16: gnew ← joinGroups(gi , gj , dist) 17: if gnew .support ≥ minimumSupport then 18: Candidates.Add(gnew ) 19: end if 20: end for 21: for all groups gj ∈ pool do 22: gnew ← joinGroups(gi , gj , dist) 23: if gnew .support ≥ minimumSupport then 24: Candidates.Add(gnew ) 25: end if 26: end for 27: end for 28: end while 29: Report groups from pool

1450027-15

G. Badr et al.

J. Bioinform. Comput. Biol. 2014.12. Downloaded from www.worldscientific.com by UNIVERSITY OF DAYTON on 12/29/14. For personal use only.

based on the sequence they appear, and sorted according to their minimum free energy. From each cluster a beam width of motifs is selected and the rest is discarded. The free energy measure RNA secondary structure stability. Stable structures are usually associated with the lowest free energy values. The free energy of a given structure can be calculated using the nearest neighbor model.45 This model takes into account the free energy of the di®erent types of loops, including the free energy of externally interacting loops. Under this model, the free energy of a secondary structure is the sum of free energies of all of its component loops. We use RNAeval program from Vienna RNA Package42 to calculate the energy of RNA secondary structures. This objective function can be easily replaced with any suitable function. 4.5. Algorithm The main steps of IncMD are shown in Algorithm 1. A window-based Trie is constructed in Line 1. Then, the hairpin-loops are extracted from the Trie in Line 3 by a call to method HairpinLoopsExtraction which is shown in Algorithm 2. The method only requires the number u of the unpaired bases (hairpin loop size). The extracted hairpin loops form the basis for the ¯rst group as shown in Line 4. The algorithm iterations of nest and join operations begin in Line 5. The nesting part is shown in Line 11. The joining part takes place twice. First, in Line 15 to join groups with the same number of base pairs. Second, in Line 22, where groups from previous iterations, which are accumulated in pool, are joined with groups from the current iteration. The groups in the pool have di®erent number of base pairs. This is necessary to ensure that all possible structures are explored. After the creation of a new group, either by nesting or joining, the group support is checked against the minimum support as shown in Lines 12 and 17 and 23. Algorithm 2 Hairpin Loops Extraction 1: create a new motif mnew 2: H3Level = u + 2 3: H5=LLP.get(1) 4: H3=LLP.get(H3Level); 5: while (H5!=null) do 6: while ((H3!=null)AND(H5 ∈ H3.parents)) do 7: if isP air(H5.ch, H3.ch) then 8: mnew .instances.Add(H5, H3) 9: end if 10: H3 = H3.next 11: end while 12: H5 = H5.next 13: end while 14: return mnew

1450027-16

IncMD: Incremental trie-based structural motif discovery algorithm

J. Bioinform. Comput. Biol. 2014.12. Downloaded from www.worldscientific.com by UNIVERSITY OF DAYTON on 12/29/14. For personal use only.

Algorithm 3 nestGroup 1: create a new group gnew 2: for all motifs mi ∈ g.motif s do 3: mnew = nestM otif (mi , lef t, right) 4: if mnew = null then 5: gnew .motif s.Add(mnew ) 6: end if 7: end for 8: return gnew .ClusterBeam()

The nestGroup method is shown in Algorithm 3. It is given a group g and two integers, left and right, indicating the allowed number of left and right bulge sizes. It is also possible to have ranges of bulge sizes. In lines 2 to 7, the method adds a base pair, a bulge, or an internal loop to all member motifs of a given group. For each member motif, a call is made to nestMotif method which is shown in Algorithm 4. The nestMotif method is a general implementation for all cases in the section: Group Nesting, depending on the values for left and right. Each individual structure is processed as explained previously. If the group is successfully nested, a clustered beam is selected from the newly created group in Line 8 before it is returned. The ClusterBeam method is explained in more detail below.

Algorithm 4 nestMotif 1: for all Instances Ii ∈ motif.instances do 2: searchLevel = lef t + 2 3: levelStart = LLP.get(searchLevel) 4: (H5, H3)=search for the motif in paths rooted at searchLevel 5: if (H5, H3) = null then 6: H5new = LLP.get(1) 7: H3new =Last child of H3 at level (H3) + right 8: while ((H3new = null) AND (H5new ∈ H3new .parents)) do 9: if isP air(H5new .ch, H3new .ch) then 10: newM otif.instances.Add(H5new , H3new ) 11: end if 12: H3new =H3new .next 13: end while 14: end if 15: end for 16: return newM otif 1450027-17

J. Bioinform. Comput. Biol. 2014.12. Downloaded from www.worldscientific.com by UNIVERSITY OF DAYTON on 12/29/14. For personal use only.

G. Badr et al.

Algorithm 5 joinGroups 1: create a new group gnew 2: for all mi ∈ g1.motif s do 3: for all (mj ∈ g2.motif s do 4: mnew = joinM otif s(mi , mj ) 5: if mnew = null then 6: gnew .motif s.Add(mnew ) 7: end if 8: end for 9: end for 10: return gnew .ClusterBeam()

The joinGroups method is shown in Algorithm 5. It takes two groups g1 and g2 along with the allowed distance of join, dist, of unpaired bases. It is also possible to allow dist to be a range from distmin to distmax . For each pair of motifs, one from each group, the JoinMotifs method shown in Algorithm 6. It is called to join individual structures from the two motifs as explained in the Section: Groups Joining. Finally, in Algorithm 7 the ClusterBeam method is presented. The motifs in each group are clustered based on the sequence in which they appear. Then, motifs in each cluster are sorted based on the free energy. The best k motifs are kept for further processing, where k corresponds to the beam width.

Algorithm 6 joinMotifs 1: for all (H51 , H31 ) ∈ motif 1.instances do 2: for all (H52 , H32 ) ∈ motif 2.instances do 3: searchLevel = level(H31 ) + dist + 1 4: levelStart = LLP.get(searchLevel) 5: (tempH5, tempH3)= search for the label of (H52 , H32 ) in paths rooted at searchLevel 6: if (H52 , H32 ) = null then 7: H5new = H51 8: H3new = tempH3 9: newM otif.instance.Add((H5new , H3new )) 10: end if 11: end for 12: end for 13: return newM otif

1450027-18

IncMD: Incremental trie-based structural motif discovery algorithm

J. Bioinform. Comput. Biol. 2014.12. Downloaded from www.worldscientific.com by UNIVERSITY OF DAYTON on 12/29/14. For personal use only.

Algorithm 7 ClusterBeam 1: N ← number of input sequences. 2: cluster motifs based on sequence they appear in 3: for i = 1 → N do 4: sort(cluster[i]) // based on minimum free energy 5: cluster[i]= best k motifs 6: end for

5. Experimental Setup To evaluate the performance of IncMD, we use six datasets from our previously published benchmark.12 IncMD is tested on two categories of the benchmark. First, the Set 1 datasets which is composed of datasets with simple structures: Iron response element I (IRE) dataset, Histone 3' UTR stem-loop (Histone3) dataset, and Selenocysteine insertion sequence 1 (SECIS I) dataset. Second, the datasets in Set 2 which are composed of datasets with more complex structures: Flavin mononucleotide (FMN) riboswitch dataset, glmS ribozyme dataset, and Lysine riboswitch dataset. The datasets in Set 3 are not used in the evaluation since they contain pseudoknots, the case that is not addressed in IncMD. Results are obtained for sensitivity (Sn), positive predictive value (PPV), and speci¯city (Sp). In addition, we test the scalability of IncMD using versions of the IRE family generated by increasing the °anking region lengths. We measure the running time and memory usage. We also study the impact of increasing the °anking region length on the running time and accuracy of predictions. The obtained results are compared with results obtained from Seed and CM¯nder. We use a 64-bit Linux-based operating system with 16 GB of RAM and 400 GB hard disk. 6. Experimental Results 6.1. Prediction performance In this section, we show the result obtained using the six datasets from our previously published benchmark. 12 We focus on the predicted motifs with the best Sn value. We compare the results with the results obtained with other six motif discovery algorithms: CM¯nder,21 RNAPro¯le,26 RNAmine,20 comRNA,19 RNAPromo,22 and Seed.18 The values for PPV and Sp are also discussed. Figure 13 shows the values for Sn, PPV, and Sp values averaged over all datasets in Set 1 and 2 datasets. The values are shown for motifs with the best Sn value for each algorithm. Among the tested datasets, comRNA19 was able to produce predictions for only one dataset, the glmS ribozyme datase. In terms of Sn, IncMD performs slightly better then RNApromo. CM¯nder shows the highest Sn values. The PPV values indicate that IncMD is able to discover more motifs than RNApro¯le, RNAmine, and 1450027-19

J. Bioinform. Comput. Biol. 2014.12. Downloaded from www.worldscientific.com by UNIVERSITY OF DAYTON on 12/29/14. For personal use only.

G. Badr et al.

RNApromo. In terms of Sp, IncMD performs as good as CM¯nder and Seed and is better than the rest of the other tools. We also look at the performance of IncMD with respect to each dataset. Figure 14 shows the measurements averaged over all datasets in Set 1 which mainly contains datasets with simple structures. IncMD shows better Sn values than RNApro¯le and RNApromo. As for the PPV values, our algorithm performs almost as good as Seed and better than all the other tools except CM¯nder. Figure 15 shows the measurements averaged over all datasets in Set 2 which mainly contains datasets with more complex structures. As shown in the ¯gure, IncMD shows the least Sn values. As for the PPV values, IncMD is better than RNApromo. In terms of Sp, IncMD performs as good as Seed is better than the rest of the other tools. In general, CM¯nder21 and Seed18 perform better than IncMD. This can be due to the smart initialization of the two algorithms that heavily reduces the search space for a potential motif seeds. In addition, while our algorithm captures structure information only, CM¯nder and Seed are able to take into account both sequence and structure. There are mainly two reasons for which IncMD could not outperform all the tested approaches: The ¯rst reason is using the minimum free energy18 as a scoring function to evaluate the motifs. IncMD generates all possible motifs; however, we discovered that there is a possibility that a real motif will be eliminated from the search space in early iterations due to using this function. A motif will be dropped from the search space if it cannot survive within the beam, where best motifs are selected based on the value of their free energy. According to Ref. 18, the structure with the lowest free energy may not coincide with the real structure. The second reason for missing motifs is the presence of un-canonical base-pairs in the structure. The algorithm can be easily modi¯ed to handle un-canonical base pairs at the expense of expanding the search space. Using a window-based Trie data structure allows to simultaneously search all possible starting positions in the sequence for any occurrence of a given motif. The cost of the search is proportional to the motif length rather than sequence length, so it does not depend on the sequence size. The window-based Trie eliminates the need to store information on very short and very long su±xes that are less likely to lead to potential motifs. Figure 16 shows the running time in minutes averaged over all datasets. IncMD shows considerably lower running time than all the other tools. The parameters, dist and beam width, k, are central to the performance of the algorithm. They are the key to managing the size of the search space. Increasing their values will result in increasing the search space and thus increasing the running time. Experiments are conducted with di®erent k values for the IRE family dataset. They show an increase in the Sn values, as shown in Fig. 11. The increase in Sn value stopped at the point where beam width, k, is larger than the number of motifs to select from. The observed steep degradation in the PPV is due to the fact that more instances are discovered per motif. One instance is the true occurrence of the motif and the rest are considered false positives. 1450027-20

J. Bioinform. Comput. Biol. 2014.12. Downloaded from www.worldscientific.com by UNIVERSITY OF DAYTON on 12/29/14. For personal use only.

IncMD: Incremental trie-based structural motif discovery algorithm

Sn

Sn

PPV

PPV

Beam width (k)

Join dist

(a)

(b)

Fig. 11. The e®ect of increasing beam width k for IRE family dataset (a) and increasing join distance dist for the FMN riboswitch datase (b) on Sn and PPV values.

Join dist

Beam width (k)

(a)

(b)

Fig. 12. The e®ect of increasing beam width k for IRE family dataset (a) and increasing join distance dist for the FMN riboswitch datase (b) on Sp values.

As for the parameter dist, experiments has been conducted with di®erent values. Figure 11(b) shows the e®ects of increasing dist values on Sn and PPV for the FMN riboswitch dataset. We observe improvement in both Sn and PPV values. The improvement in accuracy measures in both cases is not always guaranteed, as the selection of a motif is heavily dependent on the value of its free energy. In terms of Sp values, we observed similar behavior of the algorithm when increasing k and dist, as shown in Fig. 12. 6.2. Scalability To show the scalability of IncMD, we use datasets with longer sequences. Di®erent datasets are generated for the IRE family by varying the length of the °anking regions around the known motif. For each generated dataset, we measure the time and memory requirements of IncMD. In addition, we highlight the impact of increasing the size of the dataset on the accuracy of predictions. We compare the results to the predictions obtained using Seed and CM¯nder. We chose these two tools because they showed the best performance among all other tools. 1450027-21

J. Bioinform. Comput. Biol. 2014.12. Downloaded from www.worldscientific.com by UNIVERSITY OF DAYTON on 12/29/14. For personal use only.

G. Badr et al.

Fig. 13. Sn, PPV, and Sp averaged over all datasets.

Fig. 14. Sn, PPV, and Sp averaged over all simple datasets (Set 1).

Run Time and Memory Usage: IncMD is tested using eight versions of the IRE family. The °anking region lengths around the motif are scaled as follows: 300, 400, 500, 600, 700, 800, 900, and 1000 bp. Figure 17 shows the running time and memory usage of IncMD using the eight datasets. According to the parameters that are used 1450027-22

J. Bioinform. Comput. Biol. 2014.12. Downloaded from www.worldscientific.com by UNIVERSITY OF DAYTON on 12/29/14. For personal use only.

IncMD: Incremental trie-based structural motif discovery algorithm

Fig. 15. Sn, PPV, and Sp averaged over all more complex datasets (Set 2).

Fig. 16. Running time in minutes averaged over all datasets.

in our experiments, the execution time and memory increase linearly with the increase in sequence length. This highlights the bene¯t of using the Trie data structure in our algorithm. Comparision with Seed and CM¯nder: We use the same eight versions of the IRE family with °anking regions 300, 400, 500, 600, 700, 800, 900, and 1000 bp to run Seed and CM¯nder. Figure 18(a) shows the running time of the three algorithms. As shown in the ¯gure, our algorithm takes considerably less time than Seed and CM¯nder.

1450027-23

J. Bioinform. Comput. Biol. 2014.12. Downloaded from www.worldscientific.com by UNIVERSITY OF DAYTON on 12/29/14. For personal use only.

G. Badr et al.

Fig. 17. Scalability of IncMD in terms of execution time in seconds (a) and memory usage in MB (b) for the datasets with °anking region lengths: 300, 400, 500, 600, 700, 800, 900, and 1000 bp.

(a)

(b)

Fig. 18. The e®ect of increasing sequence length on the accuracy: Execution time using log scale (a) and Sn (b) compared to Seed and CM¯nder for the datasets with °anking regions 300, 400, 500, 600, 700, 800, 900, and 1000 bp.

(a)

(b)

Fig. 19. The e®ect of increasing sequence length on the accuracy: PPV (a) and Sp (b) of IncMD compared to Seed and CM¯nder for the datasets with with °anking regions 300, 400, 500, 600, 700, 800, 900, and 1000 bp.

1450027-24

IncMD: Incremental trie-based structural motif discovery algorithm Table 1. Running time in seconds for the generated ¯ve versions of IRE family dataset with °anking region lengths: 1000, 2000, 3000, 4000, and 5000 bp. #: stopped after 5 days, X: failed.

J. Bioinform. Comput. Biol. 2014.12. Downloaded from www.worldscientific.com by UNIVERSITY OF DAYTON on 12/29/14. For personal use only.

Seed CMFinder IncMD

1000

2000

3000

4000

5000

63,000 2196 43

43,8809 2936 73

# X 85

# X 96

# X 116

We also highlight the e®ect of increasing length of the input sequences on the accuracy of the results. Figures 18(b) and 19 show the Sn, PPV, and Sp of IncMD as compared to Seed and CM¯nder. We observe that increasing the sequence length has minor e®ect on the prediction accuracy for our algorithm and CM¯nder. While degradation is observed in the performance of Seed as the data size increases. In terms of Sn, our algorithm performs better than Seed on datasets of longer °anking regions: 700, 800, 900, and 1000. As for PPV and Sp, it performs better than Seen and CM¯nder. Handling Longer Sequences: We run IncMD using another versions of the IRE family. Here, we generate ¯ve more datasets with longer °anking region lengths: 1000, 2000, 3000, 4000, and 5000 bp. Table 1 shows the running time in seconds for IncMD, Seed, and CM¯nder. IncMD produces results for all the ¯ve datasets taking at most 2 min. CM¯nder produces results only for two datasets, the datasets with 1000 and 2000 bp °anking regions. For the remaining datasets, CM¯nder fails to handle the input sequences and exists with no results. The problem could be a memory issue that occurs somewhere before entering the expectation maximization phase. For Seed, the program ¯nishes and produces results for the 1000 and 2000 datasets. However, the results for the 2000 dataset ¯ll the entire hard disk. For the remaining datasets, Seed takes more than 5 days before we had to end the process. We did not obtain results from Seed and CM¯nder for the larger datasets. For this reason, accuracy is not measured at this stage. 7. Conclusion The structural motif discovery is one of the challenging problems in bioinformatics. Discovering structural motifs is more di±cult than discovering sequential motifs since sequences may share structural patterns even if they have di®erent nucleotide sequences. In this paper, we proposed an incremental algorithm for structural motif discovery, IncMD. It is based on the well-known data mining algorithm a priori and using a Trie data structure. We introduced the notion of motif group to denote motifs that are only di®erent in the number of unpaired bases. This allows IncMD to simulate alignments by discovering motif groups. The performance of IncMD was compared to six other structural motif discovery algorithms using our previous 1450027-25

J. Bioinform. Comput. Biol. 2014.12. Downloaded from www.worldscientific.com by UNIVERSITY OF DAYTON on 12/29/14. For personal use only.

G. Badr et al.

proposed benchmark dataset. The results indicate that our algorithm performs better than some of the tools. We believe that the use of minimum free energy to evaluate motifs limited the performance of our algorithm. The empirical results showed that IncMD is faster than the other compared tools. Although other algorithms are more accurate, they fail to handle larger datasets. Results also con¯rmed that IncMD is scalable in terms of running time and memory usage and is able to handle long input sequences. IncMD can be further enhanced in many ways. A smarter initialization can be used by starting from stems predicted using RNAsubopt from Vienna RNA Package. In addition, member motifs in each group can be further clustered based on edit distance. It would be also interesting to ¯nd a more suitable objective function to evaluate RNA secondary structure motifs.46 This is a research topic by its own. We can consider it in IncMD as a future work, which an easy framework where we can use any newly discovered scoring function for RNA secondary structure motifs. Finally, IncMD can be modi¯ed to account for sequence conservation. Thus, discovering structural motifs with some sequential characteristics. Acknowledgments This research has been supported by the National Plan for Sciences and Technology, King Saud University, Riyadh, Saudi Arabia (Project No. 12-BIO2605-02). We would like to thank the BioInformatics Research Group (BioInG) http://bioinformaticksurg.blogspot.ca at King Saud University for creating a rich environment for conducting research and for providing valuable discussions and events. References 1. Sung W, RNA secondary structure prediction, in Wong L (eds.), The Practical Bioinformatician, World Scienti¯c, pp. 167–192, 2004. 2. Badr G, Turcotte M, Component-based matching for multiple interacting RNA sequences, Proc 7th Int Conf Bioinformatics Research and Applications, Springer-Verlag, Berlin, Heidelberg, pp. 73–86, 2011. 3. Carvalho AM, Freitas AT, Oliveira AL, Sagot M, An e±cient algorithm for the identi¯cation of structured motifs in DNA promoter sequences, IEEE/ACM Trans Comput Biol Bioinform 3(2):126–140, 2006. 4. Westhof E, The amazing world of bacterial structured RNAs, Genome Biol 11(3):108+, 2010. 5. Meyer IM, A practical guide to the art of RNA gene prediction, Brief Bioinform 8(6):396–414, 2007. 6. Hofacker IL, Fontana W, Stadler PF, Bonhoe®er LS, Tacker M, Schuster P, Fast folding and comparison of RNA secondary structures, Monatshefte fur Chemie Chemical Monthly 125(2):167–188, 1994. 7. Mauri G, Pavesi G, Algorithms for pattern matching and discovery in RNA secondary structure, Theor Comput Sci 335(1):29–51, 2005.

1450027-26

J. Bioinform. Comput. Biol. 2014.12. Downloaded from www.worldscientific.com by UNIVERSITY OF DAYTON on 12/29/14. For personal use only.

IncMD: Incremental trie-based structural motif discovery algorithm

8. Agrawal R, Srikant R, Fast algorithms for mining association rules in large databases, Proc 20th Int Conf Very Large Data Bases, San Francisco, CA, USA, pp. 487–499, 1994. 9. Al-Turaiki I, Badr G, Mathkour H, Trie-based a priori motif discovery approach, Proc 8th Int Conf Bioinformatics Research and Applications, Springer-Verlag, pp. 1–12, 2012. 10. Badr G, Tries in information retrieval and syntactic pattern recognition, PhD Thesis, School of Computer Science, Carleton University, 2006. 11. Al-Turaiki I, Badr G, Turcotte M, Mathkour H, Incremental structural motif discovery, short abstract presented at the 9th Int Symp Bioinformatics Research and Applications ISBRA 2013, Charlotte, NC, USA, May 20–22. 12. Badr G, Al-Turaiki I, Mathkour H, Classi¯cation and assessment tools for structural motif discovery algorithms, BMC Bioinform 14(Suppl 9):S4, 2013. 13. Gorodkin J, Heyer LJ, Stormo GD, Finding the most signi¯cant common sequence and structure motifs in a set of RNA sequences, Nucleic Acids Res 25(18):3724–3732, 1997. 14. Havgaard JH, Lyngsø RB, Stormo GD, Gorodkin J, Pairwise local structural alignment of rna sequences with sequence similarity less than 40%, Bioinformatics 21(9):1815–1824, 2005. 15. Havgaard JH, Torarinsson E, Gorodkin J, Fast pairwise structural RNA alignments by pruning of the dynamical programming matrix, PLoS Comput Biol 3(10):e193, 2007. 16. Gorodkin J, Stricklin SL, Stormo GD, Discovering common stem-loop motifs in unaligned RNA sequences, Nucleic Acids Res 29(10):2135–2144, 2001. 17. Sanko® D, Simultaneous solution of the RNA folding, alignment and protosequence problems, SIAM J Appl Math 45(5):810, 1985. 18. Anwar M, Nguyen T, Turcotte M, Identi¯cation of consensus RNA secondary structures using su±x arrays, BMC Bioinform 7(1):244, 2006. 19. Ji Y, Xu X, Stormo GD, A graph theoretical approach for predicting common RNA secondary structure motifs including pseudoknots in unaligned sequences, Bioinformatics (Oxford, England), 20(10):1591–1602, 2004. 20. Hamada M, Tsuda K, Kudo T, Kin T, Asai K, Mining frequent stem patterns from unaligned RNA sequences, Bioinformatics 22(20):2480–2487, 2006. 21. Yao Z, Weinberg Z, Ruzzo WL, CM¯ndera covariance model based RNA motif ¯nding algorithm, Bioinformatics 22(4):445–452, 2006. 22. Rabani M, Kertesz M, Segal E, Computational prediction of RNA structural motifs involved in posttranscriptional regulatory processes, Proc Nat Acad Sci 105(39):14885– 14890, 2008. 23. Chen J, Le S, Maizel JV, Prediction of common secondary structures of RNAs: A genetic algorithm approach, Nucleic Acids Res 28(4):991–999, 2000. 24. Hu Y, Prediction of consensus structural motifs in a family of coregulated RNA sequences, Nucleic Acids Res 30(17):3886–3893, 2002. 25. Michal S, Ivry T, Cohen O, Sipper M, Barash D, Finding a common motif of RNA sequences using genetic programming: The GeRNAMo system, IEEE/ACM Trans Comput Biol Bioinform/IEEE, ACM 4:596–610, 2007. 26. Pavesi G, Mauri G, Stefani M, Pesole G, RNAPro¯le: An algorithm for ¯nding conserved secondary structure motifs in unaligned RNA sequences, Nucleic Acids Res 32(10):3258– 3269, 2004. 27. Sagot MF, Spelling approximate repeated or common motifs using a su±x tree, in Lucchesi CL, Moura AV (eds.), LATIN'98: Theoretical Informatics, Springer-Verlag, Berlin/ Heidelberg, pp. 374–390, 1998. 28. Weiner P, Linear pattern matching algorithms, IEEE Conf Record of 14th Annual Symp Switching and Automata Theory, 1973. SWAT '08, IEEE, pp. 1–11, 1973.

1450027-27

J. Bioinform. Comput. Biol. 2014.12. Downloaded from www.worldscientific.com by UNIVERSITY OF DAYTON on 12/29/14. For personal use only.

G. Badr et al.

29. Pavesi G, Mauri G, Pesole G, An algorithm for ¯nding signals of unknown length in DNA sequences, Bioinformatics (Oxford, England), 17(Suppl 1):S207–S214, 2001. 30. Marschall T, Rahmann S, E±cient exact motif discovery, Bioinformatics 25(7):i356– i364, 2009. 31. Jiang H, Zhao Y, Chen W, Zheng W, Searching maximal degenerate motifs guided by a compact su±x tree, Adv Exp Med Biol 680:19–26, 2010. 32. Stoye J, A±x trees, Technical Report 2000–2014, University of Bielefeld, 2000. 33. Manber U, Myers G, Su±x arrays: A new method for on-line string searches, Proc First Annual ACM-SIAM Symp Discrete Algorithms, Society for Industrial and Applied Mathematics, pp. 319–327, 1990. 34. Bajcsy P, Han J, Liu L, Yang J, Survey of biodata analysis from a data mining perspective, in Wu X, Jain L, Wang JT, Zaki MJ, Toivonen HT, Shasha D (eds.), Data Mining in Bioinformatics, Springer-Verlag, London, pp. 9–39, 2005. 35. Ozer HG, Ray WC, Informative motifs in protein family alignments, in Giancarlo R, Hannenhalli S (eds.), Algorithms in Bioinformatics, Springer Berlin Heidelberg, Berlin, Heidelberg, pp. 161–170, 2007. 36. Wijaya E, Rajaraman K, Yiu S, Sung W, Detection of generic spaced motifs using submotif pattern mining, Bioinformatics 23(12):1476–1485, 2007. 37. Yu J, Greco S, Lingras P, Wang G, Skowron A, A frequent pattern mining method for ¯nding planted (l, d)-motifs of unknown length, Rough Set and Knowledge Technology Proc 5th Int Conf, RSKT 2010, pp. 240–248, 2010. 38. Pevzner PA, Sze SH, Combinatorial approaches to ¯nding subtle signals in DNA sequences, Proc Eighth Int Conf Intelligent Systems for Molecular Biology, AAAI Press, pp. 269–278, 2000. 39. Huang H, Horng J, Wu L, Fang S, Discovering common structural motifs of ribosomal rna secondary structures in prokaryotes, Int J Artif Intel Tools 14(4):621–639, 2005. 40. Wang J, Shapiro B, Shasha D, Zhang K, Chang C, Automated discovery of active motifs in multiple rna secondary structures, Proc 2nd Int Conf Knowledge Discovery and Data Mining, pp. 70–75, 1996. 41. McCaskill JS, The equilibrium partition function and base pair binding probabilities for RNA secondary structure, Biopolymers 29(6–7):1105–1119, 1990. 42. Hofacker IL, Fontana W, Stadler PF, Bonhoe®er SL, Tacker M, Schuster P, Fast folding and comparison of RNA secondary structures, Monatsh Chem 125:167–188, 1994. 43. Briandais RDL, File searching using variable length keys, Proc IRE-AIEE-ACM '59 (Western) Papers, pp. 295–298, 1959. 44. Gus¯eld D, Algorithms on Strings, Trees and Sequences: Computer Science and Computational Biology, 1st edition, Cambridge University Press, New York, USA, 1997. ISBN 9780521585194. 45. Turner DH, Mathews DH, NNDB: The nearest neighbor parameter database for predicting stability of nucleic acid secondary structure, Nucleic Acids Res 38(Database issue):D280–D282, 2010. 46. Rivas E, Lang R, Eddy SR, A range of complex probabilistic models for RNA secondary structure prediction that includes the nearest-neighbor model and more, RNA 18(2):193– 212, 2012.

1450027-28

J. Bioinform. Comput. Biol. 2014.12. Downloaded from www.worldscientific.com by UNIVERSITY OF DAYTON on 12/29/14. For personal use only.

IncMD: Incremental trie-based structural motif discovery algorithm

Ghada Badr completed Ph.D. in 2006 in Computer Science at Carleton University, School of Computer Science, Ottawa, Canada, in Information Retrieval and Syntactic Pattern Recognition. She was the winner of the Senate Medal for outstanding research achievements in her Ph.D. studies. In 2007, she worked as a research associative in National Research of Canada, Gatineau, Canada. In 2008 and for three years, she worked as a Postdoctoral Fellow in the University of Ottawa, Ottawa, Canada. She is currently an Assistant Professor at the College of Computer and Information Sciences, King Saud University. At KSU, she established the Bioinformatics Research group (BioInG), where she is the coordinator for the group since Fall 2012. Her research interests include bioinformatics, data mining, pattern recognition, advanced data structures, information retrieval, and machine learning.

Isra Al-Turaiki received Ph.D. in 2014 in Computer Science from King Saud University, Riyadh, Saudi Arabia. In 2013, she was the winner of King Saud University Award for Scienti¯c Excellence, Graduate Students Outstanding Research in Science and Engineering. Currently, she is an Assistant Professor at the College of Computer and Information Sciences, King Saud University. She is a member of the Bioinformatics Research Group (BioInG). Her research interests include bioinformatics, data mining, data structures, and arti¯cial intelligence.

Marcel Turcotte is an Associate Professor at the University of Ottawa, Canada. His research interests include Bioinformatics, Computational Biology, Machine Learning Applications, Algorithm Design and Data Structures. He completed his Ph.D. at the Universite de Montreal, Canada, in 1995, under the supervision of Guy Lapalme and Robert J. Cedergren. Then he completed his postdoctoral studies at the University of Florida, USA, where he worked with Steven A. Benner. Following this, he moved to the United Kingdom to work with Michael J.E. Sternberg at the Imperial Cancer Research Fund. Since 2000, he works at the School of Electrical Engineering and Computer Science.

Hassan Mathkour received Ph.D. degree in Computer Science from the University of Iowa, Iowa City, Iowa, USA. Currently, he is a professor in computer science in King Saud University. His current research interests include arti¯cial intelligence, robot navigation, and Bioinformatics.

1450027-29

IncMD: incremental trie-based structural motif discovery algorithm.

The discovery of common RNA secondary structure motifs is an important problem in bioinformatics. The presence of such motifs is usually associated wi...
2MB Sizes 2 Downloads 5 Views