Journal of Bioinformatics and Computational Biology Vol. 14, No. 3 (2016) 1650009 (23 pages) # .c Imperial College Press DOI: 10.1142/S0219720016500098

An e±cient algorithm for planar drawing of RNA structures with pseudoknots of any type

J. Bioinform. Comput. Biol. Downloaded from www.worldscientific.com by NEW YORK UNIVERSITY on 03/07/16. For personal use only.

Yanga Byun* and Kyungsook Han† Department of Computer Science and Engineering Inha University, Incheon 402-751, Korea *[email protected][email protected] Received 29 July 2015 Revised 3 November 2015 Accepted 1 December 2015 Published 23 February 2016 An RNA pseudoknot is a tertiary structural element in which bases of a loop pair with complementary bases are outside the loop. A drawing of RNA secondary structures is a tree, but a drawing of RNA pseudoknots is a graph that has an inner cycle within a pseudoknot and possibly outer cycles formed between the pseudoknot and other structural elements. Visualizing a largescale RNA structure with pseudoknots as a planar drawing is challenging because a planar drawing of an RNA structure requires both pseudoknots and an entire structure enclosing the pseudoknots to be embedded into a plane without overlapping or crossing. This paper presents an e±cient heuristic algorithm for visualizing a pseudoknotted RNA structure as a planar drawing. The algorithm consists of several parts for ¯nding crossing stems and page mapping the stems, for the layout of stem-loops and pseudoknots, and for overlap detection between structural elements and resolving it. Unlike previous algorithms, our algorithm generates a planar drawing for a large RNA structure with pseudoknots of any type and provides a bracket view of the structure. It generates a compact and aesthetic structure graph for a large pseudoknotted RNA structure in O (n 2 ) time, where n is the number of stems of the RNA structure. Keywords: RNA secondary structure; pseudoknot; visualization; planar drawing; algorithm.

1. Introduction In the sense of graph theory, a drawing of an RNA secondary structure can be considered as a tree in which the root node is the loop with the smallest starting base number. If there is a dangling end, arti¯cial bases, which are not actually shown in the ¯nal drawing, can be introduced to pair the ¯rst and last bases. In contrast to an RNA secondary structure, a drawing of an RNA pseudoknot is not a tree but a graph with inner cycles within a pseudoknot as well as possible outer cycles formed between a pseudoknot and other structural elements.1 Thus, visualizing RNA pseudoknot † Corresponding

author. 1650009-1

J. Bioinform. Comput. Biol. Downloaded from www.worldscientific.com by NEW YORK UNIVERSITY on 03/07/16. For personal use only.

Y. Byun & K. Han

structures is computationally more di±cult than depicting RNA secondary structures. A graph of an RNA structure is planar if no two structural elements overlap or intersect. Visualizing a pseudoknotted RNA structure as a planar graph is complicated because it requires both the pseudoknots and the entire structure enclosing the pseudoknots to be planar. Overlaps in structure graphs may be removed by a program's editing facility, but the editing facility cannot be used by some systems such as web services. So, it would be preferable to obtain an overlap-free graph in one shot. While several programs were developed for visualizing RNA secondary structures,2–4 there were few programs for visualizing pseudoknotted RNA secondary structures. jViz.Rna5 draws pseudoknotted RNA structures using a spring force model, but often generates a drawing with overlap between structural elements in a loop with a small number of bases. Moreover, it is too slow for visualizing large RNA structures. We previously developed PseudoViewer for visualizing RNA pseudoknots as planar graphs. PseudoViewer1 is the ¯rst program that automatically visualizes RNA secondary structures with H-type pseudoknots.1 PseudoViewer26 visualizes RNA secondary structures with all known types of pseudoknots in PseudoBase.7 PseudoViewer38 visualizes RNA structures with pseudoknots of any type, both known and hypothetical, as planar graphs. The web application program and web service of PseudoViewer3 were made available to public, but its internal algorithm and method have not been published so far. In this paper, we present the algorithm of PseudoViewer3 with examples. The top-level algorithm of PseudoViewer3 consists of three parts. In the ¯rst part, it ¯nds crossing stems and clusters stems into groups of non-crossing stems. In the second part it determines the layout of stem-loops and pseudoknots. In the third part it detects overlaps between structural elements and resolves the overlaps. The ¯rst and second parts can be used to represent the complicated structure in the bracket view with the minimum character set. The bracket view is widely used to represent RNA structures with base pairings. One type of parenthesis (,) is su±cient to represent a pseudoknot-free structure. A simple pseudoknot such as the H-type pseudoknot can be represented using two types of parentheses, typically (,) and [,]. The bracket view of a complicated pseudoknot requires more than two types of parentheses; more signi¯cantly it is di±cult to determine the total number of parenthesis types required to represent it in the bracket view. There is no clear method for generating the bracket view with a minimal set of parentheses. The new algorithm e±ciently generates planar graphs for large-scale pseudoknotted RNA structures. It is about 50 times faster than PseudoViewer2.6 For example, a pseudoknotted RNA structure with more than 4000 bases can be visualized within one second by PseudoViewer3. The rest of this paper describes the algorithms in detail. 2. Terminology and Notation An RNA secondary structure is a graph in which a vertex represents a base and an edge represents a base pairing or the connection between adjacent bases through the backbone. A pseudoknot-free RNA secondary structure can be treated as a tree. 1650009-2

J. Bioinform. Comput. Biol. Downloaded from www.worldscientific.com by NEW YORK UNIVERSITY on 03/07/16. For personal use only.

Algorithm for Visualizing Pseudoknotted RNA Structures

A node in the tree represents either a stem or a sequence segment, and the sequence segment of a child node is nested in that of a parent node. A base pairing of bases i and jði < jÞ is represented by ði; jÞ. A stem is a stack of contiguous base pairs. A stem is denoted by ði; jÞk ¼ fi; j; k 2 N  ji þ k  j  k þ 1g, which is a stack of base pairs ði 0 ; j 0 Þ that satisfy 0  i 0  i ¼ j  j 0 < k. The opening stem of ði; jÞk is composed of bases op that satisfy i  op < i þ k. The closing stem of ði; jÞk is composed of bases cl that satisfy j  k < cl  j. In the bracket view of the RNA structure, the opening stems and closing stems are represented in left parentheses and right parentheses, respectively. A half stem indicates either an opening or closing stem. The relation of two stems s ¼ ði; jÞk and s 0 ¼ ði 0 ; j 0 Þk 0 is either crossed, nested (included or includes), or separated. Stems s and s 0 are crossed when a pseudoknot is formed by the stems. Stem s ¼ ði; jÞk is left crossing of s 0 ¼ ði 0 ; j 0 Þk 0 and s 0 is right crossing of s when i þ k  i 0  i 0 þ k 0  j  k < j  j 0  k 0 : s 0 includes s when i þ k  i 0 < j 0  j  k. When s includes s 0 (or, s 0 is included by s), s is an ancestor of s 0 and s 0 is a descendant of s. Among the ancestors of s, an ancestor with the maximum starting base number is the parent of s. Stems s ¼ ði; jÞk and s 0 ¼ ði 0 ; j 0 Þk 0 are separated when j < i 0 or j 0 < i, and they are siblings if they have the same parent in the tree. Consider an RNA structure in Fig. 1 as an example. Stems ðb; b 0 Þ; ðc; c 0 Þ; ðd; d 0 Þ; ðe; e 0 Þ; ðf; f 0 Þ; ðg; g 0 Þ, and ðh; h 0 Þ are all descendants of stem ða; a 0 Þ. Stem ðb; b 0 Þ includes stems ðc; c 0 Þ, and ðd; d 0 Þ. Stems ðc; c 0 Þ and ðd; d 0 Þ are separated from each

Fig. 1. An example of an RNA secondary structure with a pseudoknot. (a) Topological relation of the stems. (b) Tree structure of the stems. (c) Planar graph of the RNA structure. 1650009-3

Y. Byun & K. Han

other. Stems ðb; b 0 Þ and ðe; e 0 Þ and a pseudoknot ði; i 0 Þ are children of stem ða; a 0 Þ. Stem ðg; g 0 Þ is left-crossing of ðh; h 0 Þ and stem ðh; h 0 Þ is right-crossing of ðg; g 0 Þ.

3. Algorithms

J. Bioinform. Comput. Biol. Downloaded from www.worldscientific.com by NEW YORK UNIVERSITY on 03/07/16. For personal use only.

3.1. Finding crossing stems and page mapping of stems The planarity of the RNA structure can be determined from a con°ict graph, a dual graph of the RNA structure. In a con°ict graph G ¼ ðV ; EÞ, a vertex represents a stem and two vertices are connected by an edge if and only if they are in con°ict (i.e. they cross each other). In this representation, the determination of the number of pages for a graph is equivalent to ¯nding minimal vertex-coloring of the graph. This problem is in general NP-complete. A con°ict graph with a cycle of odd length cannot be 2-colored, which implies that the RNA structure cannot be drawn as a planar graph. For a planar graph of the RNA structure, we adopt the de¯nition and veri¯cation of book embedding of Haslinger and Stadler.9 An embedding of a graph into a book consists of an ordering of the vertices along the spine of the book (i.e. the edge of the book where the pages are gathered and bound) and an assignment of every edge to a page of the book, in which edges assigned to the same page do not cross.10 When the ¯rst base is connected to the last base by an edge, the RNA backbone forms a Hamiltonian cycle in which each base appears exactly once. The number of pages for a planar graph with a Hamiltonian cycle is a maximum of two.9,10 We assign every stem of the RNA structure to a page based on the following rules: (1) Stems in the same page do not cross. (2) Stems with no crossing stem are assigned to page 0. sp0 denotes a set of stems assigned to page 0, and can be subsequently grouped with the stems in other pages. (3) spk ðk  1Þ is a set of stems with a crossing stem. Every stem in spk ðk  2Þ has at least one crossing stem in each spi (i ¼ 1; 2; . . . ; k  1). (4) Planar embedding of sp0 and any two of spk ðk > 0Þ is possible. When all the stems of an RNA structure are mapped to sp0 only, the structure is pseudoknot-free and the number of pages for the structure is one. When an RNA structure has a stem in spk ðk  1Þ, the total number of pages for the RNA structure is the maximum value of k, which is at least 2 because each stem in spk ðk  1Þ has a crossing stem in another page. Stems in the same page can be represented by the same type of parenthesis in the bracket view, and thus the number of pages indicates the number of parenthesis types required for the bracket view. We ¯nd all crossing relations of stems using an array of half stems, which are either opening or closing stems. The array of half stems is sorted with respect to base numbers (starting base numbers for opening stems, and ending base numbers for closing stems). In addition to base numbers, each element of the array has the information on the base 1650009-4

J. Bioinform. Comput. Biol. Downloaded from www.worldscientific.com by NEW YORK UNIVERSITY on 03/07/16. For personal use only.

Algorithm for Visualizing Pseudoknotted RNA Structures

Algorithm 1 Finding crossing stems Input: array A of half stems Output: crossing stems of every stem 1. Initialize a list L to be empty. 2. For i ← 0 to length(A) – 1 Do 3. If A[i] is an opening stem Then 4. For each opening stem p in L Do 5. If partner (p) < partner (A[i]) Then 6. Add A[i] to crossing(p) 7. Add p to crossing(A[i]) 8. End if 9. End for 10. Add A[i] to L 11. Else 12. Remove partner (A[i]) from L 13. End if 14. End for

number and the pairing partner of the half stem. A stem can be represented by the index in the array. For example, stem s can be represented by s ¼ ðo; cÞ, where o and c are the indices of the opening and closing stems in the array, respectively. Algorithm 1 describes the process of ¯nding all crossing stems, and Fig. 2 shows an example of ¯nding crossing stems. In Algorithm 1, partner(S) denotes a partner of a half stem S; partner(S) indicates an opening half stem if S is a closing half stem, and a closing half stem if S is an opening half stem. crossing(S) is a list of crossing stems of stem S. After we ¯nd crossing stems, we assign stems to pages using Algorithm 2. Page assignment is done for the stems in page 1 or higher, which have at least one crossing stem. The stem with the smallest base number is assigned to page 1 and added to the list for traversal. Stems are traversed in the order that they are inserted into the list. Since the initial page number of every stem is 1 (line 2 of Algorithm 2), a negative page number of a stem indicates that the stem has not been traversed yet. If a stem ts that is being traversed is assigned to page 1, its crossing stems cs are mapped to page 2 (refer to arrows 1, 2, and 3 of Fig. 3) and subsequent pages. Similarly, if ts is assigned to page 2, its crossing stems (cs) are assigned to page 1 (refer to arrows 4 and 5 in Fig. 3). If a stem being traversed (ts) and its crossing stem (cs1) were assigned to the same page, then cs1 is removed from the list (refer to arrows 6 and 6 0 in Fig. 3). Neither ts nor cs1 can be assigned to the same page as their crossing stem, so cs1 is assigned to page 3 or higher. When a stem is assigned to page 3 or higher, page assignment is continued with the stem in the same manner as assignment of stems to pages 1 and 2. With sp0 and any two spk 's, we can generate a subplanar graph of the structure. Since Algorithm 2 puts as many stems as possible into low-numbered pages, pages 1 and 2 are usually selected for a subplanar graph if they are available. When all stems of a single pseudoknot are assigned to two pages, they are traversed in a row with no intervening stems between them. To simplify the hierarchical relation of stems in the entire RNA structure, Algorithm 2 groups the stems of a pseudoknot and generates a 1650009-5

Y. Byun & K. Han

A B C D a E c e d F b

f

A B C D a E c e d F b

f

Step 1 Array of half stems

opening stem? List

Step 2

A B C D a E c e d F b

f

J. Bioinform. Comput. Biol. Downloaded from www.worldscientific.com by NEW YORK UNIVERSITY on 03/07/16. For personal use only.

a 0) of a pseudoknot is a 1650009-7

J. Bioinform. Comput. Biol. Downloaded from www.worldscientific.com by NEW YORK UNIVERSITY on 03/07/16. For personal use only.

Y. Byun & K. Han

Fig. 3. Example of page assignment of stems. The arrow numbers indicate the order of page assignment. Stems A, F, and E are assigned to page 1, stems B and C are assigned to page 2, and stem D is assigned to page 3.

(a)

(b)

(c)

(d)

Fig. 4. Example of layout of stem-loops. (a) Simplify the RNA structure using line segments and circles. (b) Detect overlap between structural elements. (c) Compute the new radius of the parent circle. (d) Arrange the bases along the line segments and circles.

region formed by half stems, which are located on a page higher than 0 but are in di®erent pages from each other. We determine the loop region lk (k > 0) of a pseudoknot by the convex hull of the structural elements of the pseudoknot. All children must be positioned before their parents, so the layout process starts with the last stem and ends with the ¯rst stem. After the shape and position of all layout elements are determined, the bases are positioned along the element to which they belong (Fig. 4(d)). Algorithm 3 describes the layout process of stem-loops. 3.3. Layout of pseudoknots In the typical arc representation of an RNA structure with pseudoknots, a base pair is represented by an arc. In the simpli¯ed arc representation, an arc represents a stem 1650009-8

J. Bioinform. Comput. Biol. Downloaded from www.worldscientific.com by NEW YORK UNIVERSITY on 03/07/16. For personal use only.

Algorithm for Visualizing Pseudoknotted RNA Structures

1650009-9

Y. Byun & K. Han

(a)

(b)

(c)

(d)

(e)

J. Bioinform. Comput. Biol. Downloaded from www.worldscientific.com by NEW YORK UNIVERSITY on 03/07/16. For personal use only.

Fig. 5. Representation of a planar graph for the pseudoknotted RNA structure. (a) Simpli¯ed arc representation, in which an arc represents a stem. (b) Layout of the stems in the lower halfplane to ensure that their backbone direction change from up to down. The stems in the upper halfplane are simply marked with arcs without considering the backbone direction. (c) Layout of the stems in the upper halfplane to ensure that their backbone direction changes from down to up. (d) Layout after adjusting the vertical positions of the stems. (e) Final drawing after adjusting the horizontal positions of the stems to have a uniform distance between opening and closing half stems.

(Fig. 5(a)). We convert the arc representation into a planar graph by putting opening stems next to their closing half stems, as shown in Fig. 5(d). Two pages k1 and k2ð0 < k1 < k2Þ, typically pages 1 and 2, are chosen to draw a planar graph. Since every stem in spk2 has a crossing stem in spk1 , a stem in spk2 and another in spk1 form a pseudoknot. The layout of a pseudoknot starts with the arc representation of the pseudoknot. The arcs representing the stems in two pages can be put into two halfplanes separated by the backbone (Fig. 5(a)). In the visualization of the RNA structure, the backbone direction (either up or down) at base i is opposite to that of base j for each base pair (i; j) (Figs. 5(b) and 5(c)). Note that the backbone of stems in the lower halfplane (spk1 ) changes direction from up to down, whereas the backbone of stems in the upper halfplane (spk2 ) changes direction from down to up (Fig. 5(d)). In the ¯nal drawing, the horizontal positions of stems are adjusted to have a uniform distance between opening and closing half stems (Fig. 5(e)). Stems in a pseudoknot are sorted to ensure that the bottommost stem at the 5 0 end appears ¯rst, and the sorted stems are positioned using the following rules (see Fig. 6): R1: A stem in spk1 is positioned above its parent stem (if any) and below its crossing stem. R2: A stem in spk2 is positioned below its parent stem (if any) and above its crossing stem. R3: If stems ðok1 ; ck1 Þ 2 spk1 and ðok2 ; ck2 Þ 2 spk2 do not cross each other and ck1¼ ok21 or ck2¼ ok11 , then ðok1 ; ck1 Þ is positioned above ðok2 ; ck2 Þ. Half stems in the stem array are sorted in ascending order of their starting base numbers and then examined in the sorted order. If a stem has no other stem to be positioned below it, the stem is added to a list. When stems are added to the list, the

Fig. 6. Layout rules of stems in a pseudoknot. 1650009-10

Algorithm for Visualizing Pseudoknotted RNA Structures

J. Bioinform. Comput. Biol. Downloaded from www.worldscientific.com by NEW YORK UNIVERSITY on 03/07/16. For personal use only.

stems that satisfy any of R1–R3 are added to the list before others are added. By rule R1, a stem of spk1 is checked when its opening stem is visited. Likewise, by rule R2 a stem of spk2 is checked when its closing stem is visited. Any descendant ds of a stem sk1 2 spk1 cannot appear before sk1 in the sorted list. Thus, when a stem of sk1 is not added to the sorted list, ds is skipped and the closing stem of sk1 is checked. If a stem of sk1 is not added to the list, this is because of R3. After all stems in other pages spk

1650009-11

J. Bioinform. Comput. Biol. Downloaded from www.worldscientific.com by NEW YORK UNIVERSITY on 03/07/16. For personal use only.

Y. Byun & K. Han

Fig. 7. Example of layout of stems in a pseudoknot. During the layout process, the x- and y-coordinates of crossing stems are determined. In the arc representation, stems of spk1 are represented in red thick lines and stems of spk2 are represented in blue thin lines. In the stem array, stems of spk2 are shown in reversed color. Opening and closing half stems are represented in uppercase and lowercase letters, respectively. Numbers in parentheses next to a stem represent the order of determining the y-coordinates of the stem.

(k > 1) that do not cross the stem and satisfy R3 are added to the list, the stem of sk1 can also be added to the list. Only in this situation, sk1 is rechecked. Algorithm 4 summarizes the layout algorithm of pseudoknots. Figure 7 shows an example of layout of the stems in a pseudoknot using Algorithm 4. Lines 6–19 of Algorithm 4 are for determining the x-coordinates of layout elements and lines 20–41 are for determining their y-coordinates. Algorithm 4 removes both s and partner(s) from L and inserts s into sortL (lines 24–26, lines 32–34). Since the y-coordinates of stems are determined in their order in sortL (line 41), the order of removing a half stem is important. A half stem s is removed from L only after all stem-loops below s are removed. The reason for examining if direction(s) is up in line 21 of Algorithm 4 is because an opening stem in page 1 and a closing stem in page 2 must be removed from L. If stem s1 in page 1 and stem s2 in page 2 do not cross each other but next to each other, then s2 is positioned below s1 by rule R3. The 1650009-12

J. Bioinform. Comput. Biol. Downloaded from www.worldscientific.com by NEW YORK UNIVERSITY on 03/07/16. For personal use only.

Algorithm for Visualizing Pseudoknotted RNA Structures

descendants of stem-loops in page 1 are always positioned above their parents, so the descendants cannot be positioned if their parents are not positioned. Thus, when a stem-loop in page 1 cannot be positioned due to R3 and remains in L, the algorithm skips all its descendants and checks the next stem-loop. Lines 23–26 of Algorithm 4 examine if the parent of a descendant of a stem-loop in page 1 is positioned. Line 28 is for skipping all the descendants and for checking a closing stem in case a stem-loop in page 1 cannot be positioned due to R3. In summary, the y-coordinates of stem in page 1 are determined in the order of their opening half stems based on rules R1 and R2, but the y-coordinates of stem in page 2 are determined in the order of their closing half stems. If a stem in page 1 cannot be positioned based on R3 (i.e. there is a stem in page 2 that does not cross the stem in page 1 but next to it), it skips the stem in page 1 and all its descendants and determines the y-coordinate of the stem in page 2 before the stem in page 1. Thus, lines 28–36 of Algorithm 4 describes the process of skipping such stems in page 1 and returning to them to determine their y-coordinates based on rule R3. 3.4. Overlap detection A child is positioned in its parent's loop region along a circle or line segment. The layout element of the parent's loop region is denoted by pl. The circular layout element of the parent's loop region is denoted by plC and the linear layout element is denoted by plL. When we move a child, we move its descendants (denoted by dl) together. The distance between children is initially the same as the distance between bases, which may cause overlaps between the layout elements of their descendants. Both parents and descendants have two types of layout elements: linear elements for stems and circular elements for loops. Hence, there are three types of overlaps: overlaps between line segments, overlaps between a line segment and a circle, and overlaps between circles. We developed a set of equations to detect overlaps and compute overlap-free angles and overlap-free distances of layout elements using basic geometric and trigonometric formulas.11 These equations enable PseudoViewer to quickly generate a planar graph of a large-scale RNA structure with pseudoknots. This section assumes that circle O is the parent circle of the layout elements being examined. 3.4.1. Overlap-free angle between line segments In this section, we present a set of equations for computing overlap-free angles between line segments. In the equations, all angles must be positive. When the inverse sine (arcsin) has a negative value,  must be added to the value to ensure that it is positive. Consider two line segments (AB and DE) and their parent loop region plC with center O and radius R in Fig. 8(a). The lengths of BC and EF (l1 and l3 ) are independent of R, and the two line segments pass through the circle center O. The minimum overlap-free angle is \DOE þ \DOB. DO can be calculated by pffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi DO ¼ l 24 þ ðl3 þ RÞ 2  2l4 ðl3 þ RÞ cosða2 Þ. Since all the sides of DOE and \OEDð2 Þ are known, \DOEðÞ can be derived (the ¯rst term in the right-hand 1650009-13

Y. Byun & K. Han

A=D

l4

r2

E

a2

a1

B

A1

A

D

l2

B

C R

α β

l2

C

F

B

R

A3

O O

(a)

(b)

(c) A

A=D

r

l2

B

D

E

a

B

l1

C

α β

R

O θ =α + β

(d)

C

F

θ

O θ =α + β J. Bioinform. Comput. Biol. Downloaded from www.worldscientific.com by NEW YORK UNIVERSITY on 03/07/16. For personal use only.

A2

l1

l3

l1

E

r1

F

l3

A' B

a2

r

r

E

a1

E

a1 l2

l1 C R

θ O

(e)

D l1

A a1 A' C θ F R O

l2

(f)

Fig. 8. Overlap-free angles and positions. (a) Overlap-free angle between line segments. (b) Overlap-free angle between circles. (c) Overlap-free position between a line segment and a circle. (d) Overlap-free angle between a circle and line segment that is not tangent to the circle. (e) Overlap-free angle between a circle and a tangent line segment that is above the circle. (f) Overlap-free angle between a circle and a tangent line segment that is below the circle.

side of Eq. (1). Similarly, \BDO can be derived from \OBDð1 Þ, DO and BO, which in turn enables us to compute \BODðÞ from \OBDð1 Þ and \BDO. In summary, Eq. (1) represents the relation of the overlap-free angle and radius of a parent loop (plC) for the two line segments in Fig. 8(a):     l4 l þR  ¼ fðRÞ ¼ arcsin ð1Þ  sin a2  arcsin 1  sinða1 Þ þ   a1 : DO DO 3.4.2. Overlap-free angle between circles The overlap-free angle between circles ensures that the distance between the circle centers are equal to the sum of their radii. Consider the two circles with radii r1 and r2 in Fig. 8(b). Since l1 and l2 are known, the minimum overlap-free angle  between 1650009-14

Algorithm for Visualizing Pseudoknotted RNA Structures

two circles can be obtained by Eq. (2).  in Eq. (2) is in the range [0, ].   ðl1 þ RÞ 2 þ ðl2 þ RÞ 2  ðr1 þ r2 Þ 2 :  ¼ fðRÞ ¼ arccos 2ðl1 þ RÞ  ðl2 þ RÞ

ð2Þ

J. Bioinform. Comput. Biol. Downloaded from www.worldscientific.com by NEW YORK UNIVERSITY on 03/07/16. For personal use only.

3.4.3. Overlap-free angle between a line segment and a circle Supplementary Fig. S1 shows several overlap relations between line segments and circles. Two line segments might intersect when the ranges of their x- and y-coordinates overlap (line segments AB and CD in Supplementary Fig. S1). When one or more ends of a line segment are inside the circle, the line segment intersects the circle (line segments GH and KL and circle F in Supplementary Fig. S1). Two circles intersect when the distance between their centers is smaller than the sum of their radii (circles E and F in Supplementary Fig. S1). A line segment may intersect a circle even if both ends of the line segment are outside the circle. Consider line segment IJ and circle F in Supplementary Fig. S1. In this case, the perpendicular distance from the circle center to the line segment is smaller than the circle radius, and the intersection point of the perpendicular line and the extended line of the line segment lies on the line segment. On the contrary, line segment CD does not intersect circle F , since the intersection point of the perpendicular line and extended line lies outside the line segment. The computation of the overlap-free angle between a line segment and a circle di®ers depending on whether the line segment is tangent to the circle or not. If a line segment is tangent to a circle at a point, the line segment is perpendicular to the radius line at that point on the circle. Figure 8(c) shows three di®erent positions of a line segment and circle. Both A1 B and A2 B are tangent to the circle because they are perpendicular to the radius line CA 2 , whereas A3 B is not. When the length di®erence between line segments CO and Ai O exceeds CAi (the radius of the circle centered at CÞ, CO and CAi cannot form a triangle with Ai O, which is tangent to the circle (in A1 CO, CA1 is greater than the radius of the circle). Though CO, CA2 , and OA2 can form a triangle, A2 B is also tangent to the circle because the sum of \BA2 O and \OA2 C is a right angle. The equation for a circle and a line segment that is not tangent to the circle is similar to that for line segments (Fig. 8(d)). The di®erence is that the length of AB is ¯xed and \DEF is unknown. AO in AOB can be derived. Since all the sides of DEO are known, the angle \DOE ( in Fig. 8(d)) can be derived from the sides of DEO. The angle \DOE is the last term in the right-hand side of Eq. (3).  between a circle and a line segment that is not tangent to the circle can also be obtained by Eq. (3): !   2 l2 AO þ ðl3 þ RÞ 2  r 2  ¼ fðRÞ ¼ arcsin sinðaÞ þ arccos : ð3Þ AO 2AO  ðl3 þ RÞ

1650009-15

J. Bioinform. Comput. Biol. Downloaded from www.worldscientific.com by NEW YORK UNIVERSITY on 03/07/16. For personal use only.

Y. Byun & K. Han

Figures 8(e) and 8(f) illustrate two cases in which a line segment AB is tangent to circle E. To derive the minimum overlap-free angle, we introduce a new support point A 0 on the extended line of AB in both cases, and denote the contact point of the tangent line and circle E by point D. In Fig. 8(e), the segment OD is longer than OE (OD > OE), and the angle \A 0 BO ¼ 1 > =2. But, in Fig. 8(f) the segment OD is shorter than OE (OD < OE), and \A 0 BO ¼ 1  =2: In both cases, A 0 DE is a right triangle with a side DE, which is equal to the radius r of circle E. Thus, we can easily derive A 0 E ¼ r= sin a2 . Using the law of sines, Eq. (4) holds in A 0 BO of Fig. 8(e). Then, Eq. (5) is derived from Eq. (4) sin a2 sin a1 sin a1 ¼ ; ð4Þ ¼ l1 þ R l2 þ R þ A 0 E l2 þ R þ r= sin a2 ðl2 þ RÞ sin a2 þ r ¼ ðl1 þ RÞ sin a1 ; ðl þ RÞ sin a1  r : ! sin a2 ¼ 1 l2 þ R

ð5Þ

Using Eq. (5) we can compute the minimum overlap-free angle \A 0 OB ¼  in Fig. 8(e).   ðl1 þ RÞ  sinða1 Þ  r  ;  ¼ fðRÞ ¼   a1  a2 ¼   a1  arcsin < a1  : l2 þ R 2 ð6Þ In a similar way, Eq. (7) holds in A 0 BO of Fig. 8(f) by the law of sines, and Eq. (8) is derived from Eq. (7). sin ð  a2 Þ sin a1 sin a2 sin a1 ! ¼ ¼ ; 0 l þ R l þ R  r= sin a2 l1 þ R l2 þ R  A E 1 2

ð7Þ

ðl2 þ RÞ sin a2  r ¼ ðl1 þ RÞ sin a1 ; ðl þ RÞ sin a1 þ r ! sin a2 ¼ 1 : l2 þ R

ð8Þ

Using Eq. (8), we can derive the minimum overlap-free angle \A 0 OB ¼  in Fig. 8(f).   ðl1 þ RÞ  sinða1 Þ þ r  ; 0  a1  :  ¼ fðRÞ ¼   a1  a2 ¼   a1  arcsin l2 þ R 2 ð9Þ The two cases of Figs. 8(e) and 8(f) have similar equations for the minimum overlap-free angle ; The r term is subtracted when computing the inverse sine in Eq. (6), whereas the r term is added in Eq. (9). 3.4.4. Calculation of the radius When overlap is detected, the parent's loop region pl must be expanded. The linear layout element plL of pl can easily be made larger, because the children do not need 1650009-16

Algorithm for Visualizing Pseudoknotted RNA Structures

be repositioned. For the circular layout element plC of pl, the radius is increased. When the radius R of the parent loop is increased, the minimum overlap-free angle between the perpendicular lines decreases. At each child node, we choose the maximum overlap-free angle  for all the descendant nodes. An approximation of the new radius R 0 can be obtained by Eq. (10). ! X 1 0 R ¼Rþ maxði ðRÞÞ : ð10Þ R 2

J. Bioinform. Comput. Biol. Downloaded from www.worldscientific.com by NEW YORK UNIVERSITY on 03/07/16. For personal use only.

i¼child

If all the overlaps are not resolved by the new radius R 0 , we compute it repeatedly by replacing R with R 0 in Eq. (10) until R 0 no longer changes. In most cases, after two or three iterations we obtain the ¯nal radius that avoids overlaps between children. 4. Complexity Analysis 4.1. Finding crossing stems, page assignment, and layout Algorithm 1 ¯nds all crossing stems. The time required by Algorithm 1 for examining the crossing relation of a stem with others depends on the number of its left crossings and ancestors. For a stem s ¼ ðo; cÞ, the maximum number of left crossings max(lc) is c  o  1. When every previous stem is the ancestor of a stem, the stem has the maximum number of ancestors, which is maxðaÞ ¼ o  1. max(a) includes max(lc) because left crossing stems also appear before O. Therefore, the maximum comparP ison time of s is o  1 and the total comparison time is ð ni¼1 oi Þ  n, where n is the total number of stems. The worst case occurs when every stem crosses others or every stem is either included in or includes others. Therefore, the comparison time is nðn  1Þ=2 ¼ ðn 2  nÞ=2. The traversal time required by Algorithm 2 is exactly twice the total number of crossings. This is less than or equal to twice the comparison time. Thus, page assignment of stems can be performed in O(n 2 Þ time. The best time complexity O(n) is achieved when every stem is either separated or crosses a maximum of one left crossing stem and a maximum of one right crossing stem. The lines 1–25 of Algorithm 3 describe a simple layout part, which does not require overlap detection. Starting with the last stem, it ¯rst examines whether the layout element is a half stem of a pseudoknot. If so, it calls Algorithm 4 for the layout of the pseudoknot (lines 3–4). If the layout element is a stem-loop with no child, it determines its position (lines 5–7). If the layout element has a child, Algorithm 3 considers the layout elements of its descendants (lines 9–25). Thus, the time complexity for the lines 1–25 of Algorithm 3 does not exceed O(n 2 ) where n is the total number of stems. Sorting of stems in spk1 and spk2 ð0 < k1 < k2Þ is performed using the stem array, and backtracking is performed when a stem of spk2 is added to the sorted list. Thus, sorting the stems requires less time than sizeðspk1 Þ  2 þ sizeðspk2 Þ  4, and its time 1650009-17

Y. Byun & K. Han

complexity is O(n). The layout of all pseudoknots also requires O(nÞ time because Algorithm 4 simply positions the stems in the sorted order. 4.2. Overlap detection

J. Bioinform. Comput. Biol. Downloaded from www.worldscientific.com by NEW YORK UNIVERSITY on 03/07/16. For personal use only.

This section discusses the time complexity for overlap detection. The do-while loop in lines 26–52 of Algorithm 3 examines the overlap for the layout of stem-loops. Before analyzing the total time for overlap detection, consider a single stem-loop with m children. Every child chi ð1 < i  mÞ is examined for possible overlaps with chj ð1  j < iÞ. Let di denote the number of layout elements of chi and its descendants. Then, the comparison time T1 for a single stem-loop can be computed as follows. T1 ¼ fd2  d1 g þ fd3 ðd1 þ d2 Þg þ fd4 ðd1 þ d2 þ d3 Þg þ    þ fdm ðd1 þ d2 þ    þ dm1 Þg

X

¼

di dj :

ð11Þ

1i

An efficient algorithm for planar drawing of RNA structures with pseudoknots of any type.

An RNA pseudoknot is a tertiary structural element in which bases of a loop pair with complementary bases are outside the loop. A drawing of RNA secon...
566B Sizes 0 Downloads 9 Views