This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. IEEE TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS

1

Efficient Algorithms for Knowledge-Enhanced Supertree and Supermatrix Phylogenetic Problems André Wehe1,2 , J. Gordon Burleigh1 , and Oliver Eulenstein2 1 2

Department of Biology, University of Florida, Gainesville, FL 32609, USA,

Department of Computer Science, Iowa State University, Ames, IA 50011, USA

!

Abstract Phylogenetic inference is a computationally difficult problem, and constructing high quality phylogenies that can build upon existing phylogenetic knowledge and synthesize insights from new data remains a major challenge. We introduce knowledge-enhanced phylogenetic problems for both supertree and supermatrix phylogenetic analyses. These problems seek an optimal phylogenetic tree that can only be assembled from a user-supplied set of, possibly incompatible, phylogenetic relationships. We describe exact polynomial time algorithms for the knowledge-enhanced versions of the NP hard Robinson Foulds, gene duplication, duplication and loss, and deep coalescence supertree problems. Further, we demonstrate that our algorithms can rapidly improve upon results of local search heuristics for these problems. Finally, we introduce a knowledge-enhanced search heuristic that can be applied to any discrete character data set using the maximum parsimony (MP) phylogenetic problem. Although this approach is not guaranteed to find exact solutions, we show that it also can improve upon parsimony solutions from commonly used MP heuristics.

Digital Object Indentifier 10.1109/TCBB.2012.162

1545-5963/12/$31.00 © 2012 IEEE

This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. IEEE TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS

2

Index Terms phylogenetics, supertree, supermatrix

1

I NTRODUCTION

Phylogenetic trees describe the evolutionary relationships among species and can be powerful tools for examining imortant biological questions. The wealth of new genomic data, especially from next-generation sequencing technologies, have produced many new insights into phylogenetic relationships throughout the Tree of Life (e.g., [1]). Yet phylogenetic inference is an extremely difficult computational problem, and constructing high quality, comprehensive phylogenies that can synthesize insights from the everexpanding amount of new data and build upon previous phylogenetic knowledge remains a major challenge. There are two main approaches to build large-scale trees. First, supertree problems take input trees with partially overlapping taxon sets and seek a tree (the supertree) containing all of the taxa in the input trees (see for an overview [2], as well as [3]). In contrast, supermatrix (or total evidence) methods, combine partially overlapping character (usually nucleotides or amino acids) alignments for phylogenetic analyses using maximum parsimony, maximum likelihood, or Bayesian methods (e.g., [4]). Most commonly-used supertree and supermatix problems are NP-hard [5], [6], [7], [8]. Like other NP-hard problems in general, exact solutions can be obtained by complete enumeration of the solution space. However, run-times typically become prohibitive for even very small instances. Heuristics, often based on local search algorithms, are necessary to estimate solutions for larger data sets [9], [10], [11], [12], [13], [14], [15], [16], [17], [18], [19], [20]. One approach to improve the efficiency of searches is to provide a set of possible phylogenetic relationships from which the supertree or

This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. IEEE TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS

3

supermatrix has to be assembled. These phylogenetic relationships might be provided by existing phylogenetic hypotheses, and can be incompatible. Focusing the search on these candidate relationships can tremendously reduce the search space for candidate tree solutions. In addition, focussing the search provides a way to leverage the wealth of existing phylogenetic knowledge for solving phylogenetic problems, while eliminating the possibility of including undesired relationships in the resulting phylogenetic trees, which can be simply excluded from the set of candidates. We call this general approach “knowledge-enhanced” phylogenetic inference. Here we introduce knowledge-enhanced supertree problems and a knowledgeenhanced supermatrix problem. Given a collection of input trees and a set of possibly contradictory phylogenetic relationships, a knowledge-enhanced supertree problem seeks an optimal supertree only in the tree space described by the phylogenetic relationships. Similarly, the supermatrix problem takes a phylogenetic character matrix and a set of possibly contradictory phylogenetic relationships, and seeks the optimal tree only in the tree space described by the phylogenetic relationships. We describe exact polynomial time algorithms for the knowledge-enhanced supertree problems and an effective heuristic algorithm for the knowledge-enhanced supermatrix problem based on maximum parsimony. Note that our efficient algorithms for the knowledge-enhanced version of NP-hard supertree problems provide a solution for the original supertree problem when all possible phylogenetic relations are part of the input. Indeed, this does not provide an efficient solution for an NP-hard supertree problem. We demonstrate that our knowledge-enhanced algorithms can improve upon commonly used heuristics for Robinson Foulds supertrees [21], gene tree parsimony using duplication [17], duplications and loss [14], or deep coalescence cost functions [14], and maximum parsimony [18].

This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. IEEE TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS

4

Related work. Knowledge-enhanced phylogenetic inference is similar to using topological constraints in phylogenetic inference. In typical phylogenetic constraint analyses, which can be applied to any supertree or supermatrix heuristic, partial resolved trees or clades or fully resolved subtrees are passed as constraints to the heuristic’s local search method (e.g., RAxML [22], PAUP [19], and DupTree [17]). This forces the solution tree to agree with the constraints. Although such constraints can be effective, they may be inefficient. The constraint search must consider all possible topologies consistzent with the constraints, rather than a specific set of possibly incompatible topologies. Furthermore, these constrained approaches are still heuristics and therefore may find suboptimal solutions. In this work, we introduce the general concept of knowledge-enhanced versions of supertree problems and focus on the four supertree problems: the Robinson-Foulds (RF), gene duplication, duplication and loss, and deep coalesence supertree problems. The RF supertree problem seeks a binary supertree that minimizes the sum of the RF distances between every input tree and the supertree. This problem is NP-hard [5], and therefore, it has been addressed by local search heuristics [21], [23]. Although the RF supertree heuristics allow for time efficient estimation of RF supertrees for data sets with hundreds of taxa, experiments suggest that they often become stuck at locally optimal solutions [21], [23]. The gene tree parsimony supertree problems (gene duplication, duplication and loss, and deep coalescence problems) take a collection of gene trees and infer a species tree that implies the minimum number of conflict-causing events among gene trees [24], [25], [26], [27]. These problems also are NP-hard [6], but there exist effective and efficient heuristics to infer the species tree for the gene duplication [9], [28], duplications and loss [11], and deep coalescence [11] problems.

This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. IEEE TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS

5

Our Contribution. We formally define knowledge-enhanced RF, gene duplication, duplication and loss problem, and deep coalescence supertree problems and describe efficient algorithms to solve them exactly. Knowledge-enhanced supertree problems seek an optimal supertree for a given collection of input trees that can only be assembled from a given set of, possibly contradictory, phylogenetic relationships. These relationships can be described either as clusters or sibling-clusters. Sibling-clusters are two disjoint clusters whose parent in a rooted phylogenetic tree is the union of these clusters. Such phylogenetic relationships could be obtained from previous phylogenetic analyses. We describe efficient algorithms that solve the knowledge-enhanced supertree problems for a collection of n input trees representing m taxa for either c clusters or t siblings clusters in time O (c2 nm) and O (tnm) respectively. We verify the performance of these algorithms for large empirical collections of trees and demonstrate how the algorithms can be used to improve on estimates of these collections that were computed by existing heuristics. Also, we define a knowledge-enhanced version of the supermatrix problem based on parsimony and describe a heuristic algorithm to address this problem. Our heuristic may be used to obtain estimates for the MRP supertree problem [29], [30], [31], [32] or parsimony maximum. We demonstrate that the heuristic can improve upon solutions obtained from commonly used maximum parsimony heuristics.

2

BASIC N OTATION

AND

P RELIMINARIES

Let T be a rooted tree. We denote the vertex set, edge set, and leaf set of T by V (T ), E(T ) and Le(T ) respectively. The root of T is denoted by Rt(T ). Given a vertex v ∈ V (T ), we denote the children of v by Ch(v), and the parent of v by Pa(T ). Two vertices in T are called siblings (of each other) if they have the same parent. We write (u, v) to denote the

This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. IEEE TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS

6

edge {u, v} ∈ E(T ) where u = Pa(v). The set of internal vertices of T , denoted by I(T ), is defined to be the set V (T ) \ Le(T ). We call T full binary if each vertex v ∈ I(T ) has exactly two children. Given a vertex-set U ⊆ V (T ), we denote by T (U ) the unique subtree of T that spans U with the minimum number of vertices. Furthermore, the restriction of T to U , denoted by T|U , is the tree that is obtained from T (U ) by suppressing all non-root vertices of degree two. The subtree of T rooted at v ∈ V (T ), denoted by Tv , is defined to be T|U , for U := {u ∈ Le(T ) | v is on the shortest path between Rt(T ) and u}. Given a vertex v ∈ V (T ) we define the cluster of v to be CT (v) := Le(Tv ), and the cluster-presentation of T is defined as C(T ) := {CT (v) | v ∈ V (T )}. Let X be a label set. A tree T is called an X-tree if Le(T ) = X, and called a partial X-tree if Le(T ) ⊆ X. The set of all X-trees is denoted by T (X), and the set of all X-trees that are full binary is denoted by B(X). The following conventions are used for convenience throughout the manuscript. We write the set-operation ∪ as ∪˙ when the corresponding intersecting sets are disjoint. Unless noted otherwise, the term tree refers to a rooted tree that has no vertices of degree two other than its root. The Robinson-Foulds (RF) Metric [33] is defined for two rooted trees over the same taxon set as the cardinality of the symmetric difference of their cluster-representation. Given a collection of rooted trees the RF supertree problem seeks a median tree under the RF measure. However, the RF metric is not defined for input trees with taxon sets that are properly contained in the taxon set of the resulting median tree. Therefore the RF minus distance is used which is the RF distance when the trees involved are restricted to their shared taxon sets (e.g. [21], [23]). Definition 2.1. [minus RF distance]

This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. IEEE TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS

7

Let S be an X-tree, and T be a Y -tree where Y ⊆ X. We define RF (T, S) := |C(T )ΔC(S|Y )|, where the operator Δ denotes the symmetric difference. Given a set P of partial X-trees, we define the minus RF distance from P to S as RF (P, S) := ΣT ∈P RF (T, S), and the minus RF distance of P in the scope of B(X) as RF (P ) := minS∈B(X) RF (P, S). Problem 2.2. [RF supertree (NP-hard [5])] Instance: A set P of partial X-trees. Find: The dissimilarity score RF (P ) and a tree S ∗ ∈ B(X) such that RF (P ) = RF (P, S ∗ ). 2.1

The cluster-similarity problem

For convenience we introduce an equivalent maximization version of the RF supertree problem, called cluster-similarity problem, which was implicitly used in [21]. This problem is based on the cluster-similarity measure which we will introduce first. Definition 2.3. [c(luster)-similarity] S is an X-tree and T an Z-tree such that Z ⊆ X. We define the c(luster)-similarity from T to S as R(T, S) := |C(T ) ∩ C(S|Z )|. Let P be a set of full binary trees, then we define (i) the c-similarity from P to S as R(P, S) := ΣT ∈P R(T, S), and (ii) the c-similarity of P in the scope of B(X) for some label-set X as RX (P ) := maxS∈B(X) R(P, S). Proposition 2.4. [conversion between c-similarity and RF conversion] RF (T, S) = |C(T )| + |C(S)| − 2R(T, S) is the conversion between the measure RF and csimilarity for the X trees T and S. Problem 2.5. [c-similarity] Instance: A set P of partial X-trees. Find: The similarity score RX (P ), and a tree S ∗ ∈ B(X) such that R(P ) = R(P, S ∗ ).

This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. IEEE TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS

8

T

S|{a,b,c} {a,b}

{a,b}

{a} {b}

d

{c}

{a} {b} {c}

Figure 2.1: This example shows the c-similarity for the Y -tree T and X-tree S restricted to the set Z := {a, b, c}. The blue nodes indicate 4 common clusters {a} , {b} , {c} , and {a, b}, hence the restricted c-similarity is RZ (T, S) = 4. These clusters imply an upper bound on the Robinson-Foulds distance for every X-tree S containing the subtree S|Z . 2.2

Refinements for the cluster-similarity problem

We introduce two problems that refine the solution space of the c-similarity problem based on some additional input. First we define the cluster-refined c-similarity problem, which considers only candidate trees with a cluster-presentation that is a subset of a given refining cluster set. Then we define the sibling-refined c-similarity problem that further narrows down the refinement to candidate trees that can only have siblingclusters of a given set of sibling-clusters. We require that at least one candidate tree can be represented by the given refining sets, and call such refining sets complete. For each refined problem, we first introduce definitions necessary to state the problem, and then define the problem. Definition 2.6. [cluster-refined candidate trees] Let Y be a label set, and K be a set of subsets of Y . We define BC(Y, K) := {T ∈ B(Y ) | C(T ) ⊆ K} and call this set K cluster-complete when BC(Y, K) = ∅. Given a set P of partial X-trees, we define the c-similarity of P refined by K with the scope of B(Y ) as RCY (P, K) := maxS∈BC(Y,K) R(P, S).

This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. IEEE TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS

9

Problem 2.7. [cluster-refined c-similarity] Instance: A set P of partial X-trees and a cluster-complete set K of subsets of X. Find: The similarity score RCX (P, K) and a tree S ∗ ∈ BC(X, K) where RCX (P, K) = R(P, S ∗ ). Definition 2.8. [sibling-refined candidate trees] The sibling-presentation of a full binary tree T is defined as Sb(T ) := {(CT (u), CT (v)) | u and v are siblings in T }. Let Y be a label set and L be a set of bi-partitions of subsets of Y . We define BS(Y, L) := {T ∈ B(Y ) | Sb(T ) ⊆ L}, and call L sibling-complete when BS(Y, L) = ∅. Given a set P of partial X-trees, we define the c-similarity of P refined by L with the scope of B(Y ) as RSY (P, L) := maxT ∈BS(Y,L) R(P, T ). Problem 2.9. [sibling-refined c-similarity] Instance: A set P of partial X-trees and a sibling-complete set L of bi-partitions of subsets of X. Find: The similarity score RS(P, L) and a tree S ∈ BC(X, L) where RS(P, L) = R(P, S).

3

S TRUCTURAL

PROPERTIES AND RECURRENCES OF THE C - SIMILARITY

PROBLEMS

The c-similarity problem and the refined c-similarity problems share an optimal substructure. Similar substructures have been shown previously for the gene tree parsimony problems [34], [35], [36]. Here, we first phrase this optimal substructure for the csimilarity measure, and then we follow up with the recurrences for each of the csimilarity problems. Definition 3.1. Let T be a Z-tree and Y be a label set. We define Γ(T, Y ) := 1, if (Z ∩ Y ) ∈ C(T ), and Γ(T, Y ) := 0 otherwise. For a set P of partial X-trees we define

This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. IEEE TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS

10

Γ(P, Y ) := ΣT ∈P Γ(T, Y ). The optimal substructure follows the branching of a tree. A sub-problem can be defined for each subtree. Proposition 3.2. [optimal substructure] Let S ∗ ∈ B(X) and P be a set of partial X-trees such that RX (P ) = R(P, S ∗ ). Then RY (P ) = R(P, Sy∗ ) for each y ∈ Ch(Rt(S ∗ )) where Y := CS ∗ (y). Proof: Let z be the sibling of y. Then we have R(P, S ∗ ) = R(P, Sy∗ )+R(P, Sz∗ )+Γ(P, Y ). For the purpose of a contradiction we assume that RY (P ) = R(P, Sy∗ ) from which directly RY (P ) > R(P, Sy∗ ) follows. Thus there is a tree U ∈ B(Y ) such that RY (P ) = R(P, U ). Now, we construct a new tree S  ∈ B(X) by replacing in S ∗ the subtree Sy∗ with the tree U . Since RY (P ) > R(P, Sy∗ ), we have R(P, S ∗ ) < R(P, S  ) from which follows that RX (P ) > R(P, S ∗ ). This contradicts our pre-assumption RX (P ) = R(P, S ∗ ). Now, we describe the recurrences for the c-similarity problems introduced in Section 2.1. The recurrences in this section follow from the optimal substructure given in Proposition 3.2. Let P be a set of partial X-trees, Y ⊆ X, and Y = ∅.

RY (P ) =

⎧ ⎪ ⎨Γ(P, Y ),

if |Y|=1;

⎪ ⎩ω + Γ(P, Y ),

otherwise.

where ω := max(A,B)∈Π(Y ) (RA (P ) + RB (P )) and Π (Y ) is the set of all non-trivial bipartitions for Y . For a cluster-complete set K we have the following recurrence.

RCY (P ) =

⎧ ⎪ ⎨Γ(P, Y ),

if |Y|=1;

⎪ ⎩ω + Γ(P, Y ),

otherwise.

This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. IEEE TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS

11

where ω := max(A,B)∈Π(Y ) : {A,B}⊆K (RCA (P, K) + RCB (P, K)). For a sibling-complete set of bi-partitions of subsets of Y we have the following recurrence. RSY (P ) =

⎧ ⎪ ⎨Γ(P, Y ),

if |Y|=1;

⎪ ⎩ω + Γ(P, Y ),

otherwise.

where ω := max(A,B)∈L : A∪B=Y (RSA (P, K) + RSB (P, K)). ˙

4

DYNAMIC PROGRAMMING FOR RF AND C - SIMILARITY SUPERTREE PROB -

LEMS

Following from Corollary 2.4, a solution for the c-similarity supertree problems also provides a solution for the equivalent RF supertree problems. Here we present a dynamic programming (DP) solution for the c-similarity problem and the refined csimilarity problems introduced in Section 2.1. First, we describe the algorithm for the recurrence of the sibling-refined c-similarity problem, and then based on this algorithm, we derive the solutions for the other c-similarity problems. Algorithm 1 computes the recurrence of the sibling-refined c-similarity problem.

Theorem 4.1. [sibling-refined c-similarity] Let P be a set of partial X-trees and L be a sibling-complete set of bi-partitions of subsets of X. The c-similarity problem, Problem 2.9, can be solved in O (tnm) time, where t = |L|, n = |X|, and m = |V (P )|. Proof: Let S be the sibling-refined RF supertree that is constructed by following the recursive DP structure of the refined c-similarity recurrence. Then, S is the solution for Problem 2.9.

This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. IEEE TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS

12

Algorithm 1 This dynamic program (DP) solves the sibling-refined c-similarity problem, Problem 2.9 by computing the DP table R. The table functions as a lookup table and ˙ (A, B) ∈ L} of the input the restricted c-similarity stores for all clusters Y ∈ {Y = A∪B| score RSY (P ). Input: Set P of partial X-trees Sibling-complete set L of bi-partitions of subsets of X Output: The table R of restricted c-similarity scores RY (P, L), ˙ (A, B) ∈ L} where Y is an element of the X-complete set {A∪B| Algorithm: Part 1. Initialize the DP table R for trivial clusters. R ← ∅ with the default value of −∞ for unassigned rows for v ∈ X do R [{v}] ← Γ (P, Y ) end for Part 2. Compute the DP table R for non-trivial clusters. (A1 , B1 ) , (A2 , B2 ) , . . . , (At , Bt ) is a sorted list of L ˙ i | ≤ |Ai+1 ∪B ˙ i+1 | for 1 ≤ i < t such that |Ai ∪B // (1) for i := 1 to t do ˙ i ] ← max (R [Ai ∪B ˙ i ] , R [Ai ] + R [Bi ] + Γ (P, Ai ∪B ˙ i )) // (2) R [Ai ∪B end for return R

Complexity: The computational complexity for Algorithm 1 is dominated by computing the restricted c-similarity for non-trivial clusters. Part 1: Computing the restricted c-similarity for trivial clusters takes O (m) time, by counting the different leaf nodes in P . Part 2: Sorting Q takes O (tn) time. The access time for an element in R is O (n), and ˙ Γ (P, A∪B) is computed in O (nm) time. So computing the restricted c-similarity for a non-trivial cluster in Line 2 takes O (nm) time in a loop of O (t) iterations. Consequently the time complexity of Algorithm 1 is O (tnm). The DP algorithm for Theorem 4.1 can be adapted to also solve Problem 2.7 and Problem 2.5. Next, we construct different sibling-complete sets for each problem. Table 1 summarizes the different inputs.

This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. IEEE TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS

13

Table 1: The input for Algorithm 1 for the exact and refined c-similarity supertree problems, (a) Problem 2.5, (b) Problem 2.7, and (c) Problem 2.9. X is the taxa set, Π (Y ) the set of all non-trivial bi-partitions for Y , P (X) be the power set of set X, K is a cluster-complete set of subsets of X, and L is a sibling-complete set of bi-partitions of subsets of X. supertree Problem (a) exact (b) cluster-refined (c) sibling-refined

sibling-complete set {Π (Y ) |Y ∈ P (X) , Y = ∅} {(A, B) |A, B ∈ K, A ∩ B = ∅} L

size   O  3|X|  O |K|2 |L|

Theorem 4.2. [cluster-refined c-similarity] Let P be a set of partial X-trees and K be a cluster-complete set of bi-partitions of subsets of X. The cluster-refined c-similarity problem can be solved in O (c2 nm) time, where c = |K|, n = |X|, and m = |V (P )|. Proof: Let S be a solution X-tree for Problem 2.7, i.e. a full binary tree S ∈ BC (X, K) such that RCX (P, K) = R (P, S). It follows that C (S) ⊆ K, so for each pair of siblings u, v ∈ V (S) there exists (A, B) such that A, B ∈ K and CS (u) = A and CS (v) = B. Construct L := {(A, B) |A, B ∈ K, A ∩ B = ∅} for the input of Algorithm 1, and it follows that Sb (S) ⊆ L. Complexity: The size of L is O (c2 ). It follows that the computational complexity for Problem 2.7 is bounded by O (c2 nm) time. Theorem 4.3. [c-similarity] Let P be a set of partial X-trees. The c-similarity problem can be solved in O (3n nm) time, where n = |X| and m = |V (P )|. Proof: Let S be a solution X-tree for Problem 2.5; i.e. a full binary tree S ∈ B (X) such that R (P ) = R (P, S). Let P (X) be the power set of set X. Construct L := {Π(Y )|Y ∈ P (X) , Y = ∅} for the input of Algorithm 1. For all T ∈ B (X), Sb (T ) ⊆ L.

This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. IEEE TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS

14

It follows, that Sb (S) ⊆ L. Complexity: There are 12 (3n − 2n+1 + 1) = O (3n ) unique sibling-clusters in L. It follows, the computational complexity for Problem 2.5 is bounded by O (3n nm) time. To our knowledge the best-known (naive) exhaustive search algorithm for the RF supertree problem requires searching through all possible (2n − 3)!! rooted supertrees. Thus, the best naive solution for the RF supertree problem is solvable in O ((2n − 3)!!nm)  √ time. We obtain the speed up of Θ nn /3n as follows:

√ (2n − 3)!!nm nn/2 − 4 (2n − 3)!! n n ≤ =Θ n /3 , = 3n nm 3n 3n for n > 0.

Different supertree objectives Our DP algorithms for the efficient RF supertree search can also be directly applied the cost models gene duplication, gene duplication and loss, deep coalescence, and other supertree objectives. To apply our DP approach only the objective specific definition for the Y -restricted score for the subtree S|Y has to be modified. For gene duplication, this restricted score is the number of duplications induced by S|Y , for gene duplication and loss, this restricted score is the number of duplications plus losses induced by S|Y , and for deep coalescence the restricted score is the number of embedded lineages induced by S|Y . For the recurrence we now define ΓD (for gene duplication), ΓDL (for gene duplication and losses), and ΓDC (for deep coalescence). Let T be an X-tree, and let Y be a label set of X, then

˙ ΓD/DL/DC ((X1 , X2 ), A∪B), where Γ(T, Y ) = (X1 ,X2 )∈Sb(T ),A∪B=Y ˙

This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. IEEE TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS

15

˙ ΓD ((X1 , X2 ), A∪B) := 1 if A1 ∪ A2 ⊆ X1 ∪ X2 ∧ ∃i : ¬Ai ⊆ X1 ∧ ¬Ai ⊆ X2 , ˙ := 1 if ∃i, j : Ai ⊆ Xj ∧Ai+1 ⊆ X1 ∪X2 ⇒ (Ai+1 ∩X1 = ∅ = Ai+1 ∩X2 ), ΓL ((X1 , X2 ), A∪B) ˙ ˙ ˙ := ΓD ((X1 , X2 ), A∪B) + ΓL ((X1 , X2 ), A∪B), ΓDL ((X1 , X2 ), A∪B) ˙ := 1 if ∃i, j : (¬Ai ⊆ Xj ⇒ Ai ⊆ X1 ∪ X2 ) ∧ ¬Ai+1 ⊆ X1 ∪ X2 , and ΓDC ((X1 , X2 ), A∪B) ˙ := 0 otherwise. ΓD/DL/DC ((X1 , X2 ), A∪B) For simplicity, i + 1 (and similarly j + 1) is 1 if i = 2. Software that implements our algorithm for these objectives is available from the authors.

5

M AXIMUM PARSIMONY

We introduce the knowledge-enhanced maximum parsimony problem, which is similar to the knowledge-enhanced RF supertree problem in Section 2.1. However, in contrast to the RF supertree problem, we describe a heuristic instead of an exact algorithm, since the optimal substructure for supermatrix-based problems do not directly lead to efficient and exact knowledge-enhanced versions of this problem. Our algorithm is build upon the optimal substructure provided by Sankoff [37], [38]. The maximum parsimony problem counts the minimum number of evolutionary changes implied by a phylogenetic tree for a specific site in a phylogenetic character matrix. For a set of character states (e.g., 4 nucleotides) a cost matrix Q (i, j) defines the transition cost from state i to state j. Definition 5.1. [parsimony] Let S be an X-tree, Q a cost matrix, and A (v) the state for v ∈ V (S). The parsimony

score for A and S is defined as P A (A, S) := (u,v)∈E(S) Q (A (u) , A (v)). Given the states E for the label set X. We define A (S) to be all sets of states on V (S). Maximum parsimony for E and S is defined as P A (E, S) := minA∈A(S):E⊆A P A (A, S),

This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. IEEE TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS

16

and the maximum parsimony for E in the scope of B (X) is defined as P A (E) := minS∈B(X) P A (E, S). Given the sequences E of states for the label set X, we define maximum parsimony

for E and S as P A (E, S) := E∈E P A (E, S), and the maximum parsimony for E in the scope of B (X) is defined as P A (E) := minS∈B(X) P A (E, S). Problem 5.2. [single site maximum parsimony] Instance: A cost matrix Q and the states E for the label set X. Find: The parsimony score P A (E, S) and a full binary tree S ∈ B (X), such that P A (E) = P A (E, S). Problem 5.3. [maximum parsimony for sequences] Instance: A cost matrix Q and sequences E of states for the label set X. Find: The parsimony score P A (E, S) and a full binary tree S ∈ B (X), such that P A (E) = P A (E, S). Next, we provide the optimal substructure for the maximum parsimony problems on a single character (column in a character matrix), i.e. Problem 5.2 and the refined variations. 5.1

Solution for the single site maximum parsimony problem

Refinements for the single site maximum parsimony problem Problem 5.4. [sibling-refined maximum parsimony] Instance: A cost matrix Q, the states E for the label set X, and a sibling-complete set L of bi-partitions of subsets of X. Find: The parsimony score P A (E, S) and a full binary tree S ∈ BS (X, L), such that P A (E) = P A (E, S).

This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. IEEE TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS

17

Problem 5.5. [cluster-refined maximum parsimony] Instance: A cost matrix Q, the states E for the label set X, and a cluster-complete set K of subsets of X. Find: The parsimony score P A (E, S) and a full binary tree S ∈ BC (X, K), such that P A (E) = P A (E, S). Structural properties and recurrences of the single site maximum parsimony problems Definition 5.6. [restricted parsimony] Let S be an X-tree, E be the states for X, and Z ⊆ X. The maximum parsimony for E   and S restricted by Z is defined as P AZ (E, S) := P A E, S|Z . The maximum parsimony   for E restricted by Z in the scope of B (X) as P AZ (E) := minS∈B(X) P AZ E, S|Z . Proposition 5.7. [optimal substructure] Let E be the states for the label set X, and S be a full binary X-tree such that P AX (E) = P AX (E, S). Then P AZ (E) = P AZ (E, Sv ) where Z = Le (Sv ) for any vertex v ∈ V (S). Recurrences for the maximum parsimony problems The recurrences for the maximum parsimony problems follow from the optimal substructure given in Proposition 5.7 and the Sankoff parsimony [37], [38] algorithm. The original Sankoff parsimony algorithm counts the number of evolutionary state changes for a specific site in a phylogenetic tree. A cost vector stores the minimum number of evolutionary state changes up to a node for all possible state outcomes. The Sankoff algorithm calulates the cost vectors at each node moving from the leaves towards the root. Proposition 5.8. [Sankoff’s parsimony recurrence] We define the initial cost vector as 0 (j)[i] := 0, if j = i and 0 (j)[i] := ∞, otherwise. Given

This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. IEEE TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS

18

are the states E for the label set X, and initially, the cost vector is A (v) := 0 (E (v)) where v ∈ Le (S). The cost vector A (v) of an internal node v ∈ V (S) \ Le (S) is then computed for each character state i = 1, . . . , c recursively as follows:   A (v)[i] = min Q (j, i) + A (u)[j] + min Q (j, i) + A (w)[j] 1≤j≤c

1≤j≤c

where {u, w} = ChS (v). Similar to Sankoff’s parsimony recurrence, we can define the recurrences for the maximum parsimony problems and its cluster- and sibling-refinements. Proposition 5.9. [maximum parsimony recurrence for a single site]

RY (P ) =

⎧ ⎪ ⎨0 (E (v)) ,

if |Y | = 1,

⎪ ⎩ω,

otherwise,

where [i]

ω =

min

(A,B)∈Π(Y )



min Q (j, i) + P AA (E)

1≤j≤c

[j]





+ min Q (j, i) + P AB (E)

[j]





1≤j≤c

,

is the i-th state of the Sankoff cost-vector, Π (Y ) is the set of all non-trivial bi-partitions for Y , and c is the number of states. We have the following cluster-refined recurrence (PAC) for a cluster-complete set K:

P ACY (E, K) =

⎧ ⎪ ⎨0 (E (v)) ,

if |Y | = 1,

⎪ ⎩ω,

otherwise,

where [i]

ω =

min

(A,B)∈Π(Y ) : {A,B}⊆K



min Q (j, i) + P ACA (E, K)

1≤j≤c

[j]





+ min Q (j, i) + P ACB (E, K) 1≤j≤c

[j]



,

This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. IEEE TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS

19

is the i-th state of the Sankoff cost-vector, and c is the number of states. And, we have the following sibling-refined recurrence (PAS) for a sibling-complete set L:

P ASY (E, L) =

⎧ ⎪ ⎨0 (E (v)) ,

if |Y | = 1,

⎪ ⎩ω,

otherwise,

where [i]

ω =

min

˙ (A,B)∈L:A∪B=Y



min Q (j, i) + P ASA (E, K)

[j]

1≤j≤c





+ min Q (j, i) + P ASB (E, K)

[j]



1≤j≤c

is the i-th state of the Sankoff cost-vector, and c is the number of states. 5.2

Heuristic for the maximum parsimony problem for sequences

In this section we provide a heuristic for Problem 5.3 and the following refined variations. Refinements for the maximum parsimony problem Problem 5.10. [sibling-refined maximum parsimony for sequences] Instance: A cost matrix Q, sequences E of states for the label set X, and a sibling-complete set L of bi-partitions of subsets of X. Find: The parsimony score P A (E, S) and a full binary tree S ∈ BS (X, L), such that P A (E) = P A (E, S). Problem 5.11. [cluster-refined maximum parsimony for sequences] Instance: A cost matrix Q, sequences E of states for the label set X, and a cluster-complete set K of subsets of X. Find: The parsimony score P A (E, S) and a full binary tree S ∈ BC (X, K), such that P A (E) = P A (E, S).

This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. IEEE TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS

20

Heuristic We apply our standard DP Algorithm 1 to the maximum parsimony problems, but the table R stores restricted maximum parsimony scores for clusters. A cluster with a low or lowest restricted score might not be the one we should remember in R, but we assume a cluster with the low score is in the optimal solution. For this we keep a record of t clusters with the lowest restricted scores for each cluster. we assume a cluster with a low score to be the optimal solution

6

E XPERIMENTS

There are numerous potential applications of knowledge-enhanced supertrees. The initial source tree data can come from any type of analysis of any data. We focus our experiments on one specific application: using knowledge-enhanced algorithms to improve upon existing heuristics. In short, we use existing heuristics to create the input source trees for knowledge-enhanced algorithms. Existing heuristics for many supertree and supermatrix phylogenetic analyses often appear to find credible solutions, but they are not guaranteed to find an optimal tree. The only way to demonstrate that the results from the heuristic are suboptimal is to find a better tree. In our supertree experiments we demonstrate that implementations of our knowledge-enhanced algorithms can rapidly improve upon heuristics for the RF, gene duplication, duplication and loss, and deep coalescence heuristics. Furthermore, our heuristic algorithm for the knowledgeenhanced MP problem can improve upon results from commonly used MP supermatrix R heuristics. The experiments were performed on an Intel CoreTM i7 CPU with 2.80GHz.

6.1

Supertree experiments

To test our algorithms for the knowledge-enhanced supertree problems, we used one published data set from marsupials [39] and a larger data set of trees from the

This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. IEEE TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS

21

gymnosperm plant clade. The input trees for the initial analysis gymnosperm trees were made from maximum likelihood analyses of the gene alignments found in [40], implemented in RAxML [20]. The maximum likelihood analyses in RAxML used the GTRCAT model. All the input tree data sets are available from the dryad data repository (http://datadryad.org). First, we had to build the source trees for the knowledge-enhanced supertree analysis using existing heuristics for the RF, gene duplication, gene duplication and loss, and deep coalescence supertree methods. In all the experiments for each data set, we first built 10 different starting trees using random stepwise addition. We then inferred 25 supertrees from each starting tree using a local search heuristic. This produced 250 supertrees for each supertree problem (RF, gene duplication, gene duplication and losses, and deep coalescence). We used the software RF-SPR [21] for RF supertree searches, DupTree [28], [17] for gene duplications, and iGTP [14] for gene duplication and losses, and deep coalescence. We implemented algorithms to obtain the exact solutions for the knowledge-enhanced RF, gene duplication, duplication and loss, and deep coalescence supertree problems. For each supertree problem, we ran the knowledge-enhanced algorithms using the 250 supertrees built by the corresponding supertree method as the sibling-constraint set. We compared the optimality score of the knowledge-enhanced supertrees to the original supertrees built by the local search heuristics. RF experiment RF-SPR computes 25 RF-supertrees using a ratchet heuristic (see [21], [41]). Building 250 supertrees took 6.5 hours for the marsupial data set and 10 days for the gymnosperm data set. For both the mammal and gymnosperm data sets, the knowledge-enhanced supertree had better RF scores than the original supertrees built by the local search

This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. IEEE TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS

22

heuristic (Table 2). The sibling-refinement method runs took 18 seconds for the marsupial data set and 1.5 minutes for the gymnosperm data set. Gene duplication experiment For the gene duplication analyses we used the enhanced version of DupTree [28], [17] to obtain the sibling-constraint supertrees. Building the 250 initial supertrees took 1 minute for the marsupial data set and 9 minutes for the gymnosperm data set. Again, for both data sets the knowledge-enhanced algorithm produced a supertree with a better score than the best supertrees resulting from the original local search (Table 2). The siblingrefinement method runs took 17 seconds for the marsupial analysis and 27 seconds minutes for the gymnosperm data set. Gene duplication and losses experiment Due to run-time limitations, we could only run the local SPR search heursitic for the gene duplication and loss problem implemented in iGTP for the marsupial data set. Still, the analyses of 250 supertrees for gene duplication and losses took 26 hours. In contrast, the knowledge-enhanced algorithm runs took 8 seconds. The supertree resulting from the knowledge-enhanced analysis had a better score than the supertree from the local search heuristic, but the improvement was not large (Table 2). Deep coalescence Like the previous experiment, we could only run the local search heuristic for the deep coalescence supertree problem using iGTP only for the marsupial data set. The analyses of 250 supertrees for deep coalescence took 26 hours, and the knowledgeenhanced algorithm took also 8 seconds. The knowledge-enhanced algorithm produced a supertree with a slightly better score than the local search heuristic (Table 2).

This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. IEEE TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS

23

Table 2: Comparison of the optimality scores from local search heuristics and the knowledge-enhanced algorithms. 1st-row value: the best score of the knowledgeenhanced algorithm and the improvement (in % relative to the score of heuristic search). 2nd-row value: the best score of 250 heuristic local searches. The scores represent the RF distance, or minimum number of implied duplications, duplications and losses, or deep coalescence events respectively. In all cases, lower scores denote better trees. Data set rooted RF gene duplications gene duplications and losses deep coalescence 6.2

Marsupial

Gymnosperm

272 taxa; 158 trees

950 taxa; 78 trees

1492 (1.19%) 1510 623 (0.16%) 624 3309 (0.21%) 3316 1345 (0.07%) 1346

4213 (2.99%) 4343 1468 (1.28%) 1487 n/a n/a

Supermatrix experiments

We next tested if our heuristic for the knowledge-enhanced parsimony problem can improve upon results from commonly used maximum parsimony (MP) heuristics. We used a large supermatrix from the Saxifragales plant clade (Soltis et al., in review). The supermatrix was made by concatenating 51 gene alignments with a total of 2762 sequences, all obtained from GenBank. In total, the supermatrix has 950 species and is 48,465 characters in length. The supermatrix is 5.17% filled, or has 94.83% missing data. First, we computed and input set of MP supertrees using the parsimony heuristics implemented in RAxML [20] and TNT [18]. Note that for RAxML this is the parsimony starting tree; we did not perform a maximum likelihood analysis on this data set. For RAxML, we performed 100 MP tree searches, which took only 10 minutes. The tree with the best parsimony score across all runs had a parsimony score of 36779. For TNT, we performed 10 runs using the standard parameters provided by TNT. The analyses took 1.6 hours. Each TNT run returned between 2 and 4 trees, for a total of 26 MP trees in

This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. IEEE TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS

24

Table 3: Maximum parsimony scores for the Saxifragales supermatrix experiment. The numbers represent the parsimony score (minimum number of character changes), and a smaller score is better. 1st-row value: the best score of the knowledge-enhanced search and the improvement (in % relative to best tree from RAxML or TNT). 2nd-row value: the best score from the TNT searches. 3rd-row value: the best score from RAxML analyses. Data set

Saxifragales 959 taxa; 48465 alignment length

Knowledge-enhanced TNT RAxML MP

36681 (0.27%) 36810 36779

the 10 runs. The best tree had a score of 36810. As input for the knowledge-enhanced heuristic, we formed a population of the best MP trees. We initially used the 100 best MP trees (without duplication) from the TNT or RAxML runs as the sibling/cluster source pool. In each search step, 5 trees were randomly picked as input for the sibling-refinement method with t := 3, and the resulting tree was added to the source pool. When the population of the source pool exceeded 100 trees, the trees with the worst parsimony scores were discarded, leaving 100 trees. After 200 replicates of this process, which took 49 minutes, our heuristic had improved the parsimony score by 118 relative to the best trees found by RAxML or TNT (Table 3).

7

C ONCLUSION

A knowledge-enhanced phylogenetic search strategy can rapidly construct high quality phylogenetic trees from sets of candidate trees containing possibly contradictory, phylogenetic relationships. In the supertree context, this approach can be used to obtain exact solutions for the RF, gene duplication, duplication and loss, or deep coalescence problems. While this approach allows scientists to improve upon previous

This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. IEEE TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS

25

phylogenetic knowledge and the results of existing heuristics (Table 2), it is limited to the supertree problem. That is, the input data is a collection of phylogenetic trees, and it does not consider the primary phylogenetic data, like nucleotides, amino acids, or morphological characters. However, we also describe how a knowledge-enhanced search strategy can be applied to any discrete character data using maximum parsimony. Given an alignment of discrete phylogenetic characters and a pool of possible phylogenetic trees, this problem seeks the most parsimonious phylogenetic tree that contains only relationships found in the pool. Although, unlike the supertree problems, this approach is not guaranteed to find the optimal tree, we demonstrate that it can quickly and easily improve upon the results of commonly used maximum parsimony heuristics. Both the supertree and supermatrix analyses also guarantees that the resulting supertrees will not contain any undesired relationships. The user can control the range of possible relationships by the composition of the pool of trees. Our experiments demonstrate that even extremely sophisticated heuristic approaches may often fail to find optimal solutions. Knowledge-enhanced approaches are provide a fast and simple way to improve upon suboptimal solutions for supertree and supermatrix problems. In most cases, the observed improvement was small (Tables 2 and 3). This is first likely due to the quality of existing heuristics for these phylogenetic problems. There has been a tremendous amount of research on phylogenetic heuristics, and if they are identifying high quality solutions, large improvement may not be possible. Secondly, the knowledge defined supertree problem is designed to find similar solutions to the input. Since the solution must contain relationships found in the source tree, the solution will by definition resemble the input, and we would not expect to find large changes in the optimality score of the knowledge-enhanced supertree compared to the input trees. While it may be possible that the improvements from are relatively

This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. IEEE TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS

26

minor, even small changes in topologies can alter the interpretations of evolutionary questions. Ultimately, the performance of the knowledge-enhanced methods and the quality of the resulting trees depends on the user-supplied phylogenetic relationships. If these user-supplied relationships are all erroneous, the knowledge-enhanced tree will also be erroneous. Thus, the strengths and weaknesses of this approach are entirely dependent on how well the user-supplied relationships represent actual "knowledge". In our experiments, acquiring the initial set of phylogenetic relationships took far longer than the knowledge-enhanced searches. Thus, the key to using knowledge-enhanced searches to obtain high-quality phylogenetic solutions is the ability to build rapidly initial, likely suboptimal input trees that contain high quality relationships. The knowledge-enhanced search strategies also potentially enable new approaches to phylogenetic synthesis. For example, the input data (trees or character matrices) need not match the data used to build the pool of source trees used to obtain the siblingrelationships. Thus, we could find the supertree that is consistent with relationships found in supermatrix analysis, or the maximum parsimony tree that is consistent with relationships found in a supertree (i.e., gene tree parsimony) analysis. We could find the maximum parsimony tree from a data set of morphological characters that is consistent with the relationships found in an analysis of molecular characters or a species tree that is consistent with the results of single gene tree analyses. Future work will expand upon the applications of the knowledge-enhance phylogenetic strategies.

R EFERENCES [1]

N. Goldman and Z. Yang, “Introduction. statistical and computational challenges in molecular phylogenetics and evolution,” Philosophical Transactions of the Royal Society B: Biological Sciences, vol. 363, no. 1512, pp. 3889–3892, 2008. 2

This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. IEEE TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS

27

[2]

O. Bininda-Emonds, Phylogenetic supertrees: Combining information to reveal the Tree of Life. Springer, 2004, vol. 4. 2

[3]

O. Bininda-Emonds, J. Gittleman, and M. Steel, “The (super) tree of life: procedures, problems, and prospects,” Annual Review of Ecology and Systematics, pp. 265–289, 2002. 2

[4]

A. de Queiroz and J. Gatesy, “The supermatrix approach to systematics,” Trends in Ecology & Evolution, vol. 22, no. 1, pp. 34–41, 2007. 2

[5]

J. Barthélemy and F. McMorris, “The median procedure for n-trees,” Journal of Classification, vol. 3, no. 2, pp. 329–334, 1986. 2, 4, 7

[6]

B. Ma, M. Li, and L. Zhang, “From gene trees to species trees,” SIAM Journal on Computing, vol. 30, no. 3, pp. 729–752, 2001. 2, 4

[7]

L. Zhang, “From gene trees to species trees II: Species tree inference in the deep coalescence model,” arXiv preprint arXiv:1003.1204, 2010. 2

[8]

M. Bansal and R. Shamir, “A note on the fixed parameter tractability of the gene-duplication problem,” Computational Biology and Bioinformatics, IEEE/ACM Transactions on, vol. 8, no. 3, pp. 848–850, 2011. 2

[9]

M. Bansal, J. Burleigh, O. Eulenstein, and A. Wehe, “Heuristics for the gene-duplication problem: a Θ(n) speedup for the local search,” in Research in Computational Molecular Biology.

Springer, 2007, pp. 238–252. 2, 4

[10] M. Bansal and R. Shamir, “A note on the fixed parameter tractability of the gene-duplication problem,” IEEE/ACM Transactions on Computational Biology and Bioinformatics, no. 99, pp. 1–1, 2010. 2 [11] M. Bansal, J. Burleigh, and O. Eulenstein, “Efficient genome-scale phylogenetic analysis under the duplicationloss and deep coalescence cost models,” BMC bioinformatics, vol. 11, no. Suppl 1, p. S42, 2010. 2, 4 [12] M. Bansal and O. Eulenstein, “An Ω(n2 /logn) speed-up of TBR heuristics for the gene-duplication problem,” IEEE/ACM Transactions on Computational Biology and Bioinformatics (TCBB), vol. 5, no. 4, pp. 514–524, 2008. 2 [13] M. Bansal, O. Eulenstein, and A. Wehe, “The gene-duplication problem: Near-linear time algorithms for NNIbased local searches,” IEEE/ACM Transactions on Computational Biology and Bioinformatics, vol. 6, no. 2, pp. 221– 231, 2009. 2 [14] R. Chaudhary, M. Bansal, A. Wehe, D. Fernández-Baca, and O. Eulenstein, “iGTP: A software package for large-scale gene tree parsimony analysis,” BMC Bioinformatics, vol. 11, no. 1, p. 574, 2010. 2, 3, 21 [15] P. Górecki, J. Burleigh, and O. Eulenstein, “GTP supertrees from unrooted gene trees: linear time algorithms for NNI based local searches,” Bioinformatics Research and Applications, pp. 102–114, 2012. 2 [16] W. Maddison and L. Knowles, “Inferring phylogeny despite incomplete lineage sorting,” Systematic Biology, vol. 55, no. 1, pp. 21–30, 2006. 2 [17] A. Wehe, M. Bansal, J. Burleigh, and O. Eulenstein, “DupTree: a program for large-scale phylogenetic analyses using gene tree parsimony,” Bioinformatics, vol. 24, no. 13, pp. 1540–1541, 2008. 2, 3, 4, 21, 22

This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. IEEE TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS

28

[18] G. Giribet, “TNT: Tree analysis using new technology,” Systematic Biology, vol. 54, no. 1, pp. 176–178, 2005. 2, 3, 23 [19] D. Swofford, “PAUP*. phylogenetic analysis using parsimony (* and other methods). version 4,” 2003. 2, 4 [20] A. Stamatakis, “RAxML-VI-HPC: maximum likelihood-based phylogenetic analyses with thousands of taxa and mixed models,” Bioinformatics, vol. 22, no. 21, pp. 2688–2690, 2006. 2, 21, 23 [21] M. Bansal, J. Burleigh, O. Eulenstein, and D. Fernández-Baca, “Robinson-foulds supertrees,” Algorithms for Molecular Biology, vol. 5, no. 1, p. 18, 2010. 3, 4, 6, 7, 21 [22] A. Stamatakis, P. Hoover, and J. Rougemont, “A rapid bootstrap algorithm for the RAxML web servers,” Systematic biology, vol. 57, no. 5, pp. 758–771, 2008. 4 [23] R. Chaudhary, J. Burleigh, and D. Fernández-Baca, “Fast local search for unrooted robinson-foulds supertrees,” Bioinformatics Research and Applications, pp. 184–196, 2011. 4, 6 [24] M. Goodman, J. Czelusniak, G. Moore, A. Romero-Herrera, and G. Matsuda, “Fitting the gene lineage into its species lineage, a parsimony strategy illustrated by cladograms constructed from globin sequences,” Systematic Biology, vol. 28, no. 2, pp. 132–163, 1979. 4 [25] R. Guigo, I. Muchnik, and T. Smith, “Reconstruction of ancient molecular phylogeny,” Molecular Phylogenetics and Evolution, vol. 6, no. 2, pp. 189–213, 1996. 4 [26] R. Page and M. Charleston, “From gene to organismal phylogeny: reconciled trees and the gene tree/species tree problem,” Molecular phylogenetics and evolution, vol. 7, no. 2, pp. 231–240, 1997. 4 [27] J. Slowinski and R. Page, “How should species phylogenies be inferred from sequence data?” Systematic Biology, vol. 48, no. 4, pp. 814–825, 1999. 4 [28] A. Wehe and J. G. Burleigh, “Scaling the gene duplication problem towards the tree of life,” in BICoB, H. AlMubaid, Ed.

ISCA, 2010, pp. 133–138. 4, 21, 22

[29] B. Baum, “Combining trees as a way of combining data sets for phylogenetic inference, and the desirability of combining gene trees,” Taxon, pp. 3–10, 1992. 5 [30] M. Ragan, “Phylogenetic inference based on matrix representation of trees,” Molecular phylogenetics and evolution, vol. 1, no. 1, pp. 53–58, 1992. 5 [31] M. Sanderson, A. Purvis, and C. Henze, “Phylogenetic supertrees: assembling the trees of life,” Trends in Ecology & Evolution, vol. 13, no. 3, pp. 105–109, 1998. 5 [32] O. Bininda-Emonds, M. Cardillo, K. Jones, R. MacPhee, R. Beck, R. Grenyer, S. Price, R. Vos, J. Gittleman, and A. Purvis, “The delayed rise of present-day mammals,” Nature, vol. 446, no. 7135, pp. 507–512, 2007. 5 [33] D. Robinson and L. Foulds, “Comparison of phylogenetic trees,” Mathematical Biosciences, vol. 53, no. 1-2, pp. 131–147, 1981. 6

This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. IEEE TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS

29

[34] C. Than and L. Nakhleh, “Species tree inference by minimizing deep coalescences,” PLoS Computational Biology, vol. 5, no. 9, p. e1000501, 2009. 9 [35] Y. Yu, T. Warnow, and L. Nakhleh, “Algorithms for MDC-based multi-locus phylogeny inference: Beyond rooted binary gene trees on single alleles,” Journal of Computational Biology, 2011. 9 [36] M. Hallett and J. Lagergren, “New algorithms for the duplication-loss model,” in Proceedings of the fourth annual international conference on Computational molecular biology.

ACM, 2000, pp. 138–146. 9

[37] D. Sankoff, “Minimal mutation trees of sequences,” SIAM Journal on Applied Mathematics, pp. 35–42, 1975. 15, 17 [38] D. Sankoff and P. Rousseau, “Locating the vertices of a steiner tree in an arbitrary metric space,” Mathematical Programming, vol. 9, no. 1, pp. 240–246, 1975. 15, 17 [39] M. Cardillo, R. Bininda-Emonds, E. Boakes, and A. Purvis, “A species-level phylogenetic supertree of marsupials,” Journal of Zoology, vol. 264, no. 1, pp. 11–31, 2004. 20 [40] J. Burleigh, W. Barbazuk, J. Davis, A. Morse, and P. Soltis, “Exploring diversification and genome size evolution in extant gymnosperms through phylogenetic synthesis,” Journal of Botany, vol. 2012, 2012. 21 [41] K. Nixon, “The parsimony ratchet, a new method for rapid parsimony analysis,” Cladistics, vol. 15, no. 4, pp. 407–414, 1999. 21 André Wehe received his Ph.D. in Computer Engineering and Computer Science at Iowa State University in 2012, and a German Diploma in Information Technology at the German University of Applied Science of Bielefeld in 2004. He was a postdoctoral fellow at University of Florida, and is currently a Senior Software Engineer at the Atmospheric and Environmental Research in Massachusetts. His interests includes researching algorithms for computational phylogenetics. J. Gordon Burleigh received the BA degree in biology from Reed College in 1995, and the MA and PhD degrees in biology from the University of Missouri in 1999 and 2002, respectively. He is currently an assistant professor in the Department of Biology at the University of Florida. His research interests include computational evolutionary biology.

Oliver Eulenstein received is PhD in Computational Biology from the University of Bonn 1998, and was a postdoctoral fellow at the University of California Davis. Currently, he is an associate professor of Computer Science at Iowa State University. His research is focusing on designing ˘ molecular biology; especially problems that solutions for challenging computational problemsain arise whena˘ studying evolution.a˘

Efficient algorithms for knowledge-enhanced supertree and supermatrix phylogenetic problems.

Phylogenetic inference is a computationally difficult problem, and constructing high-quality phylogenies that can build upon existing phylogenetic kno...
367KB Sizes 0 Downloads 0 Views