Journal of Bioinformatics and Computational Biology Vol. 13, No. 2 (2015) 1550003 (14 pages) # .c Imperial College Press DOI: 10.1142/S0219720015500031

An e±cient algorithm for pairwise local alignment of protein ¤ interaction networks

J. Bioinform. Comput. Biol. Downloaded from www.worldscientific.com by GEORGE MASON UNIVERSITY on 01/08/15. For personal use only.

Wenbin Chen†,‡,§,††, Matthew Schmidt¶,||, Wenhong Tian**, Nagiza F. Samatova¶,|| and Shaohong Zhang† † Department of Computer Science Guangzhou University, 230 Wai Huan Xi Road Guangzhou Higher Education Mega Center, Guangzhou, 510006, P. R. China ‡Shanghai

Key Laboratory of Intelligent Information Processing Fudan University, 220 Handan Road Yangpu District, Shanghai, 200433, P. R. China

§State

Key Laboratory for Novel Software Technology Nanjing University, 22 Hankou Road, Nanjing, Jiangsu, 210093, P. R. China ¶ Computer Science Department North Carolina State University Raleigh, NC 27695, USA ||Computer

Science and Mathematics Division Oak Ridge National Laboratory, Oak Ridge, TN 37831 **Department of Computer Science University of Electronic and Technology of China North Jianshe Road, Chengdu, Sichuan, 610054, P. R. China †† [email protected]

Received 10 May 2014 Revised 29 October 2014 Accepted 30 October 2014 Published 4 December 2014 Recently, researchers seeking to understand, modify, and create bene¯cial traits in organisms have looked for evolutionarily conserved patterns of protein interactions. Their conservation likely means that the proteins of these conserved functional modules are important to the trait's expression. In this paper, we formulate the problem of identifying these conserved patterns as a graph optimization problem, and develop a fast heuristic algorithm for this problem. We compare the performance of our network alignment algorithm to that of the MaWISh algorithm [Koyutürk M, Kim Y, Topkara U, Subramaniam S, Szpankowski W, Grama A, Pairwise alignment of protein interaction networks, J Comput Biol 13(2):182–199, 2006.], which bases its search algorithm on a related decision problem formulation. We ¯nd that our algorithm discovers conserved modules with a larger number of proteins in an order of magnitude less time. *The

material in this paper was presented in part at the 2009 International Conference on Bioinformatics and Computational Biology,1 Las Vegas, Nevada, USA, July 13–16, 2009.

††Corresponding

author. 1550003-1

W. Chen et al. The protein sets found by our algorithm correspond to known conserved functional modules at comparable precision and recall rates as those produced by the MaWISh algorithm. Keywords: Network alignment; conserved functional modules; graph optimization; graph theory.

J. Bioinform. Comput. Biol. Downloaded from www.worldscientific.com by GEORGE MASON UNIVERSITY on 01/08/15. For personal use only.

1. Introduction In 1999, Hartwell et al. introduced the concept of functional modules as a way of attributing a speci¯c biological function to a set of interacting proteins as opposed to a single protein.2 Early approaches to detecting these functional modules3–6 relied on the fact that functional modules tend to form statistically signi¯cant densely connected subnetworks in protein interaction networks.6 Recently, this model has been augmented by the hypothesis that sets of evolutionarily conserved proteins with conserved patterns of interactions are likely to correspond to conserved functional modules.7 This has led to the development of network alignment algorithms to ¯nd these evolutionarily conserved functional modules. Network alignment algorithms such as PathBLAST,7 NetworkBLAST,8 NetworkBLAST-M,9 Graemlin,10,11 IsoRank,12 MaWISh,13 NetAligner,14 AlignNemo,15 and AlignMCL16 identify sets of proteins such that the interactions between the proteins in two di®erent sets are likely to be conserved across two or multiple species. For local alignment algorithms, PathBLAST try to uncover the conserved paths and NetworkBLAST want to uncover the conserved subgraph. NetAligner start to a seed alignment solution and extend them by connecting vertices of di®erent seeds through gap or mismatch edges. AlignMCL16 uses the markov cluster algorithm MCL17 to extract the conserved subnetworks, replacing the motif-based method of AlignNemo.15 The sets of proteins found by the network algorithms are called protein alignments, while a network alignment refers to the set of all of the protein alignments. There are a number of ways in which network alignment algorithms can di®er. Some alignment algorithms are limited to aligning proteins from two species, while others can align more than two. Given a set of proteins from multiple species, alignment algorithms can either attempt to identify protein alignments for all of the proteins (global network aligners), or it can attempt to ¯nd protein alignments for only a subset of the proteins (local network aligners).12 To identify the candidate protein alignments, an alignment algorithm can either use a brute force method and consider every possible set of proteins as a candidate protein alignment, or it can use tools such as BLAST18 to limit candidate proteins alignments to those sets of proteins that are functionally similar or homologous. An alignment algorithm can also choose to either explicitly represent the degree of interaction conservation as a global alignment network (GAN) or not. A GAN is a network whose nodes are candidate protein alignments, and whose edges represent the degree of interaction conservation between the proteins in the candidate protein alignments. Table 1 gives a brief overview of the characteristics of some previous network alignment algorithms. If a GAN is used by a network alignment algorithm, the algorithm identi¯es conserved functional modules through the identi¯cation of speci¯c graph structures 1550003-2

An e±cient algorithm for pairwise local alignment Table 1. Comparison of some previous network alignment algorithms.

Our Algorithm MaWISh13 IsoRank12 PathBLAST7 NetworkBLAST8 NetworkBLAST-M9 Graemlin10 Graemlin 2.011 NetAligner14 AlignNemo15 AlignMCL16

Number of species

Network alignment

Protein alignment generation

Uses GAN?

2 2 2 2 2–3 2 2 2 2 2 2

Local Local Global Local Local Local Global and Local Global Local Local Local

BLAST E-values BLAST E-values Brute force BLAST E-values BLAST E-values BLAST E-values BLAST E-values BLAST E-values BLAST E-values BLAST E-values BLAST E-values

Yes Yes No Yes Yes No No No Yes Yes Yes

in the GAN. For instance, PathBLAST searches for paths in the network, and MaWISh searches for subgraphs with su±ciently large edge weight sums.7,13 The choice of the target graph structures determines the type of functional modules the alignment algorithm will ¯nd. Since network alignment algorithms such as MaWISh tend to identify functional modules consisting of only a few proteins (Table 2), larger functional modules may be missed by these alignment algorithms. As the graph in Fig. 1 shows, there are a large number of functional modules whose size may preclude algorithms such as MaWISh from identifying them. The graph shows the likelihood that a known functional module, in this case de¯ned as generic Kyoto Encyclopedia of Genes and Genomes (KEGG) pathway,19 contains at least the number of proteins given on the x-axis. Over 50% of the generic KEGG pathways contain at least 30 proteins. It remains a goal of network alignment algorithms to identify larger functional modules like these. 1

CumulaƟve Probability

J. Bioinform. Comput. Biol. Downloaded from www.worldscientific.com by GEORGE MASON UNIVERSITY on 01/08/15. For personal use only.

Algorithm

0.9

1

0.8

0.8

0.7

0.6

0.6

0.4

0.5

0.2

0.4

0 0

0.3

5

10

15

20

25

0.2 0.1 0 0

25

50

75

100

125

150

175

200

225

250

Size of Pathway Fig. 1. The cumulative probability distribution of the likelihood that a KEGG pathway is at least a given size. 1550003-3

W. Chen et al.

1.1. Contribution

J. Bioinform. Comput. Biol. Downloaded from www.worldscientific.com by GEORGE MASON UNIVERSITY on 01/08/15. For personal use only.

In this paper, we propose a new formulation for the problem of ¯nding signi¯cant subnetworks in a pairwise GAN. We also introduce an algorithm that heuristically attempts to solve this new optimization problem. We evaluate the performance of our algorithm using real-world protein interaction data. We ¯nd that our algorithm has a comparable precision and recall rates for identifying functional modules as the MaWISH algorithm. However, our algorithm has a greater chance of ¯nding larger local alignment networks (LANs) that correspond to large functional modules not identi¯ed by previous search methods.

2. Previous Methods of Finding High Scoring LANs Koyutürk et al. developed a heuristic algorithm to ¯nd conserved functional protein modules by identifying locally maximal alignment networks in the GAN.13 The motivation for the algorithm comes from the fact that a protein in a given functional module interacts with all other proteins in the module either directly or through a module hub. If a functional module is conserved between two species, then edges connecting pairs of proteins from this conserved functional module will have a large, non-negative weight. This happens because the weight of the edge connecting the two pairs of proteins is positive when interaction is conserved between two pairs of proteins, and its magnitude is proportional to the combined similarity of the two pairs of proteins.13 In Ref. 13, Koyutürk et al. develop a framework for aligning PPI networks to discover subsets of proteins in pairwise species. Their framework relies on theoretical models that focus on understanding the evolution of protein interaction networks. In the following, we give some formulations and de¯nitions that come from Ref. 13. De¯nition 2.1 (Local Alignment of PPI Networks13). A PPI network is generally modeled by an undirected graph GðU; EÞ, where U denotes the set of proteins and uu 0 2 E denotes an interaction between proteins u 2 U and u 0 2 U. Given protein interaction networks GðU; EÞ, HðV ; F Þ, and a pairwise similarity function S de¯ned over the union of their protein sets U [ V , any protein subset pair P ¼ ðU~ ;V~ Þ induces a local alignment AðG; V ; S; P Þ ¼ fM; N ; Dg, where M ¼ fu; u 0 2 U; v; v 0 2 V : Sðu; vÞ > 0; Sðu 0 ; v 0 Þ > 0; uu 0 2 E; vv 0 2 F g, N ¼ fu; u 0 2 U; v; v 0 2 V : Sðu; vÞ > 0; Sðu 0 ; v 0 Þ > 0; uu 0 2 E; vv 0 62 F g [ fu; u 0 2 U; v; v 0 2 V : Sðu; vÞ > 0; Sðu 0 ; v 0 Þ > 0; uu 0 62 E; vv 0 2 F g, D ¼ fu; u 0 2 U : Sðu; u 0 Þ > 0g. Each match M 2 M is associated with a score ðMÞ. Each mismatch N 2 N and each duplication D 2 D are associated with penalties ðNÞ and ðDÞ, respectively. The duplication is biological analogous to the duplication of a gene in the course of evolution. A match means a conserved interaction between two orthologous protein pairs. A mismatch denotes the lack of an interaction in the PPI network of one of the species but orthologs interact in the other species. 1550003-4

An e±cient algorithm for pairwise local alignment

P

The score of alignment AðG; H; S; P Þ ¼ M; N ; D is de¯ned as: ðAÞ ¼ P P M2M ðMÞ  N2N ðNÞ  D2D ðDÞ.

J. Bioinform. Comput. Biol. Downloaded from www.worldscientific.com by GEORGE MASON UNIVERSITY on 01/08/15. For personal use only.

The goal is to ¯nd local alignments with high score. The local alignment problem can be reduced to an optimization problem on an alignment graph. The alignment graph are de¯ned as follows.13 De¯nition 2.2 (Alignment Graph 13 ). For a pair of PPI networks GðU; EÞ; HðV ; F Þ, and protein similarity function S, the corresponding weighted alignment graph GðV; EÞ is computed as follows: V ¼ fv ¼ fu; vg : u 2 U; v 2 V and Sðu; vÞ > 0g. In other words, we have a node in the alignment graph for each pair of ortholog proteins. Each edge vv 0 2 E, where v ¼ fu; vg and v 0 ¼ fu 0 ; v 0 g, is assigned weigh wðvv 0 Þ ¼ ðuu 0 ; vv 0 Þ  ðuu 0 ; vv 0 Þ  ðu; u 0 Þ  ðv; v 0 Þ. Here, ðuu 0 ; vv 0 Þ ¼ 0 if ðuu 0 ; vv 0 Þ 62 M, and similarly for mismatch and duplication penalties. Finding subgraphs of the alignment graph that have a large sum of edge-weights is equivalent to solving the Maximum Weight Induced Subgraph Problem given in De¯nition 2.3. De¯nition 2.3. Given a graph GðV ; EÞ and a constant , ¯nd a subset of nodes, V~ 2 V such that the sum of the weights of the edges in the subgraph induced by V~ is P at least , i.e. W ðV~ Þ ¼ v;v 0 2V~ wðvv 0 Þ  . In Ref. 13, the following theorem is proven. Theorem 2.4. Given PPI networks GðU; EÞ; HðV ; F Þ and a protein similarity function S, let GðV; EÞ be the corresponding alignment graph. If V~ is a solution to the maximum weight induced subgraph problem on GðV; EÞ, then P ¼ fU~ ;V~ g induces an alignment AðG; H; S; P Þ with ðAÞ ¼ W ðV~ Þ, where U ¼ fu 2 U : 9 v 2 V s:t:fu; vg 2 V~ g and V ¼ fv 2 V : 9 u 2 Us:t:fu; vg 2 U~ g. The algorithm presented of Koyutürk et al. heuristically ¯nds the subgraphs de¯ned by De¯nition 2.3. The algorithm \grows" a locally maximal subgraph. The subgraph is seeded by starting with a vertex in the GAN that has a large number of non-negative weight edges. A large number of non-negative weight edges incident to a vertex in the GAN means that the aligned pair of proteins represented by the vertex has a large number of conserved interactions. This means that the vertex could likely be a module hub. Once the subgraph is seeded, the algorithm proceeds by iteratively adding to the subgraph the vertex that has the largest sum of edge-weights of the edges connecting it to the subgraph. This process stops once there are no vertices whose sum of edge-weights of the edges connecting it to the subgraph is greater than zero. This stopping criteria is based on the assumption that proteins in a functional module only loosely interact with proteins not in the functional module.13 Thus, aligned proteins in a conserved functional module are unlikely to have a large number of conserved interactions with aligned proteins that are not in the conserved functional module. 1550003-5

J. Bioinform. Comput. Biol. Downloaded from www.worldscientific.com by GEORGE MASON UNIVERSITY on 01/08/15. For personal use only.

W. Chen et al.

The heuristic algorithm presented in Ref. 13 tends to produce relatively small LANs. The bottom-up method of growing subgraphs utilized by the algorithm makes it di±cult to grow larger subgraphs. The criteria that the vertex being added to the subgraph must be connected to the current subgraph by edges whose sum is greater than zero is a tight restriction. It requires the aligned pair of proteins represented by a vertex in the subgraph to have a large number of conserved interactions. According to Ref. 13, a conserved interaction happens only when an interaction exists in both species' protein interaction networks. This means that proteins in the conserved functional module found by this algorithm must interact with the vast majority of other proteins in the functional module. If the functional module contains a module hub, or if the protein–protein interaction data contains a great deal of noise, then the size of the LANs will be limited to relatively small numbers of proteins. 3. Our Approach to Finding High Scoring LANs Here, we will present a heuristic algorithm for ¯nding LANs in a GAN by heuristically attempting to ¯nd solutions for the optimization version of the maximum weight induced subgraph problem given in De¯nition 3.1. The solution to the optimization version of the problem will be, by de¯nition, larger than the solutions to the search problem given in De¯nition 2.3. Therefore, the LAN determined by the solution will be more likely to correspond to an actual conserved functional module. De¯nition 3.1. Given a graph GðV ; EÞ, ¯nd a subset of nodes, V~  V such that the sum of the weights of the edges in the subgraph induced by V~ is greater than the sum of the weights of the edges in the subgraph induced by any other subset of nodes S  V. Like the search version of the problem, the optimization version of the problem is NP -complete. Therefore, an e±cient algorithm for ¯nding exact solutions to the optimization problem is unlikely to exist. Our algorithm ¯nds heuristic solutions to the optimization problem in De¯nition 3.1. However, related algorithms in Refs. 20 and 21 have been shown to work well for ¯nding solutions to the densest subgraph and dense k-subgraphs problems. The pseudocode for our algorithm is given in Algorithm 1. The algorithm ¯nds subgraphs with large sums of edge weights by initially considering the entire graph. It calculates a weight for the graph as the sum of the edge-weights in the graph. The algorithm then iteratively removes the vertex that has the lowest sum of edgeweights for its incident edges and associates the sum of the edge-weights remaining in the graph as the weight of the new graph. Once every vertex has been removed it ¯nds the subgraph that had the maximum weight and outputs it as the maximum weight induced subgraph for the given graph. The runtime complexity of our algorithm is as follows. Lines 1 and 5 must sum over all of the edges in the graph. Therefore, the complexity of lines 1–5 is OðjEjÞ. 1550003-6

J. Bioinform. Comput. Biol. Downloaded from www.worldscientific.com by GEORGE MASON UNIVERSITY on 01/08/15. For personal use only.

An e±cient algorithm for pairwise local alignment

Every iteration of lines 6–13 of our algorithm involves ¯nding and removing the minimum weight vertex as well as updating the weights of the remaining vertices. This can be done in OðjV jÞ time. Since the loop on line 9 must run jV j times, the complexity of lines 6–13 is OðjV j 2 Þ. The complexity of line 14 is OðjV jÞ. Therefore, our algorithm will run in OðjEj þ jV j 2 Þ 2 OðjV j 2 Þ time. Since we want to ¯nd all the LANs from an input GAN, the total time complexity becomes OðjV j 3 Þ. The previous MaWISH algorithm had a worst-case runtime of OðjV jjEjÞ time.13 Thus, our proposed algorithm have the same worse-case time complexity as MaWISH. However, the experimental results in the next section will show that our proposed algorithm generally runs much faster than MaWISh in practice. 4. Experimental Results In this section, the bene¯ts of our algorithm for ¯nding LANs will be examined by experimentally comparing our algorithm's performance to that of the original MaWISh algorithm. Both algorithms use as input a GAN. The method for constructing the GAN is presented in full in Ref. 13. However, a brief overview of their method is given here for clarity. Two protein interaction networks, G1 and G2 , are needed to construct a GAN. In a protein interaction network, the vertices represent proteins in an organism and edges between the vertices represent interactions between the two proteins. Each 1550003-7

J. Bioinform. Comput. Biol. Downloaded from www.worldscientific.com by GEORGE MASON UNIVERSITY on 01/08/15. For personal use only.

W. Chen et al.

vertex in the GAN represents one protein from G1 and one protein from G2 . In addition, the vertices of the GAN are restricted to only those pairs of proteins across the two PPI networks that have signi¯cantly similar amino acid sequences. This is determined by the BLAST E-values of the two proteins being less than a given threshold, T . Edges exist between every possible pair of vertices in the GAN. The weights of these edges is determined by the likelihood that the two pairs of vertices conserve an interaction across the two species. For an edge between the vertices fu1 ; u2 g and fv1 ; v2 g, the edge-weight is positive if there is a likely interaction between both u1 and v1 in G1 and u2 and v2 in G2 . If there is a likely interaction between the two proteins in G1 but not in G2 or vice versa, then the edge-weight is negative. The magnitude of the edge-weight is determined with respect to the sequence similarity between u1 , u2 and v1 , v2 . The more similar these two pairs of proteins are, the greater the magnitude of the edge weight is. For the results in this section, we used protein interaction data from seven different species (Saccharomyces cerevisiae, Caenorhabditis elegans, Homo sapiens, Drosophila melanogaster, Escherichia coli K-12, Vibrio cholerae, and Caulobacter crescentus) to generate seven di®erent GANs. The protein interaction network data was obtained from the Database of Interacting Proteins (DIP),22 and the GANs were constructed by the previously mentioned method. The two algorithms were run on these GANs to generate the LANs. Figure 2 shows that the size of the LANs generated by our algorithm tended to be larger than the alignments generated by the MaWISh algorithm. This is in line with our hypothesis, and likely due to the fact that our algorithm heuristically ¯nds the maximum weight induced subgraph in the GAN (De¯nition 3.1) and the MaWISh algorithm heuristically ¯nds induced subgraphs whose weights are greater than some threshold (De¯nition 2.3).

Fig. 2. The cumulative probability distribution of the likelihood that a LAN found by an algorithm is at least a given size. 1550003-8

An e±cient algorithm for pairwise local alignment Table 2. Comparison of LANs produced and runtimes for seven pairs of PPI networks.

PPI Pair S. cerevisiae versus C. elegans S. cerevisiae versus H. sapiens C. elegans versus D. melanogaster C. elegans versus H. sapiens

J. Bioinform. Comput. Biol. Downloaded from www.worldscientific.com by GEORGE MASON UNIVERSITY on 01/08/15. For personal use only.

D. melanogaster versus H. sapiens E. coli versus V. cholerae E. coli versus C. crescentus

Algorithm

LANs

Median size

Maximum weight

Runtime

Our Algorithm MaWISh Our Algorithm MaWISh Our Algorithm MaWISh Our Algorithm MaWISh Our Algorithm MaWISh Our Algorithm MaWISh Our Algorithm MaWISh

13 111 32 459 14 183 23 262 24 385 25 228 14 93

24 3 35 3 35 3 30 3 43 3 28 4 15 2

31.01 36.14 81.26 15.29 81.76 26.75 42.57 14.27 180.04 22.73 546 169 45 19

0.02 s  0.01 s 0.08 s  0.01 s 0.91 s  0.01 s 5.10 s  0.03 s 0.29 s  0.01 s 1.21 s  0.01 s 0.89 s  0.01 s 3.98 s  0.03 s 1.88 s  0.01 s 12.47 s  0.05 s 0.02 s  0.01 s 0.15 s  0.01 s 0.00 s  0.01 s 0.02 s  0.01 s

The e®ectiveness of the two algorithms was measured in two ways. First, we compared the two algorithms on a strictly computational basis. This included comparing the number of LANs found, the size of the di®erent LANs, the weight of maximum weight subgraph found, and the total runtime of the algorithms. The results of this comparison are shown in Table 2. Our algorithm tended to ¯nd fewer LANs, but the size and weight of the LANs tended to be larger than those of MaWISh. Our algorithm also tended to have shorter runtimes than the MaWISh algorithm. We have also evaluated the alignment algorithms according to the biological relevance of the LANs produced. Our assumption is that the functional modules identi¯ed by our algorithm are functionally homogeneous. We evaluate each alignment's functional homogeneity as follows. First, we followed the approach outlined by Flannick et al. in Ref. 10. We de¯ned as a true positive any alignment that could be annotated with a statistically signi¯cant Gene Ontology (GO) term.23 This was determined by using the online GO::TermFinder application24 to annotate the LANs and to determine the statistical signi¯cance of the possible annotations through the calculation of each annotation's associated p-values. If the GO::TermFinder application annotated the LAN with any GO term with a p-value of less than 0.05, then we counted the LAN as a true positive. To use the GO::TermFinder application, we had to ¯lter out any protein in the LANs that did not have an NCBI-GI number. Under this evaluation method, the precision of the algorithm is the number of true positives divided by the total number of LANs that did not have all of their proteins ¯ltered out. The recall of the algorithm is the number of true positives divided by the total number of true positives and false negatives. We developed our own related method of evaluating an alignment's functional homogeneity using KEGG pathways19 to avoid any bias introduced from using GO 1550003-9

W. Chen et al. Table 3. Comparison of precision-recall results for seven pairs of PPI networks. Precision PPI Pair S. cerevisiae C. elegans S. cerevisiae H. sapiens

J. Bioinform. Comput. Biol. Downloaded from www.worldscientific.com by GEORGE MASON UNIVERSITY on 01/08/15. For personal use only.

C. elegans D. melanogaster C. elegans H. sapiens D. melanogaster H. sapiens E. coli V. cholerae E. coli C. crescentus

Algorithm

GO terms (%)

KEGG pathways (%)

Recall

Our Algorithm MaWISh Our Algorithm MaWISh Our Algorithm MaWISh Our Algorithm MaWISh Our Algorithm MaWISh Our Algorithm MaWISh Our Algorithm MaWISh Our Algorithm MaWISh Our Algorithm MaWISh Our Algorithm MaWISh Our Algorithm MaWISh Our Algorithm MaWISh Our Algorithm MaWISh Our Algorithm MaWISh

73 89 82 61 100 100 92 75 60 45 100 77 70 42 85 64 100 87 96 68 84 74 58 71 67 41 67 48

92 83 93 90 93 66 78 88 90 83 100 87 94 90 95 88 100 95 95 87 N/A N/A N/A N/A N/A N/A N/A N/A

4 4 4 4 10 8 10 8 3 0 3 0 3 0 3 0 7 0 7 0 N/A N/A N/A N/A N/A N/A N/A N/A

terms in the analysis. In this method, we de¯ned as a true positive any alignment that could be annotated with a statistically signi¯cant KEGG pathway ID. The statistical signi¯cance of annotating a LAN with a KEGG pathway ID was calculated as its hypergeometric probability, which is the probability that a random sample of the same size as the LAN would have included at least as many proteins in the annotated KEGG Pathway. This is similar to the way that GO::TermFinder calculates the p-value of a given GO term annotation.24 To calculate the hypergeometric probability of a KEGG pathway ID, we ¯rst had to remove any protein from the LAN that did not have a KEGG gene ID and at least one associated KEGG pathway. A LAN was determined to be a true positive if it had at least one KEGG Pathway annotation with a hypergeometric probability of less than 0.05. We also measured each algorithm's ability to ¯nd LANs that cover (contains the proteins of) as many functional modules as possible. Ideally, if a conserved functional module exists across two species we would like the alignment algorithm to ¯nd a LAN that covers the module. We once again adopted a technique from the authors of 1550003-10

An e±cient algorithm for pairwise local alignment

J. Bioinform. Comput. Biol. Downloaded from www.worldscientific.com by GEORGE MASON UNIVERSITY on 01/08/15. For personal use only.

Ref. 10 to evaluate each algorithm's ability to identify conserved modules of functional homogeneity using KEGG pathways.19 A pathway in a given species was de¯ned to be \covered" if at least three of its proteins belonged to the same LAN found by the algorithm. If a conserved KEGG pathway was covered in both species by the same alignment, then it was counted as a covered pathway. We report the number of covered pathways, in addition to the two measures of precision in Table 3. We were unable to use KEGG data to evaluate the biological relevance of the LANs produced when using the E. coli PPI network as input because of an inconsistency between our PPI network data and the KEGG pathway data for E. coli. 5. Concluding Remarks We have presented a novel approach to ¯nding LANs that correspond to conserved functional modules between two species. The proposed formulation of the maximum weight induced subgraph optimization problem allows us to design an algorithm that can ¯nd larger LANs by iteratively removing the minimum weight vertex from a GAN. Our proposed algorithm have the same worse-case time complexity as MaWISH. However, experimental observations have shown that our algorithm has an order of magnitude runtime improvement over MaWISh for several real-world examples. The LANs found by our algorithm also o®er improvements in precision and recall values over the previous MaWISh algorithm. Acknowledgments This research has been supported by the \Exploratory Data Intensive Computing for Complex Biological Systems" project from U.S. Department of Energy (O±ce of Advanced Scienti¯c Computing Research, O±ce of Science). The work of NFS was also sponsored by the Laboratory Directed Research and Development Program of Oak Ridge National Laboratory. Oak Ridge National Laboratory is managed by UTBattelle for the LLC U.S. D.O.E. under contract no. DEAC05-00OR22725. Wenbin Chen's research has been supported by the National Natural Science Foundation of China (NSFC) under Grant No.11271097, the project IIPL-2011-001 from Shanghai Key Laboratory of Intelligent Information Processing, and the project KFKT2012B01 from State Key Laboratory for Novel Software Technology, Nanjing University, the research projects of Guangzhou Education Bureau under Grant No. 2012A074. Shaohong Zhang' research has been supported by National Natural Science Foundation of China under Grant No. 61202273 and a grant from the Department of Education in Guangdong province under project No. 2013KJCX0144. References 1. Chen W, Schmidt MC, Tian W, Samatova NF, A fast, accurate algorithm for identifying functional modules through pairwise local alignment of protein interaction networks, Int Conf Bioinformatics and Computational Biology, 816–821, 2009. 1550003-11

J. Bioinform. Comput. Biol. Downloaded from www.worldscientific.com by GEORGE MASON UNIVERSITY on 01/08/15. For personal use only.

W. Chen et al.

2. Hartwell LH, Hop¯eld JJ, Leibler S, Murray AW, From molecular to modular cell biology, Nature 402(6761):47–52, 1999. 3. Dunn R, Dudbridge F, Sanderson C, The use of edge-betweenness clustering to investigate biological function in protein interaction networks, BMC Bioinformatics 6(1):39, 2005. 4. Pereira- Leal JB, Enright AJ, Ouzounis CA, Detection of functional modules from protein interaction networks, Proteins: Struct, Funct, and Bioinform 54(1):49–57, 2004. 5. Snel B, Bork P, Huynen MA, The identi¯cation of functional modules from the genomic association of genes, Proc Nat Acad Sci USA 99(9):5890–5895, 2002. 6. Spirin V, Mirny LA, Protein complexes and functional modules in molecular networks, Proc Nat Acad Sci USA 100(21):12123–12128, 2003. 7. Kelley BP, Sharan R, Karp RM, Sittler T, Root DE, Stockwell BR, Ideker T, Conserved pathways within bacteria and yeast as revealed by global protein network alignment, Proc Nat Acad Sci USA 100(20): 11394–11399, 2003. 8. Sharan R, Ideker T, Kelley B, Shamir R, Karp RM, Identi¯cation of protein complexes by comparative analysis of yeast and bacterial protein interaction data, J Comput Biol: A J Comput Mole Cell Biol 12(6): 835–846, 2005. 9. Kalaev M, Bafna V, Sharan R, Fast and accurate alignment of multiple protein networks, Research in Computational Molecular Biology, Springer, Berlin-Heidelberg, pp. 246–256, 2008. 10. Flannick J, Novak A, Srinivasan BS, McAdams HH, Batzoglou S, Graemlin: General and robust alignment of multiple large interaction networks, Genome Res 16(9): 1169–1181 (2006). 11. Flannick J, Novak A, Do C, Srinivasan B, Batzoglou S, Automatic parameter learning for multiple network alignment, Res Comput Mole Biol, 214–231, 2008. 12. Singh R, Xu J, Berger B, Pairwise global alignment of protein interaction networks by matching neighborhood topology, Res Comput Mole Biol, 16–31, 2007. 13. Koyutürk M, Kim Y, Topkara U, Subramaniam S, Szpankowski W, Grama A, Pairwise alignment of protein interaction networks, J Comput Biol 13(2):182–199, 2006. 14. Pache RA, Aloy P, A novel framework for the comparative analysis of biological networks, PLoS ONE 7(2):e31220, 2012. 15. Ciriello G, Mina M, Guzzi PH, Cannataro M, Guerra C, Alignnemo: A local network alignment method to integrate homology and topology, PLoS ONE 7(6):e38107. doi: 10.1371/journal.pone.0038107, 2012. 16. Mina M, Guzzi PH, Alignmcl: Comparative analysis of protein interaction networks through markov clustering, IEEE Int Conf Bioinformatics and Biomedicine Workshops (BIBMW), 174–181, 2012. 17. Enright AJ, van Dongen S, Ouzounis CA, An e±cient algorithm for large-scale detection of protein families, Nucleic Acids Res 30(7):1575–1584, 2002. 18. Altschul SF, Gish W, Miller W, Myers EW, Lipman DJ, Basic local alignment search tool, J Mole Biol 215(3):403–10, 1990, PMID: 2231712. 19. Kanehisa M, Goto S, KEGG: Kyoto encyclopedia of genes and genomes, Nucl Acids Res 28(1):27–30, 2000. 20. Charikar M, Greedy approximation algorithms for ¯nding dense components in a graph, Proc Third Int Workshop on Approximation Algorithms for Combinatorial Optimization, Springer-Verlag, 2000, pp. 84–95. 21. Feige U, Peleg D, Kortsarz G, The dense k-subgraph problem, Algorithmica 29(3):410– 421, 2001.

1550003-12

An e±cient algorithm for pairwise local alignment

J. Bioinform. Comput. Biol. Downloaded from www.worldscientific.com by GEORGE MASON UNIVERSITY on 01/08/15. For personal use only.

22. Xenarios I, Salwinski L, Duan XJ, Higney P, Kim S, Eisenberg D, DIP, the database of interacting proteins: A research tool for studying cellular networks of protein interactions, Nucl Acids Res 30(1):303–305, 2002. 23. Ashburner M, Ball CA, Blake JA, Botstein D, Butler H, Cherry JM, Davis AP, Dolinski K, Dwight SS, Eppig JT, Harris MA, Hill DP, Issel-Tarver L, Kasarskis A, Lewis S, Matese JC, Richardson JE, Ringwald M, Rubin GM, Sherlock G, Gene ontology: Tool for the uni¯cation of biology, Nat Genet 25(1):25–29, 2000. 24. Boyle EI, Weng S, Gollub J, Jin H, Botstein D, Cherry JM, Sherlock G, GO:: TermFinder    open source software for accessing gene ontology information and ¯nding signi¯cantly enriched gene ontology terms associated with a list of genes, Bioinformatics 20(18):3710–3715, 2004.

Wenbin Chen received his MS degree in mathematics from the Institute of Software, Chinese Academy of Science in 2003, and PhD in Computer Science from North Carolina State University, USA in 2010. He is currently an Associate Professor at the College of Computer Science, Guanzhou University. His research interests include algorithm design and analysis, Bioinformatics algorithms, Graph algorithms, Graph mining, Computational complexity etc.

Matthew Schmidt received his BS degree in Computer Science and Mathematics from Purdue University in 2004 with a minor in Physics. After graduating, he worked for Delphi Automotive as a Test Engineer in 2004–2005. Beginning in 2005, he enrolled in North Carolina State University in pursuit of his PhD and in 2010, he received PhD in Computer Science. He is currently a researcher at MIT Lincoln Laboratory. His research interests include Graph algorithms, Bioinformatics, Parallel computing etc.

Wenhong Tian is an Associate Professor at the College of Computer Science in the University of Electronic Science and Technology of China. He received PhD from the Department of Computer Science, North Carolina State University, USA. His research interests include Algorithm design, Data mining, Bioinformatics, and Cloud computation.

Nagiza F. Samatova received BS degree in Applied Mathematics from Tashkent State University, Uzbekistan in 1991; PhD in applied mathematics from Computational Center of Russian Academy of Sciences, Moscow, Russia in 1993; and MS degree in Computer Science from the University of Tennessee, USA in 1998. He is currently an Associate Professor of the Department of Computer Science in North Carolina State University and a Senior Research Scientist, Computer Science 1550003-13

W. Chen et al.

and Mathematics Division in Oak Ridge National Laboratory. His research interests include Algorithm design and analysis, Bioinformatics algorithms, Graph algorithms, Graph mining, High performance computation etc.

J. Bioinform. Comput. Biol. Downloaded from www.worldscientific.com by GEORGE MASON UNIVERSITY on 01/08/15. For personal use only.

Shaohong Zhang is an Associate Professor at the College of Computer Science, Guanzhou University, China. He received PhD from the Department of Computer Science, City University of Hong Kong. His research interests include Machine learning, Data mining, and Bioinformatics.

1550003-14

An efficient algorithm for pairwise local alignment of protein interaction networks.

Recently, researchers seeking to understand, modify, and create beneficial traits in organisms have looked for evolutionarily conserved patterns of pr...
355KB Sizes 0 Downloads 7 Views