PicXAA: a probabilistic scheme for finding the maximum expected accuracy alignment of multiple biological sequences.

Chapter 13 PicXAA: A Probabilistic Scheme for Finding the Maximum Expected Accuracy Alignment of Multiple Biological Sequences Sayed Mohammad Ebrahim Sahraeian and Byung-Jun Yoon Abstract PicXAA is a probabilistic nonprogressive alignment algorithm that finds protein (or DNA) multiple sequence alignments with maximum expected accuracy. PicXAA greedily builds up the alignment from sequence regions with high local similarity, thereby yielding an accurate global alignment that effectively captures the local similarities across sequences. PicXAA constantly yields accurate alignment results on a wide range of reference sets that have different characteristics, with especially remarkable improvements over other leading algorithms on sequence sets with high local similarities. In this chapter, we describe the overall alignment strategy used in PicXAA and discuss several important considerations for effective deployment of the algorithm. Key words Multiple sequence alignment, Nonprogressive alignment, Maximum expected accuracy (MEA), Probabilistic consistency transformation, PicXAA

1

Introduction Multiple sequence alignment (MSA) is an indispensable tool in comparative studies of biological sequences, and it plays a prominent role in many applications such as phylogenetic analysis, structure prediction, function prediction, motif discovery, and modeling sequence homology [1–7]. The mathematically optimal MSA can be found using dynamic programming. However, the dynamic programming approach has a high computational cost that renders it impractical for aligning more than a few sequences. For this reason, the progressive alignment scheme—which successively aligns pairs of sequences (or sequence profiles) along a phylogenetic tree of the given sequences—has gained popularity as a practical alternative [8–16]. In fact, the progressive alignment technique is surprisingly effective for closely related sequences and it yields

David J. Russell (ed.), Multiple Sequence Alignment Methods, Methods in Molecular Biology, vol. 1079, DOI 10.1007/978-1-62703-646-7_13, © Springer Science+Business Media, LLC 2014

203

204

Sayed Mohammad Ebrahim Sahraeian and Byung-Jun Yoon

accurate alignment results despite the low computational overhead. However, it does not work as well when applied to a set of divergent sequences that share only local similarities. Typically, the progressive scheme tends to propagate early-stage errors throughout the entire alignment process, which can be problematic when we need to align a set of sequences that prominently share local similarities but also possess many differences across sequence regions. In such a case, it may be difficult to build up the MSA through progressive alignment, as it may dilute local similarities and propagate errors that arise in divergent sequence regions. Until now, several techniques have been developed to address the shortcoming of progressive alignment and alleviate these undesirable effects [17–20]. Recently, a novel alignment algorithm called PicXAA [20] has been proposed to address this problem by adopting a computationally efficient non-progressive scheme. Based on the maximum expected accuracy (MEA) principle, PicXAA aims to find the optimal alignment that maximizes the expected number of correctly aligned symbols (i.e., amino acids or nucleotides). Towards this goal, PicXAA first computes the posterior pairwise symbol alignment probability for all pairs of symbol locations for every sequence pair. Next, it updates the estimated probabilities through an improved probabilistic consistency transformation, which aims to refine the symbol alignment probabilities of a given sequence pair by incorporating the information from other sequences. Using an efficient graph-based technique, PicXAA greedily builds up the alignment based on the updated probabilities, starting from confidently alignable regions with high local similarities. Once the initial alignment is constructed, PicXAA goes through an iterative refinement process to further improve the alignment quality in divergent sequence regions that cannot be confidently aligned. In summary, PicXAA can accurately predict the global alignment of multiple biological sequences, in which local homologies are effectively captured. Experimental results confirm that PicXAA consistently yields accurate alignment results in various benchmarks, where the improvements are especially significant on reference sets that consist of sequences with only local similarities [20].

2

Methods PicXAA [20] aims to find the multiple sequence alignment with the maximum expected accuracy, i.e., the maximum expected number of correctly aligned residue pairs. Through a greedy approach, PicXAA probabilistically builds up the MSA, by starting from high similarity regions and proceeding towards more divergent regions that bear less similarity. In this way, PicXAA effectively avoids the error-propagation problem that many of the current

PicXAA: A Probabilistic Scheme for Finding the Maximum Expected Accuracy. . .

205

Fig. 1 Diagrammatic overview of the alignment steps in PicXAA

progressive alignment techniques suffer from. In the following, we give a brief summary of the alignment steps that are involved in PicXAA. Fig. 1 provides a diagrammatic overview of the alignment process of PicXAA. 2.1 Improved Probabilistic Consistency Transformation

Suppose we have estimated the posterior pairwise alignment probabilities for all residue pairs in every possible pair of sequences x;y in a given sequence set S. We denote this probability as P xi yj 2 a jx; y , where xi 2 x is a residue in sequence x, yj 2 y is a residue in sequence y, and xi yj 2 a means that the residues xi and yj are aligned in the true (unknown) alignment a . These probabilities can be computed using various approaches, such as pair-HMMs (hidden Markov model) [10], partition function based methods (see Note 8 for parameters used in this scheme) [21], and structural pair-HMMs [15] (see Notes 1–3, 6 and 7 for more details on these methods). Given these pairwise alignment probabilities P xi yj 2 a jx; y , PicXAA updates them using an improved probabilistic consistency transformation. The probabilistic consistency transformation (PCT) attempts to enhance the reliability of the estimated pairwise residue alignment probabilities

206


by incorporating the information from other sequences in the set S. The idea of PCT has been originally proposed in [10] and it has been widely adopted afterwards by many alignment algorithms. It has been shown that such transformation can ultimately lead to a more consistent and accurate MSA. The improved PCT first proposed and adopted in PicXAA [20] improves the original PCT by considering the relative significance of each intermediate sequence z 2 S fx; yg while transforming the pairwise alignment probabilities, originally estimated using pair-HMMs, partition function method, or structural pair-HMMs (see ref. 20 for details). The improved transformation is defined as: P 0 xi yj 2 a jS P P ðxi zk 2 a jx; zÞP zk yj 2 a jz; y P ðx}zÞP ðz}yÞ z2S P P ðx}zÞP ðz}yÞ z2S

where P ðx}zÞ is the probability that sequences x and z are homologous to each other. This probability P ðx}zÞ is estimated by computing the average residue alignment probability in the optimal pairwise alignment between x and z. This transformation can be applied for more than one round of iterations (see Note 8). 2.2 Construction of the Alignment Graph

To find the alignment that maximizes the number of correctly aligned residues and effectively captures the local similarities between the given sequences, PicXAA constructs the MSA by adding one aligned residue pair at a time, starting from the most confidently alignable regions (i.e., residue pairs with high alignment probabilities) and progressing towards less confident regions (i.e., residue pairs with relatively low alignment probabilities) (see Note 5). During this process, PicXAA preserves the internal consistency of the alignment by avoiding any conflicts between the current alignment and the potential residue pair to be added to the alignment. In order to verify this compatibility in an efficient manner, PicXAA adopts a graph-based strategy for building up the alignment. In this approach, the MSA is represented as a directed acyclic graph G, where the nodes in G correspond to the columns in the alignment and the directed edges between nodes reflect the relative order of the corresponding columns in the final sequence alignment. To construct the alignment graph G, PicXAA first sorts all possible residue pairs for all pairs of sequences in S according to the consistency transformed posterior alignment probabilities, in a descending order, to get an ordered set P. Starting from the most probable residue pair, we successively add residue pairs in P to the alignment graph G one pair at a time, provided that the pair being added to the alignment is compatible with the current alignment graph. This compatibility can be easily verified by finding out whether the graph remains acyclic after adding the new residue


207

pair. To further improve the overall speed of this graph-based alignment process, the alignment graph is pruned after each update by removing any redundant edges, which makes the compatibility verification step more efficient. 2.3 Mapping the Graph to a Multiple Sequence Alignment

Except for possibly a few nodes that may not have any priority to each other in G, there is a one-to-one relationship between the final alignment graph G and the multiple sequence alignment. To find the final MSA, we only have to arrange the columns (represented as nodes in G) such that the relative order of the corresponding nodes in this linear arrangement does not conflict with that in the final alignment graph G. This can be easily achieved by using a depthfirst-search algorithm to arrange the nodes in a linear directed path P, according to their topological ordering.

2.4 Improving the Alignment Quality in Low Confidence Regions

The alignment quality of the regions that mainly consist of residue pairs with low alignment probabilities can be further improved by performing selective profile-profile alignments. Rather than taking a random split and realignment strategy as in [21], which may break the confidently aligned residue pairs that have high alignment probabilities, PicXAA adopts an iterative refinement technique, which first aligns each sequence with a set of highly similar sequences in S, and then aligns the resulting sequence profile with the profile that consists of the remaining sequences (see Note 8). In this way, PicXAA takes advantage of both the intra-family similarity as well as the inter-family similarity, thereby improving the overall quality of the MSA in low similarity regions without disrupting the residue alignments in high confidence regions (see Note 4).

2.5 Other Relevant Versions of PicXAA

A similar approach can be also used for the structural alignment of noncoding RNAs (ncRNAs). Recently, PicXAA-R [23] has extended the basic idea of PicXAA by additionally incorporating RNA folding information to predict accurate multiple RNA sequence alignments. There is also a Web-based platform called PicXAA-Web [24], which is designed to integrate PicXAA and PicXAA-R in a user-friendly Web environment for accurate alignment and analysis of multiple protein, DNA, and RNA sequences. PicXAA-Web can be freely accessed at: http://gsp.tamu.edu/picxaa

3

Notes 1. Generally, PicXAA can be used with any estimation scheme for computing the pairwise residue alignment probabilities. Currently, PicXAA allows the user to choose from three different methods for computing the alignment probabilities: (a) the pair-HMM approach implemented in ref. 10, (b) the structural pair-HMM approach used in ref. 15, and (c) the partition

208


function-based method adopted in ref. 21. These schemes are respectively called PicXAA-PHMM, PicXAA-SPHMM, and PicXAA-PF. Detailed description of each posterior probability computation scheme can be found in ref. 20. 2. PicXAA-PF and PicXAA-PHMM have comparable computational cost, which is considerably lower than that of PicXAASPHMM. The increased computational cost of PicXAASPHMM mainly arises from its computationally intensive probability estimation step that uses a complicated structural pairHMM. 3. PicXAA-PF and PicXAA-PHMM can be used for aligning both protein sequences as well as nucleotide sequences, while PicXAA-SPHMM can be only used for multiple protein sequence alignment. 4. Although the main focus of PicXAA lies in effectively capturing the local similarities across sequences while predicting the global alignment of multiple sequences, it consistently yields accurate alignment results for various reference sets with diverse characteristics. In fact, PicXAA can accurately predict the alignment of sequences that belong to closely related sequence families (thus bearing strong global similarities) as well as those that belong to distant families (thus sharing only local similarities). 5. For distantly related sequences that share local similarities that are limited to relatively short subsequences, PicXAA has a clear advantage over other progressive alignment techniques in terms of alignment accuracy. This is a direct effect of the probabilistic greedy alignment approach adopted by PicXAA, which first builds up the MSA from sequence regions that can be aligned with high confidence. 6. Typically, PicXAA-PF outperforms PicXAA-PHMM on many datasets, while PicXAA-PHMM yields better alignment results for locally similar sequences. 7. Incorporating structural similarities can be advantageous for aligning protein sequences that share many structural similarities in addition to sequence similarities. PicXAA-SPHMM uses the SPHMM implemented in [15] to estimate the pairwise residue alignment probabilities by incorporating such structural information. As a result, PicXAA-SPHMM often yields improved alignment results for structurally similar proteins, but at the price of increased computational overhead. 8. Parameters used in PicXAA: (a) Number of iterations for the probabilistic consistency transformation (PCT): In general, increasing this parameter will improve the consistency of the predicted alignment while


209

reducing the specificity of the predicted result. The default value of this parameter is two. (b) Number of iterations for the refinement step: This is the number of times the refinement steps are applied to the sequence set. Experiments show that, typically, 100 iterations are sufficient to obtain an accurate and consistent multiple sequence alignment. (c) Scoring matrix for PicXAA-PF: This parameter specifies the scoring matrix that will be used for computing the posterior pairwise residue alignment probabilities in the PicXAA-PF scheme. The default matrix will be the Gonnet 160 scoring matrix [22] for protein sequences, and the identity nucleotide scoring matrix for DNA sequences. (d) Gap open and gap extension penalties for PicXAA-PF: These parameters control the affine gap penalties that will be used to compute the posterior pairwise residue alignment probabilities in the PicXAA-PF scheme. In general, higher gap penalty results in higher alignment probability for mismatching (i.e., nonidentical) residues. For protein sequences, the default gap open and gap extension penalties are 22 and 1, respectively (for the Gonnet 160 scoring matrix). For nucleotide sequences, the default gap open and gap extension penalties are 4 and 0.25, respectively.

Acknowledgment This work was supported in part by the National Science Foundation through NSF Award CCF-1149544. References 1. Phillips A, Janies D, Wheeler W (2000) Multiple sequence alignment in phylogenetic analysis. Mol Phylogenet Evol 16:317–330 2. Wong KM, Suchard MA, Huelsenbeck JP (2008) Alignment uncertainty and genomic analysis. Science 319:473–476 3. Cuff JA, Barton GJ (2000) Application of multiple sequence alignment profiles to improve protein secondary structure prediction. Proteins 40:502–511 4. Kemena C, Notredame C (2009) Upcoming challenges for multiple sequence alignment methods in the high-throughput era. Bioinformatics 25:2455–2465 5. Edgar R, Batzoglou S (2006) Multiple sequence alignment. Curr Opin Struct Biol 16:368–373

6. Pei J (2008) Multiple protein sequence alignment. Curr Opin Struct Biol 18:382–386 7. Kumar S, Filipski A (2007) Multiple sequence alignment: in pursuit of homologous DNA positions. Genome Res 17:127–135 8. Thompson JD, Higgins DG, Gibson TJ (1994) CLUSTAL W: improving the sensitivity of progressive multiple sequence alignment through sequence weighting, position-specific gap penalties and weight matrix choice. Nucleic Acids Res 22:4673–4680 9. Notredame C, Higgins DG, Heringa J (2000) T-Coffee: a novel method for fast and accurate multiple sequence alignment. J Mol Biol 302:205–217 10. Do CB, Mahabhashyam MS, Brudno M, Batzoglou S (2005) ProbCons: probabilistic

210


consistency-based multiple sequence alignment. Genome Res 15:330–340 11. Edgar RC (2004) MUSCLE: multiple sequence alignment with high accuracy and high throughput. Nucleic Acids Res 32:1792–1797 12. Katoh K, Misawa K, Kuma K, Miyata T (2002) MAFFT: a novel method for rapid multiple sequence alignment based on fast Fourier transform. Nucleic Acids Res 30:3059–3066 13. Katoh K, Kuma K, Toh H, Miyata T (2005) MAFFT version 5: improvement in accuracy of multiple sequence alignment. Nucleic Acids Res 33:511–518 14. Katoh K, Toh H (2008) Recent developments in the MAFFT multiple sequence alignment program. Brief Bioinform 9:286–298 15. Pei J, Grishin NV (2006) MUMMALS: multiple sequence alignment improved by using hidden Markov models with local structural information. Nucleic Acids Res 34:4364–4374 16. Paten B, Herrero J, Beal K, Birney E (2009) Sequence progressive alignment, a framework for practical large-scale probabilistic consistency alignment. Bioinformatics 25:295–301 17. Subramanian AR, Kaufmann M, Morgenstern B (2008) DIALIGN-TX: greedy and progressive approaches for segment-based multiple sequence alignment. Algorithms Mol Biol 3:6

18. Schwartz AS, Pachter L (2007) Multiple alignment by sequence annealing. Bioinformatics 23:e24–e29 19. Bradley RK, Roberts A, Smoot M, Juvekar S, Do J, Dewey C, Holmes I, Pachter L (2009) Fast statistical alignment. PLoS Comput Biol 5:e1000392 20. Sahraeian SM, Yoon BJ (2010) PicXAA: greedy probabilistic construction of maximum expected accuracy alignment of multiple Sequences. Nucleic Acids Res 38:4917–4928 21. Roshan U, Livesay DR (2006) Probalign: multiple sequence alignment using partition function posterior probabilities. Bioinformatics 22:2715–2721 22. Gonnet GH, Cohen MA, Benner SA (1992) Exhaustive matching of the entire protein sequence database. Science 256:1443–1445 23. Sahraeian SM, Yoon BJ (2010) PicXAA-R: efficient structural alignment of multiple RNA sequences using a greedy approach. BMC Bioinformatics 11(Suppl 1):S38 24. Sahraeian SM, Yoon BJ (2011) PicXAA-Web: a web-based platform for non-progressive maximum expected accuracy alignment of multiple biological sequences. Nucleic Acids Res 39: W8–W12

PSAR-align: improving multiple sequence alignment using probabilistic sampling.

Probabilistic alignment leads to improved accuracy and read coverage for bisulfite sequencing data.

Evaluating the accuracy and efficiency of multiple sequence alignment methods.

PCV: An Alignment Free Method for Finding Homologous Nucleotide Sequences and its Application in Phylogenetic Study.

Inferring phylogenies of evolving sequences without multiple sequence alignment.

Probabilistic approaches to alignment with tandem repeats.

An enhanced algorithm for multiple sequence alignment of protein sequences using genetic algorithm.

Computing the expected cost of an appointment schedule for statistically identical customers with probabilistic service times.

TCS: a new multiple sequence alignment reliability measure to estimate alignment accuracy and improve phylogenetic tree reconstruction.

TM-Aligner: Multiple sequence alignment tool for transmembrane proteins with reduced time and improved accuracy.

Finding sequences for over 270 orphan enzymes.

A novel algorithm for detecting multiple covariance and clustering of biological sequences.

On the Expected Values of Sequences of Functions.

Bioclojure: a functional library for the manipulation of biological sequences.

Identifying a maximum tolerated contour in two-dimensional dose finding.

Finding errors in DNA sequences.

Compilation and alignment of DNA polymerase sequences.

Fast alignment of DNA and protein sequences.

Heuristics for multiobjective multiple sequence alignment.

A stationary north-finding scheme for an azimuth rotational IMU utilizing a linear state equality constraint.

Simultaneous alignment and folding of protein sequences.

A novel joint spatial-code clustered interference alignment scheme for large-scale wireless sensor networks.

A workbench for multiple alignment construction and analysis.

Pairwise sequence alignment for very long sequences on GPUs.