This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/JBHI.2015.2403397, IEEE Journal of Biomedical and Health Informatics

1

A Novel Approach to Multiple Sequence Alignment Using Multi-objective Evolutionary Algorithm Based on Decomposition Huazheng Zhu, Zhongshi He, and Yuanyuan Jia

 Abstract—Multiple sequence alignment (MSA) is a fundamental and key step for implementing other tasks in bioinformatics, such as phylogenetic analyses, identification of conserved motifs and domains, structure prediction. Despite the fact that there are many methods to implement MSA, biologically perfect alignment approaches are not found hitherto. This paper proposes a novel idea to perform MSA, where multiple sequence alignment is treated as a multi-objective optimization problem. A famous multi-objective evolutionary algorithm framework based on decomposition (MOEA/D) is applied for solving MSA, named MOMSA. In the MOMSA algorithm, we develop a new population initialization method and a novel mutation operator. We compare the performance of MOMSA with several alignment methods based on evolutionary algorithms, including VDGA, GAPAM, and IMSA, and also with state-of-the-art progressive alignment approaches, such as MSAprobs, Probalign, MAFFT, Procons, Clustal omega, T-Coffee, Kalign2, MUSCLE, FSA, Dialign, PRANK, and CLUSTALW. These alignment algorithms are tested on benchmark datasets BAliBASE 2.0 and BAliBASE 3.0. Experimental results show that MOMSA can obtain the significantly better alignments than VDGA, GAPAM on the most of test cases by statistical analyses, produce better alignments than IMSA in terms of TC scores, and also indicate that MOMSA is comparable with the leading progressive alignment approaches in terms of quality of alignments.

and proteins is improved accordingly with the increase of sequence alignment accuracy [2]. Therefore, developing biologically perfect alignment approaches is a promising research field. Multiple sequence alignment is mainly focused on discovering biological relations among different sequences, including nucleotide or amino acid, to investigate their underlying main characteristics or functions. In other words, if there are more similarities among the aligned sequences, they may have more biological relationships. We also can predict the functions or characteristics for the aligned sequences. In general, MSA can be formulated mathematically in the following: Given a set of N strings S= {S1, S2,…, SN} over the alphabet set  ,which consists of 4 nucleotides or 20 amino acids respectively, we need to find an alignment S such that the number of the aligned symbol is as much as possible by deleting or inserting gaps, as illustrated in Fig. 1. The found alignment S is subject to the following constraints [3]: (a) S i  S i for all i=1, 2,…, N, if all gaps are removed from S i .

Index Terms— Multiple sequence alignment, Multi-objective optimization, MOEA/D.

Score( S )    sim ( S i , S j )   g (S i ) is maximized where

(b) S1  S 2    S N  , where Si denotes the length of S i . (c)The

alignment i

j

score

i

sim( Si , S j ) denotes the evaluation of similarity between S i I. INTRODUCTION ultiple sequence alignment (MSA) is the first and one of the most important steps in bioinformatics because it has a variety of uses, such as phylogenetic analyses, identification of conserved motifs and domains, structure prediction and so on [1]. Researchers have demonstrated that the secondary or tertiary structure prediction accuracy of RNA

M

This work was supported in part by the China Science and Technology Project of Ministry of Transport under Grant No. 201 1318740240,and the Chongqing keyscientific and technological project under the No. CSTC 2010AB2001. The authors are with College of Computer Science, Chongqing University, Chongqing, China (email: [email protected], [email protected], [email protected]).

and S j , and g ( Si ) is penalty function of gaps in Si . From the above definition of MSA, it can be seen that MSA is actually viewed as a single-objective optimization problem. Additionally, it is worth noting that MSA is proved to be a NP-complete problem [4]. MSA has been paid a lot of attention from many researchers since Needleman and Wunsch proposed firstly dynamic programming for pairwise sequence alignment in 1970 [5]. So far, many approaches to MSA have been developed, which fall into three broad categories: exact, progressive and iterative. Dynamic programming technique is an exact algorithm, which can obtain the mathematically perfect alignments. However, as mentioned above, computation complexity is too high to reach for more than three sequences in reality [4].In order to overcome this drawback, progressive alignment

2168-2194 (c) 2015 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.

This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/JBHI.2015.2403397, IEEE Journal of Biomedical and Health Informatics

2

Fig. 1.An illustration of multiple sequence alignment

method is proposed by Feng and Doolittle. They utilized the Needleman and Wunsch pairwise alignment algorithm iteratively to achieve the simultaneous alignment of multiple sequences [6].This approach has the great advantage of speed and simplicity combined with reasonable sensitivity. Therefore, it is applied widely for routine applications. One of most known progressive alignment programs is CLUSTALW, which uses the weighting method, position-specific gap penalties and weight matrix to increase its robust and sensitivity [7]. The authors of PRANK proposed a progressive alignment algorithm with the phylogeny-aware gap placement method, because traditional alignment methods lead to excessive penalization of single insertion events [8]. However, the progressive alignment methods are essentially heuristic, and follow the rule of “once a gap, always a gap”, which may lead to the main disadvantage that once a mistake is made in the initial alignment, it will be impossible to be modified. In order to overcome the limitations of the progressive algorithms, mixture of iterative and progressive algorithms is recommended by researchers, for instance, MUSCLE is a representative hybrid algorithm [9], [10]. One alternative algorithm is based on Hidden Markov Model (HMM) [11], [12], [13] or Fast Fourier Transform [14]. Alignment algorithms based on HMM are popular with researchers currently. Another efficient MSA program is Kalign, which is based on the Wu-Manber string match algorithm [15]. Kalign2 is the newest version that is a fast and accurate aligner [16]. As the definition of multiple sequence alignment mentioned above, MSA can be considered as an optimization problem. So, stochastic optimization algorithms for multiple sequence alignment are natural and promising alternatives, which belong to the iterative alignment algorithms. Almost all stochastic optimization methods have been used for multiple sequence alignment, such as simulated annealing[17], genetic algorithms [18], particle swarm optimization [19], and ant colony optimization [20],[21]and so on. Furthermore, it should be pointed out that genetic algorithms are most widely used, like SAGA [18], MSA-GA [22], RBT-GA [3], GAPAM [23] and VDGA [24], perhaps because genetic algorithms essentially stem from Darwin’s principles of natural selection. These approaches iteratively improve the quality of multiple sequence alignment until the alignment cannot be better by designing the alignment updating method. This strategy can overcome the drawback of progressive alignment that the error you make in the initial alignment cannot be changed later, but it is computationally expensive and does not guarantee the optimal

Fig. 2. Illustrations of the gap opening and the gap extension

alignment can be reached each time. From the above literature review, existing alignment algorithms treat MSA as a single-objective optimization problem. For example, procedure VDGA uses some mechanisms iteratively to make the sequences aligned as much as possible. In VDGA, Weighted Sum of Pairs (WSPs) scoring function is used to measure the quality of the sequence alignment. The alignment with the best WSPs is the objective of VDGA. However, the objectives of multiple sequence alignment are to maximize the number of matching base pairs among the sequences and minimize the gap score by inserting or deleting gaps. Therefore, MSA is actually a bi-objective optimization problem. In this paper, a novel approach to multiple sequence alignment using multi-objective optimization algorithm based on decomposition is proposed, which is called MOMSA, where a new population initialization method and a novel mutation operator are developed. The rest of this paper is structured as follows. Multiple sequence alignment problem modeling is given in Section II. Section III presents MOMSA method. Experiment results and analysis are provided in the Section IV, and then in the Section V, the conclusions are made.

II. MULTIPLE SEQUENCE ALIGNMENT PROBLEM MODELING MSA is a complicated problem, especially, the choice of an objective function is very important for finding a mathematically optimal alignment. Over the recent years, several new objective functions were proposed, such as WSPs, Coffee, T-Coffee and 3D-Coffee and so on [7], [25],[26],[27]. Coffee score reflects the level of consistency between a MSA and a library containing pairwise alignments of the same sequences. It can increase the accuracy of multiple sequence alignments, but the computation time goes up sharply [25]. WSPs is simple and easy to use for evaluating an alignment, so it is widely used by researchers. Because we will use WSPs as a metric, its specific procedure is introduced in detail. A. Weighted Sum of Pairs (WSPs) WSPs is widely applied for evaluating an alignment [7], [23], [24], which can be formulated in the following. L

score( S )   Sl

(1)

l 1

N 1

Sl  

N

w

ij

cost( Ai , A j )

i 1 j  i 1

2168-2194 (c) 2015 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.

(2)

This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/JBHI.2015.2403397, IEEE Journal of Biomedical and Health Informatics

3 Where 0 if A =A ='-'  i j  if (A  '-' and A  '-' ) or  gapopen i j  (A  '-' and A  '-') and  i j   it is a gap opening    cost( A , A )   gapextend if (A  '-' and A  '-' ) or i j i j   (A  '-' and A  '-') and i j   it is a gap extension    Blosum( A , A ) if A  '-' and A  '-' i j i j  or PAM (A , A )  i j

Fig. 3. The substitution cost and gap cost of two alignments

(3)

score( S ) denotes the score of sequence alignment S . L is the length of the aligned sequences S . Sl is the cost of the l -th column of the aligned sequences S . N is the number of the aligned sequences S . wij is the weight of sequence i and j , which is calculated according to the distances among the unaligned sequences. cost( Ai , A j ) is the alignment score between two residues Ai and A j . gapopen and gapextend are the penalty cost of a gap opening and a gap extension , respectively. Fig. 2 illustrates the gap opening and the gap extension. In CLUSTALW, the initial default gapopen , gapextend are 10 and 0.2,respectively,which are adjusted dynamically in the alignment process[7]. Blosum( Ai , A j ) and PAM (Ai , Aj ) are the score of matched residues Ai , A j according to the blocks substitution matrix (Blosum)[28]and percent accepted mutation matrix(PAM) [29], respectively. WSPs scoring function has suggested that MSA problem can be addressed as a single-objective optimization problem. However, as mentioned by authors of CLUSTALW,the choice of alignment parameters is one of major problems in sequence alignment, such as gapopen , gapextend ,and w ij . Although a detailed study on these parameters is proposed, parameter choice problem is still worth researching. B. Our new proposed model of MSA Generally speaking, the simultaneous alignment of many nucleotide or amino acid sequences is to make these sequences aligned as possible as we can. In other words, we need to maximize the number of matching base pairs among the sequences and minimize the score of gap penalty by inserting or deleting gaps. Therefore, it is obvious that MSA can be regarded as a bi-objective optimization problem. In this paper, MSA is formulated mathematically as follows:

minimize F ( S )  ( f1 ( S ), f 2 ( S ))T f1 ( S )  scoregap(S)

(4)

f 2 ( S )   SP(S) S  Where scoregap(S)  x * gapopen  y * gapextend L

N 1

SP(S)   Sl , Sl   l 1

(5)

N

 cost( A , A ) i

j

(6)

i 1 j  i 1

Where the specific formula of cost( Ai , A j ) can be seen in (3).

 is the sequences alignment space. x is the number of the gap opening. y is the number of the gap extension. scoregap(S) is the score of gap penalty. SP(S) denotes the sum of score of the matched pairs in the sequence alignment S . For instance, in Fig.3, there are two alignments S a , S b . The first objective function value scoregap ( S a ) and scoregap ( S b ) are 70.2 and 120.4 respectively, if gapopen is set to 10 and gapextend is 0.2. The second objective function value SP ( S a ) and SP ( S b ) are 55 and 79 respectively, if we choose PAM250 as the substitution matrix. Obviously, the alignments S a , S b are incomparable if the two objectives are considered. In other words, S a , S b are Pareto optimal alignments, which provide the decision makers to choose. If an alignment algorithm by optimizing a single objective is used, just only one of them can be obtained. That means some good alignments are omitted. It can be seen that multi-objective optimization concept can help us find more meaningful alignments and better understand the relationships among these biological sequences. III. MOMSA: THE PROPOSED METHOD A. Motivations There are two drawbacks when MSA is viewed as optimizing the single objective problem, for instance WSPs. Firstly, WSPs scoring function has several parameters to select, however, it is not a trivial but complicated task. Because these parameters are related with the number of sequences, the length of sequences, and the similarity of the unaligned sequences, they will vary with the difference of the selected sequences. Secondly, even if the parameters of WSPs are given properly, the biologically optimal alignment may not be obtained [30], [31].In [30], the authors found that there exists two different alignments may have same value of objective function, and the

2168-2194 (c) 2015 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.

This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/JBHI.2015.2403397, IEEE Journal of Biomedical and Health Informatics

4 biological optimum may not be the mathematically optimal alignment. Based on the second drawback, a hybrid immune inspired MSA procedure (IMSA) has been proposed, which have the merit producing several optimal or suboptimal alignments [32]. However, because the parameters of objective function are not determined, multiple alignment process may have some probability of misleading by the unsuitable objective function parameters. Therefore, Multi-objective optimization algorithm for multiple sequence alignment (MOMSA) is developed in this paper. To explain the motivation further, there is a simple instance to illustrate as follows.

maximize F  ( f1 , f 2 )T f1 ( x)  x

(7)

f 2 ( x)  10  x x0 (7) is a bi-objective optimization problem. If the maximization of function F is considered as a single-objective function to be optimized, then the weight should be used for balancing the two objectives. It is assumed that two choices of weight vectors are as follows: i. If the weight vector W  (0.5, 0.5) , then

F  0.5*( f1  f 2 )  0.5*( x  10  x)  5 . ii.

If

the

weight

vector W  (0.1, 0.9)

,

then

F  0.1* f1  0.9* f 2  0.1* x  0.9*(10  x)  9  0.8* x . So, the maximum of F is 9. From the assumption above, the choice of weight W is highly related with the single-objective function value. So, it shows the importance of correct selection of parameters when you view (7) as a single function optimization problem. If situation (i) happened, then the maximum of F should be 5. When you are confronted with this problem, you may select the weight vector like situation (ii), and if you use IMSA, which can obtain optimal and suboptimal value, so you may get the maximum 5. In other words, you may reach the right answer. But if the situation is like situation (ii) in real, then maximum of F should be 9. So, while you choose the weight vector like situation (i), it is impossible to arrive the real maximum 9, even if you use an excellent optimization algorithm, because the maximum is at most 5 when we choose the weight=(0.5,0.5). Therefore, IMSA has still some difficulties when the parameters are not selected correctly. While (7) is solved by a multi-objective optimization algorithm, it does not need to select these parameters, such as weight W . So, There are no the difficulties from the choice of parameters. This is major motivation of MOMSA, which uses a recent proposed multi-objective optimization framework MOEA/D [33]. B. MOEA/D framework MOEA/D framework was proposed by Zhang and Li in 2007.The idea of MOEA/D is that it uses the weighting method to convert a multi-objective optimization problem into a

number of scalar optimization problems. MOEA/D is selected as the algorithm framework because it has the following merits: i. It does not have the non-dominate sorting procedure, so it is simpler than nondecomposition MOEAs, such as NSGAII. It is more efficient than other algorithms. ii. It can obtain the uniform solutions without complicate methods, like crowding distance method. iii. Traditional single objective optimization algorithm can be directly applied for multi-objective optimization problems. The specific algorithm can be seen in the literature [33]. In MOEA/D,the authors introduced three methods to decompose the multi-objective optimization problem into a number of scalar optimization problems. They are respectively Weighted Sum approach, Tchebycheff approach, and Penalty-based Boundary Intersection (PBI) approach. In this paper, the Tchebycheff approach is adopted because it is simple and does not need the additional penalty factor as PBI approach. In the Tchebycheff approach, the scalar optimization subproblem j can be formulated in the following: (8) minimize g te ( s |  j , z*)  max{ i j | f i ( s )  zi* |} 1 i  m

s.t. s  

where z*  ( z1 , z2 ,...zm )T is a reference point,which can be computed by formula (9). (9) zi*  min{ f i ( s ) | s  }, i  1, 2,..., m

 j  (1j , mj ,..., mj )T is a weight vector of the jth subproblem, m

i.e. i j  0 for all i =1,...,m, and



i

j

 1, m is the number of

i 1

objectives. More precisely, in this paper, m is 2. Let NP be the number of decomposed subproblems, then the weight vector  j is (j/NP,1- j/NP). It is noted that the ideal point z * is not calculated by (9), but is defined as minimum in the whole current population since searching the exact ideal point consumes large computational resources. C. MOMSA approach Algorithm 1 MOMSA approach Input: NP: the number of the subproblems considered in MOMSA, i.e. the size of the population Nnei: the number of the weight vectors in the neighborhood for each weight vector j  ,j=1,2…,Np Tmax: the maximum number of iteration Output:

S * : the best sequence alignment in the final population Step 1 Initialization Step1.1 Randomly generate an initial population P0 of size NP, the specific procedure is seen in Fig.4. Step1.2 Generate NP weight vectors  j , j=1,2,…,NP, where  j  (1j , 2j ,..., mj ) ,m is the number of objectives. Step1.3 Compute the Euclidean distances dij between each

2168-2194 (c) 2015 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.

This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/JBHI.2015.2403397, IEEE Journal of Biomedical and Health Informatics

5 two weight vectors, and then based on the Euclidean distances matrix d, select Nnei closest individuals to form their neighborhoods’ set Bi for each individual Si , i=1,2,…,NP, Bi={Si1, Si2,…, SiNnei}, Nnei denotes the size of Bi. Step1.4 For each individual Si, compute objective function value f1(Si), f2(Si) ,i=1,2,…,NP. Let F1= (f1(S1), f1(S2),…, f1(SNp)), F2= (f2(S1), f2(S2),…, f2(SNp)), set reference point Z=(min(F1),min(F2)), where function min means that getting the minimum. Step1.5 t= 1. Step2 Evolution For i=1,2,…,NP, Step2.1 Reproduction: firstly, randomly select two individuals from the neighborhood set Bi of Si, and then implement the operation of single point crossover or multiple points crossover at a defined probability on the two selected individuals to generate a new individual Si , and then mutate Si . Subsequently, S i is obtained by the operations of vertical decomposition with guide tree (VDGT) for the individual Si , i.e. Si  VDGT ( Si ) . If Si dominates S i , then Si  Si . Lastly, add S i to the Population Pt. Step 2.2 Updating the reference point Z: For each

r=1,2,…, m, if Z r  f r ( S i ) ,then Z r  f r ( S i ) . m is the number of objectives. Step2.3 Updating the neighboring solutions of Si: For each neighborhood Sik of Si, k=1,2,…, Nnei, if k

k

k

g ( Si  Si , Z )  g ( Sk  Si , Z ) , S k  S i .  Si is

the weight vector of Sik . Endfor Step 3 Stopping criterion: If t>Tmax, then stop and output the best individual S * evaluated by a criterion, which is determined by decision-maker in the final population Pt, otherwise the counter t=t+1, and go to step 2. The framework of MOMSA can be seen in the algorithm 1. The steps of our proposed method are broadly population initialization, population evolution, and the stopping criteria. For explaining comprehensively the specific procedure, these steps will be described in details below. 1) Population initialization To produce the initial population P0 with Np candidate alignments, several approaches have been presented by researchers. In SAGA, a random initialization method is used, which choose a random offset for all the sequences, and then move them to the right, lastly, make these sequences have the same length by padding with null signs. In MSA-GA, it firstly uses the Needleman-Wunsch algorithm to align globally these

pairwise sequences, and then insert random gaps to make the length of all sequences same. In IMSA, the candidate alignments are mostly created by the CLUSTALW algorithm, and the remaining candidate alignments are produced by random initialization as the initialization method in SAGA. In GAPAM and VDGA, their initial generations are also based on tree-based method, but they do not use the random offset. GAPAM firstly generates two alignments by performing the CLUSTALW algorithm with two different distances calculating method, and then two mechanisms are implemented to generate many different trees. The first mechanism is randomly selected sequences make a new sub-tree, and the non-selected sequences develop another new sub-tree. Lastly, they are merged together to form a new tree. The second one is that select two sequences and then exchange their positions to produce a new tree. Even though it is a good population initialization method, it is time-consuming. In this paper, a new and simple population initialization method is presented as shown in the Fig. 4, which is inspired by gap-insertion operation in SAGA. The specific procedure of population initialization is described below. Firstly, we generate an alignment S by tree-based method, such as CLUSTALW etc. And then, the sequences of alignment S are divided randomly into two groups, in which we randomly select a position and insert a user-defined number Nins of gaps in the same columns, respectively. Lastly, the two groups of sequences are composed as a complete alignment. 2) Genetic operators (a) Crossover operators In SAGA, there are 22 operators introduced for searching the best alignment. The effect of these operators for performance of SAGA is investigated, and the observed experimental results shows that the crossover operators that are composed of single point crossover and multiple points crossover seem to be not helpful for SAGA’s success because they are very disruptive, especially at the junction point [34]. However, the crossover operators are good at maintaining diversity of population, which is vital for evolutionary algorithm. Therefore, these two crossover operators are used in the proposed MOMSA algorithm. Single point crossover is illustrated in Fig. 5[23]. Initially, two individuals (parent a, b) is chosen with probability from the whole population or its set of neighbors, one of which is then divided vertically at a random column into two parts (a1, a2), and another one of which is cut to two pieces (b1, b2), in which each piece should contain the same number of residues or bases as the counterpart of first individuals (parent a). At last, these pieces are merged to two new individuals, in which a fitter individual would survive. Likewise, multiple points crossover operator needs two parent individuals, and then divides vertically them at two random positions. It is noted that when dividing another parent individual according to the first one, the point is that the number of residues contained in each part is the same as the counterpart of the first individual, which is illustrated in Fig. 6[23].

2168-2194 (c) 2015 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.

This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/JBHI.2015.2403397, IEEE Journal of Biomedical and Health Informatics

6

Fig. 4. The process of population initialization

Fig. 6. The example of multiple points crossover

Fig. 5. The example of single point crossover

Fig. 7. An example of mutation operator

(b) Mutation operators In VDGA and GAPAM algorithm, the authors employed the mutation operator to keep the diversity of population abundant and to improve their convergence. The mutation operator is that shuffling the two selected sequences in the guide tree and then generating a new guide tree to find a new better multiple sequence alignment. By experimental observation, we found that the mutation is too computational expensive, and usually cannot find the better alignment. That is why GAPAM’s developer found that any best solution did not be achieved when only implementing mutation operation. Therefore, the mutation operation is not employed in the proposed MOMSA algorithm. Here, we propose a new mutation operation that is simple and efficient. The proposed mutation operation’s motive is that some local optimal solutions need to change a little in order to find a better solution (i.e. a better alignment).Fig. 7 illustrates our proposed mutation operator, which randomly choose a piece in the candidate alignment, and then remove all the gaps in the piece. Subsequently, a tree-based method is used to align the piece. Lastly, the new piece replaces the counterpart of the candidate alignment.

Fig.8. An example of vertical division

3) Vertical division operator In the VDGA, the vertical division strategy can make the sequences increasingly aligned, the idea of which is to divide the longer sequences into smaller parts, align them respectively, and then combine them to generate a complete sequences alignment. This paper adopts the vertical division technique as a local search operator to improve the performance of the MOMSA algorithm. Fig. 8 illustrates this operator how to work. A candidate multiple sequence alignment is firstly decomposed randomly to three parts, and then a tree-based method is applied to align the three parts of sequences separately to form three smaller alignments. Finally, we merge the three alignments into a new whole alignment as shown in Fig. 8(e) [24]. 4) Stop condition In MOMSA, the maximal iterations Tmax is applied for stopping this procedure. Based on experimental observations, MOMSA algorithm is set to terminate when the iterations is 100, because the non-dominated solutions are not almost changed, and we also have tested MOMSA algorithm for up to 200 iterations, but the solutions we obtained did not significantly changed.

2168-2194 (c) 2015 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.

This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/JBHI.2015.2403397, IEEE Journal of Biomedical and Health Informatics

7 5) Criterions for the decision-maker At the last stage of the MOMSA algorithm, a set of non-dominated solutions will be obtained, among which the decision-maker selects a solution according to their preferences in reality. In this paper, we use weighted sum of pairs (WSPs), which is used widely for evaluating solutions and whose formula can be seen in section 2.1, to measure these non-dominated solutions and choose a best one (i.e. a multiple sequence alignment).

TABLE Ⅰ THE RESULTS OF MOMSA, VDGA, GAPAM ON THE SELECTED TEST CASES GAPAM VDGA MOMSA-W a b name SN avgSL (SP) (SP) (SP)

IV. EXPERIMENTAL STUDY In this section, firstly, we compare MOMSA with the recently proposed multiple sequence alignment algorithms based on evolutionary algorithms, including VDGA, GAPAM, IMSA, to demonstrate its superiority. After that, we evaluate the performance of the proposed MOMSA with many popular aligners. In this paper, MOMSA is coded in MATLAB and implemented on the personal computer with four Intel Core i5 3.1GHz processors with 3072 MB RAM under WIN7 platform.

ref1

A. Datasets BAliBASE(Benchmark Alignment database), which is a well-known benchmark alignments dataset for evaluating multiple sequence alignment programs, was developed by manually aligning based on those known three-dimensional structures of proteins. In this paper, we choose BAliBASE 2.0 and 3.0 to test our proposed MOMSA algorithm. There are five reference sets of unaligned sequences with different characters that contain the length and the identity among the unaligned sequences in the original BAliBASE. Based on it, BAliBASE 2.0 is developed by correcting some reference alignments. Besides that, three new alignments references sets are added for identifying the strength and weakness of alignment algorithms [35]. BAliBASE 3.0 is the latest version of the multiple sequence alignment benchmark, which has been widely used by many researchers [36]. In BAliBASE 3.0, there are 6255 sequences and 218 test cases. They are divided to five test sets. Two measures of accuracy of our test alignments are introduced in BAliBASE. The one is the sum of pairs (SP) score, the fraction of aligned residue pairs in the reference alignment that are correctly found in the tested alignment, and the other one is the total column (TC) score, the fraction of aligned columns that are correctly found. The source code of calculating both measures can be downloaded from http://bips.u-strasbg.fr/en/Products/Databases/BAliBASE. B. Parameters setting In the proposed MOMSA algorithm, population size NP is set to 100 as in the MOEA/D. That is because we observed that the obtained solutions with the population size 100 are better than that with 50, but almost same to that with 200, by a number of experiments. The number of the weight vectors Nnei is 10. The maximum number of iteration Tmax is 100. We adopt two kinds of crossover operation: single crossover and multiple points crossover, and use a new proposed mutation operator. The probabilities of each operator are set to 0.5, 0.5,

ref2

a.

1idy

50

58

0.5650

0.5730

0.2154

1tvxA 1uky

4

69

0.3160

0.2670

0.0526

4

220

0.4020

0.4490

0.5148

kinase

5

276

0.4870

0.5450

0.8496

1ped

3

374

0.4980

0.4820

0.7389

2myr

4

474

0.3170

0.3590

0.4372

1ycc

4

116

0.8450

0.7550

0.9345

3cyr

4

109

0.9110

0.8210

0.8154

1ad2

4

213

0.9560

0.9410

0.9562

1ldg

4

675

0.9630

0.9060

0.9886

1fieA

4

442

0.9630

0.9300

0.9820

1sesA

5

63

0.9820

0.9620

0.9583

1krn

5

82

0.9600

0.9600

1.0000

2fxb

5

63

0.9700

0.9780

0.9357

1amk

5

258

0.9980

0.9840

0.9947

1ar5A

4

203

0.9740

0.9380

0.9604

1gpb

5

828

0.9830

0.9840

0.9862

1taq

5

928

0.9450

0.9590

0.9477

1aboA

15

80

0.7960

0.6910

0.8398

1idy

19

60

0.9890

0.9920

0.9743

1csy

19

99

0.7640

0.8850

0.8536

1r69

20

76

0.9650

0.8340

0.9450

1tvxA

16

69

0.9200

0.9740

0.9365

1tgxA

19

71

0.8780

0.8780

0.9522

1ubi

15

60

0.7670

0.7780

0.9211

1wit

20

106

0.8510

0.8150

0.9203

2trx

18

94

0.9860

0.9860

0.9863

1sbp

16

262

0.7650

0.7720

0.8808

1havA

26

242

0.8790

0.8460

0.8969

1uky

23

225

0.8080

0.8910

0.9404

2hsdA

20

255

0.7960

0.8290

0.9192

2pia

16

294

0.8280

0.8500

0.9733

3grs

15

237

0.7460

0.7510

0.8492

kinase

18

287

0.7990

0.8880

0.9397

1ajsA

18

389

0.8990

0.9050

0.9015

1cpt

15

434

0.8750

0.8120

0.8862

1lvl

23

473

0.7810

0.8190

0.9462

1pamA

18

511

0.8600

0.8630

0.9581

1ped

18

388

0.9120

0.9470

0.9717

2myr

17

482

0.8220

0.8300

0.9659

4enl

17

440

0.8960

0.8890

0.9151 SN: Sequence Number. b. avgSL: average Sequence Length

and 1, respectively. A new population initialization method is developed, in which a random number of gaps will be inserted into a seed alignment. In this paper, we set the number Nins of gaps inserted to 5, if the maximum length of the unaligned sequences is less than 100. If not, Nins is set to 1/20 of the

2168-2194 (c) 2015 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.

This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/JBHI.2015.2403397, IEEE Journal of Biomedical and Health Informatics

8

C. Experimental results and analysis 1) Comparison of MOMSA with VDGA, GAPAM In order to judge the performance of our proposed MOMSA algorithm, we compare it with VDGA and GAPAM, which is recently proposed multiple sequence alignment algorithms. For comparing the solutions obtained by each algorithm conveniently, we have selected test data according to the GAPAM’s. The authors chose 56 test cases in BAliBASE 2.0, which contains 18 test cases from Reference 1, 23 test cases from Reference 2, 11 test cases from Reference 3, and two test cases from Reference 4 and 5 respectively. As mentioned in Section Ⅲ-A-5, in this paper, we adopt WSPs as the measure to select the best alignment from the obtained non-dominated solutions. The corresponding SP is recorded. We denote the MOMSA algorithm with the best WSPs by MOMSA-W. MOMSA-W is performed for 5 times and the median of their results are recorded. Table Ⅰ shows the results of MOMSA-W, VDGA, and GAPAM. For analyzing effectively the results showed in Table Ⅰ, we have performed a non-parametric test, Wilcoxon signed rank test by the software SPSS, which can judge the difference between the paired algorithms. In other word, Wilcoxon signed rank test can decide which algorithm is better. It is noted that Bonferroni correction is also used for Wilcoxon signed rank test to decrease the global typeⅠerror. We assume that there is no significant difference every two algorithms as a null hypothesis. Confidence level is set to 95%. Hence, if the asymptotic significance P

A Novel Approach to Multiple Sequence Alignment Using Multiobjective Evolutionary Algorithm Based on Decomposition.

Multiple sequence alignment (MSA) is a fundamental and key step for implementing other tasks in bioinformatics, such as phylogenetic analyses, identif...
1MB Sizes 1 Downloads 10 Views