Interdiscip Sci Comput Life Sci (2015) 7: 1–10 DOI: 10.1007/s12539-014-0238-5

MSuPDA: A Memory Efficient Algorithm for Sequence Alignment Mohammad Ibrahim Khan, Md. Sarwar kamal, Linkon Chowdhury∗ (Department of Computer Science & Engineering, Chittagong University of Engineering and Technology, Bangladesh, Cuet Road, Zip Code-4349)

Received 29 June 2014 / Revised 7 September 2014/ Accepted 3 November 2014

Abstract: Space complexity is a million dollar question in DNA sequence alignments. In this regards, MSuPDA (Memory Saving under Pushdown Automata) can help to reduce the occupied spaces in computer memory. Our proposed process is that Anchor Seed (AS) will be selected from given data set of Nucleotides base pairs for local sequence alignment. Quick Splitting (QS) techniques will separate the Anchor Seed from all the DNA genome segments. Selected Anchor Seed will be placed to pushdown Automata’s (PDA) input unit. Whole DNA genome segments will be placed into PDA’s stack. Anchor Seed from input unit will be matched with the DNA genome segments from stack of PDA. Whatever matches, mismatches or Indel, of Nucleotides will be POP from the stack under the control of control unit of Pushdown Automata. During the POP operation on stack it will free the memory cell occupied by the Nucleotide base pair. Key words: MSuPDA, Anchor Seed, Quick Splitting, POP.

1 Introduction DNA sequences of data continue have a profound effect on biology. Sequence comparison and alignment compute intensive task in memory are used to handle large volume of data set. Sequence alignment rates the difference and similarity between two sequences contained billions of base pairs (bp). During the alignment under computing tool, less memory is used or reused of the same memory. Sequence alignments are categorized as pair wise and multiple sequence alignments. Pair wise alignments might be global or local. For each alignment, a lot of woks have been done since 1980s.Though there are a lot of works and researches in this field, sequence alignment is still a fundamental problem of computational biology. It has various positive impacts on functional identification of newly sequences genes and buildings of reshape DNA sequences. Local sequence alignment locates best approximate subsequence match within two given subsequences. Sequences might be DNA, RNA and Protein sequences in biological sequences. Among these, DNA consist of four alphabets (A, C, G and T), in the case of RNA there are U(Uralic) instead of T) and Protein sequence consists of twenty alphabets. Local sequence alignment finds the similarities at sub-regions and indicates the best fit locations. ∗

Corresponding author. E-mail: [email protected]

Good alignment and similarities have come from solution of an optimization problem. Problem formulation defines including match, mismatch, insert and delete and assigns them value and cost in a distance function. Optimization maximizes the total match value [15] and minimizes the total cost of mismatch value and insert or delete gaps. The design of sequences started in 1995 by Fleischmann [8] to Pagani [19] 2012. In this work the authors have considered the biological sequences composed by alphabets. T-coffee [16] is a new progressive algorithm for sequencing alignment which combines signal from heterogeneous source into consensus multiple sequence alignment. In T-Coffee a substitution matric and a position scoring scheme are used for combining local and global sequence alignment. ProbCons [6] is a consistency based progressive protein multiple alignment tool for all suboptimal alignments with posteriorprobability–based scoring. ProbCons includes posterior probability scoring for suboptimal alignments, insertion state pairs for separation of short and long insertions and guide tree construction using expected accuracies. Newberg [17] proposed an optimal back trace within limited memory using dynamic programming. This algorithm improves the simplicity and speed of the calculation for the index of the optimal checkpoint. Dynamic programming [3] is a popular algorithm for determining the similarity between two sequences. However, its complexities are O (MN), where M and N represent the lengths of two DNA sequences been compared. The standard dynamic programming alignment

Input tape b

c

r

q

p

a

l

s

Control unit

n

algorithms produce alignment that maximizes the specific objective function. The Objective function or score is the sum of weighted matches or mismatches including negative weights for insertions and deletions. The resulting alignments give explicit relationships among the sequences; the substitutions, insertions and deletions required to transform one sequence into another. Dynamic programming has some drawbacks. One is the limitation of the evolutionary transformations to substitutions, insertions and deletions. Dynamic programming is costly in terms of computer time and space. Besides, local sequence alignments gapless alignment as BLAST [1] or FASTA [22] is examined [2], [5], [10] properly. We have noticed that some alignments with gap also play important role to decide the proper match for the exact sequence matching. We have examined the alignments with gap are Needleman-Wunsch [14] or Smith-Waterman [24] algorithms. We have introduced new algorithm for DNA sequence alignment that memory saving under Push Down Automata (MSuPDA).We have used it instead of dynamic programming for sequence alignment. MSuPDA reduces space complexity for global and local alignments. Pushdown Automat (PDA) is very useful tool in computer science to free occupied memory space. Push Down Automaton (PDA) is similar to a finite automaton that has a single stack. Data item can be retrieved from storage, it is also known as push down stack or push down list.PDA can write symbol in the stack and read back from later one. A read-write head executes the POP operation by reading topmost symbols. PUSH operation executes simply writing operation in stack. Stack offers the infinite memory and PUSH-POP the symbol in “Last-In-First-Out” fashion. Push Down Automata contains 3 elements: 1. Input Tape 2. Stack Input 3. Control Unit PDA starts from initial state and reads the symbol from input tape. It switches to next state, where control unit reads symbol from input tape and matches with the stack symbol for each input tape symbol. When control unit matches with stack symbol, the symbol on the top of the stack removes and the remaining symbol moves up; Removing symbol from the stack referrers to as popping symbol. When the top of the stack symbol does not match with the input tape symbol or the input also not belong to the languages or string, PDA blocks or to reject the input. If it reaches to the end of the input tape and stack becomes empty, then the input belongs to string and accepts it. In PDA (Fig. 1), control unit synchronizes between stack and input unit. In our proposed algorithm, PDA releases assigned memory space after completing of matching or mismatching operations. The numbers of release cells are proportional to the execution time. If the execution time increases then the system releases more space. MSuPDA reduces the space of DNA sequence for both

Interdiscip Sci Comput Life Sci (2015) 7: 1–10

Stack input

2

Fig. 1

Schematic presentation of PDA

local and global alignment.

2 Motivation Human genome contains huge data set and cumbersome to use this technique. Some other researches show the same as they do not use the memory reuse or memory free methods. Ning [13], Kent [11] Schwartz [23] and Watanabe [24] examined the accuracy of the sequencing. The algorithms above do not consider the situations regarding the repetitions of the patterns, mutations and incorrect mapping. Here we have noticed that system rejects the data sets, it makes the area small and as a consequence the calculations become complicated as well as wrong. In our previous work [12] we have checked the time complexity, accuracy, efficiency, speed and sensitivity. We have had reduced the time complexity to O (n) from the complexity of O (n2 ) from the work [25]. The pivotal factor that reduces the complexity was Bounding Box algorithm. Bounding Box technique helps to split the DNA segments into subsections until it reaches to the unique Nucleotide. But this process cost the spaces and creates overhead for machine. Since Human genome contains huge data set it is cumbersome to use this technique. Ewing and Green [7] proposed a solution to overcome such ambiguity but this method is low-quality regions. In the field of sequencing, there are a lot of software helping to detect the sequences, SNP detector [21], and novoSNP [25]. But these three systems detect genotype sample. They are unable to solve the dynamic patterns and sequences. The software that is used for sequencing is mainly two categories: Training based and detection based. The detection based depends on homolog finding but the homolog finding mainly depends on the database finding [4]. But the database finding is not always efficient due to the size and cost of the equipments. Training based sequencing is not able to identify the proper coding regions due to the lack of generalizations [18]. Besides, predictions are

Interdiscip Sci Comput Life Sci (2015) 7: 1–10

3

can be possible to take as input as character set of DNA nucleotide data stream. Here input file can read data from proper file. Then one important step of the process is to select Maximum Matching Sub Sequences (MMSS), which refers the matching sections of given to input data set. Using Quick Splitting technique, MMSS is calculated for both input sequence and target sequences. Both input and target sequences are divided into more substructures based on target position and preferable length. Left and right recursion operation is chosen on target position. Recursion operation (Fig. 3 and Fig. 4) is used to spilt the DNA sequences. After quick splitting, substructure pattern of input and target sequences are organized by Sequence Alignment by Shifting (SAS). SAS operation selects anchor from input and target sequence. The anchor tells the matching seeds which are repeated to the sequences.SAS find out the matching barrier and divided into two subsequences. From subsequences of input and target, SAS operation finds out the anchor lists. (Fig. 5). Anchor lists and MMSS are placed in input and stack tape of push down automata respectively. In memory mapping stage, anchors are matched with MMSS. In every matching or mismatching found data sets are released from stack. Then stack operation is performed as memory mapping for space utilization. Finally, the desired local sequences are illustrated as output.

vulnerable with many false positives identifications and sensitivities [26].

3 Proposed System Architecture The structure of this work is built based on DNA local sequences (Fig. 2). The input sequences are the data set of DNA Nucleotides of human, animal, plants, insects or micro-organs. The data set of various species

Input sequence & target sequence

Quick splitting (MMSs selection)

Sequence alignment by shifting (anchor selection)

Memory mapping

Anchor seed (output)

Fig. 2

The complete blueprint of the research work where the input sequences are the DNA data set as nucleotides base bases of A,G,C and T.

Start

DNA←AAA...TTTT last←max_length[DNA] left←DNA[0] right←DNA[last] mid←int[(left+right)/2] length←25 n←target_position

Y

N

left←mid right←right mid←int[(left+right)/2]

left←left right←mid mid←int[(left+right)/2]

total_distance (|left-mid|>length)

n≤mid

N

Y

total_distance (|right-mid|>length)

N

Y

Target_MMSS←MMSS with desired length

Target_MMSS←MMSS with desired length

End

End

Fig. 3

Complete flowchart for MMSS selection. MMSS selection is an iterative division process.

4

Interdiscip Sci Comput Life Sci (2015) 7: 1–10 Input sequence A G C C G A T T C G A G G A A A G G C C A A C C n

sio

Ri

ur

1st iteration

eft

c re

gh

tr

ec

L

sio

n

A G C C G A T T C G A G

G A A A G G C C A A C C

ion ecu rs

ion ecu rs

tr

ion

ion

Lef

tr

s cur t re

s cur t re

Lef

h Rig

h Rig

2nd iteration

ur

A G C C G A

T T C G A G

G A A A G G

C C A A C C

MMSS 1

MMSS 2

MMSS 3

MMSS N

nth iteration

Fig. 4

Graphical presentation of MMSS selections. Input DNA local sequence iteratively divided until to find the desirable MMSS.

SI

T

A

G

T

T

G

A G

ST

A

G

T

A

T

A

G T

(a) Matching barrier

T

A

G

T

G

A

G

A

G

T

T

A

G

T

(b) Sub sequence 1

Sub sequence 2

T

A

G

T





T

A

G

T

G

A

G





G

A

G



A

G

T

T

A

G

T

T





A

G

T

A

G

T



Left shifting

Right shifting

Left shifting

Right shifting

(c)

Fig. 5

Sequence Alignment Shifting for finding target anchor. (a) A matching latter considering barrier (b) two independent subsequence (c) Shifting operation.

The base pairs enable the long input sequence. Maximum Matching Sub Sequences is an important part

of the DNA alignment. Quick Splitting helps to find out the exact MMSS. The Anchor of the MMSS is the

Interdiscip Sci Comput Life Sci (2015) 7: 1–10

specific part of the DNA segments which will be the common part of the data set.SAS operation finds out the repeated patterns. Repeated patterns are considered as anchor and placed in PDA for memory mapping. The fourth step is Memory Mapping, which will be executed by Pushdown Automata. Finally the desired seed is depicted. 3.1

Quick Splitting

Here we have proposed an algorithm based on Quick short for defining Maximum Matching Sub-Sequences (MMSS). Quick Splitting (QS) is efficient because it reduce the size in every subdivision in such a way that it is continuously subdividing data set recursively (Fig. 4). Consequently the time complexity of QS is O(nlogn) on an average. The algorithm shows that it takes the data set of DNA sequences and performs sub-division on that nucleotide data set. The process starts as last, right and middle position of the input data and these positions changing is a continuous process until it finds the desired MMSS. QS technique subdivided a single DNA data set into multiple subdivided portion and every subdivided consider as a single MMSS. The algorithm is as follows: Algorithm 1: Quick Splitting (DNA,left,mid,right,length,s,Target MMSS) 1: DNA←AAA. . . TTTT 2: last←max length(DNA) 3: left←DNA[0] 4: right←DNA[last] 5: mid←INT[(left+right)/2] 6. n←target position 6: while (total distance (|left-mid|or |right-mid| >length)) 7: If left recursion is chosen (nmid) 14:

left←mid

15:

right←right

16:

mid←INT[(left+right)/2]

17:

Target MMSS←MMSS with desired length

18: End if 19: End loop

Due to the recursive subdivision to reduce data size we have imposed while loop, however indexing is used for direct search. Indexing reduced computational time but not perform recursive operations. Here while looping at step 6, it ensures the repetition of the process. Flow chart (Fig. 3) shows the complete illustration of the MMSS finding process. Left and right recursion

5

executes on target position. Target position is a pivot point to divide a sequence into more MMSS. If the target position is less than the mid-position of sequence then left portion is chosen. Chosen left portion is spilt into more substructures and each substructure contains the desired length. If target position is greater than mid position then right recursion is chosen. These substructures reduce the operation complexity and anchor lists are easily selected. Recursion method can be defined as:  Target position ≤ mid; left recursion Target position > mid; right recursion Here, at the beginning of the process, all variables are being initialized. Based on the data value, left and right segments of the DNA sequences will be selected. Based on the algorithm Quick Splitting algorithm, again initialize the segmented part of the sequences. Here left part will be the starting point of the segments right and parameter will be the middle value of the segments. The process is the same for both parts of the sequences. The constant process based on the algorithm will move until the Maximum Matching Sub Subsequences is found (Fig. 4). The given input sets are first sub-divided into two sections: the left section and right section. As the process is demonstrated flow chart, the left and right recursion is selected. For the same reasons the subdivided section is also portioned more based on the conditions at algorithm. The iterative divisions determine the ultimate MMSS. 3.2

Sequence Alignment by Shifting

Sequence Alignment by Shifting (SAS) is used for finding anchor.SAS starts at the beginning of the sequence and searches the matching nucleotide portion from both input sequence (SI ) and target sequence (ST ). Matching nucleotide is considered as a matching barrier. The matching position between SI and ST creates two splits among the sequences. Every spilt sequence is considered as a subsequence and these are independent. SAS starts from the first position and continue searching of nucleotide till the end. If there found a matching position between SI and ST then SAS splits the sequence into two sequences and align the sequence by using shifting operation. In shifting operation, SI and ST both are shifted from right and left side (Fig. 4). Find a matching latter at position 5 (Fig. 5(a)) and considered as barrier. The sequences SI and ST are split into two subsequences. These two subsequences are independent (Fig. 5(b)). Now the subsequences are shifted from both left and right. There found three matching nucleotide in left shifting operation and in right shifting have no matching nucleotide (Fig. 5(c)). So matching nucleotide of a subsequence is considered as an anchor. In our experiment,

6

Interdiscip Sci Comput Life Sci (2015) 7: 1–10

if the length of anchor is less than 1 then the matching could not be considered as an anchor. We consider the matching unit length minimum 2 or more. If the length of any subsequence matching unit is less than 2, then these subsequences are abandoned from MMSS. 3.3

Memory Mapping for Anchor Selection

The complete view of the memory saving process is illustrated in Fig. 6 below. Here we see that there are

three key units that manage memory efficiently. These units are Control Unit, Matching Unit and Count Unit. Along with these units two tables are available. Stack Table and Input data Table. Stack Table contains the whole data set of proteins or DNA or RNA. Control unit restricts the matching with desired input from whole data set. When matching or mismatching occurs it frees the stack. At the end of the input table, there is an end symbol that determines the last terminal of the input data position. It refers that a matching is occurred.

Control Input tape (anchor) unit T

Fig. 6

T T C

Matching anchor

Count uint

A G

A

C

Matching uint

$

Stack tape (MMSs)

G A

A G

Position of matching seed

The Memory Mapping for Anchor Selection using PDA. Nucleotide which consists in stack tape are popped after matching unit find matching with anchor.

While a complete Anchor is matched from given data set then Count Unit counts the total value of these matches and returns the matching seeds. When a mismatch is occurred Control Unit pops the data set from stack but does not count the anchor found. The overall view of these processes is mathematically narrated at algorithm 2 below. Rudiments of Memory Mapping Method: 1. Control Unit: Control unit is used to match with desired input sequence (input tape) from Data set which is enclosed in Stack tape. After completing matching process data are popped from stack tape. Input head and stack head positions are controlled by control unit. 2. Input Tape: In Input tape consists of Anchor which is matched with data set. Input tape is a dynamic length which varies corresponding Anchor length. In Input tape, “null symbol ()” indicates the end of Anchor. 3. Stack Tape: Stack tape holds data set (local gene sequence) with 500 base pair (bp). Every nucleotide base pair is popped after data set are accepted by matching unit. In stack tape, stack symbol ($)” indicates the end of data set. 4. Matching Unit: Matching unit is a significant

component in memory mapping. Every nucleotide in Anchor and local sequence is processed in matching unit. If both nucleotide bases are matched and no base pair is popped from stack then stack head go ahead. Memory unit gives the facilities to release the memory from stack tape. But Input tape head goes next position when matching unit finds match among the stack and input tape. Input tape goes to the end position with stack tape data set, and then increases the value of the count unit. 5. Count Unit: Control unit increases the value itself when matching unit sends a respond to count unit. In initial stage, Count Unit assigns value 0 and increases the value when matching unit sends response. When Input tape goes to null then resign the count unit value as 0. Finally Count Unit notifies numbers of Anchor seeds present in local gene sequence. When first nucleotide of anchor will match with stack tape (MMSS) and match the complete anchor then matching unit returns first nucleotide the addresses of each and every match position of Stack tape (MMSS). Finally count unit results position of all matched anchors along with total number of reputed anchor. So it is possible to count the position of anchors. MMSS and Anchor are placed into stack tape and input tape

Interdiscip Sci Comput Life Sci (2015) 7: 1–10

7

Input sequence A

G

C

T

G

A

T

T

C

... ... ...

A

G

G

C

C

A

A

Target sequence C

G

C

T

G

A

G

G



... ... ...

T

T

G

C

C

G

G

Fig. 7

Alignment process of input sequence with target sequence.

respectively. After completing memory mapping Anchor position finds out as output. Anchor position of input sequence then map with target position of target sequence. Input sequence aligns with target sequence from first Anchor seed to last Anchor position. Algorithm-2 proposed the memory efficient for finding anchor position from a local DNA sequence. Anchor in input sequence aligns to the anchor of target sequence. This algorithm can handle more than 500 base pair of gene sequence

Data Bank of Japan (DDBJ). The accession number of the sequences which are used for our evaluation (Table 1). Table 1

Accession number of different length of DNA sequences

Length 50

Length 100

Length 200

GX716623

GX687420

GX689434

GX390496

GX680445

GX536532

GX390497

GX871623

GX650226

Algorithm 2:

GX713601

GX672374

GU811161

Memory Mapping (θ, Γ, ϕ, max length)

GX715614

GX285798

GX405587

1: Initialize the input symbol (anchor) in input tape (Θ) θ ←Anchor 2: Initialize MMSS in stack tape (Γ) Γ ←MMSS 3: Initialize count unit (ϕ) ϕ ←0 4: i ←0 and j ←0 5: while (Γ[j]! = $)

//loop until end of stack.

6: If (θ[i] == τ [j]) 7:

i ← i + 1 and j ← j + 1 // increment of input tape and stack tape.

8:

if (θ[i] ==  θ[i] ← θ[1]

10: 11:

// indicates last position on anchor.

ϕ ←ϕ+1

9:

//start first position of anchor.

else j ← j + 1

Table 2

12: End if 13:

else go to step 6.

14:

if (τ [j] == $)

//indicates end of the stack tape.

print ϕ.

15:

We compare the memory requirement between Needleman-Wunsch algorithm [15], FASTA [22] and MSuPDA which is shown in Table 2. In NeedlemanWunsch Algorithm [15], suppose m, n are the lengths of the query sequence and the database sequence respectively. As the Needleman-Wunsch Algorithm complexity is O(m × n). So its memory requirement is (m×n) In FASTA algorithm [21] memory complexity is O(w×(m+n)).w denoted the width of the restricted area in alignment. To apply the dynamic programming on the restricted area, we need (w×(m+n)) cells. Therefore, the memory requirements of the FASTA Algorithm is (m + n + (w × (m + n))).

16:

End if

17:

else go to step 6.

Required memory allocation for stack operations

Operation

Total

Require space

Top

1

Sizeof StackElement

Push

N

n*Sizeof StackElement

POP

N

n*Sizeof StackElement

18: End if 19: End loop

4 Complexity Analysis and Experiment Result 4.1

Complexity Analysis

For evaluation of our algorithm, we conducted different length of sequences in human division from DNA

Total Space=(2n+1)* Sizeof StackElement Complexity of MSuPDA is O(n).

In our algorithm Maximum Matching Subsequences (MMSS) is pushed in stack and pop from stack after matching. Push down Automata performs based on stack operation. In stack operation Insertion and deletion follow the Last-In-First-out schemes. In stack, following main operations are performed. 1. Push: Inserted an element in the top position in

8

Interdiscip Sci Comput Life Sci (2015) 7: 1–10

the stack. A simple pseudo code is shown for push operation. If N numbers of element are pushed in stack then it needs N memory cell.In push operation complexity is n. Algorithm push(o) if n = S.length-1 then

Table 3

Algorithm

Comparison of memory complexity and memory requirement between Needleman-Wunsch, FASTA and MSuPDA Complexity

Memory requirement

Needleman-Wunsch O(m× n)

(m× n)

else

FASTA

O(w×(m+n))

(m + n + (w × (m + n)))

n←n+1

MSuPDA

O(m+n)

(m+n)

RSAM[12]

O(w× (m×n)) (m×n+(b×(m×n)))

throw FullStackException

S[n] ← o Complexity O(n)

2. POP: Objects removed from top position and returned the last inserted element. By the POP operation, objects are released and freed memories which were assigned previously. If stack holds n elements then n numbers of memory cells are free by POP. Algorithm pop() if isEmpty() then throw EmptyStackException else t ←t−1 return S[t + 1] Complexity O(n)

3. Top: Top returns the elements from the top the stack but doesn’t remove the top element. But when stack is empty then an exception occurs. Therefore, top returns always top element so complexity is O(1). Algorithm top() if isEmpty() then throw EmptyStackException return S[t ] Complexity O(1)

MSuPDA performed based on stack operation (Push, POP and Top).MMSS are pushed as an element in the stack. After matching, base pairs are popped from stack. Mainly Space Complexity in stack operation is based on push operation. Total length of nucleotide is occupying same number of memory cell. If subsequences are stored in all possible offset [n], we need to 2n+1 memory cell. MSuPDA follows the recursive technique that’s why for our algorithm memory requirements are (2n+1). Therefore, our algorithm’s memory complexity is O (n). Assigned memory cell is released after PDA operation. In memory requirement, our proposed algorithm is efficient than NeedlemanWunsch and FASTA. Here m is input sequence; n is target sequence; w is restricted area and b is bounding box.

4.2

Comparing MSuPDA with other sequence alignment

For memory comparison our machine was 145MB memory and Dual core 1.6 GHz. We compare between our memory and non memory efficient algorithm (FASTA and Needlman-Wunch). For evaluations we used different length sequences of HUMAN division downloaded from the well known DNA sequences of the database (Table 1). For example, considering m and n are equal 500 and restricted area w=10, then complexity of Needleman-Wunsch algorithm O (250000) and Fasta algorithm (10000). As both sequence have the same length then our MSuPDA algorithm will be O (500). For pair wise local sequence alignment we applied our algorithm with 3683 nucleotides. For the large scale alignments, we used 145 M memory limit for total memory. For optimization computation, we used N of computation stages occupy with M number of memory stages. If N≤M, there is available space, If M=1 indicates only single number of stage can compute. When computation stage computed then assign memory stages are released. After T second, if n number of stages are computed then n number of stages will be reassessed and number of unreeled memory stage is M-(N-n). We restricted ourselves, M=332 stages memory for N=3683 stages. For first level, N=3683 stages and M=332, total number of memory use M/N=332/3683=9%. MSuPDA more accurate than RSAM [12] and Newberg [16] proposed algorithm. In most local sequence alignment, Fasta algorithm [20] is second best algorithm and Needleman-Wunsch algorithm [15] slightly outperforming. We use a metric for sequence alignment quality by testing Alignment Error Rate (AER). AER requires an annotated set of “Sure” alignment and “Possible” alignment. “Sure” alignment is used for recall and “Possible” use for precession. Given a hypothesis alignment consists of base pair set A as input sequence alignment and R is consist target base pair alignment, three mea-

Interdiscip Sci Comput Life Sci (2015) 7: 1–10 Table 4

9

Comparison between different types of alignment algorithms (memory usage) Usage of Memory (%)

MSuPDA Level 1

RSAM [12]

Fasta [22]

Memory

Computation

Memory

Computation

Memory

Computation

Memory

Computation

Stage (M)

Stage (N)

Stage (M)

Stage (N)

Stage (M)

Stage (N)

Stage (M)

Stage (N)

332

3683

486

3683

389

3683

500

3683

332/3683=9% Level 2

Newberg [17]

486/3683=12.7%

389/3683=10.56%

Memory

Computation

Memory

Computation

Memory

Computation

Stage (M)

Stage (N)

Stage (M)

Stage (N)

Stage (M)

Stage (N)

132

10000

138

10000

135

10000

132/10000=1.32%

138/10000=1.4%

135/10000=1.35%

50

sures as defined:

45

|A ∩ R| Precession(A, R) = |A| |A ∩ R| Recall(A, R) = |R| AER(A, P, S)

40 35 AER

30

2 × P recession(|A ∩ R|) × Recall(|A ∩ R|) |precession(|A ∩ R|) + Recall(|A ∩ R|) 2 × (|A ∩ R|) =1− |A| + |R|

25 20

=1−

MSuPDA

15

Newbreg

10

FASTA 5

RSAM 0 108 124 140 156 172 188 204 220 236 252 284 Base pair (in thousand)

If hypothesis alignment, |A| = 350 and |R| = 250, In first case, |A ∪ R| = 175, Precession = 0.50, Recall = 0.70 and

3683/(500+3683)=88%

Fig. 8

AER = 1 − [2 ∗ 175/(350 + 250)] = 0.42 = 42%

Graph showing AER metric analysis. Alignment Error rate of MSuPDA is lower than other algorithms.

35

5 Conclusion In summary, here we have proposed a memory efficient DNA alignment algorithm, called MSuPDA that can handle the data set dynamically. The algorithm estimates the pair wise matching, mismatching and inserts or deletes using alignment by scanning process.

30 CPU excuation time/ns

Thereby, a recall error only occurs when hypothesis (target) alignment is not found and precession occurs when hypothesis alignment is found in input sequences. In our graph, we will present AER and base pair (in thousands) for accuracy measure. We also compare computational time (speed) between our algorithm, RSAM [12] and FASTA [22] for finding matching anchor. We use one hundred thousand to four thundered thousand sequence length with exons and introns in the sequence. We checked the sensitivity of the algorithm for the randomly generated sequences and examined the percentage of base pair coverage in the alignment with respect to RSAM [12] and FASTA [22]. In increase anchor our algorithm makes better speed up than other two algorithms.

25 20 15 MSuPDA

10

FASTA 5 0 108

Fig. 9

RSAM 140

172 204 236 284 348 Sequence length×103/bp

404

468

Speed comparison between our algorithm, RSAM and FASTA.

Pushdown Automata is a robust tool that continuously vacant the memory storage, especially the proposed algorithm truly stands out when the data set of DNA segments are increased. In fact, our simulation results indicate that MSuPDA delivers consistently high performance whatever the length of data set is. We also

10

have compared the results with BLASTZ and noticed that our algorithm outperforms BLASTZ in all respects of the data set. For any given length, MSuPDA requires very less time than that of BLASTZ. These results reflect the effectiveness of our system that can dynamically handle the data set from memory. MSuPDA is also highly usable and scalable and it can simply align any DNA networks with billions of base pair within few nanoseconds on a personal computer. In future we will implement this concept for real word data set with DNA and Protein sequences and alignment.

References [1] Altschul, S.F., Gish, W., Miller, W., Myers, E.W., Lipman, D.J. 1990. Basic local alignment search tool, J. Mol. Biol. 215 403-410. [2] Arratia, R, Morris, P., Waterman, M.S. 1988. Stochastic scrabbles: a law of large numbers for sequence matching with scores, J. Appl. Probab. 25, 106-119. [3] Batzoglou, S., Pachter, L., Mesirov, J.P., Berger, B., Lander, E.S. 2000. Human and mouse gene structure: Comparative analysis and application to exon prediction. Genome Res. 10, 950-958. [4] Claverie, JM, Poirot, O., Lopez, F. 1997. The difficulty of identifying genes in anonymous vertebrate sequences. Comput.Chem.21, 203-214. [5] Dembo, A., Karlin, S.1991. Strong limit theorems of empirical functional for large exceedances of partial sums of id variables. Ann. Probab. 19, 1737-1755. [6] Do, C.B., Mahabhashyam, M.S., Brudno, M., Batzoglou, S.. 2005. ProbCons: Probabilistic consistencybased multiple sequence alignment. Genome research 15.2: 330-340. [7] Ewing, B., Green, P. 1998. Base-calling of automated sequencer traces using phred. II. Error probabilities. Genome Res.8, 186-194. [8] Fleischmann, R.D, Adams, M, D, White, O, Clayton, R.A, Kirkness, E.F, Kerlavage, A.R. 1995. Whole-genome random sequencing and assembly of Haemophilus influenza. Rd. Science. 269, 496-512. [9] Karlin, S., Altschu, S.F. 1990. Methods for assessing the statistical significance of molecular sequence features by using general scoring schemes. Proc. Natl. Acad. Sci. USA.87, 2264-2268. [10] Karlin, S., Altschu, S.F. 1993. Applications and statistics for multiple high-scoring segments in molecular sequences.Proc. Natl. Acad. Sci. USA.90, 5873-5877. [11] Kent, W.J., Sugnet, C., Furey, T., Roskin, K., Pringle, T., Zahler, A., Haussler, D. 2002. The human genome browser at UCSC.Genome Res.12, 996-1006. [12] Khan, M.I., Kamal, M.S. 2013. RSAM: An Integrated Algorithm for Local Sequence Alignment, Archives Des Sciences, 5, 395-412.

Interdiscip Sci Comput Life Sci (2015) 7: 1–10 [13] Lipman, D.J., Pearson, W.R. 1985. Rapid and sensitive protein similarity searches.Science, 227, 14351441. [14] Ning, Z., Cox, A.J., Mullikin, J.C. 2001. A fast search method for large DNA databases.Genome Res. 11, 1725-1729. [15] Needleman, S.B., Wunsch, C.D. 1970. A general method applicable to the search for similarities in the amino acid sequence of two proteins.J. Mol. Biol. 48, 443-453. [16] Notredame, C., Higgins, D.G. Heringa, J. 2000. TCoffee: A novel method for fast and accurate multiple sequence alignment. Journal of molecular biology 302.1: 205-217. [17] Newberg, L.A. 2008. Memory-efficient dynamic programming backtrace and pairwise local sequence alignment. Bioinformatics 24.16: 1772-1778. [18] Pagani, I., Konstantinos, L., Jansson. J., Chen, A., Smirnova, T., Bahador. N. 2012. The Genomes OnLine Database (GOLD) v.4: status of genomic and meta genomic projects and their associated metadata. Nucleic Acids Res, 40, 571-579. [19] Lipman, D.J., Pearson, W.R. 1988. Improved tools for biological sequence comparison.Proc. Natl Acad. Sci. USA. 85, 2444-2448. [20] Pati, A., Ivanova, N.N., Mikhailova, N., Ovchinnikova, G., Hooper, S.D., Lykidis. A. 2010. GenePRIMP: a gene prediction improvement pipeline for prokaryotic genomes. Nat. Methods. 7, 455-457. [21] Stephens, M., Sloan, J.S., Robertson, P.D., Scheet, P., Nickerson, D.A. 2006. Automating sequence-based detection and genotyping of SNPs from diploid samples. Nat. Genet. 38, 375-38. [22] Smith, T.F, Waterman, M.S. 1981. Comparison of biosequences. Adv. Appl. Math.2, 482-489. [23] Schwarz, D.S., Hutvagner, G., Du, T., Xu, Z., Aronin, N., Zamore, P.D. 2003. Asymmetry in the assembly of the RNAi enzyme complex, Cell. 115, 199-208 [24] Watanabe, T., Takeda, A., Mise, K., Okuno, T., Suzuki, T., Minami, N., Imai, H. 2005. Stage-specific expression of microRNAs during Xenopus, development. FEBS Lett. 579: 318. [25] Weckx, S., Favero, J., Rademakers, R., Claes, L., Cruts, M., De, J.P., Van, B.C., De, R.P. 2005. A novel computational tool for sequence variation discovery. Genome Res. 15, 436-442. [26] Yok, N.G. Rosen, G.L. 2011. Combining gene prediction methods to improve meta genomic gene annotation. BMC Bioinformatics. 12, 20.

MSuPDA: A memory efficient algorithm for sequence alignment.

Space complexity is a million dollar question in DNA sequence alignments. In this regards, MSuPDA (Memory Saving under Pushdown Automata) can help to ...
655KB Sizes 6 Downloads 4 Views