Journal of Bioinformatics and Computational Biology Vol. 12, No. 4 (2014) 1450020 (16 pages) # .c Imperial College Press DOI: 10.1142/S0219720014500206

J. Bioinform. Comput. Biol. 2014.12. Downloaded from www.worldscientific.com by STATE UNIVERSITY OF NEW YORK @ BINGHAMTON on 03/07/15. For personal use only.

An improved preprocessing algorithm for haplotype inference by pure parsimony

Mun-Ho Choi*,‡, Seung-Ho Kang†,§ and Hyeong-Seok Lim*,¶ *School

of Electronics & Computer Engineering Chonnam National University 77 Yongbong-ro, Buk-gu, Gwangju 500-757, Korea †

Department of Information Security, Dongshin University Daeho-dong, Naju-si, Jeollanam-do 520-714, Korea ‡ [email protected] § [email protected][email protected] Received 22 April 2013 Revised 4 July 2014 Accepted 5 July 2014 Published 1 August 2014

The identi¯cation of haplotypes, which encode SNPs in a single chromosome, makes it possible to perform a haplotype-based association test with disease. Given a set of genotypes from a population, the process of recovering the haplotypes, which explain the genotypes, is called haplotype inference (HI). We propose an improved preprocessing method for solving the haplotype inference by pure parsimony (HIPP), which excludes a large amount of redundant haplotypes by detecting some groups of haplotypes that are dispensable for optimal solutions. The method uses only inclusion relations between groups of haplotypes but dramatically reduces the number of candidate haplotypes; therefore, it causes the computational time and memory reduction of real HIPP solvers. The proposed method can be easily coupled with a wide range of optimization methods which consider a set of candidate haplotypes explicitly. For the simulated and well-known benchmark datasets, the experimental results show that our method coupled with a classical exact HIPP solver run much faster than the state-of-the-art solver and can solve a large number of instances that were so far una®ordable in a reasonable time. Keywords: Haplotype inference; SNP; preprocessing; pure parsimony.

1. Introduction A haplotype in genetics is a combination of alleles at adjacent locations on the chromosome that are transmitted together.1 The most common type of genetic variation is the single nucleotide polymorphism (SNP), a point mutation found in the population, most often with only two possible values, and tracking their inheritance. ¶ Corresponding

author. 1450020-1

J. Bioinform. Comput. Biol. 2014.12. Downloaded from www.worldscientific.com by STATE UNIVERSITY OF NEW YORK @ BINGHAMTON on 03/07/15. For personal use only.

M.-H. Choi, S.-H. Kang & H.-S. Lim

So, in a second meaning, a haplotype is a set of SNPs on a single chromosome that are statistically associated.1 The genotype is the con°ated data of the paternal and maternal haplotypes of a homologous haplotype pair in an individual. When assessing the genetic contribution to a trait or disease, it may be much more informative to have haplotype data than to have only genotype data.2 However, due to the technology limitations, haplotype information is more di±cult to obtain than genotype data.3 Hence, computational methods have been designed to obtain haplotype data from genotype data. The problem of obtaining the haplotypes from genotypes of some individuals is known as haplotype inference (HI). A variety of computational methods have been developed to solve the problem (see review Refs. 4–9). Some of these methods provide very accurate results in some circumstances, however, research on HI continues because no single method is considered fully adequate in all applications.5 This paper focuses on one of the most popular approaches to the HI problem which is known as haplotype inference by pure parsimony (HIPP).10 Its goal is to ¯nd the minimum set of haplotypes that explains a given set of genotypes. The motivation for the pure parsimony principle is that the number of haplotypes existing in a large population is actually very small whereas genotypes derived from these limited number of haplotypes behave a great diversity.11 Since the problem is computationally di±cult,12 there have been signi¯cant e®orts toward producing e±cient solvers, using a distinct type of methodologies such as integer linear programming, branch and bound, Boolean satis¯ability, answer set programming, pseudo-Boolean optimization, and some heuristics (see review Refs. 7–14). The survey papers indicate that the methods based on Boolean satis¯ability are significantly more e±cient.7,9 Many of the previously cited methods carry out searches over the spaces of possible pairs of haplotypes for given genotypes. Unfortunately, the number of possible pairs for a genotype grows exponentially with respect to the number of heterozygous sites that the genotype has. Therefore, before running the problem solving algorithm, it is necessary to reduce the number of candidate pairs of haplotypes by excluding pairs that are not likely included in the optimal solution. Recently, a set-based preprocessing method which signi¯cantly reduces the search space of candidate haplotypes was proposed by Irurozki et al.15 The preprocessing method deals with groups of haplotypes rather than individual haplotypes, and iteratively removes groups of haplotypes that are not helpful in searching the optimal solution. We propose an improved method of the set-based preprocessing method. Our method utilizes the fact that the redundancy of groups of haplotypes directly depends on the inclusion relations of the groups themselves. For the moderate size datasets, our preprocessor runs much faster than the set-based preprocessor. Moreover, the experimental results indicate that our method coupled with a classical HIPP solver outperform RPoly, a state of the art solver for HIPP. 1450020-2

An improved preprocessing algorithm

2. Materials and Methods

J. Bioinform. Comput. Biol. 2014.12. Downloaded from www.worldscientific.com by STATE UNIVERSITY OF NEW YORK @ BINGHAMTON on 03/07/15. For personal use only.

2.1. Basic notations and problem de¯nition Assuming an in¯nite sites model,4 there are only two alleles at a SNP site, and in computational approaches haplotypes are represented as binary vectors of f0; 1g, where an allele of wild type is encoded as a value of 0 and an allele of mutant type is encoded as a value of 1. Likewise, genotypes are represented ternary vectors of f0; 1; 2g, where the value of 0 (respectively value of 1) is used to denote a homozygous site of wild type (respectively a homozygous site of mutant type), and the value of 2 is used to denote a heterozygous site. For the given set G of genotypes, the ith genotype in G is represented by G½i and each site in a genotype g 2 G is represented by gj . The kth haplotype in a set H of haplotypes are represented by H½k and jth site in a haplotype h 2 H is represented by hj , as expected. For the convenience of notation, a vector ha1 ; a2 ; . . . ; am i will be denoted shortly as a1 a2    am . A genotype g is explained or resolved by a non-ordered pair of haplotypes (h; h 0 ), which is represented by g ¼ h  h 0 ; if gj ¼ 0 then hj ¼ h 0j ¼ 0, if gj ¼ 1 then hj ¼ h 0j ¼ 1, and if gj ¼ 2 then hj ¼ 0 and h 0j ¼ 1 or hj ¼ 1 and h 0j ¼ 0. If g ¼ h  h 0 , then the haplotypes h and h 0 are declared conjugates with respect to g. Given a set G of genotypes, the HIPP problem consists in ¯nding a minimumcardinality set H of haplotypes that explain all genotypes in G. Problem (Preprocessing for HIPP). For the given set of genotypes, generate a reduced space of candidate haplotype ensuring that the optimal resolving pairs of haplotypes will be generated from it. 2.2. A condensed representation of group of haplotypes In order to deal with a large number of haplotypes, the concept of grouping haplotypes in ternary vectors has recently appeared in the literature.15 To increase the computing speed of operations on groups of haplotypes, we develop a modi¯ed vector representation, called condensed base, over  ¼ f1; 2; 3g. We use the same notation for genotypes and haplotypes, in which the value of 1 represents a wild type allele or a homozygous site of wild type, the value of 2 represents a mutant type allele or a homozygous site of mutant type, and the value of 3, for genotypes, represents a heterozygous site. For a group of haplotypes, every site with a value of 1 (respectively value of 2) means that any haplotypes in the group should have the same value at that site, and for each site with a value of 3, called ambiguous site, every combination with regard to the values of 1 and 2 should be included. For example, for a group of haplotypes denoted by the condensed base 1323, the haplotypes belonging to it are 1121, 1122, 1221, and 1222. Thus, the number of haplotypes included in a base b is jbj ¼ 2 l , where l is the number of ambiguous sites in b. In the proposed representation, a genotype can be considered as a group of haplotypes which can explain the genotype, and a haplotype is a special case of a base whose cardinality is 1, i.e. a base 1450020-3

J. Bioinform. Comput. Biol. 2014.12. Downloaded from www.worldscientific.com by STATE UNIVERSITY OF NEW YORK @ BINGHAMTON on 03/07/15. For personal use only.

M.-H. Choi, S.-H. Kang & H.-S. Lim

without ambiguous sites. By using this representation, we can deal with a group of haplotypes as one vector and avoid enumerating every haplotype in it. Since bases represent sets of haplotypes, we can de¯ne inclusion, intersection, and complementary relations between bases. Given two bases a and b of same size m, we say that a is a subset, or sub-base, of b if all the haplotypes belonging to a also belong to b. We can easily obtain their inclusion relation as follows:  bj 6¼ 3 ) aj ¼ bj a  b () 81jm ð1Þ b j ¼ 3 ) aj  b j : Given two bases a and b of same size m, the haplotypes in the intersection of them are those that belong to both bases. We can, by representing each digit in alphabet  as two binary digits, easily obtain their intersection as follows: a \ b  haj & bj ; 1  j  mi;

ð2Þ

where & denotes bitwise-and. If there exists a j such that aj & bj ¼ 0, then a \ b ¼ . For example, let a ¼ 1323 representing f1121; 1122; 1221; 1222g and b ¼ 1332 representing f1112; 1122; 1212; 1222g, then a \ b ¼ 1322 which represents f1122; 1222g. Much like haplotypes, bases can be paired in order to resolve genotypes. Two bases b and b 0 are complementary with respect to a genotype g if and only if there exists a bijection between b and b 0 such that, for each haplotype h 2 b there exists a haplotype h 0 2 b 0 such that g ¼ h  h 0 and vice versa. We also g ¼ b  b 0 to denote the complementary relation of bases. Note that if bases b and b 0 are complementary for a genotype g, then both b and b 0 are sub-bases of g. Furthermore, given the genotype g of size m and base b which is the sub-base of g, the complement of b with respect to g, denoted as bg , can be easily built as follows:   bj if bj ¼ gj bg  ; 1jm ; ð3Þ gj ^ bj otherwise where ^ denotes bitwise-xor. For example, given genotype g ¼ 13332 and base b ¼ 12312, the complementary base of b with respect to g is bg ¼ 11322. The complementary relation has the following property: Given a genotype g and two bases a and b, a  b  g () a g  bg  g:

ð4Þ

2.3. Compatibilities and cover relations Concepts that have been widely used in the HI literature are the ones called compatibility8 and cover.12 These terms associate haplotypes to the set of genotypes, and can be de¯ned over the condensed bases in a similar way. A haplotype h is compatible with a genotype g if h can participate in any resolving pair of g. A base b is compatible with a genotype g if all haplotypes belonging to b are compatible with g. Two genotypes g and g 0 are compatible if they share at least one 1450020-4

An improved preprocessing algorithm

compatible haplotype. In the condensed base notation, a haplotype h (respectively a base b) is compatible with a genotype g if and only if h  g (respectively b  g). Given a set G of genotypes, the cover of a haplotype h, denoted as cover(h), is a subset fg 2 G j h is compatible to gg. The cover of a base b is de¯ned as follows: coverðbÞ ¼

\

coverðhÞ:

ð5Þ

J. Bioinform. Comput. Biol. 2014.12. Downloaded from www.worldscientific.com by STATE UNIVERSITY OF NEW YORK @ BINGHAMTON on 03/07/15. For personal use only.

h2b

From the de¯nition of cover, we can ¯nd the following properties: Given a set G of genotypes and two bases a and b, a  b ) coverðaÞ  coverðbÞ;

ð6Þ

coverða \ bÞ  coverðaÞ [ coverðbÞ:

ð7Þ

2.4. The proposed preprocessing method Consider a set G of genotypes and a set B of bases. For every b 2 B, if there exist a bg with respect to every g in cover(b), then B is called closed over G under the complementary operation. If every genotype in G has at least one compatible base in B and B is closed over G under the complementary operation, then, for every g in G, we can make a pair (b; bg ) explaining g. In such a case, a base b 2 B is redundant if there is a base b 0 2 B such that b 0  b. Since coverðb 0 Þ  coverðbÞ and there exist bg and b0g for every g in cover(b) and cover(b 0 ) respectively [Eqs. (4) and (6)], ðb; bg Þ can be replaced with ðb 0 ; b0g Þ. If two or more bases share one sub-base, then all of them can be replaced by that sub-base [Eq. (7)]. This fact shows that the redundancy of bases is not relevant to the given genotypes; it directly depends on the inclusion relations of bases themselves. Algorithm 1 presents a high-level pseudo-code of the proposed preprocessing method, RCH, which consists of three steps: initialization, expansion, and elimination steps. From the given set G of genotypes, the initialization step constructs the initial set B of bases by calculating the intersection of every pair of genotypes in G. The expansion step iteratively generates all complements of the given bases and all intersections of any combinations of given bases and newly generated bases. From set B 0 that is generated in expansion step, the elimination step removes all, so called, superior bases from B 0 which have sub-bases in B 0 . Example. Refer Fig. 1 for an example. Consider the set of genotypes G ¼ f332233; 232323; 313323; 322122; 231323g. We can obtain the initial bases B ¼ f232223; 312223; 212323; 222122; 211323g by pairwisely intersecting the bases in G. In the expansion step, from the given initial bases, the new complements (gray box) 212221, 122122, 132213, 232123, 222323, 111323, 112323, 221323, 322213, 311123, 122212, and 111122 are collected. And then the new intersections (white box) 212223, 122213, 22223, 212123, 222123, 112223, 111123, and 211123 are collected. After then, the new complements (white dotted box) 221223, 112213, 222213, and 111223 are collected. In the elimination step, as depicted in Fig. 1, 16 bases are 1450020-5

J. Bioinform. Comput. Biol. 2014.12. Downloaded from www.worldscientific.com by STATE UNIVERSITY OF NEW YORK @ BINGHAMTON on 03/07/15. For personal use only.

M.-H. Choi, S.-H. Kang & H.-S. Lim

1450020-6

J. Bioinform. Comput. Biol. 2014.12. Downloaded from www.worldscientific.com by STATE UNIVERSITY OF NEW YORK @ BINGHAMTON on 03/07/15. For personal use only.

An improved preprocessing algorithm

Fig. 1. An example of how the algorithm RCH reduces the space of candidate haplotypes. Inclusion relations of bases are superposed as dashed arrows. The bases removed in the elimination step are marked by symbol .

removed. The ¯nal result is B 00 ¼ f111223; 221223; 112213; 112223; 212221; 122212; 222213; 222223; 111122; 211123; 212123; 122122; 222122g. From the processed result, we can infer the resolution pairs as follows: G½1 ¼ G½2 ¼ G½3 ¼ G½4 ¼ G½5 ¼

B 00 ½3  B 00 ½8 or B 00 ½4  B 00 ½7 or B 00 ½5  B 00 ½6 B 00 ½5  B 00 ½13 or B 00 ½8  B 00 ½11 B 00 ½1  B 00 ½11 or B 00 ½4  B 00 ½10 or B 00 ½5  B 00 ½9 B 00 ½12  B 00 ½13 B 00 ½2  B 00 ½10

Given a set G of genotypes, Algorithm 1 generates the set of bases B 0 containing all bases two or more genotypes can share and all complementary bases. Then, all redundant bases are removed from B 0 . The result is closed over G under the complementary operation, because if base b has a sub-base b 0 and is removed from B 0 , then complementary base bg with respect to g 2 coverðbÞ has at least one sub-base b0g [Eq. (4)], and consequently is removed from B 0 . Notice that, in the proposed algorithm, the information of given genotypes is only used in generating initial bases and complementary bases. The set-based preprocessing method of Ref. 15 iterates the four phases: (a) construct the pair graph which represents the complementary relations among the bases obtained so far, (b) choose the subgroup of the bases which is the connected component of the set of maximal bases in the pair graph, (c) enumerate all intersections 1450020-7

M.-H. Choi, S.-H. Kang & H.-S. Lim

J. Bioinform. Comput. Biol. 2014.12. Downloaded from www.worldscientific.com by STATE UNIVERSITY OF NEW YORK @ BINGHAMTON on 03/07/15. For personal use only.

between the chosen subgroup and maximal bases, and all complementary bases of them, (d) remove the dispensable bases in the chosen subgroup. Compared to the set-based method, our method considers the whole set (instead of all the chosen subgroups) of bases that are newly generated (not generated so far) in the previous iteration. So we can remove the steps (a) and (b) of the set-based method and cut down the number of iterations. It makes a considerable computing time savings as shown in Figs. 3(c), 3(d), 4(c) and 4(d). 2.5. Implementation remarks Before preprocessing, a structural simpli¯cation8 is applied to simplify the problem instance. The technique eliminates duplicated genotypes, duplicated sites, and complemented sites. There are matters that require attention. If there is a g in the given set G of genotypes that is not compatible to any other genotypes in G, then we should add g to the set B 00 of the preprocessed result. If there are g 2 G and b 2 B 00 such that g ¼ b, then the pair (b; b) if b has no ambiguous site, otherwise, any one of (b 0 ; b0g ), where b 0 b, can be a resolution pair of g. In order to speed up the elimination step, following property is taken into account to avoid unnecessary inclusion relation tests: jbj > jb 0 j ) b ½ 6 b 0. 3. Results and Discussion 3.1. Experimental setup In order to evaluate the performance of our preprocessing algorithm, the simulated datasets and the benchmark datasets are considered (refer Fig. 2). For the simulated

(a)

(b) Fig. 2. Summary of the experimental datasets. (a) The simulated datasets generated by ms program. (b) The classical datasets having been used to benchmark HI solvers. 1450020-8

J. Bioinform. Comput. Biol. 2014.12. Downloaded from www.worldscientific.com by STATE UNIVERSITY OF NEW YORK @ BINGHAMTON on 03/07/15. For personal use only.

An improved preprocessing algorithm

datasets, ms program was used to generate samples of haplotypes.16 In this program, di®erent parameters such as the number of haplotypes generated (from 20 to 1000), the mutation rates (10 9 , 10 8:5 , or 10 8 ) and the recombination rates (no cross-over or 10 8 ) are speci¯ed in order to produce di®erent sets of haplotypes under di®erent scenarios. For each of the con¯gurations, 100 instances are generated. To generate genotype instances, we have chosen the haplotype pairs uniformly from generated haplotypes. The optimization method which is combined with our algorithm in the experiments is Gus¯eld's method,10 called RTIP, which is the most widely referenced method for the HIPP problem. Gurobi17 is used as a linear programming solver to solve equations of RTIP. To link preprocessed result to RTIP, Boolean variables are associated with each bases of the preprocessed result B and other Boolean variables are associated with each resolution pairs (b; b 0 ), where b, b 0 2 B, capable of explaining a genotype of the given set of genotypes. The proposed preprocessing algorithm was implemented in Java. All experiments were run on an Intel Core i7-3770 (3.4 GHz, 8 GB Memory). 3.2. Experimental results A set-based preprocessor, which is similar to our algorithm, has recently been developed by Irurozki et al. Our method generates the reduced space of candidate haplotypes as the same space that of the set-based preprocessor generates. To evaluate the performance of our preprocessing algorithm with respect to the exact solver, we have compared the running times of our algorithm coupled with RTIP to the running times of RPoly,18 which is known as the most e±cient exact solver for HIPP problem as far as we are aware. Figure 3 presents the experimental results of the simulated datasets without recombination. As can be observed, all the methods gave good results; only 6 instances of 4000 instances are aborted by RPoly with 1000 s time limit. Table 1 and Fig. 4 provide the experimental results of simulated datasets with recombination. The scatter plots on Figs. 4(c) and 4(d) show the preprocessing time of the set-based preprocessor and RCH. The instances that cannot be processed in the given time limit are dotted at the top line. For the instances preprocessed in between 2 and 1000 s, RCH runs 71 times faster than the set-based preprocessor on average. RCH processes 3972 out of 4000 instances within 1000 s, which represents 99.3% instances. The set-based preprocessor can solve 88.8% instances. The scatter plots on Figs. 4(e) and 4(f) show the total execution times of RCH coupled with RTIP (called RCH þ RTIP) and RPoly. For the instances solved in between 2 and 1000 s, RCH þ RTIP run 37 times faster than RPoly on average. RCH þ RTIP solve 3972 out of 4000 instances within 1000 s, which represents 99.3% instances. RPoly can solve 84.1% instances. There is no instance that is solved by RPoly but unsolved by RCH þ RTIP. 1450020-9

J. Bioinform. Comput. Biol. 2014.12. Downloaded from www.worldscientific.com by STATE UNIVERSITY OF NEW YORK @ BINGHAMTON on 03/07/15. For personal use only.

M.-H. Choi, S.-H. Kang & H.-S. Lim

(a)

(b)

(c)

(d)

(e)

(f)

Fig. 3. Experimental results of the simulated datasets without recombination. (a) Total numbers of SNPs in the instances. (b) Average numbers of ambiguous sites in the structural simpli¯ed instances. Preprocessing times of the set-based (c) and RCH (d). Total execution times of RCH þ RTIP (e) and RPoly (f).

The numbers of bases contained in the preprocessed results are plotted in Fig. 4(g). Because both preprocessor dramatically reduce the number of haplotype candidates, the execution time of the real optimizer such as RTIP can be very short; RTIP takes only 0.5 s on average. As can be observed from the Table 1 and Fig. 4, 1450020-10

An improved preprocessing algorithm Table 1. Experimental results of the simulated datasets with recombination. Size of datasets Small Medium Large Very large

J. Bioinform. Comput. Biol. 2014.12. Downloaded from www.worldscientific.com by STATE UNIVERSITY OF NEW YORK @ BINGHAMTON on 03/07/15. For personal use only.

Number of instances Number of genotypes of each instances Number of SNPs of genotype sequences

Number of ambiguous sites (after structural simpli¯cation) Number of preprocessed or solved instances (1000 s time limit)

Average preprocessing or execution times (second) of succeeded cases

Average speedup ratio of preprocessing (RCH to set-based, between 2 and 1000 s time limit) Average speedup ratio of total execution time (RCH þ RTIP to RPoly, between 2 and 1000 s time limit) Number of bases in the preprocessed results

18,000 100 min 4 max 224 avr 49 min 8 max 879 avr 266 set-based 1795 RCH 1800 RCH þ RTIP 1800 RPoly 1737 set-based 7.5 RCH 0.8 RCH þ RTIP 0.9 RPoly 16.7 37.8

min max avr

Average accuracy of preprocessing results

Total

900 100 12 293 84 59 1742 771 843 898 898 714 80.0 6.0 6.3 72.8 60.7

600 100 14 279 97 64 2752 1164 457 587 587 391 71.7 23.6 24.6 54.5 68.2

700 100 19 319 184 165 6182 2304 456 687 687 424 138 32.1 33.5 64.3 96.1

4000 100 4 319 150 8 6182 871 3551 3972 3972 3266 49.8 10.9 11.3 39.7 71.0

35.3

48.1

30.8

29.8

37.2

7 15,940 895 77.3%

44 54,200 3904 77.0%

57 98,050 7725 88.0%

122 180,400 10710 88.0%

7 180,400 4257 80.4%

RCH þ RTIP solve a large number of instances that were so far una®ordable in a reasonable time. In order to have practical impact in genetic studies, the preprocessing method must be accurate with respect to the reality. The accuracy of preprocessed results can be measured by the correct association between genotypes and explaining haplotypes.4 We measured the accuracy of the preprocessed results as follows: jfh j h 2 B ^ h 2 H0 gj ; jH0 j where B is the set of preprocessed result and H0 is the set of haplotypes that are used in generating input genotypes of the simulated datasets. The accuracies of preprocessed results are depicted at Fig. 4(h). The average accuracy of preprocessed results is 80.4%. The experimental results of benchmark datasets are shown in Fig. 5. Similar to the simulated datasets, our method outperforms the set-based preprocessor 1450020-11

J. Bioinform. Comput. Biol. 2014.12. Downloaded from www.worldscientific.com by STATE UNIVERSITY OF NEW YORK @ BINGHAMTON on 03/07/15. For personal use only.

M.-H. Choi, S.-H. Kang & H.-S. Lim

(a)

(b)

(c)

(d)

(e)

(f)

Fig. 4. Experimental results of the simulated datasets with recombination. (a) Total numbers of SNPs in the instances. (b) Average numbers of ambiguous sites in the structural simpli¯ed instances. Preprocessing times of the set-based (c) and RCH (d). Total execution times of RCH þ RTIP (e) and RPoly (f). (g) The numbers of bases in preprocessed results. (h) The accuracies of preprocessed results.

1450020-12

J. Bioinform. Comput. Biol. 2014.12. Downloaded from www.worldscientific.com by STATE UNIVERSITY OF NEW YORK @ BINGHAMTON on 03/07/15. For personal use only.

An improved preprocessing algorithm

(g)

(h) Fig. 4. (Continued )

and RPoly. Only 1 instance of 4129 instances is unsolvable in the given time limit with RCH þ RTIP, but the set-based preprocessor failed to preprocess 34 instances and RPoly failed to solve 41 instances. 3.3. Discussion In this paper, we proposed an improved method of the recent set-based preprocessor of Irurozki et al.15 The proposed preprocessing method reduces the search space of the haplotype candidates, so the HIPP solver can solve the resulting problem instances e±ciently. From the fact that the redundancy of the bases (i.e. groups of haplotypes) directly depends on the inclusion relations of themselves, we can remove the process of building pair graphs and avoid the process of choosing the suitable subgroup of bases in the pair graph that are performed in each iterations. And we can cut down the number of iterations by considering the whole set (instead of all the chosen subgroups) of bases that are newly generated in the previous iteration. The proposed method, to enumerate all bases to be considered, iteratively uses only the intersections among the newly generated bases and the complement operations between the newly generated bases and the given genotypes. And then, to identify the redundant bases for the optimal solution, the method uses only the inclusion relations of the bases themselves. The proposed algorithm is simple and, moreover, very fast, so the computational time overload introduced by the preprocessing method is compensated by the reduction on the time required by real solver. The proposed method can be coupled with a wide range of optimization methods which consider sets of candidate haplotypes explicitly. To increase the computing speed of operations on the group of haplotypes, modi¯ed ternary vector representation was developed. By using the proposed representation, the operations such as intersections, complements, and inclusion relation tests can be done by logic bitwise operations. 1450020-13

J. Bioinform. Comput. Biol. 2014.12. Downloaded from www.worldscientific.com by STATE UNIVERSITY OF NEW YORK @ BINGHAMTON on 03/07/15. For personal use only.

M.-H. Choi, S.-H. Kang & H.-S. Lim

(a)

(b)

(c)

(d)

(e)

(f)

Fig. 5. Experimental results of the benchmark datasets. (a) Average numbers of SNPs in the instances. (b) Average numbers of ambiguous sites in the structural simpli¯ed instances. Preprocessing times of the setbased (c) and RCH (d). Total execution times of RCH þ RTIP (e) and RPoly (f).

The experimental results indicate that the proposed method runs much faster than the set-based preprocessor, and our method coupled with a classical HIPP solver outperform RPoly, a state-of-the-art solver for HIPP. Our method (RCH) runs 71 times faster than the set-based preprocessor on average, and RCH coupled 1450020-14

An improved preprocessing algorithm

with RTIP run 37 times faster than RPoly on average. As can be observed from the experimental results, our method coupled with RTIP can solve a large number of instances that were so far una®ordable in a reasonable time.

J. Bioinform. Comput. Biol. 2014.12. Downloaded from www.worldscientific.com by STATE UNIVERSITY OF NEW YORK @ BINGHAMTON on 03/07/15. For personal use only.

Acknowledgments The authors thank Ms. Ekhine Irurozki and Ms. Ana So¯a Graça for kindly providing us sample data and solvers. This research was supported by Basic Science Research Program through the National Research Foundation of Korea (NRF) funded by the Ministry of Education (NRF-2013R1A1A2012890).

References 1. Wikipedia, Webpage, http://en.wikipedia.org/wiki/Haplotype. 2. Crawford DC, Nickerson DA, De¯nition and clinical importance of haplotypes, Annu Rev Med 56:303–320, 2005. 3. Burgtorf C, Kepper P, Hoehe M, Schmitt C, Reinhardt R, Lehrach H, Sauer S, Clonebased systematic haplotyping (CSH): A procedure for physical haplotyping of whole genomes, Genome Res 13:2717–2724, 2003. 4. Salem RM, Wessel J, Schork NJ, A comprehensive literature review of haplotyping software and methods for use with unrelated individuals, Hum Genomics 2:39–66, 2005. 5. Gus¯eld D, Orzack SH, Haplotype inference, In Aluru S (ed.), Handbook on Bioinformatics, CRC Press, Boca Raton, pp. 1–28, 2005. 6. Catanzaro D, Labbe M, The pure parsimony haplotyping problem: Overview and computational advances, Int Trans Oper Res 16:561–584, 2009. 7. Graça A, Lynce I, Marques-Silva J, Oliveira AL, Haplotype inference by Pure Parsimony: A survey, J Comput Biol 17:969–992, 2010. 8. Graça A, Satis¯ability-based Algorithms for Haplotype inference, Ph.D. Thesis, Technical University of Lisbon, 2011. 9. Browning SR, Browning BL, Haplotype phasing: Existing methods and new developments, Nat Rev Genet 12:703–714, 2011. 10. Gus¯eld D, Haplotype inference by pure parsimony, Combinatorial Pattern Matching, LNCS 2676:144–155, 2003. 11. Wang L, Xu Y, Haplotype inference by maximum parsimony, Bioinform 19(14):1773– 1780, 2003. 12. Lancia G, Pinotti MC, Rizzi R, Haplotyping populations by pure parsimony: Complexity of exact and approximation algorithms, Informs J Comput 16:348–359, 2004. 13. Ramadhani S, Mousavi SR, Talebi M, An improved heuristic for haplotype inference, Gene 507:177–182, 2012. 14. Li X, Li J, An almost linear time algorithm for a general haplotype solution on tree pedigrees with no recombination and its extensions, J Bioinform Comput Biol 7(3):521– 545, 2009. 15. Irurozki E, Calvo B, Lozano JA, A preprocessing procedure for haplotype inference by pure parsimony, IEEE/ACM Trans Comput Biol Bioinform 80:1183–1195, 2011. 16. Hudson R, Generating samples under a wright–¯sher neutral model of genetic variation, Bioinform 18(2):337–338, 2002. 17. Gurobi Opimization Inc., Gurobi Optimizer, Software, 2013, http://www.gurobi.com/. 1450020-15

M.-H. Choi, S.-H. Kang & H.-S. Lim

J. Bioinform. Comput. Biol. 2014.12. Downloaded from www.worldscientific.com by STATE UNIVERSITY OF NEW YORK @ BINGHAMTON on 03/07/15. For personal use only.

18. Graça A, Marques-Silva J, Lynce I, Oliveira AL, Haplotype inference with pseudoBoolean optimization, Ann Oper Res 184:137–162, 2011.

Mun-Ho Choi received BS and MS degrees in computer science from Chonnam National University, Gwangju, Korea, in 1993 and 1995, respectively. During 2000–2004, he was a research professor in the Technology Innovation Center for Digital Multimedia Materials of Dongshin University, Naju, Korea. He is currently a PhD candidate in the School of Electronics and Computer Engineering, Chonnam National University, Gwangju, Korea. His research interests include bioinformatics, digital media, and virtual machines.

Seung-Ho Kang received BS, MS, and PhD in computer science from Chonnam National University, Gwangju, Korea, in 1994, 2003, and 2009, respectively. During 2010–2013, he was a postdoctoral researcher in the National Institute for Mathematical Sciences (NIMS), Daejeon, Korea. Since 2013, he has been working for the Department of Information Security, Dongshin University, Naju, Korea, as an assistant professor. His research interests include intrusion detection system, sensor network, and metaheuristic algorithm.

Hyeong-Seok Lim received BS degree in computer engineering from Seoul National University in 1983 and the MS and PhD in computer science from Korea Advanced Institute of Science and Technology (KAIST), Daejeon, Korea, in 1985 and 1993, respectively. During 1996–1997, he was a visiting scholar in the Department of Computer Science, Purdue University. He is currently a professor in the School of Electronics and Computer Engineering, Chonnam National University, Gwangju, Korea. His research interests include graph theory, interconnection networks, and bioinformatics.

1450020-16

An improved preprocessing algorithm for haplotype inference by pure parsimony.

The identification of haplotypes, which encode SNPs in a single chromosome, makes it possible to perform a haplotype-based association test with disea...
2MB Sizes 2 Downloads 8 Views