Novel and efficient tag SNPs selection algorithms.

Bio-Medical Materials and Engineering 24 (2014) 1383–1389 DOI 10.3233/BME-130942 IOS Press

1383

Novel and efficient tag SNPs selection algorithms Wen-Pei Chen

a,∗

, Che-Lun Hung

b,*

, Suh-Jen Jane Tsai a , and Yaw-Ling Lin c,∗∗

a Department

of Applied Chemistry, Providence University, Taiwan. E-mail: [email protected] of Computer Science and Communication Engineering, Providence University, Taiwan. E-mail: [email protected] c Department of Computer Science and Information Engineering, Providence University, Taiwan. E-mail: [email protected] b Department

Abstract. SNPs are the most abundant forms of genetic variations amongst species; the association studies between complex diseases and SNPs or haplotypes have received great attention. However, these studies are restricted by the cost of genotyping all SNPs; thus, it is necessary to find smaller subsets, or tag SNPs, representing the rest of the SNPs. In fact, the existing tag SNP selection algorithms are notoriously time-consuming. An efficient algorithm for tag SNP selection was presented, which was applied to analyze the HapMap YRI data. The experimental results show that the proposed algorithm can achieve better performance than the existing tag SNP selection algorithms; in most cases, this proposed algorithm is at least ten times faster than the existing methods. In many cases, when the redundant ratio of the block is high, the proposed algorithm can even be thousands times faster than the previously known methods. Tools and web services for haplotype block analysis integrated by hadoop MapReduce framework are also developed using the proposed algorithm as computation kernels. Keywords: SNP, haplotype block, tag SNP selection, hadoop, non-redundant site, redundant ratio

1. Introduction Single nucleotide polymorphisms (SNPs) are promising markers for disease association researches because of their high abundance along the human genome. Haplotype analysis has been successfully applied to identify DNA variations that are relevant to several common and complex diseases [1,9,15, 16,18]. Many studies suggest that human genome may be arranged into block structures, in which SNPs are relevant and only a small number of SNPs are sufficient to obtain most of haplotype structures, which is called tagSNP [2,6,17,3]. Several approaches have been suggested for defining block structures *

Those authors contributed equally to this work. Corresponding author. E-mail: [email protected]. Department of Computer Science and Information Engineering, Providence University, 200, Sec. 7, Taiwan Boulevard, Taichung, Taiwan. Tel: +886-4-2632-8001 ext 18021; this work is supported in part by the National Science Council, Taiwan, R.O.C, grant NSC 99-2632-E-126-001-MY3 and NSC 100-2221E-126-007-MY3. **

0959-2989/14/$27.50 © 2014 – IOS Press and the authors. All rights reserved

1384

W.-P. Chen et al. / Novel and efficient tag SNPs selection algorithms

where some are more commonly used. Four main criteria for haplotype block partitioning are based on haplotype diversity [17,10,23], LD [6,7], four gamete tests [22,8] and information complexity. Genotyping all existing SNPs for a large number of samples is still challenging even though SNP arrays have been developed to facilitate the task. Therefore, it is essential to select only informative SNPs representing the original block structures in the genome for genome-wide association studies. Previous works on tag SNPs selection have investigated both exact and approximate methods. Exact methods select a set of tag SNPs that accounts for the variations in the sample population [17,20], while approximate methods typically select fewer tags than exact methods by losing some information though [5]. In this paper, an efficient algorithm is presented for the exact tag SNP selection problem. With the number of individuals genotyped and SNPs in databases growing, tag SNPs selection takes too much time to compute, therefore speeding up the tag SNPs selection is an important issue. The algorithm proposed in this work helps to compress a number of SNPs without losing the contained information. Hadoop is a software framework intended to support data-intensive distributed applications. It supports MapReduce programming model [21] for writing applications which process large data set in parallel under cloud computing environment. The advantage of MapReduce is that it allows for distributed computing of the map and reduction operations. Recently, Hadoop has been applied in various domains in bioinformatics [4,19,13,12,11]. Integrating with hadoop, the algorithm proposed by the authors provides more efficient and reliable solutions. The performance of the algorithm is evaluated by HapMap data. As indicated by these experimental results, the proposed algorithm is significantly faster than the other algorithms when the redundant ration within block is high.

2. Methods 2.1. Algorithm Given a set of haplotypes H = {h1 , h2 , . . . , hm } belonging to an arbitrary population, where each haplotype has n SNPs, the objective here is to find a minimum set of tag SNPs T = {s1 , s2 , . . . , sk }, which consists of selected SNPs of the haplotypes, and can identify all of the common haplotypes [17,24] in a block. Similar to previous studies, this paper assumes that all SNPs are biallelic. However, this assumption can be ignored due to little modification in our algorithm. Under this assumption ai,j ∈ {0, 1} is used to denote the allele of the ith haplotype at the jth SNP site on the matrix A. The input of our algorithm is m haplotypes each with n SNPs, which is presented as a matrix A. Here, H and S are used to denote the set of haplotypes and SNPs site separately. The output is a set of tag SNPs. In a biallelic SNP haplotype matrix each site s partitions the haplotypes into two groups A1 , A2 ; A1 consists of haplotypes which have major allele 0 at site s, and A2 consists of haplotypes that have minor allele 1. Site s defines a partition of H, and the partition is denoted by πs = {A1 , A2 }. This idea can be extended to k-allelic SNP haplotype matrix; each SNP site s defines a partition on H, πs = {A1 , A2 , . . . , Ak }, Ai ⊂ H. These subsets are disjoint from each other, and the union of them is the H. Two haplotypes belong to the same partition if they share the same allele symbol. It has the following observations: Observation 2.1 Given a haplotype matrix, each site s partitions the haplotypes into k group, πs = {A1 , A2 , . . . , Ak }, A1 , A2 , . . . , Ak ⊂ H, A1 ∪ A2 ∪ . . . ∪ Ak = H, and Ai ∩ Aj = φ, {Ai |Ai = Aj and Ai ⊂ H}.


1385

From this observation, the problem of tag SNPs selection can be considered as finding the minimum number of SNPs sites, s1 , s2 , . . . , sk ∈ S, such that the partition πs defined by s1 , s2 , . . . , sk can distinguish all common haplotypes in the matrix. When the specific SNP markers are selected on S, these SNP markers will partition the set H into a collection of subsets; each subset contains elements that are equivalent to each other. The partition π induces an equivalence relation on H, since the corresponding relations are reflexive, symmetric, and transitive. Observation 2.2 A partition π = {A1 , A2 , . . . , A|π| }, Ai ⊂ H, induces an equivalence relation Rπ on the set of haplotypes H, such that sRπ t if and only if there exists A ∈ π such that s, t ∈ A. The partition of H consisting of the collection of all equivalence classes is called the partition of H by R and is denoted by H/R. Definition 2.3 (Join) Given two partitions π1 , π2 over H, the joint relation of Rπ1 and Rπ2 is defined to be R(π1 , π2 ) = {(h1 , h2 ) ∈ H 2 | h1 Rπ1 h2 and h1 Rπ2 h2 }; the joint relation is also an equivalence relation whose equivalence class is defined as π1 π2 = H/R(π1 , π2 ), namely the joint partition of π1 and π2 . Given π, a partition of H, note that each element A ∈ π is a subset of H, A ⊂ H, and a unique index: π → {1, . . . , |π|} can be assigned for each A ∈ π. The same index is also assigned to the haplotypes within the same subset. Thus, the partition π can be represented by list of m elements, namely List[π]. Here π[i] is defined as the index of the haplotype hi in the partition π; and partition indices sequence List[π] = π[1], π[2], . . . , π[m] is a sequence which consists of the indices of the haplotype partitioned by π. As an example, two partitions π1 = {{h1 , h3 , h5 , h6 }, {h2 , h4 }} and π2 = {{h1 , h3 }, {h2 , h5 }, {h4 , h6 }}, can be represented as List[π1 ] = 1, 2, 1, 2, 1, 1 and List[π2 ] = 1, 2, 1, 3, 2, 3 respectively, |π1 | = 2, |π2 | = 3. Note that a partition index sequence L naturally defines a partition over H. Let L(π1 , π2 ) be a list representing the partition π1 π2 , then L(π1 , π2 ) can be computed from the following equation: L(π1 , π2 )[i] = π2 [i] + π1 [i] · |π2 |

(1)

In this example, L(π1 , π2 ) = 4, 8, 4, 9, 5, 6 , and L(π1 , π2 ) defines the partition πL(π1 ,π2 ) = {{h1 , h3 }, {h2 }, {h4 }, {h5 }, {h6 }}, |πL(π1 ,π2 ) | = 5. On one hand, πL(π1 ,π2 ) can be represented as List[πL(π1 ,π2 ) ] = 1, 2, 1, 3, 4, 5 . On the other hand, according to the definition 2.3, it can get partition π1 π2 = {{h1 , h3 }, {h2 }, {h4 }, {h5 }, {h6 }}}, and it can be represented as List[π1 π2 ] = 1, 2, 1, 3, 4, 5 , |π1 π2 | = 5. As a result, List[πL(π1 ,π2 ) ] = List[π1 π2 ] and πL(π1 ,π2 ) = π1 π2 . Theorem 2.4 Given a partition index sequence L(π1 , π2 ) which represents the joint partition of π1 and π2 over H; the partition defined by L(π1 , π2 ) is equivalent to the partition π1 π2 . In other words, L(π1 , π2 ) is equivalent to the permutation list of joined partition π1 π2 . Namely, List[π1 π2 ] can be calculated correctly from Eq. (1). In order to improve the efficiency of tag SNP selection method, the algorithm first compresses the length (SNP number) of the haplotype matrix by grouping the SNP sites with the same information. The SNP sites where partition the haplotypes into the same group are called redundant sites, and these SNP sites can be considered as a group. The first sites which contain distinct information within block are also called non-redundant sites abbreviated as NRS. The idea of compression is shown in the following observations:

1386

Observation 2.5 |S|.


π∈S

π=

π∈S

π, that is S = {πs |s ∈ S}, S is a compression of S and |S | ≤

In order to further compress the haplotype matrix, it needs to find the tag SNPs T = {s1 , s2 , . . . , sk } ⊂ S , such that all haplotypes of matrix can be distinguished by T . In other words, let πS be the partition defined by S , and |πS | = g the group number identified by partition πS , for all L ⊂ S , by an increasing order of |L|, it wants to find the first partition π such that |π | = g. That is π = L L π∈T π∈S π = π. By using the idea of joint partition, an efficient tag SNPs selection algorithm is provided. π∈S Theorem 2.6 Given an m × n haplotype block B, the minimum number of tag SNPs can be found in O(nt m)/n! time, where t denotes the minimum number of tag SNPs required by B. 2.2. Web service Based on the algorithm, a web application is developed to provide useful tools for haplotype block analysis and tag SNP selection. Compared with other similar available applications, the web service in this work has the following three distinct features. First, the proposed algorithm selects the tag SNPs efficiently even when the haplotype block contains a relatively large number of SNP sites. The analysis shows that a large portion of redundant sites in real data contain the exact same information within a block. For example, in HapMap phasing data of the chromosome 20 from the Yoruba in Ibadan, Nigeria (abbreviated as YRI), the longest haplotype blocks starting at each SNP site have 28 non-redundant sites with redundant ratio (abbreviated as RR) of 43% in average; tag SNP required for each block is 9.3 in average. The proposed algorithm identifies the required tag SNPs very fast when NRS is small. In most cases, the algorithm finds the optimal solution in a few minutes, which is considerably much faster comparing to other methods. Second, the algorithm provides a useful haplotype block analyzer; it can be used to analyze the phasing data from HapMap. After users choosing interested population and chromosome and desired haplotype block characteristics, the analyzer will calculate the longest haplotype blocks started at each SNP sites. The system reports the block numbers for various block sizes, analysis profiles of non-redundant sites, redundant ratios and common haplotype group of each block. After user choosing the haplotype block for tag SNP selection, the web service predicts and reports the expected execution time needed, and then proceeds to invoke the corresponding tag SNPs selection algorithm.

Fig. 1. Haplotype block partitioning and selection on MapReduce framework.


1387

Third, the algorithm improves our previous work [14] and adopts Hadoop Map/Reduce framework to compute the longest blocks started at each SNPs site. Figure 1 illustrates the MapReduce framework for the blocks computation and tag SNPs selection scheme. Assuming that the number of map operations is n and the number of SNPs is , the input m · haplotype sample is splitted into /n chunks. Each map calculates the diversity scores of each block within the chunk where the map operation is responsible. Thus the output key, value pairs for each Map are (block start number, block end number), diversity score pairs. Here the diversity score of haplotype block is defined as the coverage of common haplotype within the block. The mapi calculates diversity scores of blacks δ(i · · · n/, i · n/), δ(i · n/, i · n/ + 1), · · · , δ(i · n/ + n/, i · n/ + n/). Therefore, each map has (n/)2 scores. Reduce stage performs haplotype block selection algorithm. In our algorithm, only one reduce operation is needed in the reduce stage. The reduce operation finds the longest blocks started at each SNP site by merging blocks with the interesting diversity scores. The proposed web service is accessible at http://bioinfo.cs.pu.edu.tw/∼hap/lbpcdt.html. 3. Experimental results on HapMap YRI data Simulated data sets and real data sets were used to measure the performance of our algorithm. The criterion of 80% coverage of common haplotypes in blocks was used, and tag SNP was defined as the minimum set of SNPs that could distinguish all common haplotypes within a block. The experiment was conducted on 10 virtual machines, with each machine equipped with 2.27 GHz Pentium 4 CPU and 2 GB RAM. Totally 2,040 simulated samples and 500 randomly sampled HapMap blocks were analyzed on these machines; finally, the total computation time was about 20 days, or equivalently 200 machine-days. 3.1. YRI data characteristics The proposed algorithm was applied to biological data set from chromosome 20 for HapMap YRI data. The data set contained 120 individuals that included 71,539 SNPs. For each marker site i its corresponding farthest right marker j was calculated so that [i, j] satisfied 80% coverage of common haplotypes. Figure 2 shows the analysis result of all blocks. There are 10,745 (15%) blocks of which the size is between 31 and 40, and there are 54.3 % (38,844) of blocks of which the size is between 21 and 60. Both non-redundant sites and redundant ratio increased as the length of haplotype block increased. The average number of non-redundant sites of all blocks is 28, and the total redundant ratio is 43%. 3.2. Performance evaluation Figure 3 shows the performance of 500 randomly sampled haplotype blocks on HapMap YRI chromosome 20 with 9 and 10 tag SNPs required. Regardless of the number of SNPs within the block, in the same tag SNP the computation time increased with the increase of NRS. These samples were also computed by Zhang et al.’s algorithm. The proposed algorithm always consumed much less time, compared with Zhang et al.’s method, while the speed-ups varied depending on the characteristics of the input blocks. In most cases, the algorithm was usually at least ten times faster than Zhang et al.’s. In many cases, when the redundant ratio of the block was high, the proposed algorithm could even be thousands times faster than that of Zhang et al.. For example, in HapMap YRI chromosome 20 data, the algorithm found the tag SNPs of block (58025, 58094) in 2 seconds; however, it took 877,562 seconds (or 10 machine-days) by using Zhang et al.’s algorithm. The block had 70 SNPs with 26 NRS (RR = 62.9%); the tag SNPs required to identify all common haplotypes was 8.

1388


!" !!

Fig. 2. Number of blocks, common haplotype groups, non-redundant sites and redundant ratio of haplotype blocks of Chromosome 20 on YRI population.

10000

tagSNP=9 tagSNP=10

CPU Time (in sec) (log scale)

T10 (n) ≈ 113.3 · (1.36)n

1000

T9 (n) ≈ 48.9 · (1.31)n

100

10

1

31

32

33

34

35

36

37 38 NRS

39

40

41

42

43

44

45

Average time (in sec) tag SNPs NRS 9 10 31 58.5 143 32 85.8 216 33 109.4 306 34 154.0 392 35 194.2 501 36 240.6 747 37 320.2 1,046 38 408.9 1,378 39 556.5 1,886 40 682.2 2,331

Fig. 3. Average computation time of the algorithm on YRI haplotype blocks with 9 and 10 tag SNPs.

4. Conclusions In this paper, a new method for tag SNP selection was presented. With regard to the obtained results, the performance of algorithm is sensitive to the number of non-redundant site and tag SNP required. From the point of view of non-redundant site, HapMap YRI chromosome 20 data was analyzed, then it is found that the blocks had 28 non-redundant site and 43% of redundant ratio in average. The most blocks are small and the tag SNP can be identified by our algorithm soon. In the case of large redundant ratio, our algorithm can decrease the computational cost significantly. Although this algorithm is more efficient, in some blocks which contain a large number of non-redundant SNP sites, the algorithm is still computation-consuming. In particular, when NRS is greater than 60 and tag SNP greater than 12, the computation needs over 10 days. With the abundance of bioinformatic data that are all too common these days, the traditional time-consuming sequential methods require imminent assistance of the emerging parallel processing methodology. The parallelized frame works is developed to enhance the performance of our algorithm, based on Hadoop map/reduce framework.


1389

References [1] Penelope E. Bonnen, Peggy J. Wang, Marek Kimmel, Ranajit Chakraborty, and David L. Nelson. Haplotype and linkage disequilibrium architecture for human cancer-associated genes. Genome Research, 12(12):1846–1853, 2002. [2] M. Daly, J. Rioux, S. Schafiner, T. Hudson, and E. Lander. Highresolution haplotype structure in the human genome. Nature Genetics, 29:229–232, 2001. [3] E. Dawson, G. Abecasis, et al. A first-generation linkage disequilibrium map of human chromosome 22. Nature, 418:544– 548, 2002. [4] Jeffrey Dean and Sanjay Ghemawat. Mapreduce: a flexible data processing tool. Commun. ACM, 53(1):72–77, January 2010. [5] Keyue Ding, Jing Zhang, Kaixin Zhou, Yan Shen, and Xuegong Zhang. htsnper1.0: software for haplotype block partition and htsnps selection. BMC Bioinformatics, 6(1):38, 2005. [6] S. B. Gabriel, S. F. Schaffner, H. Nguyen, et al. The structure of haplotype blocks in the human genome. Science, 296(5576):2225–2229, 2002. [7] G. Greenspan and D. Geiger. High density linkage disequilibrium mapping using models of haplotype block variation. Bioinformatics, 20(suppl 1):i137–i144, 2004. [8] R. R. Hudson and N. L. Kaplan. Statistical properties of the number of recombination events in the history of a sample of DNA sequences. Genetics, 111:147–164, 1985. [9] Amit Indap, Gabor Marth, Craig Struble, Peter Tonellato, and Michael Olivier. Analysis of concordance of different haplotype block partitioning algorithms. BMC Bioinformatics, 6(1):303, 2005. [10] G. C. Johnson, L. Esposito, B. J. Barratt, et al. Haplotype tagging for the identification of common disease genes. Nat Genet., 29(2):233 – 7, Oct 2001. [11] Hung C. L. and Hua G. J. Cloud computing for protein-ligand binding site comparison. Biomed Research International, 2013. [12] Hung C. L. and Lin Y. L. Implementation of a parallel protein structure alignment service on cloud. International Journal of Genomics, 2013. [13] Hung C. L. and Lin C. Y. Open reading frame phylogenetic analysis on the cloud. International Journal of Genomics, 2013. [14] Yaw-Ling Lin. Efficient algorithms for SNP haplotype block selection problems. In Proceedings of the 14th annual international conference on Computing and Combinatorics, COCOON ’08, pages 309–318, Berlin, Heidelberg, 2008. Springer-Verlag. [15] Alfonso Mas, E. Blanco, G. Monux, E. Urcelay, F.J. Serrano, E.G. de la Concha, and A. Martinez. Drb1-tnf-α-tnfβ haplotype is strongly associated with severe aortoiliac occlusive disease, a clinical form of atherosclerosis. Human Immunology, 66(10):1062 – 1067, 2005. [16] Petra Nowotny, Jennifer M Kwon, and Alison M Goate. SNP analysis to dissect human traits. Current Opinion in Neurobiology, 11(5):637 – 641, 2001. [17] N. Patil, A. J. Berno, D. A. Hinds, et al. Blocks of limited haplotype diversity revealed by high resolution scanning of human chromosome 21. Science, 294:1719–1723, 2001. [18] A Reif, S Herterich, A Strobel, A-C Ehlis, D Saur, C P Jacob, T Wienker, T Topner, S Fritzen, U Walter, A Schmitt, A J Fallgatter, and K-P Lesch. A neuronal nitric oxide synthase NOS-I haplotype associated with schizophrenia modifies prefrontal cortex function. Mol Psychiatry, 11(3):286 – 300, 2006. [19] Michael C. Schatz. Cloudburst: highly sensitive read mapping with mapreduce. Bioinformatics, 25(11):1363–1369, 2009. [20] Paola Sebastiani, Ross Lazarus, Scott T. Weiss, Louis M. Kunkel, Isaac S. Kohane, and Marco F. Ramoni. Minimal haplotype tagging. Proceedings of the National Academy of Sciences, 100(17):9900–9905, 2003. [21] Ronald Taylor. An overview of the hadoop/mapreduce/hbase framework and its current applications in bioinformatics. BMC Bioinformatics, 11(Suppl 12):S1, 2010. [22] N. Wang, J.M. Akey, K. Zhang, R. Chakraborty, and L. Jin. Distribution of recombination crossovers and the origin of haplotype blocks: the interplay of population history, recombination, and mutation. Am. J. Human Genetics, 71:1227–1234, 2002. [23] J. Zahiri, G. Mahdevar, A. Nowzari-dalini, H. Ahrabian, and M. Sadeghi. A novel efficient dynamic programming algorithm for haplotype block partitioning. Journal of Theoretical Biology, 267(2):164 – 170, 2010. [24] Kui Zhang, Fengzhu Sun, Michael S. Waterman, and Ting Chen. Dynamic programming algorithms for haplotype block partitioning: applications to human chromosome 21 haplotype data. In RECOMB ’03: Proceedings of the seventh annual international conference on Research in computational molecular biology, pages 332–340, New York, NY, USA, 2003. ACM Press.

Copyright of Bio-Medical Materials & Engineering is the property of IOS Press and its content may not be copied or emailed to multiple sites or posted to a listserv without the copyright holder's express written permission. However, users may print, download, or email articles for individual use.

Efficient haplotype block partitioning and tag SNP selection algorithms under various constraints.

Selection of human p75NTR tag SNPs and its biological significance for clinical association studies.

CPAC: Energy-efficient data collection through adaptive selection of compression algorithms for sensor networks.

Tag, you're it: affect tagging promotes goal formation and selection.

Developing a 670k genotyping array to tag ~2M SNPs across 24 horse breeds.

Click-tag and amine-tag: chemical tag approaches for efficient protein labeling in vitro and on live cells using the naturally split Npu DnaE intein.

Progress toward an efficient panel of SNPs for ancestry inference.

Efficient sequential and parallel algorithms for planted motif search.

Efficient algorithms for knowledge-enhanced supertree and supermatrix phylogenetic problems.

QUEST: Eliminating Online Supervised Learning for Efficient Classification Algorithms.

Building integrated ontological knowledge structures with efficient approximation algorithms.

Efficient Multiple Kernel Learning Algorithms Using Low-Rank Representation.

Turtle: identifying frequent k-mers with cache-efficient algorithms.

Feature selection using genetic algorithms for fetal heart rate analysis.

Efficient Recycled Algorithms for Quantitative Trait Models on Phylogenies.

Efficient irregular wavefront propagation algorithms on Intel® Xeon Phi™.

Efficient algorithms for exact inference in sequence labeling SVMs.

miR-155 Gene with Epilepsy in the Chinese Han Population.

Tag SNPs for HLA-B alleles that are associated with drug response and disease risk in the Chinese Han population.

Efficient selection for high-expression transfectants with a novel eukaryotic vector.

CP5 system, for simple and highly efficient protein purification with a C-terminal designed mini tag.

A Chimeric Affinity Tag for Efficient Expression and Chromatographic Purification of Heterologous Proteins from Plants.

Double trouble-Buffer selection and His-tag presence may be responsible for nonreproducibility of biomedical experiments.

Efficient molecular marker design using the MaizeGDB Mo17 SNPs and Indels track.