This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TCBB.2016.2535326, IEEE/ACM Transactions on Computational Biology and Bioinformatics IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS

1

hc-OTU: A Fast and Accurate Method for Clustering Operational Taxonomic Unit based on Homopolymer Compaction Seunghyun Park, Hyun-soo Choi, Byunghan Lee, Jongsik Chun, Joong-Ho Won, and Sungroh Yoon∗ , Senior Member, IEEE Abstract—To assess the genetic diversity of an environmental sample in metagenomics studies, the amplicon sequences of 16s rRNA genes need to be clustered into operational taxonomic units (OTUs). Many existing tools for OTU clustering trade off between accuracy and computational efficiency. We propose a novel OTU clustering algorithm, hc-OTU, which achieves high accuracy and fast runtime by exploiting homopolymer compaction and k-mer profiling, to significantly reduce computing time for pairwise distances of amplicon sequences. We compare the proposed method with other widely used methods, including UCLUST, CD-HIT, MOTHUR, ESPRIT, ESPRIT-TREE, and CLUSTOM comprehensively, using nine different experimental datasets and many evaluation metrics, such as normalized mutual information, adjusted rand index, measure of concordance and F-score. Our evaluation reveals that the proposed method achieves accuracy comparable to those of MOTHUR and ESPRIT-TREE, two widely used OTU clustering methods, with orders of magnitude speed-up. Index Terms—clustering algorithm, operational taxonomic unit (OTU), pyrosequencing, metagenomics, 16s rRNA.

!

1

I NTRODUCTION

I

T is widely accepted that the species richness of microbial communities known thus far is only a part of the entire array of microbial species, due to the difficulties in establishing culture environments for microbes [1]. For this reason, conventional microbiology conducted by Sanger sequencing has been almost thoroughly culture-dependent. Due to the advance of the next-generation sequencing (NGS) [2], which can be used for sequencing 16s rRNA [3], [4] genes of microbiomes, it has become possible to study uncultured microbial communities. This metagenomics approach typically relies on bioinformatics techniques and high-throughput data. The two major approaches in metagenomics utilize 16s rRNA genes and shotgun sequencing [5]. The approach based on utilizing 16s rRNA (which is highly conserved between microbial species) provides species-specific signatures and is cost-effective, and is thus widely used for studying microbiomes. The analytical procedure for exploiting 16s rRNA se-



• •

• •

S. Park is with the Department of Electrical and Computer Engineering, Seoul National University, Seoul 08826, Republic of Korea and is also with the School of Electrical Engineering, Korea University, Seoul 02841, Republic of Korea. H. Choi, and B. Lee are with the Department of Electrical and Computer Engineering, Seoul National University, Seoul 08826, Republic of Korea. J. Chun is with School of Biological Sciences, Seoul National University, Seoul 08826, Republic of Korea and is also with Interdisciplinary Program in Bioinformatics, Seoul National University, Seoul 08826, Republic of Korea. J. Won is with the Department of Statistics, Seoul National University, Seoul 151-747, Republic of Korea. *S. Yoon is the corresponding author with the Department of Electrical and Computer Engineering, Seoul National University, Seoul 08826, Republic of Korea and is also affiliated with ASRI, BDI, INMC, and Bioinformatics Institute, Seoul National University, Seoul 151-744, Republic of Korea (e-mail: [email protected]).

quences in metagenomics is as follows: (1) The sequence information from environmental samples is obtained using NGS; (2) Obtained sequences are pre-processed to remove those that contain errors [6], [7]; (3) The sequences are clustered into operational taxonomic units (OTUs); (4) The OTUs are identified by comparing with 16s rRNA databases such as SILVA [8] and GreenGenes [9]; Finally, (5) αdiversity, β -diversity, and rarefaction curves that characterize species diversity are computed. One of the most important and influential steps above is the OTU clustering process, since the final part of the analysis, namely species diversity determination and OTU identification, is affected by the distribution of OTUs. The OTU is employed for species-wide classification and generally accompanies a threshold for species assignment. Traditionally, a match of 97% is considered sufficient for species assignment and 95% for genus-level assignment [10]. Many OTU clustering methods, e.g., MOTHUR [11], UCLUST [12], CD-HIT [13], ESPRIT [14], ESPRIT-TREE [15], CLUSTOM [16], have been developed and used for microbiome studies. According to many related studies, MOTHUR tends to have higher clustering accuracy, while CD-HIT and UCLUST both have an advantage in speed. ESPRIT-TREE is known to be comparable to MOTHUR on accuracy while also being faster. A recent study [17] suggests that homopolymer compaction may be employed to achieve accuracy comparable to MOTHUR, while significantly reducing computation efforts. Inspired by this result, we propose hc-OTU, an OTU clustering method that combines homopolymer compaction and k-mer profiling. The proposed method is highly paralellizable, achieving about 7,000 times as fast speed as MOTHUR and about 6 times as ESPRIT-TREE, another

1545-5963 (c) 2015 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.

This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TCBB.2016.2535326, IEEE/ACM Transactions on Computational Biology and Bioinformatics IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS

widely used k-mer based method. Furthermore, hc-OTU has accuracy comparable to MOTHUR in all the performance evaluation metrics considered. The remainder of this paper is organized as follows. Section II describes the features of existing OTU clustering methods and the performance evaluation metrics used in this study. Details of the proposed algorithm can be found in Section III. The utilized datasets and our experimental setup are presented in Section IV. Sections V and VI provide our experimental results and conclusion, respectively.

2

BACKGROUND

In this section, we review existing OTU clustering methods and evaluation metrics for quantifying the performance of these methods. 2.1

Conventional OTU binning tools

CD-HIT [13] is a widely used program due to its high speed. To cluster large amounts of DNA sequence data, it is essential to compare all sequence pairs, which requires enormous amounts of computation; CD-HIT resolves this issue by using a short-word filter that exploits a threshold number of identical polypeptides in two sequences at a certain level of identity. Most pairwise alignments can be skipped by using the short-word filter procedure. UCLUST [12] is a greedy heuristic clustering algorithm, similar to CD-HIT. While CD-HIT uses k-mer-based filtering, UCLUST employs USEARCH [12], a word-based detection algorithm, which is known to be several times faster than BLAST. While ultrafast, CD-HIT and UCLUST suffer from diminished clustering accuracy. ESPRIT [14] adopts a hierarchical clustering method. Distances between all pairs of aligned sequences must be computed, but the computational procedure is highly complex in both time and space. ESPRIT applies k-merbased filtering to improve computation speed and memory efficiency. The time complexity of ESPRIT is O(n2 ). ESPRITTREE [15] is an improved version that reduces the time complexity to O(n1.17 ) by using a pseudometric-based partition tree. MOTHUR [11] is an accelerated version of DOTUR [10] and SONS [18] and is a combination of several analytical tools; it has become a widely used program for analyzing 16s rRNA sequence data. MOTHUR provides various tools for analyzing both pyrosequencing data and Sanger sequencing data. In order to cluster OTUs using MOTHUR, similarities for all pairs of sequences need to be initially computed, followed by hierarchical clustering via linkages. Although MOTHUR provides an option that only computes the similarities larger than a predefined value, the time and space complexity is still large and results in an often painful computation time; For this reason, the most recent version provides multithreading options to improve the speed of similarity computation. CLUSTOM [16] also uses k-mer filtering to skip repetitive similarity comparisons. After random sampling of a predefined number of sequences among the input, a k-mer profile similarity corresponding to the specific alignmentbased similarity is computed for the sampled sequences. A

2

sequence is then considered as a vertex of a graph whose edges are created based on the computed k-mer profile threshold. After the generation of a graph, CLUSTOM sequentially builds a primary cluster by setting the highestdegree vertex as a seed. In the next refinement step, the remaining singleton sequences are integrated into the existing clusters via alignment-based similarity comparison to the seed. In the final recovery step, the final OTU is generated by including the other sequences connected to the existing seed in the cluster composed of seed and singleton. TBC [17] imitates the microbial differentiation process, and uses homopolymer compaction to remove homopolymeric errors in 454 pyrosequencing results. In the first step, after compressing raw data by homopolymer compaction, a primal local cluster is formed by word-based clustering. In the second step, a consensus sequence is generated by multiple sequence alignment on each local cluster. In the final step, similarity metrics between consensus sequences are computed by CLUSTALW [19] or BLAST [20], and the final OTU is built by integrating clusters whose similarity metrics are larger than a user-specified threshold. TBC and the 454 pyrosequencing sample dataset (Kimchi, Soil, and Water) is available on the EzBioCloud webpage (http://sw.ezbiocloud.net/tbc). 2.2

Evaluation metrics

2.2.1 Normalized mutual information Normalized mutual information (NMI) [21] measures how closely the OTU clusters are assigned with a golden reference. NMI takes a value between 0 and 1, where 1 means that the generated cluster is the same as the reference, and 0 implies that the OTU cluster is randomly generated. Denote an OTU cluster by C = {c1 , c2 , . . . , cJ }, and golden reference by Ω = {ω1 , ω2 , . . . , ωK }. Then the NMI is computed as follows:

2 · I(Ω|C) , H(Ω) + H(C)

NMI(Ω, C) = where

I(Ω|C) =

  |ωk ∩ Cj | k

N

j

log

N |ωk ∩ Cj | |ωk ||Cj |

is the mutual information between C and Ω,

H(Ω)=−

 |ωk | k

N

log

|ωk | N

log

|cj | N

is the entropy of Ω, and

H(C)=−

 |cj | j

N

is the entropy of C . Here, N is the number of samples in the data.

1545-5963 (c) 2015 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.

This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TCBB.2016.2535326, IEEE/ACM Transactions on Computational Biology and Bioinformatics IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS

2.2.2 Adjusted rand index

3

: Data

Raw Sequences

The adjusted rand index (ARI) [22] is a widely used objective metric for clustering performance evaluation. By repeatedly extracting pairs from the whole data, ARI measures the frequency of inclusion, i.e., whether or not the extracted pair belongs to the same cluster. Assume that P = {p1 , p2 , . . . , pN } and Q = {q1 , q2 , . . . , qM } are two ways of clustering n samples. Let fij stand for the number of pairs that belong to the i-th cluster of P and j -th cluster of Q. ARI is defined as

: Process

Preprocessing (Removing Chimeric Sequences)

Generating Representative Sequences

Homopolymer Compaction

Cluster Integration

(Section 3.3)

(Section 3.4)

(Section 3.1)

  ( ( ) j (|q2j |) /(n2 ) i       ARI(P, Q) = 1  |pii,j ( 2 |)+ j (|q2j |) − (|p2i |) j (|q2j |) /(n2 ) 2 i i 

fij 2

)−



Clustering using k-mer profiling

  where |pi | = k fik and |qj | = k fkj . ARI has zero expected value for independent clusterings and a maximum of 1 for identical clusterings. It is known that some pairs of clusterings may result in negative values [23].

Partial Clusters

OTUs

Figure 1: Flow chart of proposed approach.

Measure of concordance (MoC) [24] is a pair-counting measure, similar to ARI. After aggregating the counts of members that are common to the two given clustering results, the MoC is computed as



Calculating Genetic Distance

(Section 3.2)

2.2.3 Measure of concordance

MoC(P, Q) = √

Filtering using kmer Profiling

|pi | 2

TP =



I(d(si , sj ) ≤ γ)

k=1 si ,sj ∈ck

M N  2  fij 1 ⎝ − 1⎠ N M − 1 i=1 j=1 |pi ||qj |

where P = {p1 , p2 , . . . , pN } and Q = {q1 , q2 , . . . , qM } are two clusterings, and N and M are the number of clusters in P and Q, respectively. The number of members that belong to both pi and qj is denoted by fij . MoC takes a value between 0 and 1, where the value of 1 means that two clusterings are identical, while 0 means they are independent.

M  

FP =

M  

I(d(si , sj ) > γ)

k=1 si ,sj ∈ck

TN =

  

I(d(si , sj ) > γ)

k=l si ∈ck sj ∈cl

FN =

  

I(d(si , sj ) ≤ γ),

k=l si ∈ck sj ∈cl

where γ is the genetic distance threshold. Typically, 0.03 or 0.05 is chosen for OTU clustering.

2.2.4 Fβ -score

3

The Fβ -score is defined as the harmonic mean of precision and recall. The parameter β is a positive real number typically chosen among 0.5, 1, and 2. The precise definition is as follows.

hc-OTU is comprised of four main steps illustrated in Fig. 1. First, a compact form of each sequence reads is generated using homopolymer compaction (Section 3.1). Next, compacted sequences are grouped into partial clusters (Section 3.2), and then the consensus of each partial cluster is chosen as the representative sequence (Section 3.3). Finally, partial clusters with similar representative sequences are merged into a single cluster (Section 3.4). Details of each step and the pseudocode are presented in Fig. 2 and Algorithm 1, respectively.

Fβ = (1 + β 2 )

precision · recall (β 2 · precision) + recall

TP TP + FP TP recall = TP + FN

precision =

where TP is the number of true positives, FP is the number of false positives, TN is the number of true negatives, and FN is the number of false negatives. For the sequence S = {s1 , s2 , . . . , sN } and the OTU cluster C = {c1 , c2 , . . . , cM }, TP, FP, TN, and FN are computed as follows.

3.1

M ETHODOLOGY

Homopolymer compaction

A homopolymer is a consecutive sequence of the same type of nucleotide (Fig. 2A). Homopolymers are converted to a single nucleotide of identical type by homopolymer compaction. For example, the sequence AAT CAAAT T T T CCGT can be transformed to AT CAT CGT . Homopolymer compaction is performed on

1545-5963 (c) 2015 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.

This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TCBB.2016.2535326, IEEE/ACM Transactions on Computational Biology and Bioinformatics IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS

$+RPRSRO\PHU&RPSDFWLRQ (Section 3.1) Seq  CATTTACAGGGACGTTCGTAA

4

'&OXVWHU,QWHJUDWLRQ

(Section 3.4)

K-mer boundary

CATACAGACGTCGTA

SeqH 

P12

%&OXVWHULQJXVLQJ.PHUSURILOLQJ (Section 3.2)

P1

P3

P4

P7



P7

P2

P6 P9 P8

P7

P2

OTU boundary

P2

(Section 3.3)

CCATTTACGGGGAGTTCCGT ACATTTTACGGGGGAGTCG GCATTTACGGGAGTTCGTT

 -CCATTT-ACGGGG-AGTTCCGT       | ||||  |||||| ||| | |   S  AC-ATTTTACGGGGGAGT-C-G(update) S  ACCATTTTACGGGGGAGTTCCGT 

P8

P6

P3

P4

&([WUDFWLQJ5HSUHVHQWDWLYH6HTXHQFHV Srep= S1    S2 S3

P9

P13

P9 P8

P5

P5 P13

P6

P13

P1

P12

P11

P12

P11

P10

K-mer distance threshold

P10

P5

C1

Srep

C3

C1

P5

C1

2

rep

P12

alignment

consensus

P13 P6

 ACCATTTTACGGGGGAGTTCCGT      | ||||  ||||| ||||| ||   S  GC-ATTT-ACGGG--AGTTC-GTT (update) S  ACCATTTTACGGGGGAGTTCCGTT 



Srep

alignment

C4

3

rep

C2

consensus

P8 C2

Figure 2: hc-OTU algorithm overview.

Algorithm 1 hc-OTU algorithm

S = {s1 , s2 , . . . , sN }  Dataset containing N sequence reads 2: Param: θh , θk , θg  θh : threshold for k-mer distance between homopolymer compacted sequences, θk : k-mer distance threshold for cluster integration, θg : Genetic distance threshold 3: Output: C = {c1 , c2 , . . . , cM }  A set of OTU cluster 1: Input:

4: Step 1: Homopolymer compaction (Section 3.1) 5: si = ti,1 , ti,2 , . . . , ti,|si | 

6:

7: 8: 9: 10: 11: 12: 13: 14:

 ti,j : The j -th nucleotide in the sequence si . ti,j ∈ {A, C, G, T } H = {h1 , h2 , . . . , hN } ← {∅}  hi : A compacted sequence of si by using homopolymer compaction procedure for i ← 1 to |S| do currentNucleotide ← ti,1 hi ← ti,1  for j ← 2 to |si | do if ti,j = currentNucleotide then hi ← h i ti,j  currentNucleotide ← ti,j

15: Step 2: Clustering using k-mer profiling (Section 3.2) 16: l ← 0 17:

18:

19: 20: 21: 22: 23: 24: 25: 26:

27:

 l: The number of partial clusters H = {h1 , h2 , . . . , hN }  hi : A compacted sequence of si by using homopolymer compaction U = {u1 , u2 , . . . , uN } ← {0, 0, . . . , 0}  ui : An index of the partial cluster assigned by sequence si procedure for i ← 1 to N do if ui = 0 then continue l←l+1 pl ← {si }  pl : l-th partial cluster for j ← i + 1 to N do if uj = 0 then continue if Dk (hi , hj ) < θh then  Dk : A function to compute a distance between two k-mer profiles (Eq. (1)) pl ← pl ∪ {sj }

1545-5963 (c) 2015 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.

This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TCBB.2016.2535326, IEEE/ACM Transactions on Computational Biology and Bioinformatics IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS

28: Step 3: Extracting representative sequences (Section 29: 30: 31: 32: 33:

34: 35: 36:

3.3) P = {p1 , p2 , . . . , pl }  P : A set of partial cluster derived from Step 2 R = {r1 , r2 , . . . , rl }  rl : The representative sequence of l-th partial cluster procedure for i ← 1 to |P | do sort(pi )  sort: A function that sorts sequences by descending order of sequence length ri ← pi,1  pi,1 : First element in the i-th partial cluster for j ← 2 to |pi | do ri ← getAlign(ri , pi,j ))  getAlign: A function that returns the merged sequence of two aligned sequences

37: Step 4: Cluster integration (Section 3.4) 38: V = {v1 , v2 , . . . , vl } ← {0, 0, . . . , 0}

39: 40: 41: 42: 43: 44: 45: 46:

 vl : An index of partial cluster that l-th partial cluster was merged into P = {p1 , p2 , . . . , pl }  P : A set of partial cluster derived from Step 2 procedure for i ← 1 to |P | do if vi = 0 then continue c i ← pi vi ← i for j ← i + 1 to |P | do if vj = 0 then continue

if Dk (ri , rj ) > θk then continue else if Dg (ri , rj ) > θg then continue  Dg : A function to compute a genetic distance between two sequences (Eq. (2)) 49: else 50: c i ← c i ∪ pj 47: 48:

all input sequences. The major errors in 454 pyrosequencing data are the homopolymeric errors, i.e., the difference between the lengths of the genuine sequences and the pyrosequencing sequences. Therefore, the effect of indel errors can be minimized by using homopolymer compaction. 3.2

Clustering using k-mer profiling

Clustering refers to a data analysis technique that aims at grouping similar objects together while separating dissimilar ones [25], and clustering has found a variety of applications in bioinformatics [26], [27]. After homopolymer compaction, the resulting compacted sequences are clustered into partial clusters using a greedy clustering method (Fig. 2B). First, a base-compacted sequence is selected, followed by measuring the k-mer distance with another compacted sequence. The compacted sequences that have a distance less than θh are grouped into a single cluster. Next, compacted sequences that do not belong to any cluster are selected and clustered in the same way.

5

Although there are many definitions for the distance between two k-mer profiles [28], [29], we use the following definition to define the k-mer distance between the reads of si and sj [30].



Dk (si , sj ) = 1 −

min(si (τ ), sj (τ )) , min(|si |, |sj |) − k + 1 τ

(1)

where si (τ ) is the k-mer profile of i-th sequence read si that has a read length of |si |. Calculating the genetic distance using, e.g., the Needleman-Wunsch algorithm, is considered computationally expensive compared to k-mer profiling, and highly correlated to the k-mer distance [14]; hence, a filtering step using the k-mer distance can effectively reduce the running time. Also in practice, there may be many OTUs with low abundance, such as singleton and doubleton. These noisy OTUs can increase computation time not only for the pre-clustering but also for subsequent procedures. Preclustering using k-mer profiling decreases the number of partial clusters and is also easy to parallelize. The parameter θh should be chosen carefully, since it affects the accuracy and computing time significantly. We suggest an empirical optimal value for θh in Section 5.4. 3.3

Extracting representative sequences

Representative sequences are generated by using multiple sequence alignment (Fig. 2C). The original input sequences, not the compacted sequences, are used in this step. The longest sequence is selected for the first representative sequence, and then sequence alignments are repeatedly performed with the other sequences in the partial cluster, until only one representative sequence remains. All partial clusters are constructed in the same way. If there are N sequences in a partial cluster and S1 , S2 , . . . , SN are the sequences in descending order of length, the length Li of Si satisfies Li ≥ Li+1 , for i = 1, . . . , N . The first representative sequence Srep is S1 , and then the representative sequence is updated with the consensus of aligned sequences of Srep and S2 . The remaining N − 1 representative sequences are generated in the same way. The rule for calculating consensus sequences is as follows. (1) If any nucleotides in a specific position are identical, choose that nucleotide; (2) If there is an indel error, choose the nucleotide that is inserted; (3) If there is mismatch, choose the nucleotide of Srep . Merging of the partial clusters is determined by the pairwise distance between the representative sequences. Thus the hypothesis to be tested is that the representative sequence should be close to the remaining sequences in the partial cluster. Fig. 3 shows a boxplot of genetic distances between the representative sequences and other sequences in each partial clusters. The result shows that the majority is distributed from zero to 0.005 and the value of 2.7σ is less than 0.01 for all datasets. This means that the representative sequence is close to the centroid of each partial cluster. 3.4

Cluster integration

In the cluster integration procedure, the representative sequences from the partial clusters are merged into a new

1545-5963 (c) 2015 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.

This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TCBB.2016.2535326, IEEE/ACM Transactions on Computational Biology and Bioinformatics IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS

6

Table 1: Datasets used in this study 0.045

Dataset

0.04

Genetic Distance

0.035

Titanium Artificial Divergent Even1 Even2 Even3 Uneven1 Uneven2 Uneven3

0.03 0.025 0.02 0.015 0.01 0.005

a b

0 Artificial

Divergent

Even1

Even2

Even3

Titanium

Uneven1

Uneven2

Uneven3

Figure 3: Genetic distance between representative sequence and other sequences in partial clusters. The whiskers represent ±2.7σ (99.3% coverage). cluster (Fig. 2D). The representative sequence of a partial cluster containing the largest number of member sequences is selected as a base cluster, and then the other representative sequences that have distances below a certain threshold are consecutively merged into a new cluster. Suppose that P1 , P2 , . . . , PN are partial clusters that are ordered by decreasing number of the included members. First, the kmer distances (1) between representative sequences of P1 and P2 , . . . , PN are computed, and then the representative sequences that have k-mer distances below θk are aligned to the representative sequence of P1 . The members of each partial clusters are merged with P1 into cluster C1 if their genetic distances are below θg . Next, the same procedure is repeated with one of the remaining partial clusters that has the largest number of members, until all partial clusters are merged. For the genetic distance between two aligned sequences, we use a widely-used definition based on the match-tomismatch ratio

Dg (si , sj ) = 1 −

M M +N

,

(2)

where M is the number of correctly matching nucleotides and N is the number of mismatching nucleotides [31]. The worst-case time complexity of the hc-OTU algorithm is O(n2 ), due to the pre-clustering step for all pairs of sequences. However, since the members grouped into a partial cluster are excluded from the objects to search, we expect the average-case time complexity to be better than this; we provide experimental evidence in Section 5.5.

4 4.1

E XPERIMENTAL S ETUP Datasets and preprocessing

Nine datasets published by Quince et al. [32] were used for our analysis. The details of these datasets are described in Table 1. The “titanium” dataset, generated by using the Roche GS FLX Titanium system, contains about 60,000 reads with an average read length of about 400bp. The remaining 8 datasets were generated using the Roche GS FLX system. The number of reads and average read length of each dataset are 40,000 to 60,000 and 250bp, respectively.

#Seq. (Be)a

#Seq. (Af)b

Mean Length

#Ref.

True OTUs

62,873 46,341 57,902 63,780 53,763 67,182 54,099 51,439 60,976

19,791 31,479 33,667 45,598 38,340 46,329 40,189 40,003 46,131

406 269 268 250 248 245 247 254 243

89 90 23 87 87 87 87 87 87

74 30 23 59 59 59 59 58 59

Be: Before preprocessing Af: After preprocessing

The mean error rate of the 454 Life Sciences GS FLX system is about 1% [33]. Although the error rate is low, this would lead to many erroneous bases, considering that the number of reads are from tens of thousands to hundreds of thousands. Since the sequence similarity of each read is sensitive to the sequencing error, it is essential to remove the reads containing these errors. The whole datasets were pre-processed as follows. (1) Reads containing more than 1 erroneous nucleotide in its barcode and primer region were removed; (2) Reads containing the nucleotide ‘N ’, which is an ambiguous nucleotide, were removed; (3) Reads with length < μ − σ are removed, where μ and σ are the mean and the standard deviation of read length; (4) Chimeric sequence reads were removed by using UCHIME [34]. The variation in the number of sequence reads during the pre-processing procedure is presented in Table 1. In many cases, there is no golden reference related to the microbiome dataset, as it is difficult to know the distributions of species in the sample a priori. Fortunately, the Quince datasets have golden reference sequences. However, the assignment of each read to the reference sequence is not available. In many studies, sequence reads are mapped to reference sequences by using an alignment tool such as BLAST [20], and then these labels are used for validation. There are many redundant sequences in Quince’s reference. Therefore, we grouped reference sequences using hierarchical agglomerative clustering (HAC) with average linkage and used these labels as golden references [35]. The number of golden reference in each dataset is referred to as “true OTUs” in Table 1. 4.2

Parameters and computer settings

All of the OTU clustering tools were set to 97% identity threshold and all the parameters were set to default values. The tools that have a multi-threading option, such as hcOTU, CD-HIT, and CLUSTOM, used 8 threads. MOTHUR supports the multi-threading option only for computing pairwise sequence distance, which is very computationally intensive and is relatively slow; thus, we used 32 threads. hc-OTU was compiled using Intel icc 14.0.0, and the benchmark was performed on a Linux server consisting of an Intel Xeon E5-2650 and 256GB RAM. 4.3

Random sampling

It is generally necessary to calculate all pairwise alignments of sequence reads for F-score measurement; however, this is

1545-5963 (c) 2015 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.

This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TCBB.2016.2535326, IEEE/ACM Transactions on Computational Biology and Bioinformatics IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS

0.984 0.947

0.983

0.979

0.982

0.957 0.931

0.95

0.9

0.85

8

5

M

7+ 0 2

ST

(6

35

C

,7

LU

í7

5

O

((

IT PR ES

& ' í+ ,7

U C LU ST

TB

C

0.8

Comparing the number of OTUs found

Adjusted Rand Index 1

0.975

0.967

0.974

0.978

0.974

0.927 0.866

0.95 0.882 0.9

0.85

0 2 7+ 8 5

M ST

(6

35

C

,7

LU

í7

5

O

((

IT PR ES

& ' í+ ,7

U C LU ST

0.8 C

The number of OTUs examined by eight OTU clustering tools is shown in Table 2. CLUSTOM has the most comparable number of OTUs, followed by hc-OTU, UCLUST, CDHIT; MOTHUR has much larger numbers of OTUs than the above two. Most OTU clustering methods employ a postprocessing procedure that removes reads with singletons or doubletons. However, all input reads should retain their class label in order to evaluate performance metrics such as NMI, ARI, and MoC. Thus, post-processing was not applied, and the resulting number of OTUs is larger than the true OTUs in most cases. In the datasets, the number of OTUs in “artificial” is underestimated by most methods; conversely, “divergent” and “titanium” OTUs numbers were comparable to the number of true ones. “even”s and “uneven”s are overestimated by 2 to 5 fold. Most overestimated OTUs were composed of singletons. It can be assumed that these singleton reads are chimeric or contaminant reads that were not removed during pre-processing.

0.983

TB

5.1

R ESULTS AND D ISCUSSION

1

(p hc ro -O po TU se d)

5

Normalized Mutual Information

(p hc ro -O po TU se d)

not feasible for large datasets. We randomly sampled 1000 sequences from each dataset to address this problem. The sampling rate r is set to 2 and the number of samples K follows the rule K = (M/N ) ∗ r, where M is the number of total sequences and N is the size of the subsample. For example, the “artificial” dataset has 31,479 sequences; therefore, the sample size is (31479/1000) ∗ 2 = 64.

7

Measure of Concordance

1 0.9

0.678

0.652 0.8 0.566 0.520

0.7

0.530 0.474

0.493

0.485

0.6 0.5 0.4

5.3

0 2 7+ 8 5

M ST LU

,7 35 (6

C

í7

5

O

((

IT PR ES

& ' í+ ,7

C

U C LU ST

Fig. 4 shows the accuracy of the eight clustering methods. NMI, ARI, and MoC were computed for nine datasets, and then the average values were acquired. Higher NMI values were obtained by ESPRIT (0.984), MOTHUR (0.979), ESPRIT-TREE (0.983), CLUSTOM (0.982), and hcOTU (0.983). UCLUST (0.931) had the lowest NMI values. The highest ARI values were obtained by ESPRIT (0.978), hc-OTU (0.975), ESPRIT-TREE (0.974) and MOTHUR (0.974), followed by CLUSTOM (0.967). TBC (0.882) and UCLUST (0.866) had relatively low ARI values. By comparison with NMI and ARI, MoC values are distributed over a relatively narrow range. MoC is more sensitive to differences in the number of clusters than other evaluation metrics; in our study, most datasets have greater numbers of predicted OTUs than true OTUs. The highest MoC value was retrieved by CLUSTOM (0.678), followed by hc-OTU (0.652) and ESPRIT (0.566). TBC (0.520), ESPRIT-TREE (0.530), UCLUST (0.474), and CD-HIT (0.485) had relatively low MoC values. hc-OTU achieves overall accuracy comparable to some of the best clustering algorithms such as ESPRIT, ESPRITTREE, MOTHUR, and CLUSTOM.

0.3 TB

Accuracy comparison: NMI, ARI, and MoC

(p hc ro -O po TU se d)

5.2

Figure 4: Normalized mutual information (NMI), Adjusted rand index (ARI), and Measure of concordance (MoC) of eight different clustering methods. Average scores of the nine datasets are reported together with the error bars. Precision and Recall 1

0.988 0.984

0.994 0.949

0.997

0.996

0.995 0.963

0.995 0.964

Precision

0.977 0.986

Recall

0.997 0.954

0.909 0.893

0.95

0.9

0.85

0.8

hc-OTU (proposed)

TBC

UCLUST

&'í+,7

(635,7

(635,7í75(( CLUSTOM

027+85

Figure 6: Recall (Sensitivity) and Precision (Positive Predictive Value). Average scores of the randomly sampled datasets are reported together with the error bars.

Accuracy comparison: Fβ -score

The mean Fβ -scores of eight clustering tools are shown in Fig. 5. hc-OTU showed the best performance among all of the clustering tools tested in this study. F-scores for hcOTU were 0.984 (β =1), 0.985 (β =2), and 0.988 (β =0.5). hcOTU had consistent F-scores compared to the other tools whose F2 -scores and F0.5 -scores were considerably different.

This discrepancy is due to the definition of Fβ -scores: as the harmonic mean of precision and recall, in general, the Fscore is higher when the difference between precision and recall is lower. Fig. 6 shows that the recall and precision values for hc-OTU are 0.984 and 0.988, respectively, and the difference between two values is relatively small as

1545-5963 (c) 2015 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.

This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TCBB.2016.2535326, IEEE/ACM Transactions on Computational Biology and Bioinformatics IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS

8

Table 2: Number of predicted OTUs Dataset

True OTUs

hc-OTU*

TBC

UCLUST

CD-HIT

ESPRIT

ESPRIT-TREE

CLUSTOM

MOTHUR

30 23 59 59 59 74 59 58 59

37 26 285 240 246 82 165 114 120

52 55 413 360 386 96 250 184 221

59 59 477 421 493 154 298 205 268

59 71 464 435 449 109 263 201 258

45 38 360 323 344 92 215 157 189

46 50 378 344 359 116 235 164 208

34 24 239 229 209 72 135 91 95

58 49 457 421 444 102 273 410 235

Artificial Divergent Even1 Even2 Even3 Titanium Uneven1 Uneven2 Uneven3 *

Proposed method

F ȕíVFRUH 1

0.984

0.985 0.988

0.970 0.958 0.984

0.978 0.969 0.988

0.977

0.979

0.970 0.988

0.981 0.984

0.973

0.949 0.925

0.979

0.974 0.962 0.987

0.940 0.911

0.95

0.9

hc-OTU (proposed)

TBC

&'í+,7

UCLUST

F1íVFRUH

(635,7 F2íVFRUH

(635,7í75((

027+85

CLUSTOM

F0.5íVFRUH

Figure 5: Fβ -score of eight different clustering methods. Average scores of the randomly sampled datasets are reported together with the error bars.

1

2h 33m

4

10

15m 32s

3

$FFXUDF\ 7LPH

0.99

DYHUDJH10,

9h 51m (Titanium dataset)

60

'HIDXOWșh 

17h 49m

50

0.98

40

0.97

30

0.96

20

0.95

10

Time (sec)

10

4m 40s

2

0.94 0.02

54.639s

10

0.04

0.06

0.08

0.1

0.12

0.14

0.16

0.18

DYHUDJHHODSVHGWLPH VHF

elapsed time

5

10

0 0.2

NíPHUWKUHVKROG șh

Figure 8: The average NMI and the elapsed time with a varying threshold θh . 0.06 is chosen for the default parameter.

8.776s

1

10

1.505s 0.781s

5

Figure 7: Average elapsed times over the nine datasets; hc-OTU (8 cores), CLUSTOM (8 cores), CD-HIT (8 cores), MOTHUR (32 cores). The execution time of ESPRIT for the “titanium” dataset is marked as the plus sign.

compared to the other tools.

Optimal parameter for k-mer distance threshold

8 7+ 0 2

35 (6

O ST LU C

,7

M

C TB

(p hc ro -O po TU se d) ,7 í7 5 (( (6 35

U C LU ST

5.4 & ' í+ ,7

0

10

The parameter θh in the pre-clustering step is the threshold of k-mer distance between two sequences. If θh is very low, the number of partial clusters would be large; hence, it would take a longer computation time. On the other hand, higher clustering accuracy would be expected since the distance between each member of a partial cluster is small. If θh is very high, the opposite would happen. Fig. 8 presents the average NMI and average elapsed time with varying θh . Considering both NMI and time, hc-OTU has the best performance from the range of 0.06 to 0.12. We

1545-5963 (c) 2015 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.

This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TCBB.2016.2535326, IEEE/ACM Transactions on Computational Biology and Bioinformatics IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS

the average-case time complexity could be a more accurate measure of performance. To see this, we used large datasets for the scalability test. Fig. 9 shows the execution time of UCLUST, CD-HIT, ESPRIT-TREE, and hc-OTU on the Quince dataset and the Human Body dataset [36]. Sequences were randomly chosen from each dataset. The number of sequences varied from 1, 000 to 300, 000 for the Quince dataset, and 1, 000 to 1, 000, 000 for the Human Body dataset. The approximate number of OTUs of Quince and the Human Body dataset is about 500 and 10, 000, respectively. In many cases of clustering algorithms, the complexity is sensitive to the number of clusters. Therefore, the Quince dataset can be regarded as a clean dataset and Human Body dataset as a noisy one. The results show that hc-OTU has better performance than ESPRIT-TREE for both of the clean and the noisy datasets. The number of reads from most of amplicon datasets is within the range of 106 [37]; therefore, hc-OTU tends to be faster than ESPRIT-TREE in the practically large cases.

Quince dataset (~340,000 reads, ~500 OTUs) 3

average elapsed time (sec)

10



10

1

10

0

10

í

10

í

10

3

4

10

5

10

10

# of the sequence reads

Human body dataset (~1,070,000 reads, ~10,000 OTUs) 5

10

average elapsed time (sec)

9

4

10

3

10



10

1

10

0

10

í

10

3

4

10

5

10

6

10

10

# of the sequence reads hc-OTU

(635,7í75((

UCLUST

CDHIT

Figure 9: Execution time of the UCLUST, CD-HIT, ESPRITTREE, and hc-OTU. The number of sequences varies from 1, 000 to 300, 000 (Quince dataset) and 1, 000 to 1, 000, 000 (Human Body dataset). chose 0.06 as the default value, taking into account the accuracy. hc-OTU also supports a user specified parameter value. The parameter values k =5 in (1) and θk =0.3 were determined empirically in this study. hc-OTU also supports a user specified parameter value. 5.5

Time demand and complexity analysis

The running time of each clustering tool is described in Fig. 7. The y-axis logarithmically represents the average elapsed time of the nine datasets. In the case of MOTHUR, even though 32 cores were used for computing, all pairwise distances, the total analysis time, was more than 17 hours on average. ESPRIT employs a filtering step that uses the k-mer distance to increase the speed; however, it requires more than two hours. UCLUST (0.78s) and CD-HIT (1.51s) had the fastest running times, followed by hc-OTU (8.78s), ESPRIT-TREE (54.64s), TBC (4m 40s), and CLUSTOM (15m 32s). ESPRIT-TREE is also known as a fast algorithm, but it took more time than hc-OTU when processing hundreds of thousands of sequences.The execution times for the “titanium” dataset using ESPRIT-TREE and ESPRIT are 131.83s and 9h 51m, respectively. The read length of the “titanium” dataset is about 400bp and is longer than the others in the Quince dataset. It might be that ESPRIT and ESPRIT-TREE are sensitive to the length of the sequence read. This result suggests that hc-OTU may have a good average-case time complexity. Note that pre-clustering step (Section 3.2) has the worst-case time complexity of O(n2 ), so this becomes a bottleneck when dealing with a very large-scale dataset. However, the inputs that elicit this behavior may occur infrequently in practice; therefore,

5.6

Additional discussion

When comparing OTU clustering results, the simplest way is to compare the number of OTUs in each cluster. In this type of evaluation, all OTUs are given the same weight, and the result is very sensitive to singleton OTUs. We consider these types of methods to be inappropriate. A more precise way is to use F-scores [16], [38] determining whether the genetic distance of each OTUs meets the threshold θg , by exploiting all pairwise sequence alignments. Generally, when golden reference labels are unknown, F-score-based methods are appropriate. However, if the golden references are known, evaluation metrics for labeled datasets such as NMI, ARI, and MOC are more appropriate than F-scores, which require excessive computation for pairwise alignments. If reference sequences are given but read-wise labels are unknown, it is necessary to map each sequence read to the reference label. In the present study, BLAST was used to measure genetic distance for more accurate mapping. So-called ultrafast clustering algorithms, namely UCLUST and CD-HIT, are faster than hc-OTU. These two tools use greedy algorithms of complexity O(n1.2 ) [15]. On the other hand, hc-OTU includes a step for computing the all pairwise k-mer distances, and its worstcase computational complexity is O(n2 ). In general, due to the tradeoff between speed and accuracy, fast algorithms exhibit low accuracy. Since OTU clustering is a batch process and in general not performed repeatedly, tools such as hc-OTU or ESPRIT-TREE, which have a high level of accuracy and runtimes of about one minute, could be a reasonable alternative to UCLUST or CD-HIT, when higher accuracy is needed. Although the 454 pyrosequencing platform is still popular at the time of writing, it has been announced that this platform will be discontinued by mid-2016. The homopolymer compaction technique presented in this paper is effective not only for the data from the 454 platform but also for the data from other types of NGS platforms that produce homopolymer indel errors (such as Ion Torrent).

1545-5963 (c) 2015 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.

This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TCBB.2016.2535326, IEEE/ACM Transactions on Computational Biology and Bioinformatics IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS

6

C ONCLUSION

We have presented hc-OTU, a novel OTU clustering method for 16s rRNA sequences, which is based on homopolymer compaction and k-mer profiling, and achieved outstanding performance in terms of accuracy and speed. A homopolymer compaction procedure is used to generate partial clusters, which denoises sequences and reduces computation time for pairwise alignments by using k-mer profiling. According to the evaluation metrics NMI, ARI, and MoC, hc-OTU achieves overall accuracy comparable to that of some of the best clustering algorithms, such as MOTHUR and ESPRIT-TREE, outperforming UCLUST and CD-HIT. It is faster than most of the clustering tools used in comparison. In particular, the average running time of hc-OTU is about nine seconds for tens of thousands sequences. This is approximately six times faster than ESPRIT-TREE, which has a similarly good accuracy performance.

ACKNOWLEDGMENTS This work was supported in part by the National Research Foundation of Korea (NRF) grant funded by the Korea government (Ministry of Science, ICT and Future Planning) [No. 2011-0009963, and No. 2014M3C9A3063541], in part by the Industrial Core Technology Development Program 10040176, “Development of Various Bioinformatics Software using Next Generation Bio-data” funded by the Ministry of Trade, Industry and Energy (MOTIE, Korea), and in part by a grant from Samsung Electronics Co., Ltd. Joong-Ho Won’s research was supported by the National Research Foundation of Korea (NRF) grant funded by the Korea government (MSIP) (No. 2013R1A1A1057949 and No. 2014R1A4A1007895).

R EFERENCES [1] [2] [3]

[4]

[5] [6]

[7]

[8]

[9]

J. C. Clemente, L. K. Ursell, L. W. Parfrey, and R. Knight, “The impact of the gut microbiota on human health: an integrative view,” Cell, vol. 148, no. 6, pp. 1258–1270, 2012. M. L. Metzker, “Sequencing technologies–the next generation,” Nature reviews genetics, vol. 11, no. 1, pp. 31–46, 2010. E. Stackebrandt and B. Goebel, “Taxonomic note: a place for DNA-DNA reassociation and 16S rRNA sequence analysis in the present species definition in bacteriology,” International Journal of Systematic Bacteriology, vol. 44, no. 4, pp. 846–849, 1994. J. Kim, S. Yu, B. Shim, H. Kim, H. Min, E.-Y. Chung, R. Das, and S. Yoon, “A robust peak detection method for RNA structure inference by high-throughput contact mapping,” Bioinformatics, vol. 25, no. 9, pp. 1137–1144, 2009. X. C. Morgan and C. Huttenhower, “Human microbiome analysis,” PLoS computational biology, vol. 8, no. 12, p. e1002808, 2012. B. Lee, J. Park, and S. Yoon, “Rapid and robust denoising of pyrosequenced amplicons for metagenomics,” in Data Mining (ICDM), 2012 IEEE 12th International Conference on. IEEE, 2012, pp. 954–959. S. Kwon, S. Park, B. Lee, and S. Yoon, “In-depth analysis of interrelation between quality scores and real errors in Illumina reads,” in Engineering in Medicine and Biology Society (EMBC), 2013 35th Annual International Conference of the IEEE. IEEE, 2013, pp. 635–638. C. Quast, E. Pruesse, P. Yilmaz, J. Gerken, T. Schweer, P. Yarza, ¨ J. Peplies, and F. O. Glockner, “The SILVA ribosomal RNA gene database project: improved data processing and web-based tools,” Nucleic acids research, p. gks1219, 2012. T. Z. DeSantis, P. Hugenholtz, N. Larsen, M. Rojas, E. L. Brodie, K. Keller, T. Huber, D. Dalevi, P. Hu, and G. L. Andersen, “Greengenes, a chimera-checked 16S rRNA gene database and workbench compatible with ARB,” Applied and environmental microbiology, vol. 72, no. 7, pp. 5069–5072, 2006.

10

[10] P. D. Schloss and J. Handelsman, “Introducing DOTUR, a computer program for defining operational taxonomic units and estimating species richness,” Applied and environmental microbiology, vol. 71, no. 3, pp. 1501–1506, 2005. [11] P. D. Schloss, S. L. Westcott, T. Ryabin, J. R. Hall, M. Hartmann, E. B. Hollister, R. A. Lesniewski, B. B. Oakley, D. H. Parks, C. J. Robinson et al., “Introducing mothur: open-source, platformindependent, community-supported software for describing and comparing microbial communities,” Applied and environmental microbiology, vol. 75, no. 23, pp. 7537–7541, 2009. [12] R. C. Edgar, “Search and clustering orders of magnitude faster than BLAST,” Bioinformatics, vol. 26, no. 19, pp. 2460–2461, 2010. [13] W. Li and A. Godzik, “CD-HIT: a fast program for clustering and comparing large sets of protein or nucleotide sequences,” Bioinformatics, vol. 22, no. 13, pp. 1658–1659, 2006. [14] Y. Sun, Y. Cai, L. Liu, F. Yu, M. L. Farrell, W. McKendree, and W. Farmerie, “ESPRIT: estimating species richness using large collections of 16S rRNA pyrosequences,” Nucleic acids research, vol. 37, no. 10, pp. e76–e76, 2009. [15] Y. Cai and Y. Sun, “ESPRIT-Tree: hierarchical clustering analysis of millions of 16S rRNA pyrosequences in quasilinear computational time,” Nucleic acids research, vol. 39, no. 14, pp. e95–e95, 2011. [16] K. Hwang, J. Oh, T.-K. Kim, B. K. Kim, D. S. Yu, B. K. Hou, G. Caetano-Anoll´es, S. G. Hong, and K. M. Kim, “CLUSTOM: a novel method for clustering 16S rRNA next generation sequences by overlap minimization,” PloS one, vol. 8, no. 5, p. e62623, 2013. [17] J.-H. Lee, H. Yi, Y.-S. Jeon, S. Won, and J. Chun, “TBC: a clustering algorithm based on prokaryotic taxonomy,” The Journal of Microbiology, vol. 50, no. 2, pp. 181–185, 2012. [18] P. D. Schloss and J. Handelsman, “Introducing SONS, a tool for operational taxonomic unit-based comparisons of microbial community memberships and structures,” Applied and environmental microbiology, vol. 72, no. 10, pp. 6773–6779, 2006. [19] J. D. Thompson, T. Gibson, D. G. Higgins et al., “Multiple sequence alignment using ClustalW and ClustalX,” Current protocols in bioinformatics, pp. 2–3, 2002. [20] S. F. Altschul, T. L. Madden, A. A. Sch¨affer, J. Zhang, Z. Zhang, W. Miller, and D. J. Lipman, “Gapped BLAST and PSI-BLAST: a new generation of protein database search programs,” Nucleic acids research, vol. 25, no. 17, pp. 3389–3402, 1997. [21] W. H. Press, Numerical recipes 3rd edition: The art of scientific computing. Cambridge university press, 2007. [22] L. Hubert and P. Arabie, “Comparing partitions,” Journal of classification, vol. 2, no. 1, pp. 193–218, 1985. [23] M. Meil˘a, “Comparing clusterings by the variation of information,” in Learning theory and kernel machines. Springer, 2003, pp. 173–187. [24] D. Pfitzner, R. Leibbrandt, and D. Powers, “Characterization and evaluation of similarity measures for pairs of clusterings,” Knowledge and Information Systems, vol. 19, no. 3, pp. 361–394, 2009. [25] A. K. Jain and R. C. Dubes, Algorithms for clustering data. PrenticeHall, Inc., 1988. [26] M. J. de Hoon, S. Imoto, J. Nolan, and S. Miyano, “Open source clustering software,” Bioinformatics, vol. 20, no. 9, pp. 1453–1454, 2004. [27] S. Yoon, L. Benini, and G. De Micheli, “Co-clustering: a versatile tool for data analysis in biomedical informatics,” IEEE Transactions on Information Technology in Biomedicine, vol. 11, no. 4, pp. 493–494, 2007. [28] S. Vinga and J. Almeida, “Alignment-free sequence comparison?a review,” Bioinformatics, vol. 19, no. 4, pp. 513–523, 2003. [29] O. Bonham-Carter, J. Steele, and D. Bastola, “Alignment-free genetic sequence comparisons: a review of recent approaches by word analysis,” Briefings in bioinformatics, p. bbt052, 2013. [30] M. L. Sogin, H. G. Morrison, J. A. Huber, D. M. Welch, S. M. Huse, P. R. Neal, J. M. Arrieta, and G. J. Herndl, “Microbial diversity in the deep sea and the underexplored rare biosphere,” Proceedings of the National Academy of Sciences, vol. 103, no. 32, pp. 12 115–12 120, 2006. [31] P. D. Schloss, “The effects of alignment quality, distance calculation method, sequence filtering, and region on the analysis of 16s rrna gene-based studies,” PLoS computational biology, vol. 6, no. 7, p. e1000844, 2010. [32] C. Quince, A. Lanzen, R. J. Davenport, and P. J. Turnbaugh, “Removing noise from pyrosequenced amplicons,” BMC bioinformatics, vol. 12, no. 1, p. 38, 2011.

1545-5963 (c) 2015 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.

This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TCBB.2016.2535326, IEEE/ACM Transactions on Computational Biology and Bioinformatics IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS

[33] A. Gilles, E. Megl´ecz, N. Pech, S. Ferreira, T. Malausa, and J.-F. Martin, “Accuracy and quality assessment of 454 GS-FLX titanium pyrosequencing,” BMC genomics, vol. 12, no. 1, p. 245, 2011. [34] R. C. Edgar, B. J. Haas, J. C. Clemente, C. Quince, and R. Knight, “UCHIME improves sensitivity and speed of chimera detection,” Bioinformatics, vol. 27, no. 16, pp. 2194–2200, 2011. [35] W. Li, L. Fu, B. Niu, S. Wu, and J. Wooley, “Ultrafast clustering algorithms for metagenomic sequence analysis,” Briefings in bioinformatics, p. bbs035, 2012. [36] E. K. Costello, C. L. Lauber, M. Hamady, N. Fierer, J. I. Gordon, and R. Knight, “Bacterial community variation in human body habitats across space and time,” Science, vol. 326, no. 5960, pp. 1694–1697, 2009. [37] S. Kwon, B. Lee, and S. Yoon, “CASPER: context-aware scheme for paired-end reads from high-throughput amplicon sequencing,” BMC bioinformatics, vol. 15, no. Suppl 9, p. S10, 2014. [38] Y. Sun, Y. Cai, S. M. Huse, R. Knight, W. G. Farmerie, X. Wang, and V. Mai, “A large-scale benchmark study of existing algorithms for taxonomy-independent microbial community analysis,” Briefings in bioinformatics, p. bbr009, 2011.

Seunghyun Park is a researcher at the Department of Electrical and Computer Engineering at Seoul National University. He received the B.S. degree in electrical engineering from Korea University, Seoul, Korea in 2009 and is pursuing the integrated M.S./Ph.D. degree in electrical engineering. His current research interests include high-performance bioinformatics and GWAS analysis.

Hyun-soo Choi received his B.S. degrees in computer and communication engineering as first major and brain and cognitive sciences as second major from Korea University, Korea, in 2013. Presently, he is pursuing an integrated M.S./Ph.D. degree in electrical and computer engineering at Seoul National University, Korea. His research interests include biomedical signal processing and machine learning.

Byunghan Lee is a Ph.D. student at the Department of Electrical and Computer Engineering, Seoul National University, Seoul, Korea. His research interests include high-performance bioinformatics, machine learning for biomedical big data, and data mining.

11

Jongsik Chun received his B.S. degree from Seoul National University and completed his Ph.D. at Newcastle University School of Medicine and completed his postdoctoral work at the Research Center for Molecular Microbiology at Seoul National University. He is the Associate Editor of the International Journal of Systematic and Evolutionary Microbiology. He serves as an editorial board member for Antonie van Leeuwenhoek (Dutch Kluwer four issues) and Microbes and Environments. He is associate professor of biology, Seoul National University. He serves on Seoul National University’s Interdisciplinary Program in Bioinformatics Steering Committee. He is associated with a number of institutes at Seoul National University, including the Institute of Microbiology Research, the Genetic Engineering Combustion Institute, and the International Vaccine Institute. He also assists the Rural Development Administration and the Institute of Agriculture and Life Sciences. He has previously served as a research associate at the Center of Marine Biotechnology at the University of Maryland and as senior researcher at the Korea Research Institute of Bioscience and Biotechnology.

Joong-Ho Won Joong-Ho Won (S02-M09) received the B.S. degree in electrical engineering from Seoul National University, Korea, in 1998 and the MS and the Ph.D. degrees in electrical engineering from Stanford University, CA, in 2003 and 2009, respectively. He also received the M.S. degree in statistics from the same institution. Dr. Won was a research scientist at the VA Cooperative Studies Program, Mountain View, CA from 2011 to 2012, and taught at the School of Industrial Management Engineering, Korea University, Seoul, Korea from 2012 until 2014. Currently he is an assistant professor at the Department of Statistics, Seoul National University. His research interests are largescale data analysis and visualization, and computational methods in statistics and medical informatics. He is a member of the IEEE.

Sungroh Yoon received his B.S. degree in electrical engineering from Seoul National University, Korea, in 1996, and his M.S. and Ph.D. degrees in electrical engineering from Stanford University, CA, in 2002 and 2006, respectively. From 2006 to 2007, he was with Intel Corporation, Santa Clara, CA. Previously, he held research positions with Stanford University, CA, and Synopsys, Inc., Mountain View, CA. He was an assistant professor with the School of Electrical Engineering, Korea University from 2007 to 2012. Currently, he is an associate professor with the Department of Electrical and Computer Engineering, Seoul National University, Korea. He is the recipient of the 2013 IEEE/IEEK Joint Award for Young IT Engineers. His research interests include highperformance bioinformatics, machine learning, big-data analytics, and parallel processing. He is a senior member of the IEEE.

1545-5963 (c) 2015 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.

hc-OTU: A Fast and Accurate Method for Clustering Operational Taxonomic Unit based on Homopolymer Compaction.

To assess the genetic diversity of an environmental sample in metagenomics studies, the amplicon sequences of 16s rRNA genes need to be clustered into...
563B Sizes 0 Downloads 9 Views