Subgrouping Automata: automatic sequence subgrouping using phylogenetic tree-based optimum subgrouping algorithm.

Computational Biology and Chemistry 48 (2014) 64–70

Contents lists available at ScienceDirect

Computational Biology and Chemistry journal homepage: www.elsevier.com/locate/compbiolchem

Research article

Subgrouping Automata: Automatic sequence subgrouping using phylogenetic tree-based optimum subgrouping algorithm Joo-Hyun Seo a,c,1 , Jihyang Park a,2 , Eun-Mi Kim a,3 , Juhan Kim a,4 , Keehyoung Joo b , Jooyoung Lee c , Byung-Gee Kim a,∗ a

School of Chemical and Biological Engineering, Seoul National University, Seoul 151-742, Republic of Korea Center for Advanced Computation, Korea Institute for Advanced Study, Seoul 130-722, Republic of Korea c School of Computational Sciences, Korea Institute of Advanced Study, Seoul 130-722, Republic of Korea b

a r t i c l e

i n f o

Article history: Received 10 June 2013 Received in revised form 12 October 2013 Accepted 23 November 2013 Keywords: Subgrouping Protein family discrimination Optimum subgrouping node Phylogenetic tree Statistical analysis

a b s t r a c t Sequence subgrouping for a given sequence set can enable various informative tasks such as the functional discrimination of sequence subsets and the functional inference of unknown sequences. Because an identity threshold for sequence subgrouping may vary according to the given sequence set, it is highly desirable to construct a robust subgrouping algorithm which automatically identifies an optimal identity threshold and generates subgroups for a given sequence set. To meet this end, an automatic sequence subgrouping method, named ‘Subgrouping Automata’ was constructed. Firstly, tree analysis module analyzes the structure of tree and calculates the all possible subgroups in each node. Sequence similarity analysis module calculates average sequence similarity for all subgroups in each node. Representative sequence generation module finds a representative sequence using profile analysis and self-scoring for each subgroup. For all nodes, average sequence similarities are calculated and ‘Subgrouping Automata’ searches a node showing statistically maximum sequence similarity increase using Student’s t-value. A node showing the maximum t-value, which gives the most significant differences in average sequence similarity between two adjacent nodes, is determined as an optimum subgrouping node in the phylogenetic tree. Further analysis showed that the optimum subgrouping node from SA prevents under-subgrouping and over-subgrouping. © 2013 Published by Elsevier Ltd.

1. Introduction The generation of subgroups containing functionally relevant sequences based on sequence similarity or identity can give good information for functional discrimination of sequences in a given set of sequences (Heger and Holm, 2000). Through clustering or subgrouping, functions of anonymous protein sequences can be easily inferred from the functions of other sequences in the same cluster (or subgroup) or the sequences in neighbor subgroups. A homologous sequence set for clustering can be prepared by text mining or several search methods such as BLAST (Altschul et al., 1990) or profile analysis (Altschul et al., 1997; Eddy, 1998). When we assume that the functions of homologous proteins are

∗ Corresponding author. Tel.: +82 2 880 6774; fax: +82 2 874 1206. E-mail address: [email protected] (B.-G. Kim). 1 Current address: Samsung Advanced Institute of Technology, Nongseo-dong, Giheung-gu, Yongin-si, Gyoenggi-do 446-712, Republic of Korea. 2 Current address: Samsung Corning Precision Materials, Yongdu-ri, Tangjeongmyeon, Asan-si, Chungcheongnam-do 336-841, Republic of Korea. 3 Current address: Joint BioEnergy Institute, Emeryville, CA 94608, USA. 4 Current address: CIRES, University of Colorado, Boulder, CO 80309, USA. 1476-9271/$ – see front matter © 2013 Published by Elsevier Ltd. http://dx.doi.org/10.1016/j.compbiolchem.2013.11.004

diversified along time owing to point mutation, deletion, insertion, multimerization and duplication, the function of the collected anonymous sequences in a given sequence set can be predicted more in detail by the functional inference based on phylogenomic analysis method (Eisen, 1998). When the functions of the several members in all subgroups in the phylogenetic tree are revealed by experiments, we can roughly identify the function of anonymous sequences in a given sequence set. Then the next remaining question is how we can make functionally meaningful subgroups from large set of sequences (Eisen, 1998; Eisen and Fraser, 2003; Krause et al., 2002). As summarized by Lee et al. (2010), developed algorithms can be categorized into three main types, i.e. phylogenomics, pattern recognition and clustering. According to Lee et al., SCI-PHY, which utilize pattern recognition and clustering, gives better results in protein function prediction than sequence-only methods such as Secator (Wicker et al., 2001), Ncut (Abascal and Valencia, 2002) and CD-HIT (Li and Godzik, 2006). However, these methods need conserved domains (for pattern recognition) or pre-determined sequence identity cut-off (for clustering). Some methods use mathematical models. Brown introduced model-based sequence clustering method utilizing Dirichlet process (Brown, 2008). This method is reported to show

J.-H. Seo et al. / Computational Biology and Chemistry 48 (2014) 64–70

good group purity and VI score, which divides the given sequence set into functionally different subgroups. However, according to Andreopoulos et al. (2009), most models used in model-based clustering methods are often oversimplified, so that leading to inaccurate result. In addition, another disadvantage of model-based method is slow processing time for large sequence sets. From the standpoint of enzymologists, bioinformatical methods should be feasible regardless of the level of input sequence set. Input sequence set could consist of sequences in superfamily, family or subfamily level. If the sequence set consists of superfamily level, sequences may be very diverse to draw conserved domains. In addition, sequence identity cut-off for sequence subgrouping could be never known for sequence set collected from database by utilizing any homology search method. Therefore, sequence clustering algorithm should produce family or subfamily sequence sets from superfamily or family-level sequence set without prior knowledge of sequence identity cutoff. In this work, we designed a versatile algorithm for subclassification of input sequence set, which uses sequence comparison and clustering method. Because there are many outperforming sequence alignment methods, we focused on the development of algorithm to generate subgroups utilizing the sequence alignment result produced from other multiple sequence alignment methods.

65

were retrieved. Redundant sequences were removed by sequence clustering with 100% threshold using CD-HIT. Final 99 sequences were remained and used for subgrouping. The sequences for the subgrouping analysis of aminotransferase group I, II and all the other aminotransferase groups were manually collected from the NR database. Putative sequences and fragment sequences were manually removed. BcAT sequences were re-collected manually with the same method for sequences of other aminotransferase subgroups. Sequence set of all aminotransferases consists of 11 sequences of alanine aminotransferase, 152 sequences of aspartate aminotransferase, 39 sequences of aromatic aminotransferase, 22 sequences of histidinol-phosphate aminotransferase, 16 sequences of ␻-aminotransferase, 10 sequences of l-ornithine aminotransferase, 55 sequences of N-acetyl-l-ornithine aminotransferase, 129 sequences of 7,8-diaminopelargonate aminotransferase, 48 sequences of ␥-aminobutyrate aminotransferase, 10 sequences of d-alanine aminotransferase, 53 sequences of branched-chain aminotransferase, 22 sequences of phosphoserine aminotransferase and 50 sequences of serine aminotransferase. Aminotransferase group I consists of alanine aminotransferase, aspartate aminotransferase, aromatic aminotransferase and histidinol-phosphate aminotransferase. Aminotransferase group II consists of ␻-aminotransferase, l-ornithine aminotransferase, N-acetyl-l-ornithine aminotransferase, 7,8-diaminopelargonate aminotransferase and ␥-aminobutyrate aminotransferase. Classification of aminotransferase in the work of Mehta et al. (Mehta et al., 1993) was adopted.

2. Materials and methods 2.2. Algorithm and used programs 2.1. Data set Sequences of ␤-alanine:pyruvate aminotransferase, ␥aminobutyrate aminotransferase, l-ornithine aminotransferase and lysine decarboxylase were searched with EC number and retrieved from BRENDA. Redundant sequences were removed by sequence clustering with 100% threshold using CD-HIT. 97 sequences of ␤-alanine:pyruvate aminotransferase, 272 sequences of ␥-aminobutyrate aminotransferase, 72 sequences of l-ornithine aminotransferase and 111 sequences of lysine decarboxylase were used for subgrouping. In the case of branched-chain aminotransferase (bcAT), sequences of bcATs were searched using the EC number of bcAT (2.6.1.42) at RefSeq database in NCBI. Putative sequences and fragment sequences were removed from the search result. To remove the redundant sequences, sequence clustering with the sequence identity threshold of 100% was performed using CDHIT. A total of 691 sequences were gathered. The sequence subgrouping using the developed subgrouping algorithm was performed for the final 691 sequences. For aspartate ammonia-lyase, sequence sets were searched with EC number, and retrieved from BRENDA. Redundant sequences were removed by sequence clustering with 100% threshold using CD-HIT. HMM profile was built using ‘hmmbuild’ and ‘hmmcalibrate’ program in HMMER package. 843 microbial genomes (as of March 2009) were searched using the generated profile. From the result of the search for each microbial genome, hit sequences scoring over default E-value of 10 were collected. Sequence clustering was performed to remove redundant sequences using CD-HIT with 100% threshold. Final 379 sequences were used for subgrouping. In the case of (S)-2-aminoadipate semialdehyde dehydrogenase (aasDH) sequence sets, there are two kinds of sequences in BRENDA database. The sequences of approximately 400 amino acids were chosen, but the number of the sequences was too small. Therefore, simple BLASTP search was performed with aasDH from Stenotrophomonas sp. SKA14 in NCBI and top 100 sequences

ClustalW 1.83 (Thompson et al., 1994) was used for the multiple alignment of input sequence sets. In this work, we used ClustalW because some input sequence sets have large number of sequences. Although we used ClustalW for multiple sequence alignment, any alignment programs can be used to generate multiple sequence alignment. “hmmbuild”, “hmmcalibrate” and “hmmsearch” in HMMER package (Eddy, 1998) were used for profile building and profile search. Subgrouping algorithm and parsing algorithm for the result of ClustalW and programs in HMMER package were coded using Python programming language. Two assumptions were made to construct the algorithm. (1) If sequence sets consist of a certain level (e.g. family level), subgrouping should be performed to discriminate sub-level (e.g. subfamily level); (2) at the optimum subgrouping node in phylogenetic tree, an average sequence similarity shows a maximum increase. Program starts with the ‘PHYLIP’ format of the tree and ‘clustal’ format of the alignment result. Therefore, if a user can make these two files, any alignment program can be used. In phylogenetic tree recognition and analysis, node number is calculated from ‘PHYLIP’ format of the phylogenetic tree. Starting from node index of zero, +1 is added when an opening parenthesis appears and −1 is added when a closing parenthesis appears. Therefore, the origin of the rooted tree is designated as node 1. After the node number assignment, the subgroup at node i is picked by selecting sequences between an opening parenthesis with node number (i + 1) and a closing parenthesis with node number i. The node number and the subgroup tag for each sequence can be assigned as in Fig. 1(a). For example, because node number of tree origin is 1, subgrouping at node 1 results in one subgroup and it contains all the sequences. Next, subgroups at each node were identified. For instance, three subgroups can be identified at node 2 and four subgroups at node 3 (Fig. 1(a)). Therefore, sequence a in Fig. 1(a) can have the multiple subgroup tags of 3-1, 4-2 and so on. For sequences in each subgroup, to calculate the average sequence similarity, pairwise sequence similarities were calculated for all the sequence

66


Fig. 1. (a) A tree example for the explanation of node and subgroup tag assignment for each sequence. Tree is appeared as cladogram. (b) Schematic for self-scoring method to select representative sequence of subgroup.

pairs using the multiple alignment result. Sequence similarity was calculated on the assumption that the aligning of similar amino acids is matched. Groups of similar amino acid are as follows: ‘aa1’:(G, A, V, L, I), ‘aa2’:(F, Y, W), ‘aa3’:(C, M), ‘aa4’:(S, T), ‘aa5’:(K, R, H), ‘aa6’:(D, E, N, Q), ‘aa7’:(P). Average sequence similarity for a subgroup, which is a total sequence similarity of a subgroup, can be calculated by averaging all the pairwise sequence similarities from all the sequence pairs in the subgroup. The average sequence similarity (ASS) for all of the possible subgroups in a certain node was calculated for all the nodes. Let the sequence similarity be SS, and the average sequence similarity of subgroup k at node i be ASS(i)k . Then, ASS(i)k =

SS for all sequence pairs inkth subgroup at nodei

number of all sequence pairs inkth subgroup at nodei

Let the naverage of all the ASS(i)k at node i be ASS(i), then ASS(i) = ASS(i)k /n, where n is the number of subgroups at k=1 node i. For example, in Fig. 1(a), we can find three subgroups at node 2 and ASS(2) is (ASS(2)1 + ASS(2)2 + ASS(2)3 )/3. Our program calculates all the ASS(i) (i = 1, 2, . . ., n) for all the nodes. Accounting the assumption 2, student’s t-value indicating normalized average sequence similarity differences between ASS(i + 1) and ASS(i) at node i and i + 1 can be adopted and calculated. The t-value is defined as follows: t(i + 1, i) =

ASS(i + 1) − ASS(i)

(var(i + 1)/n(i + 1)) + (var(i)/n(i))

,

where var(i) is the variance of SSs of all sequence pairs in all subgroups at node i and n is the number of all the sequence pairs in all subgroups at node i. A simple ASS difference can give information for ASS change. However, t-value was adopted to normalize the ASS difference and calculate statistically most meaningful ASS increase. Sequence similarity values generally follow extreme value distribution and t-test is used on the sample set following the normal distribution. However, when the sample size is sufficiently large, we are able to assume that sequence similarity values approximately follow the normal distribution. With this assumption, the t-test is adopted to normalize ASS difference. When t(i + 1, i) shows a maximum value at a certain node, program performs subgrouping at the node, which is very likely to be the optimum node. For example, if t(3,2)

shows a maximum value among the t values, the program performs subgrouping at node 3. Since some sequences branched out before the optimum node, there can be sequences that are not included in the generated subgroups (i.e. dropouts). These dropouts were also gathered and stored for the selection of the representative sequences. Representative sequences were chosen by a self-scoring method to represent the average sequence characteristics of subgroups. The self-scoring method is proceeded as following (Fig. 1(b)): (1) multiple alignment for the sequences in subgroup is performed, (2) HMM profile is generated from the multiple alignment, (3) this HMM profile is used for scoring the member sequences in that subgroup, (4) the top scoring sequence is chosen as the representative sequence. Representative sequences are selected from the dropout sequences generated before the optimum node as well as from the subgroups at optimum node. When representative sequences were determined from the dropout sequences, evolutionary history was considered using the following procedure. After the collection of dropout sequences, subgroups in dropout sequences were determined according to branch points and nodes. Using this algorithm, the program selects additional representative sequences from the dropout sequences. For example, if ‘Subgrouping Automata’ (SA) performs subgrouping at node 4 in Fig. 1(a), there are 6 dropout sequences – i.e. sequences b, c, d, e, f and g. Because the sequences b, c, d, e, f and g have different evolutionary history, four different cases should be considered. Sequences e and f are branched out at node 2. Therefore, one representative sequence should be selected from the sequence set of e and f. Next, sequences b, c, d and g, which are dropped out at node 3, have three different evolutionary histories. Therefore, selection should be performed to pick one from sequence b itself, one from g itself and one from the sequence set of c and d. If the number of sequences of a given sequence set is larger than two (e.g. (c, d) set and (e, f) set), same self-scoring method was applied to select representative sequence. Using this algorithm, program selects additional representative sequences from dropout sequences. Therefore, in case of Fig. 1(a) with subgrouping at node 4, total nine representative sequences could be selected. Using these nine representative sequences, ASS(i)rs can be made by the calculation of average of 36 sequence similarities from 36 sequence pairs generated from nine representative sequences. ASS(i)rs calculates the average sequence similarity among the representative sequences at each node, and it is compared to ASS(i) to determine whether the subgrouping is either overdone (i.e. too


specific subgrouping), underdone (i.e. too broad subgrouping) or optimally done.

67

Table 1 Member specification in the subgrouping of aminotransferase group I and aminotransferase group II. Input sequence set

Subgroup number

3. Results and discussion Subgroup 1

3.1. Subgrouping for the sequence sets of the same EC number The sequence sets of various enzyme families were subjected to automatic subgrouping using SA. SA basically calculates the difference of the average sequence similarities between the subgroups at two adjacent nodes, i.e. t(i + 1, i), and determines the subgrouping node by searching the node showing maximum t-value. To prove a hypothesis such that optimum subgrouping roughly occurs at the node showing maximum t-value, we compared average sequence similarity at node i, i.e. (ASS(i)), with average sequence similarity among representative sequences (ASS(i)rs ) at node i. The two values have special characteristics. ASS(i) will not change greatly in early stages, but beyond certain node, it tends to increase sharply because subgroups tends to be comprised of highly homologous sequences more and more. In contrast, ASS(i)rs increases sharply in early stages, then stays similar, then decreases beyond certain node. In starting stages, the number of representative sequences is small. But, as subgrouping nodes increases, more similar sequences are selected from more subgroups. In other words, the number of subgroups increases as subgrouping nodes increases. However, beyond the certain node, as the number of subgroups increases, less similar sequences are collected as representative sequences. Therefore, ASS(i)rs decreases. Plot of these two values can suggest optimum subgrouping node where these two values are equal. Here, we can imagine two cases. If ASS(i)rs > ASS(i), all the sequences are still at under-subgrouping (i.e. too broad subgrouping) and subgroups at node i tends to contain diverse sequences which should not be included in the same subgroup. Not only because one representative sequence is selected from a subgroup containing many non-homologous sequences, but also because the effect of lowering ASS(i) by the presence of diverse sequences in the same subgroups is absent in the set of representative sequences, ASS(i)rs is higher than ASS(i) in the case of under-subgrouping. If ASS(i)rs < ASS(i), over-subgrouping (i.e. too specific subgrouping) occurs and the subgroups at node i are comprised of highly similar sequences to result in high ASS(i). In this case, since too many representative sequences are selected from many subgroups containing highly homologous sequences and the representative sequence set is diversified, ASS(i)rs becomes lower than ASS(i). Therefore, if the optimum subgrouping for the input sequence set is performed, ASS(i)rs would be the same or at least similar to ASS(i). In addition, when ASS(i) = ASS(i)rs , the sequence similarity value (SS(i)) would become a threshold value for the optimum subgrouping of a given sequence set. Sequence set of ␤-alanine:pyruvate aminotransferase, ␥aminobutyrate aminotransferase, l-ornithine aminotransferase, aspartate ammonia-lyase, (S)-2-aminoadipate semialdehyde dehydrogenase and lysine decarboxylase were investigated using SA. To investigate the value profile of ASS(i) or ASS(i)rs , subgrouping was performed for each node, and ASS(i) and ASS(i)rs are calculated at each node. The results are shown in Fig. 2. When SA achieved subgrouping at the node where t-value is the maximum, subgrouping node is the same or very close to the optimum node where ASS(i) and ASS(i)rs curves are intersecting. ASS(i) shows large increase at the node where t-value is maximum. Although there is little deviation between subgrouping node (where t-value is maximum) and optimum node (where ASS(i) = ASS(i)rs ), only 0.5–2 node difference was observed. These results suggest that assumption 2 (an average sequence similarity shows a maximum increase at the optimum subgrouping node in phylogenetic tree) is powerful to estimate

Aminotransferase group I

Subgroup 2

Aspartate aminotransferase 72 Aromatic aminotransferase 31 Histidinol-phosphate aminotransferase 22 Alanine aminotransferase 10

Subgroup 3

Aspartate aminotransferase 3

Subgroup 1

␥-Aminobutyrate aminotransferase 183

Subgroup 2

Aminotransferase group II

Members Aspartate aminotransferae 77 Aromatic aminotransferase 8 Alanine aminotransferase 1

Subgroup 3

Subgroup 4 Subgroup 5 Dropouts

␥-Aminobutyrate aminotransferase 7 L-Ornithine aminotransferase 52 ␥-Aminobutyrate aminotransferase 11 ␤-Alanine:pyruvate aminotransferase 73 ␥-Aminobutyrate aminotransferase 4 ␥-Aminobutyrate aminotransferase 10 ␥-Aminobutyrate aminotransferase 1

the sequence similarity threshold for the subgrouping of a given sequence set. In general, ASS(i) tends to increase while ASS(i)rs decreases, as the node number increases. If subgrouping is performed at the optimum node where ASS and ASSrs curves intersect, such subgrouping is likely to prevent the diversity and high similarity in each subgroup. Under this circumstance, representative sequences are able to fully represent their input sequence set. From these results, it could be concluded that subgrouping at the node showing the maximum t-value is able to result in optimum subgrouping, where ASS(i) is very close to ASS(i)rs . 3.2. Subgrouping for the sequence sets of multiple EC number To carry out subgrouping of the sequence sets having multiple enzyme families, SA was applied to aminotransferase group I enzymes and aminotransferase group II enzymes. Aminotransferase group I sequence set consists of 11 sequences of alanine aminotransferase (alaAT), 152 sequences of aspartate aminotransferase (aspAT), 39 sequences of aromatic aminotransfserase (aroAT) and 22 sequences of histidinol-phosphate aminotransferase (hpAT). Aminotransferase group II sequence set consists of 73 sequences of ␤-alanine:pyruvate aminotransferase (␤-alaAT), 216 sequences of ␥-aminobutyrate aminotransferase (gabaAT) and 52 sequences of l-ornithine aminotransferase (orAT). To calculate the value profiles of ASS(i) and ASS(i)rs , subgrouping was performed at each node, and ASS(i) and ASS(i)rs were calculated at each node. Results are shown in Fig. 3(a) and (b). In the result of the subgrouping for aminotransferase group I (Fig. 3(a)), subgrouping was achieved at the node 2 which is close to the point where ASS(i) and

68


(a)

(c)

100

100

8

80 2

70

t-value

4

0

15

80 10

70

t-value

6

90

Sequence similarity (%)


90

60 5 50

40 0

-2 30

60 5

10

20

0

25

5

Node number

ASS(i) t-value ASS(i)rs

(b)

15

10

100

30

(d)

15

20

25

Node number

ASS(i) t-value ASS(i)rs 100

16 14

10

80

70

0


12

20

90

t-value


95

90 10 85

8

80

6 4

75 2 70

60 10

20

40

-2 0

50

Node number


(e)

30

0

65

-10 0

2

4 ASS(i) t-value ASS(i)rs

100

100

t-value

0

(f)

6

8

10

12

14

16

18

Node number

100

8

90

6

80

4

70

2

60

0

60

40

50 20


60 70

t-value


80 80

t-value

90

40 0 30 0


10

15

20

25

Node number

50

-2 0


10

15

20

Node number

Fig. 2. Value profiles of ASS(i), t-value and ASS(i)rs : (a) ␤-alanine:pyruvate aminotransferase, (b) aspartate ammonia-lyase, (c) lysine decarboxylase, (d) (S)-2-aminoadipate semialdehyde dehydrogenase, (e) branched-chain amino acid aminotransferase, (f) l-ornithine aminotransferase.

ASS(i)rs intersect. In addition, ASS(i) (37%) and ASS(i)rs (43%) have relatively similar values compared to values at other nodes. Member specification generated by subgrouping at node 2 is appeared in Table 1. If the subgrouping is performed at node 3, five groups are generated with the more separated subgroups only for aspATs. At node 3, member specification is as following; group 1 (75 aspATs, 8 aroATs, 1 alaAT), group 2 (2 aspATs), group 3 (57 aspATs, 31 aroATs,

22 hpATs, 10 alaATs), group 4 (15 aspATs), group 5 (2 aspATs) and dropout sequence of one aspAT. This result suggests that subgrouping at node 2 is the best for the sequence sets of aminotransferase group I. However, although SA found the best subgrouping result, subgrouping was not performed according to their function. This result is caused from little correlation between sequence similarity and their functions in aminotransferase group I. For example,


(a)

100

20

80

15 60 10

40

t-value


25

5

0 20 0

5

10

(b)

15

20

Node number

ASS(i) t-value ASS(i)rs 100

80 40

t-value


60

60 20

40 0 0

5

10


15

20

25

Node number

(c) 100 80

60 60 40

t-value


80

40 20 20 0 0 0


20

30

Node number

Fig. 3. Value profiles of ASS(i), t-value and ASS(i)rs : (a) aminotransferase group I, (b) aminotransferase group II, (c) all aminotransferase groups.

according to the Oue et al. (1997), aroAT from Paracoccus denitrificans has 37.6% sequence homology compared to the aroAT from Escherichia coli whereas it shows 44.7% sequence homology to the aspAT from E. coli. In addition, almost all the residues in the active site of aspAT are conserved in primary structure of aroAT (Okamoto et al., 1998). Due to the little correlation between sequence similarity and function, SA could not discriminate the aminotransferase group I sequences and did not generate the subgroups according to their functions. SA could only and approximately differentiate hpAT

69

and alaAT from aspAT and aroAT. This seems to be the limitation of the sequence-based subgrouping method. In the case of aminotransferase group II, somewhat different ASS(i) and ASS(i)rs value profiles were resulted in the subgrouping (Fig. 3(b)). First, ASS(i) starts with the value higher than ASS(i)rs from the starting point. Secondly, ASS(i)rs does not increase as the node number increases. ASS(i)rs was remained at similar values until the final node. However, the second phenomenon was the same in the subgrouping of aminotransferase group I. This can be explained as follows. Representative sequences are selected from different enzyme families. So, the proportions of sequences from each family are maintained from node 2 to last node. This makes ASS(i)rs value similar throughout the subgrouping at every node. In the subgrouping of aminotransferase group II, SA successfully determined the optimum subgrouping (at node 3) to discriminate the functional difference. At node 2, three subgroups could be generated; (1) group 1: 190 gabaATs and 52 orATs, (2) group 2: 73 ␤-alaATs and 12 gabaATs, (3) group 3: 14 gabaATs. So subgroups of aminotransferase group II are not still discriminated at node 2. At node 4, seven subgroups could be generated; (1) group 1: 182 gabaATs, (2) group 2: 9 gabaATs, (3) group 3: 7 gabaATs, (4) group 4: 7 gabaATs, (5) group 5: 3 gabaATs, (6) group 6: 52 orATs, (7) group 7: 73 ␤-alaATs and 4 gabaATs. Too specific subgroups were generated for gabaAT at node 4. Therefore, the selection of node 3 as optimum subgrouping node was successful (Table 1). The main difference between subgrouping of the sequence sets of the same EC number and subgrouping of the sequence set of the multiple EC number is the position of the subgrouping node. In the case of the sequences of the multiple EC number, there is a tendency of subgrouping at the node near the tree root. Table 1 shows the subgrouping result of aminotransferase group I and aminotransferase group II. Although aminotransferase group I was not subgrouped according to their functions, the functional discrimination was possible in the subgrouping of aminotransferase group II. l-Ornithine aminotransferase and ␤-alanine:pyruvate aminotransferase were clearly discriminated. However, some ␥-aminobutyrate aminotransferase sequences were in the l-ornithine aminotransferase subgroup and ␤-alanine:pyruvate aminotransferase subgroup. This inaccuracy might be caused by the mis-annotation of the sequences or more close relationship of some ␥-aminobutyrate aminotransferase sequences with their neighbor subgroups. In addition, ␥-aminobutyrate aminotransferases split into several subgroups. This seems to be from the characteristics of phylogenetic tree of aminotransferase group II. Phylogenetic tree was predicted to have more duplication events in the ␥-aminobutyrate aminotransferase subgroup. Next, subgrouping was carried out for the sequences from all the groups of aminotransferase (617 sequences). It took 6.8 min with one Xeon® 2 GHz CPU except the time for multiple sequence alignment. Subgrouping was performed at the node 14 next to the node 13 where ASS(i) and ASS(i)rs intersect (Fig. 3(c)) At node 14, ASS(i) and ASS(i)rs were 40% and 36%. At node 13, ASS(i) and ASS(i)rs were 32% and 34%. The subgroups generated by SA could represent functionally different subgroups (Table 2). Although group I sequences, especially for aspATs and hpATs, split into too many specific subgroups, group II, III and IV sequences were clearly separated. In the phylogenetic tree of all the aminotransferases, the sequences in group I were predicted to have much more duplication events in the early stage of evolutionary history compared to the sequences in group II, III or IV. In the subgrouping of all aminotransferase sequences, subgrouping was performed at node 14. If the subgrouping is performed at node 13, sequences in subgroup 15 and 16 (Table 2), which represent aminotransferase group II and III, respectively, is still in one subgroup. If the subgrouping is performed at node 15, sequences in aminotransferase group II, III and IV (Table 2)

70


Table 2 Member specification in the subgrouping of all aminotransferase groups. Subgroup number

Members

Aminotransferase group

Subgroup 1 Subgroup 2 Subgroup 3 Subgroup 4 Subgroup 5 Subgroup 6 Subgroup 7 Subgroup 8 Subgroup 9 Subgroup 10 Subgroup 11 Subgroup 12 Subgroup 13 Subgroup 14 Subgroup 15

aspAT 22, aroAT 4 aspAT 2, aroAT 24 aspAT 4 aspAT 2 aspAT 5 aspAT 2 aspAT 2 aspAT 2 hpAT 4 hpAT 3 hpAT 5 hpAT 5 hpAT 2 alaAT 2 gabaAT 48, ␻-AT 16, orAT 10, dapaAT 129, aoAT 55 DAT 10, bcAT 53 psAT 1, seAT 50, aspAT 8, alaAT 4 psAT 21 aspAT 103, hpAT 3, alaAT 5, aroAT 11

I I I I I I I I I I I I I I II

Subgroup 16 Subgroup 17 Subgroup 18 Dropouts

III IV, I IV I

Abbreviations: aspAT, aspartate aminotransferase; aroAT, aromatic aminotransferase; hpAT, histidinol-phosphate aminotransfearse; alaAT, alanine aminotransferase; seAT, serine aminotransferase; psAT, phosphoserine aminotransferase; aoAT, N-acetyl-l-ornithine aminotransferase; ␻-AT, ␻-aminotransferase; orAT, lornithine aminotransferase; gabaAT, ␥-aminobutyrate aminotransferase; dapaAT, 7,8-diaminopelagonate aminotransferase; bcAT, branched chain amino acid aminotransferase; DAT, d-amino acid aminotransferase.

starts to split into too specific subgroups. These result shows that SA can perform optimum subgrouping and determine the family level subgroups from the input sequence set in superfamily level. If we focus on the subgrouping result of the sequences in group II, III and IV, SA clearly discriminates families. This result shows that SA discriminate families from superfamily level sequence set if families in input sequence set have similar evolutionary history. 4. Conclusions SA can perform a sequence subgrouping in an automated manner. It does not need a pre-determined sequence threshold for sequence clustering. Using the stochastic calculation of ASS(i) at each node and statistical calculation (t-value), it automatically determines the subgrouping node. From the analysis of various sequence sets, the followings could be addressed. (1) Subgrouping node determined by the maximum t-value is the same or very close to the node where ASS(i) and ASS(i)rs are similar or same; (2) Subgrouping by SA functionally discriminates the sequences in a given sequence set with a high probability by using the simple calculation of sequence similarity; (3) If the subgroups have a similar evolutionary history in phylogenetic tree , SA clearly discriminate the subgroups as shown in Section 3.2, which could be found in the clear discrimination of aminotransferase group II, III and IV in the subgrouping of all aminotransferase groups. If we define that optimum subgrouping is the one escaping from under-subgrouping or over-subgrouping, SA could successfully achieve optimum subgrouping without prior knowledge about the input sequence set. In addition, SA can suggest the optimum sequence similarity threshold which is value of ASS(i) where ASS(i) = ASS(i)rs . This is the major difference from CD-HIT or DivergentSet (Widmann et al., 2006). CD-HIT solely depends on the pre-defined sequence identity to perform the sequence clustering and DivergentSet refers to the phylogenetic tree and performs the

sequence clustering at the pre-defined sequence identity threshold. In contrast to these methods, SA refers to the phylogenetic tree and suggests the optimum sequence similarity threshold for subgrouping by the stochastic calculation. In addition, although SA only utilizes multiple sequence alignment and sequence similarity calculation, it gives good subgrouping result for the superfamily-level sequence set. SA can automatically find the optimum subgrouping node. In addition, it can split an input sequence set into functionally different subgroups. Once the input sequence set is given, the next task is to perform the subgrouping efficiently according to their function, which can be done by SA. Representative sequences selected by SA could fully represent the given sequence set, which efficiently reduces a search space for the further study such as the selection of subgroup having activity on the target enzymatic reaction. Acknowledgements This research was supported by WCU (World Class University) program through the National Research Foundation of Korea (NRF) grant funded by the Ministry of Education, Science and Technology (R32-2008-000-10213-0) and Creative Research Initiatives (Center for in silico Protein Science, No. 2008-0061987) of MEST. References Abascal, F., Valencia, A., 2002. Clustering of proximal sequence space for the identification of protein families. Bioinformatics 18, 908–921. Altschul, S.F., Gish, W., Miller, W., Myers, E.W., Lipman, D.J., 1990. Basic local alignment search tool. Journal of Molecular Biology 215, 403–410. Altschul, S.F., Madden, T.L., Schaffer, A.A., Zhang, J., Zhang, Z., Miller, W., Lipman, D.J., 1997. Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids Research 25, 3389–3402. Andreopoulos, B., An, A., Wang, X., Schroeder, M., 2009. A roadmap of clustering algorithms: finding a match for a biomedical application. Briefings in Bioinformatics 10, 297–314. Brown, D.P., 2008. Efficient functional clustering of protein sequences using the Dirichlet process. Bioinformatics 24, 1765–1771. Eddy, S.R., 1998. Profile hidden Markov models. Bioinformatics 14, 755–763. Eisen, J.A., 1998. Phylogenomics: improving functional predictions for uncharacterized genes by evolutionary analysis. Genome Research 8, 163–167. Eisen, J.A., Fraser, C.M., 2003. Phylogenomics: intersection of evolution and genomics. Science 300, 1706–1707. Heger, A., Holm, L., 2000. Towards a covering set of protein family profiles. Progress in Biophysics and Molecular Biology 73, 321–337. Krause, A., Haas, S.A., Coward, E., Vingron, M., 2002. SYSTERS, GeneNest, SpliceNest: exploring sequence space from genome to protein. Nucleic Acids Research 30, 299–300. Lee, D.A., Rentzsch, R., Orengo, C., 2010. GeMMA: functional subfamily classification within superfamilies of predicted protein structural domains. Nucleic Acids Research 38, 720–737. Li, W., Godzik, A., 2006. CD-hit: a fast program for clustering and comparing large sets of protein or nucleotide sequences. Bioinformatics 22, 1658– 1659. Mehta, P.K., Hale, T.I., Christen, P., 1993. Aminotransferases: demonstration of homology and division into evolutionary subgroups. European Journal of Biochemistry 214, 549–561. Okamoto, A., Nakai, Y., Hayashi, H., Hirotsu, K., Kagamiyama, H., 1998. Crystal structures of Paracoccus denitrificans aromatic amino acid aminotransferase: a substrate recognition site constructed by rearrangement of hydrogen bond network. Journal of Molecular Biology 280, 443–461. Oue, S., Okamoto, A., Nakai, Y., Nakahira, M., Shibatani, T., Hayashi, H., Kagamiyama, H., 1997. Paracoccus denitrificans aromatic amino acid aminotransferase: a model enzyme for the study of dual substrate recognition mechanism. Journal of Biochemistry 121, 161–171. Thompson, J.D., Higgins, D.G., Gibson, T.J., 1994. CLUSTAL W: improving the sensitivity of progressive multiple sequence alignment through sequence weighting, position-specific gap penalties and weight matrix choice. Nucleic Acids Research 22, 4673–4680. Wicker, N., Perrin, G.R., Thierry, J.C., Poch, O., 2001. Secator: a program for inferring protein subfamilies from phylogenetic trees. Molecular Biology and Evolution 18, 1435–1441. Widmann, J., Hamady, M., Knight, R., 2006. DivergentSet, a tool for picking nonredundant sequences from large sequence collections. Molecular & Cellular Proteomics 5, 1520–1532.

An empirical subgrouping of Finnish learning-disabled children.

Subgrouping of DR4 alleles by DNA heteroduplex analysis.

The role of anger in psychosocial subgrouping for patients with low back pain.

Legionella detection and subgrouping in water air-conditioning cooling tower systems in Kuwait.

Focusing on drug versus disease mechanisms and on clinical subgrouping to advance personalised medicine in psychiatry.

Subgrouping siblings of people with autism: Identifying the broader autism phenotype.

D Transcription Factors Reveals Subgrouping and Association with CTCF.

Classification, epidemiology and clinical subgrouping of antineutrophil cytoplasmic antibody (ANCA)-associated vasculitis.

Subgrouping of patients with oral lichen planus according to cytochrome P450 enzyme phenotype and genotype.

Subgrouping of patients with rheumatoid arthritis based on pain, fatigue, inflammation, and psychosocial factors.

Subgrouping of fast twitch fibres in skeletal muscles of man. A critical appraisal.

Subgrouping of Japanese middle-aged women attending a menopause clinic using physical and psychological symptom profiles: a cross-sectional study.

Subgrouping of low back pain patients for targeting treatments: evidence from genetic, psychological, and activity-related behavioral approaches.

Stage IIIC endometrial cancer: the need for novel subgrouping according to the ratio of metastatic lymph nodes.

Symptom patterns and phenotypic subgrouping of women with polycystic ovary syndrome: association between endocrine characteristics and metabolic aberrations.

Keele Aches and Pains Study protocol: validity, acceptability, and feasibility of the Keele STarT MSK tool for subgrouping musculoskeletal patients in primary care.

Subgrouping of intermediate-stage (BCLC stage B) hepatocellular carcinoma based on tumor number and size and Child-Pugh grade correlated with prognosis after transarterial chemoembolization.

PCR-Based Simple Subgrouping Is Validated for Classification of Gliomas and Defines Negative Prognostic Copy Number Aberrations in IDH Mutant Gliomas.

Design of an optimum computer vision-based automatic abalone (Haliotis discus hannai) grading algorithm.

Automatic feature selection of motor imagery EEG signals using differential evolution and learning automata.

Hologram authentication based on a secure watermarking algorithm using cellular automata.

Breast Density Analysis Using an Automatic Density Segmentation Algorithm.

Increased intraepithelial CD3+ T-lymphocytes and high PD-L1 expression on tumor cells are associated with a favorable prognosis in esophageal squamous cell carcinoma and allow prognostic immunogenic subgrouping.

Exercise only, exercise with mechanical traction, or exercise with over-door traction for patients with cervical radiculopathy, with or without consideration of status on a previously described subgrouping rule: a randomized clinical trial.