Journal of Bioinformatics and Computational Biology Vol. 13, No. 4 (2015) 1550012 (19 pages) # .c Imperial College Press DOI: 10.1142/S0219720015500122

Optimized leaf ordering with class labels for hierarchical clustering

Natalia Novoselova*,§, Junxi Wang†,¶ and Frank Klawonn†,‡,||

J. Bioinform. Comput. Biol. 2015.13. Downloaded from www.worldscientific.com by PRINCETON UNIVERSITY on 08/12/15. For personal use only.

*Department

of Bioinformatics, United Institute of Informatics Problems Surganova Str. 6, Minsk 220012, Belarus



Biostatistics, Helmholtz Centre for Infection Research Inho®enstraße 7, 38124, Braunschweig, Germany



Department of Computer Science, Ostfalia University of Applied Sciences Salzdahlumer Straße 46/48, 38302, Wolfenbuttel, € Germany § [email protected][email protected] ||[email protected], [email protected] Received 17 February 2015 Accepted 22 February 2015 Published 2 April 2015

Hierarchical clustering is extensively used in the bioinformatics community to analyze biomedical data. These data are often tagged with class labels, as e.g. disease subtypes or gene ontology (GO) terms. Heatmaps in connection with dendrograms are the common standard to visualize results of hierarchical clustering. The heatmap can be enriched by an additional color bar at the side, indicating for each instance in the data set to which class it belongs. In the ideal case, when the clustering matches perfectly with the classes, one would expect that instances from the same class cluster together and the color bar consists of well-separated color blocks without frequent alteration of colors (classes). But even in the case when instances from the same class cluster perfectly together, the dendrogram might not re°ect this important aspect due to the fact that its representation is not unique. In this paper, we propose a leaf ordering algorithm for the dendrogram that preserving the hierarchical clustering result tries to group instances from the same class together. It is based on the concept of dynamic programming which can e±ciently compute the optimal or nearly optimal order, consistent with the structure of the tree. Keywords: Hierarchical clustering; dendrogram; leaf ordering; dynamic programming; biomedical data.

1. Introduction Hierarchical clustering is widely used in the bioinformatics community to analyze data and to reveal hidden structures that do not become obvious with simple visualization strategies. The advantage of hierarchical clustering consists in making it possible to visualize the results in the form of a binary tree diagram or dendrogram that groups similar instances together. As a rule of thumb, the Euclidean distance or one minus a suitable correlation coe±cient are used as the distance measures that is 1550012-1

J. Bioinform. Comput. Biol. 2015.13. Downloaded from www.worldscientific.com by PRINCETON UNIVERSITY on 08/12/15. For personal use only.

N. Novoselova, J. Wang & F. Klawonn

required for hierarchical clustering. The leaves of the dendrogram derived from hierarchical clustering correspond to the instances or objects, e.g. genes, proteins or patients. For the visualization of the dendrogram a linear order of the leaves is required that is consistent with the clustering result. The clustering itself yields only a binary tree without specifying which of the subtrees should be placed on the left or right. Switching the two subtrees of any node will lead to a di®erent dendrogram although it still re°ects the same hierarchical clustering result. There exist the 2 n1 possible orderings of a binary tree with n leaf nodes, since in any of the n  1 non-leaf nodes we have two choices. The chosen leaf order will highly in°uence the visualization and therefore the interpretation of the data based on the corresponding dendrogram and heatmap. This observation is not new at all and various approaches have been proposed to rearrange the leaf order of a dendrogram. However, all these methods are based exclusively on the distance matrix, but do not exploit additional information provided by class labels as in our approach. Before we describe the principles of our approach, we brie°y review the strategies from the literature that do not incorporate class labels, since we borrow some algorithmic ideas from them. The approaches1–4 try to rearrange the linear ordering of the leaf nodes in the dendrogram such that the sum of the distances of adjacent leaves is minimized. Eisen et al.1 proposed a simple heuristic strategy that switches subtrees according to weights that can be assigned to the leaves (for example the average gene expression over all experiments or the position of a gene in the chromosome). An implementation of this method can be found in Ref. 2. After that several authors proposed di®erent computationally feasible algorithms to optimize the leaf ordering.3,4 The optimization of the ordering used by Bar-Joseph et al.3 is based on a recursive process, which estimates the optimal ordering of subtrees in a bottom-up way. It resembles a dynamic programming approach and includes a backtracking phase to improve orderings made in previous steps of the algorithm. The algorithm's complexity is Oðn 4 Þ, where n is the number of instances in the data set. The authors proposed also an improvement of the algorithm that allows the earlier search termination and can dramatically decrease the computation time in average. The algorithm in Ref. 4 is also based on dynamic programming using a three-dimensional table that stores the costs of the best linear ordering of the leaves in the subtree, starting from the bottom. The algorithm complexity is proclaimed to be Oðn 3 Þ when a multiplication instead of sum is used in computing the table elements. Other approaches to optimal leaf ordering can be found in Refs. 5–8. But they also do not consider the class labels in the optimization process. Buchta and Hahsler7 adopted the ordering concept, described in Ref. 3, where the minimal path length is de¯ned to be an optimal ordering. Hahsler et al.8 implemented several seriation/sequencing techniques to reorder matrices, dissimilarity matrices and dendrograms. Sakai9 proposes the R package for sorting the dendrogram based on the average distance of subtrees. In many studies in the ¯eld of medicine and biology, the elements of the data set, characterized by the number of features (e.g. gene expression levels in microarray), 1550012-2

J. Bioinform. Comput. Biol. 2015.13. Downloaded from www.worldscientific.com by PRINCETON UNIVERSITY on 08/12/15. For personal use only.

Optimized leaf ordering

are also tagged by some nominal class labels, as for example gene ontology (GO) terms or disease subtypes. In such cases, it is often of interest to see whether the clustering result is more or less coherent with the class labels, i.e. instances with the same label also cluster together. If this is the case, the distance measure can be considered suitable to classify unlabeled instances in nearest neighbor fashion. Figure 1(a) shows a dendrogram for microarray data with three di®erent tissue types,10 obtained by applying agglomerative hierarchical clustering to leukemia data set. The Euclidean distance measure was used to compute the distance matrix. The branches with the same color correspond to the same tissue type. It can be seen that most of the tissue types cluster together, but some instances are apart from their proper group of tissue type. The ordering of the leaf nodes in Fig. 1(b) obtained by simply °ipping some of the inner nodes results in dendrogram that is even more consistent with class labels (tissue types). With such an optimized dendrogram it is easier to interpret the interrelation between the available class labels and clusters. Moreover, it is possible to see whether one can predict the class for unlabeled instances based on the distance that was used for clustering. In our study, we optimized the leaf ordering according to the class labels. To our knowledge it is the ¯rst time to provide a leaf ordering algorithm based on the class

(a)

(b) Fig. 1. Hierarchical clustering of three tissue types: 25 acute myeloid leukemia (AML) samples; nine T-lineage acute lymphoblastic leukemia (ALL) samples; and 38 B-lineage ALL from acute leukemia patients.10 (a) Dendrogram obtained from clustering. (b) Dendrogram after rearrangement of the leaf nodes. 1550012-3

N. Novoselova, J. Wang & F. Klawonn

J. Bioinform. Comput. Biol. 2015.13. Downloaded from www.worldscientific.com by PRINCETON UNIVERSITY on 08/12/15. For personal use only.

label information. We reformulated the standard problem of optimal leaf ordering and developed a dynamic programming algorithm to ¯nd the optimal or a near optimal solution. We implemented our dynamic programming algorithm together with an alternative approach using a genetic algorithm (GA) to rearrange the leaves in a dendrogram. We performed several experiments on simulated and real data sets and compared the results to the heuristic approach.1 Our algorithm together with all the necessary functions is available as a R package at http://cran.r-project.org/web/ packages/ReorderCluster.

2. Leaf Ordering with Class Labels In this section, we will formalize the problem of optimal leaf ordering of instances according to their class labels. Then we provide our dynamic programming algorithm to solve this problem and some modi¯cations which help to reduce its time complexity. After that we will brie°y describe our approach to optimize the leaf ordering with class labels using a GA. 2.1. Leaf ordering with dynamic programming 2.1.1. Problem de¯nition In our study, we assume that a distance measure between the genes is given and dendrogram, i.e. the binary tree is constructed by the method in Ref. 1 or by another technique where the leaves in the tree correspond to the instances in the data set. Let T be the resulting tree with n leaves. x1; x2 ; . . . ; xn denote the individual leaves of T and v1; v2 ; . . . ; vn1 the inner nodes. Let c1; c2 ; . . . ; cn denote the class labels of the corresponding leaves. According to Ref. 3, the linear ordering which preserves the structure of T is de¯ned to be an ordering of the leaves of T generated by °ipping internal nodes or changing the order between the two subtrees rooted at vi 2 T . Since T has n  1 inner nodes, there are 2 n1 possible linear orderings that do not change the structure of T . The task is to ¯nd an ordering of leaves of the hierarchical tree such that the instances (leaf nodes) with the same class labels will be as close together as possible. To formalize this task we need to introduce an objective function to judge the quality of a particular linear ordering of instances, taking into account the class labels and the constraints given by the binary tree. A particular leaf ordering  2  where  is the set of all possible permutations of f1; . . . ; ng induces a permuted vector c ¼ ðcð1Þ;...; cðnÞ Þ of class labels corresponding to the permuted leaves. Let ni;j be the length of the jth sequence of elements ðcðkÞ;...; cðkþni;j Þ Þ from c in which only the ith class label occurs without being interrupted by the elements of other class labels. For  2 , we de¯ne the following function F 1 ðT Þ ¼

pi K X X i¼1

j¼1

1550012-4

ðni;j Þ coef ;

ð1Þ

Optimized leaf ordering

J. Bioinform. Comput. Biol. 2015.13. Downloaded from www.worldscientific.com by PRINCETON UNIVERSITY on 08/12/15. For personal use only.

where pi is number of individual sequences of elements with the ith class label when T is ordered according t o , K is the number of di®erent class labels, coef 2 1; 2 is the parameter to be chosen that reinforces the in°uence of longer sequences. The value of coef is equal to 1.5 in our experiments. The coef values in the interval 1; 2 give the similar results. Our goal is to ¯nd such an ordering  that the F 1 ðT Þ is maximized, in other words, in the optimal ordering the sequences of identical class labels should be as long as possible. Note that not all permutations  2  are admitted but only those that are consistent with the given dendrogram. Otherwise, one could easily optimize the objective function (1) by simply enumerating the instances by the classes. 2.1.2. Algorithm description We ¯nd an optimized ordering that conforms to the given dendrogram using dynamic programming approach.3 The algorithm is recursive and proceeds in a bottom up fashion from the leaf nodes of the tree T to its root. The optimized ordering of each subtree v is determined only after computing the optimized leaf orderings of its child nodes vl and vr where vl and vr are the left and the right subtrees of v. To save the optimized ordering and the corresponding value of the function F 1 ðvÞ for each internal node v of T we de¯ne the matrix A of size n  n. We also de¯ne three auxiliary matrices maxI; maxJ, and R of size n  n. Each cell ðu; wÞ; u ¼ 1; . . . ; n; w ¼ 1; . . . ; n in these matrices corresponds to a pair of leaf nodes. The element ðu; wÞ of matrix A stores the function value corresponding to the optimized leaf ordering of node v, when u is the leftmost element of v and w is the rightmost element of v. For each pair of leaves ðu; wÞ there exists only one subtree v for which the value of A½u; w is computed. This is the subtree rooted at the least common ancestor of u and w. In Fig. 2, node v is the least common ancestor of u and w. For any other subtree vr1 (see Fig. 2) that is not the least common ancestor of u and w both leaves then belong to its left v or right vr2 subtree and therefore do not take part in the computing of the optimized leaf ordering for the node vr1 . Therefore, we do not need to use an additional dimension for matrix A in order to store the node for which the optimized leaf ordering with the elements ðu; wÞ is computed.

1

2

,

u

,

,

1

1

,

2

w

1

Fig. 2. Binary tree with a root node vr1 , where node v is the least common ancestor of elements u and w. 1550012-5

J. Bioinform. Comput. Biol. 2015.13. Downloaded from www.worldscientific.com by PRINCETON UNIVERSITY on 08/12/15. For personal use only.

N. Novoselova, J. Wang & F. Klawonn

The computation of the optimized function value for the internal node v is based on the known optimized function values of child nodes vl and vr . To compute the value of the cell ðu; wÞ of matrix A for node v we must check all feasible leaves i1 ; j1 , where i1 is the rightmost leaf of vl and j1 is the leftmost leaf of vr . Then the optimized function value can be de¯ned as 8 max ðA ½u; i1  þ Avr ½j1 ; wÞ; if classði1 Þ! ¼ classðj1 Þ > > >i1 2vl;r ;j1 2vr;l vl > > > < max A½u; w ¼ i1 2vl;r ;j1 2vr;l ! > > > Avl ½u; i1  þ Avr ½j1 ; w  R½u; i1  coef  > > ; if classði1 Þ ¼ classðj1 Þ; > : R½w; j1  coef þ ðR½u; i1  þ R½w; j1 Þ coef ð2Þ where classðiÞ is the class label of the ith element. The cell ðu; wÞ of matrix R stores the length of the sequences of leaves of the internal node v (starting from the rightmost element w to the leftmost element uÞ which have the same class labels and are not interrupted by elements of other class labels. The cell ðu; wÞ of matrices maxI, maxJ stores the leaves i1 2 vl;r and j1 2 vr;l , respectively, which give the optimal value for A½u; w. After computing the function value and the corresponding optimized ordering for all pairs ðu; wÞ, u 2 vl , w 2 vr for node v, we can de¯ne the optimal ordering for node v as the pair ðuopt ; wopt Þ that yields the maximal function value F 1 ðvÞ ¼ A½uopt ; wopt  ¼ maxu2vl ;w2vr A½u; w. At each stage of this bottom up recursive process of computing the optimized function value for each inner node v we update the matrices maxI; maxJ in order to store the intermediate leaves that provide the optimized value for each pair of ðu; wÞ elements of this node. At the end of the recursive process we obtain the optimized function value for the whole tree T as the best value F 1 ðvroot Þ of the root node vroot . After that, using the matrices maxI; maxJ, the sequence of leaves which gives the best function value and corresponds to the optimized leaf ordering can be recovered using a backtracking scheme. The main scheme of this approach is sketched in Algorithm 1. In lines 11–53 of the algorithm the optimized leaf ordering is computed for all possible leftmost elements of the left subtree vl of node v. In lines 17–53, of the algorithm the optimized leaf ordering is computed for all possible rightmost elements of the right subtree vr of node v with ¯xed u 2 vl . In lines 25–37, the maximal value of the optimization function and the corresponding leaf ordering is computed for the ¯xed elements u 2 vl and w 2 vr searching through all feasible intermediate leaves i1 2 vl and j1 2 vr . The matrix element ðu; wÞ is calculated in lines 27–30. Since the matrix A is symmetric, in lines 38 and 39 the maximal value of optimized function is stored as A½u; w ¼ A½w; u. In lines 33 and 34, the intermediate leaves i1 , j1 that yield the maximum value of A½u; w are updated in the matrices maxI; maxJ. The elements of the matrix R are calculated in lines 42–51. They are used for evaluating the 1550012-6

Optimized leaf ordering

u 1

1

1

2

2

[ , ]

1

1

w

[ , ]

J. Bioinform. Comput. Biol. 2015.13. Downloaded from www.worldscientific.com by PRINCETON UNIVERSITY on 08/12/15. For personal use only.

Fig. 3. The elements of the matrix R for subtree v.

objective function for the internal nodes and to store the lengths of sequences of leaves with the same class label from the left and right side of each subtree v. For example, in Fig. 3 for the subtree v with leaves having class labels 1 or 2, R½u; w ¼ R½j1 ; w ¼ 2 and R½w; u ¼ R½i1 ; u ¼ 3 holds. As a result of the algorithm we obtain the matrix A with elements ðu; wÞ corresponding to the function values for the optimized leaf ordering from the leftmost element u to the rightmost element w of the corresponding internal node v. The maximal element ðuopt ; wopt Þ of the matrix A corresponds to the optimized ordering of the tree T . Using the matrices maxI; maxJ we can recover the sequence of elements in the optimized leaf ordering of the whole tree T . 2.1.3. Improvements and variation of the original algorithm As the time complexity of our algorithm is Oðn 4 Þ because it is based on similar principles as the one in Ref. 3, where n is the number of leaves in the tree, we propose some modi¯cation to decrease the computation time. The ¯rst modi¯cation consists in merging the elements of the node v when all of them have the same class label. Flipping the subtrees of such a node does not change the value of the objective function. Therefore, we have developed Algorithm 2 that rearranges the original tree and merges the subtrees with the same class label into one element. Algorithm 1 is then applied to the simpli¯ed tree. The simpli¯ed tree preserve the sequence lengths of the original subtrees and also the initial function values for the merged subtrees are taken into account. Figure 4 shows the result of applying Algorithm 2 to the subtree v. The proposed Algorithms 1 and 2 are implemented in the R programming language.11 Pure R code is relatively slow compared to C or Cþþ. To reduce the computation time the second modi¯cation consists in converting Algorithm 1 to Cþþ and call it then from the R environment. For this purpose, we used the Rcpp R package,12 that makes it simple to connect Cþþ to R. The experiments with di®erent data sets have shown a considerable speed up of the algorithm (see Table 1). 1550012-7

N. Novoselova, J. Wang & F. Klawonn

J. Bioinform. Comput. Biol. 2015.13. Downloaded from www.worldscientific.com by PRINCETON UNIVERSITY on 08/12/15. For personal use only.

Inputs: is the root node of the tree , the matrix to store the values of the optimized function , the matrices to store the intermediate leaves , the matrix to store the lengths of sequences with the same class labels for internal subtrees , the vector of class labels for each leaf node Outputs: matrices 1: 2: 3: 4: 5: 6: 7: 8: 9: 10: 11: 12: 13: 14: 15: 16: 17: 18: 19: 20: 21: 22: 23: 24: 25: 26: 27: 28: 29: 30: 31: 32: 33: 34: 35: 36:

OrderingClass( if size( )=1

)

return( else =OrderingClass ( = OrderingClass (

( ( for all leaves if = else = end if for all leaves if

// vremI is the number of intermediate leaves of the left // subtree to search for the optimal value of

=

//vremJ is the number of intermediate leaves of the //right subtree to search for the optimal

else = end if max=-1 for all leaves for all leaves

+ end if if

) )

>max max=

end if end if Algorithm 1. Optimized leaf ordering with class labels. 1550012-8

J. Bioinform. Comput. Biol. 2015.13. Downloaded from www.worldscientific.com by PRINCETON UNIVERSITY on 08/12/15. For personal use only.

Optimized leaf ordering

37: 38: 39: 40: 41: 42: 43: 44: 45: 46: 47: 48: 49: 50: 51: 52: 53: 54: 55: 56:

end if =max = = if((all class( ) are equal) and )=class( (class(

)

else end if if((all class( ) are equal) and )=class( (class(

)

else end if end if end if end if return( end Algorithm 1. (Continued )

2.2. Leaf ordering with GA We also developed an approach to optimize the leaf ordering with respect to the class labels based on a di®erent objective function. Dynamic programming is not suitable for this objective function. Instead, we used a GA. GA is a powerful stochastic optimization technique, inspired by natural evolutionary processes to evolve a population of individual task solutions to reach the global optimum of the objective function or at least a good solution. GA is especially appropriate to solve optimization problems with complex solution spaces and having many parameters.13 The optimization task in this case of searching the optimal leaf ordering with class labels can be formulated as follows. Let  2 , where jj ¼ 2 n1 be the set of all possible leaf orderings of the previously constructed hierarchical tree T with n leaves. Each ordering is the result of leaf permutations or °ipping the subtrees of the internal nodes of the tree. Each leaf i corresponds to the particular class label ci , therefore the resulting ordering can be considered as the vector of class labels of the permuted sequence of leaves, c ¼ ðcð1Þ ; cð2Þ ; . . . ; cðnÞ Þ, where  is a permutation of f1; . . . ; ng. It is required to ¯nd such an ordering  of the leaves of the tree that is consistent with the tree and which 1550012-9

N. Novoselova, J. Wang & F. Klawonn

J. Bioinform. Comput. Biol. 2015.13. Downloaded from www.worldscientific.com by PRINCETON UNIVERSITY on 08/12/15. For personal use only.

Input: is the root node of the tree , the matrix to store the values of the optimized function , the matrix to store the lengths of sequences with the same class labels for internal subtrees. Output: the simplified tree . 1: 2: 3: 4: 5: 6: 7: 8: 9: 10: 11: 12: 13: 14: 15: 16: 17: 18: 19: 20: 21: 22: 23: 24: 25: 26:

SubTree( {

) if all the leaves of have the same class label ) return(the leftmost element leaf of end if if is not a singleton ) =SubTree( , (leaf, if(leaf!=NULL) //number of elements of subtree len=size( leaf // subtree is a singleton , = , = len end if end if is not a singleton if ) (leaf, =SubTree( , if(leaf!=NULL) //number of elements of subtree len=size( leaf // subtree is a singleton , = , =len end if end if return( , leaf=NULL, )

end

Algorithm 2

minimizes the objective function F 2 ðT Þ ¼ 

K X i¼1

ni =n

pi X

ni;j =ni  lnðni;j =ni Þ;

ð3Þ

j¼1

where F 2 ðT Þ is the partition entropy value for elements in c, K is the number of class labels, ni is the number of elements of the ith class label, pi is the number of continuous sequences of elements of class i, ni;j is the number of elements in the jth sequence with the ith class label. In order to ¯nd the optimal leaf ordering the function F 2 ðT Þ must be minimized. 1550012-10

Optimized leaf ordering

,

,

1

1 ,

1

2

2

2

=

,

1

1

1

2

2

1

=

J. Bioinform. Comput. Biol. 2015.13. Downloaded from www.worldscientific.com by PRINCETON UNIVERSITY on 08/12/15. For personal use only.

Fig. 4. Merging of subtrees vl;l and vr;l having the same class labels.

A binary coding of the solutions is an obvious choice in this case and suits very well to the concept of GA. Since the permutation of the leaf nodes is induced by either switching or not switching the subtrees of any of the n  1 inner nodes, a solution can be coded as a binary vector of length n  1. Such a binary vector induces a unique permutation of the leaf nodes. 0 stands for not switching the subtrees of the corresponding node, whereas 1 stands for switching them. In Fig. 5, we can see an example of the correspondence between the individual of GA and the resulting tree. The value of the ¯tness function for an individual of the GA is computed according to (3) for the decoded tree. The ¯tness function indicates the e±ciency of the solution for the optimization task, encoded in the GA individual. The initial population is generated at random. For the realization of the GA optimization scheme the standard crossover and mutation operations are applied to the population of GA individuals. The optimization scheme consists of repeated execution of several steps: decoding of individuals, calculation of the ¯tness function, selection of individuals according to the function values, recombination, and elitism operations. At each generation a new population of the GA is formed. The iteration process

Hierarchical trees

1

1

0

0

1

1

2

0

0

2

1

0

2

1

0

0

GA individual Fig. 5. The encoding of the GA individuals. 1550012-11

2

1

1

1

1

1

N. Novoselova, J. Wang & F. Klawonn

continues until the GA stopping condition will be satis¯ed. As a stopping condition either a pre-de¯ned, maximal number of generations can be used or the algorithm is terminated when no improvement of the best solutions is achieved anymore. In our study, we used the standard binary GA, implemented in the R package genalg.14 The package genalg includes R-based GAs for binary and °oating point chromosomes. We used a population size of 200 individuals and a maximum number of iterations equal to 100.

J. Bioinform. Comput. Biol. 2015.13. Downloaded from www.worldscientific.com by PRINCETON UNIVERSITY on 08/12/15. For personal use only.

3. Experiments We have performed several experiments to test our approach of reordering the leaves in dendrogram in order to make the same class labels to be located as close as possible. The resulting ordered collection of data cases (e.g. genes in microarray analysis) is displayed in the form of a heatmap where rows represent cases and columns represent single features (e.g. experiments in microarray analysis). Each data point corresponds to a colored box, where the color represents a feature value. The optional vertical and horizontal side bars can be used to annotate the rows and columns of the heatmap. The hierarchy of the tree is also drawn on the left side of the heatmap. 3.1. Methods We have used two arti¯cial data sets and one biological data set to test the optimal leaf ordering algorithm and its e®ect on the visualization of the results of hierarchical clustering. The ¯rst simulated data set consists of 90 cases, characterized by the values of 90 features. For each subsequent subset of 15 cases we set 40 consecutive features to 1 (with 10 features shift), and then °ip them to 0 or 1 with probability p ¼ 0:01. The rest of the features in the data set were generated at random. As a result the six distinctive clusters are formed. The second arti¯cial data set is more complex and consists of 400 data cases, each with 100 feature values. The similar generation scheme as for the ¯rst data set is applied and eight distinctive clusters are formed. For both data sets we have manually de¯ned the vector of class labels c ¼ ðc1 ; c2 ; . . . ; cn Þ; ci 2 f1; 2; 3g, which does not conform to the generated clusters. For the ¯rst data set each subsequent subset of 15 cases was di®erently labeled and the labels of the ¯rst and the second half of data set were the same. For the second data set the ¯rst 125 cases was attributed to class 1, the following 150 cases to class 2 and the last 125 cases to class 3. The real biological data set presents cell cycle data.15 The cell cycle or cell-division cycle is the series of events that take place in a cell leading to its division and duplication (replication). The data set presents the results of microarray analysis and consists of expression pro¯les of 800 genes, which are cell cycle regulated in Saccharomyces cerevisiae.3 These genes were assigned to ¯ve groups termed G1, S, S/G2, G2/M, and M/G1, which present distinct phases of the cell cycle. The data set 1550012-12

Optimized leaf ordering

J. Bioinform. Comput. Biol. 2015.13. Downloaded from www.worldscientific.com by PRINCETON UNIVERSITY on 08/12/15. For personal use only.

is a combination of three di®erent experiments (cdc15, cdc28, and  factor) that together constitute 59 measurements of gene expression. In our experiment, we have used the 24 cdc15 measurements. We have applied hierarchical clustering with heuristic leaf ordering1 to all data sets and then used the proposed optimal leaf ordering algorithm and the GA-based approach to improve the orderings with respect to the class labels. In GA experiments, the results were slightly better for the value of objective function in (1) when the objective in (3) was optimized. All the algorithms were implemented in R environment with the R language. In order to compare the results for each leaf ordering we have computed the value of the objective function in (1). 3.2. Results The results of applying the optimal leaf ordering and the hierarchical clustering algorithm with heuristic ordering1 to the ¯rst simulated data set are shown in Fig. 6. The additional color bar at the side of the heatmap presents the class label annotations. For the ¯rst simulated data set the optimal leaf orderings with class labels using the dynamic programming algorithm and the GA yielded similar results (Fig. 6) with less dispersed cluster labels than in hierarchical clustering alone. The evaluation function for the optimal leaf ordering has higher values than for the unordered hierarchical clustering (Table 1). For the second simulated data set, the optimal leaf orderings are presented in Fig. 7. As the de¯ned class labels do not conform to the generated clusters they are more dispersed for pure hierarchical clustering. The optimal leaf orderings better preserve the similarity of data pro¯les (Fig. 7). The value of the evaluation function is slightly better for the optimal leaf ordering algorithm with dynamic programming (Table 1). We compared the results of the optimal leaf ordering and heuristic ordering for the biological data set, considering the cell cycle groups as the class labels. In Fig. 8, we can see that the optimal ordering with classes can perfectly locate similar class

(a)

(b)

(c)

Fig. 6. Comparison of leaf orderings using hierarchical clustering and two proposed approaches for the ¯rst simulated dataset: (a) hierarchical clustering,1 (b) optimal leaf ordering (dynamic programming), and (c) optimal leaf ordering (GA). 1550012-13

N. Novoselova, J. Wang & F. Klawonn

J. Bioinform. Comput. Biol. 2015.13. Downloaded from www.worldscientific.com by PRINCETON UNIVERSITY on 08/12/15. For personal use only.

(a)

(b)

(c)

Fig. 7. Comparison of leaf orderings using hierarchical clustering and two proposed approaches for the second simulated dataset: (a) hierarchical clustering,1 (b) optimal leaf ordering (dynamic programming), and (c) optimal leaf ordering (GA).

Fig. 8. Comparison of leaf orderings using hierarchical clustering and two proposed approaches for biological data set.

labels close to each other. Unlike hierarchical clustering, the cell cycle order is more preserved. Despite the optimization function of the optimal leaf ordering algorithm does not take into account the distance similarities between gene expression pro¯les they are better arranged according to similarity then the hierarchical clustering with heuristic ordering. The peaks of gene expression for optimal leaf ordering change continuously from the bottom to the top of the heatmap plot (Fig. 8). Therefore, the optimal ordering with classes correctly identi¯ed the order of the groups in the real cell cycle process. The results of the GA approach are slightly worse with a smaller value of the objective function (Table 1). We have compared the computation time of the di®erent approaches, including the dynamic programming algorithm for leaf ordering with and without using the Cþþ function, and the approach using the GA (Table 1). The results were obtained on a BENQ laptop with a Microsoft Windows 7 (Service Pack 1) operating system and an Intel(R) Core(TM)2 Duo CPU T8100 @2.10 GHz (2.00 GB RAM) processor. As can be seen the computation time of the dynamic programming approach with modi¯cations is the best. 1550012-14

Simulated data 1 Simulated data 2 Spellman data set

Data set

0.85 s

7.34 s

60.0 min

90

400

800

Dynamic Number programming of cases algorithm (original)

1550012-15 28.0 s

1.07 s

0.014 s

Dynamic programming algorithm (with modi¯cations)

Time complexity

30.0 min

7.34 min

1.27 min

Approach with GA

3245.0

3527.0

444.82

3245.0

3527.0

444.82

Dynamic Dynamic programming programming algorithm algorithm (original) (with modi¯cations)

2960.8

3406.0

444.82

2355.0

2900.0

396.69

Approach Hierarchical with GA clustering

Evaluation function (to be maximized in (1))

Table 1. Time complexity and values of the optimization function for optimal leaf ordering algorithms subject to class labels.

J. Bioinform. Comput. Biol. 2015.13. Downloaded from www.worldscientific.com by PRINCETON UNIVERSITY on 08/12/15. For personal use only.

Optimized leaf ordering

N. Novoselova, J. Wang & F. Klawonn

J. Bioinform. Comput. Biol. 2015.13. Downloaded from www.worldscientific.com by PRINCETON UNIVERSITY on 08/12/15. For personal use only.

3.3. Discussion Our optimized leaf ordering algorithm can work with any binary tree to order the leaves according to the class labels, preserving the clustering structure. We have made improvements to the original version of our algorithm that decreases the computational time especially for large data sets. For all analyzed data sets the resulting orderings with the proposed dynamic programming algorithm and GA-based approach were almost the same with only small variations in the resulting heatmaps. As the GA approach is as a rule more time consuming (see Table 1) its use for large data sets is not recommended. The proposed dynamic programming algorithm with modi¯cations is the most e±cient among all the examined approaches and is recommended for practical applications. Optimized ordering with classes helps to determine the relationship between the known class labels and the clustering structure of a data set, allowing to relate the classes to the speci¯ed clusters and to determine the relationship between di®erent classes. By applying the optimized ordering algorithm to a biological data set we have not only obtained a closer grouping of similar class labels, but also a better ordering of genes according to the similarity of their expression pro¯les, that corresponds to the time order of the real cell cycle process.

4. Conclusion In this paper, we have studied the problem of ordering leaves of a tree representing hierarchical clustering of data cases with class labels. We have presented an e±cient algorithm that allows to ¯nd such a leaf ordering that the data cases with the same class labels are located as close as possible in the reordered hierarchical tree. We have reformulated the standard problem of optimal leaf ordering, which optimizes the function of similarities of data elements. In our case, we try to optimize the function which is computed on the basis of class labels of individual data elements in order to place the same class labels together. Several modi¯cations to the original version of the algorithm have been proposed in order to make it more time e±cient. Apart from the dynamic programming approach we have also developed an approach to optimal leaf ordering with classes using a GA. We have made several experiments in order to compare the proposed approaches and standard hierarchical clustering with heuristic ordering.1 According to our results the dynamic programming leaf ordering algorithm with modi¯cations has both the best value of the objective function and the smallest computation time. Therefore, it can be recommended for practical applications. For the biological data set the dynamic programming algorithm lead to a better ordering according to similarities of gene pro¯les. Continuous drift of the peak expression values of genes in the heatmap can be clearly seen (Fig. 8). Currently we are investigating the possibility to take into account data cases with missing class labels for the leaf ordering. We expect that the

1550012-16

Optimized leaf ordering

location of such data cases in the resulting tree can facilitate the prediction of their class label. The implementation of our algorithms with all the necessary functions is available as a R package at http://cran.r-project.org/web/packages/ReorderCluster.

J. Bioinform. Comput. Biol. 2015.13. Downloaded from www.worldscientific.com by PRINCETON UNIVERSITY on 08/12/15. For personal use only.

Acknowledgments We thank the colleagues from the Cellular Proteomics Research Group at the Helmholtz Centre for Infection Research for many useful comments on this problem. Natalia Novoselova acknowledges the support by a research grant A/13/00004 from the German Academic Exchange Service (DAAD). This study was co-¯nanced by the European Union (European Regional Development Fund) under the Regional Competitiveness and Employment objective and within the framework of the Bi2SON Project Einsatz von Informations-und Kommunikationstechnologien zur Optimierung der biomedizinischen Forschung in Südost-Niedersachsen.

References 1. Eisen MB, Spellman PT, Brown PO, Botstein D, Cluster analysis and display of genomewide expression patterns, Proc Natl Acad Sci USA 95(25):14863–14868, 1998. 2. Eisen MB, Cluster v. 2.11 and TreeView v. 1.5, 2000, Available at http://rana.lbl.gov/ EisenSoftware.html. 3. Bar-Joseph Z, Gi®ord DK, Jaakkola TS, Fast optimal leaf ordering for hierarchical clustering, Bioinformatics 17:22–29, 2001. 4. Biedl T, Brejova B, Demaine ED, Hamel AM, Vinar T, Optimal arrangement of leaves in the tree representing hierarchical clustering of gene expression data, Technical Report CS-2001-14, Department of Computer Science, University of Waterloo, 2001. 5. Chae M, Chen JJ, Reordering hierarchical tree based on bilateral symmetric distance, PLoS One 6(8):e22546, 2011, doi: 10.1371/journal.pone.0022546. 6. Brandes U, Optimal leaf ordering of complete binary trees, J Discrete Algorithms 5(3):546–552, 2007. 7. Buchta C, Hahsler M, cba: Clustering for business analytics, R package version 0.2-14, 2014, Available at http://CRAN.R-project.org/package¼cba. 8. Hahsler M, Buchta C, Hornik K, Infrastructure for seriation, R package version 1.0-14, 2014, Available at http://CRAN.R-project.org/package¼seriation. 9. Sakai R, Dendsort: Modular leaf ordering methods for dendrogram nodes, R package version 0.3.2, 2014, Available at http://CRAN.R-project.org/package¼dendsort. 10. Handl J, Knowles J, Kell DB, Computational cluster validation in post-genomic data analysis, Bioinformatics 21:3201–3212, 2005, doi: 10.1093/bioinformatics/bti517. 11. R Core Team, R: A language and environment for statistical computing, R Foundation for Statistical Computing, Vienna, Austria, 2014, Available at URL http://www. R-project.org/. 12. Eddelbuettel D, Francois R, Rcpp: Seamless R and Cþþ integration, J Stat Soft 40 (8):1–18, 2011. 13. Goldberg DE, Genetic Algorithms in Search, Optimization and Machine Learning, Addison-Wesley Publishing Company, Inc., USA, 1989.

1550012-17

N. Novoselova, J. Wang & F. Klawonn

J. Bioinform. Comput. Biol. 2015.13. Downloaded from www.worldscientific.com by PRINCETON UNIVERSITY on 08/12/15. For personal use only.

14. Willighagen E, genalg: R based genetic algorithm, R package version 0.1.1.1, 2014, Available at http://CRAN.R-project.org/package¼genalg. 15. Spellman PT, Sherlock G, Zhang MQ, Iyer VR, Anders K, Eisen MB, Brown PO, Botstein D, Futcher B, Comprehensive identi¯cation of cell cycle-regulated genes of the yeast Saccharomyces cerevisiae by microarray hybridization, Mol Biol Cell 9:3273–3297, 1998.

Natalia Novoselova received the Ph.D. degree in computer sciences from United Institute of Informatics Problems (UIIP), National Academy of Sciences of Belarus (NASB) in 2008. She was a scienti¯c researcher from 1987 to 2000 in the Laboratory of Ergonomics, Institute of Engineering Cybernetics NASB, and has been a senior scienti¯c researcher since 2000 in the Laboratory of Bioinformatics, United Institute of Informatics Problems NASB in Minsk, Belarus. Her research interests include data mining methods, notably the neural network, genetic algorithms and fuzzy systems and their application to analysis of medical and biological data. She is an author of more than 20 research publications in these areas. In years 2008 and 2010, she was a recipient of the two-month scienti¯c grants from German Academic Exchange Service, to conduct research in the ¯eld of bioinformatics at the Ostfalia University of Applied Sciences, Wolfenbuettel, Gemany.

Junxi Wang received B.Sc. degree in Biotechnology/Bioinformatics from the University of Emden/Leer (Germany) and the Zhejiang University of Science and Technology (China) in 2014. She works as a research assistant at the Helmholtz Centre for Infection Research and is also pursuing her M.Sc. degree at Goethe University in Frankfurt.

1550012-18

J. Bioinform. Comput. Biol. 2015.13. Downloaded from www.worldscientific.com by PRINCETON UNIVERSITY on 08/12/15. For personal use only.

Optimized leaf ordering

Frank Klawonn received his M.Sc. and Ph.D. in mathematics and computer science from the University of Braunschweig in 1988 and 1992, respectively. Before he became the head of the Lab for Data Analysis and Pattern Recognition (DAPaR) at Ostfalia University of Applied Sciences in Wolfenbuettel (Germany), he was a Visiting Professor at Johannes Kepler University in Linz, Austria (1996), at Rhodes University in Grahamstown, South Africa (1997), a research fellow with British Telecom (2001) and an Associate Professor at Emden/Leer University (until 2002). He is now also the head of the Biostatistics Group at the Helmholtz Centre for Infection Research in Braunschweig (Germany). His main research interests focus on techniques for intelligent data analysis, especially clustering, classi¯cation and robust statistical methods. He is a member of the editorial board of international journals like Data Mining, Modelling & Management (IJDMMM), Intelligent Data Analysis (IDA), Evolving Systems (ES) and Fuzzy Sets and Systems (FSS).

1550012-19

Optimized leaf ordering with class labels for hierarchical clustering.

Hierarchical clustering is extensively used in the bioinformatics community to analyze biomedical data. These data are often tagged with class labels,...
767KB Sizes 4 Downloads 6 Views