Journal of Theoretical Biology 377 (2015) 10–24

Contents lists available at ScienceDirect

Journal of Theoretical Biology journal homepage: www.elsevier.com/locate/yjtbi

VR-BFDT: A variance reduction based binary fuzzy decision tree induction method for protein function prediction Fahimeh Golzari, Saeed Jalili n SCS Lab, Computer Engineering Department, Tarbiat Modares University, Tehran, Iran

H I G H L I G H T S

    

Protein multi-function prediction. Decision boundary fuzzification. Label variance reduction as splitting criterion. Hierarchical multi-label classification. Variance reduction based binary fuzzy decision tree induction.

art ic l e i nf o

a b s t r a c t

Article history: Received 23 October 2014 Received in revised form 11 March 2015 Accepted 20 March 2015 Available online 10 April 2015

In protein function prediction (PFP) problem, the goal is to predict function of numerous well-sequenced known proteins whose function is not still known precisely. PFP is one of the special and complex problems in machine learning domain in which a protein (regarded as instance) may have more than one function simultaneously. Furthermore, the functions (regarded as classes) are dependent and also are organized in a hierarchical structure in the form of a tree or directed acyclic graph. One of the common learning methods proposed for solving this problem is decision trees in which, by partitioning data into sharp boundaries sets, small changes in the attribute values of a new instance may cause incorrect change in predicted label of the instance and finally misclassification. In this paper, a Variance Reduction based Binary Fuzzy Decision Tree (VR-BFDT) algorithm is proposed to predict functions of the proteins. This algorithm just fuzzifies the decision boundaries instead of converting the numeric attributes into fuzzy linguistic terms. It has the ability of assigning multiple functions to each protein simultaneously and preserves the hierarchy consistency between functional classes. It uses the label variance reduction as splitting criterion to select the best “attribute-value” at each node of the decision tree. The experimental results show that the overall performance of the proposed algorithm is promising. & 2015 Elsevier Ltd. All rights reserved.

Keywords: Machine learning Hierarchical multi-label classification Protein function prediction Consistency preserving

1. Introduction In the last decade, many protein sequences have been identified, but a large fraction of these proteins still remain uncharacterized (Nabieva et al., 2005; Nguyen et al., 2011; Moosavi et al., 2013; Jiang and McQuay, 2012). Due to the critical role and functionality of proteins in many activities of organisms, discovering the functions of uncharacterized proteins is very important in different fields such as discovery of new drugs, detection mechanism of diseases, etc. (Luscombe et al., 2001). In order to solve this

n

Corresponding author. Tel.: þ 98 21 8288 3374. E-mail addresses: [email protected] (F. Golzari), [email protected] (S. Jalili). http://dx.doi.org/10.1016/j.jtbi.2015.03.023 0022-5193/& 2015 Elsevier Ltd. All rights reserved.

problem, many experimental methods are used successfully, but since experimental function annotation of proteins is expensive and time consuming, need for computational methods, automatic function prediction, has been a growing concern (Yang et al., 2012), (Lee et al., 2007). Among the existing computational methods, machine learning techniques have a special place in function annotation of proteins. In these techniques, Protein Function Predictions (PFP) are usually seen as classification tasks, where each function is considered as a class label (Cerri and de Carvalho, 2011). Since each protein may be involved in more than one biological activity, it can be associated to multiple class labels at the same time. Moreover, the functional classes are organized in a hierarchial structure, so that a protein that belongs to some class c automatically belongs to all the ancestors of c. These two characteristics of the function prediction

F. Golzari, S. Jalili / Journal of Theoretical Biology 377 (2015) 10–24

task, categorize it in the hierarchical multi-label classification (HMC) problems (Cerri and de Carvalho, 2011). HMC problems differ from standard classification problems in two ways: 1) each instance can be a member of more than one class at the same time and 2) the classes are not independent from each other rather they are related in a tree or a graph hierarchial structure (Schietgat et al., 2010; Cerri et al., 2012; Cerri and de Carvalho, 2010; Chen and Hu, 2012; Kocev et al., 2013). Examples of HMC problems are found in many real-world application domains such as image annotation (Feng and Xu, 2010), text classification (Chang et al., 2008), music categorization (Trohidis et al., 2008), scene classification (Dimou et al., 2009), functional genomics, etc. Many schemes have been proposed to standardize the categorization of known functions of proteins. The Functional Catalogue (FunCat) (Ruepp et al., 2004) and the Gene Ontology (GO) (Botstein et al., 2000) databases are two most famous schemes. These two online databases not only provide functional annotations for genes and gene products such as proteins, but also represent hierarchial relationships between the functional classes. The GO includes thousands of protein functions structured in a directed acyclic graph. So, a functional class may have more than one parent. A small sample of GO hierarchical structure is shown in Fig. 1. The FunCat is a tree-structured class hierarchy and organizes hundreds of functional classes in six levels. Fig. 2 a shows a small portion of FunCat taxonomy and a tree structure representation of classes in Fig. 2a is shown in Fig. 2b. Many researchers have focused only on either GO or FunCat in their proposed methods. Our method supports both GO and FunCat taxonomies. Providing a solution for solving HMC problems should be such that preserves the consistency between classes according to their parent–child relations in a hierarchical structure. This is called the hierarchy constraint (Schietgat et al., 2010), (Chen and Hu, 2012), (Kocev et al., 2013), (Vens et al., 2008), (Vens et al., 2010), also known as True Path Rule (TPR) (Valentini, 2011). More precisely, for protein function prediction problem, hierarchy constraint or TPR can be summarized as follows: “annotating a protein to a class in the hierarchy is automatically transferred to its ancestors, while proteins unannotated with a class cannot be annotated with its descendants” (Valentini, 2011). For example, in tree hierarchical structure shown in Fig. 3, if a protein belongs to functional class 3.1 it should belong to functional classes 3 and 0 too (Fig. 3a). Also,

Fig. 1. Relationships between GO terms are structured according to a directed acyclic graph.

11

if a protein does not belong to class 3, it should not belong to classes 3.1 and 3.2 (Fig. 3b). There are two main approaches to deal with PFP task: 1) the local approach in which original problem is transferred into a set of binary problems and a separate classifier is learned for each class (Chen and Hu, 2012), (Valentini, 2011; Barutcuoglu et al., 2006; Alvares-Cherman et al., 2012) or each parent in the hierarchy (Vens et al., 2008), (Cerri and de Carvalho, 2009) or each level of the hierarchy (Cerri and de Carvalho, 2011), (Clare and King, 2003) and 2) the global approaches in which a single global classifier is trained to predict all the functional classes of a protein at once (Schietgat et al., 2010); (Cerri et al., 2012); (Kocev et al., 2013); (Vens et al., 2008); (Stojanova, 2013); (Otero et al., 2010). Usually due to a large number of functional classes, learning a separate model for each functional class would not be cost effective. Typically, a learned global model is smaller than the total size of models built for each functional class. In this paper, we propose a global algorithm for predicting all classes of an instance at once. Protein function prediction is one of the most complex and challenging problems in machine learning for several reasons: 1) large number of discovered functional classes (Valentini, 2011); (Cesa-Bianchi et al., 2011), 2) possibility of multiple annotations for each protein (Valentini, 2011), (Cesa-Bianchi et al., 2011), 3) existence of hierarchical relationship between functional classes (Valentini, 2011); (Cesa-Bianchi et al., 2011) that causes two another challenges: 3.1) hierarchical constraint and 3.2) imbalance between positive and negative examples of functional classes because of reducing the number of positive instances by moving down the hierarchical structure. Due to these challenges, there is a need for some algorithm that: 1) is able to handle a high number of functional classes with an appropriate time consumption, 2) is able to identify all of the functions of an uncharacterized protein, 3) preserves the consistency between the functional classes by considering the hierarchical constraint and 4) even with unbalanced functional classes, is able to learn a model with an acceptable performance. Responding to such a need, in this paper we propose an algorithm with the capabilities listed above. Some of the existing global methods (Schietgat et al., 2010); (Kocev et al., 2013); (Vens et al., 2008); (Stojanova, 2013) use the predictive clustering tree (PCT) framework (Blockeel et al., 2000) that generalizes the standard (crisp) decision trees into predicting structured classes problems. Indeed, PCT framework considers a decision tree as a hierarchy of clusters that top node stands to one cluster including all instances, which is irregularly spitted up into smaller clusters from the root down to the leaves of the tree. The heuristic used in PCTs for selecting best “attribute-value” is the reduction of variance between classes of instances. By maximizing the variance reduction, the node/cluster homogeneity is maximized and the predictive performance is improved. In addition, prototype function that computes the label for each leaf can be instantiated for a given learning task. So far, PCTs have been instantiated for predicting hierarchical multi-label problems (Vens et al., 2008), multi-target problems (Struyf and Džeroski, 2006); (Kocev et al., 2007) and time-series problems (Džeroski et al., 2007); (Slavkov et al., 2010). We also have used this framework to induce a multi label-supported decision tree. In crisp decision trees, a set of instances is partitioned into distinct sets with sharp decision boundaries and this partitioning continues until a stop condition occurs. In these decision trees, with partitioning data into sharp boundaries sets, small changes in the attribute values of a new instance may result unexpected and incorrect changes in the predicted label of the instance and finally misclassification. In order to overcome this problem, we can apply fuzzy logic (Lowen, 1996) and make a fuzzy decision tree. In a fuzzy decision tree, each instance belongs to any set (all sets) with

12

F. Golzari, S. Jalili / Journal of Theoretical Biology 377 (2015) 10–24

01 METABOLISM 01.01 amino acid metabolism 01.01.03 assimilation of ammonia, metabolism of the glutamate group 01.01.05 metabolism of urea cycle, creatine and polyamines … 01.02 nitrogen, sulfur and selenium metabolism 01.02.02 nitrogen metabolism 01.02.03 sulfur metabolism … 02 ENERGY 02.01 glycolysis and gluconeogenesis

02

01 01.01 01.01.03

01.01.05

01.02 01.02.02

02.01 01.02.03

… Fig. 2. (a) A small portion of hierarchical structure of FunCat. (b) Tree structure corresponding to functional classes in (a).

Fig. 3. An example that clears the concept of hierarchy constraint. The nodes show classes and the edges represent their relationships in the hierarchy. (a) If a protein is predicted to belong to class 3.1 it should be predicted to belong to classes 3 and 0 too. (b) If it is predicted that a protein is not belong to class 3 it should not be predicted to belong to classes 3.1 and 3.2.

a membership degree between 0 and 1. Thus, the sets have not clearly defined boundaries and small changes in attribute values of an instance will cause far less classification error. In this paper, a Variance Reduction based Binary Fuzzy Decision Tree induction (VR-BFDT) algorithm is proposed to predict the functions of proteins. This algorithm: 1. Just fuzzifies the decision boundaries in the decision tree induction process. 2. Uses the label variance reduction as splitting criterion to select the best “attribute-value” at each node of the decision tree. 3. Has the ability of assigning multiple labels (more than one functional class) to each protein and predicts all these functional classes at once (There is no need to learn a separate model for each functional class.). 4. Preserves the consistency between functional classes according to their hierarchial constraint. 5. Has the ability of applying on both tree and graph structures of functional classes. 6. Has higher efficiency and accuracy in comparison with other proposed methods to solve protein function prediction problem.

As demonstrated by a series of recent publications (Chen et al., 2013, 2014; Qiu et al., 2014; Lin et al., 2014; Liu et al., 2014; Xu et al., 2014) in response to the call (Chou, 2011), to establish and report a really useful statistical predictor for a biological system, we should make the following procedures very clear: (a) construct or select a valid benchmark dataset to train and test the predictor; (b) formulate the biological sequence samples with an effective mathematical expression that can truly reflect their intrinsic correlation with the target to be predicted; (c) introduce or develop a powerful algorithm (or engine) to operate the prediction; (d) properly perform crossvalidation tests to objectively evaluate the anticipated accuracy of the

predictor; (e) establish a user-friendly web-server for the predictor that is accessible to the public. The rest of this paper is organized as follows: Section 2 introduces past works on protein function prediction. Section 3 presents the VR-BFDT method. The descriptions of the datasets, the introduction of the evaluation measures and the experimental results are given in Section 4. Finally, the conclusion is stated in Section 5.

2. Related works In this section, we briefly review some works have been done recently in protein function prediction task and other related efforts in biological systems. Cerri and de Carvalho (2011) proposed a local method, HMCLMLP in which for each hierarchical level of the FunCat structure, a three-layer artificial neural network is learned with the backpropagation algorithm. In order to correct inconsistencies that may occur during the classification, a post-processing phase is used after classification of new instances. Vens et al. (2008) compared three algorithms: CLUS-HMC, CLUS-HSC and CLUS-SC. CLUS-HMC is a global algorithm based on PCTs framework that learns a decision tree to predict all classes of protein at once. In this algorithm, maximization of variance reduction between labels of instances is used as heuristic function for finding the best “attribute-value” at each node of the decision tree. Also, the prototype function that is used to label a leaf node is the mean of class label of instances which have reached to that leaf node. CLUS-HSC is a local method which learns a decision tree for each functional class and takes into account the hierarchial relationship between functional classes for learning decision trees. CLUS-SC is also a local method like CLUSHSC, except that ignores the dependencies between functional classes. CLUS-HMC and CLUS-HSC can be used for both FunCat and GO structures.

F. Golzari, S. Jalili / Journal of Theoretical Biology 377 (2015) 10–24

Schietgat et al. (2010) presented the ensemble of decision trees constructed by CLUS-HMC algorithm called CLUS-HMC-ENS algorithm. In order to select training instances for learning each decision tree, authors used bagging (Breiman, 1996) method. The number of trees in the ensemble is given as a parameter to the algorithm. Kocev et al. (2013) also proposed the ensemble of predictive clustering trees for predicting multiple target variables and hierarchical multi-label classification specially PFP task. They used bagging and random forests as ensemble learning methods. Cerri et al. (2012) employed a genetic algorithm as a global method. Their algorithm which was called HMC-GA, evolves the individuals of classification rules to optimize the level of coverage of each individual. The set of optimized individuals is selected to make the set of classes to be predicted. A novel ant colony algorithm named hmAnt-Miner proposed by Otero et al. (2010) discovered an ordered list of IF-THEN classification rules. Stojanova (2013) by leaving aside this assumption that the instances are independent and have the same distribution or in fact, by taking into account the correlation between instances, has improved the heuristic function used in CLUS-HMC algorithm. IN her algorithm called NHMC, heuristic function is defined as a linear combination of maximized variance reduction of the label vector of instances and maximize the correlation between class labels of instances. Sokolov and Ben-Hur (2010) proposed a global method which models the structure of GO hierarchy using kernel methods for structured-output spaces. Valentini (2011) proposed a method inspired by TPR to preserve the consistency between the functional classes predicted by binary SVM classifiers, in both GO and FunCat structures. Chen and Hu (2012) used a local method in which before learning a SVM model for each functional class in the FunCat structure, a hierarchical oversampling approach is used to balance the skewed training subsets. They also improved the TPR consistency approach (Valentini, 2011) for combining the probability output of SVM models. Moreover, in the field of multi-label classification problems some efforts have been done in prediction of subcellular location of proteins by using a novel classifier called multi-label KNN classifier (Lin et al., 2013; Wu et al., 2012, 2011; Xiao et al., 2011), as well as, Chou (2013) presents a comprehensive review on the recent works in this area from both conceptual aspects and detailed mathematical formulation. Also, there is a justification for using GO approach in developing predictive methods to identify protein subcellular localization and functions in this review, convincing us that utilizing the GO approach can significantly improve the quality of prediction. Some fuzzy approaches also have been used to deal with various biological problems (Ding et al., 2007); (Shen et al., 2005). But so far fuzzy approaches have not been used in solving PFP problem.

3. Proposed method In this section, we explain our proposed method in five subsections. Subsection 3.1 defines the PFP problem in a mathematical form, subsection 3.2 discusses some portion of our method. Subsections 3.3 and 3.4 present Pseudo-code of learning and prediction phases of our method and finally the consistency preserving of our method is proven. Fig. 4 illustrates our proposed method schematically. 3.1. Notations In our proposed global method, multiple functional classes of the set C ¼ fc1 ; c2 ; …; cm g can be annotated for each single protein

13

xi A X ¼ fx1 ; x2 ; …; xn g simultaneously. Each functional class cj has a   weight Wðcj Þ. The set yi A Y ¼ y1 ; y2 ; …; yn stores functional classes that protein xi belong to; therefore yi D C. Protein xi A X has d numeric attribute A ¼ a1 ; a2 ; …; ad that describe whose specification. 1 r ir n is the index of n proteins in the set X and 1 r j rm is the index of m functional classes in the set C. The goal is to learn a binary fuzzy decision tree that predicts a set of functional classes for any given unknown protein. In order to learn this model, the ordered pair 〈X,Y〉, which shows the set of training proteins and their related classes, the functional classes C and the weight of the functional classes (W ðC Þ) are given to the VR-BFDT method as input. When the learning phase is finished, the learned binary fuzzy decision tree is returned as output. 3.2. Definitions

Definition 1 (protein label). In order to facilitate the work with the set of functional classes, a binary label vector V i is assigned to each protein xi A X, So the j0 th component of the vector is 1, if the protein belongs to functional class cj and 0 otherwise. (In the rest of this paper, the phrase “protein label” in the training phase, refers to the protein corresponding vector V i ) Definition 2 (membership function). Some fuzzy decision tree induction algorithms are suggested for solving standard classification problems in which tree nodes, the decision boundaries are fuzzified (Chandra and Varghese, 2009); (Qiu, 2011). These algorithms assume that the dataset only consist of numerical attributes, therefore, the fuzzy decision tree made by these algorithms, is a binary fuzzy decision tree. Our proposed method, VR-BFDT finds fuzzy decision boundaries. More precisely, with this assumption that all attributes are numerical, in a node S of fuzzy decision tree, during selection of best split attribute and best split point, instances are divided into two child nodes or two fuzzy sets, the set of instances whose value corresponding to best attribute is greater than or equal to the split-point and the set of instances whose value corresponding to best attribute is less than the splitpoint, with a membership degree between 0 and 1. Thus, a fuzzy decision boundary is made at split-point. In order to make fuzzy decision boundaries we use a sigmoid membership function (Chandra and Varghese, 2009); (Qiu, 2011) which is defined by the following equation:

μðvalðxi ; aÞÞ ¼

8

VR-BFDT: A variance reduction based binary fuzzy decision tree induction method for protein function prediction.

In protein function prediction (PFP) problem, the goal is to predict function of numerous well-sequenced known proteins whose function is not still kn...
3MB Sizes 0 Downloads 9 Views