Journal of Theoretical Biology 362 (2014) 3–8

Contents lists available at ScienceDirect

Journal of Theoretical Biology journal homepage: www.elsevier.com/locate/yjtbi

Multiclass classification of sarcomas using pathway based feature selection method Jian-lei Gu a,b,1, Yao Lu a,b,1, Cong Liu c, Hui Lu a,b,c,n a

Shanghai Institute of Medical Genetics, Shanghai Children's Hospital, Shanghai Jiao Tong University, Shanghai 200040, China Key Laboratory of Molecular Embryology, Ministry of Health & Shanghai Laboratory of Embryo and Reproduction Engineering, Shanghai 200040, China c Department of Bioengineering, Bioinformatics Program, University of Illinois at Chicago, Chicago, IL 60607, USA b

art ic l e i nf o

a b s t r a c t

Article history: Received 1 February 2014 Received in revised form 3 June 2014 Accepted 28 June 2014 Available online 8 July 2014

Feature selection is an important research topic in bioinformatics, to date a large number of methods have been developed. Recently several pathway based feature selection protocols, such as the conditionresponsive genes method, have been proposed for better classification performance. However, these conventional pathway based methods may lead to the selection of relevant but redundant genes in a given pathway while missing the other useful genes. Also these methods were limited to binary classification, while in many clinical problems a multiclass protocol is preferred such as the classification of sarcomas. Here, we propose a new pathway based feature selection method named Redundancy Removable Pathway based feature selection method (RRP) for the binary and multiclass classification problems. Three classifiers were implemented to compare the performance and gene functions of genebased, conventional pathway based, and our RRP method. The validation results suggest that the RRP method is a feasible and robust feature selection method for multi-class prediction problems. & 2014 Elsevier Ltd. All rights reserved.

Keywords: Pathway activity

1. Introduction Identification of biomarkers is very important for characterizing and predicting the progress and prognosis of many complex diseases (Alon et al., 1999; Annest et al., 2009; Fan et al., 2011; Ross et al., 2000; Saeys et al., 2007; Somorjai et al., 2003). Machine learning methods were widely used to improve prediction performance (Bhardwaj et al., 2005; Langlois and Lu, 2010). The biomarker discovery methods fall into two distinct strategies: selection and extraction method (Li et al., 2008). The selection method selects a subset of original features according to their discriminative power; such as statistic tests (t-test, Fisher-test, etc.) (Saeys et al., 2007). The extraction method projects the whole data into a lower dimensional space and constructs the latent variables as new features. Linear transformation, principle components analysis (PCA) and Partial Least Squares (PLS) are the most frequently used methods for feature extraction (Boulesteix, 2004; Li et al., 2008). However, the limitation of both strategies is that they only consider the genes (features) as independent variables and ignore

n Corresponding author at: Shanghai Institute of Medical Genetics, Shanghai Children's Hospital, Shanghai Jiao Tong University, Shanghai 200040, China. E-mail address: [email protected] (H. Lu). 1 Equally Contributing Authors

http://dx.doi.org/10.1016/j.jtbi.2014.06.038 0022-5193/& 2014 Elsevier Ltd. All rights reserved.

the hidden relations between them (Ding and Peng, 2005; Gentleman et al., 2004; Saeys et al., 2007). Accordingly some relevant features might be overlooked during the feature selection process due to the overrepresented protein complexes, signal transduction cascades and other hidden relations (Lee et al., 2008). In many cases those less significant features also represent the true relevant genes. One way to address this issue is to integrate secondary information (such as metabolic pathway or protein–protein interaction information (Chuang et al., 2007)) to restrict the feature search space, and reduce the hidden molecular interactions among selected features, sometimes being called ‘redundant’ genes (Ding and Peng, 2005). Among these the pathway based method is the dominant method (Bild et al., 2006; Chuang et al., 2007; Guo et al., 2005; Lee et al., 2008; Pitak et al., 2011). Guo et al. used the mean and median expressions of genes to represent the markers for a given pathway (Guo et al., 2005), while Bild et al. used the feature extraction method based on the first component of a PCA (Bild et al., 2006). In 2008, Lee et al. conducted the Condition Responsive Genes (CORGs) method that selected a subset of genes in a given pathway to represent the pathway markers, which may lead to a better classification accuracy and robustness (Lee et al., 2008; Pitak et al., 2011). Moreover, these pathway based feature selection method include, but are not limited to, gene expression data. It has also been reported that the pathway based classifier obtained biomarkers for autism diagnosis and identified a set of common cellular processes to autism across different populations (Skafidas et al., 2012).

4

J.-l. Gu et al. / Journal of Theoretical Biology 362 (2014) 3–8

Fig. 1. The dashed box 1 indicated the microarray data processing module. The log transformed expression levels were converted using z-score for further analysis. The genes (probes) were mapped to the corresponding pathway according to their pathway annotation information (dashed box 2). For each pathway, its member genes were ranked by their RRP score. A greedy search algorithm (dashed box 3) was performed to identify the subset of genes with the local maximal RRP scores. Then, the gene expressions of feature set were overlaid and represented as the expression level of this pathway. Finally, the pathways were ranked by their discriminative power, and then the top-n pathways were selected as the pathway biomarkers for a given class. (For interpretation of the references to color in this figure legend, the reader is referred to the web version of this article.)

These pathway based methods are aimed to remove potential redundant genes by forcing limited numbers of features from each pathway, however they are not designed to remove the ‘redundant’ marker genes within a given pathway. It is expected, the genes come from same pathway are more likely to co-express. To address this redundancy in the same pathway, we propose a new pathway based feature selection algorithm named Redundancy Removable Pathway based feature selection method (RRP). Unlike other existing pathway based method, our new scoring and search algorithm incorporate the correlation coefficients between candidate features and selected features during feature identification. The relevance and expression correlations were optimized simultaneously in our RRP approach, and the ‘redundant’ genes from a given pathway could be removed from feature set. Besides, we have extended the RRP method beyond binary classification to the multiclass classification, and compared the performance with CORGs and conventional gene based method on three public sarcomas microarray datasets. Sarcomas is a group of malignant tumors which occurs at extraskeletal non-epithelial tissues (Goldberg, 2007). Based on the type of tissue from which they arise, sarcomas could be divided into as many as 50 different subtypes (Coindre et al., 2001). To date, the diagnosis and classification of sarcomas is still a challenge, because of their heterogeneity and the lack of effective markers for detection of their boundaries (Osuna and de Alava, 2009). Previous work on subtype classification of sarcomas was aimed to identify maximal relevant markers, thus, they have concentrated mostly upon the tissue-specific genes, and might overlooked some important but less predictive signals (Konstantinopoulos et al., 2010). Most sarcomas exhibit similar features to those from the healthy tissue (Osuna and de Alava, 2009), but their subtype specific markers are hard to identify. So that sarcomas are ideal class of disease for our signal detective feature selection method. In the following, we first describe the protocol of our RRP method, then present the results on Sarcomas subtype

classification, finally we discuss the biological relevancy of the selected markers (Fig. 1).

2. Materials and methodology Our method searches a set of gene/features that has the most distinguishing power in a given pathway. An overview of novel feature selection algorithm is showed in Fig. 2. In Section 2.1, we define the scoring functions. In Section 2.2, we present the feature selection algorithm to choose a set of informative genes from given pathways for better classification. Then we describe the classifying procedure in Section 2.3. 2.1. Definition 2.1.1. Class-specific F-score F-score is designed to measure the ratio of intra-class to interclass gene expression distances. The bigger absolute value of F-score usually indicates a stronger discriminative power. To avoid the bias caused by the imbalanced number of classes, a modification of class-specific F-score is defined as follows (Rajapakse and Mundra, 2013): m ðx  xi Þ2 F il ¼ Gðxil xi Þ m l il ∑j ¼l 1 ðxijl  xil Þ2 ( GðxÞ ¼

1; if x 4 0  1; otherwise

ð1Þ

ð2Þ

where xi is the average expression level of gene i across all the samples; xil is the average expression level of the gene i across the samples in class l; ml is the number of samples in class l and xijl is the expression level of gene i of the sample j belonging to class l.

J.-l. Gu et al. / Journal of Theoretical Biology 362 (2014) 3–8

5

Input: a set S of member genes of pathway a for class l For each gene

for gene i >0 add gene i to set Spositive Else add gene i to set Snegative End For For set of Spositive and Snegative to do Let S’ = Spositive or Snegative Calculate If

Sort

in descending order for Spositive set and ascending otherwise

Add top-ranked gene j to feature set P Calculate pathway activity of feature set P Initialization discriminative score (F-score) DS for pathway activity While |P| = DS Update discriminative score DS with DSnew Else remove last gene x from feature set P and End while End For Output: feature set P of pathway a for classe l Fig. 2. Pseudo code of the greedy search algorithm.

GðxÞ is an indicator function that is equal to 1 if the condition xil  xi 4 0 is true and  1 otherwise. 2.1.2. Correlation assessment and RRP score The major difference between our RRP method and conventional pathway based ones is to incorporate the correlation coefficients between candidate features and selected features during the scoring. It is frequently discussed that the expression correlations among features would impact the classification performance (Ding and Peng, 2005; Xiong et al., 2001). Here, an average correlation coefficients between a candidate gene i and a current gene set {S} is defined as Ris ¼

∑sj ¼ 1 jcorði; jÞj jSj

ð3Þ

The corði; jÞ is the gene expression correlation between candidate gene i and a selected gene j in the current feature set {S}. An absolute value is taken to measure the strength of the relationship between two genes. |S| is the number of genes in the current feature set. Next, we design a novel RRP scoring system to identify key genes which yield the most discriminative activities (i.e. the biggest jF il j) and have the weakest relationship with current feature sets (i.e. the smallest Ris ). In this study, two RRP scores are adopted. RRP:Q il ¼

jF il j Ris

RRP:Dil ¼ jF il j  Ris

ð4Þ ð5Þ

2.1.3. Pathway activity inference and discriminative score Our goal is to classify disease subtypes using pathway activity based information. Lee et al. and Sootanan et al. defined activity of

a pathway for a given sample j using a weighted Z-score. k Z ij PAC j ¼ ∑ pffiffiffi k i¼1

where Z ij is the standard expression score (log-normalized) of gene i in sample j was averaged. Instead of selecting all genes in this pathway to infer its activity, our method will search a set of “representation” genes in it. So k here is the number of selected genes. In this work, up-regulated and down-regulated genes in a given pathway are assessed and selected separately and thus two pathway activity values are generated correspondingly: PAC þ for upregulated genes with F-score4 0 and PAC  otherwise. To rank pathways with their discriminative powers, a discriminative score based on the F-score of the pathway activity is defined.

2.2. Feature selection algorithm For a given pathway, a greedy search is performed to identify a subset of member genes in the pathway for which discriminative score was locally maximal. A pseudo code depicting our feature selection algorithm is showed in Fig. 2. In brief, the feature set is initialized to contain only the top F-score ranked gene in the first step and iteratively grows. In each iteration, the remaining genes in the pathway are ranked by their RRP scores. The top ranked candidate is added to the feature set and the discriminative score for new feature set is calculated. The iteration terminates when the new discriminative score becomes smaller than the previous one, and the last added gene is kicked out. In the end, a subset of member genes with locally maximally RRP scores for this pathway is obtained.

6

J.-l. Gu et al. / Journal of Theoretical Biology 362 (2014) 3–8

2.3. Classifiers To achieve multi-classification, a simple non-parametric linear k-nearest neighbor (kNN, k¼15), naive Bayes (the following called Bayes) and Random Forest classifier are selected to make the predictions. In this work, 'class', 'e1071' and 'random Forest' three packages in R version 3.0.2 is used for kNN, Bayes and Random Forest, respectively. Here the variables for classifiers are a set of pathway expression values. Training samples for a given class L are grouped together and then subject to the above feature search algorithms. The top n pathways with highest discriminative scores are obtained. The identified gene subsets in those n pathways are then grouped together to generate a class-specified pathway feature set. Feature sets for different classes are then merged together to generate the final feature set.

Table 1 Sarcomas dataset information. Dataset

Platform

Type

EMEXP353

Affy U133A

FIBRO LEIO LIPO MPNST SYN

10 16 26 8 20

GDS1209

Affy U133A

FIBRO LEIO LIPO MPNST SYN

7 6 7 0 4

GDS2736

Affy U133A

FIBRO LEIO LIPO MPNST SYN

4 6 40 3 16

2.4. Gene based and CORGs feature selection method For the gene based feature selection method, we analyzed the differentially expressed genes for each class of sarcomas using the Limma package in R (Smyth, 2004). After differential expression analysis, we ranked candidate genes according to their p-value and used top-n genes as the feature set for specific class. Next, the classspecific feature sets were merged for multiclass classification. To extend the conventional CORGs method to multiclass classifications, the simple one-versus-the-rest strategy was implemented to the binary CORGs method (with NCFS-c pathway activity calculation method) (Lee et al., 2008; Pitak et al., 2011). The top-n pathways with highest discriminative score were selected to serve as pathway biomarkers for the corresponding class of sarcomas. Same as the gene based method, these class-specific feature sets were merged for multiclass classification.

Total

# Samples

173

3. Results and discussion In this section, we compare our RRP method with gene based method and CORGs method (Lee et al., 2008; Pitak et al., 2011) in the sarcomas classification problem. Comparisons are made on three public sarcomas microarray data sets (Detwiller et al., 2005; Nakayama et al., 2007). We first discuss the correlations of expression value among genes in the same pathway. Then using RRP to see if we can increase the performance and robustness of classification. In the end we illustrate some useful biological information from the features generated by RRP.

Fig. 3. Correlation profiles of member genes from different pathways. The relationship between any pair of genes within the same pathways was obtained using the Pearson correlation coefficients. The Gaussian smoothed density plot (each line) shows that the gene expressions of member genes within the pathway are highly correlated. The different colors of profiles donate calculation results for different pathways.

3.1. Datasets and pre-process 3.2. The expression correlations among member genes of pathway Three public sarcomas microarray datasets (NCBI GDS1209, GDS2736 and EBI E-MEXP-353) with 173 tumor samples in 5 classes, including liposarcoma (LIPO), leiomyosarcoma (LEIO), malignant peripheral nerve sheath tumor (MPNST), synovial sarcoma (SYN) and fibrosarcoma (FIBRO) were used in our study (Table 1) (Detwiller et al., 2005; Nakayama et al., 2007). All these datasets were produced from Affymetrix U133A platform. The Robust MultiArray Average (RMA) algorithm (Bioconductor package ‘affy’) was performed to normalize the raw data independently (Gautier et al., 2004; Gentleman et al., 2004; Irizarry et al., 2003). All probes except quality control ones were preserved for further analysis. In our RRP method, the normalized (log-transformed) gene expression values were concentrated by z-score among all samples. The pathway information was extracted and integrated from the current Affymetrix U133A annotation file (updated 2012-10-29). Probes without pathway annotation information were annotated as an artificial pathway “unknown”. Overall 59 distinct pathways were obtained for further analysis.

For a given pathway, pathway based methods are typically select marker genes independently although genes are known to function coordinately within a gene family, signaling cascades and protein complexes. Our method includes the correlation profiles among genes within the same pathway. As shown in Fig. 3, the expressions of genes from the same pathway were highly correlated with each other (the correlation coefficients over 0.6 and the peak centered at around 0.7). However, it is believed that simply combining high relevant but co-expressed genes may not obtain a better performance (Ding and Peng, 2005; Xiong et al., 2001). Moreover, when analyze the data using the CORGs method (all 5 classes), we found that the marker genes of SYN sarcomas were overemphasized in the annexin gene family. All of 4 marker genes of prostaglandin synthesis regulation pathway (down regulated) belong to the annexin gene family, similar results also observed in metalloproteinase genes family for FIBRO sarcomas and integrin gene family for MPNST sarcomas. Thus a new scoring function and

J.-l. Gu et al. / Journal of Theoretical Biology 362 (2014) 3–8

search algorithm for the pathway based feature selection method need to be developed for reducing the redundancy. 3.3. The RRP approach The CORGs methods mapped transcriptome profile onto pathways, after the mapping a search algorithm was applied to identify the feature set for each pathway. Although CORGs methods can lead to more relevant pathways, they may have redundancy among selected markers within the same pathway. We attempt to use the correlation coefficient to select and remove the redundant marker genes. As the Fig. 1 shown, our RRP method first took gene expression matrix and pathway annotation information as the input data; then, similar to the conventional CORGs method, we map the gene expression matrix onto the biological pathways; finally, we apply an improved scoring and greedy search algorithm for each pathway in order to identify a subset of member genes that best represent the pathway activity. The major improvement of our method is to incorporate the correlation coefficients between candidate features and selected features during the scoring. We used two straightforward ways to integrate relevance and correlation coefficient: the difference or quotient of relevance and the correlation coefficient. To increase the robustness of pathway markers, we used pathway activity, aggregating the expression of marker genes from the feature set into a single pathway activity value. The top n most significant pathways were selected as the biomarkers for a given class.

7

Bayes and Random Forest classifier showed similar trend. Compared with the conventional CORGs method, our proposed RRP methods (RRP.D and RRP.Q) in general have better accuracies, especially in the 3-class classification (Fig. 4).

3.5. Biological relevant of marker genes The redundant-reduced class-specific marker genes identified as characteristic for each sarcomas class can provide insights into the biology of sarcomas. Our proposed method achieves more relevant biomarkers that might be overlooked in the general pathway based method. In the cross validation analysis of 3-class classification (liposarcomas, liposarcomas and synovial sarcoma, Fig. 5), we identified 24 unique marker genes in the RRP method compared to the conventional CORGs method, of which there are 11 genes shared in both of RRP.D and RRP.Q methods. We found several important cancer-related genes in these 11 marker genes. The DLC1 gene has been identified as tumor suppressor in many cancers (Lahoz and Hall, 2008). The CTAG1A/CTAG1B gene is a cancer/testis antigen, it was reported that about 76% of synovial sarcomas over-express CTAG1A/CTAG1B gene (Lai et al., 2012). Moreover, our RRP method identified SSX1/SSX2 as a marker gene (but overlooked in the conventional CORGs method), this is a member of well-known fusion gene SYT and SSX in synovial sarcoma (Crew et al., 1995; Ladanyi, 2001). These observations

3.4. The experimental design and performance analysis To validate the performance of the RRP method in multiclass classification, we used 3 independent sarcomas datasets (Table 1) and a 5-fold cross-validation. For comparative purposes, gene based method (top-5 and top-10 most differentially expressed genes), conventional CORGs method (top-5 and top-10 pathways) and two different scoring methods (RRP.D and RRP.Q) of RRP (top5 and top-10 pathways) were included in our study. Three independent experiments with 3-class (liposarcomas, leiomyosarcoma and synovial sarcoma), 4-class (liposarcomas, leiomyosarcoma, fibrosarcoma and a special class combined with malignant peripheral nerve sheath tumor and synovial sarcoma) and 5-class (liposarcomas, leiomyosarcoma, fibrosarcoma, malignant peripheral nerve sheath tumor and synovial sarcoma) were calculated in our cross validation analysis (100 random splits of the merged three datasets). As shown in Fig. 4, the accuracies of pathway based method were better than gene based method in the 3-class classification, but the accuracies of CORGs method appear lower than gene based method as the number of class increases. The

Fig. 5. Biological relevant of marker genes: (A) the Venn diagram of 30 most significant marker genes for the three pathway based methods in the cross validation analysis of the 3-class classification. (B) The common 11 marker genes of two RRP methods. The genes marked in red background are known cancer related genes, which were overlooked in conventional CORGs method. (For interpretation of the references to color in this figure legend, the reader is referred to the web version of this article.)

Fig. 4. The box plot of prediction accuracies via 5-fold cross validation. The box plot shows the accuracies distribution of kNN classifier (k ¼ 15) under 100 random splits of the merged three datasets. The grey, yellow, blue, purple, red, pink, orange and green boxes (left to right) denote the Gene Based method (top-5 genes), Gene Based Method (top-10 genes), CORGs (top-5 pathways), CORGs (top-10 pathways), RRP.D (top-5 pathways), RRP.D (top-10 pathways), RRP.Q (top-5 pathways) and RRP.Q (top-10 pathways) method, respectively. The histogram shows that RRP method achieved higher accuracy compare with the conventional CORGs method.

8

J.-l. Gu et al. / Journal of Theoretical Biology 362 (2014) 3–8

indicated that the RRP method is able to detect biologically meaningful signals. 4. Conclusion In summary, we proposed a new pathway based feature selection method RRP, the relevance and expression correlations were optimized simultaneously in our method because both of relevance and expression correlations of selected feature set impact the classification performance. The approach addressed previously ignored expression correlations among marker genes in the pathway based feature selection method. We conducted three data analysis to evaluate the classification accuracy of the pathway based feature selection method with our proposed RRP method and the conventional methods. Generally, compared with the CORGs method, the proposed RRP method has better performance in multiclass classification. Moreover, our RRP method provides important biologically relevant genes. Looking forward, there are certain ways to improve upon the basic concept of RRP method. For example, a recent research (Staiger et al., 2012) showed that the performance of secondary information based classifier does not rely on the source of secondary information. Integrating multiple layers of functional information might further expand our RRP method. Acknowledgment This work is supported in part by the National Natural Science Foundation of China (No. 31071167) and Medical Engineering Cross Research Foundation of Shanghai Jiao Tong University. References Alon, U., Barkai, N., Notterman, D.A., Gish, K., Ybarra, S., Mack, D., Levine, A.J., 1999. Broad patterns of gene expression revealed by clustering analysis of tumor and normal colon tissues probed by oligonucleotide arrays. Proc. Natl. Acad. Sci. USA 96, 6745–6750. Annest, A., Bumgarner, R.E., Raftery, A.E., Yeung, K.Y., 2009. Iterative Bayesian model averaging: a method for the application of survival analysis to highdimensional microarray data. BMC Bioinform. 10, 72, http://dx.doi.org/10.1186/ 1471-2105-10-72. Bhardwaj, N., Langlois, R.E., Zhao, G., Lu, H., 2005. Kernel-based machine learning protocol for predicting DNA-binding proteins. Nucl. Acids Res. 33, 6486–6493, http://dx.doi.org/10.1093/nar/gki949. Bild, A.H., Yao, G., Chang, J.T., Wang, Q., Potti, A., Chasse, D., Joshi, M.B., Harpole, D., Lancaster, J.M., Berchuck, A., Olson Jr., J.A., Marks, J.R., Dressman, H.K., West, M., Nevins, J.R., 2006. Oncogenic pathway signatures in human cancers as a guide to targeted therapies. Nature 439, 353–357, http://dx.doi.org/10.1038/ nature04296. Boulesteix, A.L., 2004. PLS dimension reduction for classification with microarray data. Stat. Appl. Genet. Mol. Biol. 3, 1–32, http://dx.doi.org/10.2202/15446115.1075 (Article33). Chuang, H.Y., Lee, E., Liu, Y.T., Lee, D., Ideker, T., 2007. Network-based classification of breast cancer metastasis. Mol. Syst. Biol. 3, 140, http://dx.doi.org/10.1038/ msb4100180. Coindre, J.M., Terrier, P., Guillou, L., Le Doussal, V., Collin, F., Ranchere, D., Sastre, X., Vilain, M.O., Bonichon, F., N’Guyen Bui, B., 2001. Predictive value of grade for metastasis development in the main histologic types of adult soft tissue sarcomas: a study of 1240 patients from the French federation of cancer centers sarcoma group. Cancer 91, 1914–1926, http://dx.doi.org/10.1002/10970142(20010515)91:10o 1914::AID-CNCR121443.0.CO;2-3 ([pii]). Crew, A.J., Clark, J., Fisher, C., Gill, S., Grimer, R., Chand, A., Shipley, J., Gusterson, B.A., Cooper, C.S., 1995. Fusion of SYT to two genes, SSX1 and SSX2, encoding proteins with homology to the Kruppel-associated box in human synovial sarcoma. EMBO J. 14, 2333–2340. Detwiller, K.Y., Fernando, N.T., Segal, N.H., Ryeom, S.W., D'Amore, P.A., Yoon, S.S., 2005. Analysis of hypoxia-related gene expression in sarcomas and effect of hypoxia on RNA interference of vascular endothelial cell growth factor A. Cancer Res. 65, 5881–5889, http://dx.doi.org/10.1158/0008-5472.CAN-04-4078.

Ding, C., Peng, H., 2005. Minimum redundancy feature selection from microarray gene expression data. J. Bioinform. Comput. Biol. 3, 185–205 (S0219 720005001004 [pii]). Fan, X., Shao, L., Fang, H., Tong, W., Cheng, Y., 2011. Cross-platform comparison of microarray-based multiple-class prediction. PLoS One 6, e16067, http://dx.doi. org/10.1371/journal.pone.0016067. Gautier, L., Cope, L., Bolstad, B.M., Irizarry, R.A., 2004. affy–analysis of Affymetrix GeneChip data at the probe level. Bioinformatics 20, 307–315, http://dx.doi.org/ 10.1093/bioinformatics/btg405. Gentleman, R.C., Carey, V.J., Bates, D.M., Bolstad, B., Dettling, M., Dudoit, S., Ellis, B., Gautier, L., Ge, Y., Gentry, J., Hornik, K., Hothorn, T., Huber, W., Iacus, S., Irizarry, R., Leisch, F., Li, C., Maechler, M., Rossini, A.J., Sawitzki, G., Smith, C., Smyth, G., Tierney, L., Yang, J.Y., Zhang, J., 2004. Bioconductor: open software development for computational biology and bioinformatics. Genome Biol. 5, R80, http://dx.doi. org/10.1186/gb-2004-5-10-r80. Goldberg, B.R., 2007. Soft tissue sarcoma: an overview. Orthop. Nurs. 26, 4–11 (00006416-200701000-00003 Quiz 12-3, [pii]). Guo, Z., Zhang, T., Li, X., Wang, Q., Xu, J., Yu, H., Zhu, J., Wang, H., Wang, C., Topol, E.J., Rao, S., 2005. Towards precise classification of cancers based on robust gene functional expression profiles. BMC Bioinform. 6, 58, http://dx.doi.org/10.1186/ 1471-2105-6-58. Irizarry, R.A., Bolstad, B.M., Collin, F., Cope, L.M., Hobbs, B., Speed, T.P., 2003. Summaries of affymetrix GeneChip probe level data. Nucl. Acids Res. 31, e15. Konstantinopoulos, P.A., Fountzilas, E., Goldsmith, J.D., Bhasin, M., Pillay, K., Francoeur, N., Libermann, T.A., Gebhardt, M.C., Spentzos, D., 2010. Analysis of multiple sarcoma expression datasets: implications for classification, oncogenic pathway activation and chemotherapy resistance. PLoS One 5, e9747, http://dx. doi.org/10.1371/journal.pone.0009747. Ladanyi, M., 2001. Fusions of the SYT and SSX genes in synovial sarcoma. Oncogene 20, 5755–5762, http://dx.doi.org/10.1038/sj.onc.1204601. Lahoz, A., Hall, A., 2008. DLC1: a significant GAP in the cancer genome. Genes Dev. 22, 1724–1730, http://dx.doi.org/10.1101/gad.1691408. Lai, J.P., Robbins, P.F., Raffeld, M., Aung, P.P., Tsokos, M., Rosenberg, S.A., Miettinen, M.M., Lee, C.C., 2012. NY-ESO-1 expression in synovial sarcoma and other mesenchymal tumors: significance for NY-ESO-1-based targeted therapy and differential diagnosis. Mod. Pathol. 25, 854–858, http://dx.doi.org/10.1038/modpathol.2012.31. Langlois, R.E., Lu, H., 2010. Boosting the prediction and understanding of DNAbinding domains from sequence. Nucl. Acids Res. 38, 3149–3158, http://dx.doi. org/10.1093/nar/gkq061. Lee, E., Chuang, H.Y., Kim, J.W., Ideker, T., Lee, D., 2008. Inferring pathway activity toward precise disease classification. PLoS Comput. Biol. 4, e1000217, http://dx. doi.org/10.1371/journal.pcbi.1000217. Li, G.Z., Bu, H.L., Yang, M.Q., Zeng, X.Q., Yang, J.Y., 2008. Selecting subsets of newly extracted features from PCA and PLS in microarray data analysis. BMC Genomics 9 (Suppl 2), S24, http://dx.doi.org/10.1186/1471-2164-9-S2-S24. Nakayama, R., Nemoto, T., Takahashi, H., Ohta, T., Kawai, A., Seki, K., Yoshida, T., Toyama, Y., Ichikawa, H., Hasegawa, T., 2007. Gene expression analysis of soft tissue sarcomas: characterization and reclassification of malignant fibrous histiocytoma. Mod. Pathol. 20, 749–759, http://dx.doi.org/10.1038/modpathol.3800794. Osuna, D., de Alava, E., 2009. Molecular pathology of sarcomas. Rev Recent Clin Trials 4, 12–26. Pitak, S., Santitham, P.-O., Asawin, M., Jonathan, H.C., 2011. Pathway-based microarray analysis for robust disease classification. Neural Comput. Appl.. Rajapakse, J.C., Mundra, P.A., 2013. Multiclass gene selection using Pareto-fronts. IEEE/ACM Trans. Comput. Biol. Bioinform. 10, 87–97, http://dx.doi.org/10.1109/ TCBB.2013.1. Ross, D.T., Scherf, U., Eisen, M.B., Perou, C.M., Rees, C., Spellman, P., Iyer, V., Jeffrey, S.S., Van de Rijn, M., Waltham, M., Pergamenschikov, A., Lee, J.C., Lashkari, D., Shalon, D., Myers, T.G., Weinstein, J.N., Botstein, D., Brown, P.O., 2000. Systematic variation in gene expression patterns in human cancer cell lines. Nat. Genet. 24, 227–235, http://dx.doi.org/10.1038/73432. Saeys, Y., Inza, I., Larranaga, P., 2007. A review of feature selection techniques in bioinformatics. Bioinformatics 23, 2507–2517, http://dx.doi.org/10.1093/bioinformatics/btm344. Skafidas, E., Testa, R., Zantomio, D., Chana, G., Everall, I.P., Pantelis, C., 2012. Predicting the diagnosis of autism spectrum disorder using gene pathway analysis. Mol. Psychiatry , http://dx.doi.org/10.1038/mp.2012.126. Smyth, G.K., 2004. Linear models and empirical bayes methods for assessing differential expression in microarray experiments. Stat. Appl. Genet. Mol. Biol. 3, http://dx.doi.org/10.2202/1544-6115.1027 (Article3). Somorjai, R.L., Dolenko, B., Baumgartner, R., 2003. Class prediction and discovery using gene microarray and proteomics mass spectroscopy data: curses, caveats, cautions. Bioinformatics 19, 1484–1491. Staiger, C., Cadot, S., Kooter, R., Dittrich, M., Muller, T., Klau, G.W., Wessels, L.F., 2012. A critical evaluation of network and pathway-based classifiers for outcome prediction in breast cancer. PLoS One 7, e34796, http://dx.doi.org/10.1371/ journal.pone.0034796. Xiong, M., Fang, X., Zhao, J., 2001. Biomarker identification by feature wrappers. Genome Res. 11, 1878–1887, http://dx.doi.org/10.1101/gr.190001.

Multiclass classification of sarcomas using pathway based feature selection method.

Feature selection is an important research topic in bioinformatics, to date a large number of methods have been developed. Recently several pathway ba...
955KB Sizes 0 Downloads 4 Views