MCentridFS: a tool for identifying module biomarkers for multi-phenotypes from high-throughput data.

Molecular BioSystems View Article Online

Published on 30 July 2014. Downloaded by The University of Manchester Library on 31/10/2014 16:18:33.

PAPER

Cite this: Mol. BioSyst., 2014, 10, 2870

View Journal | View Issue

MCentridFS: a tool for identifying module biomarkers for multi-phenotypes from high-throughput data† Zhenshu Wen,a Wanwei Zhang,b Tao Zengb and Luonan Chen*bc Systematically identifying biomarkers, in particular, network biomarkers, from high-throughput data is an important and challenging task, and many methods for two-class comparison have been developed to exploit information of high-throughput data. However, as the high-throughput data with multi-phenotypes are available, there is a great need to develop effective multi-classification models. In this study, we proposed a novel approach, called MCentridFS (Multi-class Centroid Feature Selection), to systematically identify responsive modules or network biomarkers for classifying multi-phenotypes from high-throughput data. MCentridFS formulated the multi-classification model by network modules as a binary integer linear programming problem, which can be solved efficiently and effectively in an accurate manner. The approach is evaluated with respect to two diseases, i.e., multi-stages HCV-induced dysplasia and hepatocellular carcinoma and multi-tissues breast cancer, both of which demonstrated the high classification rate and the cross-validation rate of the approach. The computational results of the five-fold cross-validation of the two data show that MCentridFS outperforms the state-of-the-art multi-classification methods. We further verified the effectiveness of MCentridFS to characterize the multi-phenotype processes using module biomarkers by

Received 30th May 2014, Accepted 25th July 2014 DOI: 10.1039/c4mb00325j

two independent datasets. In addition, functional enrichment analysis revealed that the identified network modules are strongly related to the corresponding biological processes and pathways. All these results suggest that it can serve as a useful tool for module biomarker detection in multiple biological processes or multi-classification problems by exploring both big biological data and network information. The Matlab code

www.rsc.org/molecularbiosystems

for MCentridFS is freely available from http://www.sysbio.ac.cn/cb/chenlab/images/MCentridFS.rar.

Introduction As the high-throughput technologies, such as microarrays and the next generation sequencing technologies, mature and become prevalent in biological research, more and more datasets have been accumulated and available in the public databases, such as the NCBI Gene Expression Omnibus (GEO),1 the EBI ArrayExpress2 and the Stanford Microarray Database.3 As we know, the high-throughput data can simultaneously monitor thousands of genes and thus offer an unprecedented opportunity to fully characterize biological processes.4 However, how to extract the meaningful biomarkers from the huge amount of information is a significant challenge that scientists and clinicians often come across.5 a

School of Mathematical Sciences, Huaqiao University, Quanzhou 362021, China Key Laboratory of Systems Biology, Shanghai Institutes for Biological Sciences, Chinese Academy of Sciences, Shanghai 200031, China. E-mail: [email protected] c Collaborative Research Center for Innovative Mathematical Modelling, Institute of Industrial Science, University of Tokyo, Tokyo 153-8505, Japan † Electronic supplementary information (ESI) available. See DOI: 10.1039/ c4mb00325j b

2870 | Mol. BioSyst., 2014, 10, 2870--2875

Considerable efforts have been devoted to developing methods to identify individual gene biomarkers.6–8 However, it is well accepted that genes or proteins within a cell do not function alone but interact with each other to form networks or pathways so as to carry out biological functions.9–11 Therefore, pathway-based approaches have been developed to identify network biomarkers or even dynamical network biomarkers (DNBs) from a system perspective, which not only reliably characterize the diseases but also play a crucial role in revealing the essential biological mechanisms12–16 at the network level. However, most of the methods focusing on dissecting the biomarkers, are mainly for two-class samples, e.g., normal and disease samples, and just identify biomarkers one by one. Such limitations highlight the need to explore new approaches that can identify the network or module biomarkers simultaneously by classifying multiphenotypes.17,18 In this paper, we proposed a novel multi-classification model, called MCentridFS, to systematically identify network or module biomarkers from high-throughput data with multi-phenotypes. MCentridFS formulated the multi-classification model as a binary integer linear programming problem, which can be solved

This journal is © The Royal Society of Chemistry 2014

View Article Online

Paper Table 1

Molecular BioSystems Accuracy rates for HCC data and breast cancer data by MCentridFS


Accuracy rate

Data

Classification accuracy rate using evaluated dataset (%)

Classification accuracy rate using independent dataset (%)

HCC data Breast cancer data

90.6719 87.8820

86.2530 91.7431

efficiently and effectively in an accurate manner. To evaluate the approach, we applied the approach to study two multi-class gene expression datasets, that is, multi-stage hepatocellular carcinoma (HCC)19 and multi-tissue breast cancer.20 The classification accuracy rates of our module biomarkers on the two datasets are 90.67% and 87.88%, respectively, and the five-fold cross-validation rates achieved are 84.07% and 82.83%, respectively. In addition, the classification accuracy rates on two independent datasets reach 86.25% and 91.74%, respectively (see Table 1). The results suggest that it can serve as a useful tool for module biomarker detection in multiple biological processes or multi-phenotypes by exploiting both large data and network information.

Materials and methods Fig. 1 illustrates the schematic flowchart of MCentridFS. The details of the procedure are described in the following subsections and Results. Datasets We evaluated the approach on two different cancers, that is, HCC and breast cancer, the data for which were downloaded from NCBI GEO database with accession number GSE676419

and GSE10797,20 respectively. The HCC gene expression dataset contains 75 tissue samples including the normal stage and four neoplastic stages (very early HCC to metastatic tumors), while, there are totally 66 tissue samples including 5 normal stromal samples, 5 normal epithelial samples, 28 stromal samples of breast cancer and 28 epithelial samples of breast cancer. The preprocessing of probe level data was the same as that used in original ref. 19 and 20. If there are multiple probes corresponding to the same gene, we adopted the averaged intensity of these probes to represent the expression value of the gene. Construction of a comprehensive human PPI network and the differential PPI network We applied a voting method to construct an ensemble protein– protein interaction (PPI) network by integrating five curated human PPI databases, i.e., HPRD,21 BioGrid,22 IntAct,23 MINT24 and Reactome.25 Only interactions that are found in at least three of these databases were selected. The comprehensive PPI network contains 7001 nodes with 19 188 edges.16 We combined the constructed PPI network and the multi-phenotype gene expression data to obtain the differential PPI network. Specifically, we first applied the ANOVA method26 to identify genes that were differentially expressed on one or more of the K clusters (phenotypes) relative to the others (p-value o 0.05). Then if one (or two) node(s) of an edge in the PPI network is differentially expressed, we kept the edge in the final differential PPI network. In this way, the differential PPI network is defined.27 Identification of responsive modules for a multi-classification model by a binary integer linear programming model First, we exploit gene expression data and the PPI network to construct the differential PPI network (the process is described

Fig. 1 Schematic flowchart of MCentridFS. First, we exploited expression data and the PPI network to construct a differential PPI network, which is used to cluster network modules. Second, the activity matrix of modules is obtained by defining the activity of each module Mij. Through introducing indicative variable xi and designing a classifier, we formulate the identification of responsive modules as an integer linear programming problem. Finally, by solving it, among all modules we identify module biomarkers, which characterize multiple phenotypes.


Mol. BioSyst., 2014, 10, 2870--2875 | 2871

View Article Online


Molecular BioSystems

Paper

in the above subsection). Second, we decompose the differential PPI network into modules by the Markov Clustering (MCL) algorithm28 and only consider those n modules with more than 3 edges. Third, normalized gene expression data are employed to define the activity of module Mi in the case of sample Sj as P zkj gk 2Mi ffiffiffiffiffiffiffiffiffi ; Mij ¼ p jMi j where |Mi| means the number of nodes in the module Mi and Zkj corresponds to the normalized expression value of gene gk in sample Sj. In this way, we obtain the activity matrix with element Mij representing the activity of module Mi in the case of sample Sj. The indicative function is defined to indicate whether a module is selected or not as follows: ( 1; Mi is selected xi ¼ 0; otherwise

defined similarly. Through simple calculation,14,16 we can further express the conditions of the classifier as C(x1, x2, , xn)t r 0,

where C is a matrix function of Mij with each element Cij representing the j-th module’s contribution to the i-th condition of the classifier. t means the transpose of the vector x. Note that the above system of inequalities is linear for (x1, x2, , xn). The terms on the left side of eqn (1) represent the classification ability of modules, i.e., the more negative they are, the more clearly the modules are able to distinguish multi-class samples. We aim not only to classify the multi-class samples based on the designed classifier, but also to identify the minimal number of modules in this classification process. Therefore, by combining the two objectives, we formulate the module-identification problem as the following binary integer linear programming: min

x1 ;x2 ;;xn

Then, we design a classifier to select responsive modules for a three-classification model (see Fig. 2) as an example:

JS S1J22 JS S2J22 o 0, JS S1J22 JS S3J22 o 0, JS S2J22 JS S1J22 o 0,

JS S2J22 JS S3J22 o 0, JS S3J22 JS S1J22 o 0, JS S3J22 JS S2J22 o 0,

n X

xj þ l

j¼1

s X n X

Cij xj

i¼1 j¼1

s:t: C ðx1 ; x2 ; ; xn Þt 0; n X

for S A S1,

(2)

xi 1;

(3)

i¼1

for S A S1,

n X

for S A S2,

xi ¼ 0; 1;

for S A S2, for S A S3, for S A S3,

xi m;

i¼1

(1)

where S, Si, and S1 (i = 1, 2, 3) represent the sample, the i-th cluster samples, the Euclidean center of the i-th cluster samples, respectively. For K-classification model, the classifier can be

i 2 f1; 2; ; ng

where s is the number of the samples and m is the threshold to restrict the number of the modules. The first term in the objective function implies that we intend to minimize the number of modules, while the second term is used to characterize the classification ability of these modules, i.e., we intend to maximize the classification ability of these modules (or minimize C(x1, x2, , xn)t). l is a positive penalty parameter to control the trade-off between the number of modules and the classification ability of these modules. Algorithm for solving the binary integer linear programming problem Clearly, the formulated problem is NP-hard. We turn to relax the constraints from binary variables xi A {0, 1} to continuous variables xi A [0, 1]. With such a relaxation, we can adopt a linear programming algorithm to solve the problem in an efficient manner. The experimental results show that such relaxation is both efficient and effective, that is, we almost always obtain integral solutions although it is not theoretically ensured. Here, we describe a rule to determine how to choose l and m.29 Given l and m, we define the classification accuracy rate of the identified modules from the constraints C(x1, x2, , xk)t r 0 as follows: CA ¼

Fig. 2 An illustration of the schematic diagram of the classifier corresponding to a three-classification model.

2872 | Mol. BioSyst., 2014, 10, 2870--2875

m0 ; m

where m and m0 are the number of samples to be classified, and the number of samples that are correctly classified, respectively. We choose the corresponding l and m when CA attains the maximum value, and the resulted modules are regarded as the


View Article Online


Paper


putative responsive modules accordingly. Specifically, we test l and m by certain intervals and choose l and m corresponding to the maximum CA.

Table 2 Comparisons of five-fold cross-validation rates of MCentridFS with other state-of-the-art multi-classification methods

Results

Data

Decision trees (%)

kNN (%)

svm (%)

McentridFS (%)

Overview of module prediction

HCC data Breast cancer data

62.13 53.50

78.72 74.14

83.76 76.79

84.07 82.83

The flowchart to identify responsive modules for a multiclassification model is illustrated in Fig. 1. First, we exploited the high-throughput data with multi-phenotypes and constructed a PPI network to obtain the differential PPI network (see Materials and methods). Second, we obtained n modules, which contain more than three genes, by decomposing the differential PPI network through the Markov Clustering algorithm.28 Then, following the flowchart in Fig. 1, we finally identified p responsive modules from all n modules by adjusting l and m (see Materials and methods), which maximizes the classification accuracy rate CA in the binary integer linear programming problem.

rate, and the results are showed in Table 2. From Table 2, the results suggested that MCentridFS outperforms decision trees, kNN, and svm. Additionally, the high five-fold cross-validation classification rates of the four multi-classification methods indicated that the features (modules) identified by MCentridFS are efficient. All these results showed that MCentridFS can serve as a useful tool for module biomarker detection in multiple biological processes or multi-classification problems.

Application to HCC data The HCC dataset consists of 21 327 genes and 75 tissue samples, including 10 normal samples and four neoplastic stages (i.e., 13 cirrhotic liver tissue samples, 17 high-grade dysplastic liver tissue samples, 18 early HCC samples, and 17 very advanced HCC samples) of HCC samples. We applied the formulated multi-classification model to the HCC data and identified 12 modules from 267 modules to discriminate the five-class samples. To evaluate the quality of the identified modules in terms of distinguishing the five-class samples, we exploited the module activity matrix of modules (modules vs. samples) to check whether a sample belongs to its class of samples, and we found that these modules achieved a high classification accuracy rate CA = 90.67% (see Table 1). Application to breast cancer data We further evaluated the approach on a breast cancer dataset. There are totally 66 tissue samples and 22 277 probes in the breast cancer dataset. The 66 samples consist of 5 normal stromal samples, 5 normal epithelial samples, 28 stromal samples of breast cancer and 28 epithelial samples of breast cancer. We applied our multi-classification model to the breast cancer data, and identified 31 modules from 406 modules to discriminate the four-class samples. Also, we exploited these modules to evaluate the quality of the identified modules in terms of distinguishing the four-class samples, which achieved a high classification accuracy rate CA = 87.88% (see Table 1). Comparisons with the state-of-the art multi-classification methods We conducted the five-fold cross-validation of HCC and breast cancer data to evaluate the performance of MCentridFS and compare the results with other state-of-the-art multi-classification methods, such as decision trees,32 k-nearest neighbors (kNN),33 and support vector machines (svm).34 For each data and method, we performed five-fold cross-validation 100 times by exploiting 12 selected features and took the average accuracy


Methods

Functional enrichment analysis of the identified modules We further analyzed the functional enrichment of the identified modules through a hypergeometric test using g:Profiler.35 The representative enriched GO terms in each module are presented in Tables S1 and S2 (ESI†) for HCC data and breast cancer data, respectively. From Tables S1 and S2 (ESI†), we found that, the identified modules are mainly enriched in the processes, which are highly correlated to the hallmarks of cancer and contribute to the major progression of cancer,36,37 such as the immune process, signaling pathways, cell communication, apoptosis, inflammatory process, etc. In addition, the specific processes, i.e., hepatic immune response, hepatitis B, and heparin binding, are also found in the progression of HCC. What’s more, the inflammatory process and immune response, which play an important role during the breast cancer progression,38 are also enriched in the identified modules. Validation of the identified modules using other independent datasets We further evaluated the effectiveness of the identified modules using gene expression data from other two independent datasets. For the HCC data, GSE50579,30 including 70 HCC samples and 10 normal samples, was used to verify the effectiveness of the 12 identified modules, and it achieved a 86.25% accuracy rate (see Table 1). Similarly, we obtained a 91.74% accuracy rate (see Table 1) for the breast cancer dataset GSE42568,31 which includes 104 breast cancer samples and 17 normal samples. These independent results provided additional evidence that the module-based biomarkers identified in this study could be used for disease analysis and prediction, and further suggested the effectiveness of MCentridFS.

Discussion and conclusion Our method is novel in several aspects compared to previous methods. First, in contrast to the conventional methods, our method is able to reveal responsive relations between module biomarkers and multiple phenotypes by integrating

Mol. BioSyst., 2014, 10, 2870--2875 | 2873

View Article Online



gene expression data and the PPI network. Second, technically, instead of those heuristic methods, which select biomarkers one by one, our method can identify module biomarkers in an accurate manner by formulating a binary integer linear programming problem, which can be solved efficiently and effectively, although the method requires more settings and parameters than some simple methods. Third, our method achieves high classification accuracy and a positive cross-validation rate for multi-class comparison by exploiting both high-throughput data and network information. In this paper, we focus on developing a computational approach, i.e., MCentridFS, to identify network or module biomarkers for classifying multi-phenotypes from high-throughput data, and the application of the approach to two datasets achieved promising classification accuracy rates (almost 90%) and cross-validation rates (84% and 82%), which demonstrates the effectiveness of MCentridFS. The results of the five-fold cross-validation classification rates of MCentridFS outperform other multi-classification methods, which also indicates that MCentridFS can serve as a useful tool for module biomarker detection in multiple biological processes or multi-classification problems. However, several important future topics remain to be studied. Our method may include more available information, such as copy number aberrations (CNA), and epigenomic data (e.g., DNA methylation data), into the multi-classification model,16 which should be studied further. Besides, we know that the five stages of the HCC data are progressive step by step, therefore, our multi-classification model may be improved to obtain more refined results to consider such an ordinal information. In addition, we may extend our framework to detect dynamical network biomarkers (DNBs) for early diagnosis of complex diseases,15 which can elucidate more details of cancer pathogenesis. Finally, our approach is useful to identify new oncogenes and cancer suppressor genes of various cancers by further analysis of the identified module biomarkers.16 In contrast to the traditional molecular biomarkers or node biomarkers, detecting edge biomarkers39,40 is also a hot topic. As a future work, our method can be extended to identify the edge biomarkers for classifying biological samples. In addition to the complex diseases analyzed in this manuscript, our method can be used to study phenotype evolutions of general dynamical biological processes in a similar manner.

Funding This research was supported by the foundation of Huaqiao University (No. 12BS223), and Promotion Program for Young and Middle-aged Teacher in Science and Technology Research of Huaqiao University (ZQN-PY119). The work was also supported by the Strategic Priority Research Program of the Chinese Academy of Sciences (No. XDB13040700), National Program on Key Basic Research Project (No. 2014CB910504), the National Natural Science Foundation of China (No. 61134013, 91029301, 91130033, 31200987), the Knowledge Innovation Program of CAS (No. KSCX2EW-R-01, 2013KIP218), and 863 project (No. 2012AA020406).

2874 | Mol. BioSyst., 2014, 10, 2870--2875

Paper

Competing interests None.

Contributors ZSW wrote the paper, designed and performed the experiments, and analyzed the data. WWZ and TZ helped perform the experiments. TZ helped software preparation and publication. LNC conceived the experiments and contributed materials.

References 1 R. Edgar, M. Domrachev and A. E. Lash, Gene Expression Omnibus: NCBI gene expression and hybridization array data repository, Nucleic Acids Res., 2002, 30, 207–210. 2 H. Parkinson, M. Kapushesky, M. Shojatalab, N. Abeygunawardena, R. Coulson, A. Farne, E. Holloway, N. Kolesnykov, P. Lilja and M. Lukk, ArrayExpress—a public database of microarray experiments and gene expression profiles, Nucleic Acids Res., 2007, 35, D747–D750. 3 G. Sherlock, T. Hernandez-Boussard, A. Kasarskis, G. Binkley, J. C. Matese, S. S. Dwight, M. Kaloper, S. Weng, H. Jin and C. A. Ball, The Stanford microarray database, Nucleic Acids Res., 2001, 29, 152–155. 4 R. K. Curtis, M. Oresic and A. Vidal-Puig, Pathways to the analysis of microarray data, Trends Biotechnol., 2005, 23, 429–435. 5 D. Cavalieri and C. De Filippo, Bioinformatic methods for integrating whole-genome expression results into cellular networks, Drug Discovery Today, 2005, 10, 727–734. 6 A. A. Alizadeh, M. B. Eisen, R. E. Davis, C. Ma, I. S. Lossos, A. Rosenwald, J. C. Boldrick, H. Sabet, T. Tran and X. Yu, et al., Distinct types of diffuse large B-cell lymphoma identified by gene expression profiling, Nature, 2000, 403, 503–511. 7 T. R. Golub, D. K. Slonim, P. Tamayo, C. Huard, M. Gaasenbeek, J. P. Mesirov, H. Coller, M. L. Loh, J. R. Downing and M. A. Caligiuri, et al., Molecular classification of cancer: class discovery and class prediction by gene expression monitoring, Science, 1999, 286, 531–537. 8 S. Ramaswamy, K. N. Ross, E. S. Lander and T. R. Golub, A molecular signature of metastasis in primary solid tumors, Nat. Genet., 2003, 33, 49–54. 9 A. L. Barabasi and Z. N. Oltvai, Network biology: understanding the cell’s functional organization, Nat. Rev. Genet., 2004, 5, 101–113. 10 L. Chen, R. S. Wang and X. S. Zhang, Biomolecular networks: methods and applications in systems biology, John Wiley & Sons Inc, Hoboken, New Jersey, 2009. 11 L. Chen, R. Wang, C. Li and K. Aihara, Modeling biomolecular networks in cells: structures and dynamics, Springer Verlag, London, 2010. 12 E. Segal, M. Shapira, A. Regev, D. Pe’er, D. Botstein, D. Koller and N. Friedman, Module networks: identifying


View Article Online

Paper


13

14

15

16

17

18

19

20

21

22

23

24

regulatory modules and their condition-specific regulators from gene expression data, Nat. Genet., 2003, 34, 166–176. A. Subramanian, P. Tamayo, V. K. Mootha, S. Mukherjee, B. L. Ebert, M. A. Gillette, A. Paulovich, S. L. Pomeroy, T. R. Golub and E. S. Lander, et al., Gene set enrichment analysis: a knowledge-based approach for interpreting genome-wide expression profiles, Proc. Natl. Acad. Sci. U. S. A., 2005, 102, 15545–15550. Z. Wen, Z. P. Liu, Y. Yan, G. Piao, Z. Liu, J. Wu and L. Chen, Identifying Responsive Modules by Mathematical Programming: An Application to Budding Yeast Cell Cycle, PLoS One, 2012, 7, e41854. L. Chen, R. Liu, Z. P. Liu, M. Li and K. Aihara, Detecting earlywarning signals for sudden deterioration of complex diseases by dynamical network biomarkers, Sci. Rep., 2012, 2, 342. Z. Wen, Z. P. Liu, Z. Liu, Y. Zhang and L. Chen, An integrated approach to identify causal network modules of complex diseases with application to colorectal cancer, J. Am. Med. Inform. Assoc., 2013, 20, 659–667. S. Lu, J. Li, C. Song, K. Shen and G. C. Tseng, Biomarker detection in the integration of multiple multi-class genomic studies, Bioinformatics, 2010, 26, 333–340. X. Ren, Y. Wang, L. Chen, X.-S. Zhang and Q. Jin, ellipsoidFN: a tool for identifying a heterogeneous set of cancer biomarkers based on gene expressions, Nucleic Acids Res., 2013, 41, e53. E. Wurmbach, Y. B. Chen, G. Khitrov, W. Zhang, S. Roayaie, M. Schwartz, I. Fiel, S. Thung, V. Mazzaferro and J. Bruix, Genome-wide molecular profiles of HCV-induced dysplasia and hepatocellular carcinoma, Hepatology, 2007, 45, 938–947. T. Casey, J. Bond, S. Tighe, T. Hunter, L. Lintault, O. Patel, J. Eneman, A. Crocker, J. White and J. Tessitore, Molecular signatures suggest a major role for stromal cells in development of invasive breast cancer, Breast Cancer Res. Treat., 2009, 114, 47–62. S. Peri, J. D. Navarro, T. Z. Kristiansen, R. Amanchy, V. Surendranath, B. Muthusamy, T. K. Gandhi, K. N. Chandrika, N. Deshpande and S. Suresh, et al., Human protein reference database as a discovery resource for proteomics, Nucleic Acids Res., 2004, 32, D497–D501. C. Stark, B. J. Breitkreutz, T. Reguly, L. Boucher, A. Breitkreutz and M. Tyers, BioGRID: a general repository for interaction datasets, Nucleic Acids Res., 2006, 34, D535–D539. H. Hermjakob, L. Montecchi-Palazzi, C. Lewington, S. Mudali, S. Kerrien, S. Orchard, M. Vingron, B. Roechert, P. Roepstorff and A. Valencia, et al., IntAct: an open source molecular interaction database, Nucleic Acids Res., 2004, 32, D452–D455. A. Ceol, A. Chatr Aryamontri, L. Licata, D. Peluso, L. Briganti, L. Perfetto, L. Castagnoli and G. Cesareni, MINT, the molecular interaction database: 2009 update, Nucleic Acids Res., 2010, 38, D532–D539.



25 L. Matthews, G. Gopinath, M. Gillespie, M. Caudy, D. Croft, B. de Bono, P. Garapati, J. Hemish, H. Hermjakob and B. Jassal, et al., Reactome knowledgebase of human biological pathways and processes, Nucleic Acids Res., 2009, 37, D619–D622. 26 R. V. Hogg and J. Ledolter, Engineering statistics, MacMillan, New York, 1987. 27 X. Liu, Z. P. Liu, X. M. Zhao and L. Chen, Identifying disease genes and module biomarkers by differential interactions, J. Am. Med. Inform. Assoc., 2012, 19, 241–248. 28 A. J. Enright, S. Van Dongen and C. A. Ouzounis, An efficient algorithm for large-scale detection of protein families, Nucleic Acids Res., 2002, 30, 1575–1584. 29 X. M. Zhao, R. S. Wang, L. Chen and K. Aihara, Uncovering signal transduction networks from high-throughput data by integer linear programming, Nucleic Acids Res., 2008, 36, e48. 30 O. Neumann, M. Kesselmeier, R. Geffers, R. Pellegrino, B. Radlwimmer, K. Hoffmann, V. Ehemann, P. Schemmer, P. Schirmacher and J. Lorenzo Bermejo, Methylome analysis and integrative profiling of human HCCs identify novel protumorigenic factors, Hepatology, 2012, 56, 1817–1827. 31 C. Clarke, S. F. Madden, P. Doolan, S. T. Aherne, H. Joyce, L. O’Driscoll, W. M. Gallagher, B. T. Hennessy, M. Moriarty and J. Crown, Correlating transcriptional networks to breast cancer survival: A large-scale coexpression analysis, Carcinogenesis, 2013, 34, 2300–2308. 32 Y. Lee, Y. Lin and G. Wahba, Multicategory support vector machines: Theory and application to the classification of microarray data and satellite radiance data, J. Am. Stat. Assoc., 2004, 99, 67–81. 33 S. D. Bay, ICML, Citeseer, 1998, vol. 98, pp. 37–45. 34 C.-C. Chang and C.-J. Lin, LIBSVM: a library for support vector machines, ACM TIST, 2011, 2, 27. 35 J. Reimand, M. Kull, H. Peterson, J. Hansen and J. Vilo, g:Profiler–a web-based toolset for functional profiling of gene lists from large-scale experiments, Nucleic Acids Res., 2007, 35, W193–W200. 36 D. Hanahan and R. A. Weinberg, Hallmarks of cancer: the next generation, Cell, 2011, 144, 646–674. 37 F. Cavallo, C. De Giovanni, P. Nanni, G. Forni and P. L. Lollini, 2011: the immune hallmarks of cancer, Cancer Immunol. Immunother., 2011, 60, 319–326. 38 D. G. DeNardo and L. M. Coussens, Balancing immune response: crosstalk between adaptive and innate immune cells during breast cancer progression, Breast Cancer Res., 2007, 9, 212. 39 X. Yu, G. Li and L. Chen, Prediction and early diagnosis of complex diseases by edge-network, Bioinformatics, 2014, 30, 852–859, DOI: 10.1093/bioinformatics/btt620. 40 W. Zhang, T. Zeng and L. Chen, EdgeMarker: identifying differentially correlated molecule pairs as edge-biomarkers, J. Theor. Biol., 2014, DOI: 10.1016/j.jtbi.2014.05.041.

Mol. BioSyst., 2014, 10, 2870--2875 | 2875

Identifying module biomarkers from gastric cancer by differential correlation network.

VirFinder: a novel k-mer based tool for identifying viral sequences from assembled metagenomic data.

xSyn: A Software Tool for Identifying Sophisticated 3-Way Interactions From Cancer Expression Data.

ToNER: A tool for identifying nucleotide enrichment signals in feature-enriched RNA-seq data.

Identifying novel biomarkers through data mining-a realistic scenario?

A data mining approach for identifying pathway-gene biomarkers for predicting clinical outcome: A case study of erlotinib and sorafenib.

Screening_mgmt: a Python module for managing screening data.

PATH-SCAN: a reporting tool for identifying clinically actionable variants.

Challenges and strategies for identifying biomarkers for colorectal cancer.

Identifying Cancer Biomarkers From Microarray Data Using Feature Selection and Semisupervised Learning.

Metadata from data: identifying holidays from anesthesia data.

Module-based association analysis for omics data with network structure.

Identifying Urinary and Serum Exosome Biomarkers for Radiation Exposure Using a Data Dependent Acquisition and SWATH-MS Combined Workflow.

India Allele Finder: a web-based annotation tool for identifying common alleles in next-generation sequencing data of Indian origin.

A New Strategy for Analyzing Time-Series Data Using Dynamic Networks: Identifying Prospective Biomarkers of Hepatocellular Carcinoma.

GESearch: An Interactive GUI Tool for Identifying Gene Expression Signature.

TF2LncRNA: identifying common transcription factors for a list of lncRNA genes from ChIP-Seq data.

RNAseqViewer: visualization tool for RNA-Seq data.

B-HIT - A Tool for Harvesting and Indexing Biodiversity Data.

ARX--A Comprehensive Tool for Anonymizing Biomedical Data.

Data analysis techniques: a tool for cumulative exposure assessment.

ChIPseek, a web-based analysis tool for ChIP data.

A software tool for the analysis of neuronal morphology data.

BiNA: a visual analytics tool for biological network data.