Clustering-based gene-subnetwork biomarker identification using gene expression data 2 3 2 ' Narumol Doungpan , Worrawat Engchuan , Asawin Meechai and Jonathan H. Chan ,§ ' Department of Biological Engineering, King Mongkut's University of Technology Thonburi, Bangkok, Thailand [email protected] 2
complex disease with robustness and specificity is an ongoing
. As each cellular component
functions through interactions with others, thus a biological
challenge. Gene expressions provide information on how the cell
subnetwork is supposed to represent a functional module in the
reacts to a particular state and the relationship of genes may lead
cell. The link between nodes implies that the impact of a
to novel information. A network-based approach
expression data with protein-protein interaction network can be used to identify gene-subnetwork biomarkers for a particular disease.
However, cancer datasets are heterogeneous in nature
containing unknown or undefined subtypes of cancers. In this study, we propose a gene-subnetwork biomarker identification approach by implementing an Expectation-Maximization
clustering technique to homogenize the dataset. To validate our proposed method. Lung cancer expression datasets are used to identify gene-subnetwork biomarkers. The evaluation of gene subnetwork biomarkers is done by 5-fold cross-validation on an independent dataset. The comparison between non-clustering and clustering-based gene-subnetwork identification showed that clustering produced improved classification performance at a statistically
functional analysis results showed more significant subnetworks were identified using the proposed approach.
representation of the network, each node represents gene, protein, or metabolite while the link between nodes represents the
978-1-4799-1959-8/15/$31.00 @2015 IEEE
pathological processes that interact in a complex network instead . The protein-protein interaction (PPI) network is one of the biological
information at the systems-level. The PPI network has also been used to further study molecular evolution to gain insight into the robustness of cells to perturbation and to characterize protein functions. Furthermore, PPI plays an important role in unravelling the molecular basis of a disease and understanding and identifying disease pathogenesis genes, disease-related
In a genome-wide association study, microarray data is one
a single effector gene product, but would rather reflect various
of the high throughput technologies used to inspect the gene
research for a decade to study the biological meaning of components
of gene products that carry no defects through the link. Thus a disease phenotype is rarely a consequence of an abnormality in
based approach .
The network approach has been applied in biological cellular
the single gene product that carries it, but can alter the activity
subnetworks, and classifying diseases by using a network
Keywords-gene-subnetwork; protein-protein interaction; clustering; classification; gene expression; lung cancer I.
specific genetic abnonnality is not restricted to the activity of
product in a particular stage or environment . Expression levels of thousands of genes are measured simultaneously and used as the profile of a sample. The differentially expressed gene (DEG) analysis of those profiles between patient and healthy control groups could reveal candidate genes or gene markers involved in the disease development that can be used for disease prediction .
Disease biomarker with high specificity and sensitivity
II. MATERIALS AND METHOD
when applying to real use is important. The gene marker identification
different levels of data provides more reliable biomarkers for both disease prediction and classification . Various
Each dataset downloaded from a database needs to be preprocessed prior to further analysis. The clustering method was applied to cluster the expression data of the disease samples into groups (or subgroups). Each disease group and
subnetworks. For example, protein-protein interaction (PPI)
corresponding protein in the PPI network, significant gene
data may be used as a network scaffold to be overlaid with
gene expression data to determine a group of genes (called
discriminative score of a particular candidate subnetwork.
subnetwork) that is significant to a disease. Expression profiles PPI
subnetwork identification. These subnetworks are notable and enriched for a number of biological processes . Chuang et al.  proposed the "PinnacleZ" method to identify subnetworks as biomarkers instead of a single gene by integrating PPI network with expression profiles of breast cancer to identify marker for classification of breast cancer metastasis.
corresponding proteins in the PPI network. The significance of each subnetwork is determined by calculating a discriminative score. The differentially expressed genes from the disease
approach is a common method for identification of significant
are informative data. The integration of this data with curated
subnetworks from gene expression data -. A score-based
Protein-protein Interaction Network Protein-protein interaction (PPI) data were downloaded
from BioGRID (http://thebiogrid.orgl) (version 3.2.95). The 151,895 PPI data composed of human (Entrez ID: 9606) and non-human protein data. Before utilizing this PPI data, those non-human
interactions were removed. After the preprocessing of PPI data, a network of 13,586 proteins with 82,571 edges was formed. Such networks can be visualized conveniently using tools such as Cytoscape (http://www.cytoscape.orgl) .
B. Gene Expression Data The three lung cancer datasets used in this study were
(tumor) and normal cells can be linked by genes that are not
(http://www.ncbi.nlm.nih.gov/geo/) . The first expression
data was published by Landi et at.  and used to identify the
subnetwork. In addition, these subnetwork biomarkers can
genes that correspond to lung cancer from smoking. This gene
imply the mechanism or pathway that plays a role in disease
The concept of using PPI network and expression data to identify the biomarker of disease has often been applied in prior research , , -. Accordingly, the challenge is to
biomarkers to provide more informative and effective markers for complex diseases.
noninvolved lung tissue from 28 current, 26 former and 20 non-smokers,
information. There are a total of 107 samples composed of 58 adenocarcinoma and 49 non-tumor samples as control. The second one was provided by Sanchez-Palencia et at.  and used to determine whether the phenotypic heterogeneity and genetic diversity in NSCLC are correlated. This expression
Complex diseases tend to be heterogeneous in nature. For
data (GSEI8842) composed of 91 samples, of which 46 are
example, cancer originating from a single cell can proliferate
and mutate into heterogeneity which leads to non-coincidence
samples and 45 non-tumor samples are the control samples.
The last expression data was provided by Lu et at.  and
heterogeneity occur among cancer cells within the same tumor
used to screen for the transcriptional modulation which causes
lung cancer among non-smoking females in Taiwan. The
differences and reversible changes in cell properties . The
expression data (GSEI9804) composed of 120 samples, of
expression profile from tumor samples may provide different
which 60 are lung cancer and the others are control samples.
types of genetic variation which needs to be considered in the analysis. Ribeiro et al.  proposed a semi-supervised method based on clustering approach for finding sub-classes of cancer using gene expression data.
expression data by removing those probe sets representing more than one genes or unknown genes. For multiple probe sets which refer to the same gene, probe set with the highest
This study aims to improve the method in subnetwork
variance was chosen. Moreover, an additional file that is
biomarker identification by applying a clustering technique to
needed in the subnetwork identification is the class file. The
reduce the heterogeneity of gene expression profile in patient
class file is a text file which indicates the class (case or control)
samples. The gene-subnetwork biomarkers were identified
of each sample in an expression matrix. The format should be
using three expression datasets of subjects with lung tumors
in two columns; the first column is name of each sample in the
and controls as well as PPI network from BioGRID . The
expression, and the second column is the positive number
proposed clustering-based method was compared with the
specifying class of each particular sample.
traditional gene-subnetwork biomarker identification using 5fold cross-validation.
further. For clustering, only the subset of genes was used as
Gene-subnetwork Biomarker Identification In
identified using the PinnacleZ plugin for Cytoscape . The
features to minimize the effect of noise in using experimental microarray data.
procedure is described as follows.
The "HugeNavigator", which is an online tool for querying
The expression data was overlaid on the corresponding proteins in PPI networks during the identification process. The network modules were identified by expanding from a starting node to its neighbor nodes using the Greedy search algorithm and then determining the discriminative potential of each
disease-related genes from collected publications, was used to select gene subsets . The GeneProspector function of HugeNavigator was applied with the search term "Lung cancer" to retrieve the list of lung cancer related genes. A total of 1,328 genes was obtained.
subnetwork by using mutual information (MI).
In this step, only cases were clustered into subgroups.
MI measures the interdependence between two random variables to determine their joint distribution. In the case of two random variables being independent, then one variable is not given the information of the other one. One is the discretized form of a vector summing up each gene expression value of a particular subnetwork, and the other one is a vector defining tumor and non-tumor.
controls. The gene-subnetwork biomarker identification step was applied to these combined datasets using PinnacleZ with the same set of parameters as described in the previous section.
Evaluation of gene-subnetwork biomarkers The use of classification for biomarker evaluation is a
In the step of determination of the significant subnetworks through
Then the dataset of each subgroup was combined with all
common practice in microarray analysis , -. In this work,
performed. The p-value thresholds for these three tests were
features for building a classification model in two different
set as default (tl
ways. The first is to use the gene expression of the gene
subnetworks having those p-values smaller than the default
member in the identified subnetwork as the feature for
thresholds were carried on for further analysis. Since the
evaluation by classification so-called gene-level subnetwork
approach of PinnacleZ is stochastic, the identification of the
biomarkers. The second way is to transform the expressions of
subnetworks was repeated three times and the consensus
gene members in the identified subnetwork into activity scores
subnetworks chosen were those overlapping among the three
subnetwork-level. The activity score was calculated using the
a clustering technique biomarker
was applied prior to gene
patient dataset. Fig. 1 shows the workflow of the proposed clustering-based method. The gene expression datasets retrieved from GEO were first grouped into binary classes of case and control. In this study, "case" represents the state of samples having disease (patients), while "control" represents the healthy samples. The Expectation-Maximization (EM) clustering algorithm was used in this work. EM algorithm has been applied in many fields of study when the data can be assumed to be a mixture of Gaussian distributions
. Furthermore, EM
clustering allows the cross-validation process to determine the most appropriate number of clusters . The EM Clusterer was applied using the WEKA library version 3.7-12 . EM works by assigning a probability distribution to each sample in the case data to indicate the probability being belong to the cluster. In WEKA, cross validation was used to find the number of clusters. Initially, the number of clusters was set as 1. The dataset was then split into 10 folds and 9 folds of data were used to build a clustering
In order to reduce the heterogeneity of the gene expression data,
following equation (l):
D. Clustering-based Gene-subnetwork Biomarker Identification
likelihood was calculated and averaged over all lO iterations of cross-validation. In the case when the log likelihood increased, the number of clusters was increased by 1. Then the process iterated until the log likelihood did not increase any
the activity level of subject j subnetwork
the normalized expression (mean of subject j gene
0, standard deviation
zlj is =
is number of members in subnetwork
The gene-subnetwork biomarkers were evaluated in terms
of classification performance by applying both types of markers
identified from one dataset on another independent dataset. Five-fold cross-validation was used to assess the area under the receiving operating curve (AVC) and Recall as measures of the classification performance of biomarkers on an independent dataset. AVC is a known unbiased measure for classification used with class-imbalanced issues . Recall is an important measure that is commonly used to evaluate the accuracy of medical screening . The procedure of 5-fold cross-validation is described as follows. The dataset was randomly divided into 5 equal subsets while keeping the ratio between case and control approximately the same for all subsets. For each iteration, 4 subsets were used to train while the remaining subset was used to test the model until all 5 subsets had been used as a test set. For the classifier, the Support Vector Machine (SVM)  was used because it can be applied to both binary and multi class classification problems. In addition, SVM is commonly used in microarray analysis to analyze and recognize patterns by constructing hyperplanes to separate the different classes of interest , .
Overview of clustering-based gene-subnetwork identification
Finally, gene-subnetwork identification using the proposed
In this study, SVM with default parameters (i.e. type of SVM was C-SVC and kernel
function was radial basis
clustering-based method could identify more significant gene
function with the degree equal to 3) was used. Then the
subnetworks. That is, by summing up the gene subnetworks in
classification performance was calculated by averaging those
all clusters in each dataset, the total is greater than that of the
measurements in the 5-fold procedure. To reduce the effect of
whole dataset without clustering.
stochasticity on the validation results, cross-validation was repeated
Significance of gene members As a rough measure of the gene-level biomarkers, we
repetitions were reported.
simply evaluated the significance of those gene members III. RES U L T S A N D DI S C U S S I O N
found in gene-subnetworks by comparing with the list of
In this study, the proposed clustering-based method was compared with the non-clustering method to evaluate the significance of a gene-subnetwork for disease classification. Three lung cancer datasets were used to identify the sets of markers and each set of markers were applied using cross validation of independent datasets; e.g. identify markers using
reported disease-related genes for lung
B. Significance of subnetwork biomarkers
clustering results shown in Table I.
From the preliminary functional analysis, it can be seen that the datasets GSEIOOn and GSE18842 were categorized
clustering-based approach did better in the other two datasets (see Fig. 2).
The results show that the non-clustering method can identify
the Landi et al. dataset and use those markers in cross the
genes with evidence supported in the gene-subnetwork better
GeneProspector function of the HugeNavigator (1,328 genes).
assessed using cross-validation. In particular, one dataset was
into three subtypes while GSE19804 was divided into five
used for training, with the other two independent datasets used
SUbtypes. However, cluster 2 from the latter dataset was not
for testing. The gene-subnetwork biomarkers were separated
associated with any subnetworks,
indicating that it may
into gene-level subnetwork biomarkers and subnetwork-level
possibly be a novel subtype, or that it can be merged with one
subnetwork biomarkers, according to the classification model
of the other four clusters instead. More detailed analysis in
built. The cross-validation results for the gene-level and
consultation with domain experts will be undertaken in the
subnetwork-level are shown in Tables II and III, respectively.
In summary, the cross-validation results indicate that the
From the cross-validation results, the clustering-based method performed generally better than the non-clustering
method. When combining the results from both types of
subnetwork biomarkers identification. Also, it provides a
biomarkers, the improvement is statistically significant from a
better classification model for medical screening purposes,
paired (-test, with both higher AVe (p-value
especially when considering the significant improvement in
terms of Recall.
subnetwork-level biomarkers alone also yielded statistically significant improvements by clustering.
W. Pan, "A comparative review of statistical methods for discovering differentially expressed genes in replicated microarray experiments," Bioinformatics, vol. 18(4),pp. 546-554,2001.
0.957± 0.990± 0.978± GSE GSE 1±0.0 10072 19804 0.005 0.009 0.004 The best perfonnance of each cross-valIdatIOn IS hIghlIghted In bold.
J. Su, B. J. Yoon, and E. R. Dougherty, "Identification of diagnostic subnetwork markers for cancer in human protein-protein interaction network," BMC Bioinformatics, vol. 11,p. 6:S8,2010.
S. Prom-On, A. Chanthaphan,1. H. Chan, and A. Meechai, "Enhancing biological relevance of a weighted gene co-expression network for functional module identification," J. Bioinform. Comput. Bioi. , vol. 9(1), pp. 111-129,2011.
1. Chen, and B. Yuan, "Detecting Functional Modules in the Yeast Protein-Protein Interaction Network," Bioinformatics, vol. 22, pp. 22832290,2006.
H. Y. Chuang,E. Lee,Y. T. Liu,D. Lee,and T. Ideker,"Network-based classification of breast cancer metastasis," J. Mol. Syst. Bioi. , vol. 3, p. 140,2007.
CROSS-VALIDATION RESULT USING SUBNETWORK-LEVEL SUBNETWORK BIOMARKERS Recall
0.959± 0.984± 0.985± GSE GSE 1±0.0 10072 19804 0.009 0.013 0.005 The best performance of each cross-valIdatIOn IS hIghlIghted In bold.
Note that the results show that there is no significant difference between the use of gene-level and subnetwork-level subnetwork
combining both types of biomarkers produced p-value for AVC and p-value
0.109 for Recall.
 M. T. Dittrich, G. W. Klau, A. Rosenwald, T. Dandekar, and T. MUlier, "IdentifYing functional modules in protein-protein interaction networks: an integrated exact approach," Bioinformatics, vol. 24(13), pp. i223-31, 2008.  E. B. van den Akker, B. Verbruggen, B. T. Heijmans, M. Beekman, J. N. Kok,P. E. Slagboom,and M. 1. Reinders, "Integrating protein-protein interaction networks with gene-gene co-expression networks improves gene signatures for classifying breast cancer metastasis," J. integr. Bioinform. , vol. 8(2),pp. 188,20II.  C. Wu, 1. Zhu, and X. Zhang, "Integrating gene expression and protein protein interaction network to prioritize cancer-associated genes," BMC Bioinformatics, vol. 13,pp. 182,2012.  M. Li, X. WU, J. Wang, and Y. Pan, "Towards the identification of protein complexes and functional modules by integrating PPI network and gene expression data," BMC Bioinformatics, vol. 13,pp. 109,2012.  L. Zhang, S. Li, C. Hao, G. Hong, J. Zou, Y. Zhang, P. Li, and Z. Guo, "Extracting a few functionally reproducible biomarkers to build robust subnetwork-based classifiers for the diagnosis of cancer," Gene, vol. 526,pp. 232-8,2013.  H. Rakshit, N. Rathi, and D. Roy, "Construction and Analysis of the Protein-Protein Interaction Networks Based on Gene Expression Profiles of Parkinson's Disease," ?LoS ONE, vol. 9(8),pp. e103047,2014.
 C. E. Meacham and S. J. Morrison, "Tumour heterogeneity and cancer cell plasticity," Nature, vol. 501(7467), pp. 328-37,2013.  C. Ribeiro, F. de Assis T. de Carvalho, and 1. G. Costa, "Semi supervised Approach for Finding Cancer Sub-Classes on Gene Expression Data," Advances in Computational Biology, Lecture Notes in Bioinformatics. Berlin: Springer Verlag,vol. 6268,pp. 24 - 34,2010.  A. Chatr-Aryamontri et aI. , "The BioGRID interaction database: 2015 update," Nucleic Acids Research, vol. 43, pp.D470-8, 2014.  P. Shannon, et aI. , "Cytoscape: a software environment for integrated models of biomolecular interaction networks," Genome Research, vol. 13,pp. 2498-2504,2003.  T. Barrett et aI. , "NCBT GEO: mining millions of expression profiles database and tools," Nucleic Acids Research, vol. 33, pp. D562-D566, 2005.  M. T. Landi et aI., "Gene expression signature of cigarette smoking and its role in lung adenocarcinoma development and survival," PLoS One, vol. 3, pp. e1651,2008.  A. Sanchez-Palencia et aI. , "Gene expression profiling reveals novel biomarkers in nonsmall cell lung cancer," Int. J. Cancer, vol. 129, pp. 355-364,2011.  T. P. Lu, et al. "Identification of a novel biomarker, SEMA5A, for non small cell lung carcinoma in nonsmoking women," Cancer Epidemiol. Biomarkers Frev. , vol. 19(10),pp. 2590-7,2010.  M. Pirooznia, J. Y. Yang, M. Q. Yang, and Y. Deng, "A comparative study of different machine learning methods on microarray gene expression data," BMC Genomics, vol. 9,pp. 13,2008  A. P. Dempster, N. M. Laird, and D. B. Rubin, "Maximum likelihood from incomplete data via the EM algorithm," Journal of the Royal Statistical Society, vol. 34,pp. 1-38,1977.
 M. Hall, E. Frank, G. Holmes, B. Pfahringer, P. Reutemann, and T. H. Witten, "The WEKA Data Mining Software: An Update," SlGKDD Explorations, vol. II,Issue 1,2009.  W. Yu, M. Gwinn, M. Clyne, A. Yesupriya, and M. L. Khoury, "A navigator for human genome epidemiology," Nature Genetics, vol. 40, pp. 124-125,2008.  T. Li, C. Zhang and M. Ogihara, "A comparative study of feature selection and multiclass classification methods for tissue classification based on gene expression," Bioinformatics, vol. 20, pp. 2429-2437, 2004.  J. H. Chan, P. Sootanan, and P. Larpeampaisarl, "Feature selection of pathway markers for microarry-based disease classification using negatively correlated feature sets," in Froc. International Joint Conference on Neural Networks (UCNN 2011), San Jose, CA, 2011, pp. 3293-3299.  P. Sootanan, S. Prom-on, A. Meechai, and J. H. Chan, "Pathway-based microarray analysis for robust disease classification," Neural Com put. & Appl., vol. 21,pp. 649-660,2012.  W. Engchuan and J. H. Chan, "Pathway activity transformation for multi-class classification of lung cancer datasets," Neurocomputing, http://dx.doi.orglIO. 10 \6/j.neucom.2014.08.096.  S. Kotsiantis, D. Kanellopoulos, and P. Pintelas, "Handling imbalanced dataset: A review," GESTS Int. Trans. ComSci. & Eng., vol. 30, pp. 2536,2006.  C. Goutte, and E. Gaussier, "A probabilistic interpretation of precision, Recall and F-score, with implication for evaluation," In Advances in Information Retrieval, pp. 345-359, 2005.  C. Cortes and V. Vapnik, "Support-vector networks," Machine learning, vol. 20,pp. 273-297,1995.
Autoantibodies in systemic sclerosis: unanswered questions.
Systemic sclerosis (SSc) is an autoimmune disease characterized by vascular abnormalities, and cutaneous and visceral fibrosis. Serum autoantibodies d...