Clustering-based gene-subnetwork biomarker identification using gene expression data 2 3 2 ' Narumol Doungpan , Worrawat Engchuan , Asawin Meechai and Jonathan H. Chan ,§ ' Department of Biological Engineering, King Mongkut's University of Technology Thonburi, Bangkok, Thailand
[email protected] 2
School of Information Technology,
King Mongkut's University of Technology Thonburi, Bangkok, Thailand
[email protected] [email protected] 3 Department of Chemical Engineering, King Mongkut's University of Technology Thonburi Bangkok, Thailand
[email protected] §
Corresponding author
Abstract-
The identification of predictive biomarkers of
complex disease with robustness and specificity is an ongoing
construction assumption
[1]. As each cellular component
functions through interactions with others, thus a biological
challenge. Gene expressions provide information on how the cell
subnetwork is supposed to represent a functional module in the
reacts to a particular state and the relationship of genes may lead
cell. The link between nodes implies that the impact of a
to novel information. A network-based approach
integrating
expression data with protein-protein interaction network can be used to identify gene-subnetwork biomarkers for a particular disease.
However, cancer datasets are heterogeneous in nature
containing unknown or undefined subtypes of cancers. In this study, we propose a gene-subnetwork biomarker identification approach by implementing an Expectation-Maximization
(EM)
clustering technique to homogenize the dataset. To validate our proposed method. Lung cancer expression datasets are used to identify gene-subnetwork biomarkers. The evaluation of gene subnetwork biomarkers is done by 5-fold cross-validation on an independent dataset. The comparison between non-clustering and clustering-based gene-subnetwork identification showed that clustering produced improved classification performance at a statistically
significant
level.
Furthermore,
preliminary
functional analysis results showed more significant subnetworks were identified using the proposed approach.
the
systematic
level.
For
the
representation of the network, each node represents gene, protein, or metabolite while the link between nodes represents the
interaction
or
relationship
978-1-4799-1959-8/15/$31.00 @2015 IEEE
pathological processes that interact in a complex network instead [2]. The protein-protein interaction (PPI) network is one of the biological
networks
which
offers
a
valuable
source
of
information at the systems-level. The PPI network has also been used to further study molecular evolution to gain insight into the robustness of cells to perturbation and to characterize protein functions. Furthermore, PPI plays an important role in unravelling the molecular basis of a disease and understanding and identifying disease pathogenesis genes, disease-related
In a genome-wide association study, microarray data is one
I NTRODUCTION
at
a single effector gene product, but would rather reflect various
of the high throughput technologies used to inspect the gene
research for a decade to study the biological meaning of components
of gene products that carry no defects through the link. Thus a disease phenotype is rarely a consequence of an abnormality in
based approach [3].
The network approach has been applied in biological cellular
the single gene product that carries it, but can alter the activity
subnetworks, and classifying diseases by using a network
Keywords-gene-subnetwork; protein-protein interaction; clustering; classification; gene expression; lung cancer I.
specific genetic abnonnality is not restricted to the activity of
according
to
network
product in a particular stage or environment [4]. Expression levels of thousands of genes are measured simultaneously and used as the profile of a sample. The differentially expressed gene (DEG) analysis of those profiles between patient and healthy control groups could reveal candidate genes or gene markers involved in the disease development that can be used for disease prediction [5].
Disease biomarker with high specificity and sensitivity
II. MATERIALS AND METHOD
when applying to real use is important. The gene marker identification
problem
is
a
challenging
task.
Integrating
different levels of data provides more reliable biomarkers for both disease prediction and classification [6]. Various
techniques
have
been
Each dataset downloaded from a database needs to be preprocessed prior to further analysis. The clustering method was applied to cluster the expression data of the disease samples into groups (or subgroups). Each disease group and
proposed
to
infer
control
samples
used
gene-subnetwork
network.
subnetworks. For example, protein-protein interaction (PPI)
corresponding protein in the PPI network, significant gene
data may be used as a network scaffold to be overlaid with
subnetworks
gene expression data to determine a group of genes (called
discriminative score of a particular candidate subnetwork.
subnetwork) that is significant to a disease. Expression profiles PPI
data
highlights
the
expression
of
each
protein
for
subnetwork identification. These subnetworks are notable and enriched for a number of biological processes [8]. Chuang et al. [9] proposed the "PinnacleZ" method to identify subnetworks as biomarkers instead of a single gene by integrating PPI network with expression profiles of breast cancer to identify marker for classification of breast cancer metastasis.
The
expression
data
are
overlaid
to
the
corresponding proteins in the PPI network. The significance of each subnetwork is determined by calculating a discriminative score. The differentially expressed genes from the disease
A.
overlaying were
then
the
identify
approach is a common method for identification of significant
By
integrating
to
biomarkers
are informative data. The integration of this data with curated
by
were
subnetworks from gene expression data [6]-[15]. A score-based
the
expression expression
searched
and
data
with
PPI
value
onto
the
determined
by
the
Protein-protein Interaction Network Protein-protein interaction (PPI) data were downloaded
from BioGRID (http://thebiogrid.orgl) (version 3.2.95). The 151,895 PPI data composed of human (Entrez ID: 9606) and non-human protein data. Before utilizing this PPI data, those non-human
proteins,
self-interactions
and
redundant
interactions were removed. After the preprocessing of PPI data, a network of 13,586 proteins with 82,571 edges was formed. Such networks can be visualized conveniently using tools such as Cytoscape (http://www.cytoscape.orgl) [19].
B. Gene Expression Data The three lung cancer datasets used in this study were
(tumor) and normal cells can be linked by genes that are not
downloaded
differentially
(http://www.ncbi.nlm.nih.gov/geo/) [20]. The first expression
expressed
interconnecting
the
significantly
differentially
but
play
expressed
a
genes
role in
in
from
Gene
Expression
Omnibus
database
the
data was published by Landi et at. [21] and used to identify the
subnetwork. In addition, these subnetwork biomarkers can
genes that correspond to lung cancer from smoking. This gene
imply the mechanism or pathway that plays a role in disease
expression
pathogenesis.
frozen
The concept of using PPI network and expression data to identify the biomarker of disease has often been applied in prior research [1], [5], [10]-[15]. Accordingly, the challenge is to
identify
more
reproducible
and
robust
subnetwork
biomarkers to provide more informative and effective markers for complex diseases.
analysis
tissue
(GSEI0072)
samples
of
was
performed
adenocarcinoma
on
and
fresh paired
noninvolved lung tissue from 28 current, 26 former and 20 non-smokers,
with
biochemically
validated
smoking
information. There are a total of 107 samples composed of 58 adenocarcinoma and 49 non-tumor samples as control. The second one was provided by Sanchez-Palencia et at. [22] and used to determine whether the phenotypic heterogeneity and genetic diversity in NSCLC are correlated. This expression
Complex diseases tend to be heterogeneous in nature. For
data (GSEI8842) composed of 91 samples, of which 46 are
example, cancer originating from a single cell can proliferate
primary
and mutate into heterogeneity which leads to non-coincidence
samples and 45 non-tumor samples are the control samples.
adenocarcinomas
and
squamous-cell
carcinomas
functional
The last expression data was provided by Lu et at. [23] and
heterogeneity occur among cancer cells within the same tumor
used to screen for the transcriptional modulation which causes
of
gene marker
identification. Phenotypic
and
environmental
lung cancer among non-smoking females in Taiwan. The
differences and reversible changes in cell properties [16]. The
expression data (GSEI9804) composed of 120 samples, of
expression profile from tumor samples may provide different
which 60 are lung cancer and the others are control samples.
due
to
a
consequence
of
genetic
change,
types of genetic variation which needs to be considered in the analysis. Ribeiro et al. [17] proposed a semi-supervised method based on clustering approach for finding sub-classes of cancer using gene expression data.
To
avoid
ambiguous
results,
we
preprocessed
the
expression data by removing those probe sets representing more than one genes or unknown genes. For multiple probe sets which refer to the same gene, probe set with the highest
This study aims to improve the method in subnetwork
variance was chosen. Moreover, an additional file that is
biomarker identification by applying a clustering technique to
needed in the subnetwork identification is the class file. The
reduce the heterogeneity of gene expression profile in patient
class file is a text file which indicates the class (case or control)
samples. The gene-subnetwork biomarkers were identified
of each sample in an expression matrix. The format should be
using three expression datasets of subjects with lung tumors
in two columns; the first column is name of each sample in the
and controls as well as PPI network from BioGRID [18]. The
expression, and the second column is the positive number
proposed clustering-based method was compared with the
specifying class of each particular sample.
traditional gene-subnetwork biomarker identification using 5fold cross-validation.
C.
further. For clustering, only the subset of genes was used as
Gene-subnetwork Biomarker Identification In
this
study,
the
gene-subnetwork
biomarkers
were
identified using the PinnacleZ plugin for Cytoscape [9]. The
features to minimize the effect of noise in using experimental microarray data.
procedure is described as follows.
The "HugeNavigator", which is an online tool for querying
The expression data was overlaid on the corresponding proteins in PPI networks during the identification process. The network modules were identified by expanding from a starting node to its neighbor nodes using the Greedy search algorithm and then determining the discriminative potential of each
disease-related genes from collected publications, was used to select gene subsets [27]. The GeneProspector function of HugeNavigator was applied with the search term "Lung cancer" to retrieve the list of lung cancer related genes. A total of 1,328 genes was obtained.
subnetwork by using mutual information (MI).
In this step, only cases were clustered into subgroups.
MI measures the interdependence between two random variables to determine their joint distribution. In the case of two random variables being independent, then one variable is not given the information of the other one. One is the discretized form of a vector summing up each gene expression value of a particular subnetwork, and the other one is a vector defining tumor and non-tumor.
statistical
filtering,
controls. The gene-subnetwork biomarker identification step was applied to these combined datasets using PinnacleZ with the same set of parameters as described in the previous section.
E.
Evaluation of gene-subnetwork biomarkers The use of classification for biomarker evaluation is a
In the step of determination of the significant subnetworks through
Then the dataset of each subgroup was combined with all
three
statistical
tests
were
common practice in microarray analysis [9], [28]-[31]. In this work,
the
gene-subnetwork
biomarkers
were
applied
as
performed. The p-value thresholds for these three tests were
features for building a classification model in two different
set as default (tl
ways. The first is to use the gene expression of the gene
=
0.05, t2
=
0.05, t3
=
5E-6). Only
subnetworks having those p-values smaller than the default
member in the identified subnetwork as the feature for
thresholds were carried on for further analysis. Since the
evaluation by classification so-called gene-level subnetwork
approach of PinnacleZ is stochastic, the identification of the
biomarkers. The second way is to transform the expressions of
subnetworks was repeated three times and the consensus
gene members in the identified subnetwork into activity scores
subnetworks chosen were those overlapping among the three
to
runs.
subnetwork-level. The activity score was calculated using the
a clustering technique biomarker
was applied prior to gene
identification
to
homogenize
the
patient dataset. Fig. 1 shows the workflow of the proposed clustering-based method. The gene expression datasets retrieved from GEO were first grouped into binary classes of case and control. In this study, "case" represents the state of samples having disease (patients), while "control" represents the healthy samples. The Expectation-Maximization (EM) clustering algorithm was used in this work. EM algorithm has been applied in many fields of study when the data can be assumed to be a mixture of Gaussian distributions
[24]. Furthermore, EM
clustering allows the cross-validation process to determine the most appropriate number of clusters [25]. The EM Clusterer was applied using the WEKA library version 3.7-12 [26]. EM works by assigning a probability distribution to each sample in the case data to indicate the probability being belong to the cluster. In WEKA, cross validation was used to find the number of clusters. Initially, the number of clusters was set as 1. The dataset was then split into 10 folds and 9 folds of data were used to build a clustering
model.
Applying
clustering
model,
features
for
classification.
This
is
termed
(I),
In order to reduce the heterogeneity of the gene expression data,
the
following equation (l):
D. Clustering-based Gene-subnetwork Biomarker Identification
subnetwork
be
the
log
likelihood was calculated and averaged over all lO iterations of cross-validation. In the case when the log likelihood increased, the number of clusters was increased by 1. Then the process iterated until the log likelihood did not increase any
where
akj is
the activity level of subject j subnetwork
the normalized expression (mean of subject j gene
k.
i
and
n
=
k,
0, standard deviation
zlj is =
1)
is number of members in subnetwork
The gene-subnetwork biomarkers were evaluated in terms
of classification performance by applying both types of markers
(gene
expression-based
and
gene
activity-based)
identified from one dataset on another independent dataset. Five-fold cross-validation was used to assess the area under the receiving operating curve (AVC) and Recall as measures of the classification performance of biomarkers on an independent dataset. AVC is a known unbiased measure for classification used with class-imbalanced issues [32]. Recall is an important measure that is commonly used to evaluate the accuracy of medical screening [33]. The procedure of 5-fold cross-validation is described as follows. The dataset was randomly divided into 5 equal subsets while keeping the ratio between case and control approximately the same for all subsets. For each iteration, 4 subsets were used to train while the remaining subset was used to test the model until all 5 subsets had been used as a test set. For the classifier, the Support Vector Machine (SVM) [34] was used because it can be applied to both binary and multi class classification problems. In addition, SVM is commonly used in microarray analysis to analyze and recognize patterns by constructing hyperplanes to separate the different classes of interest [28], [31].
Fig. 1.
Overview of clustering-based gene-subnetwork identification
Finally, gene-subnetwork identification using the proposed
In this study, SVM with default parameters (i.e. type of SVM was C-SVC and kernel
function was radial basis
clustering-based method could identify more significant gene
function with the degree equal to 3) was used. Then the
subnetworks. That is, by summing up the gene subnetworks in
classification performance was calculated by averaging those
all clusters in each dataset, the total is greater than that of the
measurements in the 5-fold procedure. To reduce the effect of
whole dataset without clustering.
stochasticity on the validation results, cross-validation was repeated
10
times
and
the
average
results
from
these
A.
Significance of gene members As a rough measure of the gene-level biomarkers, we
repetitions were reported.
simply evaluated the significance of those gene members III. RES U L T S A N D DI S C U S S I O N
found in gene-subnetworks by comparing with the list of
In this study, the proposed clustering-based method was compared with the non-clustering method to evaluate the significance of a gene-subnetwork for disease classification. Three lung cancer datasets were used to identify the sets of markers and each set of markers were applied using cross validation of independent datasets; e.g. identify markers using
reported disease-related genes for lung
than
clustering-based
for
the
GSE18842
B. Significance of subnetwork biomarkers
et
al.
dataset.
The
clustering results shown in Table I.
The
From the preliminary functional analysis, it can be seen that the datasets GSEIOOn and GSE18842 were categorized
dataset,
while
clustering-based approach did better in the other two datasets (see Fig. 2).
Sanchez-Palencia
the
The results show that the non-clustering method can identify
the Landi et al. dataset and use those markers in cross the
from
genes with evidence supported in the gene-subnetwork better
validation
of
cancer
GeneProspector function of the HugeNavigator (1,328 genes).
significance
clustering-based
and
of
subnetwork
biomarkers
non-clustering-based
case
for data
the was
assessed using cross-validation. In particular, one dataset was
into three subtypes while GSE19804 was divided into five
used for training, with the other two independent datasets used
SUbtypes. However, cluster 2 from the latter dataset was not
for testing. The gene-subnetwork biomarkers were separated
associated with any subnetworks,
indicating that it may
into gene-level subnetwork biomarkers and subnetwork-level
possibly be a novel subtype, or that it can be merged with one
subnetwork biomarkers, according to the classification model
of the other four clusters instead. More detailed analysis in
built. The cross-validation results for the gene-level and
consultation with domain experts will be undertaken in the
subnetwork-level are shown in Tables II and III, respectively.
future.
In summary, the cross-validation results indicate that the
From the cross-validation results, the clustering-based method performed generally better than the non-clustering
use
method. When combining the results from both types of
subnetwork biomarkers identification. Also, it provides a
of
a
clustering
technique
helps
to
improve
gene
biomarkers, the improvement is statistically significant from a
better classification model for medical screening purposes,
paired (-test, with both higher AVe (p-value
0.01) and
especially when considering the significant improvement in
using
terms of Recall.
higher
Recall
(p-value
=
0.008).
=
Moreover,
the
subnetwork-level biomarkers alone also yielded statistically significant improvements by clustering.
CLUSTERING AND SUBNETWORK TDENTlFlCAnON RESULTS
TABLE I.
Accession Number
Dataset
Number of Case samples
Number of Subnetworks
Genes in Subnetworks CALMI, CANDI, CAVI, CBL, CD81, CDC37, CRYAB, CRYBAI, DES, DOCK4, EDNRB, FBX09, GRK5, HDGF, HlSTIH2BG, HlSTlH2BK, HLA-F, HSPB8, HTRIA, KAT2A, KCNA3,
Whole set
58
8
KCNN3, LILRB2, MAPT, MMSI9, NRGN, PIASI, PIK3RI, POP7, PTPN6, RBBP6, RBP4, RPP30, RPP38, RPP40, RUNXITI, SIPRI, SIPR3, SCAPER, SH3GU, SNCA, SPTBNI, TCF3, TCF4, TTN, UBQLN4, ZBTBI6, ZBTB38, ZNF667
GSEIOOn
Cluster I
13
4
Cluster2
16
3
ANGPTI, CANDI, CORT, KHDRBSI, KlAAOIOI, MPP6, NCKlPSD, PMM2, SEMASA, SRC, SSTR3, SSTR5, STATI, SUM02 ANGPT4, APCS, C4BPA, FlO, F8, LRPI, PROSI, TEK, TIEl ACSL4, ARHGEF6, ARRB2, ATXN2, C20orf20, CDKI7, CLASPI, CLASP2, CUP2, CSDA, DAB2,
C1uster3
29
7
DDX24, DDXS6, DOCK4, ENG, EPASl, FEZl, FGFR2, GNAQ, GRB2, GRK5, HDGF, HlSTlH2BC, HISTIH3J, HTT, KLRAPI, KPNA2, MYHIO, PAFAHIBI, PCNP, PIK3RI, PRX, RPL37, SH3GL3, SPG20, SPTANI, SPTBNI, SYNM, TACCl, TDRD7, TERF2JP, TGFB3, TGFBR2, TGFBR3, TJPI, TREMl, TSC22DI, TUBAIA, TYROBP, VIM, XBPI, ZAP70 MMPI, CAVl, MMP13, GRK5, MAT2A, ITGA2, FBX038, CCNA2, PPP2R2A, STATI, PTK2,
Whole set
46
4
PPARG, KLF7, SATBI, NR2FI, PIASI, lRAKI, ATXN7, ILlRAP, SNRNP200, BCAN, PSMD4, USP7, MAT2B, KATS, MMP8, HDAC7, ULBPl ALDH3B2, ATP8AI, BSCL2, C12orf48, Cl6orf88, CACNA2D2, CANDl, CANX, COLl2Al,
GSEI8842
Cluster1
19
26
CPox, CXCLl6, EEFIA2, ELAVLl, GCLC, GlMAP5, GIMAP7, HlGDIB, KlAAOIOI, NDUFABI, ODZ4, OLRI, RCCDl, RPLIOAP3, RPS15AP25, SLC2AI, SLC6A8, SNORA7A, TCF3, TMEMI94A, TMEM48, TTC34, USEI, VAMP5
Cluster2 Cluster3
14 13
1 9
BSCL2, CANX, EEFIA2, SLC2AI, TCF3, USEI ALDH3B2, CANDI, CPox, ELAVLl, GCLC, KlAAOIOI, ODZ4, RPLlOAP3, RPS15AP25, SLC6A8, SNORA7A, TTC34 SPPI, KlAAOIOI, SPOCK2, ATP2A2, ALDHI8AI, PRIM2, STK39, RAEI, CYB5R3, PRDX4,
Whole set
60
6
NME4, TADAI, ICTI, LDHA, MYCBP2, KAT2A, MRPL9, SUPV3Ll, MPPEDl, MRPSlI, C7orfJO, GRIPAPI, HMGNI, GRJPI, MRMI, COX2 AFG3L2, ALDHI8AI, ATPSCl, BMX, CARDII, CAVl, CBLC, CCTJ, CD2AP, CHFR, CLK3, COX2, CYB5B, ELAVLI, FOXMI, FZD6, HARS2, HDAC4, HLA-F, ICTI, IKBKG, IN080D, lRF6,
ClusterI
21
10
ISGIS, KRT6B, KRT80, ULRB2, MCIR, MGRNl, MRPLl, MRPL41, MRPL49, MRPSll, MRPS28, MYOF, NEDD4, PAAFI, PABPCI, PARS2, PDGFRB, PDPKI, PPARG, PRDX4, PSMB2, PSMD3, PTEN, PTPN6, SENPI, SFRPl, SlRT7, SREKl, SRPKl, THNSLl, TKT, TSGlOl, TUFM, UBE2L3, UBE2T, UCHL5, VDACI, WNT2, XRCC6, ZBTB3, ZEB2, ZNF556
Cluster2
II
0
AAGAB, ABCC4, ABLIMI, ACTB, AGR2, ALS2CRll, ANXA7, APC, APC2, AR, ARAP3, ARHGEF6, ARRBI, ATF4, ATPIAI, ATR, AXINI, BANFI, BARDI, BCL6, BCR, BMSI, CALM2, CASP3, CBXl, CCDCl07, CCDCl34, CDl9, CDHIS, CHAFlA, CHEKI, COMT, COPS6, COPS7A, CRMPI, CTNNBI, CTTN, CUL4A, CUL4B, DACHI, DAP3, DAZAP2, DCAFI2, DCAF5, DFFA, DFFB, DGKE, DLCl, DNAJAl, DOCK7, DUSP5, EFHCl, EIF3A, EIF3C, F8AI, FAFl, FBX02S, FSHR, FTHI, FZD8, GAB2, GABBRI, GlTI, GNAI2, GRJPAPI, GRK5, H2AFV, H3F3B, HANDI,
GSEI9804
HAPI, HAUSS, HAUS6, HAXl, HDAC3, HEY2, HEYL, HlSTlH2AG, HlSTlH2AL, HlSTlH2AM, HISTIH3A, HlSTlH3E, HISTIH4F, HMGB2, HNRNPU, HOOK2, HOOK3, HSP90AAI, HSPAI3,
Cluster3
9
29
HSPAS, HTT, IFITMl, IKBKB, lRAKl, lRAKlBPI, lRAK4, lRF7, ITSN2, JUP, KAT2B, KAT5, KCNK3, KlAAOIOI, KPNA2, KRTl4, LEFI, LRP6, LRRK2, LUC7L2, MAGEDI, MAGEFI, MAPIB, MAP3K3, MAPK3, MAPK6, MBD4, MDK, MDM4, MEOX2, MLLT4, MSH3, NCKAP5, NEDD8, NFKBIA, NOSTRIN, NPHS2, NR3CI, NUDTl6Ll, PABPC4, PLA2G4C, PLG, PML, PPP2CA, PRDX3, PTPNI, PTPRE, PVRLl, PWPI, RASSFI, RBBP5, RBBP6, RBMI5, REEP5, RGSI, RGS2, RHOTl, RPLl8, RPL22, RPPI4, RPP30, RPSl4, RPSl6, RPS24, SIOOAlO, SAMM50, SH3GL3, SH3KBPI, SIN3A, SIRT5, SlRT7, SLC25A5, SMAD4, SMARCA2, SNAPIN, SNRNP70, SNX7, SPECCIL, SPPl, SREBF2, SRGN, STMN2, SYK, TABI, TEXll, THOC2, TIMELESS, TOP2A, TP73, TRIM27, TRIM32, TRJP6, UBC, UBE2E3, UBE2H, UBE2T, UBQLN4, UBR2, UBR4, UCHLI, vJP, WNTl, YWHAZ, ZBTB38, ZDHHCI7, ZNF382, ZNF691, ZYGlIB AKTI, BRCAl, BSCL2, CANX, CFTR, CHMPlB, CUL3, CXorfS6, GRIAl, GSK3B, HNRNPAl,
Cluster4
8
4
KlAAOIOI, LUC7L, MYC, NFKBILl, NPTXI, PLXNAI, PTEN, RABIIA, RHOD, RNPSI, SMARCA4, SNAP23, SUCLGI, TCERGI, USEI
ClusterS
II
5
LMXlB, SSBP3, TALI, TRIM33, CNNM4, UBC, PTP4A2, USF2, SlOOA9, ARRBl, TTN, SPINI
IV. CONCLUSIONS
Genes overlap with reported susceptibility genes
This study has extended current existing gene-subnetwork biomarkers identification methods by applying a clustering
0.35
]
Q. �
�
technique to homogenize gene expression datasets. From the
0.3
comparative results, the clustering-based method performed
0.25
statistically-significant better than the non-clustering method
0.2 0.15
• Non-cluster
0.1
• Cluster
0.05
with higher AVC and Recall, as well as resulted in an increased number of significant subnetworks being identified. This
work
illustrates
the
benefits
of
using
a
clustering
approach to preprocess data and helps to identify better gene
o GSE18842
GSE19804
subnetwork biomarkers of a complex disease like lung cancer.
GSE10072
ACKNOWLEDGMENT
datasets
Fig. 2.
This work is supported by a National Research Council of
Overlap with reported susceptibility genes result
TABLE II.
Thailand (NRCT) sub-grant.
CROSS-VALIDATION RESULT USING GENE-LEVEL SUBNETWORK BIOMARKERS
REFERENCES [1]
L. Chen, 1. Xuan, R. B. Riggins, R. Clarke, and Y. Wang, "IdentifYing cancer biomarkers by network-constrained support vector machines," BMC Syst. Bioi. , vol. 5,p. 161,2011.
[2]
A. L. Barabasi, N. Gulbahce, 1. Loscalzo, "Network medicine: a network-based approach to human disease," Nat. Rev. Genet. , vol. 12(1), p. 56,2011.
Recall
AUC Non-
Non-
Train
Test
cluster
Cluster
cluster
GSE 18842
GSE 10072
0.966± 0.005
0.999± 0.004
0.976± 0.004
1±0.0
GSE 19804
GSE 10072
0.922± 0.014
0.935± 0.004
0.925± 0.025
0.953± 0.007
[3]
T. Ideker, R. Sharan, "Protein Networks in Disease," Genome Res. , vol. 18,pp. 644-652,2008.
GSE 19804
GSE 18842
0.930± 0.008
0.929± 0.006
0.925± 0.014
0.948± 0.005
[4]
1. Quackenbush, "Computational analysis of microarray data," Nat. Rev. Genet. , vol. 2(6),pp. 418-427,2001.
GSE 10072
GSE 18842
0.979± 0.003
0.966± 0.007
0.998± 0.005
0.979± 0.007
[5]
GSE 18842
GSE 19804
0.989± 0. 0
W. Pan, "A comparative review of statistical methods for discovering differentially expressed genes in replicated microarray experiments," Bioinformatics, vol. 18(4),pp. 546-554,2001.
1±0.0
I±O.O
I±O.O
[6]
0.957± 0.990± 0.978± GSE GSE 1±0.0 10072 19804 0.005 0.009 0.004 The best perfonnance of each cross-valIdatIOn IS hIghlIghted In bold.
J. Su, B. J. Yoon, and E. R. Dougherty, "Identification of diagnostic subnetwork markers for cancer in human protein-protein interaction network," BMC Bioinformatics, vol. 11,p. 6:S8,2010.
[7]
S. Prom-On, A. Chanthaphan,1. H. Chan, and A. Meechai, "Enhancing biological relevance of a weighted gene co-expression network for functional module identification," J. Bioinform. Comput. Bioi. , vol. 9(1), pp. 111-129,2011.
[8]
1. Chen, and B. Yuan, "Detecting Functional Modules in the Yeast Protein-Protein Interaction Network," Bioinformatics, vol. 22, pp. 22832290,2006.
[9]
H. Y. Chuang,E. Lee,Y. T. Liu,D. Lee,and T. Ideker,"Network-based classification of breast cancer metastasis," J. Mol. Syst. Bioi. , vol. 3, p. 140,2007.
TABLE llI.
Cluster
CROSS-VALIDATION RESULT USING SUBNETWORK-LEVEL SUBNETWORK BIOMARKERS Recall
AUC Non-
Non-
Train
Test
cluster
Cluster
cluster
Cluster
GSE 18842
GSE 10072
0.965± 0.007
0.974± 0.009
0.978± 0.001
0.985± 0.015
GSE 19804
GSE 10072
0.935± 0.007
0.950± 0.008
0.937± 0.007
0.952± 0.012
GSE 19804
GSE 18842
0.897± 0.015
0.930± 0.009
0.897± 0.015
0.957± 0.009
GSE 10072
GSE 18842
0.950± 0.009
0.959± 0.006
0.953± 0.012
0.978± 0.008
GSE 18842
GSE 19804
1±0.0
0.988± 0.004
1±0.0
1±0.0
0.959± 0.984± 0.985± GSE GSE 1±0.0 10072 19804 0.009 0.013 0.005 The best performance of each cross-valIdatIOn IS hIghlIghted In bold.
Note that the results show that there is no significant difference between the use of gene-level and subnetwork-level subnetwork
biomarkers.
The
paired
(-test
results
combining both types of biomarkers produced p-value for AVC and p-value
=
0.109 for Recall.
=
when 0.l68
[10] M. T. Dittrich, G. W. Klau, A. Rosenwald, T. Dandekar, and T. MUlier, "IdentifYing functional modules in protein-protein interaction networks: an integrated exact approach," Bioinformatics, vol. 24(13), pp. i223-31, 2008. [11] E. B. van den Akker, B. Verbruggen, B. T. Heijmans, M. Beekman, J. N. Kok,P. E. Slagboom,and M. 1. Reinders, "Integrating protein-protein interaction networks with gene-gene co-expression networks improves gene signatures for classifying breast cancer metastasis," J. integr. Bioinform. , vol. 8(2),pp. 188,20II. [12] C. Wu, 1. Zhu, and X. Zhang, "Integrating gene expression and protein protein interaction network to prioritize cancer-associated genes," BMC Bioinformatics, vol. 13,pp. 182,2012. [13] M. Li, X. WU, J. Wang, and Y. Pan, "Towards the identification of protein complexes and functional modules by integrating PPI network and gene expression data," BMC Bioinformatics, vol. 13,pp. 109,2012. [14] L. Zhang, S. Li, C. Hao, G. Hong, J. Zou, Y. Zhang, P. Li, and Z. Guo, "Extracting a few functionally reproducible biomarkers to build robust subnetwork-based classifiers for the diagnosis of cancer," Gene, vol. 526,pp. 232-8,2013. [15] H. Rakshit, N. Rathi, and D. Roy, "Construction and Analysis of the Protein-Protein Interaction Networks Based on Gene Expression Profiles of Parkinson's Disease," ?LoS ONE, vol. 9(8),pp. e103047,2014.
[16] C. E. Meacham and S. J. Morrison, "Tumour heterogeneity and cancer cell plasticity," Nature, vol. 501(7467), pp. 328-37,2013. [17] C. Ribeiro, F. de Assis T. de Carvalho, and 1. G. Costa, "Semi supervised Approach for Finding Cancer Sub-Classes on Gene Expression Data," Advances in Computational Biology, Lecture Notes in Bioinformatics. Berlin: Springer Verlag,vol. 6268,pp. 24 - 34,2010. [18] A. Chatr-Aryamontri et aI. , "The BioGRID interaction database: 2015 update," Nucleic Acids Research, vol. 43, pp.D470-8, 2014. [19] P. Shannon, et aI. , "Cytoscape: a software environment for integrated models of biomolecular interaction networks," Genome Research, vol. 13,pp. 2498-2504,2003. [20] T. Barrett et aI. , "NCBT GEO: mining millions of expression profiles database and tools," Nucleic Acids Research, vol. 33, pp. D562-D566, 2005. [21] M. T. Landi et aI., "Gene expression signature of cigarette smoking and its role in lung adenocarcinoma development and survival," PLoS One, vol. 3, pp. e1651,2008. [22] A. Sanchez-Palencia et aI. , "Gene expression profiling reveals novel biomarkers in nonsmall cell lung cancer," Int. J. Cancer, vol. 129, pp. 355-364,2011. [23] T. P. Lu, et al. "Identification of a novel biomarker, SEMA5A, for non small cell lung carcinoma in nonsmoking women," Cancer Epidemiol. Biomarkers Frev. , vol. 19(10),pp. 2590-7,2010. [24] M. Pirooznia, J. Y. Yang, M. Q. Yang, and Y. Deng, "A comparative study of different machine learning methods on microarray gene expression data," BMC Genomics, vol. 9,pp. 13,2008 [25] A. P. Dempster, N. M. Laird, and D. B. Rubin, "Maximum likelihood from incomplete data via the EM algorithm," Journal of the Royal Statistical Society, vol. 34,pp. 1-38,1977.
[26] M. Hall, E. Frank, G. Holmes, B. Pfahringer, P. Reutemann, and T. H. Witten, "The WEKA Data Mining Software: An Update," SlGKDD Explorations, vol. II,Issue 1,2009. [27] W. Yu, M. Gwinn, M. Clyne, A. Yesupriya, and M. L. Khoury, "A navigator for human genome epidemiology," Nature Genetics, vol. 40, pp. 124-125,2008. [28] T. Li, C. Zhang and M. Ogihara, "A comparative study of feature selection and multiclass classification methods for tissue classification based on gene expression," Bioinformatics, vol. 20, pp. 2429-2437, 2004. [29] J. H. Chan, P. Sootanan, and P. Larpeampaisarl, "Feature selection of pathway markers for microarry-based disease classification using negatively correlated feature sets," in Froc. International Joint Conference on Neural Networks (UCNN 2011), San Jose, CA, 2011, pp. 3293-3299. [30] P. Sootanan, S. Prom-on, A. Meechai, and J. H. Chan, "Pathway-based microarray analysis for robust disease classification," Neural Com put. & Appl., vol. 21,pp. 649-660,2012. [31] W. Engchuan and J. H. Chan, "Pathway activity transformation for multi-class classification of lung cancer datasets," Neurocomputing, http://dx.doi.orglIO. 10 \6/j.neucom.2014.08.096. [32] S. Kotsiantis, D. Kanellopoulos, and P. Pintelas, "Handling imbalanced dataset: A review," GESTS Int. Trans. ComSci. & Eng., vol. 30, pp. 2536,2006. [33] C. Goutte, and E. Gaussier, "A probabilistic interpretation of precision, Recall and F-score, with implication for evaluation," In Advances in Information Retrieval, pp. 345-359, 2005. [34] C. Cortes and V. Vapnik, "Support-vector networks," Machine learning, vol. 20,pp. 273-297,1995.