Autoantibodies in systemic sclerosis: unanswered questions.

Clustering-based gene-subnetwork biomarker identification using gene expression data 2 3 2 ' Narumol Doungpan , Worrawat Engchuan , Asawin Meechai and Jonathan H. Chan ,§ ' Department of Biological Engineering, King Mongkut's University of Technology Thonburi, Bangkok, Thailand [email protected] 2

School of Information Technology,

King Mongkut's University of Technology Thonburi, Bangkok, Thailand [email protected] [email protected] 3 Department of Chemical Engineering, King Mongkut's University of Technology Thonburi Bangkok, Thailand [email protected] §

Corresponding author

Abstract-

The identification of predictive biomarkers of

complex disease with robustness and specificity is an ongoing

construction assumption

[1]. As each cellular component

functions through interactions with others, thus a biological

challenge. Gene expressions provide information on how the cell

subnetwork is supposed to represent a functional module in the

reacts to a particular state and the relationship of genes may lead

cell. The link between nodes implies that the impact of a

to novel information. A network-based approach

integrating

expression data with protein-protein interaction network can be used to identify gene-subnetwork biomarkers for a particular disease.

However, cancer datasets are heterogeneous in nature

containing unknown or undefined subtypes of cancers. In this study, we propose a gene-subnetwork biomarker identification approach by implementing an Expectation-Maximization

(EM)

clustering technique to homogenize the dataset. To validate our proposed method. Lung cancer expression datasets are used to identify gene-subnetwork biomarkers. The evaluation of gene subnetwork biomarkers is done by 5-fold cross-validation on an independent dataset. The comparison between non-clustering and clustering-based gene-subnetwork identification showed that clustering produced improved classification performance at a statistically

significant

level.

Furthermore,

preliminary

functional analysis results showed more significant subnetworks were identified using the proposed approach.

the

systematic

level.

For

the

representation of the network, each node represents gene, protein, or metabolite while the link between nodes represents the

interaction

or

relationship

978-1-4799-1959-8/15/$31.00 @2015 IEEE

pathological processes that interact in a complex network instead [2]. The protein-protein interaction (PPI) network is one of the biological

networks

which

offers

a

valuable

source

of

information at the systems-level. The PPI network has also been used to further study molecular evolution to gain insight into the robustness of cells to perturbation and to characterize protein functions. Furthermore, PPI plays an important role in unravelling the molecular basis of a disease and understanding and identifying disease pathogenesis genes, disease-related

In a genome-wide association study, microarray data is one

I NTRODUCTION

at

a single effector gene product, but would rather reflect various

of the high throughput technologies used to inspect the gene

research for a decade to study the biological meaning of components

of gene products that carry no defects through the link. Thus a disease phenotype is rarely a consequence of an abnormality in

based approach [3].

The network approach has been applied in biological cellular

the single gene product that carries it, but can alter the activity

subnetworks, and classifying diseases by using a network

Keywords-gene-subnetwork; protein-protein interaction; clustering; classification; gene expression; lung cancer I.

specific genetic abnonnality is not restricted to the activity of

according

to

network

product in a particular stage or environment [4]. Expression levels of thousands of genes are measured simultaneously and used as the profile of a sample. The differentially expressed gene (DEG) analysis of those profiles between patient and healthy control groups could reveal candidate genes or gene markers involved in the disease development that can be used for disease prediction [5].

Disease biomarker with high specificity and sensitivity

II. MATERIALS AND METHOD

when applying to real use is important. The gene marker identification

problem

is

a

challenging

task.

Integrating

different levels of data provides more reliable biomarkers for both disease prediction and classification [6]. Various

techniques

have

been

Each dataset downloaded from a database needs to be preprocessed prior to further analysis. The clustering method was applied to cluster the expression data of the disease samples into groups (or subgroups). Each disease group and

proposed

to

infer

control

samples

used

gene-subnetwork

network.

subnetworks. For example, protein-protein interaction (PPI)

corresponding protein in the PPI network, significant gene

data may be used as a network scaffold to be overlaid with

subnetworks

gene expression data to determine a group of genes (called

discriminative score of a particular candidate subnetwork.

subnetwork) that is significant to a disease. Expression profiles PPI

data

highlights

the

expression

of

each

protein

for

subnetwork identification. These subnetworks are notable and enriched for a number of biological processes [8]. Chuang et al. [9] proposed the "PinnacleZ" method to identify subnetworks as biomarkers instead of a single gene by integrating PPI network with expression profiles of breast cancer to identify marker for classification of breast cancer metastasis.

The

expression

data

are

overlaid

to

the

corresponding proteins in the PPI network. The significance of each subnetwork is determined by calculating a discriminative score. The differentially expressed genes from the disease

A.

overlaying were

then

the

identify

approach is a common method for identification of significant

By

integrating

to

biomarkers

are informative data. The integration of this data with curated

by

were

subnetworks from gene expression data [6]-[15]. A score-based

the

expression expression

searched

and

data

with

PPI

value

onto

the

determined

by

the

Protein-protein Interaction Network Protein-protein interaction (PPI) data were downloaded

from BioGRID (http://thebiogrid.orgl) (version 3.2.95). The 151,895 PPI data composed of human (Entrez ID: 9606) and non-human protein data. Before utilizing this PPI data, those non-human

proteins,

self-interactions

and

redundant

interactions were removed. After the preprocessing of PPI data, a network of 13,586 proteins with 82,571 edges was formed. Such networks can be visualized conveniently using tools such as Cytoscape (http://www.cytoscape.orgl) [19].

B. Gene Expression Data The three lung cancer datasets used in this study were

(tumor) and normal cells can be linked by genes that are not

downloaded

differentially

(http://www.ncbi.nlm.nih.gov/geo/) [20]. The first expression

expressed

interconnecting

the

significantly

differentially

but

play

expressed

a

genes

role in

in

from

Gene

Expression

Omnibus

database

the

data was published by Landi et at. [21] and used to identify the

subnetwork. In addition, these subnetwork biomarkers can

genes that correspond to lung cancer from smoking. This gene

imply the mechanism or pathway that plays a role in disease

expression

pathogenesis.

frozen

The concept of using PPI network and expression data to identify the biomarker of disease has often been applied in prior research [1], [5], [10]-[15]. Accordingly, the challenge is to

identify

more

reproducible

and

robust

subnetwork

biomarkers to provide more informative and effective markers for complex diseases.

analysis

tissue

(GSEI0072)

samples

of

was

performed

adenocarcinoma

on

and

fresh paired

noninvolved lung tissue from 28 current, 26 former and 20 non-smokers,

with

biochemically

validated

smoking

information. There are a total of 107 samples composed of 58 adenocarcinoma and 49 non-tumor samples as control. The second one was provided by Sanchez-Palencia et at. [22] and used to determine whether the phenotypic heterogeneity and genetic diversity in NSCLC are correlated. This expression

Complex diseases tend to be heterogeneous in nature. For

data (GSEI8842) composed of 91 samples, of which 46 are

example, cancer originating from a single cell can proliferate

primary

and mutate into heterogeneity which leads to non-coincidence

samples and 45 non-tumor samples are the control samples.

adenocarcinomas

and

squamous-cell

carcinomas

functional

The last expression data was provided by Lu et at. [23] and

heterogeneity occur among cancer cells within the same tumor

used to screen for the transcriptional modulation which causes

of

gene marker

identification. Phenotypic

and

environmental

lung cancer among non-smoking females in Taiwan. The

differences and reversible changes in cell properties [16]. The

expression data (GSEI9804) composed of 120 samples, of

expression profile from tumor samples may provide different

which 60 are lung cancer and the others are control samples.

due

to

a

consequence

of

genetic

change,

types of genetic variation which needs to be considered in the analysis. Ribeiro et al. [17] proposed a semi-supervised method based on clustering approach for finding sub-classes of cancer using gene expression data.

To

avoid

ambiguous

results,

we

preprocessed

the

expression data by removing those probe sets representing more than one genes or unknown genes. For multiple probe sets which refer to the same gene, probe set with the highest

This study aims to improve the method in subnetwork

variance was chosen. Moreover, an additional file that is

biomarker identification by applying a clustering technique to

needed in the subnetwork identification is the class file. The

reduce the heterogeneity of gene expression profile in patient

class file is a text file which indicates the class (case or control)

samples. The gene-subnetwork biomarkers were identified

of each sample in an expression matrix. The format should be

using three expression datasets of subjects with lung tumors

in two columns; the first column is name of each sample in the

and controls as well as PPI network from BioGRID [18]. The

expression, and the second column is the positive number

proposed clustering-based method was compared with the

specifying class of each particular sample.

traditional gene-subnetwork biomarker identification using 5fold cross-validation.

C.

further. For clustering, only the subset of genes was used as

Gene-subnetwork Biomarker Identification In

this

study,

the

gene-subnetwork

biomarkers

were

identified using the PinnacleZ plugin for Cytoscape [9]. The

features to minimize the effect of noise in using experimental microarray data.

procedure is described as follows.

The "HugeNavigator", which is an online tool for querying

The expression data was overlaid on the corresponding proteins in PPI networks during the identification process. The network modules were identified by expanding from a starting node to its neighbor nodes using the Greedy search algorithm and then determining the discriminative potential of each

disease-related genes from collected publications, was used to select gene subsets [27]. The GeneProspector function of HugeNavigator was applied with the search term "Lung cancer" to retrieve the list of lung cancer related genes. A total of 1,328 genes was obtained.

subnetwork by using mutual information (MI).

In this step, only cases were clustered into subgroups.

MI measures the interdependence between two random variables to determine their joint distribution. In the case of two random variables being independent, then one variable is not given the information of the other one. One is the discretized form of a vector summing up each gene expression value of a particular subnetwork, and the other one is a vector defining tumor and non-tumor.

statistical

filtering,

controls. The gene-subnetwork biomarker identification step was applied to these combined datasets using PinnacleZ with the same set of parameters as described in the previous section.

E.

Evaluation of gene-subnetwork biomarkers The use of classification for biomarker evaluation is a

In the step of determination of the significant subnetworks through

Then the dataset of each subgroup was combined with all

three

statistical

tests

were

common practice in microarray analysis [9], [28]-[31]. In this work,

the

gene-subnetwork

biomarkers

were

applied

as

performed. The p-value thresholds for these three tests were

features for building a classification model in two different

set as default (tl

ways. The first is to use the gene expression of the gene

=

0.05, t2

=

0.05, t3

=

5E-6). Only

subnetworks having those p-values smaller than the default

member in the identified subnetwork as the feature for

thresholds were carried on for further analysis. Since the

evaluation by classification so-called gene-level subnetwork

approach of PinnacleZ is stochastic, the identification of the

biomarkers. The second way is to transform the expressions of

subnetworks was repeated three times and the consensus

gene members in the identified subnetwork into activity scores

subnetworks chosen were those overlapping among the three

to

runs.

subnetwork-level. The activity score was calculated using the

a clustering technique biomarker

was applied prior to gene

identification

to

homogenize

the

patient dataset. Fig. 1 shows the workflow of the proposed clustering-based method. The gene expression datasets retrieved from GEO were first grouped into binary classes of case and control. In this study, "case" represents the state of samples having disease (patients), while "control" represents the healthy samples. The Expectation-Maximization (EM) clustering algorithm was used in this work. EM algorithm has been applied in many fields of study when the data can be assumed to be a mixture of Gaussian distributions

[24]. Furthermore, EM

clustering allows the cross-validation process to determine the most appropriate number of clusters [25]. The EM Clusterer was applied using the WEKA library version 3.7-12 [26]. EM works by assigning a probability distribution to each sample in the case data to indicate the probability being belong to the cluster. In WEKA, cross validation was used to find the number of clusters. Initially, the number of clusters was set as 1. The dataset was then split into 10 folds and 9 folds of data were used to build a clustering

model.

Applying

clustering

model,

features

for

classification.

This

is

termed

(I),

In order to reduce the heterogeneity of the gene expression data,

the

following equation (l):

D. Clustering-based Gene-subnetwork Biomarker Identification

subnetwork

be

the

log

likelihood was calculated and averaged over all lO iterations of cross-validation. In the case when the log likelihood increased, the number of clusters was increased by 1. Then the process iterated until the log likelihood did not increase any

where

akj is

the activity level of subject j subnetwork

the normalized expression (mean of subject j gene

k.

i

and

n

=

k,

0, standard deviation

zlj is =

1)

is number of members in subnetwork

The gene-subnetwork biomarkers were evaluated in terms

of classification performance by applying both types of markers

(gene

expression-based

and

gene

activity-based)

identified from one dataset on another independent dataset. Five-fold cross-validation was used to assess the area under the receiving operating curve (AVC) and Recall as measures of the classification performance of biomarkers on an independent dataset. AVC is a known unbiased measure for classification used with class-imbalanced issues [32]. Recall is an important measure that is commonly used to evaluate the accuracy of medical screening [33]. The procedure of 5-fold cross-validation is described as follows. The dataset was randomly divided into 5 equal subsets while keeping the ratio between case and control approximately the same for all subsets. For each iteration, 4 subsets were used to train while the remaining subset was used to test the model until all 5 subsets had been used as a test set. For the classifier, the Support Vector Machine (SVM) [34] was used because it can be applied to both binary and multi class classification problems. In addition, SVM is commonly used in microarray analysis to analyze and recognize patterns by constructing hyperplanes to separate the different classes of interest [28], [31].

Fig. 1.

Overview of clustering-based gene-subnetwork identification

Finally, gene-subnetwork identification using the proposed

In this study, SVM with default parameters (i.e. type of SVM was C-SVC and kernel

function was radial basis

clustering-based method could identify more significant gene

function with the degree equal to 3) was used. Then the

subnetworks. That is, by summing up the gene subnetworks in

classification performance was calculated by averaging those

all clusters in each dataset, the total is greater than that of the

measurements in the 5-fold procedure. To reduce the effect of

whole dataset without clustering.

stochasticity on the validation results, cross-validation was repeated

10

times

and

the

average

results

from

these

A.

Significance of gene members As a rough measure of the gene-level biomarkers, we

repetitions were reported.

simply evaluated the significance of those gene members III. RES U L T S A N D DI S C U S S I O N

found in gene-subnetworks by comparing with the list of

In this study, the proposed clustering-based method was compared with the non-clustering method to evaluate the significance of a gene-subnetwork for disease classification. Three lung cancer datasets were used to identify the sets of markers and each set of markers were applied using cross validation of independent datasets; e.g. identify markers using

reported disease-related genes for lung

than

clustering-based

for

the

GSE18842

B. Significance of subnetwork biomarkers

et

al.

dataset.

The

clustering results shown in Table I.

The

From the preliminary functional analysis, it can be seen that the datasets GSEIOOn and GSE18842 were categorized

dataset,

while

clustering-based approach did better in the other two datasets (see Fig. 2).

Sanchez-Palencia

the

The results show that the non-clustering method can identify

the Landi et al. dataset and use those markers in cross the

from

genes with evidence supported in the gene-subnetwork better

validation

of

cancer

GeneProspector function of the HugeNavigator (1,328 genes).

significance

clustering-based

and

of

subnetwork

biomarkers

non-clustering-based

case

for data

the was

assessed using cross-validation. In particular, one dataset was

into three subtypes while GSE19804 was divided into five

used for training, with the other two independent datasets used

SUbtypes. However, cluster 2 from the latter dataset was not

for testing. The gene-subnetwork biomarkers were separated

associated with any subnetworks,

indicating that it may

into gene-level subnetwork biomarkers and subnetwork-level

possibly be a novel subtype, or that it can be merged with one

subnetwork biomarkers, according to the classification model

of the other four clusters instead. More detailed analysis in

built. The cross-validation results for the gene-level and

consultation with domain experts will be undertaken in the

subnetwork-level are shown in Tables II and III, respectively.

future.

In summary, the cross-validation results indicate that the

From the cross-validation results, the clustering-based method performed generally better than the non-clustering

use

method. When combining the results from both types of

subnetwork biomarkers identification. Also, it provides a

of

a

clustering

technique

helps

to

improve

gene

biomarkers, the improvement is statistically significant from a

better classification model for medical screening purposes,

paired (-test, with both higher AVe (p-value

0.01) and

especially when considering the significant improvement in

using

terms of Recall.

higher

Recall

(p-value

=

0.008).

=

Moreover,

the

subnetwork-level biomarkers alone also yielded statistically significant improvements by clustering.

CLUSTERING AND SUBNETWORK TDENTlFlCAnON RESULTS

TABLE I.

Accession Number

Dataset

Number of Case samples

Number of Subnetworks

Genes in Subnetworks CALMI, CANDI, CAVI, CBL, CD81, CDC37, CRYAB, CRYBAI, DES, DOCK4, EDNRB, FBX09, GRK5, HDGF, HlSTIH2BG, HlSTlH2BK, HLA-F, HSPB8, HTRIA, KAT2A, KCNA3,

Whole set

58

8

KCNN3, LILRB2, MAPT, MMSI9, NRGN, PIASI, PIK3RI, POP7, PTPN6, RBBP6, RBP4, RPP30, RPP38, RPP40, RUNXITI, SIPRI, SIPR3, SCAPER, SH3GU, SNCA, SPTBNI, TCF3, TCF4, TTN, UBQLN4, ZBTBI6, ZBTB38, ZNF667

GSEIOOn

Cluster I

13

4

Cluster2

16

3

ANGPTI, CANDI, CORT, KHDRBSI, KlAAOIOI, MPP6, NCKlPSD, PMM2, SEMASA, SRC, SSTR3, SSTR5, STATI, SUM02 ANGPT4, APCS, C4BPA, FlO, F8, LRPI, PROSI, TEK, TIEl ACSL4, ARHGEF6, ARRB2, ATXN2, C20orf20, CDKI7, CLASPI, CLASP2, CUP2, CSDA, DAB2,

C1uster3

29

7

DDX24, DDXS6, DOCK4, ENG, EPASl, FEZl, FGFR2, GNAQ, GRB2, GRK5, HDGF, HlSTlH2BC, HISTIH3J, HTT, KLRAPI, KPNA2, MYHIO, PAFAHIBI, PCNP, PIK3RI, PRX, RPL37, SH3GL3, SPG20, SPTANI, SPTBNI, SYNM, TACCl, TDRD7, TERF2JP, TGFB3, TGFBR2, TGFBR3, TJPI, TREMl, TSC22DI, TUBAIA, TYROBP, VIM, XBPI, ZAP70 MMPI, CAVl, MMP13, GRK5, MAT2A, ITGA2, FBX038, CCNA2, PPP2R2A, STATI, PTK2,

Whole set

46

4

PPARG, KLF7, SATBI, NR2FI, PIASI, lRAKI, ATXN7, ILlRAP, SNRNP200, BCAN, PSMD4, USP7, MAT2B, KATS, MMP8, HDAC7, ULBPl ALDH3B2, ATP8AI, BSCL2, C12orf48, Cl6orf88, CACNA2D2, CANDl, CANX, COLl2Al,

GSEI8842

Cluster1

19

26

CPox, CXCLl6, EEFIA2, ELAVLl, GCLC, GlMAP5, GIMAP7, HlGDIB, KlAAOIOI, NDUFABI, ODZ4, OLRI, RCCDl, RPLIOAP3, RPS15AP25, SLC2AI, SLC6A8, SNORA7A, TCF3, TMEMI94A, TMEM48, TTC34, USEI, VAMP5

Cluster2 Cluster3

14 13

1 9

BSCL2, CANX, EEFIA2, SLC2AI, TCF3, USEI ALDH3B2, CANDI, CPox, ELAVLl, GCLC, KlAAOIOI, ODZ4, RPLlOAP3, RPS15AP25, SLC6A8, SNORA7A, TTC34 SPPI, KlAAOIOI, SPOCK2, ATP2A2, ALDHI8AI, PRIM2, STK39, RAEI, CYB5R3, PRDX4,

Whole set

60

6

NME4, TADAI, ICTI, LDHA, MYCBP2, KAT2A, MRPL9, SUPV3Ll, MPPEDl, MRPSlI, C7orfJO, GRIPAPI, HMGNI, GRJPI, MRMI, COX2 AFG3L2, ALDHI8AI, ATPSCl, BMX, CARDII, CAVl, CBLC, CCTJ, CD2AP, CHFR, CLK3, COX2, CYB5B, ELAVLI, FOXMI, FZD6, HARS2, HDAC4, HLA-F, ICTI, IKBKG, IN080D, lRF6,

ClusterI

21

10

ISGIS, KRT6B, KRT80, ULRB2, MCIR, MGRNl, MRPLl, MRPL41, MRPL49, MRPSll, MRPS28, MYOF, NEDD4, PAAFI, PABPCI, PARS2, PDGFRB, PDPKI, PPARG, PRDX4, PSMB2, PSMD3, PTEN, PTPN6, SENPI, SFRPl, SlRT7, SREKl, SRPKl, THNSLl, TKT, TSGlOl, TUFM, UBE2L3, UBE2T, UCHL5, VDACI, WNT2, XRCC6, ZBTB3, ZEB2, ZNF556

Cluster2

II

0

AAGAB, ABCC4, ABLIMI, ACTB, AGR2, ALS2CRll, ANXA7, APC, APC2, AR, ARAP3, ARHGEF6, ARRBI, ATF4, ATPIAI, ATR, AXINI, BANFI, BARDI, BCL6, BCR, BMSI, CALM2, CASP3, CBXl, CCDCl07, CCDCl34, CDl9, CDHIS, CHAFlA, CHEKI, COMT, COPS6, COPS7A, CRMPI, CTNNBI, CTTN, CUL4A, CUL4B, DACHI, DAP3, DAZAP2, DCAFI2, DCAF5, DFFA, DFFB, DGKE, DLCl, DNAJAl, DOCK7, DUSP5, EFHCl, EIF3A, EIF3C, F8AI, FAFl, FBX02S, FSHR, FTHI, FZD8, GAB2, GABBRI, GlTI, GNAI2, GRJPAPI, GRK5, H2AFV, H3F3B, HANDI,

GSEI9804

HAPI, HAUSS, HAUS6, HAXl, HDAC3, HEY2, HEYL, HlSTlH2AG, HlSTlH2AL, HlSTlH2AM, HISTIH3A, HlSTlH3E, HISTIH4F, HMGB2, HNRNPU, HOOK2, HOOK3, HSP90AAI, HSPAI3,

Cluster3

9

29

HSPAS, HTT, IFITMl, IKBKB, lRAKl, lRAKlBPI, lRAK4, lRF7, ITSN2, JUP, KAT2B, KAT5, KCNK3, KlAAOIOI, KPNA2, KRTl4, LEFI, LRP6, LRRK2, LUC7L2, MAGEDI, MAGEFI, MAPIB, MAP3K3, MAPK3, MAPK6, MBD4, MDK, MDM4, MEOX2, MLLT4, MSH3, NCKAP5, NEDD8, NFKBIA, NOSTRIN, NPHS2, NR3CI, NUDTl6Ll, PABPC4, PLA2G4C, PLG, PML, PPP2CA, PRDX3, PTPNI, PTPRE, PVRLl, PWPI, RASSFI, RBBP5, RBBP6, RBMI5, REEP5, RGSI, RGS2, RHOTl, RPLl8, RPL22, RPPI4, RPP30, RPSl4, RPSl6, RPS24, SIOOAlO, SAMM50, SH3GL3, SH3KBPI, SIN3A, SIRT5, SlRT7, SLC25A5, SMAD4, SMARCA2, SNAPIN, SNRNP70, SNX7, SPECCIL, SPPl, SREBF2, SRGN, STMN2, SYK, TABI, TEXll, THOC2, TIMELESS, TOP2A, TP73, TRIM27, TRIM32, TRJP6, UBC, UBE2E3, UBE2H, UBE2T, UBQLN4, UBR2, UBR4, UCHLI, vJP, WNTl, YWHAZ, ZBTB38, ZDHHCI7, ZNF382, ZNF691, ZYGlIB AKTI, BRCAl, BSCL2, CANX, CFTR, CHMPlB, CUL3, CXorfS6, GRIAl, GSK3B, HNRNPAl,

Cluster4

8

4

KlAAOIOI, LUC7L, MYC, NFKBILl, NPTXI, PLXNAI, PTEN, RABIIA, RHOD, RNPSI, SMARCA4, SNAP23, SUCLGI, TCERGI, USEI

ClusterS

II

5

LMXlB, SSBP3, TALI, TRIM33, CNNM4, UBC, PTP4A2, USF2, SlOOA9, ARRBl, TTN, SPINI

IV. CONCLUSIONS

Genes overlap with reported susceptibility genes

This study has extended current existing gene-subnetwork biomarkers identification methods by applying a clustering

0.35

]

Q. �

�

technique to homogenize gene expression datasets. From the

0.3

comparative results, the clustering-based method performed

0.25

statistically-significant better than the non-clustering method

0.2 0.15

• Non-cluster

0.1

• Cluster

0.05

with higher AVC and Recall, as well as resulted in an increased number of significant subnetworks being identified. This

work

illustrates

the

benefits

of

using

a

clustering

approach to preprocess data and helps to identify better gene

o GSE18842

GSE19804

subnetwork biomarkers of a complex disease like lung cancer.

GSE10072

ACKNOWLEDGMENT

datasets

Fig. 2.

This work is supported by a National Research Council of

Overlap with reported susceptibility genes result

TABLE II.

Thailand (NRCT) sub-grant.

CROSS-VALIDATION RESULT USING GENE-LEVEL SUBNETWORK BIOMARKERS

REFERENCES [1]

L. Chen, 1. Xuan, R. B. Riggins, R. Clarke, and Y. Wang, "IdentifYing cancer biomarkers by network-constrained support vector machines," BMC Syst. Bioi. , vol. 5,p. 161,2011.

[2]

A. L. Barabasi, N. Gulbahce, 1. Loscalzo, "Network medicine: a network-based approach to human disease," Nat. Rev. Genet. , vol. 12(1), p. 56,2011.

Recall

AUC Non-

Non-

Train

Test

cluster

Cluster

cluster

GSE 18842

GSE 10072

0.966± 0.005

0.999± 0.004

0.976± 0.004

1±0.0

GSE 19804

GSE 10072

0.922± 0.014

0.935± 0.004

0.925± 0.025

0.953± 0.007

[3]

T. Ideker, R. Sharan, "Protein Networks in Disease," Genome Res. , vol. 18,pp. 644-652,2008.

GSE 19804

GSE 18842

0.930± 0.008

0.929± 0.006

0.925± 0.014

0.948± 0.005

[4]

1. Quackenbush, "Computational analysis of microarray data," Nat. Rev. Genet. , vol. 2(6),pp. 418-427,2001.

GSE 10072

GSE 18842

0.979± 0.003

0.966± 0.007

0.998± 0.005

0.979± 0.007

[5]

GSE 18842

GSE 19804

0.989± 0. 0

W. Pan, "A comparative review of statistical methods for discovering differentially expressed genes in replicated microarray experiments," Bioinformatics, vol. 18(4),pp. 546-554,2001.

1±0.0

I±O.O

I±O.O

[6]

0.957± 0.990± 0.978± GSE GSE 1±0.0 10072 19804 0.005 0.009 0.004 The best perfonnance of each cross-valIdatIOn IS hIghlIghted In bold.

J. Su, B. J. Yoon, and E. R. Dougherty, "Identification of diagnostic subnetwork markers for cancer in human protein-protein interaction network," BMC Bioinformatics, vol. 11,p. 6:S8,2010.

[7]

S. Prom-On, A. Chanthaphan,1. H. Chan, and A. Meechai, "Enhancing biological relevance of a weighted gene co-expression network for functional module identification," J. Bioinform. Comput. Bioi. , vol. 9(1), pp. 111-129,2011.

[8]

1. Chen, and B. Yuan, "Detecting Functional Modules in the Yeast Protein-Protein Interaction Network," Bioinformatics, vol. 22, pp. 22832290,2006.

[9]

H. Y. Chuang,E. Lee,Y. T. Liu,D. Lee,and T. Ideker,"Network-based classification of breast cancer metastasis," J. Mol. Syst. Bioi. , vol. 3, p. 140,2007.

TABLE llI.

Cluster

CROSS-VALIDATION RESULT USING SUBNETWORK-LEVEL SUBNETWORK BIOMARKERS Recall

AUC Non-

Non-

Train

Test

cluster

Cluster

cluster

Cluster

GSE 18842

GSE 10072

0.965± 0.007

0.974± 0.009

0.978± 0.001

0.985± 0.015

GSE 19804

GSE 10072

0.935± 0.007

0.950± 0.008

0.937± 0.007

0.952± 0.012

GSE 19804

GSE 18842

0.897± 0.015

0.930± 0.009

0.897± 0.015

0.957± 0.009

GSE 10072

GSE 18842

0.950± 0.009

0.959± 0.006

0.953± 0.012

0.978± 0.008

GSE 18842

GSE 19804

1±0.0

0.988± 0.004

1±0.0

1±0.0

0.959± 0.984± 0.985± GSE GSE 1±0.0 10072 19804 0.009 0.013 0.005 The best performance of each cross-valIdatIOn IS hIghlIghted In bold.

Note that the results show that there is no significant difference between the use of gene-level and subnetwork-level subnetwork

biomarkers.

The

paired

(-test

results

combining both types of biomarkers produced p-value for AVC and p-value

=

0.109 for Recall.

=

when 0.l68

[10] M. T. Dittrich, G. W. Klau, A. Rosenwald, T. Dandekar, and T. MUlier, "IdentifYing functional modules in protein-protein interaction networks: an integrated exact approach," Bioinformatics, vol. 24(13), pp. i223-31, 2008. [11] E. B. van den Akker, B. Verbruggen, B. T. Heijmans, M. Beekman, J. N. Kok,P. E. Slagboom,and M. 1. Reinders, "Integrating protein-protein interaction networks with gene-gene co-expression networks improves gene signatures for classifying breast cancer metastasis," J. integr. Bioinform. , vol. 8(2),pp. 188,20II. [12] C. Wu, 1. Zhu, and X. Zhang, "Integrating gene expression and protein protein interaction network to prioritize cancer-associated genes," BMC Bioinformatics, vol. 13,pp. 182,2012. [13] M. Li, X. WU, J. Wang, and Y. Pan, "Towards the identification of protein complexes and functional modules by integrating PPI network and gene expression data," BMC Bioinformatics, vol. 13,pp. 109,2012. [14] L. Zhang, S. Li, C. Hao, G. Hong, J. Zou, Y. Zhang, P. Li, and Z. Guo, "Extracting a few functionally reproducible biomarkers to build robust subnetwork-based classifiers for the diagnosis of cancer," Gene, vol. 526,pp. 232-8,2013. [15] H. Rakshit, N. Rathi, and D. Roy, "Construction and Analysis of the Protein-Protein Interaction Networks Based on Gene Expression Profiles of Parkinson's Disease," ?LoS ONE, vol. 9(8),pp. e103047,2014.

[16] C. E. Meacham and S. J. Morrison, "Tumour heterogeneity and cancer cell plasticity," Nature, vol. 501(7467), pp. 328-37,2013. [17] C. Ribeiro, F. de Assis T. de Carvalho, and 1. G. Costa, "Semi supervised Approach for Finding Cancer Sub-Classes on Gene Expression Data," Advances in Computational Biology, Lecture Notes in Bioinformatics. Berlin: Springer Verlag,vol. 6268,pp. 24 - 34,2010. [18] A. Chatr-Aryamontri et aI. , "The BioGRID interaction database: 2015 update," Nucleic Acids Research, vol. 43, pp.D470-8, 2014. [19] P. Shannon, et aI. , "Cytoscape: a software environment for integrated models of biomolecular interaction networks," Genome Research, vol. 13,pp. 2498-2504,2003. [20] T. Barrett et aI. , "NCBT GEO: mining millions of expression profiles database and tools," Nucleic Acids Research, vol. 33, pp. D562-D566, 2005. [21] M. T. Landi et aI., "Gene expression signature of cigarette smoking and its role in lung adenocarcinoma development and survival," PLoS One, vol. 3, pp. e1651,2008. [22] A. Sanchez-Palencia et aI. , "Gene expression profiling reveals novel biomarkers in nonsmall cell lung cancer," Int. J. Cancer, vol. 129, pp. 355-364,2011. [23] T. P. Lu, et al. "Identification of a novel biomarker, SEMA5A, for non small cell lung carcinoma in nonsmoking women," Cancer Epidemiol. Biomarkers Frev. , vol. 19(10),pp. 2590-7,2010. [24] M. Pirooznia, J. Y. Yang, M. Q. Yang, and Y. Deng, "A comparative study of different machine learning methods on microarray gene expression data," BMC Genomics, vol. 9,pp. 13,2008 [25] A. P. Dempster, N. M. Laird, and D. B. Rubin, "Maximum likelihood from incomplete data via the EM algorithm," Journal of the Royal Statistical Society, vol. 34,pp. 1-38,1977.

[26] M. Hall, E. Frank, G. Holmes, B. Pfahringer, P. Reutemann, and T. H. Witten, "The WEKA Data Mining Software: An Update," SlGKDD Explorations, vol. II,Issue 1,2009. [27] W. Yu, M. Gwinn, M. Clyne, A. Yesupriya, and M. L. Khoury, "A navigator for human genome epidemiology," Nature Genetics, vol. 40, pp. 124-125,2008. [28] T. Li, C. Zhang and M. Ogihara, "A comparative study of feature selection and multiclass classification methods for tissue classification based on gene expression," Bioinformatics, vol. 20, pp. 2429-2437, 2004. [29] J. H. Chan, P. Sootanan, and P. Larpeampaisarl, "Feature selection of pathway markers for microarry-based disease classification using negatively correlated feature sets," in Froc. International Joint Conference on Neural Networks (UCNN 2011), San Jose, CA, 2011, pp. 3293-3299. [30] P. Sootanan, S. Prom-on, A. Meechai, and J. H. Chan, "Pathway-based microarray analysis for robust disease classification," Neural Com put. & Appl., vol. 21,pp. 649-660,2012. [31] W. Engchuan and J. H. Chan, "Pathway activity transformation for multi-class classification of lung cancer datasets," Neurocomputing, http://dx.doi.orglIO. 10 \6/j.neucom.2014.08.096. [32] S. Kotsiantis, D. Kanellopoulos, and P. Pintelas, "Handling imbalanced dataset: A review," GESTS Int. Trans. ComSci. & Eng., vol. 30, pp. 2536,2006. [33] C. Goutte, and E. Gaussier, "A probabilistic interpretation of precision, Recall and F-score, with implication for evaluation," In Advances in Information Retrieval, pp. 345-359, 2005. [34] C. Cortes and V. Vapnik, "Support-vector networks," Machine learning, vol. 20,pp. 273-297,1995.

Unanswered Questions.

Baltimore's unanswered questions.

Rejection--unanswered questions.

Fiber: unanswered questions.

Functional autoantibodies in systemic sclerosis pathogenesis.

Unanswered questions in metal chelation.

E-Learning: some unanswered questions.

HPV vaccination: unanswered questions remain.

Unanswered questions about prazosin hydrochloride.

Nine unanswered questions about cytokinesis.

Eccentric exercise: many questions unanswered.

Some unanswered questions about vaccination.

Some unanswered questions in thyroid-related ophthalmopathy.

Polyp Resection - Controversial Practices and Unanswered Questions.

The apnea-PaCO2 relationship--unanswered questions?

Unanswered questions about the periodic health examination.

Bacterial protein acetylation: new discoveries unanswered questions.

Determining collateral ventilation during bronchoscopy: unanswered questions.

Aging and depression: some unanswered questions.

Labor support: many unanswered questions remain.

The myxomycetes--some problems and unanswered questions.

Unanswered questions on access from the margins.

Ultra-endurance exercise: unanswered questions in redox biology and immunology.

Telomere shortening in neurological disorders: an abundance of unanswered questions.