Tumor Biol. DOI 10.1007/s13277-015-3448-5

RESEARCH ARTICLE

Long non-coding RNA LINC01296 is a potential prognostic biomarker in patients with colorectal cancer Jia-jun Qiu 1 & Jing-bin Yan 1,2,3

Received: 23 February 2015 / Accepted: 8 April 2015 # International Society of Oncology and BioMarkers (ISOBM) 2015

Abstract Colorectal cancer (CRC), one of the most malignant cancers, is currently the fourth leading cause of cancer deaths worldwide. Recent studies indicated that long noncoding RNAs (lncRNAs) could be robust molecular prognostic biomarkers that can refine the conventional tumor-nodemetastasis staging system to predict the outcomes of CRC patients. In this study, the lncRNA expression profiles were analyzed in five datasets (GSE24549, GSE24550, GSE35834, GSE50421, and GSE31737) by probe set reannotation and an lncRNA classification pipeline. Twenty-five lncRNAs were differentially expressed between CRC tissue and tumoradjacent normal tissue samples. In these 25 lncRNAs, patients with higher expression of LINC01296, LINC00152, and FIRRE showed significantly better overall survival than those with lower expression (P < 0.05), suggesting that these lncRNAs might be associated with prognosis. Multivariate analysis indicated that LINC01296 overexpression was an independent predictor for patients’ prognosis in the test datasets (GSE24549, GSE24550) (P = 0.001) and an Electronic supplementary material The online version of this article (doi:10.1007/s13277-015-3448-5) contains supplementary material, which is available to authorized users. * Jing-bin Yan [email protected] 1

Shanghai Children’s Hospital, Shanghai Institute of Medical Genetics, Shanghai Jiao Tong University School of Medicine, 24/1400 West Beijing Rd., Shanghai 200040, China

2

Key Laboratory of Embryo Molecular Biology, Ministry of Health of China and Shanghai Key Laboratory of Embryo and Reproduction Engineering, Shanghai 200040, China

3

State Key Laboratory of Biocherapy/Collaborative Innovation Center for Biotherapy, West China Hospital, Sichuan University, Chengdu 610041, China

independent validation series (GSE39582) (P=0.027). Our results suggest that LINC01296 could be a novel prognosis biomarker for the diagnosis of CRC. Keywords Colorectal cancer . lncRNA . Biomarker . LINC01296

Introduction Colorectal cancer (CRC) is currently one of the most common cancers and the fourth leading cause of cancer deaths worldwide [1]. According to an estimate, there are 1.2 million new CRC cases and >600,000 deaths every year, which accounts for ∼8 % of all cancer deaths [2]. The incidence and death rates of CRC have been rapidly increasing over the last few years in Asian countries [3]. In China, the rates are much higher than the worldwide average [4]. Currently, clinicopathologic tumor staging, which is based on the tumor-node-metastasis (TNM) system, is the commonly used prognostic marker of CRC clinical outcomes. However, the TNM staging system is not a reliable predictor of CRC outcome. Histologically identical CRC patients may have totally different disease progression and clinical outcomes owing to their different genetic and epigenetic backgrounds. For example, although most TNM stage II patients with no lymph node metastasis have a better prognosis, one fourth of these patients may still have a high risk for relapse after surgical resection (classified as high-risk stage II patients) [5, 6]. DNA microsatellite instability (MSI) is a phenomenon displayed in most cancers of the colon and rectum; it refers to a clonal change in the number of repeated DNA nucleotide units in microsatellites caused by deletions or insertions, and it

Tumor Biol.

occurs in tumors with deficient mismatch repair [6]. MSI has been systematically analyzed for prognostic potential in CRC. The MSI-high (MSI-H) phenotype, which is present in 15 % of CRC, confers a good prognosis and a less aggressive clinical course than the MSI-low (MSI-L) or microsatellite stable (MSS) phenotype [7, 8]. MSI is the hallmark of Lynch syndrome which is an autosomal dominant hereditary syndrome caused by germline mutations in the MLH1, MSH2, MSH6, and PMS2 genes, although it is not solely restricted to hereditary CRC. Therefore, MSI is a marker for better clinical outcomes but appears to be more pronounced for Lynch syndrome [6, 8]. The vast majority of the human genome (98 %) does not code for proteins and gives rise to non-protein-coding RNAs (ncRNAs) [9]. Long non-coding RNAs (lncRNAs) are RNA polymerase II transcripts of >200 nucleotides that lack an open reading frame [10]. lncRNA makes up the biggest class of ncRNAs, with ∼58,000 human lncRNA genes annotated thus far [11]. Unlike the smaller noncoding micoRNAs, the functions of the majority of lncRNAs are not fully clear. However, with the improvement of technology and research in transcriptome profiles, increasing evidence shows that some lncRNAs, which can regulate gene expression at transcriptional, post-transcriptional, and epigenetic levels by interacting with DNA, RNA, and protein [10, 12, 13], play important roles in serial steps of cancer development [12]. These lncRNAs are involved in both oncogenic and tumor-suppressive pathways [14, 15]. Epigenetic studies have shown that lncRNA can predict cancer outcomes and further identify those patients who should require more aggressive treatments [16]. The aberrant expression patterns of lncRNAs can also be used to diagnose cancer or reflect disease prognosis and serve as predictors of patient outcomes. For instance, HOTAIR, a lncRNA located in HOX loci, is highly expressed in human cervical cancer and primary breast tumors, and its high expression level in tumors is a powerful biomarker of poor prognosis and metastases [17, 18]. To identify possible biomarkers for predicting CRC outcomes, we analyzed a cohort of published datasets from the gene expression omnibus (GEO) and investigated the correlation between the expression of some specific lncRNAs and clinical prognostic variables. Table 1 Five datasets included in this study

Materials and methods CRC gene expression data from GEO CRC expression data were obtained from GEO. The datasets were selected using the following criteria: (a) patients had CRC, (b) CRC tissue and tumor-adjacent normal tissue samples were available for comparison, (c) data were obtained from the same platform, and (d) more than three samples existed. According to these criteria, five datasets (GSE24549, GSE24550, GSE35834, GSE50421, and GSE31737) were chosen (Table 1). These data were obtained from the Affymetrix Human Exon 1.0 ST platform. In the next step, four datasets (GSE24549, GSE24550, GSE35834, and GSE50421) were used in the Bleave one dataset out^ process [23], and GSE31737 served as an independent dataset to validate the gene signatures derived from the meta-analysis. Among them, information on disease-free survival (DFS), MSI status, and TNM stage of samples were available for GSE24549 and GSE24550. Therefore, these two datasets were used to investigate the correlation between the lncRNA expression profiles and CRC prognosis. Individual data processing The raw CEL files of the five datasets were quantilenormalized and background-adjusted using robust multichip average (RMA), which is an effective tool for computing lncRNA profiling data with AltAnalyze software [24, 25]. The normalized data were analyzed with Linear Model for Microarray Data, a modified t test that incorporates the Benjamini–Hochberg multiple-hypotheses correction technique, through R 3.1.1 [26]. The probe sets for which the adjusted P-value was below 0.01 and the expression level differed by a fold change of ≥2 between two comparison groups were defined as significantly different probe sets. Identification of differentially expressed probe sets A workflow (Fig. 1) for the consistency-based meta-analysis approach was designed to identify the distinctive probe sets as described by Yang et al. [23]. A Bleave one dataset out^ validation process was used in this approach, and the result was

Dataset

No. of tumor

No. of control

Reference

Year published

Platform

GSE31737 GSE24549 GSE24550 GSE35834 GSE50421

40 83 77 30 24

40 13 13 23 25

Loo LW et al. [19] Sveen A et al. [20] Sveen A et al. [20] Pizzini S et al. [21] Aziz MA et al. [22]

2012 2011 2011 2014 2014

Affymetrix Human Exon 1.0 ST Array

Tumor Biol. Fig. 1 Consistency-based metaanalysis process. Four datasets (GSE24549, GSE24550, GSE35834, GSE50421) were shuffled to create four scenarios. In each scenario, three of the datasets were used to generate signatures, and the result was validated in an independent dataset in each scenario. The signatures produced from each scenario were confirmed by the fifth dataset (GSE31737)

validated in an independent dataset (GSE31737). First, the meta-analysis was replicated four times. One of the four datasets (GSE24549, GSE24550, GSE35834, and GSE50421) was left out each time, and the analysis was performed using the remaining three datasets as one scenario. In each scenario, after aggregating the distinctive probe sets in three datasets, the probe sets differentially expressed in at least two datasets were selected as the probe set signatures of this scenario. The differentially expressed probe sets were then validated in the fourth dataset. Second, the probe sets differentially expressed in at least two scenarios were picked out as the final probe set signatures. An independent dataset (GSE31737) was used for validation. Probe set reannotation and lncRNA classification pipeline To reannotate the probe sets, the sequences of probes were downloaded from the official website of Affymetrix (http:// www.affymetrix.com/Auth/analysis/downloads/na25/ wtexon/HuEx-1_0-st-v2.probe. fa.zip), and sequence alignment was performed between all of the probes obtained from the four meta-analysis scenarios (as presented in Fig. 1) and the Refseq database by NCBI Blast-2.2.30+. Because each probe set comprises four probes with 25 nucleotides in Affymetrix Human Exon 1.0 ST, probe sets were filtered and reannotated by the following criteria: (a) the probe should perfectly hit its target gene (E-value=2e-6, Query cover= 100 %, Ident=100 %, antisense), (b) the probe should not hit more than one target perfectly, (c) four probes in one probe set should hit the same gene, and (d) the accession number of the hit gene should be BNR_^ (NR indicates non-coding RNA in the Refseq database). According to the above process, the differentially expressed ncRNA probe sets were identified.

Then, to achieve the lncRNA probe sets, only the probe sets whose genes were defined as lncRNAs by NCBI remained. Data analysis To inspect the Bleave one dataset out^ cross-validation result visually, hierarchical clustering analysis (HCA) was performed on the differentially expressed lncRNAs obtained from each scenario and an independent dataset (GSE31737) with Cluster&TreeView [27]. Principal component analysis (PCA) was conducted with the bioconductor package pcaMethods [28]. The samples were grouped by HCA according to similarities in their gene expression profiles, whereas the PCA summarized the most important variables in a dataset as principal components and classified the samples using as few variables as possible [13]. In HCA cluster analysis, the Euclidean distance method was used to cluster arrays. Survival analysis The test series (GSE24549, GSE24550) and an independent validation series (GSE39582) were used to identify the Table 2

Number of probe sets differentially expressed in each dataset

GSE31737 GSE24549 GSE24550 GSE35834 GSE50421

Upregulated probe sets

Downregulated probe sets

Differential probe sets

5271 5923 11,185 3633 16,884

4893 51,266 37,423 12,105 11,210

10,164 57,189 48,608 15,738 28,094

Tumor Biol. Table 3

Number of distinctive lncRNAs in four scenarios

Scenario

Upregulated lncRNAs

Downregulated lncRNAs

Differential lncRNAs

Probe sets of differentially expressed lncRNAs

I II III IV

23 20 22 21

22 18 26 24

45 38 48 45

81 74 87 79

correlation between expression levels of specific lncRNAs and CRC prognosis. Dataset GSE39582 which is based on Affymetrix HG-U133 plus 2 platform contained a total of 566 samples with their clinical data, and 541 samples remained after filtering out those clinical data which were not complete [29]. To eliminate individual differences, for each lncRNAs, the expression value was normalized by dividing the average expression value of all of its probe sets by that of GAPDH. The probe sets hitting GAPDH were obtained from NetAffx and filtered by NCBI Blast-2.2.30+. Receiver Table 4

operating characteristic curves were used to determine the cutoff value of two groups distinguished by the expression level of a specific lncRNA [30]. The method of Kaplan and Meier was used to construct curves with diagnosis of CRC based on lncRNA expression status, and survival curves were compared by log-rank test. To evaluate the association between them, Cox proportional hazards analysis was performed to calculate the hazard ratio and the 95 % confidence interval (CI). In addition, a multivariate Cox regression was measured to identify independent prognostic factors of significance [30, 31]. A two-tailed P-value of 0.05 or less was considered as statistically significant.

Results Distinctive lncRNA expression pattern between tumor tissue and tumor-adjacent normal tissue samples Through individual data processing, differentially expressed probe sets of each dataset were obtained (Table 2). After the

Summary of differentially expressed lncRNAs

lncRNA

Expression status

Chromosome

Start

End

Description

BLACAT1 CASC19 CASC21 CDKN2B-AS1 CCAT1 CRNDE DLEU1

Upregulated Upregulated Upregulated Downregulated Upregulated Upregulated Upregulated

1q32.1 8q24.21 8q24.21 9p21.3 8q24.21 16q12.2 13q14.3

205434886

205456086

21994791 127207382 54918863 50082169

22121097 127219268 54929189 50528643

Bladder cancer-associated transcript 1 Cancer susceptibility candidate 19 Cancer susceptibility candidate 21 CDKN2B antisense RNA 1 Colon cancer-associated transcript 1 Colorectal neoplasia differentially expressed Deleted in lymphocytic leukemia 1

ELFN1-AS1 FAM83H-AS1 FIRRE FOXP1-AS1 FOXP4-AS1 HAGLR

Upregulated Upregulated Upregulated Downregulated Upregulated Downregulated

7p22.3 8q24.3 Xq26.2 3p13 6p21.1 2q31.2

1738630 143734140 131702650

17442310 143746337 131830643

41523895 176173195

41546255 176188958

KBTBD11-OT1 LINC01234 LINC01296 LINC00152 LINC00858 MIR4435-1HG LOC100190940 LOC100505938 LOC145837 LOC400706 UCA1 ZFAS1

Downregulated Upregulated Upregulated Upregulated Upregulated Upregulated Upregulated Upregulated Downregulated Upregulated Upregulated Upregulated

8p23.3 12 14q11.2 2p11.2 10q23.1 2q13 12q24.33 7p21.3 15q23 19q13.32 19p13.12 20q13.13

113744577 19076244 87455455 84279980 111367002 130033812 8237045 69561720 46563469 15828947 49278178

113773683 19096796 87521518 84294659 111495115 130042342 8262821 69571440 46580982 15836321 49289260

ELFN1 antisense RNA 1 FAM83H antisense RNA 1 (head to head) FIRRE intergenic repeating RNA element FOXP1 antisense RNA 1 FOXP4 antisense RNA 1 HOXD antisense growth-associated long non-coding RNA KBTBD11 overlapping transcript 1 Long intergenic non-protein coding RNA 1234 Long intergenic non-protein coding RNA 1296 Long intergenic non-protein coding RNA 152 Long intergenic non-protein coding RNA 858 MIR4435-1 host gene Uncharacterized LOC100190940 Uncharacterized LOC100505938 Uncharacterized LOC145837 Uncharacterized LOC400706 Urothelial cancer-associated 1 ZNFX1 antisense RNA 1

Tumor Biol.

Fig. 2 Validation of 25 differentially expressed lncRNAs using an independent dataset (GSE31737). In HCA (a), the x axis represents the samples, and lncRNAs are shown on the y axis. Red spots represent upregulated genes, and green spots represent downregulated genes. The sample types are shown with bar colors in the dendrogram; blue stripes

represent tumor-adjacent normal samples, and red stripes are tumor samples. In PCA (b), green dots represent tumor-adjacent normal tissue samples, and red dots represent tumor tissue samples. In GSE31737, CRC samples can be distinguished from tumor-adjacent normal tissues by HCA and PCA

Bleave one dataset out^ validation process, distinctive lncRNAs were identified from four scenarios through the probe set reannotation and lncRNA classification pipeline (Table 3). In each scenario, the lncRNA signatures aggregated from any three datasets were validated by an independent dataset. HCA clustering data revealed clear distinctions between CRC tissue and tumoradjacent normal tissue samples (Supplementary Figs. 1A, 2A, 3A, 4A), which could also be distinguished by PCA (Supplementary Figs. 1B, 2B, 3B, 4B). In scenario I, all samples were perfectly distributed into the CRC tissue and tumor-adjacent normal tissue groups by PC1=0 (Supplementary Fig. 1B). In scenario II, 29 of 30 tumor-adjacent normal tissue samples and 29 of 30 tumor tissue samples were distributed by PC1 = 0 (Supplementary Fig. 2B). In scenario III, all samples were perfectly distributed into two groups by PC1=7 (Supplementary Fig. 3B). In scenario IV, 76 of 77 tumor tissue samples and all tumor-adjacent normal tissue samples were distributed by PC1 = −6 (Supplementary Fig. 4B). Through aggregation of the four scenarios of lncRNA signatures, 25 lncRNAs were identified, which included 20 upregulated and five downregulated lncRNAs (Table 4). HCA and PCA of all samples in the independent dataset (GSE31737) using the final 25 differentially expressed lncRNAs indicated that most samples could be distributed by PCA1: PCA1>−2 in 35 of 40 CRC tissue samples and PCA1 < −2 in 36 of 40 tumor-adjacent normal tissue samples. A total of 71 samples could be correctly distributed by PCA1. Thus, CRC tissue could be discriminated from tumor-adjacent normal tissue using these 25 identified lncRNAs with an overall accuracy of 88.8 %

(Fig. 2). A total of nine of 80 samples were incorrectly classified based on their lncRNA expression profiles, with an error rate of 11.2 %. Table 5 Characteristics of the two independent colorectal cancer sample series Characteristic

Test seriesa (n=160)

Validation seriesb (n=541)

Age at diagnosis Gender Male Female TNM stage I II III IV Mean follow-up, years (minimum; maximum) Tumor location Distal Proximal MSI MSI-H MSI-L MSS Adjuvant chemotherapy Yes No

NA NA

67.0±13.2 300 241

90 70 4.58 (0.17; 10)

36 259 201 45 4.19 (0; 16.75)

NA 327 214 NA 21 16 100 NA 83

a

GEO accession numbers GSE24549 and GSE24550

b

GEO accession number GSE39582

232 309

Tumor Biol.

Identification of prognostic lncRNAs from 25 lncRNAs through the test series The correlation between expression levels of differentially expressed lncRNAs and CRC prognosis was analyzed in test series (GSE24549, GSE24550) (Table 5). The probe sets hitting GAPDH and 25 lncRNAs were shown in Table S1. The survival curves of 25 lncRNAs were drawn, and significant differences were identified for three lncRNAs (LINC01296, LINC00152, and FIRRE) (P

Long non-coding RNA LINC01296 is a potential prognostic biomarker in patients with colorectal cancer.

Colorectal cancer (CRC), one of the most malignant cancers, is currently the fourth leading cause of cancer deaths worldwide. Recent studies indicated...
667KB Sizes 0 Downloads 7 Views