REVIEW URRENT C OPINION

Statistical methods applied to omics data: predicting response to neoadjuvant therapy in breast cancer Nils Terne`s a,b, Monica Arnedos c, Serge Koscielny a, Stefan Michiels a,b, and Emilie Lanoy a

Purpose of review Omics technologies have become an essential part of clinical trials in oncology to provide a better understanding of molecular mechanisms and to unveil therapeutic targets. Standard statistical methods often fail in the high-dimensional setting. Therefore, an adequate modelling of the omics data is needed in order to identify ‘target’ genes of interest. Recent findings Several genes or gene signatures have been identified to predict the response to neoadjuvant therapies in breast cancer trials. We first reviewed statistical methods used to identify genes in 13 recent publications. Most of these studies had a small sample size (median: 42 patients) and were nonrandomized. We then focused on some popular methods – especially the so-called penalized methods used by three of the reviewed articles – and on the more recent methods proposed to predict causal estimates from observational data. We finally illustrated these methods in a nonrandomized neoadjuvant phase II trial of letrozole in estrogen receptor-positive breast cancer patients. Summary The review highlighted small sample sizes, few randomized trials and a large panel of statistical methods used in this setting. In our illustrated neoadjuvant example, causal inference methods did not outperform the penalized methods. Keywords breast cancer, clinical trial, gene expression, neoadjuvant

INTRODUCTION The application of omics technologies to biological samples collected in the context of clinical trials has become an integral part of clinical research in oncology. The objective is to provide a better understanding of molecular mechanisms of treatment effects based on mutation, genomic aberration or gene expression data. An appropriate data analysis is therefore essential. Compared to the usual clinical or biological data, omics data exhibit some particular statistical characteristics. Typically, the number of measured variables p (usually >10 000) is much larger than the sample size n (usually >n and high correlation between genes). The ridge and the lasso penalties are implemented in the R-package penalized [28]. More recently, the elasticnet combines the advantages of the ridge and lasso penalties [3]. This method overcomes the limitations of the lasso penalty regarding variable selection within a group of highly correlated variables. This method is implemented in the R-package elasticnet [29].

Causal methods In observational settings, an association cannot be easily interpreted as causation due to potential selection bias. In randomized clinical trials, the

treatment is randomly allocated to the patients and therefore independent of other covariates (measured and unmeasured). Groups of patients differ only on treatment allocation and so the estimated association between treatment and clinical outcome can be interpreted as causal effect. In the omics context, as no intervention experiment is possible on the human genome, only observational data are available. Hence, groups are not exchangeable (presence of confounders, bias) and the association between gene expression and outcome cannot be interpreted as a causal effect. Causal diagrams, called directed acyclic graphs (DAGs) can be used to infer causal effects from noninterventional data [30]. A DAG represents the ordered relationship between variables; an edge between two variables means that a given variable may or may not have a causal effect on another variable. No edge means no causal effect; information is in the lack of edge. Causal diagrams are exhaustive for all confounding factors allowing that bias selection can be controlled and causal effects estimated. However, in highdimensional settings, general processes between genes are unknown and so potential edges between genes can be not directed and DAG cannot be drawn. The intervention-calculus, when the DAG is absent (IDA) [5 ], is a statistical method that estimates lower bounds of causal effects from observational data, assuming the data come from an unknown DAG exhaustive from confounders which influence at least two measured factors. The IDA procedure comprises two steps. At the first step, the PC-algorithm [31,32] estimates the skeleton of the underlying DAG using the observed data, in which every edge is undirected. Then, some edges can be oriented (directed edges) or not (bidirected edges). Because of unidentifiable orientation, the PC-algorithm cannot estimate the true DAG alone, but provides a set of possible DAGs which describes the conditional independence information from the data (Fig. 1). At second step, causal effects for &&

G1

E

G1

G3 G2 CPDAG

E

G1

G3 G2 DAG 1

E

G1

G3 G2 DAG 2

E G3

G2 DAG 3

FIGURE 1. A completed partially DAG (CPDAG) and its equivalent class of DAGs representing relationships between an endpoint (E) and gene expression (G). A directed path represents an oriented relationship and a bidirected path represents a nonoriented relationship. DAG, directed acyclic graph.

1040-8746 ß 2014 Wolters Kluwer Health | Lippincott Williams & Wilkins

www.co-oncology.com

579

Copyright © Lippincott Williams & Wilkins. Unauthorized reproduction of this article is prohibited.

Breast cancer

each gene (covariate) can be derived from each possible DAG (Fig. 1) using the do-calculus [30]. As the true DAG is unidentifiable, the true total causal effect is usually also unidentifiable. However, the true lower bound can be identified (the smallest effect of all DAGs). The IDA procedure is implemented in the R-package pcalg [33]. Recently, a resampling procedure, called Causal Stability Ranking (CStaR) [6 ], was proposed to ensure the robustness of the IDA results. This method is based on a repeated random subsampling scheme similar to that of Michiels et al. [34].

the signal-to-noise ratio, the average noise background was estimated using the R-package simpleaffy [40]. By combining these two filtering methods, 9615 probes were retained. The best Jetset overall scored probe was selected for a given gene. Finally, 8142 probes (genes) were retained. The expression data were scaled by subtraction of the mean and division by the standard deviation.

&

APPLICATION TO NEOADJUVANT TRIAL OF LETROZOLE The present application in ERþ early breast cancer patients treated with neoadjuvant aromatase inhibitor letrozole (FemaraTM, Novartis, Basel, Switzerland), aimed to identify early changes in gene expression that reflect the antiproliferative treatment. Penalized and causal methods are compared in this application.

Data source To illustrate these methods, publicly available data [35,36] were used. The aim of the phase II trial was to investigate genes whose expression differed between ERþ breast cancers that were clinically responsive or not to letrozole therapy. Tumour biopsies were collected initially (before treatment, D0), after 10–14 days of treatment (early changes, D14) and after 3 months (late changes, M3) (Fig. 2). For 56 patients, the expression of 22 283 probe sets (several probes per gene) on the Affymetrix HG-U133A array from tumour tissue at D0, D14 and M3 were collected and are publicly available in the inSilicoDb database [37].

Variables The absolute change at 3 months in the log-scale expression of the aurora kinase A (AURKA) gene involved in regulation, proliferation and cell survival [41], was chosen as antiproliferative response primary endpoint. Previous studies have shown that in the ERþ breast cancer population, changes in proliferation-based measures are associated with survival [42–46]. The change at 3 months in ESR1 expression [41] was chosen as secondary endpoint. The changes in expression between D0 and D14 of the 8142 candidate genes were evaluated as predictors of endpoints.

Main results During the first 3 months of letrozole therapy, the mean expression level of AURKA and ESR1 genes decreased. This decrease was more pronounced during the first 10–14 days (Fig. 3). We identified 26 ‘target’ genes associated with an antiproliferative response that were ranked in the top 10 of at least one of the five methods (Table 2). The results of penalized methods were highly similar, that is, seven predictors were ranked in the top 10 of each of the three methods. Five predictors singled out by CStaR were ranked in each top 10 of the penalized methods. However, the IDA procedure top 10 was very different from the other four methods top 10.

Preprocessing data

Start of letrozole therapy D0

D14

M3

Core biopsies Microarray Early change Late change

15

Gene expression (log base 2)

All microarray data were renormalized applying the frozen robust multiarray analysis normalization algorithm [38]. The signal-to-noise ratio and the Jetset quality score [39] were criteria for filtering leading to number of probes reduction. To compute

AURKA

14

ESR1

13 12 11 10 9 8 7 6 5 0

10

20

30

40

50

60

70

80

90

Number of days after treatment

FIGURE 2. Neoadjuvant phase II trial in hormone receptorpositive early breast cancer: time points of measurements and duration of letrozole exposure. 580

www.co-oncology.com

FIGURE 3. Change in mean expression (with 95% confidence interval) of AURKA and ESR1 over time. Volume 26  Number 6  November 2014

Copyright © Lippincott Williams & Wilkins. Unauthorized reproduction of this article is prohibited.

Statistical methods applied to omics data Terne`s et al. Table 2. Top 10 predictors of the 3-month change in the expression of AURKA estimated by penalized and causal methods

Summary ranking Gene function according to GeneCardf

Penalized

Causal

Probeset

Early change in expression

zinc finger protein 106 homolog (mouse) mitochondrial ribosomal protein L4

217781_s_at

ZFP106b

2

3

2

218105_s_at

MRPL4

1

1

MORC family CW-type zinc finger 4

219038_at

MORC4b

3

aurora kinase A

208079_s_at

AURKAd

sphingomyelin phosphodiesterase 3, neutral membrane cyclin-dependent kinase inhibitor 2A

219695_at

ridge

lassoa

elasticnet

IDA

CStaR

Correlation with late change AURKA

2

þ0.58e

1

1

0.60e

2

3

3

þ0.55e

5

5

5

4

þ0.51e

SMPD3

4

4

4

9

0.55e

207039_at

CDKN2Ad

8

6

9

0.45e

zinc finger, DHHC-type containing 8 pseudogene 1 APEX nuclease (apurinic/ apyrimidinic endonuclease) 2 sterol regulatory element binding transcription factor 2

222274_at

ZDHHC8P1b

6

7

6

0.50e

204408_at

APEX2

7

201248_s_at

SREBF2b

follistatin-like 1

208782_at

FSTL1

1

þ0.43e

potassium voltage-gated channel, shakerrelated subfamily, member 6

205616_at

KCNA6

2

0.36e

transmembrane protein 59-like

219005_at

TMEM59L

3

0.40e

uncharacterized LOC100506963

213685_at

LOC100506963

ribosomal protein S6 kinase, 90kDa, polypeptide 5

204635_at

RPS6KA5d

5

0.16

phosphatidylinositol glycan anchor biosynthesis, class T

217770_at

PIGT

6

0.17

calpain 2, (m/II) large subunit

208683_at

CAPN2

DEK oncogene

200934_at

DEKc

small proline-rich protein 1B

205064_at

SPRR1B

desumoylating isopeptidase 2 v-mybmyeloblastosis viral oncogene homolog (avian)-like 1

212371_at

DESI2

213906_at

MYBL1c

sarcoglycan, epsilon

204688_at

SGCE

thrombospondin 1

201110_s_at

THBS1

4

6

10

0.49e

0.52e

10

5

0.50e

þ0.50e

7 7 7

þ0.48e 0.22

8

þ0.50e þ0.43e

8

8 8

þ0.40e þ0.48e (Continued )

1040-8746 ß 2014 Wolters Kluwer Health | Lippincott Williams & Wilkins

www.co-oncology.com

581

Copyright © Lippincott Williams & Wilkins. Unauthorized reproduction of this article is prohibited.

Breast cancer Table 2 (Continued)

Summary ranking Gene function according to GeneCardf

Probeset

Early change in expression

protocadherin gamma subfamily B, 5

211807_x_at

PCDHGB5

von Willebrand factor

202112_at

VWF

membrane protein, palmitoylated 2 (MAGUK p55 subfamily member 2)

213270_at

MPP2

ribonucleotide reductase M2

201890 at

RRM2

Penalized ridge

lassoa

Causal

elasticnet

IDA

CStaR

Correlation with late change AURKA 0.47e

9 9

þ0.09 10

10

0.49e

þ0.14

CStaR, Causal Stability Ranking. a lasso selected only eight predictors. b transcription factor. c oncogene. d gene involved in the cell cycle control. e P-value < 0.01. f [50].

Interestingly, nine of the overall 26 identified predictor genes are involved in potential activation pathways of letrozole: transcription factors, oncogenes and genes involved in cell cycle control. Among the seven transcription factors and genes involved in cell cycle control, six were detected by several methods. Of note, letrozole-induced changes in gene expression related to cell cycle progression were also identified in the original analysis of this study [35,47]. However, oncogenes (DEK and MYBL1) were detected by only one method (respectively by lasso and CStaR). The proportion of ranked genes expected to be on the activation pathway of letrozole was higher in penalized methods (six among the top 10 of each method), as compared with CStaR (four among the top 10) and with IDA procedure (two among the top 10). An early change in AURKA expression satisfactorily predicted a late change in AURKA except using IDA in which the early change in AURKA was ranked 102nd. An early read-out of proliferation after short-term letrozole therapy has already shown to be associated with a clinical response to endocrine therapy [48]. For several selected genes such as CDKN2A, an early increase in gene expression was associated with a late antiproliferative response or a 3-month decrease in AURKA expression (correlation between an early change in CDKN2A and a late change in AURKA was equal to 0.45, Table 2). Cell cycle inhibition has also been described as potentially associated with a benefit from aromatase inhibition in breast cancer patients [49]. 582

www.co-oncology.com

In identifying effects of early changes in gene expression under letrozole on another endpoint, the late change in ESR1 expression, the analysis failed to provide stable parameter estimates. Early changes in ESR1 expression did not strongly predict late changes (SDC 1, http://links.lww.com/COON/A8). The results of penalized methods and CStaR did not differ strongly. Interestingly, both the ridge and the CStaR method identified an early change in CD8A as a predictor for a late change in ESR1. CD8A is found in most cytotoxic T-lymphocytes that mediate efficient cell-cell interaction with the immune system. A potential interaction between response to aromatase inhibitors and immune system induction has been previously described [49].

CONCLUSION The review of gene expression analysis in recent neoadjuvant phase II studies in breast cancer highlighted small sample sizes and few randomized trials. A large panel of statistical methods have been used to identify target genes that are associated with the treatment response, among which penalized methods are one of the most popular. In this particular neoadjuvant example, the penalized methods appeared to be more efficient than causal methods. Acknowledgements This research did not receive any specific grant from any funding agency in the public, commercial or not-forprofit sectors. Volume 26  Number 6  November 2014

Copyright © Lippincott Williams & Wilkins. Unauthorized reproduction of this article is prohibited.

Statistical methods applied to omics data Terne`s et al.

Conflicts of interest The authors have no conflicts of interest to declare.

REFERENCES AND RECOMMENDED READING Papers of particular interest, published within the annual period of review, have been highlighted as: & of special interest && of outstanding interest 1. Hoerl AE, Kennard RW. Ridge regression: biased estimation for nonorthogonal problems. Technometrics 1970; 42:80–86. 2. Tibshirani R. Regression shrinkage and selection via the Lasso. J R Stat Soc Ser B 1996; 58:267–288. 3. Zou H, Hastie T. Regularization and variable selection via the elastic net. J R Stat Soc Ser B 2005; 67:301–320. 4. Bøvelstad HM, Nyga˚rd S, Størvold HL, et al. Predicting survival from microarray data: a comparative study. Bioinformatics 2007; 23:2080–2087. 5. Maathuis MH, Kalisch M, Bu¨hlmann P. Estimating high-dimensional interven&& tion effects from observational data. Ann Stat 2009; 37:3133–3164. This methodological article proposes the causal approach to analyse high-dimensional omics data. 6. Stekhoven DJ, Moraes I, Sveinbjo¨rnsson G, et al. Causal stability ranking. & Bioinformatics 2012; 28:2819–2823. This article presents a resampling approach to ensure the robustness of the causal procedure. 7. Sotiriou C, Pusztai L. Gene-expression signatures in breast cancer. N Engl J Med 2009; 360:790–800. 8. Ignatiadis M, Singhal SK, Desmedt C, et al. Gene modules and response to neoadjuvant chemotherapy in breast cancer subtypes: a pooled analysis. J Clin Oncol 2012; 30:1996–2004. 9. Desmedt C, Di Leo A, de Azambuja E, et al. Multifactorial approach to predicting resistance to anthracyclines. J Clin Oncol 2011; 29:1578–1586. 10. Gao Q, Patani N, Dunbier AK, et al. Effect of aromatase inhibition on functional gene modules in estrogen receptor-positive breast cancer and their relationship with antiproliferative response. Clin Cancer Res 2014; 20:2485–2494. 11. Horak CE, Pusztai L, Xing G, et al. Biomarker analysis of neoadjuvant && doxorubicin/cyclophosphamide followed by ixabepilone or paclitaxel in early-stage breast cancer. Clin Cancer Res 2013; 19:1587–1595. A large randomized phase II trial designed to investigate potential biomarkers that differentiate response to neoadjuvant doxorubicin/cyclophosphamide followed by ixabepilone or paclitaxel through the use of penalized regression models. Randomized trials are the most appropriate design to test for gene by treatment interactions in order to properly identity predictive genes. 12. Bonnefoi H, Potti A, Delorenzi M, et al. Validation of gene signatures that predict the response of breast cancer to neoadjuvant chemotherapy: a substudy of the EORTC 10994/BIG 00-01 clinical trial. Lancet Oncol 2007; 8:1071–1078. 13. Bonnefoi H, Piccart M, Bogaerts J, et al. TP53 status for prediction of sensitivity to taxane versus nontaxane neoadjuvant chemotherapy in breast cancer (EORTC 10994/BIG 1-00): a randomised phase 3 trial. Lancet Oncol 2011; 12:527–539. 14. Baselga J, Zambetti M, Llombart-Cussac A, et al. Phase II genomics study of ixabepilone as neoadjuvant treatment for breast cancer. J Clin Oncol 2009; 27:526–534. 15. Iwamoto T, Bianchini G, Booser D, et al. Gene pathways associated with prognosis and chemotherapy sensitivity in molecular subtypes of breast cancer. J Natl Cancer Inst 2011; 103:264–272. 16. Korde LA, Lusa L, McShane L, et al. Gene expression pathway analysis to predict response to neoadjuvant docetaxel and capecitabine for breast cancer. Breast Cancer Res Treat 2010; 119:685–699. 17. Tabchy A, Valero V, Vidaurre T, et al. Evaluation of a 30-gene paclitaxel, fluorouracil, doxorubicin, and cyclophosphamide chemotherapy response predictor in a multicenter randomized trial in breast cancer. Clin Cancer Res 2010; 16:5351–5361. 18. Hatzis C, Pusztai L, Valero V, et al. A genomic predictor of response and survival following taxane-anthracycline chemotherapy for invasive breast cancer. JAMA 2011; 305:1873–1881. 19. Chen YZ, Xue JY, Chen CM, et al. PPAR signaling pathway may be an important predictor of breast cancer response to neoadjuvant chemotherapy. Cancer Chemother Pharmacol 2012; 70:637–644. 20. Gonzalez-Angulo AM, Iwamoto T, Liu S, et al. Gene expression, molecular class changes, and pathway analysis after neoadjuvant systemic therapy for breast cancer. Clin Cancer Res 2012; 18:1109–1119. 21. Shen K, Qi Y, Song N, et al. Cell line derived multigene predictor of pathologic response to neoadjuvant chemotherapy in breast cancer: a validation study on US Oncology 02-103 clinical trial. BMC Med Genomics 2012; 5:51.

22. Dunbier AK, Ghazoui Z, Anderson H, et al. Molecular profiling of aromatase inhibitor-treated postmenopausal breast tumors identifies immune-related correlates of resistance. Clin Cancer Res 2013; 19:2775–2786. 23. Mulligan JM, Hill LA, Deharo S, et al. Identification and validation of an anthracycline/cyclophosphamide-based chemotherapy response assay in breast cancer. J Natl Cancer Inst 2014; 106:djt335. 24. Singer CF, Klinglmu¨ller F, Stratmann R, et al. Response prediction to neoadjuvant chemotherapy: comparison between pretherapeutic gene expression profiles and in vitro chemosensitivity assay. PLoS One 2013; 8:e66573. 25. Geeleher P, Cox NJ, Huang RS. Clinical drug response can be predicted && using baseline gene expression levels and in vitro drug sensitivity in cell lines. Genome Biol 2014; 15:R47. In this article, the authors compared several available machine learning algorithms to analyse high-dimensional data. The penalized ridge regression was consistently the best performer. 26. Knudsen S, Jensen T, Hansen A, et al. Development and validation of a gene expression score that predicts response to fulvestrant in breast cancer patients. PLoS One 2014; 9:e87415. 27. El-Dereny M, Rashwan N. Solving multicollinearity problem using ridge regression models. Int J Contemp Math Sci 2011; 6:585–600. 28. Goeman J. Package ‘‘penalized’’. 2012. Available from: www.cran.r-project. org/web/packages/penalized/penalized.pdf. 29. Zou H, Hastie T. Elastic-Net for Sparse Estimation and Sparse PCA. 2012. Available from: www.cran.r-project.org/web/packages/elasticnet/elasticnet. pdf. 30. Pearl J. Causality: models, reasoning and inference. New York, United States: Cambridge University Press; 2000. 31. Spirtes P, Glymour C, Scheines R. Causation, prediction, and search. 2nd ed London, United Kingdom: The MIT Press; 2000. 32. Kalisch M, Bu¨hlmann P. Estimating high-dimensional directed acyclic graphs with the PC-algorithm. J Mach Learn Res 2007; 8:613–636. 33. Kalisch M, Ma¨chler M, Colombo D, et al. Causal inference using graphical models with the R package pcalg. J Stat Softw 2012; 47:1–26. 34. Michiels S, Koscielny S, Hill C. Prediction of cancer outcome with microarrays: a multiple random validation strategy. Lancet 2005; 365:488–492. 35. Miller WR, Larionov A, Renshaw L, et al. Gene expression profiles differentiating between breast cancers clinically responsive or resistant to letrozole. J Clin Oncol 2009; 27:1382–1387. 36. Miller WR, Larionov A, Anderson TJ, et al. Sequential changes in gene expression profiles in breast cancers during treatment with the aromatase inhibitor, letrozole. Pharmacogenomics J 2012; 12:10–21. 37. Taminau J, Steenhoff D, Coletta A, et al. inSilicoDb: an R/Bioconductor package for accessing human Affymetrix expert-curated datasets from GEO. Bioinformatics 2011; 27:3204–3205. 38. McCall MN, Bolstad BM, Irizarry RA. Frozen robust multiarray analysis (fRMA). Biostatistics 2010; 11:242–253. 39. Li Q, Birkbak NJ, Gyorffy B, et al. Jetset: selecting the optimal microarray probe set to represent a gene. BMC Bioinformatics 2011; 12:474. 40. Wilson CL, Miller CJ. Simpleaffy: a BioConductor package for Affymetrix quality control and data analysis. Bioinformatics 2005; 21:3683–3685. 41. Desmedt C, Haibe-Kains B, Wirapati P, et al. Biological processes associated with breast cancer clinical outcome depend on the molecular subtypes. Clin Cancer Res 2008; 14:5158–5165. 42. Dowsett M, Ebbs SR, Dixon JM, et al. Biomarker changes during neoadjuvant anastrozole, tamoxifen, or the combination: influence of hormonal status and HER-2 in breast cancer: a study from the IMPACT trialists. J Clin Oncol 2005; 23:2477–2492. 43. Mackay A, Urruticoechea A, Dixon JM, et al. Molecular response to aromatase inhibitor treatment in primary breast cancer. Breast Cancer Res 2007; 9:R37. 44. Ellis MJ, Coop A, Singh B, et al. Letrozole inhibits tumor proliferation more effectively than tamoxifen independent of HER1/2 expression status. Cancer Res 2003; 63:6523–6531. 45. Ellis MJ, Ma C. Letrozole in the neoadjuvant setting: the P024 trial. Breast Cancer Res Treat 2007; 105 (Suppl 1):33–43. 46. Haibe-Kains B, Desmedt C, Loi S, et al. A three-gene model to robustly identify breast cancer molecular subtypes. J Natl Cancer Inst 2012; 104:311–325. 47. Miller WR, Larionov AA, Renshaw L, et al. Changes in breast cancer transcriptional profiles after treatment with the aromatase inhibitor, letrozole. Pharmacogenet Genomics 2007; 17:813–826. 48. Bedard PL, Singhal SK, Ignatiadis M, et al. Low residual proliferation after short-term letrozole therapy is an early predictive marker of response in high proliferative ER-positive breast cancer. Endocr Relat Cancer 2011; 18:721– 730. 49. Mello-Grand M, Singh V, Ghimenti C, et al. Gene expression profiling and prediction of response to hormonal neoadjuvant treatment with anastrozole in surgically resectable breast cancer. Breast Cancer Res Treat 2010; 121:399–411. 50. Rebhan M, Chalifac´caspi V, Prilusky J, Lancet D. GeneCards: a novel functional genomics compendium with automated data mining and query reformulation support. Bioinformatics 1998; 14:656–664.

1040-8746 ß 2014 Wolters Kluwer Health | Lippincott Williams & Wilkins

www.co-oncology.com

583

Copyright © Lippincott Williams & Wilkins. Unauthorized reproduction of this article is prohibited.

Statistical methods applied to omics data: predicting response to neoadjuvant therapy in breast cancer.

Omics technologies have become an essential part of clinical trials in oncology to provide a better understanding of molecular mechanisms and to unvei...
322KB Sizes 3 Downloads 5 Views