Briefings in Bioinformatics Advance Access published September 16, 2014

B RIEFINGS IN BIOINF ORMATICS . page 1 of 13

doi:10.1093/bib/bbu034

Predictive modelling of gene expression from transcriptional regulatory elements David M. Budden, Daniel G. Hurley and Edmund J. Crampin Submitted: 2nd July 2014; Received (in revised form) : 20th August 2014

Abstract

Keywords: gene expression; transcriptional regulation; transcription factors; histone modifications; predictive modelling

INTRODUCTION Understanding the regulatory logic underpinning the precise spatiotemporal control of eukaryotic gene expression is a central problem in molecular biology. Perturbation of regulatory elements affects critical cellular functions, including homeostasis, differentiation and apoptosis. The ability to understand and predict these effects is critical to developing treatments for complex diseases; particularly the

hundreds of developmental, autoimmune, neurological, inflammatory and neoplastic disorders directly associated with abnormal patterns of regulatory elements [1, 2]. Significant regulation of eukaryotic gene expression is known to occur at the level of transcriptional initiation and control [3]. It is therefore unsurprising that many recent studies have focused on modelling the effects and interactions of transcription factors

Corresponding author: Edmund J. Crampin. Tel: +61 3 8344 6699; Fax: +61 3 8344 7412; E-mail: [email protected] David M. Budden works in the Systems Biology Laboratory at the Melbourne School of Engineering. David’s research involves modelling the regulation of gene expression using machine learning and information-theoretic approaches, with particular focus on the interactions and feedback mechanisms between transcription factors, histone modifications and microRNAs and their dysregulation in cancer. Daniel G. Hurley works in the Systems Biology Laboratory at the Melbourne School of Engineering. Daniel has a background in commercial IT, and his research applies network models in systems biology to solve problems in human health and disease. Edmund J. Crampin is the Rowden White Chair of Systems and Computational Biology at the University of Melbourne, Director of the Systems Biology Laboratory at the Melbourne School of Engineering, and Professor in the Department of Mathematics and Statistics, and the Melbourne Medical School. Edmund leads an interdisciplinary team of researchers developing mathematical and computational approaches to investigate molecular networks underlying complex human diseases, including heart disease and cancer. ß The Author 2014. Published by Oxford University Press. For Permissions, please email: [email protected]

Downloaded from http://bib.oxfordjournals.org/ at University of Newcastle on September 29, 2014

Predictive modelling of gene expression provides a powerful framework for exploring the regulatory logic underpinning transcriptional regulation. Recent studies have demonstrated the utility of such models in identifying dysregulation of gene and miRNA expression associated with abnormal patterns of transcription factor (TF) binding or nucleosomal histone modifications (HMs). Despite the growing popularity of such approaches, a comparative review of the various modelling algorithms and feature extraction methods is lacking. We define and compare three methods of quantifying pairwise gene-TF/HM interactions and discuss their suitability for integrating the heterogeneous chromatin immunoprecipitation (ChIP)-seq binding patterns exhibited by TFs and HMs. We then construct log-linear and e-support vector regression models from various mouse embryonic stem cell (mESC) and human lymphoblastoid (GM12878) data sets, considering both ChIP-seq- and position weight matrix- (PWM)-derived in silico TF-binding. The two algorithms are evaluated both in terms of their modelling prediction accuracy and ability to identify the established regulatory roles of individual TFs and HMs. Our results demonstrate that TF-binding and HMs are highly predictive of gene expression as measured by mRNA transcript abundance, irrespective of algorithm or cell type selection and considering both ChIP-seq and PWM-derived TF-binding. As we encourage other researchers to explore and develop these results, our framework is implemented using open-source software and made available as a preconfigured bootable virtual environment.

page 2 of 13

Budden et al. To understand the effects and interactions of TFs and HMs, modelling frameworks have recently been developed to infer novel biology from the integration of large heterogeneous data sets (i.e. genomewide ChIP- and RNA-seq; investigation into the effective integration of DNA methylation and miRNA-catalyzed mRNA degradation data are ongoing) for various eukaryotic systems [4–8]. As an example, such models have demonstrated that ChIP-seq binding data for a small panel of 12 TFs (i.e. X X> Y;

where i ¼ eþ"i . Although an improvement over the linear regression model, log-linear regression does not allow synergistic or logical (Boolean) regulatory interactions to be captured.

resulting in the following closed-form predictions of mRNA transcript abundance:  1 Y^ ¼ hðXÞ ¼ X^ ¼ X X> X X> Y;

e-SUPPORT VECTOR REGRESSION

j

|fflfflfflffl{zfflfflfflffl}

TFbinding

k |fflfflfflffl{zfflfflfflffl}

HMþDNase

where gk is the fitted coefficient for HM or DNase k. This formulation captures only the additive regulatory interactions of the modeled TFs and HMs.

Log-linear regression Ouyang et al. proposed the following log-linear model for describing mRNA transcript abundance of a gene i, yi, as a function of a TFAS matrix A [7]: X j aij þ "i ; ð6Þ log ðyi þ Þ ¼  þ where  is fitted from held-aside data to prevent attempted evaluation of log ð0Þ and i is estimated as per (4). McLeay et al. extended this model to include HM and DNase data as per (5) [6]: X X log ðyi þ Þ ¼  þ j aij þ gk bik þ"i : j

TFbinding

k |fflfflfflffl{zfflfflfflffl}

HMþDNase

By rewriting the RHS as a product of exponentials, it can be seen that the multiplicative interactions of all TFs, HMs and DNase are captured to predict mRNA transcript abundance: ! X X yi þ  ¼ exp  þ j aij þ gk bik þ "i j k  Y Y   ¼ i exp j aij exp gk bik j k   Y ¼ i exp j aij þ gk bik ; j;k

where the kernel function, KðXi ; XÞ ¼ FðXi Þ  FðXÞ, implicitly maps X7 !FðXÞ to allow a nonlinear relationship, Y ¼ f ðXÞ, to be projected from a linear relationship in the higher-dimensional mapped space, F (a process known as the ‘kernel trick’). The optimal Lagrange multipliers, a, can be determined by solving the following constrained quadratic optimization problem: 8 1  >  > > > 2 ða  a Þ Qða  a Þ < min X X a;a >  > > ða þ a Þ þ e ðai  ai Þ þe i : i i

subject to

j

|fflfflfflffl{zfflfflfflffl}

i

8 < 1> ða  a Þ ¼ 0 :

0  ai ; ai  C

i

i ¼ 1; . . . ; n

where C > 0 is the regularization parameter, e > 0 controls the width of the e-insensitive loss region [48] and Qij ¼ KðXi ; Xj Þ. The radial basis  function kernel, KðXi ; Xj Þ ¼ exp gjjXi  Xj jj2 , has previously been applied to e-SVR modelling of mRNA and miRNA transcript abundance [4, 5].

COMPARATIVE PERFORMANCE EVALUATION In the following sections, we evaluate the performance of log-linear (6) and SV (7) regression in constructing predictive models of gene expression from various input data.

Experimental data We consider predictive models of gene expression as measured by mRNA transcript abundance (RPKM- or FPKM-normalized RNA-seq for a

Downloaded from http://bib.oxfordjournals.org/ at University of Newcastle on September 29, 2014

where y^i ¼ yi  "i . The usage of Y (i.e. RNA-seq expression) in training the model is a defining feature of supervised learning. If X is the column-wise concatenation of the TFAS matrix, A, and HM+DNase score matrix, B, it is intuitive to express yi in its expanded form: X X yi ¼  þ j aij þ gk bik þ"i ; ð5Þ

The e-support vector regression (e-SVR) model describing mRNA transcript abundance of a gene i, yi, as a non-linear function of a TFAS or HM+DNase score matrix, X, can be formulated as [47]: X ðai  ai ÞKðXi ; XÞ þ "i ; ð7Þ yi ¼  þ

page 6 of 13

Budden et al.

Table 1: All H. sapiens (GM12878 lymphoblastoid cell line) data used for model construction Data type

Data source

Notes

RNA-seq

ENCODE [50]

TSS TF ChIP-seq HM ChIP-seq DNase-seq PWMs

Ensembl hg19/GRCh37 [51] ENCODE [50] ENCODE [50] ENCODE [50] ENCODE [50]

49 488 genes FPKM normalized [35] Consider only most 50 -located TSS for each gene [6, 7] c-Fos, Ctcf, Egr1, Nrf1, Nrsf, Pou2f2, Sp1, Srf, Stat3, Usf1 and Yy1 H3K4me1, H3K4me2, H3K4me3, H4K20me1, H3K27me3 and H3K36me3 DNase I hypersensitivity c-Fos, Ctcf, Egr1, Nrf1, Nrsf, Pou2f2, Sp1, Srf, Stat3, Usf1 and Yy1

Genes corresponding with haplotype variants, unmapped contig regions and low confidence RNA-seq mappings were removed, resulting in a set of 38 041genes for analysis.

Data type

Data source

Notes

RNA-seq

http://grimmond.imb.uq.edu.au/mESEB.html [52]

TSS

Ensembl mm8/NCBIM36.46 [51]

TF ChIP-seq

http://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc¼GSE11431 [11]

HM ChIP-seq

ftp://ftp.broad.mit.edu/pub/papers/chipseq/[38, 32]

DNase-seq PWMs

http://research.imb.uq.edu.au/t.bailey/supplementary_data/McLeay2011a/[6] http://www.sciencedirect.com/science/article/pii/S009286740800617X [11]

18 936 genes RPKM normalized [34] Consider only most 50 -located TSS for each gene [6, 7] E2f1, Esrrb, Klf4, c-Myc, n-Myc, Nanog, Oct4, Smad1, Sox2, Stat3, Tcfcp2l1 and Zfx H3K4me1, H3K4me2, H3K4me3, H3K9me3, H4K20me3, H3K27me3 and H3K36me3 DNase I hypersensitivity Esrrb, Klf4, c-Myc, n-Myc, Oct4, Sox2 and Stat3 E2f1 Nanog Smad1

UniProbe [53] JASPAR [54] TRANSFAC [55]

Genes corresponding with haplotype variants, unmapped contig regions and low confidence RNA-seq mappings were removed, resulting in a set of 17 517 genes for analysis. Preprocessed RNA-seq and ChIP-seq data and mapped DNase I hypersensitivity in mES cells are available at http:// research.imb.uq.edu.au/t.bailey/supplementary_data/McLeay2011a/[6]

single sample [34, 35]). To capture the regulatory interactions responsible for the measured expression patterns, log-linear (6) and SV (7) regression models were constructed from the following H. sapiens and M. musculus data: ChIP-seq, insilico (PWM-predicted) and tissue-specific in silico (PWM+H3K4me3) TF binding; ChIP-seq HM; and DNase-seq DNase-I hypersensitivity. TF-gene association strengths (TFAS) and HM+DNase scores were calculated using the exponentially decaying affinity (1) and constrained sum of tags (2) methods, respectively. All H. sapiens (GM12878 lymphoblastoid cell line) and M. musculus (embryonic stem cell) data analysed are detailed in Tables 1 and 2. In contrast, a list of known regulatory elements and associated contexts not captured by this predictive modelling framework is provided in Table 3. The linear regression and

e-SVR with radial basis kernel function R implementations (from the ‘stat’ and ‘e1071’ packages, respectively) were used with default parameters. In silico (PWM-predicted) TF-binding data was generated using FIMO (Find Individual Motif Occurrences [49]) with default parameters. For tissue-specific in silico (PWM+H3K4me3) TFbinding, a H3K4me3 epigenetic prior is generated using the tools published by Cuellar Partida et al. [28] with parameters a ¼ 4  105 and  ¼ 1. All experiments were completed on a 2.0 GHz 24-core PC with 128 GB RAM. As we encourage other researchers to explore and develop these results, our framework is implemented using open-source software and made available as a preconfigured bootable virtual environment. This environment was created using a minimal installation

Downloaded from http://bib.oxfordjournals.org/ at University of Newcastle on September 29, 2014

Table 2: All M. musculus (embryonic stem cell) data used for model construction

Predictive modelling of gene expression

page 7 of 13

Table 3: Despite the assortment of known regulatory elements and contexts not captured by this framework, models constructed from a small panel of TFs and HMs were able to predict genome-wide mRNA transcript abundance with striking accuracy Notes

Yes Yes Yes No No No

A proxy for open euchromatin Not yet available at sufficient resolution Focus of ongoing investigation Focus of ongoing investigation

Yes No No No

Focus of ongoing investigation Currently considers only the 50 -most TSS e.g. 50 H3K4me versus 3’ H3K36me in active genes

This apparent paradox between the biological complexity of transcriptional regulation and the significant prediction accuracy of comparatively simple models presumably reflects uncharacterized redundancy in regulatory mechanisms. The nature of this redundancy is the focus of ongoing investigation by various groups.

of Lubuntu 13.10 (http://lubuntu.net/); a lightweight Linux distribution, which supports all the tools required. R version 3.0.1 was installed, along with the core set of packages and utilities required to explore the presented results. Full configuration details for the reference environment are available in the Supplementary Materials. Alternatively, all data and scripts are available online at http://sourceforge.net/projects/budden2014predictive/.

that each subset was used for evaluation exactly once, with the final adjusted R2 score calculated as the mean of the corresponding values for each fold. Cross-validation is important for preventing overfitting of the model to noise in the experimental data at the expense of effective generalization to new inputs. This is particularly important for nonlinear models (e.g. SVR) that are better able to capture complex (and potentially artefactual) relationships.

Evaluation of prediction accuracy The prediction accuracy of each model was evaluated as an ‘adjusted R2 score’, which captures the proportion of variation in measured mRNA transcript abundance (for a single sample) explained by the model. Unlike the R2 score (i.e. the ‘coefficient of determination’, equivalent to the square of the Pearson correlation coefficient and previously used to evaluate models of mRNA transcript abundance [4, 5]), the adjusted R2 prevents spurious inflation due to the introduction of additional explanatory variables [56]. This is particularly important when comparing the prediction accuracy of models constructed from different numbers of regulatory elements. More specifically, the adjusted R2 score presented for each model results from a 10-fold cross-validation process. This involves partitioning the data into 10 equal-sized subsets, of which 9 were used for training the model and the remaining subset reserved for evaluation. The process was repeated 10 times, so

Principal component analysis In a recent review, Gunawardena described the difference between ‘forward’ and ‘reverse’ mathematical modelling in a biological context: forward modelling begins with known causalities expressed as a model, which is then used to make predictions; whereas reverse modelling attempts to infer causalities from correlations captured within a mathematical model, often without explicit integration of physical constraints [57]. As the predictive modelling framework we describe falls into the latter category, it is particularly important that the mechanisms of transcriptional regulation described by the model are consistent with established molecular biology. Principal component analysis (PCA) is a statistical framework that has previously been applied to capture characteristic modes of gene expression [58, 59] and clustering properties of TFs [60]. More recently, PCA has been used to identify the individual roles of TFs and HMs in predictive models of gene

Downloaded from http://bib.oxfordjournals.org/ at University of Newcastle on September 29, 2014

Regulatory element TF binding Histone modifications DNase-I hypersensitivity Higher-order chromatin structure DNA methylation miRNAs and other ncRNAs Regulatory context Promoter-localized elements Enhancer-localized elements Multiple promoters Spatial binding patterns

Used

page 8 of 13

Budden et al.

expression [6, 7]. Specifically, the log-transformed and quantile-normalized TFAS or HM+DNase score matrix, X, is reformulated using the following singular value decomposition (SVD) [61]: X ¼ UV> ;

Table 4: Prediction accuracy (10 -fold cross-validation adjusted R2, reported as the mean and standard deviation of the 10 folds) for log-linear (6) and SV (7) regression models of mRNA transcript abundance constructed from (a) mESC and (b) GM12878 data

ð8Þ

RESULTS Although previous studies have applied log-linear and SV regression to construct predictive models of gene expression, there is no direct evaluation of which approach is the most effective. We evaluate the performance of models constructed from integrated transcript abundance and epigenetic-association data for GM12878 and mES cells (described in Tables 1 and 2), both as an adjusted R2 score and by considering the accuracy of the regulatory interactions captured by the model

Prediction accuracy Previous studies have demonstrated that HM data can be used as a statistical before significantly improving PWM predictions of in vivo TF binding [28, 29]. To determine whether accurate predictive models of gene expression can be constructed from such data sets, log-linear and SV regression models were constructed from in silico (PWM), tissue-specific in silico (PWM+H3K4me3) and

(a) mESC Log-linear regression Support vector regression (b) GM12878 Log-linear regression Support vector regression

PWM

PWM+ H3K4me3

ChIP-seq

0.25 (0.02) 0.30 (0.02)

0.50 (0.03) 0.51 (0.03)

0.58 (0.01) 0.64 (0.02)

0.10 (0.01) 0.11 (0.01)

0.28 (0.01) 0.38 (0.02)

0.33 (0.01) 0.39 (0.01)

Three sets of TF-binding input data were considered: in silico (PWMpredicted), tissue-specific in silico (PWM+H3K4me3) and ChIP-seq. The relationships between actual (RNA-seq) and predicted mRNA transcript abundance are illustrated in Supplementary Figures 1 and 2 for (a) and (b), respectively.

ChIP-seq TF-binding data. The prediction accuracy (10-fold cross-validated adjusted R2) of these models is presented in Table 4. It is evident that in silico TF-binding data incorporating an H3K4me3 prior provides significantly more information regarding mRNA transcript abundance than data generated from PWMs alone, irrespective of the cell type or modelling algorithm chosen. Models constructed from ChIP-seq TF-binding perform marginally better than those constructed from the tissue-specific in silico equivalent. In addition to TF-binding data, HMs (responsible for the direct or indirect perturbation of chromatin structure) and DNase-I hypersensitivity (a proxy for nucleosome-depleted open chromatin) are known to be accurate predictors of gene expression [5, 6, 8]. Table 5 presents the prediction accuracy of log-linear and SV regression models constructed from TFbinding, HM+DNase and TF+HM+DNase data. It is evident that HM and DNase data provide more information regarding mRNA transcript abundance than TF-binding (particularly for GM12878 cells), irrespective of the modelling algorithm chosen and despite consisting of fewer regulatory elements. Furthermore, the incorporation of TFbinding, HM and DNase data into a single model yields surprisingly little improvement in prediction accuracy. The implied redundancy in transcriptional regulation is the focus of ongoing investigation by various groups. Considering both Tables 4 and 5, SVR yielded more accurate predictive models than log-linear

Downloaded from http://bib.oxfordjournals.org/ at University of Newcastle on September 29, 2014

where U is the matrix of component scores (transformed TFAS or HM+DNase scores),  is the diagonal matrix of the singular values of X and V is the matrix of loadings (weights by which the TFAS or HM+DNase scores are multiplied to derive their respective component scores). In the context of modelling gene expression, the columns of the matrix P ¼ U are the principal components (PCs), and the rows correspond with ‘eigengenes’ [58]. By individually substituting the TFAS or HM+DNase score matrix of a log-linear (6) or SV (7) regression model with each PC (i.e. a dimensionality reduction), the model yielding the highest adjusted R2 prediction accuracy (and thus the most informative PC) can be identified. To demonstrate that the model is biologically accurate, the corresponding loading for each TF, HM or DNase should reflect its established regulatory role; i.e. known hallmarks of gene activation (e.g. H3K4me3 and DNase I hypersensitivity) and repression (e.g. H3K9me3 and H3K27me3) should correspond with positive and negative loadings, respectively.

Predictive modelling of gene expression regression in all 12 cases. However, the improvements were often small and come at the expense of a two order-of-magnitude increase in required CPU-time, suggesting that log-linear regression may be the preferred approach for large data sets.

One way to demonstrate that a predictive model of gene expression is biologically accurate is to show Table 5: Prediction accuracy (10 -fold cross-validation adjusted R2, reported as the mean and standard deviation of the 10 folds) for log-linear (6) and SV (7) regression models of mRNA transcript abundance constructed from (a) mESC and (b) GM12878 data

(a) mESC Log-linear regression Support vector regression (b) GM12878 Log-linear regression Support vector regression

HM+ DNase

TF+HM+ DNase

0.58 (0.01) 0.64 (0.02)

0.62 (0.01) 0.67 (0.01)

0.68 (0.01) 0.70 (0.01)

0.33 (0.01) 0.39 (0.01)

0.42 (0.01) 0.45 (0.01)

0.43 (0.01) 0.46 (0.01)

Three sets of ChIP-seq input data were considered: TF-binding (TF), HM and DNase-I hypersensitivity (HM+DNase) and the concatenation of both (TF+HM+DNase). The relationships between actual (RNAseq) and predicted mRNA transcript abundance are illustrated in Supplementary Figures 3 and 4 for (a) and (b), respectively.

0.15

weighted loading

0.10

0.05

0.00

−0.05

H3K9me3

H3K4me3

H3K4me2

H3K4me1

H3K36me3

H3K27me3

H4K20me3

DNAse

Smad1

Stat3

Sox2

Oct4

Nanog

Esrrb

Tcfcp2l1

Klf4

c-Myc

Zfx

n-Myc

E2f1

−0.10

Figure 1: The weighted loading of each feature for the most informative PC resulting from a SVD PCA of ChIP-seq TFAS+HM+DNase data (8). As each PC is simply a column of the matrix P ¼ U, the selection of modelling algorithm affects only the identification of the most informative PC (i.e. that corresponding with the most accurate predictive model) and the constant weight (i.e. the adjusted R2 score for that model). As both log-linear and SV regression identified the second PC as most informative (adjusted R2 &0:48 in both cases), the corresponding plots of weighted loadings for both algorithms are indistinguishable and thus only one (log-linear) is included.

Downloaded from http://bib.oxfordjournals.org/ at University of Newcastle on September 29, 2014

that the weighted loading of each feature (for the most informative PC) corresponds with the experimentally verified regulatory roles for that TF, HM or DNase. Figure 1 presents the weighted loading of each feature resulting from a SVD PCA of TFAS+HM+DNase data (8). Figure 1 suggests that H4K20me3, H3K27me3 and H3K9me3 repress gene expression when localized to the promoter (i.e. negative loading). These elements are known hallmarks of gene silencing, by association with constitutive heterochromatin (H4K20me3 and H3K9me3) and facultative heterochromatin (H3K27me3), mediated by HP1 and PRC1/2 activity, respectively [17]. The activating effects of H3K4me2 and H3K4me3 are well established in molecular biology, and DNase-I hypersensitivity corresponds with areas of transcriptionally active open euchromatin [18, 62]. H3K36me3 (exhibiting the smallest positive loading) is known to localize towards the 30 end of actively transcribed genes [63], explaining why its regulatory signal is only weakly captured in a promoter-centric analysis. Of the 12 TFs considered, the 4 with the largest positive loadings (E2f1, n-Myc, c-Myc and Zfx) are known to preferentially localize to promoter regions (e.g. E2f1 has been shown to bind to >60% of mouse embryonic stem cell (mESC) promoters [6]), whereas those with the smallest loadings

Experimental verification

TF

page 9 of 13

page 10 of 13

Budden et al.

DISCUSSION We have provided the first review and direct quantitative evaluation of several techniques from the emerging field of predictive gene expression modelling. We defined and compared three methods of quantifying pairwise gene-TF/HM interactions, finding that the ‘constrained sum of tags’ and ‘exponentially decaying affinity’ were better suited to the ChIP-seq binding patterns exhibited by HMs and TFs, respectively. We described how these quantified interactions could be used to construct predictive models, and compared the robustness of the approach by constructing log-linear and SV regression models for mouse embryonic stem (mES) and human lymphoblastoid (GM12878) cell types. Although any regression algorithm could be applied,

log-linear and SV regressions represent opposite ends of the speed-versus-power continuum, and the similarity of their performance suggests that simple fast algorithms are preferable for large data sets. We further demonstrated that accurate models can be constructed (using either algorithm) from PWM-predicted in silico TF-binding, which has major implications in preliminary investigation of dysregulation across disease states. Finally, we validated the regulatory interactions inferred from our models by comparing against the known roles (i.e. activator or repressor) of 20 key TFs and HMs. Predictive modelling frameworks have the potential to fill an important gap between thermodynamically driven models of individual transcription regulatory events [66–68] and association-driven ‘network’ models of indirect gene regulation (e.g. those represented in the DREAM challenges [69]). Rather than modelling the regulation of specific genes, they can lead us to more general conclusions regarding the roles and interactions of TFs, HMs and other key regulators of gene expression. Furthermore, they avoid the common issue in bioinformatics of system underdeterminism by treating individual genes as observations of transcriptional regulatory logic in action, rather than variables in an association-driven analysis [70]. We believe that there are many hurdles preventing the research community from integrating new complex modelling frameworks into their preexisting data analysis pipelines. One example is an overall lack of direct comparisons of modelling and feature extraction techniques across multiple data sets, which we have provided here for predictive gene expression modelling. Another is the wide range of tools and environments used for their implementation [71]. This also increases the difficulty of reproducing research, highlighted by a recent study by Begley et al. finding that 47 of 53 cancer-related research papers contained irreproducible results [72]. To address these issues, we have provided a preconfigured virtual environment containing all the code and data used to generate the results and figures presented. We encourage other researchers to explore these results, and we welcome discussion about this approach for enabling reproducible and extensible research.

SUPPLEMENTARY DATA Supplementary data are available online at http:// bib.oxfordjournals.org/.

Downloaded from http://bib.oxfordjournals.org/ at University of Newcastle on September 29, 2014

(Smad1 and Stat3) bind further from the TSS (e.g. Smad1 has been shown to bind to

Predictive modelling of gene expression from transcriptional regulatory elements.

Predictive modelling of gene expression provides a powerful framework for exploring the regulatory logic underpinning transcriptional regulation. Rece...
237KB Sizes 0 Downloads 3 Views