Computational Biology in microRNA Yue Li1,2†∗ and Zhaolei Zhang2,3∗ MicroRNA (miRNA) is a class of small endogenous noncoding RNA species, which regulate gene expression post-transcriptionally by forming imperfect base-pair at the 3′ untranslated regions of the messenger RNAs. Since the 1993 discovery of the ﬁrst miRNA let-7 in worms, a vast number of studies have been dedicated to functionally characterizing miRNAs with a special emphasis on their roles in cancer. A single miRNA can potentially target ∼400 distinct genes, and there are over a 1000 distinct endogenous miRNAs in the human genome. Thus, miRNAs are likely involved in virtually all biological processes and pathways including carcinogenesis. However, functionally characterizing miRNAs hinges on the accurate identiﬁcation of their mRNA targets, which has been a challenging problem due to imperfect base-pairing and condition-speciﬁc miRNA regulatory dynamics. In this review, we will survey the current state-of-the-art computational methods to predict miRNA targets, which are divided into three main categories: (1) sequence-based methods that primarily utilizes the canonical seed-match model, evolutionary conservation, and binding energy; (2) expression-based target prediction methods using the increasingly available miRNA and mRNA expression data measured for the same sample; and (3) network-based method that aims identify miRNA regulatory modules, which reﬂect their synergism in conferring a global impact to the biological system of interest. We hope that the review will serve as a good reference to the new comers to the ever-growing miRNA research ﬁeld as well as veterans, who would appreciate the detailed review on the technicalities, strength, and limitations of each representative computational method. © 2015 John Wiley & Sons, Ltd.

How to cite this article:

WIREs RNA 2015. doi: 10.1002/wrna.1286

INTRODUCTION

M

icroRNAs (miRNAs) are a class of singlestranded endogenous noncoding RNAs that are about 19–25 nucleotides (nt) in length.1 During miRNA biogenesis in mammals, primary miRNAs † Present address: Computer Science and Artificial Intelligence Laboratory, Massachusetts Institute of Technology, Cambridge, MA, USA ∗ Correspondence to: [email protected], [email protected] 1 Department

of Computer Science, University of Toronto, Toronto, Ontario, Canada 2 Donnelly Centre for Cellular and Biomolecular Research, University of Toronto, Toronto, Ontario, Canada 3 Department

of Molecular Genetics, University of Toronto, Toronto, Ontario, Canada Conflict of interest: The authors have declared no conflicts of interest for this article.

(pri-miRNAs) are either transcribed from independent miRNA genes or from portions of introns of protein-coding genes.2 The pri-miRNAs fold into hairpin structures, which are processed by Drosha into ∼70 nt precursor miRNAs (pre-miRNAs) and exported by exportin-5 into cytoplasm. The pre-miRNAs are cleaved by Dicer to produce a double-stranded RNA duplex, where one strand is selected to function as a mature miRNA and the other strand is usually degraded.2 Mature miRNAs are mostly known to repress gene expression at post-transcriptional levels by binding to the 3′ untranslated region (UTR) of the target mRNA transcripts with perfect (in plants) or imperfect (in animals) base-pairing.1 In human, over 500 miRNAs have been experimentally validated, and over 1000 miRNAs are predicted to exist based on miRBase

© 2015 John Wiley & Sons, Ltd.

wires.wiley.com/rna

Advanced Review

registry to date.3 Due to imperfect base pairing in mammalian systems, each miRNA on average would mediate the expression of about 400 target mRNAs. Thus, while the number of confirmed miRNAs is less than 3% of all genes, they potentially regulate roughly 60% of all of the protein coding genes.1,4 Given such a broad target scope, miRNAs are inevitably involved in virtually all biological processes.1 Specifically, numerous recent studies have elucidated miRNA regulatory roles in cell differentiation,5–8 immune responses,9–11 viral infections,12,13 and cancer biology.14–16 Functional characterization of miRNAs depends on precise identification of their corresponding targets. However, it has proved difficult to experimentally identify or validate individual miRNA–mRNA interactions. Recent records from miRTarBase17 show that 4572 interactions were validated by low throughput reporter assays or Western blots between 1232 miRNAs and 17,520 potential target genes among 18 species. Other popular databases such as TarBase hold comparable collections.18 However, computational prediction provides a rapid method to identify putative miRNA targets informative for the subsequent validations.19 Most of the prediction programs are based on seed region match (i.e., sequence complementarity between miRNA and the putative 3′ UTR target binding sites) and the evolutionary conservation of those sites.20 These programs generally have very poor agreements among them and can only achieve at maximum less than 50% specificity.20 The sensitivity of these programs is also questionable as about 30% of the confirmed target sites are not conserved among closely related species.20 Furthermore, most of these programs consider each miRNA–mRNA pair in isolation whereas an mRNA is likely to be targeted by multiple miRNAs and a miRNA can target multiple mRNAs that share the corresponding target site. The ‘many-to-many’ relationship between mRNA and miRNA can be conceptualized as a bipartite graph (Figure 1). Only recently, expression profiling of miRNAs and mRNAs (m/miRNAs) by microarray or nextgeneration sequencing across various tissues and cell-lines21–24 (following miRNA transfection/ knockdown25,26 ) promise to provide crucial (direct) evidence for the genome-wide functional relationship between miRNAs and their targets in tissue or cell-specific manners. However, integration of the knowledge gained from prior sequence studies with the high-throughput expression data is crucial but challenging and has been an on-going research topic with some recent progress.27–33

miR1

miR2

miR3

mRNA1

mRNA2

mRNA3

FIGURE 1 | A bipartite graph illustrates the many-to-many relationship between miRNA and their mRNA targets. Notably, the graph can also be considered as a directed acycilic graph (DAG), where miRNAs only regulate genes and genes are only regulated by miRNAs. This very property enables exact inference in several frameworks.

In this review, we will survey the representative state-of-the-art computational target prediction methods with the emphasis on the algorithmic aspects of each approach. In particular, we will (1) review some of the most widely used sequence-based programs (Table 1), (2) summarize recently developed expression-based methods (Table 2), which exploited the not only sequence information but also expression data to infer in vivo functional miRNA–mRNA interactions, and (3) highlight several network-based approach (Table 3) to study miRNA synergism.

SEQUENCE-BASED MIRNA TARGET PREDICTION TargetScan TargetScan34 is the very first mammalian miRNA target prediction algorithm developed by Lewis et al. Since then, it underwent a series of refinements,4,35–38 which maintains the program as one of the most widely used target prediction tools to date. Briefly, given an miRNA conserved across multiple organisms, TargetScan searches for seed match among a set of candidate 3′ UTR sequences in the query organism (e.g., human) that have orthologs in the reference organisms (e.g., mouse + rat + pufferfish). The seed match is defined as the perfect Watson-Crick (WC) complementary base-pairing between 2 and 8 nt seed of the 5′ end of the miRNA and anywhere in the candidate target 3′ UTR sequence (Figure 2). Comparing against controls generated from ‘miRNA-like’ permutations, TargetScan achieved 3.1:1 signals:noise ratio in predicting human miRNA targets. However, many target sites may have imperfect complementarity with the miRNA seeds such as the nematode let-7:lin-41 or vertebrate miR-196:HoxB8.35 However, the authors showed that relaxing the seed-match by allowing one mismatch essentially led to a nearly 1:1 signal:noise (i.e., equivalent to random guess). The dilemma underscores the need for additional discriminative features.

© 2015 John Wiley & Sons, Ltd.

WIREs RNA

Computational Biology in microRNA

TABLE 1 Sequence-Based Methods Method TargetScan(S)

SM √

4,34–37

FBE √ ( ) √

√

DIANA-microT40

SAE

CT √ √

√

miRanda41 PicTar43 PITA48

√

√

√

√

CS √ ( )

Cb

√

√

√ √

The predictions methods are based on the following features: SM, seed match; FBE, free binding energy; SAE, site accessibility energy; CT, conserved targeting; CS, Context Score; Cb, combinatorial regulation of multiple miRNAs. TargetScan uses either CS or CT as score, where CT is computed for individual target sites as probability PCT rather than used as a binary filter as in other programs. Moreover, the CS and Cb only partially account for some of the confounding effects including target abundance (‘dilution effect’) and miRNA interaction in the form of synergistic or competitive regulation, respectively.

TABLE 2 Expression-Based Methods Methods 31

TargetScore

GenMiR27,52–54 BayesPro30

BM √

PanMiRa71

TA

Act

PO

TB √

√

VI √

MCMC

EB √

SS

√

√

√ √

MLR60,94,62 ProMISe32

LS

√

√

√

√ √

√

√

√

TargetScore: inferred target posterior based on Bayesian Gaussian mixture model (another similar method exists28 ); GenMiR: generative model of miRNA; BayesPro: Bayesian model using protein output; MLR: multivariate linear regression; ProMISe: probabilistic miRNA-mRNA interaction signature; PanMiRa: Pan-cancer miRNA association; BM: Bayesian method; LS: least squares; TA: trans-acting factors that affect target mRNA level (e.g., RNA saturation effect, RISC availability, copy number variation, DNA methylation, etc.); Act: inferring miRNA activities; PO: using protein output as response; TB: transfection-based model; VI: variational inference; MCMC: Markov chain Monte Carlo or sampling methods; EB: empirical Bayes; SS: sample-specific miRNA-mRNA relationships.

TABLE 3 Network-Based Method to Detect miRNA Regulatory Modules

Seed match 5′ – [ORF]

Method

Algorithm

Enumeration of Bi-Clique 81

Discrete network enumeration

Population-based learning 80 Genetic algorithm and estimation of distribution GroupMiR82

Nonparametric Bayesian with Indian Buffet Process

SNMNMF87

Constrained nonnegative matrix factorization

PIMiM88

Regularized linear regression with group LASSO constraint

Mirsynergy33

Detecting modules by greedy-based deterministic neighbourhood expansion

TargetScanS is a refined or ‘simplified’ version of TargetScan, with emphasis on the 6-mer conserved seed match, t1A anchor (Figure 2), and conservation island.35 The rationale for the latter criterion is that the increased overall conservation of a 3′ UTR results in decreased fraction of the conservation that can be explained by its pairing to the miRNA and consequently leads to more fortuitous conserved

N N N N N N N [A⏐N] 3′ –

Poly(A)

N N N N N N N N N N – 5′ 8 7 6 5 4 3 21 Seed

microRNA

FIGURE 2 | Canonical miRNA Watson-Crick base pairs to the 3′ UTR of the mRNA target site. The most critical region is a 6mer site termed as the ‘seed’ occurs at position 2–7 of the 5′ end of the miRNA.34 Three other variations centring at the 6mer seed are also known to be (more) conserved: 7mer-m8 site, a seed match + a Watson-Crick match to miRNA nucleotide 8; 7mer-t1A site, a seed match + a downstream A in the 3′ UTR; 8mer, a seed match + both m8 and t1A. The site efﬁcacy has also been proposed in the order of 8mer > 7mer-m8 > 7mer-A1 > 6mer.4,36 ORF, open reading frame; (NNNNN), the additional nucleotides to the shortest 19 nt miRNA; [A|N], A or other nucleotides; Poly(A), polyadenylated tail.

but nonfunctional pairing between the 3′ UTR and miRNAs. Friedman et al.4 developed a quantitative approach to assess individual target sites based on the probability of conserved targeting (PCT ). Briefly, 3′ UTRs are separated by a conservation rate into 10 bins examined independently of each other to account for the diverse basal conservation rates among the UTR sequences. A phylogenetic tree is constructed

© 2015 John Wiley & Sons, Ltd.

wires.wiley.com/rna

Advanced Review

from each candidate target site using 28-genome alignment corresponding to 28 vertebrate species. Lengths of branch segments connecting the species having the common target site are summed together and normalized to the averaged branch length in the corresponding bin to represent a unit-less branch length score as the relative conservation of the individual target sites. However, conservation may not imply function and vice versa. Accordingly, seven context-dependent but conservation independent features surrounding the seed match are proposed.36,37,39 These features include seed match type (6mer seed, 7mer-tA1, 7mer-m8, and 8mer; Figure 2), number of sites in the same target, site location relative to the ends of 3′ UTR, local AU content, 3′ -supplementary pairing, seed-pairing stability, and the target abundance.37 The contribution of these features is assessed by the corresponding linear coefficients from multivariate linear regression on mRNA expression fold-changes due to miRNA transfection. Collectively, a ‘context score+’ for each target site is the sum over the coefficients of the features and used to quantitatively assess the plausibility of the target sites. The context score is not only orthogonal to the conserved targeting score PCT but also more suitable to identify novel or species-specific target sites that cannot be identified by the former conservation-based approach. Recently, another feature, named differential 3′ UTR isoform usage, was introduced38 to account for affected isoform ratios (AIRs) that contain the target sites for each miRNA.

DIANA-microT The development of DIANA-microT40 differs from many other sequence-based algorithms in that it predicts miRNA targets based on computationally determined binding energy and a set of structural features termed ‘miRNA-recognition elements’ (MRE) that were experimentally induced in-house by the authors. DIANA-microT starts with filtering out miRNA:target with less than three consecutive WC pairings between the two sequences. For the remaining pairs, the program then approximates the minimum binding free energy between miRNA and every 38-nt segment of the 3′ UTR by sliding a 38-nt window across the 3′ UTR and assessing the base-pairing by global sequence alignment using a modified version of Needleman-Wunsch (dynamic programming) algorithm. The candidate interactions with low free energy were subject to filter by a set of putative MRE-rules derived from a series of point mutations of lin-28, a target mRNA of the endogenous miRNA let7-b.

MiRanda MiRanda41,42 is another popular sequence-based algorithm initially developed to predict miRNA targets in fly and later on applied to human genome. The program operates in three phases including SmithWaterman local alignments of miRNA:UTR, assessing thermodynamic folding energy of miRNA:UTR duplex by RNAfold, and conservation filtering based on evolutionary distance. The scores derived from the first two phases allow MiRanda to resolve cases where multiple miRNA bind to the same site by greedily choosing the sites with the highest alignment score and lowest folding energy. MiRanda was applied to identify miRNA targets in both fly41 and human genome42 . The false discovery rates for single sites in both fly and human are estimated to be 35%, a slightly higher rate than those estimated for TargetScan(S). Since MiRanda is similar to TargetScan, it shares similar limitations

PicTar PicTar43 (Probabilistic identification of combinations of Target sites) is based on hidden Markov model (HMM) with a maximum likelihood (ML) approach, which is built on the logics of several earlier works from Siggia group (Figure 3).44–47 Let S be a 3′ UTR sequence, L the length of S, and w the miRNA ‘word’ (i.e., seed region) of length lw , pw the transition probability of the occurrence of miRNA w, and pb the transition probability for the background of length lb = 1, which is simply estimated from the fraction of A, U, G, C (i.e., Markov model of order 0) either in S with length > 300 nt or from all query UTRs. Thus, S can be represented by multiple different ways of concatenating the segments corresponding to either miRNA target sites or background. The goal is to obtain at any arbitrary nucleotide position i of the 3′ UTR sequence S the posterior probability p(𝜋 i = w|S, 𝜃) that i is the last position of the word w, where 𝜃 is the model parameters controlling emission probabilities. Following Markov’s assumption, p(𝜋 i = w|S, 𝜃) is proportional to the products of the probabilities before and after position i, which can be computed in time O(LK) by Forward–Backward algorithm. The optimization of the transition probability pw in PicTar is performed using Baum-Welch algorithm in Expectation–Maximization (EM). At E-step, PicTar performs forward recursion to compute the joint likelihood Z(1, i, 𝜋 i = w) and backward recursion to compute Z(i + 1, … , L|𝜋 i = w). At M-step, PicTar updates transition probability pw . The algorithm alternates between E and M-step until the likelihood

© 2015 John Wiley & Sons, Ltd.

WIREs RNA

Computational Biology in microRNA

Sequences of coexpressed mature microRNAs

CGUGCUAGCUACGAUCGAUCGACUUCGAUCGAC

3′ UTR sequence

p_1

miR1

p_1

p_b

p_1

PicTar maximum likelihood fit

p_2

p_1

p_b

BKG

p_3 p_b

p_2

p_3 p_b

p1

p2

p3 miR2

CGUGCUAGCUACGAUCGAUCGACUUCGAUCGAC

p_2

p_3 p_2

miR3

p_3

PicTar score

FIGURE 3 | PicTar43 schematic view. Given a 3′ UTR and a set of K miRNA sequences, PicTar uses a (K + 1)-state HMM to infer whether each segment of the 3′ UTR represents a seed match to one of the K miRNAs or background (BKG). As a simple illustration, only three miRNAs (miR1, 2, 3) are shown in the HMM model on the right.

p(S|𝜃) increases by less than a cutoff. Finally, the PicTar score is defined as likelihood ratio of observed over background. The major advantage of PicTar over other models discussed so far is that the coordinate actions of the miRNAs (synergistic in case of optimally spaced sites or antagonistic in case of overlapping binding sites) are naturally captured within the emission and transition probabilities. For instance, the PicTar score as the joint probability of multiple miRNA target sites will be higher than the linear sum of individual miRNA target sites (i.e., synergistic effects). Longer 3′ UTR will score less than shorter 3′ UTR if both contain the same number of target sites. However, the emission probabilities e(s|w) in PicTar are assumed known or set arbitrarily based on either seed match or conservation.

PITA PITA (Probability of Interaction by Target Accessibility) was developed to evaluate systematically the accessibility imposed by the sequence context of the target sites in order to explain the observed variability in target strength from identical seed matches.48 PITA was inspired by the experimental observation of differing reporter activity when embedding the same target sites into a closed stem and an open loop structure of the reporter constructs. The latter conferred considerably higher repression strength than the former structural context, which led to the hypothesis that target efficacy may depend on the accessibility energy elicited from the local secondary structural environment. Accordingly, PITA estimates the accessibility energy ddG as ddG = dGduplex − dGopen, where dGduplex

is the binding free energy of the miRNA-target duplex structure estimated by RNAduplex49 , and dGopen is the energy required to make the target region accessible for the miRNA binding and can be further decomposed into two components: dGopen = dGfree − dGunpaired , where dGfree is the free energy of the ensemble of all secondary structures of the target region and dGunpaired is the free energy of all target-region structures in which the target nucleotides (±70 nt flanking region) are required to be unpaired. Both quantities are estimated by RNAFold over all possible predicted secondary structures.49 Finally, ddG1, ddG2 , … , ddGn scores corresponding to multiple sites s1 , s2 , … , sn in the same target 3′ UTR are combined. Notably, however, fully addressing the structural variation is a daunting task, which requires considering not only the interaction between miRNA and target but also the effects that RNA-binding proteins such as HuR or ELAVL1 (an RNA-degrading factor) or PABPC1 (polyA binding protein) have on the secondary structure formation.

EXPRESSION-BASED MIRNA TARGET PREDICTION Over the past few years, the dramatically decreased costs of microarray and RNA-seq measurements resulted in an explosion of deposition of highthroughput data measuring m/miRNAs expression across various conditions and cell types. As in the current writing, there are 25,044 entries in GEO database that are associated with the word ‘microRNAs’ or ‘miRNA’. Consequently, there has been an ongoing

© 2015 John Wiley & Sons, Ltd.

wires.wiley.com/rna

Advanced Review

miRNA transfection

(a)

(b) Initial

Control

AAA

Treatment

AAA

Variational Bayesian Gaussian mixture model

AAA

RISC Site for transfected miRNA Transfected miRNA

N

Transfected control Sequence-based score: Over-expression data: xf: log2fold-change

x1: context score+

x2: PCT

...

FIGURE 4 | RACER62 schematics. (a) The mRNA expression of gene g (yg ) is modelled as a function of the following input variables (left to right): TF-binding signals (bg,TF ), DNA methylation (mg ), copy number variation (ng ), miRNA–mRNA interactions implicated in the sequence-based seed match (c g,miR ) 3′ UTR regions of the mRNA and miRNA expression (zmiR ). (b) Two-stage regression analysis. At stage 1, RACER estimates the sample-speciﬁc TF and miRNA activities for each sample t . At stage 2, RACER uses the inferred regulatory activities of TFs and miRNAs to estimate the interaction scores wg,TF and wg,miR between gene g and TF and between gene g and miRNA miR across all of the T samples, respectively.

paradigm shift from the conventional sequence-based predictions of potential miRNA targets to predictions of functional miRNA targets, that can statistically explain (at least partially) the observed expression patterns. The following subsections review several representative expression-based target prediction algorithms (Table 2).

Predicting miRNA Targets Using miRNA-Overexpression Data One of the most direct way to query the targets of a given miRNA is by transfecting the miRNA into a cell and examine the expression changes of the cognate target genes22 (Figure 4(a)). Presumably, a bona fide target will exhibit decreased expression upon the miRNA transfection. In particular, overexpression of miRNA coupled with expression profiling of mRNA by either microarray or RNA-seq has proved to be a promising approach.22,50 Consequently, genome-wide comparison of differential gene expression holds a new promise to elucidate the global impact of a specific miRNA regulation without solely relying on evolutionary conservation. However, miRNA transfection is prone to off-target effects. For instance, overexpressing a miRNA may not only repress the expression of its direct targets but may also cause a cascading repression of in-direct targets of the affected transcription activators. To our knowledge, there are few programs developed specifically for transfection-based miRNA target prediction. Among them, Sylamer is the earliest work developed to identify enriched k-mer motifs,

which are the seed regions among the top ranked targets based on p-values obtained from t-test51 . Thus, Sylamer does not model the distribution of fold-changes or sequence features but rather was mainly designed to visually inspect the enrichment of k-mer patterns rather than predicting specific targets. Liu et al. developed a Bayesian Gaussian mixture model (GMM) called Expmicro28 to take advantage of the available transfection-based expression data combined with sequence-based prior information. However, Expmicro requires training data in order to calculate the sequence-based scores using SVMicro and employs only a two-component GMM to model the fold-changes. To improve prediction accuracy for direct miRNA targets, we recently developed a probabilistic model called TargetScore that integrates expression change due to miRNA overexpression and sequence information such as context score36,37 and other orthogonal sequence-based features such as conservation4 into a probabilistic score. In TargetScore, each score feature is considered as an independent observed variable, which is the input to a Variational Bayesian-Gaussian Mixture Model (VB-GMM) (Figure 4(b)). Specifically, given expression fold-change (due to miRNA transfection), we use a three-component VB-GMM to infer down-regulated targets accounting for genes with little or positive fold-change (due to off-target effects26 ). Otherwise, two-component VB-GMM is applied to unsigned sequence scores. The parameters for the VB-GMM are optimized using Variational Bayesian EM (VB-EM) algorithm. The mixture component

© 2015 John Wiley & Sons, Ltd.

WIREs RNA

Computational Biology in microRNA

with the largest absolute means of observed negative fold-change or sequence score is associated with miRNA targets and denoted as ‘target component’. The other components correspond to the ‘background component’. It follows that inferring miRNA–mRNA interactions is equivalent to inferring the posterior distribution of the target component given the observed variables. The targetScore is computed as the sigmoid-transformed fold-change weighted by the averaged posteriors of target components over all of the features. However, miRNA transfection experiments represent nonphysiological conditions (i.e. massive over-expression of exo/endogenous miRNAs) and may lead to misleading conclusions about endogenous miRNA targets. Thus, as discussed below, we may also need to examine the overall expression correlation across clinical samples or some other orthogonal evidence to ascertain the endogenous regulatory effects of a miRNA on its candidate targets.

Inferring miRNA Targets Using Paired mRNA and miRNA Expression GenMiR (Generative model of miRNA regulation) represents the first expression-based framework that couples miRNA expression with mRNA expression profiles in miRNA target prediction.27,52–54 Among its three generations, this review focuses on the most successful GenMiR++27,53 It is worth mentioning that by using only mRNA expression GenMiR++ explicitly assumes that mRNA degradation rather than translational repression is the primary mode of miRNA regulation, which is also supported by other studies22,23,55,56 although it is still a highly debatable subject30,57,58 . Essentially, GenMiR++ is a linear regression model, which assumes a bipartite graph (Figure 1) where miRNAs and mRNA expression levels are treated as independent variables and response variable, respectively. The target mRNA expression follows a Gaussian distribution with unknown covariance and expected mean( estimated from ) the linear∑ regression model p xg |Z, S, Γ, Λ, Θ = N(xg ; 𝜇 − k 𝜆k sgk Γzk , Σ). Being Bayesian, the goal is to infer the marginal posterior probability for the latent miRNA-target relationship sgk , whose close formed solution is difficult to obtain. As an approximation, VB-EM is used. Specifically, VB-EM starts with a surrogate distribution, which is much easier to solve due to the mean-field factorization than the exact posterior: q(S, Λ, Γ|C) = q(S|C)q(Λ)q(Γ), where )1−sgk ∏ s ( . The goal of q (S|C) = (g,k)|cgk =1 𝛽gkgk 1 − 𝛽gk the VB inference step in the VB-EM is to minimize

the Kullback–Leibler divergence, which is equivalent to setting the latent variable to their expected value conditioned on other variables. This comes to the EM part of the VB-EM algorithm. In the E-step, 𝛽 gk is chosen conditioned on the (initialized) model parameters; in the M-step, the model parameters are chosen conditioned on 𝛽 gk . The algorithm iterates until converges. Finally, the GenMiR++ score for a pair of miRNA k and mRNA g is defined as a log 𝛽 gk − log (1 − 𝛽 gk ). Notably, to achieve a high specificity, GenMiR++ operates only on the targets predicted from TargetScanS35 , which nonetheless limits its sensitivity. GenMiR++ was first applied to 1770 TargetScanS candidates from 78 mouse miRNAs using an expression consortium across 17 mouse tissues53 and subsequently to 151 human miRNAs and 16,063 mRNAs across 88 normal or tumor tissue samples27 . In both cases, GenMiR++ demonstrated promising specificity, and the predicted targets tend to be more enriched for meaningful biological processes.

Target Prediction Using Proteomic Data Numerous studies have demonstrated that miRNAs predominantly repress mRNA expression by mRNA degradation 22,23,55,56 as implicated in the negative correlation between mRNA and miRNA levels. Nonetheless, translational repression without influencing mRNA levels was suggested 57,58 . Regardless of the debate on the two major modes of miRNA repression, it is always ideal to predict miRNA targets directly from the protein outputs (although the costs for proteomic profiling are usually much higher than mRNA expression profiling). Accordingly, Li and Min et al. described a Bayesian framework30 to infer miRNA targets based on protein outputs measured by mass spectrometry as a function of miRNA expression and putative miRNA-target relationship derived from TargetScan. The proposed Bayesian framework to model protein output is similar to GenMiR++ except that Negative Binomial (NB) distribution was used to model the discrete peptide counts and that the marginal posterior distribution of interactions was inferred using Gibbs sampling.

Inferring Cancer-Specific miRNA Activities and the Cognate Targets Via Regularized Multivariate Regression Analysis Inferring the activities of transcriptional regulators based on the correlation between the occurrences of regulatory sequences (input variables) and gene expression (response) was first demonstrated more

© 2015 John Wiley & Sons, Ltd.

wires.wiley.com/rna

Advanced Review

than 10 years ago by Bussemaker et al.59 Essentially, the linear coefficients attributed to each regulator were considered as the corresponding activities. Notably, the MLR model here can operate on a single expression profile. This basic idea inspired many studies including ours to infer miRNA (and TF) activities.60–62 In particular, Setty et al. described a LASSO model to infer miRNA/TF activities attributable to expression changes of each gene yg measured from glioblastoma multiforme (GBM) tumor samples, taking into account the effects from copy number (CN) (Cg ), TF-binding sites (Ng,TF ) from the motif-based database TRANSFAC63 , and miRNA target sites (Ng,miR ) information60 : ∑ ∑ yg ∼ wCN Cg + miR wmiR Ng,miR + TF wTF Ng,TF . To estimate wmiR (miRNA activities) and wTF (TF activities), the authors proposed two variants of training the MLR, namely a sample-by-sample LASSO and a group multi-task ridge regression (L2 -norm) MLR model. The first LASSO operates on each expression profile separately to minimize for each sample the L1 -regularized squared error. The group LASSO model, however, aims to train all of the regression models simultaneously by sharing information across (a)

TF-binding signals (bg,TF)

samples and regularizing each model based on GBM subtype information. Using either of the MLR models, the authors assessed the importance of each regulator and their cognate targets based on the total residual errors induced by their exclusion. Recently, we developed a two-stage regression model called RACER (Regression Analysis of Combined Expression Regulation) by leveraging both the sample-specific expression data from The Cancer Genome Atlas (TCGA) and the TF occupancy data from ENCODE.62 Briefly, we first performed a similar regression model as above to infer sample-specific TF activities using ENCODE TF-binding data derived from K562 and miRNA activities using sequence-based information and miRNA expression levels for each sample (Figure 5). Using the activities as input, we then inferred the TF-mRNA and miRNA–mRNA relationships across all samples. As a case study, we used CNV, DNA methylation, m/miRNA expression measured from 173 patients with Acute Myeloid Leukemia (AML) from TCGA 64 and 97 TF occupancy data from ENCODE-K56265,66 . Encouragingly, power analysis on predicting validated or confidence miRNA targets and TF-binding sites revealed competitive Copy number variation (ng)

TF A × ng copies

Pol II

TF B Promoter

CDS

Seed match regions to miR (cg,miR)

DNA methylation (mg)

Pri-miRNA

RISC

polyA polyA polyA

Dicer

Drosha

3’UTR

miRNA expression (zmiR) mRNA expression (yg)

(b) Stage 1: Estimate sample-specific TF and miR activities (αTF,t, αmiR,t) in sample t: [yg,t]N×1 ≈ α0 + αCNV,t[ng,t]N×1 + αDM,t[mg,t]N×1 + [bg,TF]N×K×[αTF,t]K×1 + [cg,miR]N×M× ([αmiR,t]M×1[zmiR,t]M×1) Stage 2: Estimate TF/miR-gene interactions (wTF,g, wg,miR) for gene g across all samples: [yg,t]1×T ≈ w0 + wg,CNV[ng,t]1×T + wg,DM[mg,t]1×T + [wg,TF]1×K*×[αTF,t]K*×T + [wg,miR]1×M*×[αmiR,t]M*×T

FIGURE 5 | ProMISe32 schematics. (a) Expressed seed match of mRNA i for miRNA k is deﬁned as the product of the number of target sites c i ,k and the total expression of mRNA i . The probability of mRNA i ‘attracting’ miRNA k takes into account the expression of miRNA k and the total expression of other mRNA xt that carries compatible seed match for miRNA k . (b) Conversely, the expressed seed k for mRNA i is the number of target sites that miRNA k can recognize on mRNA i multiplied by the expression of miRNA k . The probability of miRNA k targeting mRNA i considers the total expression of mRNA i with respect to all of the other miRNA expression z that can recognize the target sites of mRNA i .

© 2015 John Wiley & Sons, Ltd.

WIREs RNA

Computational Biology in microRNA

performance of RACER over other existing methods. RACER’s favourable performance is likely attributed to the linear decomposition of the various expression co-regulators, which may be over/underestimated in reduced or alternative model formalisms.

Inferring Probabilistic miRNA–mRNA Interaction Signatures Using Sample-Specific miRNA and mRNA Expression Despite diverse modelling approaches, the expressionbased methods reviewed so far (except for TargetScore) rely on a negative expression correlation between miRNA and mRNA. They usually require a large number of samples to compute miRNA–target interactions (MTI) and thus have difficulty in (a)

mRNA competition

mRNA targets for miRNA k

mRNA1

mRNA2

identifying ‘personalized’ MTI in individual samples. Indeed, each tissue or cell line has a unique miRNA regulatory network with weighted MTI edges, which can be used as molecular signatures similar to the uniqueness of mRNA/miRNA expression profile.21,67 Accordingly, we recently described an alternative model called ProMISe (Probabilistic MiRNA–mRNA Interaction Signature) inspired by an analogy of ‘role-switch’, where miRNAs and mRNAs are considered respectively as ‘predators’ and ‘preys’ in one scenario and then as ‘preys’ and ‘predators’ in the other scenario, respectively. In particular, ProMISe takes into account the competitions among mRNAs (and miRNAs) for the same miRNA (and mRNA) using paired expression profile coupled with target site information (Figure 6).

×

c1,k ×

c2,k ×

total mRNA expressed = expression seed matches x1(t)

x1(t)

(b) miRNAs that target mRNA i

miRNA1

ci,k

×

x2(t)

xi(t)

xi(t)

miRNA competition ×

ci,1 ×

miRNA expressed = expression seeds

ci,2 ×

z2

ci,k ×

.. .

xi(t)

z2

. . . miRNAk

mRNA i total expression

z1 z1

miRNA2

.. . zk

x2(t)

. . . mRNAi

miRNA k expression

zk zk

FIGURE 6 | PanMiRa71 schematics. (a) The pan-cancer study enabled by TCGA was performed on 12 distinct cancer types each containing hundreds of patient samples each measured by mRNA/miRNA expression, DNA methylation, and copy number. (b) Suppose target RNA expression (y i ,t ,d ) in sample t of cancer type d is a function of DNA methylation (x DM i ,t ,d ), copy number (x i ,t ,d CN ) and miRNA regulation (x k ,t ,d miR ). The expression change across samples for the target RNA is modelled as the response variable in a multivariate linear regression framework using the input variables as indicted above. (ç) The resulting linear coefﬁcient 𝛽 miR i ,k ,d s indicate the corresponding interaction between miRNA k and target gene i of cancer type d and are transformed into z -scores, which are then subsequently subjected to local false discovery rate (locfdr ) estimation.72 (d) The joint posteriors for the recurrent interactions given the z -scores are inferred by empirical Bayes using the probabilistic quantities obtained from the locfdr procedure above.

© 2015 John Wiley & Sons, Ltd.

wires.wiley.com/rna

Advanced Review

Briefly, the probability that mRNA i attracting a specific miRNA k is calculated as the reversed probability that miRNA k is attracted by other mRNA j (j ≠ i). Conversely, the probability that miRNA k targeting a particular mRNA i is calculated as the reversed probability that mRNA i is targeted by other miRNA l. Suppose we have N mRNAs and M miRNAs, let x and z be the observed mRNA and miRNA expression vectors, respectively, and C be the N-by-M seed-match matrix, where ci,k denotes the number of target sites on mRNA i for miRNA k. Assuming the cell is at a condition-specific equilibrium state, where mRNAs are being dynamically transcribed and degraded (partly due to miRNA binding), then the total amount of transcribed mRNA targets xt are unobserved and higher than the observed (equilibrium) expression level xo . Our goal is to infer p(ti,k |x(t) , z(t) , C) for whether miRNA k targets mRNA i, given the (hidden) total transcription levels of mRNA and miRNA. To this end, we perform an iterative procedure by estimating p(ti,k |x(t) , z(t) , C) given the total transcription level and estimating total transcription level given p(ti,k |x(t) , z(t) , C) (Box 1).

Inferring Recurrent miRNA–mRNA Interactions Via Pan-Cancer Analysis MiRNAs have long been proposed to contribute to cancer biology because they can function as tumour suppressors (e.g., miR-200 family), oncogenes (e.g., miR-155), or both (e.g., miR-29 family).68,69 Aberrant expression level of miRNAs are implicated in various cancers, and miRNA expression profiles have been shown to provide better discriminative power in classifying human cancers than mRNA profile.21 Systematic approaches to infer the recurrent miRNA–mRNA associations are underway. However, the systematic biases due to different experimental conditions, sample heterogeneity, sample size, and many other nonuniform aspects make the pan-cancer analysis particularly challenging. Recently, Jacobsen et al.70 developed a rank-based method called recurrence (REC) to evaluate the recurrent interactions across 11 cancer types using TCGA data. The miRNA–target interaction strength in individual cancer type is estimated by the corresponding coefficients in a multivariate linear regression model, which also takes into account the biases in estimating target expression changes due to the corresponding copy number and DNA methylation changes. Essentially, a specific miRNA–target pair that is ranked high in terms of its regression coefficients among all cancer types will confer a high REC score. As the rank-based approach is scale-free, REC avoids direct comparisons

BOX 1 PROMISE ITERATIVE ALGORITHM Initialize x(t) = x (o)

zk ∑ ⎡ j≠i cj,k x (t) ⎤ ) ( j x ⎥ |x(t) , zk, c., k = 1 − ⎢ ∑ 1. p ti,k ⎢ ′ c ′ x (t) ⎥ ′ j ,k j ⎣ j ⎦

( ) (z) (t) 2. p ti,k |xi , z, ci , . = 1 −

[∑

l≠k ci,l zl

∑

l′

]xi(t)

ci,l′ zj′

( ) ( ) (j) (t) (x) 3. p ti,k |x , z, C = p ti,k |X (t) , zk , c., k ( ) (z) (t) p ti,k |xi , z, ci , .

( ) 4. Δxi,k = 𝜂p ti,k |X (t) , zk , c., k xi(t)

5. xi(t)∗ = xi(t) +

∑ k

x (t)∗ Δxi,k ; xi(t) = ∑i (t)∗ T xi i

[ ( )t | (x) (t) |x , zk , c., k 6. Repeat 1 − 5 until max ||p ti,k | ( )t−1 |] (x) (t) | < tol, −p ti,k |x , zk , c., k | | [ ( ) ( )t−1 |] t | (z) (t) (z) (t) | < tol | max |p ti,k |xi , z, ci , . − p ti,k |xi , z, ci , . | | |

between cancer types, which can be misleading due to the aforementioned systematic biases inherent in each cancer dataset. Notably, however, the REC score is derived from a chi-squared approximation, and the magnitude of the score is arbitrary. To reliably infer the distribution of recurrent miRNA–target interactions from the pan-cancer data compendium, we recently developed a novel Bayesian framework called PanMiRa71 (Pan-cancer MiRNA-target Associations) (Figure 7). Briefly, we first constructed a large pan-cancer data compendium

© 2015 John Wiley & Sons, Ltd.

WIREs RNA

Computational Biology in microRNA

BLCA

BRCA

CRC

HNSC

KIRC

LGG

LUAD

LUSC

PRAD

SKCM

THCA

UCEC

mRNA expression

microRNA expression

DNA methylation

Copy number

Genome-wide measurements

(a) TCGA data for 12 distinct cancer types:

(b) Linear regression model for sample t at cancer type d:

(c) Local false discover rate estimation:

(d) Empirical Bayesian infernece of

recurrent interactions across D cancer types:

False interactions p(zi,k,d|ti,k,d = 0)

Frequency

60000 40000

True interactions p(zi,k,d|ti,k,d = 1)

20000 0 –10

FIGURE 7 |

–5

0

5

TargetScore31

schematics. (a) Cartoon of miRNA overexpression or transfection experiment. A miRNA mimic of interest or control hairpin is transfected into a cell. True target genes are expected to exhibit expression decreases relative to the control cell. Additional to the expression fold-change (x f ) due to miRNA transfection, the input data also consists of sequence-based scores (x 1 , x 2 , … , x L ). All input variables are continuous. (b) For each score type of gene n (xn ), we used a Variational Bayesian-Gaussian Mixture Model (VB-GMM) to infer the posterior distribution of the binary target status (znk ), given the observed feature scores x n ∈ (x 1 , x 2 , … , x L ). The plate model indicates a repeating pattern of the generative model for all of the N genes.

from TCGA, providing paired mRNA and miRNA (m/miRNA) expression data as well as copy number (CN) and DNA methylation (DM) data measured for the same sample across 12 cancer types (Figure 7(a)). Suppose the mRNA target expression level is a function of DM, CN, and miRNA regulation (Figure 7(b)). To estimate the individual miRNA-target relationships, we used a multivariate linear regression model, which is fitted to the observed target expression, taking into account the biases due to the baseline expression level, DM, and CN effects (Figure 7(c)). The resulting linear coefficients are indicative of the corresponding miRNA–target relationships in the specific cancer type: the more negative they are the more likely the interactions are real. Finally, we exploited an empirical distribution of the linear coefficients estimated from the locfdr72 approach to estimate the joint posterior of the true interactions across the D cancer types via an empirical Bayes algorithm73 (Figure 7(d)). As a future work, we will develop a more sophisticated statistical model to first learn the intrinsic miRNA regulatory elements from the CLASH74 (cross-linking, ligation, and sequencing of

hybrids) data and then incorporate that information as Bayesian prior to infer the posterior distributions of the functional and direct miRNA–target interactions based on expression data.

NETWORK-BASED METHODS TO DETECT MIRNA-REGULATORY MODULES Although targets of individual miRNAs are significantly enriched for certain biological processes,75,76 it is also likely that multiple miRNAs are coordinated together to regulate synergistically one or more pathways.43,77,78 Indeed, despite their limited number (2578 mature miRNAs in human genome, miRBase V203 ), miRNAs may be in charge of more evolutionarily robust and potent regulatory effects through coordinated collective actions. The hypothesis of miRNA synergism is also parsimonious or biologically plausible because the number of possible combinations of the 2578 human miRNAs is extremely large, enough to potentially react to virtually countless

© 2015 John Wiley & Sons, Ltd.

wires.wiley.com/rna

Advanced Review

environmental changes. Intuitively, if a group of (miRNA) workers perform similar tasks together, then removing a single worker will not be as detrimental as assigning each worker a unique task.33,77 Several related methods have been developed to study miRNA synergism. Some early methods were based on pairwise overlaps79 or score-specific correlation78 between predicted target sites of any given two (co-expressed) miRNAs. For instance, Shalgi et al.79 devised an overlapping scoring scheme to account for differential 3′ UTR lengths of the miRNA targets, which may otherwise bias the results if standard hypergeometric test was used. Methods beyond pairwise overlaps have also been described. These methods considered not only the sequence-based miRNA-target site information but also the respective miRNA–mRNA expression correlation (MiMEC) across various conditions to detect miRNA-regulatory modules (MiRMs). For instance, Joung et al.80 developed a probabilistic search procedure to separately sample candidate module members from the mRNA and miRNA pools with probabilities proportional to their overall frequency of being chosen as the ‘fittest’, which was determined by their target sites and MiMEC relative to the counterparts. The algorithm found only the single best MiRM, which varied depending on the initial mRNA and miRNA set. Other network-based methods are limited to only the sequence information or differentially expressed (DE) m/miRNAs for a more disease-focused network construction. For instance, Peng et al.81 used an enumeration approach to search for maximal bi-clique on DE m/miRNAs to discover complete bipartite subgraphs, where every miRNA is connected with every mRNA. The approach operated on unweighted edges only, which required discretizing miRNA–mRNA expression correlation. Also, maximal bi-clique does not necessarily imply functional MiRMs and vice versa. We review below in more details four recently developed network methods (Table 3) to detect MiRMs. Despite distinct unsupervised learning frameworks, all three methods exploit the widely available paired m/miRNA expression profiles to improve upon the accuracy of earlier developed (sequence-based) network approaches.

GroupMiR: Inferring miRNA and mRNA Group Memberships with Indian Buffet Process The same mRNAs (miRNAs) may interact with different sets of miRNAs (mRNAs) in different pathways. The exact number of pathways is unknown and may grow with an increase of size or quality of the training data. Thus, it is more natural to infer the number

of common features shared among different groups of miRNAs and mRNAs. Accordingly, Le et al.82 proposed a powerful alternative model called GroupMiR (Group MiRNA target prediction). Same as the expression-based method described above, the input to GroupMiR are a sequence-based miRNA–target prediction matrix (to constrain the search space) and a pair of mRNA and miRNA expression matrices measured for the same cell types or samples across multiple time points or experimental conditions (i.e., expression data of mouse lung development over 7 time points). GroupMiR then explores the latent binary features or memberships possessed within mRNAs, miRNAs, or shared between mRNAs and miRNAs on a potentially infinite binary feature space empowered by a nonparametric Bayesian (NBP) formalism called the Indian Buffet Process (IBP)83 (which was later on proved to be equivalent to Beta process84 ). IBP can be understood from an analogy of a type of an ‘Indian buffet’ as follows. A finite number of N customers or objects form a line to enter one after another a buffet comprised of K dishes ∑ m or features. Each customer i samples k i k dishes selected by mk previous customers, and Poisson (𝛼/i) new dishes, where 𝛼 is a model parameter. The choices of N customers on the K dishes are expressed in an N-by-K binary matrix Z weighted by a sequence-based miRNA–mRNA interaction matrix. A left-order function (lof ) maps a binary matrix Z to a left-ordered binary matrix with columns (i.e., dishes) sorted from left to right by decreasing order of mk and breaking ties in favour of customers who enter the buffet earlier. This process defines an exchangeable distribution on equivalence class [Z] comprising all of the Z that have the same left-ordered binary matrix lof(Z) regardless of the order the customers enter the buffet (i.e., row order) or the dish order (i.e., column order). As a result, inference on potentially infinite number of features (i.e., MiRMs) becomes feasible. Similar to GenMiR27 , GroupMiR then performed a Bayesian linear regression on each mRNA expression using miRNA expression but placing more weight on the expression of miRNAs that shared one or more common features with that mRNA. Inference in GroupMiR was performed via MCMC. Finally, the posteriors of Z for all feature columns with at least one nonzero entry serve as the target prediction of GroupMiR. Specifically, the outputs from GroupMiR are membership assignments to each miRNA and mRNA. The biological interpretation is that multiple miRNAs that share latent features with the cognate mRNA targets will likely co-regulate a common pathway. Comparing with the outputs from standard clustering methods such as K-means or mixture model (where

© 2015 John Wiley & Sons, Ltd.

WIREs RNA

Computational Biology in microRNA

each m/miRNA must belong to a single cluster or their posteriors for all clusters must sum to 1), the membership sharing system is more biologically plausible because each miRNA can be assigned with more than one membership, implying that they may engage in multiple biological processes each involving a distinct set of other miRNAs and cognate mRNA targets. However, the slow mixing rate is the main issue for IBP-based framework due to the intensive Gibbs samplings required to perform the inference, which prevented GroupMiR from fully exploring the data space at a genome scale offered by microarray or RNA-seq platforms, and consequently compromised the model accuracy. Adaptations of efficient IBP inference algorithms are underway.85

SNMNMF: Sparse Network-Regularized Multiple Non-Negative Matrix Factorization The non-negative matrix factorization (NMF) algorithm was originally developed to extract latent features from images.86 NMF serves as an attractive alternative to conventional dimensionality reduction techniques such as principle component analysis (PCA) because it factorizes the original s-by-n matrix V (for s images and n pixels) into two non-negative matrices V = WH, where W and H are the s-by-k ‘basis image’ and k-by-n ‘image encoding’ matrices, respectively. Recently, Zhang et al.87 adapted the NMF algorithm to detecting miRNA-MiRMs. Specifically, they proposed a sparse network-regularized multiple NMF (SNMNMF) technique by introducing two image encoding matrices, namely H1 and H2 , which are k-by-n and k-by-m matrices for n mRNAs and m mRNAs and their latent k-dimensional representation, respectively. The objective function is the sum of reconstruction error, L2 -norm of W, H1 , H2 , and the correlation between the encoding matrices H with the known gene-gene and miRNA-gene interaction matrices. Notably, NMF is most stably solved by multiplicative updates or alternating least squares with non-negativity projection. Here, authors adopted the latter strategy by noting that the partial derivative of the objective function with respect to each matrix (W, H1 , H2 ) depends on the optimal solution from the other two matrices. Accordingly, the authors employed a two-stage heuristic approach, which nonetheless guarantees to converge to a local optimal: (1) update W fixing H1 , H2 (which are initialized randomly); and (2) update H1 , H2 fixing W, repeat (1) and (2) until convergence. To infer MiRMs, the authors exploited the k-by-m and k-by-n encoding matrices H1 and H2 and assigned the individual latent features to miRNAs or mRNAs by comparing the

corresponding feature score with the average scores across all k features. However, the NMF approach requires a predefined number of modules in order to perform the matrix factorization, which may be data-dependent and difficult to determine beforehand. Additionally, the NMF solution is often not unique, and the identified modules do not necessarily include both miRNAs and mRNAs, which make reproducing and interpreting the results difficult. Moreover, the SNMNMF does not enforce negative MMEC (miRNA–mRNA expression correlation), whereas the negative MMEC is necessary to ascertain the repressive function of the miRNAs on the mRNAs within the MiRMs.

PIMiM: Regression-Based Module Learning with Protein-Interaction Constraint and Group LASSO Le and Bar-Joseph described another regression-based model called PIMiM (Protein Interaction-based miRNA Modules) in their more recent work.88 Comparing with GroupMiR, PIMiM is simpler by having a fixed number of memberships, which needs to be determined beforehand—similar to above-mentioned SNMNMF in that regard. The continuous linear coefficients defined as the membership values indicate the strength of miRNA i or mRNA j belonging to module k, respectively, which are estimated via an ML approach under the constraints on the static miRNA-–mRNA and protein–protein interaction information obtained from the existing databases. Similar to the models described by Setty et al.60 (reviewed above), there were two variants of the PIMiM model, namely the cancer-specific model with L1 -norm (i.e., LASSO) and the joint cancer model with L1 /L2 group LASSO penalty and constraints on the overall sum of membership coefficients across cancer types. As a result, the group LASSO encourages consistent module assignments across cancer types. Because the objective functions for the single-cancer and joint-cancer are nonconvex, the authors used quasi-Newton method to iteratively alternate between the update of the module assignments and the update of model parameters until convergence. The initial parameters were randomly set, and the best-performing models were chosen in several random starts.

Mirsynergy: Detecting Synergistic miRNA-Regulatory Modules by Overlapping Neighbourhood Expansion In one of our recent works, we described a novel model called Mirsynergy that integrates m/miRNA

© 2015 John Wiley & Sons, Ltd.

wires.wiley.com/rna

Advanced Review

expression profiles, target site information, and gene–gene interactions (GGI) to form MiRMs, where an m/miRNA may participate in multiple MiRMs, and the module number is systematically determined given the predefined model parameters.33 Notably, standard clustering methods such as k-means or hierarchical clustering are not suitable for constructing MiRMs since these methods assign each data point to a unique cluster. The general solution for solving an overlapping clustering problem is NP-hard.89 In this regard, we formulate the construction of synergistic MiRMs as a greedy-based overlapping clustering problem90 with two main stages. Prior to the two clustering stages, we first inferred miRNA–mRNA interaction weights (MMIW) by regressing on mRNA using as predictors the miRNA expression data and target site information. At stage 1, we form each MiRM by greedily including boundary or excluding internal miRNAs to maximize the synergy, which is proportional to the correlation between miRNAs in terms of their MMIW, until no more node can be added or removed to improve. At stage 2, we fix the MiRM assignments and greedily add (remove) genes to (from) each MiRM to maximize the synergy score, which is defined as a function of the MMIW matrix and the gene–gene interaction weight. An advantage of our deterministic formalism over some of the previously discussed methods is the automatic determination of module number (given the predefined thresholds to merge and filter low quality clusters) and efficient computation. In practice, Mirsynergy works the best on a sparse MMIW matrix such as the outputs from LASSO, which was the best performing expression-based method based on our comparison with other alternatives. We also explored other model formulations such as clustering m/miRNAs in a single clustering stage or using different MMIW matrices other than the one produced from LASSO, which tends to produce MiRMs each containing only one or a few miRNAs or several very large low quality MiRMs, which were then filtered out by the density threshold in either clustering stage. Thus, the performance of Mirsynergy is sensitive to the quality of MMIW and GGIW. Nonetheless, other MMIW or GGIW matrices (generated from improved methods) can be easily incorporated into Mirsynergy as the function parameters by the users of the Bioconductor package.

CONCLUSIONS AND OUTLOOK In this review, we surveyed in a fair amount of details three main computational schemes on miRNA target

predictions, namely the widely used sequence-based methods, the more recent expression-based prediction frameworks, and network methods that are designed to identify MiRMs. Despite active research and considerable progress made in miRNA research, there still seems to be some disjoints between the computational side and biological side of the miRNA research community. In particular, currently there seems to be two main branches in miRNA research: (1) understanding the basic miRNA biology and general principle; and (2) exploiting the miRNA-regulatory dynamics as predictors/biomarkers to explain complex diseases such as cancers. Both require integrative model design by harnessing more than one data type. However, the convergence of the two directions is crucial but only vaguely conceivable due to seemingly different research interests. The former centres on characterizing the miRNA recognition elements usually aided with novel experimental methods or designs. In particular, the canonical model of seed match has been constantly challenged by the recent discoveries of ‘seedless’ match and the varying degree of regulatory effects of miRNAs having identical seed regions (i.e., miRNA family). Recent technological advancements such as PAR-CLIP, CLASH, single-cell RNA-seq, Poly(A)-position Profiling By Sequencing (3P-Seq) provided unprecedented opportunities to refine the canonical model by revisiting the governing rules of MRE.38,74,91–93 The research is usually driven by experimental biologist. The latter focuses more on characterizing the molecular abnormalities of a complex disease using high-throughput data measurements across hundreds or thousands of clinical samples, which is the focus this review. Data at such a large scale are usually generated by mature technological platforms such as RNA-seq and microarrays. Research in this direction is usually driven by computationally inclined researchers. Importantly, the accuracy of the computational modelling hinges on the basic understanding of miRNA biology and the biological insights on the diseases. However, due to the scale of such research and other confounding factors such as sample heterogeneity, researchers usually pre-assume the basic model of seed-match to mitigate the model complexity. Thus, it appears to be a productive venue if future projects could unify the two fields in order to elucidate novel miRNA biology meanwhile generating testable hypotheses in explaining the underlying mechanisms of a physiological condition of interest. To this end, more interdisciplinary collaborations need to be established ideally right from the start of the project design.

© 2015 John Wiley & Sons, Ltd.

WIREs RNA

Computational Biology in microRNA

REFERENCES

2. Filipowicz W, Bhattacharyya SN, Sonenberg N. Mechanisms of post-transcriptional regulation by microRNAs: are the answers in sight? Nat Rev Genet 2008, 2008:102–114.

18. Vlachos IS, Paraskevopoulou MD, Karagkouni D, Georgakilas G, Vergoulis T, Kanellos I, Anastasopoulos I-L, Maniou S, Karathanou K, Kalfakakou D, et al. DIANA-TarBase v7.0: indexing more than half a million experimentally supported miRNA:mRNA interactions. Nucleic Acids Res 2015, 43(D1):D153–D159.

3. Kozomara A. Griffiths-Jones S. miRBase: annotating high confidence microRNAs using deep sequencing data. Nucleic Acids Res 2014, 42:D68–D73.

19. Dai Y, Zhou X. Computational methods for the identification of microRNA targets. Open Access Bioinformatics 2010, 2:29–39.

4. Friedman RC, Farh KK-H, Burge CB, Bartel DP. Most mammalian mRNAs are conserved targets of microRNAs. Genome Res 2009, 19:92–105.

20. Sethupathy P, Megraw M, Hatzigeorgiou AG. A guide through present computational approaches for the identification of mammalian microRNA targets. Nat Methods 2006, 3:881–886.

1. Bartel DP. MicroRNAs: target recognition and regulatory functions. Cell 2009, 136:215–233.

5. Song L, Tuan RS. MicroRNAs and cell differentiation in mammalian development. Birth Defects Res C Embryo Today 2006, 78:40–149. 6. Goljanek-Whysall K, Sweetman D, Münsterberg AE. microRNAs in skeletal muscle differentiation and disease. Clin Sci (Lond) 2012, 123:611–625. 7. Berardi E, Pues M, Thorrez L, Sampaolesi M. miRNAs in ESC differentiation. Am J Physiol Heart Circ Physiol 2012, 303:H931–H939. 8. El Gazzar M, McCall CE. MicroRNAs regulatory networks in myeloid lineage development and differentiation: regulators of the regulators. Immunol Cell Biol 2012, 90:587–593. 9. Baltimore D, Boldin MP, O’Connell RM, Rao DS, Taganov KD. MicroRNAs: new regulators of immune cell development and function. Nat Immunol 2008, 9:839–845. 10. Bi Y, Liu G, Yang R. MicroRNAs: novel regulators during the immune response. J Cell Physiol 2009, 218:467–472. 11. Harris A, Krams SM, Martinez OM. MicroRNAs as immune regulators: implications for transplantation. Am J Transplant 2010, 10:713–719. 12. Sullivan CS, Ganem D. MicroRNAs and viral infection. Mol Cell 2005, 20:3–7. 13. Umbach JL, Kramer MF, Jurak I, Karnowski HW, Coen DM, Cullen BR. MicroRNAs expressed by herpes simplex virus 1 during latent infection regulate viral mRNAs. Nature 2008, 454:780–783.

21. Lu J, Getz G, Miska EA, Alvarez-Saavedra E, Lamb J, Peck D, Sweet-Cordero A, Ebert BL, Mak RH, Ferrando AA, et al. MicroRNA expression profiles classify human cancers. Nature 2005, 435:834–838. 22. Lim LP, Lau NC, Garrett-Engele P, Grimson A, Schelter JM, Castle J, Bartel DP, Linsley PS, Johnson JM. Microarray analysis shows that some microRNAs downregulate large numbers of target mRNAs. Nature 2005, 433:769–773. 23. Sood P, Krek A, Zavolan M, Macino G, Rajewsky N. Cell-type-specific signatures of microRNAs on target mRNA expression. Proc Natl Acad Sci USA 2006, 103:2746–2751. 24. Deng N, Puetter A, Zhang K, Johnson K, Zhao Z, Taylor C, Flemington EK, Zhu D. Isoform-level microRNA-155 target prediction using RNA-seq. Nucleic Acids Res 2011, 39:e61. 25. Linsley PS, Schelter J, Burchard J, Kibukawa M, Martin MM, Bartz SR, Johnson JM, Cummins JM, Raymond CK, Dai H, et al. Transcripts targeted by the microRNA-16 family cooperatively regulate cell cycle progression. Mol Cell Biol 2007, 27:2240–2252. 26. Khan AA, Betel D, Miller ML, Sander C, Leslie CS, Marks DS. Transfection of small RNAs globally perturbs gene regulation by endogenous microRNAs. Nat Biotechnol 2009, 27:549–555.

14. McManus MT. MicroRNAs and cancer. Semin Cancer Biol 2003, 13:253–258.

27. Huang JC, Babak T, Corson TW, Chua G, Khan S, Gallie BL, Hughes TR, Blencowe BJ, Frey BJ, Morris QD. Using expression profiling data to identify human microRNA targets. Nat Methods 2007, 4:1045–1049.

15. Yang N, Coukos G, Zhang L. MicroRNA epigenetic alterations in human cancer: one step forward in diagnosis and treatment. Int J Cancer 2008, 122:963–968.

28. Liu H, Yue D, Zhang L, Chen Y, Gao S-J, Huang Y. A Bayesian approach for identifying miRNA targets by combining sequence prediction and gene expression profiling. BMC Genomics 2010, 11(suppl 3):S12.

16. Meltzer PS. Cancer genomics: small RNAs with big impacts. Nature 2005, 435:745–746.

29. Su N, Wang Y, Qian M, Deng M. Predicting MicroRNA targets by integrating sequence and expression data in cancer. In: Proceedings of the 2011 IEEE International Conference on Systems Biology (ISB), Zhuhai, China, 2011, 219–224.

17. Hsu S-D, Tseng Y-T, Shrestha S, Lin Y-L, Khaleel A, Chou C-H, Chu C-F, Huang H-Y, Lin C-M, Ho S-Y, et al. miRTarBase update 2014: an information resource for experimentally validated miRNA-target interactions. Nucleic Acids Res 2014, 42:D78–D85.

30. Li J, Min R, Bonner A, Zhang Z. A probabilistic framework to improve microrna target prediction by

© 2015 John Wiley & Sons, Ltd.

wires.wiley.com/rna

Advanced Review

incorporating proteomics data. J Bioinform Comput Biol 2009, 7:955–972.

Nucleic Acids. Cambridge: Cambridge University Press; 1998.

31. Li Y, Goldenberg A, Wong K-C, Zhang Z. A probabilistic approach to explore human miRNA targetome by integrating miRNA-overexpression data and sequence information. Bioinformatics 2014, 30:621–628.

45. Bussemaker HJ, Li H, Siggia ED. Building a dictionary for genomes: identification of presumptive regulatory sites by statistical analysis. Proc Natl Acad Sci USA 2000, 97:10096–10100.

32. Li Y, Liang C, Wong K-C, Jin K, Zhang Z. Inferring probabilistic miRNA-mRNA interaction signatures in cancers: a role-switch approach. Nucleic Acids Res 2014, 42:e76.

46. Sinha S, van Nimwegen E, Siggia ED. A probabilistic method to detect regulatory modules. Bioinformatics 2003, 19(suppl 1):i292–i301.

33. Li Y, Liang C, Wong K-C, Luo J, Zhang Z. Mirsynergy: detecting synergistic miRNA regulatory modules by overlapping neighbourhood expansion. Bioinformatics 2014, 30:2627–2635. 34. Lewis BP, Shih I-h, Jones-Rhoades MW, Bartel DP, Burge CB. Prediction of mammalian microRNA targets. Cell. 2003, 115:787–798. 35. Lewis BP, Burge CB, Bartel DP. Conserved seed pairing, often flanked by adenosines, indicates that thousands of human genes are MicroRNA targets. Cell 2005, 120:15–20. 36. Grimson A, Farh KK-H, Johnston WK, Garrett-Engele P, Lim LP, Bartel DP. MicroRNA targeting specificity in mammals: determinants beyond seed pairing. Mol Cell 2007, 27:91–105. 37. Garcia DM, Baek D, Shin C, Bell GW, Grimson A, Bartel DP. Weak seed-pairing stability and high target-site abundance decrease the proficiency of lsy-6 and other microRNAs. Nat Struct Mol Biol 2011, 18:1139–1146. 38. Nam J-W, Rissland OS, Koppstein D, Abreu-Goodger C, Jan CH, Agarwal V, Yildirim MA, Rodriguez A, Bartel DP. Global Analyses of the Effect of Different Cellular Contexts on MicroRNA Targeting. Mol Cell 2014, 53:1031–1043. 39. Nielsen CB, Shomron N, Sandberg R, Hornstein E, Kitzman J, Burge CB. Determinants of targeting by endogenous and exogenous microRNAs and siRNAs. RNA 2007, 13:1894–1910. 40. Kiriakidou M, Nelson PT, Kouranov A, Fitziev P, Bouyioukos C, Mourelatos Z, Hatzigeorgiou A. A combined computational-experimental approach predicts human microRNA targets. Genes Dev 2004, 18:1165–1178. 41. Enright AJ, John B, Gaul U, Tuschl T, Sander C. MicroRNA targets in Drosophila. Genome Biol 2004, 5:R1–R14. 42. John B, Enright AJ, Aravin A, Tuschl T, Sander C, Marks DS. Human MicroRNA Targets. PLoS Biol 2004, 2:e363. 43. Krek A, Grün D, Poy MN, Wolf R, Rosenberg L, Epstein EJ, MacMenamin P, da Piedade I, Gunsalus KC, Stoffel M, et al. Combinatorial microRNA target predictions. Nat Genet 2005, 37:495–500. 44. Durbin R, Eddy SR, Krogh A, Mitchison G. Biological Sequence Analysis: Probabilistic Models of Proteins and

47. Rajewsky N, Vergassola M, Gaul U, Siggia ED. Computational detection of genomic cis-regulatory modules applied to body patterning in the early Drosophila embryo. BMC Bioinform 2002, 3:30–42. 48. Kertesz M, Iovino N, Unnerstall U, Gaul U, Segal E. The role of site accessibility in microRNA target recognition. Nat Genet 2007, 39:1278–1284. 49. Lorenz R, Bernhart SH, Höner Zu Siederdissen C, Tafer H, Flamm C, Stadler PF, Hofacker IL. ViennaRNA Package 2.0. Algorithms Mol Biol 2011, 6:26–39. 50. Arvey A, Larsson E, Sander C, Leslie CS, Marks DS. Target mRNA abundance dilutes microRNA and siRNA activity. Mol Syst Biol 2010, 6:1–7. 51. van Dongen S, Abreu-Goodger C, Enright AJ. Detecting microRNA binding and siRNA off-target effects from expression data. Nat Methods 2008, 5:1023–1025. 52. Huang J, Morris Q, Frey B. Detecting MicroRNA Targets by Linking Sequence, microrna and Gene Expression Data. New York, NY: Springer; 2006. 53. Huang JC, Morris QD, Frey BJ. Bayesian inference of microRNA targets from sequence and expression data. J Comput Biol 2007, 14:550–563. 54. Huang JC, Frey BJ, Morris QD. Comparing sequence and expression for predicting microRNA targets using GenMiR3. In: Pacific Symposium on Biocomputing Pacific Symposium on Biocomputing, 2008, 52–63. 55. Farh KKH. The widespread impact of mammalian microRNAs on mRNA repression and evolution. Science 2005, 310:1817–1821. 56. Guo H, Ingolia NT, Weissman JS, Bartel DP. Mammalian microRNAs predominantly act to decrease target mRNA levels. Nature 2010, 466:835–840. 57. Baek D, Villén J, Shin C, Camargo FD, Gygi SP. The impact of microRNAs on protein output. Nature 2008, 455:64–71. 58. Selbach M, Schwanhäusser B, Thierfelder N, Fang Z, Khanin R, Rajewsky N. Widespread changes in protein synthesis induced by microRNAs. Nature 2008, 455:58–63. 59. Bussemaker HJ, Li H, Siggia ED. Regulatory element detection using correlation with expression. Nat Genet 2001, 27:167–171. 60. Setty M, Helmy K, Khan AA, Silber J, Arvey A, Neezen F, Agius P, Huse JT, Holland EC, Leslie CS. Inferring transcriptional and microRNA-mediated regulatory programs in glioblastoma. Mol Syst Biol 2012, 8:1–16.

© 2015 John Wiley & Sons, Ltd.

WIREs RNA

Computational Biology in microRNA

61. Mezlini AM, Wang B, Deshwar A, Morris Q, Goldenberg A. Identifying cancer specific functionally relevant miRNAs from gene expression and miRNA-to-gene networks using regularized regression. PLoS One 2013, 8:e73168.

76. Tsang JS, Ebert MS, van Oudenaarden A. Genome-wide dissection of microRNA functionsand cotargeting networks using gene set signatures. Mol Cell 2010, 38:140–153.

62. Li Y, Liang M, Zhang Z. Regression analysis of combined gene expression regulation in acute myeloid leukemia. PLoS Comput Biol 2014, 10:e1003908.

77. Boross G, Orosz K, Farkas IJ. Human microRNAs co-silence in well-separated groups and have different predicted essentialities. Bioinformatics 2009, 25:1063–1069.

63. Matys V, Kel-Margoulis OV, Fricke E, Liebich I, Land S, Barre-Dirrie A, Reuter I, Chekmenev D, Krull M, Hornischer K, et al. TRANSFAC and its module TRANSCompel: transcriptional gene regulation in eukaryotes. Nucleic Acids Res 2006, 34:D108–D110.

78. Xu J, Li CX, Li YS, Lv JY, Ma Y, Shao TT, Xu LD, Wang YY, Du L, Zhang YP, et al. MiRNA-miRNA synergistic network: construction via co-regulating functional modules and disease miRNA topological features. Nucleic Acids Res 2011, 39:825–836.

64. The Cancer Genome Atlas Research Network. Genomic and epigenomic landscapes of adult de novo acute myeloid leukemia. New Engl J Med 2013, 368:2059–2074.

79. Shalgi R, Lieber D, Oren M, Pilpel Y. Global and local architecture of the mammalian microRNAtranscription factor regulatory network. PLoS Comput Biol 2007, 3:e131.

65. Gerstein MB, Kundaje A, Hariharan M, Landt SG, Yan K-K, Cheng C, Mu XJ, Khurana E, Rozowsky J, Alexander R, et al. Architecture of the human regulatory network derived from ENCODE data. Nature 2013, 488:91–100.

80. Joung J-G, Hwang K-B, Nam J-W, Kim S-J, Zhang B-T. Discovery of microRNA-mRNA modules via population-based probabilistic learning. Bioinformatics 2007, 23:1141–1147.

66. The ENCODE Project and Consortium. An integrated encyclopedia of DNA elements in the human genome. Nature 2013, 488:57–74. 67. Ramaswamy S, Tamayo P, Rifkin R, Mukherjee S, Yeang CH, Angelo M, Ladd C, Reich M, Latulippe E, Mesirov JP, et al. Multiclass cancer diagnosis using tumor gene expression signatures. Proc Natl Acad Sci USA 2001, 98:15149–15154. 68. Calin GA, Croce CM. MicroRNA signatures in human cancers. Nat Rev Cancer 2006, 6:857–866. 69. Spizzo R, Nicoloso MS, Croce CM, Calin GA. SnapShot: microRNAs in cancer. Cell 2009, 137:586–586.e581. 70. Jacobsen A, Silber J, Harinath G, Huse JT, Schultz N, Sander C. Analysis of microRNA-target interactions across diverse cancer types. Nat Struct Mol Biol 2013, 20:1325–1332. 71. Li Y, Zhang Z. Potential microRNA-mediated oncogenic intercellular communication revealed by pan-cancer analysis. Sci Rep 2014, 4:7097–7111. 72. Efron B. Large-scale simultaneous hypothesis testing. J Am Stat Assoc 2004, 99:96–104. 73. Chen X, Slack FJ, Zhao H. Joint analysis of expression profiles from multiple cancers improves the identification of microRNA–gene interactions. Bioinformatics 2013, 29:2137–2145. 74. Helwak A, Kudla G, Dudnakova T, Tollervey D. Mapping the human miRNA interactome by CLASH reveals frequent noncanonical binding. Cell 2013, 153:654–665. 75. Papadopoulos GL, Alexiou P, Maragkakis M, Reczko M, Hatzigeorgiou AG. DIANA-mirPath: integrating human and mouse microRNAs in pathways. Bioinformatics 2009, 25:1991–1993.

81. Peng X, Li Y, Walters K-A, Rosenzweig ER, Lederer SL, Aicher LD, Proll S, Katze MG. Computational identification of hepatitis C virus associated microRNA-mRNA regulatory modules in human livers. BMC Genomics 2009, 10:373–392. 82. Le H-S, Bar-Joseph Z. Inferring interaction networks using the IBP applied to microrna target prediction. Adv Neural Inf Process Syst 2011, 24:235–243. 83. Griffiths T, Ghahramani Z. Infinite latent feature models and the Indian buffet process. In: NIPS, Whistler, BC, Canada, 2005, 475–482. Cambridge, MA: MIT Press. 84. Thibaux R, Jordan MI. Hierarchical beta processes and the Indian buffet process. In: Proccedings of the International Conference on Artificial Intelligence and Statistics, San Juan, Puerto Rico, 2007, 564–571. 85. Doshi-Velez F, Ghahramani Z. Accelerated sampling for the Indian buffet process. In: Proceedings of the 26th Annual International Conference on Machine Learning, Montreal, QC, Canada, 2009, 273–280. New York, NY: ACM. 86. Lee DD, Seung HS. Learning the parts of objects by non-negative matrix factorization. Nature 1999, 401:788–791. 87. Zhang S. Li Q, Liu J. Zhou XJ: A novel computational framework for simultaneous integration of multiple types of genomic data to identify microRNA-gene regulatory modules. 2011, 27:i401–i409. 88. Le H-S, Bar-Joseph Z. Integrating sequence, expression and interaction data to determine condition-specific miRNA regulation. Bioinformatics 2013, 29:i89–i97. 89. Barthélemy J-P. Brucker Fc: NP-hard approximation problems in overlapping clustering. J Classif 2001, 18:159–183.

© 2015 John Wiley & Sons, Ltd.

wires.wiley.com/rna

Advanced Review

90. Nepusz T, Yu H, Paccanaro A. Detecting overlapping protein complexes in protein-protein interaction networks. Nat Methods 2012, 9:471–472.

Unambiguous identification of miRNA: target site interactions by different types of ligation reactions. Mol Cell 2014, 54:1042–1054.

91. Hafner M, Landthaler M, Burger L, Khorshid M, Hausser J, Berninger P, Rothballer A, Ascano J, Manuel, Jungkamp A-C, Munschauer M, et al. Transcriptome-wide identification of RNA-binding protein and microRNA target sites by PAR-CLIP. Cell 2010, 141:129–141.

93. Mukherji S, Ebert MS, Zheng GXY, Tsang JS, Sharp PA, van Oudenaarden A. MicroRNAs can generate thresholds in target gene expression. Nat Genet 2011, 43:854–859.

92. Grosswendt S, Filipchyk A, Manzano M, Klironomos F, Schilling M, Herzog M, Gottwein E, Rajewsky N.

94. Lu Y, Zhou Y, Qu W, Deng M, Zhang C. A Lasso regression model for the construction of microRNAtarget regulatory networks. Bioinformatics 2011, 27:2406–2413.

© 2015 John Wiley & Sons, Ltd.