Ab initio human miRNA and pre-miRNA prediction.

J. Bioinform. Comput. Biol. 2013.11. Downloaded from www.worldscientific.com by KING MONGKUT'S UNIVERSITY OF TECHNOLOGY THONBURI on 10/13/14. For personal use only.

Journal of Bioinformatics and Computational Biology Vol. 11, No. 6 (2013) 1343009 (16 pages) # .c Imperial College Press DOI: 10.1142/S0219720013430099

AB INITIO HUMAN miRNA AND PRE-miRNA PREDICTION

IGOR I. TITOV*,†,‡ and PAVEL S. VOROZHEYKIN†,§ *Institute

of Cytology and Genetics, SB RAS, 10 Lavrentyev Avenue Novosibirsk 630090, Russian Federation †

Novosibirsk State University, 2 Pirogov Street Novosibirsk 630090, Russian Federation ‡[email protected] § [email protected] Received 5 July 2013 Revised 16 September 2013 Accepted 10 October 2013 Published 3 December 2013

MicroRNAs (miRNAs) are small single-stranded noncoding RNAs that play an important role in post-transcriptional regulation of gene expression. In this paper, we present a web server for ab initio prediction of the human miRNAs and their precursors. The prediction methods are based on the hidden Markov Models and the context-structural characteristics. By taking into account the identi¯ed patterns of primary and secondary structures of the pre-miRNAs, a new HMM model is proposed and the existing context-structural Markov model is modi¯ed. The evaluation of the method performance has shown that it can accurately predict novel human miRNAs. Comparing with the existing methods we demonstrate that our method has a higher prediction quality both for human pre-miRNAs and miRNAs. The models have also showed good results in the prediction of the mouse miRNAs. The web server is available at http:// wwwmgs.bionet.nsc.ru/mgs/programs/rnaanalys (mirror http://miRNA.at.nsu.ru). Keywords: miRNA; pre-miRNA; binding site; hidden Markov model; partition function; computer prediction; secondary structure.

1. Introduction MicroRNAs (miRNAs) are small single-stranded RNAs of 19–27 nucleotides in length, which bind to the mRNA by complementary interactions and lead to mRNA degradation or translation inhibition.1 In the miRNA biogenesis initially a primary transcript (pri-miRNA) is transcribed from miRNA gene, the pri-miRNA contains one or more hairpin structures of the miRNA precursors (pre-miRNA).1 Then, in the process of pre-miRNA maturation its hairpin is cleaved into one or two mature miRNAs,2 ¯nally, mature miRNA is involved in the translation regulation or mRNA degradation. Hairpin structure is a key property used in almost all ab initio methods for the prediction of miRNAs and their precursors. 1343009-1


I. I. Titov & P. S. Vorozheykin

The miRNAs and their precursors were found recently,3 but their number as the number of computational methods to ¯nd them is rapidly increasing. Cross-species conservation of the sequences and their secondary structures is typical for some identi¯ed miRNAs and their precursors.1 Therefore, the basic computational methods use sequence homology to the already known and experimentally veri¯ed data.4 This approach is further improved by taking into account the pre-miRNA secondary structures conservation and the properties of the precursor's primary and secondary structures (MiRscan,5 miRseeker,6 MiRAlign,7 Vmir,8 RFam,9 miRNAFold,10 and others). Several computational approaches have been developed for ab initio prediction of the pre-miRNAs and miRNAs. The methods based on the support vector machines and the decision trees classify miRNAs and their precursors by the set of primary and secondary structure characteristics: free hairpin energy, loops lengths, base composition, etc. (miR-abela,11 Triplet-SVM,12 RNAmicro,13 MiRFinder,14 miPred,15 mirCos,16 microPred,17 Virgo,18 MiRPara,19 maturePred,20 MiRmat,21 and others). Several probabilistic models based on the naive Bayesian classi¯er (BayesMiRNA¯nd,22 MatureBayes,23 and others), stochastic context-free grammars (CIDmiRNA24), hidden Markov models (ProMiR I/II,25,26 miRRim,27 SSCPro¯ler,28 FOMmiR,29 and others) were developed to predict miRNAs and pre-miRNAs. Some of these methods use the precursor sequence and its secondary structure statistics as well as the information about the mature miRNA position, as the most conservative precursor region. A large number of new miRNAs and their precursors for Homo sapiens, Caenorhabditis elegans, Mus musculus, Rattus norvegicus, and other organisms were predicted by these computational methods.30 In this work, the computational programs for ab initio miRNA and pre-miRNA predictions are developed, which are able to ¯nd as well close sequence homologs as the distant and novel sequences. They are based on hidden Markov models with context-structural characteristics of the stem-loop of the miRNA precursors. In this paper, we show that the implemented method for miRNA prediction can successfully ¯nd novel human miRNAs. Also we compare the prediction results of our programs for human pre-miRNAs and miRNAs and show that our methods outperform other approaches. The described programs are united into the web-server including additional tools which can be useful for miRNA analysis. The algorithm of one of them, the program for miRNA biding sites prediction, is similar to the HMM's dynamic programming algorithms: therefore we have brie°y described it too.

2. Methods 2.1. Technical description of system The methods were realized as a set of web-based applications for the miRNA analysis. The web-server is developed in HTML/JavaScript/Perl/Cþþ, it is available 1343009-2

Ab Initio Human miRNA and Pre-miRNA Prediction

at wwwmgs.bionet.nsc.ru/mgs/programs/rnaanalys/ (mirror miRNA.at.nsu.ru). The program GArna31 was used in the calculations of the RNA secondary structures and their energies.


2.2. Datasets Training and testing of our methods were carried out on 1,872 human pre-miRNAs from the miRBase (release 20.0),2 these pre-miRNAs contain 2,794 mature miRNAs almost equally from the 5 0 and 3 0 arms of the precursor. The pre-miRNA overprediction error was evaluated by a negative dataset which includes 1,872 random sequences from the pseudo-precursors created by Xue and co-authors.12 They have selected these sequences from the coding regions, each of them may form a hairpin structure. All hairpins satisfy the same as the known premiRNAs constraints on their secondary structure.12 To evaluate the novel miRNA prediction we used the test dataset miR19-20, which contains the new human miRNAs appeared in the version 20.0 of the miRBase. Training of the HMM was carried out on the known human miRNAs from the miRBase, release 19.0. To compare our program with the existing methods for predicting human premiRNAs we used the training and test datasets described by Tempel and co-authors in their paper.10 For HMM training the known human pre-miRNAs from the miRBase, release 17.0 were used and the testing dataset consisted of two genomic sequences arti¯cial and real ones. The ¯rst one, 30,500 nt in length, was obtained by concatenating the sequences of 100 known human pre-miRNAs (miRBase, release 17.0) and alternating them with human mRNAs. The 100 pre-miRNAs have lengths in the range of 63–110 nt and start in the sequence for every 300 nt. The second sequence is a \þ" chain of the human chromosome 19 from the position 54169933 to the position 54485651. This sequence contains the cluster of 50 pre-miRNAs. The sequence of the chromosome 19 and the mRNA fragments were taken from the human genome release 37.2 (http://ncbi.nlm.nih.gov/genbank/). To compare our method with the existing miRNA prediction methods we have used the following test and training datasets. Five-fold cross-validation on a sample of human pre-miRNAs from the miRBase (release 2.2) was carried out to compare our method with the ProMiR. The dataset consisted of 136 pre-miRNAs sequences with 81 mature miRNAs from 5 0 arm of the precursor and 55 mature miRNAs of its 3 0 arm. To compare with the MatureBayes we used known human and mouse premiRNA and miRNA sequences. The training dataset contained miRNAs and their precursors from the database miRBase, release 10.1, the test dataset included the miRNAs and their precursors that had been added to the miRBase in the 11–14th releases. These datasets are identical to the datasets that authors already used to evaluate the quality of prediction of the MatureBayes.23 To compare our method with MaturePred and MiRmat we used human and mouse pre-miRNAs and miRNAs (release 17.0 of the miRBase) as a training sample; 1343009-3



the test was performed on the human and mouse pre-miRNAs and miRNAs that had been added in 18.0 release of the miRBase. Test results for ProMiR, MatureBayes, MaturePred and MiRmat were taken from the paper of the MiRmat21; testing of these methods was carried out on the samples mentioned above. For the BayesMiRNA¯nd, mirCos, MiRPara and FOMmir comparison was not performed due to the lack of available source codes or web servers. 2.3. Pre-miRNA prediction To predict human pre-miRNA sequences we have constructed a context-structural Markov model. Pre-miRNAs are the nucleotide sequences which can form a stemloop secondary structure (Fig. 1(a)). By taking into account the secondary structure of the pre-miRNA its hairpin can be interpreted as a pairwise sequence in which each position has two characteristics: The symbol from the alphabet fA, C, G, U, g and the structural property (this position is located in the loop or in the helix). The sequence starts with the 5 0 end of the pre-RNA. The structure of the Markov model includes ¯ve states: A, C, G, U, and X and the one initial state S. The ¯nal state in the model is not de¯ned. States A, C, G, U correspond to the symbols of nucleotides, the state X corresponds to the symbol \". Inserting the symbol \" equalizes the number of symbols between the hairpin complementary strands containing bulge or asymmetrical internal loops (Figs. 1(a) and 1(b)). The model can generate two types of symbols for each state: T for symbols that have a complementary pair in the secondary structure in this state, and F for the symbols that do not have the complementary pair (Fig. 1(b)). Only the symbol \" is generated in the X state. The possible transitions between the model states are shown in Fig. 1(c). For the transition probabilities we de¯ne the path ¼ 1 2 L a sequence of observed states in which the sequence of the observations x ¼ x1 x2 xL is emitted. Let Tkm ¼ P ðlþ1 ¼ m j l ¼ kÞ be a probability of transition from state k to

Fig. 1. The pairwise representation of the pre-miRNA's stem-loop, where the state of each pair includes the information about sequence and structure of the precursor. (a) The hairpin structure of the pre-miRNA hsa-let-7a-1 with two mature miRNAs hsa-let-7a-5p and hsa-let-7a-3p. (b) The pairwise sequence corresponding to the pre-miRNA's hairpin hsa-let-7a-1. (c) The structure of the Markov model. 1343009-4



state m, where l state l in the path . In particular T0m is the probability of transition from the initial state S into the state 1 ¼ m. Let El ðbÞ ¼ P ðxl ¼ b j l ¼ kÞ be the emission probability of the symbol b in the state l. The probabilistic model is de¯ned by the parameter set fT ; Eg, where T ¼ fTkm g is a matrix which consists of transition probabilities between states and E ¼ fEl ðbÞg is a matrix which consists of emission probabilities. Using the information about human pre-miRNAs the parameters fT ; Eg of model are evaluated by the method developed by Baum et al.32 Let the model ¼ fT ; Eg emits the sequence of observations x ¼ x1 x2 xL in the corresponding states of the model ¼ 1 2 L . Using the probabilities T and E one can calculate the total probability P ðxÞ of the sequence of observations in a Q given state of the model : P ðxÞ ¼ Ln¼1 Tn1 n En ðxn Þ, where L is the length of the sequence x, and 0 ¼ S is the model start state. The value of probability P ðxÞ decreases exponentially while increasing the sequence length L. Therefore, since the lengths of the pre-miRNAs are up to 150 nt, in the calculations we used the normalized value of the length L: log P ðxÞ K ¼ 10 ¼ L

PL

n¼1 ðlog10 ðEn ðxn ÞÞ

L

þ log10 ðTn1 n ÞÞ

:

ð1Þ

This structure of the model takes into account the statistical properties of both the primary and secondary structures of the precursor. Note that our model is not the same as the HMM from the program ProMiR.25 In particular, the number of parameters in our model is almost seven times smaller, 38 against 256. Reducing the number of the parameters allows us to avoid a de¯cit of learning sequences observed in ProMiR where certain parameters of HMM take the zero values or the small ones. As a result the ProMiR model is not capable to generate certain sequences or probability of generation is too low for classi¯cation. Also the structure of our model, unlike the ProMiR's model, ignores miRNA location in its precursor. Search for the pre-miRNAs in genomic sequence starts from extracting the subsequences that are capable to form the hairpin and to satisfy the conditions on length, on the G þ C nucleotide content and on the value of e-score and energy (see the web site for details). Since the program for the secondary structure calculation, GArna,31 is based on a stochastic algorithm we build three versions of hairpin structures with di®erent initial randomization parameters for each sliding window of the length L ¼ 120 nt and the step of 10 nt. To force hairpin formation within each window, the algorithm GArna was modi¯ed by adding the multi-loop penalties and by enforcing sequence ends binding. Then between three GArna outputs the structure with the lowest energy is selected as a pre-miRNA candidate. The primary and secondary structures of the candidate uniquely determine the sequence of the observations x ¼ x1 x2 x120 and the HMM's path of the observation states ¼ 1 2 120 . Finally the candidate is classi¯ed as pre-miRNA, if the value K (Eq. (1)) for it is less than a predetermined threshold value. The value of the threshold could be chosen basing on the prediction quality evaluation (Fig. 3). 1343009-5



2.4. MiRNA prediction Our ab initio miRNA prediction method is based on the paired HMM of the program ProMiR.25 ProMiR predicts miRNA/miRNA* duplex and de¯nes one mature miRNA from two duplex sequences. In their work Nam and colleagues interpreted a pre-miRNA hairpin (Fig. 2(a)) as a pairwise sequence from the hairpin beginning to its end. HMM contains four structural states M (helix), N (inner loop), I (insertion), and D (deletion), and two hidden states m (the precursor position within the miRNA) and n (the position outside the miRNA). Let us denote HMM as ¼ fT ; Eg and the transposed matrix of transition probabilities between states as T 0 . For each position of the pairwise sequence the following values were computed: SðiÞ ¼

Pm ði 1ÞTmn ; Pm ði 1ÞTmn þ Pn ði 1ÞTnn

0 Pm ði 1ÞT mn S 0 ðiÞ ¼ 0 0 : Pm ði 1ÞT mn þ Pn ði 1ÞT nn

ð2Þ

Pm ðiÞ and Pn ðiÞ are the probabilities of the position i to be inside or outside of miRNA region, respectively. The values of these probabilities are computed iteratively through the parameters of the HMM. Tmn is the transition probability between hidden states m and n, i.e. the probability to pass from a miRNA region to outside, and Tnn is the transition probability between the hidden states n and n, i.e. the probability of the model to pass from outside to outside of miRNA region. By calculating the S and S 0 values one can determine the positions of both boundaries of miRNA. We identify the Dicer site ¯rst ¯nding it immediately before the maximum of the SðiÞ. Then we ¯nd the Drosha site immediately after the maximum of the S 0 ðiÞ within the distances 22–26 nt from the Dicer site. The main but not all di®erences between our approach and the work of Nam25 are as follows. First, we use not one, but two HMMs, each one for the 3 0 and 5 0

Fig. 2. The pairwise representation of the pre-miRNA's stem-loop, where the each pair includes the information about the precursor's hairpin and the mature miRNA region. (a) The hairpin structure of the pre-miRNA hsa-let-7a-1 with two mature miRNAs hsa-let-7a-5p and hsa-let-7a-3p. (b) The pairwise sequence corresponding to the 5 0 miRNA into the pre-miRNA's hairpin hsa-let-7a-1. (c) The structure of the hidden Markov model. 1343009-6


miRNAs. Second, we predict the miRNA/miRNA duplex regarding each miRNA as \mature." Third, the prediction yields the optimal and suboptimal pairs of miRNA corresponding to the global and to the following global maxima of the function SðiÞ.


2.5. Additional tools for miRNA analysis The described web-server includes the additional tools which can be useful for miRNA analysis: Searching for the miRNA homologs and predicting the miRNA binding sites. One of these programs is based on the physicochemical model which takes into account the mRNA binding competition between miRNAs. Together with the HMM programs presented above, it exploits the dynamic programming algorithm: therefore we will describe it below. Let us consider a miRNA-mRNA duplex and de¯ne a statistical weight of its con¯guration at position k of the mRNA sequence S. This statistical weight Wk ðSÞ is calculated through the energy Ek of the miRNA/mRNA duplex and the temperature T : Ek

Wk ðsÞ ¼ e T :

ð3Þ

Let N to be the length of mRNA sequence S, m is the length of miRNA sequence and relates to the miRNA concentration value near the mRNA. Note that for the small values of (here 1–3) this approach resembles the search for thermodynamically most stable miRNA/mRNA complexes. The probability of miRNA-mRNA duplex starting from the position i in the mRNA sequence is the sum of the statistical weights of all con¯gurations in which miRNA binds mRNA starting from the position i, divided by the sum of all statistical weights. These sums are computed in three steps using a dynamic programming algorithm. \Forward step". The calculation of F1 ; . . . ; Fi ; . . . ; FN which are the sums of all statistical weights of the mRNA subsequences S1 ; . . . ; Si ; . . . ; SN , where Si is a sequence with the length i, SN is the whole mRNA sequence. F0 ¼ 1; Fi ¼ Fi1 ; for 1 i m; Fi ¼ Fi1 þ Fim1 P ðSim Þ; for i m þ 1; P ðSim Þ ¼ q Wim ðSÞ, q ¼ 10 6 is the normalization factor against the arithmetic over°ow. \Backward step". The calculation of R1 ; . . . ; Ri ; . . . ; RN which are the sums of 0 all statistical weights of mRNA subsequences S 10 ; . . . ; S i0 ; . . . ; S N , where S i0 is a 0 sequence of length i, S 1 is the whole mRNA sequence. RN ¼ 1; Ri ¼ Riþ1 ; for i N m þ 1; Ri ¼ Riþ1 þ Riþmþ1 P ðS i0 Þ; for 1 i m; P ðS i0 Þ ¼ q Wi ðS 0 Þ, q ¼ 10 6 is the normalization factor against the arithmetic over°ow. 1343009-7


\Final step". The calculation of P ðiÞ, which is the probability of the miRNA– mRNA duplex starting from the position i of mRNA:


P ðiÞ ¼

Fi1 q Wi ðSÞ Riþmþ1 ; R1

for i N m 1:

ð4Þ

The numerator is the sum of all statistical weights, where the miRNA–mRNA duplex starts from the position i of mRNA and denominator is a sum of all statistical weights. One can calculate the average miRNA occupancy at the position i as: Psum ðiÞ ¼

X

P ði kÞ;

for i N m 1:

ð5Þ

k¼0::m1

The program displays the graph of binding probability Psum ðiÞ at the each position i of the target mRNA sequence.

3. Results 3.1. Evaluation of the human pre-miRNA prediction To evaluate the prediction quality of our method we have performed the 5-fold crossvalidation test on the datasets of human pre-miRNAs and pseudo pre-miRNAs described above. The threshold for the hidden Markov model (Eq. (1)) is used to classify the precursor candidates. For test datasets at each threshold value we have calculated false positives and false negatives and plotted a graph of their dependence (Fig. 3). We have chosen the default threshold value 1.93, corresponding to false positives of 0.11 and false negatives of 0.11. The prediction accuracy (the ratio of sum of true positives and true negatives outcomes to all outcomes) was 89%.

Fig. 3. Evaluation of the human pre-miRNAs prediction over the 5-fold cross-validation. The curve is de¯ned as a plot of test false negative as the y-coordinate, versus the false positive as the x-coordinate. 1343009-8



3.2. Evaluation of the human miRNA prediction To evaluate the human miRNA prediction we have used the datasets described above. One dataset is used for the 5-fold cross-validation test and the second for the evaluating the prediction of the novel miRNAs. As a quality assessment, we have calculated the cumulative distribution function of the distances between the predicted and real 5 0 ends of the miRNAs (Tables 1 and 2). As can be seen from Tables 1 and 2 the method demonstrates a good prediction quality almost equally for the both datasets. The best of the optimal and sub-optimal miRNA pairs (the average position deviation of 1.61 nt from the real miRNAs, Table 1) yields much better predictions than the only optimal miRNA pair (3.13 nt, Table 1). At the same time the majority of the predicted miRNAs (71% and 90%, Table 1) are not far than 3 nt from the real ones. This signi¯cant di®erence between the average and the median of the distribution suggests that the cumulative distribution function could be more helpful for the miRNA prediction evaluation than the average position deviation. 3.3. Comparison of the methods for pre-miRNA prediction For comparing we used those programs with the ab initio prediction of pre-miRNAs in genomic sequences, which have an accessible web server for computing on large (more than 5000 nt) sequences or the available source code. Comparison is

Table 1. Evaluation of the human miRNAs prediction over the 5-fold cross validation. Table illustrates the percentages of predicted miRNAs that are located within 0–3 nucleotides from the start of the real mature miRNAs for the optimal miRNA pair and the best of the optimal and sub-optimal miRNA pairs. E (nt) is the average position deviation.

Our method, optimal miRNA pair Our method, the best of the optimal and sub-optimal miRNA pairs

E (nt)

0

1

2

3

3.13 1.61

19.50 27.19

40.79 60.47

56.53 78.71

71.20 89.98

Table 2. Evaluation of the human miRNAs prediction over the miR19-20 testing dataset. Table illustrates the percentages of predicted miRNAs that are located within 0–3 nucleotides from the start of the real mature miRNAs for the optimal miRNA pair and the best of the optimal and sub-optimal miRNA pairs. E (nt) is the average position deviation.

Our method, optimal miRNA pair Our method, the best of the optimal and sub-optimal miRNA pairs

E (nt)

0

1

2

3

3.18 1.81

18.52 27.51

43.56 59.79

59.79 77.43

74.96 89.24

1343009-9



made for the methods miRNAFold,10 mirPara,19 CID-miRNA,24 SSCPro¯ler,28 and Vmir.8 Testing was carried out on two classes of sequences used by the authors of the arti¯cial and the genomic ones. For the evaluation of our miRNAFold10 method we have chosen the 1.97 value of the HMM threshold and 15.0 and 0 kcal/ mol of the energy threshold. Following the miRNAFold's authors we consider only that pre-miRNA to be correctly predicted the hairpin loop center of which is closer than 10% of the hairpin size to the centers of a known pre-miRNA.10 Measures of sensitivity and selectivity were used for comparing the programs. The sensitivity shows the ability of the method to detect novel pre-miRNAs, the selectivity represents the probability that the predicted candidate is a real pre-miRNA. sensitivity ¼ 100 TP=ðTP þ FNÞ;

selectivity ¼ 100 TP=ðTP þ FPÞ:

ð6Þ

TP is the number of correctly predicted pre-miRNAs, FN is the number of unpredicted pre-miRNAs, FP is the number of the false positive. TP + FN is equal to 100 for arti¯cial sequences and to 50 for the genomic sequence. The results of testing the methods are presented in the Table 3. Table 3 shows that our method demonstrates higher sensitivity and selectivity than miRNAFold, MiRPara, CID-miRNA, and Vmir on both test sequences even without the energy constraint. The additional energy restriction further improves the quality of our method. At its default parameters the SSCPro¯ler is superior to all methods of the Table 3 by the selectivity on the real genomic sequence; however it shows a low sensitivity value.

Table 3. Comparison of the methods for the pre-miRNAs prediction on the arti¯cial and real genome sequences. Prediction results for miRNAFold, CID-miRNA, miRPara, and Vmir were taken from the miRNAFold's paper.10 The interface of the SSCPro¯ler web server does not provide the option to search the pre-miRNAs in a random nucleotide sequence; it was tested with the default parameters. Methods are ordered by the publication date.

Arti¯cial sequence

Real genomic sequence

Method

Sensitivity

Selectivity

Our method, the HMM's threshold is 1.97, energy 15 kcal/mol Our method, the HMM's threshold is 1.97 miRNAFold miRPara CID-miRNA Vmir

97

39.34

97 97 97 97 28

25.39 19.17 9.70 11.72 1.32

100

1.43

100 100 90 12 38 100

0.99 0.89 0.93 5.83 0.69 0.56

Our method, the HMM's threshold is 1.97, energy 15 kcal/mol Our method, the HMM's threshold is 1.97 miRNAFold miRPara SSCPro¯ler CID-miRNA Vmir

1343009-10


3.4. Comparison of the methods for miRNA prediction


To compare our method with the existing ones we used the test and learning samples of pre-miRNAs and miRNAs, described in the paper.21 The comparison is made with the programs ProMiR, MatureBayes, MaturePred and MiRmat; comparison with the other methods has not been carried out due to the lack of available source codes or computational web servers. 3.4.1. Comparison with ProMiR The comparison of our method with the ProMiR was carried out on a dataset of the human miRNAs from the database miRBase, release 2.2. Mean and standard deviation (SD) of the predicted miRNAs boundaries from the boundaries of the real miRNAs were estimated. The comparison results are shown in Table 4. In general, on this early miRBase release our method has showed (Table 4) higher prediction accuracy than the ProMiR for both sites, especially for 3 0 precursor's strand. 3.4.2. Comparison with MatureBayes, MaturePred and MiRmat The comparison of our method with MatureBayes, MaturePred and MiRmat programs was conducted on the sample of human and mouse pre-miRNAs. As a quality measurement we have used the commonly used cumulative distribution function of the distances between the predicted 5 0 ends of the miRNAs and the real ones. Actually, since the 5 0 region controls the miRNA site recognition,33–35 the correct prediction of the 5 0 end of the miRNA is more important for the further target prediction than of 3 0 end. Incorrect prediction of the 5 0 end of the miRNA can lead to misidenti¯cation of the miRNA functions.36 MatureBayes and MaturePred set the 3 0 end of the miRNA at a ¯xed distance (22 nt) from the predicted 5 0 end, while our method and the ProMiR predict the 3 0 end of the miRNA separately, so the length of miRNA varies. The comparison results are shown in Tables 5 and 6. By the numbers of the exactly predicted 5 0 ends of the miRNAs our method is only marginally better than MaturePred and MiRmat (Table 5). However, for other deviations our method demonstrates the stable better results. Also it outperforms

Table 4. Performance comparison between our method and the ProMiR. Two parameters were measured: the mean of the absolute distance between the predicted and true sites and the standard deviation (SD). 5 0 strand Drosha site

3 0 strand

Dicer site

Drosha site

Dicer site

Method

mean

SD

mean

SD

mean

SD

mean

SD

Our method, optimal miRNA pair Our method, the best of the optimal and sub-optimal miRNA pairs ProMiR

1.94 1.67

2.52 1.58

2.22 1.33

2.66 1.63

1.75 1.44

2.00 1.64

1.19 0.50

1.80 0.87

1.96

2.56

2.47

3.26

2.13

2.70

1.60

2.14

1343009-11

I. I. Titov & P. S. Vorozheykin Table 5. Performance comparison between our method, MiRmat and MaturePred. The distance from the truth is the absolute distance between predicted start position of mature miRNA and that of actual mature miRNA.


Distance from the truth Our method, optimal miRNA Our method, the best of the optimal and sub-optimal miRNA pairs MiRmat MaturePred

0

1

2

3

23.59 38.97 23.03 21.21

55.38 74.87 45.45 36.36

71.79 85.13 59.39 48.48

80.51 89.74 69.70 57.57

Table 6. Performance comparison between our method and MatureBayes. The distance from the truth is the absolute distance between predicted start position of mature miRNA and that of actual mature miRNA. Distance from the truth Our method, optimal miRNA Our method, the best of the optimal and sub-optimal miRNA pairs MatureBayes

0

1

2

3

24.47 33.98 12.75

49.21 66.85 34.20

65.47 82.08 45.51

74.84 88.22 59.13

MatureBayes for all deviations (Table 6). The ability to choose the best of the optimal and sub-optimal miRNA pairs signi¯cantly improves the prediction (Tables 5 and 6). 4. Discussion The precursor's secondary structure plays a key role in the miRNA biogenesis. For the transportation the pre-miRNA from nucleus into cytoplasm the 2-nucleotide 3 0 overhang structure and the double-stranded stem of the pre-miRNA are recognized by the exporting complex Exp-5.37 The hairpin is required for the cutting miRNA/ miRNA* duplex by Dicer and Drosha complexes.1,38 Therefore, no wonder that the coarse-grained secondary structure, rather than the pre-miRNA sequence, has a great impact on the miRNA processing.39,40 Considering a pre-miRNA hairpin in more depth it's stem is very often imperfect, thus the regularities of the loops locations could be important in for the miRNA maturation.36,41,43 Indeed, we observe the loops increasing near the boundaries of the miRNA: This property is implicitly utilized by many miRNA prediction methods.16,19–21,23,29 It is interesting that the loops frequency has also a peak within the miRNA (Fig. 4). Clearly the loops of such types could a®ect the miRNA maturation. First, the bulges and the asymmetrical internal loops presumably36,42 were the major source of the miRNA length diversity and of the shifts between the miRNA's boundaries in the homologous precursors. Then, the hairpin structure in the central region of the miRNA/miRNA* duplex and on the miRNAs boundaries may exert on the selection of the pre-miRNA's functional strand for some species, as it was previously suggested for the Drosophila miRNAs.41,43 Therefore, the precise calculation of pre-miRNA secondary structure is crucial for miRNA and pre-miRNA prediction. 1343009-12



(a)

(b)

Fig. 4. Loops distributions near the boundaries of the human miRNAs from miRBase. (a) The miRNA sequences were aligned by the boundary closest to the nonlooped hairpin side. (b) The miRNA sequences were aligned by the boundary closest to the precursor's hairpin loop. Negative numbers correspond to the positions outside of the miRNA region; the numbers show the distance to the aligned boundary.

5. Conclusion In this paper, we present the web application for ab initio prediction of the human miRNAs and their precursors by using the statistical classi¯ers, namely, the hidden Markov models, the information about the primary and secondary structures of premiRNAs and the identi¯ed patterns of miRNA precursor location. The implemented methods allow us to ¯nd the pre-miRNAs with good accuracy in an arbitrarily given nucleotide sequence, as well as to predict a candidate of miRNA/miRNA* duplex. Also our web server allows ¯nding the close miRNA homologs and the miRNA binding sites: The more detailed information is available at the web-server. Acknowledgments The work was ¯nancially supported by Russian Ministry of education and science (Project 8740 and NSh-5278.2012.4) and by RAS Presidium (Program #6 \Molecular and Cell Biology", Project 6.6). References 1. Bartel DP, MicroRNAs: Genomics, biogenesis, mechanism, and function, Cell 116:281– 297, 2004. 2. Gri±ths-Jones S, Grocock RJ, van Dongen S, Bateman A, Enright AJ, MiRBase: microRNA sequences, targets and gene nomenclature, Nucleic Acids Res 34:D140–D144, 2006. 3. Lee RC, Feinbaum RL, Ambros V, The C. elegans heterochronic gene lin-4 encodes small RNAs with antisense complementarity to lin-14, Cell 75(5):843–854, 1993. 4. Weber MJ, New human and mouse microRNA genes found by homology search, FEBS J 272(1):59–73, 2005. 5. Lim LP, Lau NC, Weinstein EG, Abdelhakim A, Yekta S, Rhoades MW, Burge CB, Bartel DP, The microRNAs of Caenorhabditis elegans, Genes Dev 17(8):991–1008, 2003. 1343009-13



6. Lai EC, Tomancak P, Williams RW, Rubin GM, Computational identi¯cation of Drosophila microRNA genes, Genome Biol 4:R42, 2003. 7. Wang X, Zhang J, Li F, Gu J, He T, Zhang X, Li Y, MicroRNA identi¯cation based on sequence and structure alignment, Bioinformatics 21(18):3610–3614, 2005. 8. Grundho® A, Sullivan CS, Ganem D, A combined computational and microarray-based approach identi¯es novel microRNAs encoded by human gamma-herpesviruses, RNA 12(5):733–750, 2006. 9. Eddy SR, A new generation of homology search tools based on probabilistic inference, Genome Inform 23(1):205–211, 2009. 10. Tempel S, Tahi F, A fast ab-initio method for predicting miRNA precursors in genomes, Nucleic Acids Res 40(11):e80, 2012. 11. Sewer A, Paul N, Landgraf P, Aravin A, Pfe®er S, Brownstein MJ, Tuschl T, van Nimwegen E, Zavolan M, Identi¯cation of clustered microRNAs using an ab initio prediction method, BMC Bioinformatics 6:267, 2005. 12. Xue C, Li F, He T, Liu GP, Li Y, Zhang X, Classi¯cation of real and pseudo microRNA precursors using local structure-sequence features and support vector machine, BMC Bioinformatics 6:310, 2005. 13. Hertel J, Stadler PF, Hairpins in a Haystack: Recognizing microRNA precursors in comparative genomics data, Bioinformatics 22(14):e197–e202, 2006. 14. Huang TH, Fan B, Rothschild MF, Hu ZL, Li K, Zhao SH, MiRFinder: An improved approach and software implementation for genome-wide fast microRNA precursor scans, BMC Bioinformatics 8:341, 2007. 15. Jiang P, Wu H, Wang W, Ma W, Sun X, Lu Z, MiPred: Classi¯cation of real and pseudo microRNA precursors using random forest prediction model with combined features, Nucleic Acids Res 35(Web Server issue):W339–W344, 2007. 16. Sheng Y, Engstr€om PG, Lenhard B, Mammalian microRNA prediction through a support vector machine model of sequence and structure, PLoS One 2(9):e946, 2007. 17. Batuwita R, Palade V, microPred: E®ective classi¯cation of pre-miRNAs for human miRNA gene prediction, Bioinformatics 25(8):989–995, 2009. 18. Kumar S, Ansari FA, Scaria V, Prediction of viral microRNA precursors based on human microRNA precursor sequence and structural features, Virol J 6:129, 2009. 19. Wu Y, Wei B, Liu H, Li T, Rayner S, MiRPara: A SVM-based software tool for prediction of most probable microRNA coding regions in genome scale sequences, BMC Bioinformatics 12:107, 2011. 20. Xuan P, Guo M, Huang Y, Li W, Huang Y, MaturePred: E±cient identi¯cation of microRNAs within novel plant pre-miRNAs, PLoS One 6(11):e27422, 2011. 21. He C, Li YX, Zhang G, Gu Z, Yang R, Li J, Lu ZJ, Zhou ZH, Zhang C, Wang J, MiRmat: Mature microRNA sequence prediction, PLoS One 7(12):e51673, 2012. 22. Yousef M, Nebozhyn M, Shatkay H, Kanterakis S, Showe LC, Showe MK, Combining Multi-species genomic data for microRNA identi¯cation using a Naive Bayes classi¯er, Bioinformatics 22(11):1325–1334, 2006. 23. Gkirtzou K, Tsamardinos I, Tsakalides P, Poirazi P, MatureBayes: A probabilistic algorithm for identifying the mature miRNA within novel precursors, PLoS One 5(8): e11843, 2010. 24. Tyagi S, Vaz C, Gupta V, Bhatia R, Maheshwari S, Srinivasan A, Bhattacharya A, CIDmiRNA: A web server for prediction of novel miRNA precursors in human genome, Biochem Biophys Res Commun 372(4):831–834, 2008. 25. Nam JW, Shin KR, Han J, Lee Y, Kim VN, Zhang BT, Human microRNA prediction through a probabilistic co-learning model of sequence and structure, Nucleic Acids Res 33(11):3570–3581, 2005.

1343009-14



26. Nam JW, Kim J, Kim SK, Zhang BT, ProMiR II: A web server for the probabilistic prediction of clustered, nonclustered, conserved and nonconserved microRNAs, Nucleic Acids Res 34(Web Server issue):W455–W458, 2006. 27. Terai G, Komori T, Asai K, Kin T, miRRim: A novel system to ¯nd conserved miRNAs with high sensitivity and speci¯city, RNA 13(12):2081–2090, 2007. 28. Oulas A, Boutla A, Gkirtzou K, Reczko M, Kalantidis K, Poirazi P, Prediction of novel microRNA genes in cancer-associated genomic regions a combined computational and experimental approach, Nucleic Acids Res 37(10):3276–3287, 2009. 29. Shen W, Chen M, Wei G, Li Y, MicroRNA prediction using a ¯xed-order Markov model based on the secondary structure pattern, PLoS One 7(10):e48236, 2012. 30. Kozomara A, Gri±ths-Jones S, miRBase: Integrating microRNA annotation and deepsequencing data, Nucleic Acids Res 39(Database issue):D152–D157, 2011. 31. Titov II, Vorobiev DG, Ivanisenko VA, Kolchanov NA, A fast genetic algorithm for RNA secondary structure analysis, Russian Chemical Bulletin 51(7):1135–1144, 2002. 32. Rabiner LR, A tutorial on Hidden Markov Models and selected applications in speech recognition, IEEE 77(2):257–286, 1989. 33. Brennecke J, Stark A, Russell RB, Cohen SM, Principles of microRNA-target recognition, PLoS Biol 3(3):e85, 2005. 34. Maziere P, Enright AJ, Prediction of microRNA targets, Drug Discov Today 12(11– 12):452–458, 2007. 35. Bartel DP, MicroRNAs: Target recognition and regulatory functions, Cell 136(2):215– 233, 2009. 36. Starega-Roslan J, Krol J, Koscianska E, Kozlowski P, Szlachcic WJ, Sobczak K and Krzyzosiak WJ, Structural basis of microRNA length variety, Nucleic Acids Res 39(1):257–268, 2011. 37. Okada C, Yamashita E, Lee SJ, Shibata S, Katahira J, Nakagawa A, Yoneda Y, Tsukihara T, A high-resolution structure of the pre-microRNA nuclear export machinery, Science 326(5957):1275–1279, 2009. 38. Xia H, Li F, He T, Li Y, Distribution of mature microRNA on its precursor: A new character for microRNA prediction, Int J Inform Technol 11:1–8, 2005. 39. Krol J, Sobczak K, Wilczynska U, Drath M, Jasinska A, Kaczynska D, Krzyzosiak WJ, Structural features of microRNA (miRNA) precursors and their relevance to miRNA biogenesis and small interfering RNA/short hairpin RNA design, J Biol Chem 279(40): 42230–42239, 2004. 40. Kozlowski P et al., Structures of microRNA precursors, Current Perspectives in microRNAs (miRNA), Springer, Netherlands, pp. 1–16, 2008. 41. Ghildiyal M, Xu J, Seitz H, Weng Z, Zamore PD, Sorting of Drosophila small silencing RNAs partitions microRNA* strands into the RNA interference pathway, RNA 16(1): 43–56, 2010. 42. Starega-Roslan J, Koscianska E, Kozlowski P, Krzyzosiak WJ, The role of the precursor structure in the biogenesis of microRNA, Cell Mol Life Sci 68(17):2859–2871, 2011. 43. Okamura K, Liu N, Lai EC, Distinct mechanisms for microRNA strand selection by Drosophila Argonautes, Mol Cell 36(3):431–444, 2009.

1343009-15



Igor Titov received his M.Sc. degree in Physics from the Novosibirsk State University and Ph.D. in Solid State Physics from the Institute of Semiconductor Physics, Novosibirsk. He worked at the Institute of Solid State Chemistry, Novosibirsk and as a visiting scientist at the Institute of Physiology and Pathophysiology, Mainz, CNRITB, Milan and RIKEN, Tsukuba. Now he is a senior researcher at the Institute of Cytology and Genetics, Novosibirsk. He is also teaching students at the Informational Biology Department of the Novosibirsk State University. His main research interests are in bioinformatics and complex systems.

Pavel Vorozheykin received his B.Sc. and M.Sc. degrees in Applied Mathematics and Information Science (Faculty of Mechanics and Mathematics) from the Novosibirsk State University, Russian Federation. Currently he is an assistant of the Informatics Department at the High college of Informatics of the Novosibirsk State University. His main research interests are in bioinformatics and mathematical modelling.

1343009-16

Mirnacle: machine learning with SMOTE and random forest for improving selectivity in pre-miRNA ab initio prediction.

Revealing Missing Human Protein Isoforms Based on Ab Initio Prediction, RNA-seq and Proteomics.

Nonmetallic substrates for growth of silicene: an ab initio prediction.

Ab initio RNA folding.

IPred - integrating ab initio and evidence based gene predictions to improve prediction accuracy.

Sphinx: merging knowledge-based and ab initio approaches to improve protein loop prediction.

Integration of QUARK and I-TASSER for Ab Initio Protein Structure Prediction in CASP11.

Well-characterized sequence features of eukaryote genomes and implications for ab initio gene prediction.

AIDA: ab initio domain assembly for automated multi-domain protein structure prediction and domain-domain interaction prediction.

A Deep Learning Network Approach to ab initio Protein Secondary Structure Prediction.

Prediction of possible CaMnO3 modifications using an ab initio minimization data-mining approach.

AIDA: ab initio domain assembly server.

Ab-initio SCF investigation ofγ-aminobutyric acid.

A new era for ab initio molecular crystal lattice energy prediction.

General computational algorithms for ab initio crystal structure prediction for organic molecules.

A Parallel Framework for Multipoint Spiral Search in ab Initio Protein Structure Prediction.

Ab initio non-relativistic spin dynamics.

Ab Initio Screening of CO2-philic Groups.

Ab initio H2O in realistic hydrophilic confinement.

Ab Initio Prediction and Characterization of Mo2C Monolayer as Anodes for Lithium-Ion and Sodium-Ion Batteries.

A fragmentation and reassembly method for ab initio phasing.

Ab initio molecular crystal structures, spectra, and phase diagrams.

Macromolecular ab initio phasing enforcing secondary and tertiary structure.

Ab initio water pair potential with flexible monomers.