k./ 1992 Oxford University Press
Nucleic Acids Research, Vol. 20, No. 11 2871-2875
WORDUP: an efficient algorithm for discovering statistically significant patterns in DNA sequences Graziano Pesole, Nicola Prunella1, Sabino Liuni1, Marcella Attimonelli and Cecilia Saccone Dipartimento di Biochimica e Biologia Molecolare, Universita' di Bari, Bari 70125 and 1Centro di Studio sui Mitocondri e Metabolismo Energentico, CNR, Bari, Italy Received December 26, 1991; Revised and Accepted April 30, 1992
ABSTRACT We present here a fast and sensitive method designed to isolate short nucleotide sequences which have nonrandom statistical properties and may thus be biologically active. It is based on a first order Markov analysis and allows us to detect statistically significant sequence motifs from six to ten nucleotides long which are significantly shared (or avoided) in the sequences under investigation. This method has been tested on a set of 521 sequences extracted from the Eukaryotic Promoter Database (2). Our results demonstrate the accuracy and the efficiency of the method in that the sequence motifs which are known to act as eukaryotic promoters, such as the TATA-box and the CAAT-box, were clearly identified. In addition we have found other statistically significant motifs, the biological roles of which are yet to be clarified.
INTRODUCTION The exponential growth of biosequence databases will become a very valuable tool for molecular biologists only if combined with parallel development of theoretical studies aimed at decoding the genetic information contained in the sequences. The detection of short sequence signals and the characterization of their functional role is a major challenge in molecular biology as these signals may play a fundamental role in the basic mechanisms of gene regulation and expression. In this context the linguistic analysis of DNA sequences, which can be considered as texts made up of several words of different lengths (1,2), may help us identify biologically interesting sequence patterns. For example, the finding that some words occur considerably more (or less) frequently than expected could help us clarify their biological significance. In particular they can guide the experimental studies that are obviously needed to establish the biological role of the sequences under consideration. It is now clear that in DNA sequences the usage of short oligomers is non random (1-9). In addition, it must be recalled that the words of genetic language can be of different lengths, e.g. triplets in coding regions, restriction sites of four or six nucleotides, transcription signals and protein binding sites of variable sequence length. Several pattern recognition methods based on the usage of words of different lengths in a single sequence or in a given set of sequences have been reported (1,3,10-17). These methods
can be used for taxonomic (18) or functional (3,19) classification of DNA sequences, or to extract consensus patterns from sets of functionally related nucleotide sequences (10 - 16). In this paper, we describe a novel algorithm for the identification of statistically significant nucleotide words which has unique features and is extremely fast. This algorithm is based on a first order Markov analysis. It allows us to analyze, in sets of unaligned sequences, known to be functionally related (e.g. promoter regions, introns, etc.) but not evolutionary homologous, the statistical pattern of DNA oligomers which are assumed to be Poisson distributed. By using a very fast string-matching method (20) modified by us for application to a four letter alphabet (manuscript in preparation), we are able to analyze words of lengths ranging from six to ten nucleotides, the typical size of the vast majority of DNA sequence signals such as promoters, enhancers and DNA binding proteins. T'his method has been tested on a set of 521 sequences extracted from the Eukaryotic Promoter Database (21).
MATERIALS The method presented here has been implemented in WORDUP, a computer program written in FORTRAN running on a DEC VAX computer under VMS operating system. It has been tested on 521 eukaryotic promoter sequences extracted from release 24 of the Eukaryotic Promoter Database (EPD) (21) through the ACNUC-EPD retrieval program (22,23). All sequences have been extracted from position -100 to position +5 with respect to their transcription origin, as it is generally accepted that specific sequences within 100 nucleotides upstream of the transcription start site usually determine the efficiency of mRNA synthesis. In the EPD database (21) the promoter sequences are classified in various homology groups. If two promoters exhibit a similarity degree greater than 50%, they are considered to belong to the same homology group. In order to avoid, in the statistical analysis, the bias produced by sequence similarities between evolutionarily related sequences, only one member per homology group has been selected. The ACNUC-EPD retrieval program allows us to automatically extract only those non-homologous promoters which form an independent promoters subset. The program WORDUP is available from the authors upon request.
2872 Nucleic Acids Research, Vol. 20, No. 11
METHOD were in common with the other wor*- having an highr The aim of the algorithm described here is to single ouwoi' X1~Th s xn values fortheg 1 were that are significantly shared or avoided in a given set of ees wasortedgwT ime podu#was 1 timies repeated Mwerno having a particular biological property in con (e.g. meat lon un ficantwords errierIpbpg promoters, introns, non-coding regions,etc.). These words are The final x2 sorting has been thus determined. considered to be specific for that biological role. For a given x2 value and known di function P(c2) We define the sequence S by: we can ulate how many w-mers are e to exceed this S = XjX2 . Xn;  value as follows: xi=A,C,G,T which contains n-w+ 1 oligomers of length w, defined as: N(X2) = [l-P(x2)] .4*  For example if w=7 the expected number of aes which S= Xj, Xj+..., j+w-l j=l,n-w+l have x2 25 is less than 0.01 in 16,384 "his way, with a Consider now N nucleotide sequences SI, S2,..., S, L1 fixed a x2 heshold, we are able to evallkti*Mh words nucleotides long, i= 1,...,N. If oligomers are Poisson di d, be consdered statistically significant at a given level of the probability 7ri(Sk) that the oligomer k, w bases long, confidence. k= 1,...,4w, is found at least one time in the sequence Si is given by: RESULTS 7r(sk) = I-e-$ik  Among d* sequence patterns whose bio*p4lvity. hasbeee with: expe ely demonstrated are thW oonolling ihe can
the average number of sequences containing the w-mer k, in the sequence i, weighted on its base composition, where:  qi(Sk) = f(u1,2)f(u2,3). . .f(uw,._Xw)/f(u2) f(uw- 1) is the occurrence probability of the oligomer Sk in the sequec i, calculated according to Stuckle et al. (17) by assg a first order stationary Markov chain, with f(u ,y) and f(u.) being the respective frequencies of dinucleotides and mononucleotides observed in each sequence. The observed number of occurrences of the w-mer k in the set of the N sequences is given by:  POOk) = EiPi(Sk) where pi(sk) is 1 if Sk is found in the sequence i, otherwise it is 0. Whereas the expected number of occurrences is given by:
mechanism. For thi reason, our algorithm by analyzing.
the statistical significance of the occurrences of the k-th word can be established according to the elementary X2 stistics:  X2k = (P(sk) H(sk) )2 / Il(sk) It may occur that some words have a x2-value signifinty higher than expected not because they are really ially significant but as a result of their location overlapping with a trily significant word. For example, if the tmly significat word is: ACGTACGT, all the eight one-position shiftd 7/8 overlapping words: NACGTACG, CGTACGTN will be
to test the promot
,cz , f e a H
erroneously considered statistically significant.
obviously decreases with reduction of overlapping sites. In order to overcome this problem the following iterative procedure has been applied to each sequence. In the first iteratin all 4w words were sorted according to their x2-value and only the M words with a significantly high x2-value (> 10) were considered. The positions of the last M -1 words were compared with the word having the highest x2. All the occurrences of the M I words with overlapping positions having the highest 92 word were thus ignored. A word was considered in overlapping position when at least w/2 (w even) or w/2 + 1 (w odd) consecutive characters -
0 2 4 6 * I I I 1 2 4 1 1 0 2 4 6 O 0 0 O O O 0
i 0 0
j s Figure1.Lh obsred (left) and expected (right) o(m calcule fbr to 4W epamers. For each verical br lbetwm te upper a lower midpoints, that we calculatd borvd vahe ro h qtamr with below the bar and the previous and the following X2 $ x2 values higher thn expected (randm) are 6tb te ) in the distribution.
sequences. We analyzed a set of 521 promoter sequences extracted as described in the Materials section. A word length of 7 nucleotides was chosen for the statistical analysis. In figure 1, the observed and expected distributions of x2 values for all 4w heptamers is shown. The distributions are very similar at low x2 values, but, as we thought, more words than expected are found at higher x2 values because of the presence of statistically significant heptamers of promoter sequences. By comparing observed and expected distributions, the threshold x2 value can be fixed at 16, and all words with x2 value above this threshold can be considered statistically significant. In Table I, observed and expected occurrences, with their relevant x2 values, are reported for statistically significant words (c2 > 16). After correction for overlapping words, the number of such words decreased from 102 to 35, accounting for 33 sequences that occurred more than expected and 2 less. Cis-acting DNA sequences most commonly found in the region surrounding the transcription initiation site, such as TATA-box, CAAT-box and GC-box (24-26), were clearly identified. The most frequent and statistically significant words corresponded to TATA-box, such as CTATAAA, GTATAAA etc. Following TATA-box-like words the other significant heptamers belonged to the classes of the CAAT-box and the GC-box. In addition, some other words not clearly classifiable in the above mentioned promoter types have been identified. All the significant words were found to occur more than expected except for the two heptamers GGGGGGG and CCCCCCC which were found to occur less than expected. In the same way all those heptamers containing a six nucleotide homo-G or homo-C stretch were found to occur less than expected (data not shown). The reason for this finding remains to be clarified but a possible explanation could be that sequences of more than 5 G:C pairings produce some distortions in the
Table H. The names of the Transcription Factor Database motifs (25) matching with the highest x2 eptamers. The matching motifs without base ambiguities are reported in boldcase. WORD
observed GTATAAA CTATAAA TATATAA
TGACGTC ATAAAAG GGGGCGG GATTGGC TATATAT
62 61 63 11
65(-39) 40 13
TCACGTG TTGCATA CCAATCA GGGCGTG CCAATGA GTATAAG
AAAACAG GGGACCA GCCACCG TATAAAG TTCCTCC AGGCGTG GTGACGC TTCCTTT GAGCCTG CTGTCAC AAATAGG GGCCAGA TCCGCTG CCACACC CTGATAA TGACGGA AAATACG GGGGGGG
59(-32) 14 17
15 13 11
18 12 11
6(-1) 15 13
12 13 8 14 9 6
8 6 10
Signals found in the TFD sites collection
GTATAAA TIE, HSV late TATA box CTATAAA engrailed CS, CArG box, GR-lysozym.2-1, PR-lysozym.2-1, MAT-alpha-2 CS1, TFW TATATAA
TGACGTC CAMP RE, Somatostct, cytomegl9, ATF CS, TCR-beta decamer, HCMV2, CREGa, CRE-VIP, CRE-som, CREB-Som, CREB-TH, CREB-VIP, CREB-PTH, CREB-ACG, CREB-HTLV-II, CREB-CMV, CREB-IAP, CREB.IAP, CREsomat, IE1-II, EI-IV, IEI-VI, CREB-HTLVI.l ATAAAAG CArG box, TAB-hsp83, TFMIDMLP
GGGGCGG SPI CSI, MT/G-BOX, SPl-MMlA, SPl-DHFR.1, SPl-IE3/l,SPl-1E3/2, SPI-SV/4, SPI-SV/6, SPl-SV/l, SPl-TK/l,Spl-DHFR.2, Spl-DHFR.3, GAL4-GAL2.1, oD.globin-A, bA.globin.1, Spl?-MT-1.1, Spl-rasl.3, Spl-rasl.4, Spl-SV40.1, Spl-tk.2, glob.1, TK.2, GAL2 USI, T-Ag-SV40-II, GC-box, G.stfig.2 GATIGGC HLAf.Ybox, NF-El CS1, XY CS.2, HMGCoAred PF5, CAAT box.2, Y box TATATAT
topoUCS, GAL4-depGA, R2-bIFN.1, rPRL-FPI, PUF-l-prol, TATA-box-CS, Pit-l-PlP, )TFDCYCl-l, TFMD.LEU2-2
ATAAATA topoIRS, TAB-hsp7O, B-factor-hsp7O,qa-IFCSI
CDEI, AAVN.210.D, uteroglobinupstreamCS, GAL2ds, MLTF-pB48, M MLTF-HCMV, MLTF-AAVP5, MLTF-TS
OCTA-mutant-10, SV40.2, SV40.8, hlst.H2B.l, nifpromoterCS,Pit-1CS, Pit-l-PitB2
engrailed CS, NF-El CSI, CCAAT.4, CAAT.4, PU-MHC II I-Ab, MAT-alpha-2 CS1
GGGCGTG MT/G-box, Spl-EII-lat, EIIaE-C, GC-box CCAATGA
Table I. The observed and expected occurrences are reported together with their relevant x2 values in the case of the most statistically sinificant words (C2 > 16). The occurrences of words overlapping with higher words (reported in the brackets) were not taken into account in the xcalculation.
Research, Vol. 20, No. 11 2873
MATCH NOT FOUND
CYC7-H2, E2 RS. 14, nGRE. 1, VRE(IFN.alpha)-repA
CArG box, MAT-alpha-2 CS I
1.04 5.88 13.07 2.33 6.72 5.13 6.90 1.29 3.31 4.52 3.75 3.05 2.42
SP-A, malT CS2
55.44 48.56 44.38
5.57 3.32 2.95 3.91 4.92 2.25 1.52
5.09 4.17 2.37 3.79 4.28 2.10 4.96 2.55 1.33 2.13
43.01 37.49 34.76 34.41 34.40 33.62 32.39 30.38 27.66
AP2-SV40.1, AP2-SV40.2, AP2-SV40.3, NFAT-HIV.2, API-HV.4
XRE.2, XRE-cyt45ORS, XRE-2.2
21.95 21.04 20.63 20.17 19.59 19.23 18.68 18.52 17.73 17.71
16.56 16.42 16.23 16.20 16.13 21.73 17.25
B1-A/B, CArGbox, MMV-NF-IRS, MAT-alpha-2CS2
GH-site 3, AP2-GH.1
MATCH NOT FOUND
beto-globin US, b.globin, SV40.5, SV40.12, EBP20-SV40, AP3-SV40. 1, AP3-SV40.2, E2 RS. 1, ICP4-gD
PRL conserved motif, EF-1
MATCH NOT FOUND
MATCH NOT FOUND
globfin, A2vlt RE3,
2874 Nucleic Acids Research, Vol. 20, No. 11 double helix structure of DNA. On the contrary, many heptamers with G- or C-stretches of 3-5 nucleotides were found to be more abundant than expected (GC-boxes). In order to check appropriateness of the algorithm described here, the 33 words found significantly over-represented (C2 > 16) were compared with a set of sequence patterns experimentally recognized to act as transcriptional elements because of their interaction with protein factors involved in the transcription machinery. The collection of transcriptional element consensus sequences contained in the TFD database compiled by David Ghosh (27) has been used in this analysis. In table [I, the TFD sequence signals that match with the overrepresented heptamers listed in Table I, are reported. The majority of the words we have found to be significant (29 out of 33) match with one or more TFD sequence motifs and only four do not. A possible explanation is that these words either have no biological significance or their biological role remains to be experimentally demonstrated.
DISCUSSION We present here a new method which allows us to single out short DNA sequences (genetic words) which have non-random statistical properties and thus may be biologically active. In particular the goal of our method was to single out a set of nucleotide words from six to ten nucleotide long, the typical size of many DNA sequence signals, whose statistical properties largely deviate from expected average. The statistical significance of each word is given by comparing expected and observed number of sequences which contain that word at least once. The main problem is the correct calculation of the expected values. In this calculation, the fact should be taken into account that DNA sequences are not a random succession of four characters, the bases, but that they have a particular structure for both base composition and dinucleotide distribution (5,6). For example, the scarcity of the CpG dinucleotide in many genomes, as a result of its high mutation susceptibility (28), is well known. For this reason, we chose to perform our statistical analysis on the basis of a first order Markov chain which calculates the expected occurrences of the various words from observed dinucleotide frequencies. It has been recently demonstrated that higher order Markov chains usually produce uncorrected results (17). As we do not know a priori the length of the biologically functional sequence signals, it is advisable to use words of different lengths in the analysis. The most appropriate length for a subset of completely overlapping words having different lengths, is taken to be that of the word with the highest x2 value. Other methods proposed so far for finding unknown sequence signals are slow and impractical for words longer than six nucleotides (11,14,16). In addition, they require a priori alignment of sequences and, in some cases, do not adequately take into account the nucleotide composition of analyzed sequences, as they consider all four bases to be equally probable. Our method is much simpler, does not require a priori alignment of the sequences to be analyzed, and performs a rigorous statistical analysis that takes into account the dinucleotide frequencies of observed sequences. Furthermore, it is extremely fast. The pattern matching algorithm used in WORDUP is more than twice as fast as that used by the FINDPATTERN program of the GCG package (29), the fastest existing to our knowledge. It is well known that DNA regulatory sites are usually represented as 'consensus'. This, in our opinion, is not wholly
adequate. Only a small fraction of the biologically active DNA motifs exactly match with the consensus (10). The biologically active sequence patterns cannot be simply described in terms of consensus as the presence of a given base in a given position can influence the bases required upstream and downstream to preserve their biological functions. In addition, the biological activity of consensus neighborhood words (12, 30) cannot be linke to the number of mismatches with respect to the consensus. We know, for example, that not all mismatches from the canonical TATA-box consensus sequence, GTATAWAW (24), are biologically equivalent to each other (10). A striking feature of our method is that its application does not define consensus sequences, but, more accurately, provides a list of sequence motifs which are significantly shared (or avoided) in the sequences under examination. These lists can be also usefully employed for decoding of biological information encoded by the anonymous sequences which are now produced in large amounts in megasequencing p s. The functional relatedness of analyzed sequences does not exclude the possibility that different signals can be found in different subsets of sequences, even in similar positions in sequences. For example, in a set of promoter sequences various unrelated signals can be found. In the case of our test, we succeeded in identifying two unrelated words, TTGCATA and GTGACGC, both located approximately at position -50 with respect to the transcription start site in the sequences which contain these sequence signals. These words are contained in several enhancer signals (31-34). In such cases searching methods based on consensus matrices (10-12,14,16,33) would fail because multiple signals on the same position would tend to hide one another. The results shown in table Il clearly demonstrate the reliability of the algorithm presented here. Our method can be used for analyzing other sets of analogous sequences such as introns, mRNAs untranslated sequences, etc. in order to detect specific sequence signals which may play a fundamental role in the basic mechanisms of gene regulation and expression.
ACKNOWLEDGMENTS This work was financed by Progetto Finalizzato Biotecnologie Biostrumentazioni (CNR, Italy) and Progetto Finalizzato Ingegneria Genetica (CNR, Italy).
REFERENCES 1. Brendel, V., Beckmann, J.S. and Trifonov, E.N. (1986) J. Biomolecular Structure and Dynamics, 4(1), 11-21 2. Pevzner, P.A., Borodovsky M.Y. and Mironov A.A. (1989) J. Biomolecular Structure and Dynamics, 6, 1013-1026 3. Becknmann, J.S., Brendel, V. and Trifonov, E.N. J. Biomolecular Structure and Dynamics,(1986) 4(3), 91-400 4. Borodovsky, M.Y. and Gusein-Zade S.M. (1989) J. Biomolecular Structure and Dynamics, 6, 1001-1011 5. Nussinov, R. (1981) J. Mol. Biol. 149, 125-131 6. Beutler, E., Gelbart, T., Han, J., Koziol, J.A. and Beutler, B. Proc. Natd. Acad. Sci. USA 86:192-196 (1989) 7. Kozhukhin C.G., Pevzner P.A. (1991) CABIOS 7,39-49 8. Korn L.J., Queen C.L. and Wegman M.N. (1977) Proc. Natl. Acad. Sci. USA 74, 4401-4405. 9. Lipman D.J. and Wilbur W.J. (1984) Nucleic Acid Res. 12, 215-226. 10. Stormo G.D. (1990) Meth. Enzymol. 183, 211-221 11. Mengeritsky G., Smith T.F. (1987) CABIOS 3, 223-227 12. Waternan M.S. (1989) (In) Mathematical methods for DNA sequences, Waterman M.S., Ed., pp. 93-116, CRC Press, Boca Raton.
Nucleic Acids Research, Vol. 20, No. 11 2875 13. 14. 15. 16. 17.
20. 21. 22. 23. 24. 25. 26.
27. 28. 29.
Staden R. (1990) Meth. Enzymol. 183, 193-211 Bucher P. and Bryan B. (1984) Nucleic Acids Res. 12, 287-305 Gartman, C.J. and Grob, U. (1991) Nucleic Acids Research 19, 6033-6040 Galas D.J., Eggert M. and Waterman M.S. (1985) J. Mol. Biol. 186, 117-128 Stuckle, E.E., Emmrich, C., Grob, U. and Nielsen P.J. (1991) Nucleic Acids Research 18:6641 -6647 Blaisdell, B.E. (1986) Proc. Natl. Acad. Sci. USA 83:5155-5159. Smith, T.F., Waterman, M.S. and Sadler J.R. (1983) Nucleic Acids Res. 11, 2205-2220. Boyer, R.S. and Moore, J.S. (1977) Comm. ACM 20(10), 62-72. Bucher, P. (1991) The Eukaryotic promoter Database of the Weizman Institute of Science. EMBL Nucleotide Sequence Data Library Release 23, Postfach 10.2209, D-6900 Heidelberg. Attimonelli, M., Liuni, S. and Prunella, N. (submitted) Gouy M., Gautier C., Attimonelli M., Lanave C. and Di Paola G. (1985) CABIOS 1, 167-172. Breathnach, R. and Chambon, P. (1981) Ann. Rev. Biochem. 50:349-383. Everett, R.D., Baty, D. and Chambon, P. (1983) Nucleic Acids Research 11, 2447-2464. Efstratiadis, A., Posakony, J.W., Maniatis, T., Lawn, M.N., O'Connell, C., Spritz, R.A., Deriel, J.K., Forget, B.G., Weissman, S.M., Slightom, J.L., Blechl, A.E., Smithies, O., Baralle, F.E., Shoulders, C.C. and Proudfoot, N.J. (1980) Cell, 21, 653-668. Ghosh D. (1990) Nucleic Acids Res. 18, 1749-1756. Nussinov, R. (1984) Nucleic Acids Research 12, 1749-1763. Devereux, J., Haeberli, P. and Smithies 0. (1984) Nucleic Acids Res.
12:387-395. 30. Waterman M.S., Arratia R., Galas D.J. (1984) Bull. Math. Biol. 46:515-527. 31. Chen, R., Ingraham, H.A., Treacy, M.N., Albert, V.R. Wilson, L. and Rosenfeld, M.G. (1990) Nature 346, 583-586 32. Maeda, H., Araki, K., Kitamura, D., Wang, J. and Watanabe, T. (1987) Nucleic Acids Research 15, 2581-2569 33. Davidson, I., Fromental, C., Augereau, P., Wildeman, A., Zenke, M. and Chambon, P. (1986) Nature 323, 544-548 34. Casey, J.L., Di Jeso, B., Rao, K.K., Rounalt T.A., Klausner, R.D. and Harford, J.B. (1988) Nucleic Acids Research 16, 629-646 35. Hertz, G.Z., Hartzell, G.W. and Stormno, G.D. (1990) CABIOS 6, 81-92.