J Mol Evol (1992) 35:286-291

Journal of Molecular Evolution © Springer-VerlagNew YorkInc. 1992

Prototypic Sequences for Human Repetitive DNA Jerzy Jurka, Jolanta Walichiewicz, and Aleksandar Milosavljevic Linus Pauling Institute of Science and Medicine, 440 Page Mill Road, Palo Alto, CA 94306, USA

Summary.

We report a collection of 53 prototypic sequences representing known families of repetitive elements from the human genome. The prototypic sequences are either consensus sequences or selected examples of repetitive sequences. The collection includes: prototypes for high and medium reiteration frequency interspersed repeats, long terminal repeats of endogenous retroviruses, alphoid repeats, telomere-associated repeats, and some miscellaneous repeats. The collection is annotated and available electronically.

Key words: Repetitive DNA w Database - - Human Genome w Primates

Introduction Proliferation of repetitive elements has long been viewed as a powerful evolutionary force reshaping eukaryotic genomes. At the level of an individual organism this "reshaping" means that repetitive elements can lead to significant genetic instability and to genetic diseases more often than to positive evolutionary changes. Thus, in addition to the evolutionary significance, studies on repetitive elements in the human genome are increasingly important from the medical point of view. Furthermore, detailed knowledge of repetitive elements is essential for the ongoing genome-mapping and sequencing projects. As the sequencing continues with ever-increasing speed, so does the number of the known repetOffprint requests to: J. Jurka

itive sequences in the databanks. Some arguments have been proposed that this D N A should be viewed as "junk" or even "trash" DNA and that the sequencing of the repetitive portions of eukaryotic genomes should be abandoned. Fortunately, this is becoming a minority view as we are increasingly aware that for better or worse, repetitive elements are integral components of genes, that they affect them in a variety of ways and, in addition, that they can provide us with important evolutionary information. Repetitive elements can cause illegitimate recombinations (Lerman et al. 1987; Groffen et al. 1989; Chen et al. 1989; Jurka 1990; Hu et al. 1991), introduce regulatory signals for transcription (Wu et al. 1990), or new splice sites (Mitchell et al. 1991). They can even be incorporated in open reading frames of functional genes (Caras et al. 1987; Brownell et al. 1989; Post et al. 1990). Repetitive elements can serve as powerful evolutionary markers (Del Pozzo and Guardiola 1990; Raisonnier 1991; Okada 1991; Chuat et al. 1992). Finally, they are important for physical and genetic mapping (Nelson et al. 1989; Zucchi and Schlesinger 1992). Knowledge of repetitive elements is also indispensable for accurate sequence assembly. Other chapters on repetitive DNA are yet to be written. We know that many repetitive elements are retroposed from a limited number of genes which are evolutionarily preserved for tens of millions of years (Britten et al. 1988; Jurka and Milosavljevic 1991). The biological function of these genes remains to be determined although we tend to believe that they interfere with viral infections at the intracellular level (Jurka 1989; Jurka and Milosavljevic 1991).

287 As the number of known sequences of repetitive families and our knowledge about them advance rapidly, the demand for an organized procedure for identification and up-to-date analysis of repetitive elements in newly sequenced DNA advances as well. There are two basic components of such a procedure: a database of known repetitive elements and a specialized computer software for accurate identification of homologous repeats. In this paper we present a basic database of repeats in the form of a collection of human prototypic repetitive elements. This collection contains possibly welldefined "complex" repeats which can be used as reference sequences for identification of other members of known repetitive families. Even with well-defined reference sequences identification of homologous repeats may not be a straightforward task. Many genomic regions contain clusters of repetitive elements and repeats inserted in repeats. Some repeats are much more mutated than others and are not easily detectable by any single screening procedure. We continue to address these problems in the screening software developed for our research purposes. This software-in-progress can be run on our computer via electronic mail, as described below.

Prototypic Repetitive Elements Given enough sequence examples each family of repetitive elements can be best summarized by its consensus sequence. Whenever possible, we chose consensus sequences as prototypic repeats in our collection. However, if the number of known sequences is small, the consensus becomes too ambiguous. Currently, the majority of repetitive families are represented by a few published D N A sequences. For these families we chose individual sequences or sequence fragments as prototypic repetitive elements. Many interspersed repeats contain simple A-rich sequences or poly(A)-tails of varying lengths. Simple, A-rich sequences are very common in eukaryotic genomes. Thus, any sequence with a long poly(A)-tail would be very nonspecific prototype for computer searches. For this reason we reduced such tails to a minimum. However, the internal poly(A) stretches were not eliminated. The list of the prototypic elements is presented in Table 1. The first column lists short reference names for all the prototypes. These names are predominantly limited to three-letter words followed by numbers whenever necessary. The first two letters abbreviate published definitions of keywords and the third letter " R " denotes "repetitive elements." Identical three-letter reference names are followed by unique numbers. The exceptions are

applied to the well-established names of Alu and L1 sequences. The second column shows GenBank loci names and sequence positions of the prototypes. Sequences not present in GenBank (release 69.0) as well as consensus sequences can be found in publications listed in the last column. The first two elements in Table 1 (Alu and MIR elements) represent abundant families of short interspersed repeats (often abbreviated to " S I N E families"). The MIR prototype (Mammalian Interspersed Repeat) represents the second-largest family of interspersed repeats in primates and occurs in other mammals as well (Donehower et al. 1989; Korotkov 1991). This little-known family will be described in detail elsewhere (Jurka 1992, in preparation). L1 represents the family of long interspersed repeats (also known as LINE-l, Kpn, or KpnI family). Like the MIR family, L1 family, is present in nonprimate mammals. M E R repeats, as well as HGR, A3R, and KER represent medium reiteration frequency sequences. Their number per genome usually ranges from several hundreds to thousands of sequences per human genome (Jurka 1990; Bosma et al. 1991; Kaplan et al. 1991). Some of them have been detected in nonprimate mammals (Kaplan et al. 1991). The LOR1 family represents low reiteration frequency sequences. It is present in about 100 copies in the human genome. Lowfrequency families are arbitrarily defined as families of - 1 0 0 or less elements per genome. THR and OFR sequences are components of a single transposable element, THE-1 (Paulson et al. 1985; Misra et al. 1987). The T H R portion of the THE-1 transposon is flanked on both sides by LTRlike repeats called " O repeats" (Paulson et al. 1985). The O repeats often exist independently of the THE-1 transposons and are listed as separate OFR elements in our compilation. XBR, XTR, and DBR are potential repetitive elements. The first two have been found in human a-fetoprotein gene, each in two copies (Gibbs et al. 1987), and the third one was found in the human prothrombin gene and some other genes (Degen and Davie 1987). Sequences LTR1 through LTR10 represent long terminal repeats of endogenous human retroviruses. Although LTRs are not commonly linked to other repetitive elements, they are present in multiple copies in the genomic DNA and, for practical purposes, they are put together with other prototypic repeats in this compilation. We cannot exclude the possibility that at least some medium- and low-frequency repeats represent remnants of long terminal repeats of unknown or extinct retroviruses. This provides an additional rationale for joint consideration of LTRs with other repetitive elements.

288 Table 1.

Prototypic repetitive elementsa

Name ALU MIR L1 MER1 MER2 MER3 MER4 MER5 MER6 MER7 MER8 MER9 MER10 MERll MER12-21 MER22 MER24 MER25 HGR A3R KER LOR1 OFR THR DBR XBR XTR LTR1 LTR2 LTR3 LTR4 LTR5 LTR6 LTR7 LTR8 LTR9 LTR10 ALR CER TARI PTR5 PTR7 MSRI IVR

Genbank loci names and sequence positions

Definition

Ref.

human Alu interspersed repetitive sequence--a consensus mammalian-wide interspersed repeat human L1 interspersed repetitive sequence (LINE-l, KpnI) human medium reiteration frequency repetitive sequence MER1 human medium reiteration frequency repetitive sequence MER2 human medium reiteration frequency repetitive sequence MER3 human medium reiteration frequency repetitive sequence MER4 human medium reiteration frequency repetitive sequence MER5 human medium reiteration frequency repetitive sequence MER6 human medium reiteration frequency repetitive sequence MER7 human medium reiteration frequency repetitive sequence MER8 HUMFIXG 22,119-21,742" human medium reiteration frequency repetitive sequence MER9 HUMHLASBA 9026--8621" human medium reiteration frequency repetitive sequence MER10 (Mstli) HUMP45C 17 789-52* human medium reiteration frequency repetitive sequence MER11 human medium reiteration frequency repetitive sequences (10 families) human SstI moderately repetitive DNA sequence family HUMREPSST 1-1563 human DNA 17p11.2 to 17p12 HUMREP17P 1-290 human medium reiteration frequency repetitive sequence MER25 HUMHBEG 9524-9951 human non-alpha-globin repeat located in the gamma- and epsilon-globin HUMNALHGL 1-473 intergenic region human A3 repeated element DNA HUMRA3A 189-38" human K element interspersed repeat DNA HUMRSKF 1-337 human low copy repetitive sequence HUMIGKVKA 1--482 human O interspersed repeat, clone 0-5 HUMRSO5C 26--384 human THE-I repetitive sequence HUMRSOLTR 398--1944 11,018--11079 potential new repetitive element copy B HUMTHB Xba repetitive element copy A, from human alpha-fetoprotein gene HUMAFP 11,548-11,850 repetitive element X-2 from human alpha-fetoprotein gene HUMAFP 564--664 LTR from human endogenous retrovirus-like sequence (HuERS-P2) HUMERSP2A 11-852 LTR from human endogenous retrovirus (4-14), 3' long terminal repeat HUMER142 13-461 LTR from human DNA related to mouse mammary tumor virus HUMERMTV2 22--450 (MMTV) 3' LTR LTR from human endogenous retrovirus ERV3, pol-env-3' LTR region HUMERVA34A 2800-3387 LTR from human endogenous retrovirus 5' LTR, clone HERV-kl8 HUMERVKB1 32-1000 LTR from human SSAV-related endogenous retroviral LTR-like element HUMLTRSSA 1-535 LTR from human endogenous retrovirus RTVL-H2 HUMRTVLH2 6-455 LTR from human endogenous retrovirus-like sequence (HuERS-PI-1) HUMERSP1B 11-700 5'-LTR region LTR from human endogenous retroviruslike sequence (HuERS-P3) HUMERSP3 1--608 human DNA sequence 5'-flanking minisatellite pms32 HUMSATM13 125-430 human alpha repetitive DNA---a consensus human D22Z3 repetitive DNA (centromeric DNA) HUMREP 1-382 human telomere-associated repeat sequence, complete sequence HUMTARSTS2 1-2111 human ptr5 mRNA for repetitive sequence HUMREPTR5 1-2438 human ptr7 mRNA for repetitive sequence HUMREPTR7 1--1895 human 37 bp minisatellite repeats, specific to chromosome 19 HUMRSSA19 repetitive sequence from human involucrin gene.--a consensus

HUMSATM01 HUMHBB HUMTPA HUMFIXG HUMFIXG HUMTPA HUMFIXG HUMTPA HUMALUANO

631-730 67,089-73,228 40-578 6669-7025 24,781-24,979 5307--6297 24,021-24,170 34,152-35,153 137-392

(2) (16) (3) (4) (4) (4) (4) (4) (4) (5) (6) (6, 7) (6, 8, 9) (6, 10) (6) (11) (12) (13) (17) (18) (19) (28) (19) (20) (34) (14) (14) (21) (22) (23) (24) (25) (26) (27) (21) (21) (16) (1) (33) (32) (15) (15) (29) (30, 31)

a* Large number preceding the small one indicates complementary sequence. Refs.--(1) Vissel and Choo (1987); (2) Jurka and Smith (1988); (3) Hattori et al. (1985); (4) Jurka (1990); (5) Kaplan and Duncan (1990); (6) Kaplan et al. (1991); (7) Yoshitake et al. (1985); (8) Lawrance et al. (1985); (9) Mermer et al. (1987); (10) Picado-Leonard and Miller (1987); (I 1) Epstein et al. (1987); (12) Donehower et al. (1988); (13) Rogan et al. (1987); (14) Gibbs et al. (1987); (15) La Mantia et al. (1989); (16) Armour et al. (1989);

(17) Jagadeeswaran et al. (1982); (18) Humphries et al. (1985); (19) Sun et al. (1984); (20) Paulson et al. (1985); (21) ;Harada et al. (1987); (22) Steele et al. (1984); (23) May and Westley (1986); (24) Cohen et al. (1985); (25) Ono (1986); (26) Brack-Werner et al. (1989); (27) Mager and Freeman (1987); (28) Straubinger et al. (1984); (29) Das et al. (1987); (30) Heller et al. (1985); (31) Eckert and Green (1986); (32) Brown et al. (1990); (33) Metzdorf et al. (1988); (34) Degen and Davie (1987)

The rest of Table quences for (1) two q u e n c e s ( A L R and associated repetitive

related examples of interspersed repetitive sequences expressed in embryonic carcinoma cells (PTR5 and PTR7); (4) one minisatellite (MSR1) and one unusual repeat which represents an array of

1 includes prototypic secentromeric repetitive seCER); (2) one telomeresequence (TAR1); (3) two

289 39 repeats coding for involucrin protein (Eckert and Green 1986) and viewed by some as a repetitive DNA (Heller et al. 1986).

Availability Throughout the end of this year this collection can be obtained directly from [email protected]. edu. Later on, it will be distributed via anonymous servers. In particular, it will be deposited at the National Center for Biotechnology Information. A preliminary version of the program identifying repetitive elements in newly sequenced DNA can also be accessed via electronic mail. Currently, this program is in the testing stage, and its accessibility may be limited. To obtain detailed information about the input format and the current status of the program, one should send a message containing a single word " h e l p " to the Internet address [email protected]. edu. In its present form the program automatically reads the incoming mail messages that contain newly sequenced DNA and then identifies regions homologous to the prototypic sequences from our collection. An output fde containing the results is then mailed back to the sender.

Discussion This collection is far from being exhaustive. First of all, it does not include prototypes for individual subfamilies present within families of repetitive sequences. Automatic assignment of individual Alu sequences to specific subfamilies has been addressed before (Jurka and Milosavljevic 1991), and this approach can be extended for other families. We also did not include any prototypic sequences for so-called "simple repeats." Simple repeats repr e s e n t v e r y c o m m o n and often p o l y m o r p h i c stretches of mononucleotides, dinucleotides, etc. (e.g.), poly[A], poly[AC]), scattered over the genome. They need to be treated separately for the very reason of their simplicity. Available screening and alignment procedures are not sensitive enough to distinguish between different types of simple repeats without additional labor-intensive evaluation. For example, poly(AAAT) will efficiently match poly(A) and other A-rich simple sequences. Finally, there are some retroposed pseudogenes not yet included in this collection which may turn out to be present in large copy numbers per human genome. For example, a retroposed pseudogene for small cytoplasmic Ro RNA (hY3 gene, Wolin and Steitz 1983) is present in two GenBank sequences: in the alpha-l-globin gene (Jurka et al. 1988) and in the human nonhistone chromosomal protein HMG-14

gene (Landsman et al. 1988). The latter pseudogene is reported here (GenBank accession n u m b e r M21339, positions 5348-5440). Human sequences from GenBank represent only a small sample of the human genome (

Prototypic sequences for human repetitive DNA.

We report a collection of 53 prototypic sequences representing known families of repetitive elements from the human genome. The prototypic sequences a...
595KB Sizes 0 Downloads 0 Views