Nucleic Acids Research, Vol. 18, No. I

Novel families of interspersed repetitive elements from the human genome Jerzy Jurka Linus Pauling Institute of Science and Medicine, 440 Page Mill Road, Palo Alto, CA 94306, USA Received September 15, 1989; Accepted November 22, 1989

ABSTRACT Six novel families of interspersed repetitive elements have been detected in the available human DNA sequences using computer-assisted analyses. The estimated total number of elements in the reported six families is over 17, 000. Sequences representative for each family range from approximately 150 to 650 base pairs (bp) in length and are predominantly (A + T)-rich. Sequences from four families contain stretches of patchy complementarity up to 45 bp long. Member of one of the families is likely be directly involved in a multigene deletion on chromosome 14. Two of the six sequence families are homologous to 'low reiteration frequency sequences' from monkey cells, detected first in defective variants of simian virus 40. Like Alu and Li families, the newly discovered families are probably composed of pseudogenes derived from functional genes.

INTRODUCTION So far, the most abundant and the most thoroughly studied -families of interspersed repetitive elements from the human genome are Alu and LI families. Both families are likely to be composed of active genes with unknown functions and of a large number of pseudogenes falling into distinct subfamilies (1-9). The third largest family is believed to be the THE family of transposon-like elements (10). Furthermore, there is an undetermined number of other families of repetitive elements which are comparable in size with the THE family or smaller. Studies on these families may be essential for our understanding of the origin, evolution and biological role of interspersed repetitive sequences and/or of their putative master genes. Extensive computer searches and analyses of the available human DNA sequences revealed the presence of at least six novel families of repetitive elements which are reported and briefly characterized in this paper. Each novel family of repetitive elements reported in this paper is represented by at least two highly similar sequences extracted from introns or flanking regions of different genes. Any two genes are considered different if they have different biological function, and if the overall similarity between them is lower than 50%. The average similarity within the discovered families ranges from 69 to 82 %. Sequences from two of the six families reported in this paper are homologous to the previously described 'low reiteration frequency sequences' from defective SV40 variants from monkey cells (24, 28). However, as pointed out by one

of the anonymous reviewers, in the human genome these repeats should be placed between high and low frequency repeats as their number ranges from estimated one to several thousand per family. Therefore, they are referred to as 'medium reiteration frequency sequences' abbreviated to 'MER sequences'.

MATERIALS AND METHODS Elimination of Alu and Li Sequences from the Database One of the difficulties with detection of new repetitive DNA by systematic computer searches is the large number of background scores coming primarily from Alu and LI repetitive sequences. Therefore, prior to the analysis, all Alu and LI sequences have been eliminated from the database as follows. The LI DNA sequence, cD 11 (37) and the Alu consensus sequence (4) have been compared against human sequences from the GenBank DNA sequence database (release 55.0), using a computer program for rapid similarity searches (DASHER2, D. V. Faulkner 1985, unpublished), based on the algorithm by Wilbur and Lipman (12). The goal of the rapid search was to determine the approximate location of Alu and LI sequences and of their complements. The exact location of the identified sequences was then established by alignment with the consensus sequences, using the algorithm by Smith and Waterman (13). This computer-intensive approach was chosen because not all Alu and LI sequences can be extracted from GenBank, based on the information present in so called 'site features' describing each sequence. All the detected Alu and LI sequences were replaced by random letters different from A, C, T and G.

Search for new repetitive elements After elimination of Alu and LI sequences, the GenBank sequences have subsequently been compared with all non-identical sequences from the same database using DASHER2. The output was then screened for high identity scores (> 10.0) between portions of seemingly unrelated sequences. The following GenBank sequences were selected for further analysis: HUMAGG-human angiogenin gene; HUMALBGC-human albumin gene; HUMAPOA4C-human apolipoprotein A4 gene; HUMAPOAII-human apolipoprotein CIII gene; HUMERYA-human erythrocyte alpha-spectrin gene; HUMFIXG-human factor IX gene; HUMFOLA2-human dihydrofolate reductase gene; HUMGASTA-human gastrin gene; HUMNGFB-human beta-nerve growth factor gene; HUMPAIA-human tissue plasminogen activator inhibitor-I gene; HUMRENA3-human renin gene (exons 3,4 and 5); 137

138 Nucleic Acids Research CTGGATCAGTGGTCCTC-AGTCTTTTTTGCACCAGGGACCAGTTTT G.... .G.......-.ACG....G . ........T..C T.A.GGG ....-.. .T.ACA....G ......T..T.... HUMAPOAI1 .A. .G.......-.C ....G.......... @ @8 CpG spots

1 2 3 4

HUMTPA HUMPAIA HUMAGG

1 HUJMTPA GT-AAAGATAGCTTTTCCACGGACAGAGGGAGGGGAGATAGTTTC 2 HUMPAIA T...A.. ----CAG.TG ...G.... .. A.GG.... .C.AT.. .... 3 HUMAGG ..GG0G... C. TA..... T.T... .T---- T.GT ... ....... .. 4 HUMAPOAIl .GGTG. .C.AT.....T. .G.----CAGC. .. .... G.... .T 5 SV4CVP8B ------------. GT ----G.... .G. ...G.... G

45 45 45 45

89 86 86

3 HUMAGG 4 HUMAPOATI S SV4CVP8B

CpG spots

.C....-------------------.C G -

8 .T.T

1 77 1 76

...... A....AT.......TCC.CA.T ...

1 HUMTPA

GCATAATGTAGAATCAGTGGGAACCCTGAGCTTGTTTTCTTATAA

2 HUMPAIA

C... CG ........ G. ... ......CCGC..

1

HUMTPA

CTAGATGGTCCCATCTGGGGGTTATGGGAGACAGC------

2

4

HUMPAIA..........A... G......C.T-------------------------AATTGTTCT HUMAGGO HUMAPOATIl------------ --------AGCTGTTCC

5

SV4CVP8B

------------- --------AAACTCTTCC

1

HUJMPAIA

-----GACAGATCATCAGGCATTA ---- GATTCTCATAGGGAGC

2

HUMTPA

3

HUMAGG

GGTTCA. .T......A....GTTA......AA.

4

HUMAPOATI

ACCT----.................. A...

5

SV4CVP8B

ATGCG-.......C

3

1 34 1 31 93 92 27

GGGATGATTCAAGAGGATTACATTTATTGTGCACTTTATTTATAT ........CAC.........TG..... C ... A ....-------------------

TACTATTACATTGTATTATAT--AATGAAATAATGGTATGACTCA

1 HUMTPA 2 HUMPAIA

..T...A.....

1 HUMFIXG 2 HUMERYA 3 HUJMTPA 4 SV4EV210

CAGATTCAATTAACTGCAGATCAAAAATATTCAAGAAAAAAATGG T.T ..... T.C..CATG.G.AT .....TTTT--.....--

1 2 3 4

HUMFIXG HUMERYA HUMTPA

ATGGTTCGATCTCTACTGAACATGTACAGACTCTTTTATCTT-TC

1 2 3 4

HUMFIXG HUMERYA

TG.....CC..-AA...

T T

1 42 11 1 *****TATATAAATAAATAATACAACAATAAAAATAACACA 2 47 1 37

ATTATTCCC-------------------

HUMTPA

SV4EV210

179 1 48 287 AAT.... .A....CTT.....C ........A 1 57 -----------T. .AC ..... T...

1 HUMFIXG

- --

2 HUJMERYA

TAAACAATACAGCATAACAACTATTTACATAGCATTT .T....G.T.....T.........

--...

1 HUMFIXG 2 .HUMERYA 3 HUMTPA

222 ACATTGTATTAGCTATTA--AGAGAAACCTAGAGATGATTTAAAG 191 .T. .T..GA..... A..... ......G.AG....T--A.

330 ...G...T--A..T...TT......... ......C

101 37

4 SV4EV210 .TG......G....TG.A. .T. .T.0........G 5 HUMAPOA4C

290 292 1 47

1 HUMFIXG 2 HUMERYA 3 HUMTPA

138

4 SV4EV210 5 HUJMAPOA4C

..T.TG.A. .A....CA . .............CGTT.. ..TGTG....C.....C.......T.C..('.T....

HUMTPA

GCACAACCTCAATCCCTCGCATGTACAGTTCACAATGGGGTTTGC

337

1

HUMFIXG

TTTATATCGGAGACTTGAGCATCCACAGATCTTGATATTTGCAGG

HUMPAIA

TTG....AG

A....A ....T.G.. .A....C.A

335

2

HUMERYA

.GC.CC..AG..

3

...G.A.... CG.......A. .CG.CCA HUMAGG ....T..GO G.AG..G.C......A....CA. .AG...... HUMAPOAIIl..

192

3

4

183

4

SV4EV21O

5

SV4CVP8B

93

5

HUMAPOA4C

1

HUMTPA

ACTCCTATGAGAATCTAGTGCCA-CTGCTGATCCAACCGGAGGTG

HUMPAIA

G.........A.... G-.C.....TG. .A..GA...

381~ 3379)

3

HUMAGG

G.........A... .TGC......TG-A....CA

237I

HUMAPOAIl........G.

AG ......-..-....TG.

.A....C

2

CAGCTCAGGCAGTAATGCGAGTGATGGGGAGCGGCTGTAAACACA

HUMPAIA

G....T..TG.........A

3

HUMAGG

G...

4

HUMAPOATI

HUMPIXG

GGGTCTTGCCACCAATTTTCCATGGATACTGAGGAACGACTGTAA

HUMERYA

A....

C.

HUMTPA

-....

C..GA..A.

HUMAPOA4C....C.

Figure

Linnenbach

al.

et

al.

et

of

GATGAAGCTTCACTGGCTCTCCAGCCACTCACCTCCTGCTGCGCA

Asterisks (*) indicate

HUMPAIA

.....AT....CA.CTG... CT...

show bases identical

HUMAPOAIl

-------- ---- A.. CA .....T. .0...T...

.CT-... --G.T.AT

alignment

spots

1

HUMTPA

GCCTGGTTCCTAACAGGCCACGGA-CAGATACCAGCCCATGGCCCC

2

HUMPAIA

C.TG

4

HUMAPOAIl

T .T...C ...... GG..0C...... TTT.. .A.TTG T...C.........

511

AGGGCCGGGGATTCCCAGTCTAG

539

HUMAPOAIl

G...

341

.T.G....

overlined.

Note

that

the

in

gaps.

new repetitive subsequences have been by alignment using the algorithm by Smith and Waterman (13). Multiple alignment and other sequence analyses have been done using a specialized sequence editor MASE (14).

The exact locations of the determined

spots

.TT....-C.

are

homologous sequence from the virus. location but not the length of Alu-J sequence (4), dots (.) relative to the top sequence and hyphens (-) indicate the preserved

516

319

HUMTPA

CpG

following GenBank loci and the and Fig. 2 (-815, -770),

imperfect complementarity

is least

HUMTPA

4

145)

UMTPA(I18851, 18392) and Fig A (1 5320, 1486 1), Degen SV4EV210(446, 698) and Fig. 6 (396, 648), Sheflin et al. (28). The

al. (19);

4

1

elements from the

(25); HUMERYA(2478, 2165) and Fig. 5 (2478, 2165), (26); HUMFIXG(6669, 7025) and Fig. 3 (3703, 4059),

2

.....

.ACC

Yoshitake et al. (27); H et

spots

.C. .C..

....A........ G. .A ....C .GA.T....C.....G. .....AGT. ..C... ..

HUMAPOA4C(l,

literature:

Karathanasis

.T.....CT...

T.... .G.AC.G.. G......G-..T.. ----------------

...

357 31 4 4 60 1 45

.GA ....GCC......A.A. .TG.TA ....T

repetitive

2. MER2

1

CpG

.C...A

3

longest regions complementarity

CpG

.CCTTG..

TG..T. ..G..

.AAG.... A..

25~

98d

spots

HUMTPA

CpG

CG..T...G...

....

2

published 2

......T

C..-... .GA.GA

HUMTPA

312 269 4 17 253 100

spots

2

CpG

267 236 373 247 56

.... .G.. .GT.C....A.... .ACA.GCA.CG ......T.T ..T.C .....G.. C..A.........A-- CTC. .CA ...

1

4

202 11

TACAAAGGAGGATGTGTTTAGGTTATATGCAAATAGTAAGCCATT

2

CpG

1 33 102

G..... G.......... .G-----

************Alu..J (fragment)******* 128 T...T..G.T.... G0.. .---ATG. .CG.....TGAG .

SV4EV210

257 255 102

AGG ....AG..TT..

8-7

I___

___I

....

3 HUMTPA 4 SV4EV210

75

89 69

..--.....

TGA....G.C.G.CT....TG....G. .TGG .....-

----...C

222 220

..A.

..------

3 HUMTPA 4 SV4EV210

85 20

CpG spots @ 1 HUMTPA 2 HUMPAIA

44 CAACCACA-CTGTACAGTCAGCTCTCCTTGTGAGTTCCACAGCCA 29 AG .....C.TG ---------ACTCC.CA.ATC... 29 .CAG ....T.T.TG ---------ACTTTG.A.ATC. 45 .TCCA.G.0.....T... .....TA.AGTTG.C.CTTCA.

1 HUMFIXG 2 HUMERYA

spots

RESULTS AND DISCUSSION Figure 1. published

MER I

repetitive

elements from the

literature: HUMAGG(981

HUMAPOAIl1(6395,

following

727) and Fig. 2(

GenBank loci and the

716,

970), ref. (22);

1) and Fig. 3 (-1010, -1520) from ref. (15); HUMTPA(40, 578) and Fig. A (-3491, -2953) from ref. (19); SV4CVP8B(555-647) and Fig.3 (555-647), ref.

(24). The longest regions of imperfect complementarity

Alignment

'CGp

are

overlined.

gaps (-) and identical bases (.) relative to the top sequence

indicated. Positions where base in

spots' (34)

are

changes

are

attributable to deam-ination of

are

cytosine

indicated (@).

alpha chain J plasminogen activator gene. In addition, human gamma-B-crystallin gene (HUMCRYGBC) from GenBank release 59.0 has been included in the analysis. HUMTCRAJ

region;

MERi

family

6055) and Fig. 2(1943, 2283), ref. (23); HUMPAIA(51 1,

-human genes for T-cell receptor

HUMTPA-human tissue

This

family

is

represented by

the human and

analyzed

one

human

five GenBank sequences: four from

from the

MERI

monkey

elements

genome

belong

to

(Fig. two

). The different

subfamilies

represented by shorter sequences at least 319 bp and, longer ones at least 539 bp long. They are referred to as MERIa (Fig. 1, sequences 3, 4 and 5), and MERIb subfamily and 2), respectively. Similarity between (Fig. 1, sequences

the

human

MERlIa sequences

is 81 %. The difference in

is 77 % and between MER Ib sequences

length is due to the extra two sequence fragments in MERib relative to MERla sequences.The first fragment begins at position 93 and the second one at position

Nucleic Acids Research 139 1 HUMTCRAJ1 AGGCTGCTCTGTCTGGTATAATAGTCTCTAGCTACATGTGGCTGT 2 HUMFIXG .A.T .TG CAA ..-GG ..AC.A C. A. A. 3 HUMENA3 GA GGA G.A TAAG.CAA ----T..CAAA. 4 HUMALBGC GA TA CAA ..C.G C.AT C A.A. 5 HUMGASTA ------------------CT... C.A.C ..C C. .....

...

A..

1 HUMTCRAJ1 TGAAATAAAAATTA--GTCAAT---GTGAAATAAAATGCTTTCAA 2 HUMFIXG .T TT.... TAA.A ACTA T .A--A 3 HUMRENA3 CTT .C.. .AATA ..GCTTAAGC.TCA.TT.A. T .....

4

HUMALBGC

5

HUMGASTA

AT.C..C .T.C..TT

T......

......

ACTA.. ATT-A

...

TT.-A.T TAA--A

87

.....

83

TGT

T

4

C.GGCTGGCCTCTT.GTTG.ACT.G.CA..GC.....-T..CTGGC ---------CTC.GTTGCACTTGTCAT

HUMALBGC

89 57

1 HUMTCRAJ1 TT------------AAAAATGGAAACTTCATTTCAAGAGCTCAAT 2 HUMFIXG ..--------CCTC..TTGCACCTG.CAA... .T... CA.AT..C

3 HUMRENA3

85

.....

118 124 127

TA ......

125

1 HUMTCRAJ1 AGCCACATGTAGCTGGTGGCTATTGTATGGGACAGTGCAGATT--

161

A

2

HUMFIXG

3

HUMRENA3

4

HUMALBGC.

........ C G. AGCG-G ....A..A.... GCA ..C.A.T .A-G ..GACA.. A... OCT.. .TCC.0.ACA

..A.

....

....

........

1 HUMTCRAJ1 -GACAGGACATTTCCATTGCCACAGAAAGTTC 2 HUMFIXG A .T. .A...-C..... .A.TG T... ..

..T.-------... .GCATG...

3

HUMRENA3

A

4

HUMALBGC

GA.T.-----.C.T.C.CAA.G

G.CA.CT

AC..

168 170

168

192 199

195 195

Figure 3. MER3 repetitive elements from the following GenBank loci and the published literature: HUMALBGC (2933, 2739) and Fig. 2 ( 1196, 1002), Minghetti et al. (29); HUMFIXG (24781, 24979) and Fig. 3 ( 21815, 22013), Yoshitake et al. (27); HUMGASTA (4129, 4073) and Fig. 2 (4129, 4073), Kariya et al. (30); HUMRENA3 (2271, 2465) and Fig. 2 (3492, 3686), Hardman et al. (31); HUMTCRAJ1 (3178, 2987) and Fig. 5 (3178, 2987), Yoshikai et al. (32). The longest regions of imperfect complementarity are overlined. Dots (.) show bases identical relative to the top sequence and hyphens (-) indicate the alignment gaps.

255, relative to

sequence 3 (HUMAGG), from Fig. 1. Both fragments have distinct base composition: the first one contains 65% (A+T), and the second 68% (A+G) as opposed to the rest of MERI sequences containing 47 % (A +T) and 50% (A +G). This may indicate that MERIb sequences evolved from a MERla-like sequence after insertion of these two fragments of a foreign DNA. Recently, MERIb sequences have been described by Bosma et al. (15). The authors view them as potential elements regulating the coordinate expression of human tissue plasminogen activator and human tissue plasminogen activator inhibitor-I genes. Similarity of these sequences to the remaining three MERI sequences from Fig. 1, indicating their repetitive character, has not been reported. Three human MERI sequences are located in the 5 '-flanking regions of genes. The fourth one (HUMAPOAI1) is within the second intron of the apolipoprotein CIII gene, flanked on its 5'-end by an old Alu-Se sequence (4). After the analysis was completed, one more example of MERI repeat was identified in the region upstream from so called sigma(gamma4) sequence within the cluster of immunoglobulin heavy chain constant genes (positions 421-176, Fig. 2A in ref. 38). MERla sequences are homologous (75% similar) to a fragment of monkey DNA from viral-host recombinant junction of

defective simian virus 40 (Fig. 1, SV4CVP8B). With the exception of monkey DNA all MERI sequences contain long regions of patchy complementarity, overlined in Fig. 1. MER2 family Four copies of these sequences have been found in human DNA and one in monkey DNA from defective SV40 (Fig. 2). Probably most representative for this group is sequence 3 (HUMFIXG), 357 bp long. Sequences 1 and 2 contain short regions of complementarity overlined in Fig. 2. Average similarity within MER2 family is 76%. Sequence 3 (HUMTPA) contains 158 bp long fragment of an old Alu-J insert (4).

MER3 family With the average sirnilarity of 69%, this family is the most diverse of all MER families. The longest MER3 sequence is 199 bp long. MER3 sequences contain very long but poorly preserved region of complementarity, overlined in Fig. 3. MER4 family This family is represented by two sequences from human tissue plasminogen activator gene and a sequence fragment from human gamma-B-crystallin gene (17). Members of this family are around 647 bp long, which makes them the longest of MER sequences described in this paper. Overall, the first two MER4 sequences from Fig. 4 are 80% similar. The similarity is even greater (87%) in the region homologous to the third fragment from Fig. 4 (HUMCRYGBC). The 5'-end of the first MER4 element from Fig. 4 is flanked by a repetitive sequence over 500 bp long and composed mostly of variants of a heptamer: TAGATGA. However, it is not clear if this 'tail-like' region is an inherent part of MER4 sequences since it exists only in this single element. After the analysis was completed, one more example of MER4 fragment has been found in the previously described hot spot for chromosomal recombination (Fig. 7, ref. 39). This spot is involved in the multigene deletion in the human immunoglobulin heavy chain (IGH) constant region locus, mapped on chromosome 14. The deletion at position 823 and an insertion at position 862 in Fig. 7 (ref. 39) are within region complementary to MER4 sequences from Fig. 4 of this paper. It is likely that the multigene deletion is caused by a recombination between different MER sequences flanking the genes prior to deletion. MER5 family MER5 sequences are represented by two examples in (Fig. 5). They are at least 152 bp long, 75% identical, and are the shortest of all MER sequences reported in this paper. They contain inverted repeats 16bp in length. MER6 family MER6 family (Fig. 6) is represented in the database by two sequences 82% identical. Members of MER6 family are expected to be at least 264 bp long which is the length of the sequence 1. The sequence 2 is almost four times longer than the sequence 1 as it contains two insertions, relative to the sequence 1. The first insertion is a full-length Alu-Se sequence. The second putative insertion is a 439 bp long sequence with base composition different from that in regions similar between the sequences.

Other potential MER families Two repetitive elements, 147 and 62 bp long, have been reported by Degen and Davie (18). Fragments of these elements are strikingly similar to sequence fragments from other genes. Overall, however, these potential repeats are less similar to each other than to the MER elements. Analogous unreported 'patchy similarity' has been found between a portion of an intron from human tissue plasminogen activator gene (Fig. A: 19206-19385, ref. 19), and other sequences from the database. The fragment most often shared with other sequences is: TGCCTCAGTTTCCTCATCTG Two pairs of repetitive elements have been reported in human alfa-fetoprotein gene (20). So far, however, no sequences homologous to these elements were found in other genes. Previously unreported pairs of repeats are also present in human

140 Nucleic Acids Research 1 HUMTPA 2 HUMTPA

GTGAACCCAAA-GTATCTGAGACAGGTCTCAATCGATTTAGAAAG ........... ..........A.A.

44 43

1 HUMFOLA2 GTTCTTGGAAACTGTGACTTTAAGTGAAACAATGTATAGCAAAAC 2 HUMTPA ...... TG.C .. A. CA... G.

45

1 HUMTPA 2 HUMTPA

TTGATTTTGCCAAGGTTAAGGACACCCCCGTGACACAGCCTCAGG ..T. T. ... A.AT ... G . T..

89 88

1 HUMFOLA2 CAATTTTACCATAGGCTACTGGATGTACACAAGAGTT-------2 HUMTPA .. ...... **Alu-Se CCA .A.T ... A..A.

82

1 HUMTPA 2 HUMTPA

AGGTCCTGAGGACATGTGCCCAAGGTTGTCAGGGCACAGCTTGCC ..A.. A.T.. G.C .A. .T.TG .. GT

134 133

1 HUMFOLA2 ----------------------AAGTTCCTAGGGCA-TATTGCTG

2 HUMTPA

A... T. .A

104 405

1 HUMTPA 2 HUMTPA

TTTAGACGTTTTAGGGAGTCATGAGACATCAATCAACATGTGTGA ..C.T AA. . T. ... A. C .. A TG.A

179 178

1 HUMFOLA2 GTCACAAAAACATCACCAAACTTCTAAATGAAGGCCAAAAACACT 2 HUMTPA .... T.. G... A .. C.

149 450

1 HUMTPA 2 HUMTPA

GATGTACATCGGTTTGGTCGGGAAAGTTGGGATAACTCGAAGCAA G. .. ... C T.G .T...... .C.

224

1 HUMFOLA2 TCTAATAGAAAACACTGAAGTAAATATGAGCTATACATACATTTA 2 HUMTPA ..... C.TTC ..T.T... ..A.C-.

494

1 HUMTPA GGGCTTCCAGGCCATAGGTAGATAAGAGACAAAAGGCTGTATTCT 2 HUMTPA -.GGCTT.AT.C.... T............ 3 HUMCRYGBC --T.C -.

269

1 HUMFOLA2 AGAAGAAT------------------------------------2 HUMTPA .... AGG.TAATAAAAACAAACATGGTATTTATTTACCCAATTTT

202 539

2 HUMTPA

TGGAGAATAAGTGGATGATGGCGGCCATAATAGTGGTGGGTTAAT

1 HUMTPA GAGTCCTTGATCAGCTTTTCACTGAACACACAATT---------2 HUMTPA C .T. TACATATGAG ..... ..TG . 3 HUMCRYGBC .. ... --T..... ........... C

304 311 60

584

2 HUMTPA

TCAGGGAATAAATGTTTGCAAGGCAGGAATGGTAAGGAACACCTC

629

2 HUMTPA

TTACTACCATGAAGTTCAAAAACAATCACAAATAGGGCAGGCTCC

674

2 HUMTPA

CTGACTGCCTCCACACCTCACAGTTTATCATGCATTCGTGTATTT

719

2 HUMTPA

AGCACAGACTAGACAGATTTTTACTTTATAACACTTTGTATTCAT

764

2 HTJMTPA

TCATTTTCCAACGTGCTTATTCCGGTTCAGGGTCTTGGGTGGGGT

809

2 HUMTPA

GGCCAGAGCCACCTGGCAGCTCAGGGCGCCAGGCGGGACCAGCCC

854

2 HUMTPA

TGGACAGACGCCACCATCCTATTGCAGAGCCACTAACACCCACAC

899

745 158

1 HUMFOLA2 ------------------------------------------AAT 2 HUMTPA TCACTCAGACTGGGCCCACGTAGATACATCCTCAGTTGGCTT...

205

453 785

1 HUMFOLA2 GTGCATGTCTTTGGGATGTGGAAGGAAACTGGAATACCTGGAGAA 2 HUMTPA -AG.C. .. C..

250

2 HUMTPA

...........

2 HUMTPA

AGGGGGGCAGAGGAAGAGTCACTTATGCCTTAGTTGGCTCAGTGA

1 HUMTPA ----------------------------GAGTCTGGCTCAGTTCA 2 HUMTPA ATCTAC**Alu-Sd (complete)****************.GA. ---------------------------T ..... 3 HUMCRYGBC GA.GAC

TCTGCATTTTTACATAAAAAATAGGGCAGAGGAAGCAATCAGATA 2 HUMTPA ... ..........A... C. 3 HUMCRYGBC ......C. C .G. G. 1 HUMTPA

....

1 HUMTPA CGCATTTGTCTCAGATGAGC---AGAGGGATGATTTTCTGTCCTG . GA.T.T.. 2 HUMTPA T. .........G. ...C ............. 3 HUMCRYGBC T . CTC . C.. ............

223 267

356 321 658 366

703 122 408

2 HUMTPA

T- .. ----TTG.CCTG

1 HUMTPA 2 HUMTPA

CAACAGAACTGTTTTAGGGTAAAGATCTTTAGGCCTGCAAGGAAT G.

498 830

1 HiUMTIA 2 HUMTPA

GTCCTTGT-AGAGAATTAAAAAAAAAAAAAAAAA - CTT T. GG.TA.... GTG.G.G.GGT.TGT.GCCTTTTAT...

534 875

I HIUMTPA

TGCAGCTATCTTATTGAGGAATAAA-TGAGAGGCAGGTTTGCCTG CGT . . ..... A.G..C

578

1 iiUJMTIA 2 HUMTPA

ACGIAGTTCCCGGCTTGACATTTCCCTT-GGC-TCATGACGTTGGG .T.C .. A. G.T. T ....AG.. .T-.

622 965

I

G-TCCTGAGATTTATTTTCCTTTCAC T G.. .T. .

647

HUMTIA

2

IUMTPA

2i IMT'A

..G..G.A.T.CA.

920

991

Figure 4. MER4 repetitive elements from the following GenBank loci and the published literature: HUMTPA (5307, 6297) and Fig. A (1776, 2766), Degen et al. (19); HUMTPA (24498, 25144) and Fig. A (20968, 21614), Degen et al. (19); HUMCRYGBC (4720, 4877) see (17). Asterisks indicate location but not the length of Alu-Sd sequence (4); dots (.) show bases identical relative to the top sequence and hyphens (-) indicate the alignment gaps.

1 HUMFIXG AGCATCAGAATCACCTGGGAACGTA---GAAATGCAAATTCTCC2

....T

HUMNGFB

A..T.ACTA

TAG

T.G

1 HUMFIXG TGCTCTAC-ACTAGACCTACCAAATC-AGAATATCTAGGGGGTGG 2

HUMNGFB

AG...

..TC. .C.T.

G.

.... -...

A.A...

1 HUMFIXG GGCCCAGCAGTCTGTGCGCAAACAAGCACTGCAGGTGATTTTGAT 2

A.AGTGT

HUMNGFB

A.T.C

C

41 45

84 88

129 133

1 HUMFIXG GCACATTATAGTTTGAAAACT

150

2

152

HUMNGFB

C.--.GA

G.G.C

194

77

CACCTGGGAAGATAAGCTATCCATTTACAATGCCAAGGTGAAAGT . GT.

...

90

25

1 HUMTPA

... ................

(complete)****GAAGAGTG ...... T.GT

45

Figure 5. MER5 repetitive elements from the following GenBank loci and the published literature: HUMFIXG (24021, 24170) and Fig. 3 (21056, 21205), ref. (27); HUMNGFB(3550, 3399) and from the '6.6kb insert' in Fig. 2 of Ullrich et al. (11). The longest regions of imperfect complementarity are overlined. Dots (.) show bases identical relative to the top sequence and hyphens (-) indicate the alignment gaps.

adenosine deaminase gene (21). The first pair is at positions 13564-13621 and 14279-14337 (Fig. 3, ref. 21), and the second one is at positions 25397-25449 and 26379-26422.

1 HUMF0LA2 ACCCCATGCAGACA

A....G.......C

944

988 264 1002

Figure 6. MER6 repetitive elements from the following GenBank loci and the published literature: HUMFOLA2(1, 264) and Fig. 3 (1, 264), ref. (33); HUMTPA (34152, 35153) and Fig. A (30622, 31623), ref. (19). Asterisks indicate location but not the length of Alu-Se sequence (4); dots (.) show bases identical relative to the top sequence and hyphens (-) indicate the alignment gaps.

Current data do not permit to decide whether these sequences are scattered repetitive elements or simply a result of local sequence duplications. An attractive alternative is that they are long terminal repeats (LTR), flanking transposable elements.

The abundance and age of MER sequences There are around 3.5 million base pairs of human DNA in GenBank database if one excludes cDNA sequences. The noncDNA portion includes about 20 interspersed fragments of LI repetitive elements and around 500 Alu sequences. 20 LI and 500 Alu sequences in the database correspond to 17, 143 LI and 428, 571 Alu sequences in the entire human genome. It is proposed that 20 human MER sequences from all 6 families described in this paper represent number of elements comparable with that of the LI family. These figures are in excellent agreement with the majority of current estimates (35,36). They are more likely to be underestimates than overestimates since GenBank contains larger portion of coding regions than the genome itself, and the coding regions are less likely to accomodate repeats than the non-coding ones. MER sequences are associated with a number of Alu sequences. Following the terminology of Jurka and Smith (4), one can arrange Alu sequences from the oldest to the most recent ones in the following order: Alu-J, -Se, -Sd, -Sc, -Sb. Three Alu inserts have been found in MER sequences: Alu-J (Fig. 2), AluSd (Fig. 4) and Alu-Se (Fig. 6). In addition, two more Alu sequences have been found next to MER sequences: Alu-J next to MER5 in human factor IX gene (HUMFIXG) and Alu-Se next

Nucleic Acids Research 141 to MER3 in human gastrin gene (HUMGASTA). Alu-J subfamily is believed to be 55 Myr old and Alu-S(d+e), described as Alu-

Sa, about 38 Myr old (16). The MER sequences containing Alu inserts should be at least as old as the Alu inserts themselves, as there is no apparent evidence for relocation of Alu sequences

from the original insertion site to another site. For example, the MER2 (HUMTPA) sequence from Fig. 2 containing Alu-J insert should be around 55 Myr (million years) old. The relationships between Alu sequences and the flanking MER sequences are less certain. It is also not known how 'young' are the 'youngest' MER sequences. More data will be necessary to resolve this question.

Origin and evolution of MER sequences The presence of two subfamilies in MERI family, of which MERIb seems to be younger than MERla, suggests that at least this family might have evolved in a manner similar to Alu and LI families, with one potential difference: no evidence for retroposition of MER elements has been found. Following the recent Alu/Ll models of evolution, MERI subfamilies are expected to contain mostly pseudogenes generated by an unknown mechanism from an active gene. Most pseudogenes outside 'CpG islands' undergo a process of 'CpG decay' (e.g. 34). This process is believed to be caused by a deamination of methylated cytosine producing TpG and CpA as the most common products of the decay (34). For example, positions marked by '@' in Fig. 1 show variations attributable to CpG decay. CpA and TpG in homologous positions can be traced back to ancestral CpG doublets, presumably inherited from the conserved parental gene as suggested for Alu sequences (3). With the exception of MER5 sequences, one can find similar evidence for CpG decay in other MER sequences reported in this paper. In addition to the CpG decay, MER sequences contain Alu inserts. Furthermore, some MER sequences are shorter at the ends than their homologues, which suggests that they may be truncated. Altogether, there are good reasons to believe that at least some MER sequences represent pseudogenes. As pointed out above, the mechanisms generating MER pseudogenes are unknown. Unlike in the case of Alu and LI sequences, no obvious poly(A) tails or direct flanking repeats have been found in MER sequences, to suggest retroposition. There is a possibility is that at least some MER sequences can be replicated and disseminated by DNA viruses, as suggested by the presence of MERI and MER2 sequences in SV40 recombinants.

ACKNOWLEDGEMENTS I would like to thank Steve Lawson, Raxit Jariwalla, Greg Spicer and Emile Zuckerkandl for critically reading the manuscript and very helpful discussions, Martha Best and Diane Read for help with preparation of the manuscript and the Bionet resource for generous computer support.

REFERENCES 1. Slagel, V., Flemington, E., Traina-Dorge, V., Bradshaw, H., and Deininger, P. (1987) Mol. Biol. Evol. 4, 19-29 2. Willard, C., Nguyen, H.T., and Schmid, C.W. (1988) J. Mol. Evol. 26,

180-186 3. Britten, R.J., Baron, W.F., Stout, D., and Davidson, E.H. (1988) Proc. Natl. Acad. Sci. U.S.A. 85, 4770-4774 4. Jurka, J., and Smith, T. (1988) Proc. Natl. Acad. Sci. U.S.A. 85, 4775 -4778 5. Quentin, Y. (1988) J. Mol. Evol. 27,194-202

6. Martin, S.L., Voliva, C.F., Burton, F.H., Edgell, M.H., and Hutchison, III, C.A. (1984) Proc. Natl. Acad. Sci. U.S.A. 81, 2308-2312 7. Skowronski, J., and Singer, M.F. (1986) Cold. Spring. Harb. Symp. Quant. Biol. 51, 457-463 8. Scott, A.F., Schmeckpeper, B.J., Abdelrazik, M., Comey, C.T., O'Hara, B., Rossiter, J.P., Cooley, T., Heath, P., Smith, K.D., and Margolet, L. (1987) Genomics 1, 113-125 9. Jurka, J. (1989) J. Mol. Evol. (in press) 10. Paulson, K.E., Deka, N., Schmid, C.W., Misra, R., Schindler, C.W., Rush, M.G., Kadyk, L., and Leinwand, L. (1985) Nature 316, 359-361 11. Ullrich, A., Gray, A., Berman, C., and Dull, T. (1983) Nature 303, 821-825 12. Wilbur, W.J., and Lipman, D.J. (1983) Proc. Natl. Acad. Sci. U.S.A. 80, 726-730 13. Smith, T.F., and Waterman, M.S. (1981) J. Mol. Biol. 145, 195-197 14. Faulkner, D.V., and Jurka, J. (1988) Trends. Biochem. Sci. 13, 321 -322 15. Bosma, P.J., van den Berg, E.A., Kooistra, T., Siemieniak, D.R., and Slighton, J.L.(1988) J. Biol. Chem. 263, 9129-9141 16. Labuda, D. and Striker, G. (1989) Nucl. Acids Res. 17, 2477-2491 17. Den Dunnen, J.T., Van Neck, J.W., Cremers, F.P.M., Lubsen, N.H., and Schoenmakers, J.G.G. GenBank release 59.0. 18. Degen, S. J. F., and Davie, E.W. (1987) Biochemistry 26, 6165-6177 19. Degen, S. J. F., Rajput, B., and Reich, E. (1986) J. Biol. Chem. 261, 6972-6985 20. Gibbs,P.E.M., Zielinski, R., Boyd, C. and Dugaiczyk, A., (1987) Biochemistry 26, 1332-1343 21. Wiginton, D.A., Kaplan, D.J., States, J.C., Akeson, A.L., Perme, C.M., Bilyk, I.J., Vaughn, A.J., Lattier, D.L., and Hutton, J.J. (1986) Biochemistry 25, 8234 22. Kurachi, K., Davie, E.W., Strydom, D.J., Riordan, J.F., and Vallee, B.L. (1985) Biochemistry 24, 5494-5499 23. Protter, A.A., Levy-Wilson, B., Miller, J., Bencen, G., White, T., and Seilhamer, J.J. (1984) DNA 3, 449-456 24. Wakamiya, T., McCutchan, T., Rosenberg, M., and Singer, M. (1979) J. Biol. Chem.254, 3584-3591 25. Karathanasis, S.K., Oettgen, P., Haddad I.A., and Antonarakis S.E. (1986) Proc. Natl. Acad. Sci. U.S.A. 83, 8457-8461 26. Linnenbach, A.J., Speicher, D.W., Marchesi, V.T., and Forget, B.G. (1986) Proc. Natl. Acad. Sci. U.S.A. 83, 2397-2401 27. Yoshitake, S., Schach, G., Foster, D.C., Davie, E.W., and Kurachi, K. (1985) Biochemistry 24, 3736-3750 28. Sheflin, L., Celeste, A., and Woodworth-Gutai, M. (1983) J. Biol. Chem. 258, 14315-14321 29. Minghetti, P.P., Ruffner, D.E., Kuang, W.-J., Dennison, O.E., Hawkins, J. W., Beattie, W. G., and Dugaiczyk, A. (1986)J. Biol. Chem. 261, 6747-6757 30. Kariya, Y., Kato, K., Hayashizaki, Y., Himeno, S., Tarui, S., and, Matsubara, K. (1986) Gene 50, 345-352 31. Hardman, J.A., Hort, Y.J., Catanzaro, D.F., Tellam, J.T., Baxter, J.D., Morris, B.J., and Shine, J.(1984) DNA 3, 457-468 32. Yoshikai, Y., Clark, S.P., Taylor, S., Sohn, U., Wilson, B.I., Minden, M.D., and Mak, T.W. (1985) Nature 316, 837-840 33. Yang, J.K., Masters, J.N., and Attardi, G. (1984) J. Mol. Biol. 176, 169-187 34. Bird, A.P. (1987) Trends Genet. 3, 342-347 35. Singer, M.F. and Skowronski, J. (1985) Trends Biochem. Sci. 10, 119-122 36. Weiner, A.M., Deininger, P.L. and Efstradiatis, A. (1986) Ann. Rev. Biochem. 55, 631-661 37. Skowronski, J., Fanning, T.G. and Singer, M.F. (1988) Mol. Cell. Biol. 8, 1385-1397 38. Akahori, Y., Handa, H., Imai, K, Abe, M., Kameyama, K., Hibiya, M., Yasui, H., Okamura, K., Naito, M., Matsuoka, H. and Kurosawa, Y. (1988) Nucl. Acids Res. 16, 9497-9511 39. Keyeux, G., Lefranc, G. and Lefranc M.-P. (1989) Genomics 5, 431-441

Novel families of interspersed repetitive elements from the human genome.

Six novel families of interspersed repetitive elements have been detected in the available human DNA sequences using computer-assisted analyses. The e...
1MB Sizes 0 Downloads 0 Views