J Mol Evol (1992) 35:123-130

Journal of Molecular Evolution @ Springer-VerlagNew York Inc. 1992

Essential Role of Duplications of Short Motif Sequences in the Genomic Evolution of Bombyx mori Sachiko Ichimura and Kazuei Mita Division of Biology,National Institute of Radiological Sciences,Anagawa, Chiba-shi, Japan

Summary. The Bornbyx fibroin gene has a discrete mosaic structure of various repetitive sequences, which may have evolved through various repeating arrangements. Detailed sequence analysis of the fibroin gene containing coding and noncoding regions revealed that the whole sequence could be arranged as an array of short repetitive sequences. A portion of the intron o f the fibroin gene is one of interspersed repetitive elements. We cloned a 1.5-kb DNA fragment of the B o m b y x genome that contains interspersed elements homologous to the intron sequence. Sequence comparison between the intron and the 1.5-kb fragment shows that partial duplication has frequently occurred in evolutionary progress, and the resultant repetitive blocks of short motif sequences are abundant in the genome. These facts suggest that tandem duplication of the short m o t i f sequence is an important rearrangement in genomic evolution of the fibroin gene. Key words: Genomic evolution -- Repetitive sequences -- Duplication of DNA sequences -- B o m b y x mori -- Silk fibroin gene

Introduction Tandemly or interspersed repetitive sequences are major components of most eukaryotic genomes. For example, Gage (1974) estimated that 45% o f B o m b y x genomic DNA consists of repetitive sequences. Many o f such sequences are species specific and would possess an important role in gene evolution. In fact, some repetitive sequences mediate chroOffprint requests to: S. Ichirnura

mosomal rearrangements such as deletions, duplications, and inversions through unequal crossing over, transposition, and some other recombination processes. The Alu family, which is the most abundant SINE (short interspersed element) in primate genomes is a retroposon originating from processed 7SL RNA (Ullu and Tschudi 1984), and some Alu elements are considered to be transpositionally active even in recent human evolution (Matera et al. 1990). Furthermore, this element has been observed at recombination sites (Lehrman et al. 1986; Kudo and Fukuda 1989). Minisatellite regions dispersed in the vertebrate genome are constructed by short tandemly repetitive units containing a core sequence similar to the generalized recombination signal of Escherichia coli (Jeffreys et al. 1985a). The minisatellites are highly polymorphic due to allelic variations in repeat copy number and are used as genetic markers (Jeffreys et al. 1985b). Multiplication o f sequences also mediates the construction of many protein-coding sequences, e.g., Balbiani ring genes in C h i r o n o m u s tentans (Hoog et al. 1988; Galli et al. 1990) and the repetitive protein genes of malaria (Kemp et al. 1987). A 228-bp unit of the coding sequence for ubiquitin is tandemly repeated, and the resultant polyubiquitin genes having various numbers of repeats are multiplied in the genome (Dworkin-Rastl et al. 1984). On the other hand, repetitive sequences are essential for maintaining the functional structure o f the chromosome. Specific tandem repetitive sequences in the centromere are important for DNA replication. Highly conserved telomeric repetitive sequences help stabilize the chromatin structure (Meyne et al. 1989). The silk fibroin gene of B o m b y x mori is composed of a short exon 1, an intron, and a long exon 2 (Tsujimoto and Suzuki 1979a). The core of exon

124 2 is c o m p o s e d o f a higher order repeating sequence constructed by two t a n d e m l y repetitive elements (Mita et al. 1988; Mita and lchimura, unpublished). Parts o f 5'-side flanking and intron sequences are reportedly interspersed repetitive sequences (Pearson et al. 1981). Namely, the fibroin gene with flanking sequences has a discrete mosaic structure o f vari o u s r e p e t i t i v e s e q u e n c e s . I n this s t u d y , the superstructure o f the repetitive sequences o f the fibroin gene containing n o n c o d i n g and coding regions was analyzed, and short t a n d e m repetitive sequences were generally observed. One o f the genomic fragments possessing h o m o l o g o u s sequences with the fibroin intron was cloned, and the sequence was c o m p a r e d with that o f the intron. It is estimated that sequence diversification occurred by frequent duplications o f short m o t i f sequences with deletions, additions, and point mutations in the ancestral sequence.

Materials and Methods The fibroin gene sequences of the 5'-upstream flanking region, a portion of the coding region at the 5'-end and of the intron have been published (Tsujimoto and Suzuki 1979b;Kusuda et al. 1986). Recently, the 3'-side portion of exon 2 comprising a repetitive core and short C-terminal domains have been sequenced (Mira and Ichimura, unpublished). We have also determined the sequence of the 1.5-kb fragment having a homologous sequence with the fibroin intron. The cloningand sequencing methods were as follows. A genomic library ofB. mori was prepared using the Lambda Dash/Barn HI cloning Kit and Gigapack Gold (Stratagene). Among a number of clones containing a homologous sequence to the intron, one with a 13-kb insertion containing Ala and Gly tRNA sequences was selected. A Sal I fragment (1.5-kb fragment) located at the end of the 13-kb genomic insertion was subcloned into M13mpl8 and M13mpl9 and sequenced by the dideoxy chain-termination method using a 7-DEAZAsequencing kit (Takara Biomedicals). Sequencing was completed using the deletion method of Henikoff (1984) and Yanisch-Perron et al. (1985). The sequences of the fibroin gene were arranged to form short tandem repetitive sequences with some gaps by the method as follows. The sequences were divided into overlapped blocks of about 50 bases, and direct repeats in each block were searched using SDC GENETYX software (Software Development Co., Ltd.). Only repeats with short or no spacers were selected from the data, and tandem repetitive sequences were arranged to make maximum matching after introduction of gaps using the method by Takeishi and Gotoh (1982). The weights used are -1 for a nucleotide match, 1 for a mismatch, and k + 1 for a gap involving k unpaired bases. Homology among repeats in each block is more than 70% in Fig. 1.

Results

The Fibroin Gene Can Be Arranged as an Array of Short Repetitive Sequences The sequence o f the fibroin gene containing the 5'flanking region ( - 8 7 6 ~ - I ) , exon 1 (1 ~66), and

intron ( 6 7 ~ 1039), and a portion o f the 5'-side o f exon 2 ( 1 0 4 0 ~ 1 5 3 9 ) , which does not have clear repeat structures as the main part o f the gene has, was published (Tsujimoto and Suzuki 1979b; Kusuda et al. 1986). These sequences can be arranged as an array o f t a n d e m repeats o f short m o t i f sequences as shown in Fig. 1. Each sequence n u m bered in parentheses is presented as a repeat or an imperfect repeat o f a respective m o t i f sequence in a square bracket. M a n y o f these repetitive sequences would reflect some specific rearrangements during evolutionary progress, as the probability o f the form a t i o n o f direct repeats within a short distance in a D N A sequence consists o f a r a n d o m array o f nucleotides is small. Namely, in a r a n d o m sequence possessing an equal frequency o f four bases, the probability o f the occurrence that an n base unit repeats directly within the distance o f m bases, P = (1/4) n × (m + 1). This means that an average distance (rh) between n bases direct repeats is about 4 n, e.g., ffa -~ 1 kb when n = 5. Various duplications in an ancestral D N A sequence appear to have been an essential process in fibroin gene evolution. The main part o f exon 2 o f about 15 kb was reported to be a h o m o l o g o u s alternating array o f repetitious crystalline and a m o r p h o u s coding sequences (Gage and M a n n i n g 1980). Recently, we found that typical repetitive sequences initiate at 2009 and that the repetitive sequences have evolved by multistep amplification o f repetitive elements (a) and (b) with conserved b o u n d a r y sequences (Mita et al. 1988; Mita and Ichimura, unpublished). In order to clarify the characteristics o f each element, elements (a) and (b) in the c D N A s 5' and 3' end portions are s u m m a r i z e d in Fig. 2. Element (a) is formed by t a n d e m reiteration o f a unit sequence, G G T G C T G G T G C T G G T T C A , and element (b) is comprised o f the imperfect t a n d e m repetition o f a m o t i f sequence, G G T G C T G G A T A C G A G C A G G A G C T G G C G T T . Partial deletion o f the m o t i f sequence o f element (b) does not occur randomly. The short sequences o f G G T G C T at the 5' end and G G C G T T at the 3' end are frequently deleted, and the deletion sites have a consensus sequence o f GG(~,)GC(~)*GG. Interestingly, this sequence is homologous to the h u m a n minisatellite consensus at the breakpoint o f oncogene translocations, GC(~)G G ( ~ ) G G (Krowczynska et al. 1990). A few deletions were also observed in element (a), in which the deletion site in the m o t i f sequence is located i m m e d i a t e l y a f t e r the s e c o n d c o d o n , t h a t is, GGTGCT* GGTGCTGGTGCTGGTTCA. This point m a y be a r e c o m b i n a t i o n site between elements (a) and (b), as all o f element (a) except for the last, which is not followed by element (b), ended at the second codon. Although deletion sites in elements (a) and (b) have the same consensus sequence [GG(~)GC(~)*GG], not every site with the same

125 (-816~-782) (-752~-743) TACGC [GAAAATCGTCTGT] TTAAACAAAAATAqqqq'TTCGTAAAAACACTfATCAATG [AGTAA]

(-834~-822) [TGTAGGT] ......

C. . . . . . . . . . . .

CAA

. . . . . . . .

- ....

(-696~-653)

(-652~-S39)

(-638~-597)

AGG

AATAAAACCTCTTCATTGACTTGAGAATGTCTGGACAGATTTGGCTT

(-345~-324)

(-323A'-307)

(--300N--277)

(-181~-169]

(-163~-155)

.

.

.

.

.

.

.

.

.

AATAAAAAGAT

[TTGTTA]

.....

.

(-22~23) T T T C A G ~

I

I

.

.

.

A .... T ..... A-T .

GATGAGAGTCAAAAC

.

.

L->'c o ding

.

.

.

.

CA . . . . . . . . . . . . .

.

.

......

A-T

.

.

.

.

T-G ....

(--217N--199)

.

G--T--T

ATTAGAGATGCATCTGGGGCAGTTATC

.

.

.

.

(I014-~i035) [AAAAAATTGTT] . . . . . . C-TT .

[TAAGATCAAT] A---

G. . . . . . . . . . .

T.

.

.

GTGAG

CTGCGCTCTGCAG

.

A---C . . . . . . A-

{93~108)

[TTAATTATTZ~A] [CAGAAGGTTGGC] ........................

.

.

.

.

.

.

.

.

TCAG . . . .

.

. .

.

.

.

.

TAGAC

.

TTGC

(267 283) [GTCAAAAGTA]

G. . . . . . . . .

(440~454) [TTATTAAT]

(411~418) [GAAA] TTGCATAAAATTATAACCGCA

.

.

TGAATCTAT

G---G-

GATATCTATTAACA

. . . . . . . . . . .

TATGTCGCTTATA . . . . . . .

CGATATC

.

.

.

(57S-~597) (602~23) [GAGTTTCTCGCCA] AGTG [GATCGCGTTACC] . . G .......... --TC . . . . . . . . . A---G-

.

{680~689) (696~703) {72Je,,743) (744~766) [TGAGTC] AGCTCA [CCCA] TCGGAGCGTACGTGGAA [TAGGCTACCAGCTGG] [GAAACAAAGCTC] . . . . . . . . T. . . . . . . . . . . . . . . . . . . . . . . . . . . .

AATTCTTTCAGGT

CGTATTGATTTAAG . . . . . . .

.

C ............ .

(773~806) (807~826) (836~871) [TAACAACATAATGTGACCA] {TATGAGATACAAT] TTTCCCACA [CATAATTAGAATGTTGTT] . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ..... T . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . C ......

.

[TAA] .

. . . . . T. . . . . . . . . . . . . . .................. G

(49C°v511) (519~537) [CAAATTCATAAT] CCTCGAG [GTAGCATTCT] AATACATTGGTATGTGATTATAACACGAGCTGCCCACT . . . . . . . . . . . . . . . . . . . . . . . . . . . . 'IT . . . . . . . . . . . . . . T-

(624~886) [TAGAGTCTAGTGTTAGCACTGCTCZ~fGT] .... T ...... A .............. ---G .............

(9744898) [TTATAGCTCAATAT] . . . . . . .

.

T

(72~92)

[Cqq'fGTGAT]

(293~351) (352~405) [GTAAAATACTGGGCAGACTTGCAATATCCTAq'rTCACCG] [GCAATGCATAAATTCAACAATATGAAAAA] ........................................................................

(469~481) [ATTGCCT] III[IICG . . . . . . . . . . . . . . . . .

.

(-401~-356) [CCAAGGACATIq~]

(116~172) (215~226) (234~262) [ACGGGCCACCTGATAACGTAAGTGGTCGCCA] TTTGATZ~GTCACGCCCCGGGGGGGCTACGGAATAAACTACA [qq'FA] A A A A A T G [AACCTTATGATTTATGT] . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . .

.

[TTCTTTAAAATATTACAAAAAAATTGAACG]

(39~53)

[GGTTCCAACTTTTTCAAATCAGCATCAGTTC]

[][[[-[]:--]~][]]'--[.~--tran;.

......

G.

(-i18~-99) (-95~-86) (-85~-72) (-66~'-37) [TAATq~fCTATG] CTG [TACAA] [TGTGTAGA] TTCTA [TCGAAAGTAAATACG] . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

(-145~-130)

[TAAATCATAAT] [CATTGTT] CACAA [TTTAA] CTTCATACG . . . . . . . . . . . . . . . . . . . . . . . . . . . . A.

.

(--276~--218)

[CAAAAA] [CTAAAATGAAA] [CAGTTTCCGCT] T=fCCTCTA [ATGGGAGAAAGC] ........................................................................ ............ A. . . . . . C. . . . . . A--T . .

. . . . . . . . . .

C.

- - -T ...........................

(-366~346)

(-198~-182)

---

[TGTAIItffGGTGATTTACAAA] - .....................

T .... A-

(-535~-530)

(-553~-536)

[TATTGACCCAAGA] [TCCTGCAGTATG] ............................

(-439~-402)

(-529~-516) (-515~-487) [CTGGACA] ['I~GTATAGACAGTTCCAC] ........................

GGTTGTGGCATCACGCCGATTGG

G .... GG--

(-576~-554)

(-591~-577)

[TTTTTCATGGTACGAGATGCTTAAAAATC] [TTTTCGATA] [GTAAAATTTTGCTAAAACTGTCACACC] ATATA [TTGTT] ................................................................. G . . . . . . . . . . . . . . . . . . . . . . -G ..... A . . . . . . . . . . . . . . ---G-

G ...........

(-721~-697) (-739~-722) [TTCATGAATAAT] [TAAAAAAAAAATAC~ ..........................

.

(888,~915) (919~973) [CCATTACTATTA] AGG [CAGATATGTGCTAACTTCGATTC] ..... A. . . . . . . . . . . . C ........... -. . . . . . . . . . . . . CA-C ............. ---C . . . . . . . . . . . . . G--A-

AACGCCCCAGCTAGAA A

(I053~I068)(i069~i080) [CAAATG] [TGATT~] . . . . . . . . .

.

.

. --T

.

.

.

. ---

.

GGACTATTTTGGGAGTGA . . . . . . . .

. .

.

.

(i163~i172) (I180~1236) (1237~1271) [GAAA] TTACAAC [TAAAAAACATGCAACACTTGGAAAAAA] [GATTCAAGACGTTATAACCACG] ................................................................ . . . . . . . . . C .... G .................... . . . . . G .......

(1 3 2 4 ~ i 3 4 8 ) (1349--1383 ) ( 1384~1424 ) [T A C T G q ' T G C T C A A A G T ] [G C T G A T G C G G G A G C A T A T T C T ] [CGTATCAAACAGTGGATACAGCACTC ............................................................... . . . . . . . . . . A ....... C. . . . . . . . . . . . . . T.

] .

.

C

.

.

GCGATTTCAGCACTA .

.

.

.

AAGTAA

.

.

.

.

.

.

. .

(109941108) [TGTCAC] . . .

(iI09~i135) [CAACAGATAGTAATA]

.

.

.

A ....

(1272~1310) [GTAACGATGTGCTCATTGTAGA] .

G ..............

CACTTTCCGATGG

A - --

( 1440~1456 ) (1457~1534 ) [G T C G C T G C A ] [GGAGCTGGTTCTGGTGCTGGTGCCGGAGCTGGTTAT - ................ G-A ......... G .................. G .................. ....... C ............... G

]

Fig. 1. The sequence of the 5'-end of the fibroin gene containing flanking, intron, and short coding regions arranged as an array of imperfect repetitive blocks. The square brackets enclose motifs of repetitive sequences. In the arrangement of repetitive sequences, the same nucleotides as the motif sequences are shown by dashes. Two arrows show the initiating points of transcription and the coding sequence, respectively. The boxed sequence is a TATA box. The intron sequence spanning from 67 to 1039 is enclosed by a solid line.

s e q u e n c e is a h o t s p o t for the d e l e t i o n in the m o t i f s e q u e n c e . O n e m e c h a n i s m e x p l a i n i n g the d e l e t i o n sites is that t w o D N A strands o f h o m o l o g o u s chrom o s o m e s m a y be slightly m i s a l i g n e d at m e i o s i s , foll o w e d by u n e q u a l c r o s s i n g o v e r at specific r e c o m b i n a t i o n sites. S u c h slight m i s a l i g n m e n t in the r e p e t i t i v e s e q u e n c e s m i a h t h a p p e n f r e o u e n t l v be-

cause the h o m o l o g y o f the t w o strands is c o n s e r v e d a p p r o x i m a t e l y e v e n after the m i s a l i g n m e n t as s h o w n in Fig. 3. G G C G T T at the T - e n d or G G T G C T at the Y - e n d o f the m o t i f s e q u e n c e o f e l e m e n t (b) is d e l e t e d by c r o s s i n g o v e r at site 1 or site 2, respectively. A s i m i l a r u n e q u a l c r o s s i n g o v e r e v e n t as s h o w n in Fig. 3B results in the f o r m a t i o n o f the

126 conserved joining sequence between e l e m e n t s (a) and (b). These m o d e l s suggest that one o f the rec o m b i n a t i o n signals is the G G C G T T G G T G C T sequence. S o m e imperfect repetitive blocks in the other regions m a y be essentially the result o f a similar m e c h a n i s m , although s o m e other process diversifying the sequence cannot be ignored.

element (a) [GGTGCTGGTGCTGGq~CAI

element (b) [ GGTG CTGGATA CGGAGCA GGAGCTGGC G'lq' ] --¢

....................

C C

...........

.....

T--

- -T .......

C O ._ t~ O9

..... ..... ..... ..... .....

C C C c ¢

t...

............ ............ ............ ............ ..... c ......

E~ c" "O

...........

Duplications of Short Motif Sequences Have Frequently Occurred in Genomic Evolution

^-°c°-° O

..^

...............

O

--A--"O --A-T

.....

A partial sequence o f the fibroin intron is an interspersed repetitive e l e m e n t (Pearson et al. 1981). We cloned and sequenced a D N A fragment o f 1.5 kb that contains a sequence h o m o l o g o u s to the intron (Fig. 4). The sequences o f the intron and 1.5-kb fragment are c o m p a r e d in Fig. 5. Highly h o m o l o gous sequences were locally observed, for example 28 ~ 90 o f the 1.5-kb fragment s h o w s 84% h o m o l o g y to 5 6 9 ~ 6 3 3 o f the intron, and 9 3 7 ~ 1006 o f the 1.5-kb fragment s h o w s 66% h o m o l o g y to 6 7 ~ 139 o f the intron. If s o m e repetitive sequences in the intron were replaced with the corresponding typical motifs presented in brackets, the 1.5-kb fragment sequence from the 5'-end to 584 is h o m o l o g o u s to the intron sequence between 515 and the 3'-end n u m b e r e d 1039. Furthermore, the T-side portion o f the 1.5-kb fragment spanning from 938 to the 3'end is also h o m o l o g o u s to the intron sequence spanning from the 5'-end n u m b e r e d 6 7 - 7 6 6 as s h o w n schematically in Fig. 5. This suggests that the sequence rearrangements by short m o t i f duplications are important in the evolutionary progress o f the intron sequence from an ancestral sequence similar to the 1.5-kb fragment. T h e sequence analysis suggested that short m o t i f duplications occurred frequently also in the g e n o m i c e v o l u t i o n o f the 1.5-kb fragment, although the result is n o t shown. The 1.5-kb fragment contains a h o m o l o g o u s sequence to U 5 R N A , w h i c h is a small nuclear R N A ( s n R N A ) mediating splicing o f an intron, and a c o m plementary sequence o f a B m l element, a major SINE o f B . mori ( A d a m s et al. 1986). The U 5 R N A like sequence is also highly conserved in the intron sequence. The rat U 5 R N A sequence (Branlant et al. 1983) was used for c o m p a r i s o n as a partial sequence has been reported for B. mori U 5 R N A , w h i c h has high h o m o l o g y to rat U 5 R N A ( A d a m s et al. 1985). Our preliminary results s h o w that the U 5 R N A - l i k e sequence is a SINE having a copy n u m b e r o f about 104 in the Bombyx g e n o m e . Because the copy n u m b e r o f the B m l e l e m e n t is 2.4 × 104 ( A d a m s et al. 1985), w e infer that the U 5 R N A sequence and the B m 1 e l e m e n t are linked in a repetitive element. Relative to this estimation, s o m e U 6 R N A pseudogenes are closely linked with s o m e o f the interspersed repetitive e l e m e n t s in the

.................

T-

c-

C ............ .....

T ............

.......................

^

..... C ............ --A ............... °.^-..

--A . . . . . . . . . . .

C---

--A---

.....

..^

...................

T-

C ........... ..............................

..... C ............ ..... C ............ --A ...............

.....

--^

..... T ............ ........................

C ............

T

..............................

............... C O

-.A ...............

E~ 0~ t...

....................

T- -^ ..... ^ ......

¢-

- -^ ................. m'rA........................

O O

..............

C--cO.)

- -0 ............. T. . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

.....

..............

C--°

T- - --T .......

.......................

A

.......................

^

..A-..

.....

C ............

.....

C .........

Fig. 2. The repetitive structure of elements (a) and (b), which are the main repetitive elements in the core region of the fibroin gene. Elements (a) and (b) in the two complementary DNAs around the 5' and 3' ends are summarized. m o u s e g e n o m e ( O h s h i m a et al. 1981). The B m l e l e m e n t was thought to be a retroposon, and because s o m e B m 1 e l e m e n t s are transcriptionally active, the ancestral intron sequence might also be a retrotransp o s o n inserted in the fibroin-coding sequence.

Discussion T h e sequence analysis o f the fibroin gene o f B . mori s h o w s that amplification o f short m o t i f sequences frequently occurred during gene e v o l u t i o n and that the resultant t a n d e m repetitive sequences o f short m o t i f sequences are abundant in the gene including

127 (A)

element (b)

slte i

site 2

GGTGCTGGATACGGAGCAGGAGCTGGCGTTGGTGCTGGATACG GAGCAGGAGCTGGCGTT !l 'l' ! ! l ! l I I l !I!l I l ~ . . . . . . . . . i, i , GGTGCTGGATACGGAGCA;;AI3G~GC;T~" GTGCT;(~ATACGGAGCACGA;C+GGCGTT

(b} ~ [ GGTGCTGGATACGGAGCAGGAGCTGGCGTT ]

(b) [ GGTGCTGGATACGGAGCAGGAGCTGGCGTT ]

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

(B)

element {a) ~GTGCTGGTGCTGGTTCAGGTGCTGGTGCTGGTTCAGGTGCTGGTGCTGGTTCA II

I

[

II

II

I!

II

t

II

I!l~llllll!

t

Fig. 3. M o d e l s o f u n e q u a l crossing over d u e to the a p p r o x i m a t e sequence h o m o l o g y after slight m i s a l i g n m e n t . In m o d e l A, u n e q u a l crossing over at site 1 deletes G G C G T T at the 3 ' - e n d a n d that at site 2 deletes G G T G C T at the 5 ' - e n d o f the m o t i f sequence in e l e m e n t (b). Model B s h o w s t h e crossing over that f o r m s a c o n s e r v e d b o u n d a r y sequence between e l e m e n t s (a) a n d Co).

;;TGCT;;ATAC;;AGCA;;AGCT;;CGT~3G+;C+;;A+ACGGAGCAGGAGCTGGCG'rT element (b) Ca)

(b)

[GGTGCTGGTGCTGG'I'rCA] [GGTGCTGGATACGGAGCAGGAGCTGGCG'VF] ................................................

i TCGACGTGCAACCTAACCTATGCATCAGCCGCTGAGTTTCTCGCCG•ATCTTCT•AGC•••TC•C•ATTCCGATCCG•TAGTA•ATTCAT 91 TCGC•AAACAATTGCTCTT•AGTTGTTAGGTCTCCTTCGGAGGCGCTCGGGCAGTTGTTAGCAAATCCCGCCCCTCTTGGCTGA•CCTTT

181 •CTCTCGCCCACCT•TCCT•GTGAAACTGGAAAGGCCTTCGG•••ACCA•TAAACTTTCAATCATAAAAAAAATAAAAAAAAAAAAAAAA 271 ACCTTAATGTGGGTAATGGCATTTAAAATATGATGTCTT•ACTCTGTGCTCTCATGGTAATGTATAATACCACGCGTTTCT••TGCACTT 361 AGTGAA•TTACTCTTAAAACGAGGCCTGTGAAGAGTATTAAATGGTAGGCAGCGGCTAGGTTCTGCCCATGGCATT•TTGAT•TCCGTGA 451 •CGACA•TAACCACTCACCATCAGGT•G•CCATATAATCGTCTGCCTACAC•GCCAATAAAAAAATATTAAATTTATGAAAACAAGTCAA 541 AAGTCGGA•TTGCTTTGGAAAAAGTTGATAGTGGCATTAACAATACGTAGATATGATTCACTAAAATCGAAGATAGGTAGGTCTAGAACC 631 TACAGACTCTTGAAC•AAAACATCATAAC•TTACCCCT•G•CCAATTACGCGTTTGCGCACATTAATAATTCAAAA•A•CACACAAATAT 72i TAAAACGAGACAATACAGAGATTCAACACGAAATTCCGTAACGTTGTAATCAAATGAAATTCAAAACGGTTCCTTTAAAGTTACACAGTA 811 CTTGTTTACACTAAATAAAGCGCAAAAGGTACAGATTTTGCGCGAATGTAAATCAAACTCGATGTTTTATTAAACACTCGGCTGTTCCAG 901 ATATTCAAATTAAGC•CGTGTCGAAGCTAAGGCTAT•TGGAGTCGGACTTTTTTTTTTATTGCTAGATGGGTGGACGAGCTCACA•CCCA 991 CCTGGTGTTAAGT••TTACTG•AGCCCATA•ACATATACGACGTAAATGC•CACCCACCTTGA•ATATAAGTTCTAAGGTCTCAATTATA 1081 GTTACAACGGCTGCCCCACCCTTCAAACTGAAACGCATTACTGCTTCAC•GCAGAAATAGGCAGGGCGGTGGTACATACCGCGTGGACTC i171 ACAAGAGGTCCTACCACCAGTAA•ATTGACTTGTGAAATACTAAAGACATTTGCGTTAGAATTTGTTATAGTATTACATCGTAAT•TACA 1261 TAATGCATGTGTAAAACTCCTGGCTTTGATGATGTCGGGTTTTTTTATTT•TATAAATGGGCA-AACAAAATCAGAACTGATTTGGTATTA 13SI AATAGTCAGCGGAGACCACCGAAATTTCTACAAAGTTTCAATT•CA•TGGATAATCGCTGAACCATCCTACTAACTGGGACGTAAACAGT 1441 TTCGCGACAGTTATAGGCACTATAACGATAAATAGAAAACTTTAATCATAGATGTTTGAACTGGATC Fig. 4.

The nucleotidesequenceofthe 1.5-kb fragment

coding and noncoding regions. In fact, the short sequence, C T G G A C A , situated at - 524 ~ - 518 o f the fibroin gene ofBombyx mandarina, which is the closest relative o f B. mori, is duplicated in B. mori ( - 529 ~ - 516 block in Fig. 1), although b o t h fibroin genes are highly h o m o l o g o u s (Kusuda et al. 1986). Some evidence suggests that short m o t i f repetitions are generally i m p o r t a n t for the evolution o f the eukaryotic genome. For example, hypervariable

minisatellite D N A sequences are short motif repetitive sequences present at numerous loci in eukaryotes (Jeffreys et al. 1985a; Kominami et al. 1988; Dixon et al. 1990). Simple repetitive sequences (AA/ T T ) . , (GA/CT)., ( G T / C A ) . , and ( C A G / G T C ) . are ubiquitous repetitive c o m p o n e n t s o f eukaryotic gen o m e s (Tautz and Rens 1984). Furthermore, some functional units would be evolved through short motif repetitions, e.g., the #-globin direct repeat el-

128

1

1.5 kb f r a g m e n t

1507

II U5 I f

t

67

515

I camp Bm -

I

I

I

L

]

1039

I

67

Intron

U5

Rot

at a TCGA CGT GCAACCT AACC ~!! !! !!! !! !!

Iniron

515

Rot U5

62

1.5~b fr.

95

lnfron

634

1.5kb fr.

208

Intron

716

TATG

CATCA

II!I

I!

ctctg gtttctcttcagatcgtataaatctttcgcc~tt

!

III

:l!!!!!l!!!!''

II

421

TrCTGCCCATGGCA~GT

922

I

!

!!!

It

TGAGCCTTTGCTCTCG CCCACCTGTCCTGGTGAAAC !!!! l! !!!! ! !I!! ! I ! !! agctcacctacccatcggagc gtacg

!l!!!lI

!!!

!!!!

taatgtgac

cat

aaaatc tagtggtgta

! l I !!!

!!t

!l

ccacaaa[tg~tt gtaca taattagaal ................. .......... c.....

!!

!

!!!l

!I

!l

taa

cg c c c c a g c t a g a a

!!

! l !!I

!!!!!

!l!

.....................

!!!I

! !!

!!!!

......

I

a-t

1 comp,

!

!!I

I!

l!!

:::

Brn 1 comp,

::

::::::::

.

acacagccctcaaggngttaaga

:

::::;:::

AGATGGCTGGACGAGC

I !l !!!t !!!! ! !!! lI 8attagttt~c~a~tttcagaaggttgg

gagt~

,

cattttgtttcag

aga ggatggacgag c

::::::::::::::::: !!!!

~I

-- -

GTGGAGTCGGAC~ATTGCT !I gt

,:

~I::

t ] cgt act gait taagaaaaaat t gttaa

cttttttttttattgctt

957 67

cccaccttgagatataagttctaag

!!!

:

::

ggtsaccggagcccatagacatctac

::::::

::::::

l !! ! !!![ l !!II!II t !!!l! ccagacgatatcacgggccacctgataataagt

agt a

gtct c

:::::::::::::::

tag

ttacaacggctg

!!l

!I!

1 camp.

Bm 1 comp.

! l! tit

!I

I

!

(

I

t;

!

:I(

gg~

cgcca

caaa cc

ccccaccct~

I

:t

tit

t:

tt

tt

t

II

CT

tltt

~

tecta c c a c c a g ~aa . ..........................

Eta aaata

tcctat~tcaccgg/taa

agact~gcaaca

ctgggc

tta~

taca

cg

::::

::::

::

!l

!I

ttn

c

gc a a a ~ t a t a a

!

!

!!l

cataaaa~ataae

422

gaaacgcac t~]

1426

Fig.

650

5.

denoted

:::

!!

and

regions

g---a .

between letters

complementary

sequences

are schematically

the

1.5-kb

and

that

sequence

in the

brackets

shown

ga

ctc

::{:::

::

::

t~t :::

_.,

~ / ~ r ~ /

!

.

T GGGACGT AAACAGTT-fCGCGACAG~/'AT

Homology

intron

I: tat

ttttgcgg~

!I

~,I... !I

!

I

:1

!]!

caattgct

{:!!!!!

attg

!!#!!

.

.

AGG

.

.

.

CAC TA

.

above

genomic

fragment

of the fibroin

intron

of the Bm 1 element are

motif

the

sequences

sequences.

and

the

fibroin

.

intron

by lowercase

a---gGAAAACTgFAATCATAGATGTTTGAA CTGGATC

sequences. letters

are denoted

by lowercase

for repetitive

sequences

---

GCAGTG GATAATCGC TGAACCA TGCTACTAAC !! !! !!] !l' !l]! !! ! {! I! cag~g[gat cgcgttacc]tagat~ctatgaagcac~gct cttgt g ..........

TAAC GAT A A A T A

sequence

G A A CT !! !!

! !!!!I

cgcaaa[ttataatca]taacct

cctttt~tt

[ t a g g g t c tgaga ~ g t t agcaaa t t c t t t c a g g t ] g c t c a c c t a c c c a t c g g a g c g t a c [ g tggaataggctaccagctg]@aaa . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . - -a . . . . . . . . .... -a-g

by uppercase

sequence The

~

atgt~a

cgcattattaatttattatgatatctattaa

....... 1.SKb f r ,

Eactgcttcacc

attcagcaatatgaaa~atgcaatgca| tcgtaaagcatcatta[gaaaatzgac) ............. c. . . . . . . . . . . . . . . . . . . a--........ a .......... a. . . . . . . . .

G ATI~GGTA TTAAATAGTCAG CG G A G A C C A C C G A A A T T T C T A CAAA GI~fTCAA~ ! !!! !!! !! !!!! !! !! ! ! !!!! !! !!!~! !! ! !!l ! cga[ggtagcattctgtacattttaata catt]cgagc~gcccactgag tttctcgcca:atc ttc t . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

515

!

GAAACGCATT A C T C C ~ C A C G

--t 13~9

:

! !!

g~tatcgtaoattg~gcca

CAAAATCA !

::::

!!!!!II

gt c acgccccgggggggctacg gaataaactaca[ttta]aaaaa tg(aaccttatgatttatg~]gc[g~caaaag~a]gaatc . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . g........ t -g---g- -

cgtcaa&taggcagggcggtggtacctacccgcgcggactcacaagacg .........................................

gcgcca

:::::::

CGTAAATGCGC A

II

1236

Intron

cgtaaat

:

I !! ! ! aaacgcaca

CAAA !Ill tCtga

aa

:::

TCACAGCCCACCTGGTGTI'AAGTGGq~'ACTGGAGCCCATAGACATATACGA

1044 175

1.5~b ~r.

!!I!I

[tattacaattattac]taaggcag c ............. ...... c-c ---

TGATGTCCGTGAGCGACAGTAACCACTC ACCA TCAGGTGGGC CATATAATCGTCTGCCTACACGGCCAATAAAAAAATATTAAATTTAT t !!I!!! !!!l I! !I! ! ! l !!! !I t atgtcc [taac ttcgattccagaatca gtgCgC}

gaaat [ ttatagctcaata

lntron

l!!

tgag

!]

!I

GAAAA CA AG TCAAAAGTGCGAGTTGCTTI"GGAAAAAGTTGATAGTGGCATTAACAATACG

293

!l!

TA GGCAGCGGCTAGG

I!

528

Intron

!I

I l!I

969

1131

!!

TCTGTGCTCTCATGGTAATGTATAATACCACGCGTTTCTCGTGCACTTAGT GAAG TTACTCTTAAAACGAGGCCTGT GAAGAGTATTAAATGG

Inlron

[ntron

GGC !{

!!!l

tctttcaggt{tgagtc]

t a a c a a ca

1.Skb fr.

1.5kb f t .

CCCGCCCCTCTT

!![!!

l !

Inlron

1.5kb fr.

r,

ttctat

AGTAAACT~rCAATCATAAAAAAAATAAAAAA/~AAAAAAAAACCTTAATGTGGGTAATGGCATTTAAAA TA TGATGTCTTGAC

:!!!

1.5kb f t .

Bm

,,,,r~,,

I!!!l

atacaattatgtactttc

Intron

rr

,~,,,,

tgga AT~CGC

TGGAA AGGCCTTCGGGCCACC

Itggaataggc tac cagc][gaaacaaagctc}aagtaa .............................

315

1.5kb fr.

r

!!!!!!l,

gagctgcccactgagtttctcgccagatcttctca~ggg~cgcgt~accga:cacgtga~aga

ctct gagt c t t a a a c c a a t t ~ t t t g a g c c t t g t t c c g g c a a g gcta : ::::: :::: :::: ::: : :: ::: : : :: :::: : : :: GAAACAATFGCTCTTGAGTTGTTAGGTCTCCTTCG GAGGC GCTCGGGCA GTTGTTAGCAAAT !!l I! ! ! ! ! ! ! ! I It!!! !I l '~ ! ! ! ! ! ! ! ! ! l gaagcact gctcttg ttagggct o g~ g C t a g c a a a t

813

Infron

!!I!!!!!ll!

'~C

gaggaacaa

1.5kb fr.

Bm

agatttccg

taa

t-

Intron

1.5kb fr.

tac

O CCGC+GAG'~C+CGCCGGITCTTCTCIGCGGGTEGCGA'~CCGA+CCGG+AGTAGA

cgalggtagcattct]aatacattggtatgtgattataacac .......

1059

Intron

1

1

1.Skb f t .

I

766

letters shown

The

under

the

caaagctc

sequence 1.5-kb

of the

1.5-kb

fragment.

above

the sequence

under

the

motif

gaaacaagctc

of the

sequences.

The

fragment rat U5

1.5-kb The

RNA

fragment. compared

is

129 e m e n t is a d u p l i c a t e d s e q u e n c e o f a 1 0 - b p m o t i f , w h i c h is c o n s e r v e d i n v a r i o u s m a m m a l i a n fl-globin p r o m o t e r s a n d regulates t r a n s c r i p t i o n ( S t u v e a n d M y e r s 1990), a n d a v i r u s - i n d u c i b l e e n h a n c e r o f the h u m a n i n t e r f e r o n gene is a t a n d e m l y r e p e a t e d seq u e n c e o f a 6 - b p m o t i f ( F u j i t a et al. 1987). M a n y r e p e t i t i v e s e q u e n c e s a n a l y z e d here are i m perfect, p r o b a b l y r e s u l t i n g f r o m d u p l i c a t i o n s a n d subsequent deletions. A n o t h e r m e c h a n i s m of imperfect d u p l i c a t i o n , a slight m i s a l i g n m e n t f o l l o w e d b y u n e q u a l c r o s s i n g o v e r at a specific site, was specu l a t e d f r o m t h e d e l e t i o n o f specific p a r t s o f the m o t i f s e q u e n c e as o b s e r v e d i n e l e m e n t (b) o f the f i b r o i n coding region. These m e c h a n i s m s producing imperfect d u p l i c a t i o n s w o u l d b e e s s e n t i a l to d i v e r s i f y sequences during evolution. S o m e m o t i f s o f the r e p e t i t i v e s e q u e n c e s i n Fig. 1 r e s e m b l e each o t h e r a n d c o n t a i n the s a m e s h o r t s e q u e n c e s . F o r e x a m p l e the T T G T T m o t i f o f t h e -591~-577 b l o c k is p r e s e n t also i n C A T T G T T o f t h e - 181 ~ - 169 b l o c k a n d i n T T G T T A o f the - 1 4 5 ~ - 130 block. T(A)n, C(A)n, a n d (T)nC are also f r e q u e n t l y o b s e r v e d a m o n g v a r i o u s m o t i f seq u e n c e s e s p e c i a l l y i n the n o n c o d i n g regions. T h e s e o b s e r v a t i o n s suggest t h a t the m o t i f s are r e d u c e d to m o r e e s s e n t i a l or c o m m o n a n c e s t r a l s e q u e n c e s . I n fact, O h n o (1987) h a s p o i n t e d o u t t h a t s h o r t oligomers could be ancestor sequences a n d that their d u p l i c a t i o n a n d f u r t h e r a m p l i f i c a t i o n w o u l d b e ess e n t i a l processes o f g e n e e v o l u t i o n . P a l i n d r o m i c seq u e n c e s p r o b a b l y d u e to i n v e r s i o n s a n d h i g h e r o r d e r r e p e t i t i o n s were also c h a r a c t e r i z e d i n the f i b r o i n gene a l t h o u g h these r e a r r a n g e m e n t s are n o t a n a l y z e d here. T h e c o d o n usage p a t t e r n is b i a s e d i n the f i b r o i n gene ( M i t a et al. 1988). T h i s c a n be e x p l a i n e d b y t h e difference i n t h e c o d o n usage p a t t e r n i n the u n i t s e q u e n c e s o f t h e r e p e t i t i o n . T h e v a r i o u s c o d o n usage b i a s e s o b s e r v e d i n o t h e r genes c o u l d b e e x p l a i n e d b y the difference i n the m o t i f s e q u e n c e s o f i m p e r f e c t repetitive sequences. Acknowledgment, This work was supported by a grant-in-aid for specific research from the Ministry of Education, Science, and Culture of Japan.

References

Adams DS, Herrera RJ, Luhrmann R, Lizardi PM (1985) Isolation and partial characterization of U1-U6 small RNAs from Bombyx mori. Biochemistry 24:117-125 Adams DS, Eickbush TH, Herrera RJ, Lizardi PM (1986) A highly reiterated family of transcribed oligo(A)-terminated interspersed DNA elements in the genome ofBombyx mori. J Mol Biol 187:465-478 Branlant C, Krol A, Lazar E, Haendler B, Jacob M, Galego-Dias L, PousadaC (1983) High evolutionary conservation ofthe secondary structure and of certain nucleotide sequences of U5 RNA. Nucleic Acids Res 11:8359-8367

Dixon LK, Bristow C, Wilkinson PJ, Sumption KJ (1990) Identification of a variable region of the African swine fever virus genome that has undergone separate DNA rearrangements leading to expansion of minisatellite-like sequences. J Mol Biol 216:677-688 Dworkin-Rastl E, Shrutkowski A, Dworkin MB (1984) Multiple ubiquitin mRNAs during Xenopus laevis development contain tandem repeats of the 76 amino acid coding sequence. Cell 39:321-325 Fujita T, Shibuya H, Hotta H, Yamanishi K, Taniguchi T (1987) Interferon gene regulation: tandemly repeated sequences of a synthetic 6 bp oligomer function as a virus-inducible enhancer. Cell 49:357-367 Gage LP (1974) The Bombyx mori genome: analysis by DNA reassociation kinetics. Chromosoma (Berl) 45:27--42 Gage LP, Manning RF (1980) Internal structure of the silk fibroin gene of Bombyx morL J Biol Chem 255:9444-9450 Galli J, Lendahl U, Paulsson G, Ericsson C, Bergman T, Carlquist M, Wieslander L (1990) A new member of a secretory protein gene family in the dipteran Chironomus tentans has a variant repeat structure. J Mol Evol 31:40-50 Henikoff S (1984) Unidirectional digestion with exonuelease III creates targeted breakpoints for DNA sequencing. Gene 28:351-359 Hoog C, Daneholt B, Wieslander L (1988) Terminal repeats in long repeat arrays are likely to reflect the early evolution of Balbiani ring genes. J Mol Biol 200:655-664 Jeffreys AJ, Wilson V, Thein SL (1985a) Hypervariable 'minisatellite' region in human DNA. Nature 314:67-73 Jeffreys AJ, Wilson V, Thein SL (1985b) Individual-specific 'fingerprints" of human DNA. Nature 316:76-79 Kemp DJ, Coppel RL, Anders RF (1987) Repetitive proteins and genes of malaria. Annu Rev Microbiol 41:181-208 Kominami R, Mitani K, Muramatsu M (1988) Nucleotide sequence of a mouse minisatellite DNA. Nucleic Acids Res 16: 1197 Krowczynska AM, Rudders RA, Krontiris TG (1990) The human minisatelliteconsensus at breakpoints ofoncogene translocations. Nucleic Acids Res 18:1121- 1 1 2 7 Kudo S, Fukuda M (1989) Structural organization of glycophorin A and B genes: glycoprotein B gene evolved by homologous recombination at Alu repeat sequences. Proc Natl Acad Sci USA 86:4619-4623 Kusuda J, Tajima Y, Onimaru K, Ninaki O, Suzuki Y (1986) The sequence around the 5' end of the fibroin gene from the wild silkworm, Bombyx mandarina, and comparison with that of the domesticated species, B. mori. Mol Gen Genet 203:359-364 Lehrman MA, Russell DW, Goldstein JL, Brown MS (1986) Exon-Alu recombination deletes 5 kilobases from the low density lipoprotein receptor gene, producing a null phenotype in familial hypercholesterolemia. Proc Natl Acad Sci USA 83:3679-3683 Matera AG, Hellmann U, Schmid CW (1990) A transpositionally and transcriptionallycompetent Alu subfamily. Mol Cell Biol 10:5424-5432 Meyne J, Ratliff RL, Moyzis RK (1989) Conservation of the human telomere sequence (TTAGGG)n among vertebrates. Proc Natl Acad Sci USA 86:7049-7053 Mita K, Ichimura S, Zama M, James TC (1988) Specific codon usage pattern and its implications on the secondary structure of silk fibroin mRNA. J Mol Biol 203:917-925 Ohno S (1987) Evolution from primordial oligomeric repeats to modern coding sequences. J Mol Evol 25:325-329 Ohshima Y, Okada N, Tani T, Itoh Y, Itoh M (1981) Nucleotide sequences of mouse genomic loci including a gene or pseudogene for U6 (4.8S) nuclear RNA. Nucleic Acids Res 9:5145-5158

130 Pearson WR, Mukai T, Morrow JF (1981) Repeated DNA sequences near the 5'-end of the silk fibroin gene. J Biol Chem 256:4033--4041 Stuve LL, Meyers RM (1990) A directly repeated sequence in the ~-globin promoter regulates transcription in murine erythroleukemia cells. Mol Cell Biol 10:972-981 TakeishiK, GotohO (1982) Computer analysis ofthe sequence relationships among 4.5S RNA molecular species from various sources. J Biochem 92:1173-1177 Tautz D, Renz M (1984) Simple sequences are ubiquitous repetitive components of eukaryotic genomes. Nucleic Acids Res 12:4127-4138 TsujimotoY, SuzukiY (1979a) Structuralanalysisofthefibroin

gene at the 5' end and its surrounding regions. Cell 16:425436 Tsujimoto Y, Suzuki Y (1979b) The DNA sequence of Bombyx mori fibroin gene including the 5' flanking, mRNA coding, entire intervening and fibroin protein coding regions. Cell 18: 591-600 Ullu E, Tschudi C (1984) Alu sequences are processed 7SL R N A genes. Nature 312:171-172 Yanisch-Perron C, Vieira J, Messing J (1985) Improved M13 phage cloning vectors and host strains: nucleotide sequences of the M13 m p l 8 and pUC19 vectors. Gene 33:103-119 Received September 4, 1991/Revised February 5, 1992

Essential role of duplications of short motif sequences in the genomic evolution of Bombyx mori.

The Bombyx fibroin gene has a discrete mosaic structure of various repetitive sequences, which may have evolved through various repeating arrangements...
665KB Sizes 0 Downloads 0 Views