Cell, Vol. 18, 771-779.

November

1979,

Copyright

0 1979

by MIT

Nucleotide Sequence and Genetic Organization the Polyoma Late Region: Features Common to the Polyoma Early Region and SV40 Prescott Deininger, Abby Esty, Patricia and Theodore Friedmann Department of Pediatrics University of California, San Diego La Jolla, California 92093

LaPorte

Summary The nucleotide sequence of the late region of the polyoma genome has been determined. It consists of 2366 bp and encodes the virion capsid proteins VP1 , VP2 and VP3. Extensive open reading frames identify the possible coding sequences of VP2 and VP3 toward the 5’ end of the late region, and of the major capsid protein VP1 toward the 3’ end of the late region. The 5’ end of the sequence encoding VP1 overlaps the 3’ VP2/VP3 region by 29 nucleotides and is in a different reading frame. The predicted amino acid sequences for all three known capsid proteins show extensive homology with the analogous capsid proteins of SW0 throughout most of their length. The VP2/VP3 amino acid homology between the two viruses is 34%, while the major capsid protein VP1 is much more highly conserved, showing 54% homology. These homologies together with the extent of open reading frames help to define the extent of the coding sequences. The VP2 initiator begins at position 269 and the coding region extends to the first termination codon beginning at 1226. The predicted size of VP2 is 35,007 daltons. A probable VP3 initiator is within the VP2 coding sequence at position 614 and is in the same frame as VP2. This coding sequence can also utilize the terminator at position 1226, and the predicted size of the VP3 translation product is 22,979 daltons. The VP1 coding region begins at position 1197 and continues in a frame different from that of VP2/ VP3 to a termination point at 2349. The molecular weight of VP1 is predicted to be 42,634 daltons. The 5’ untranslated region contains sequences that resemble a potential ribosomal binding site and a possible mRNA capping sequence similar to those found in other eucaryotic systems. There is also a sequence (5’-TCAAGTAAGTGA-3’) almost identical to one found in two regions containing potential splice sites in the early region of polyoma. The 5’ untranslated region does not show the extensive repeated sequences found in the similar region of SV40. The 3’ untranslated region contains the sequence 5’-AATAAA-3’, thought to represent a polyadenylation signal. As in the early region of polyoma, the extensive nucleotide and deduced amino acid homology with SV40 indicate a close evolutionary relationship between the two viruses,

and help to identify tant structure-function

regions of common relationships.

of

and impor-

Introduction The availability of methods for the rapid determination of nucleotide sequences (Maxam and Gilbert, 1977; Sanger, Nicklen and Coulson, 1977) has begun to allow the deduction of important organizational and structural features of some large and relatively complex genomes and their encoded gene products. The phenomenon of RNA splicing (Crick, 1979) has at the same time revealed immense organizational complexity and versatility in eucaryotic genomes, including those of the small DNA tumor viruses. Many of the organizational features of the SV40 genome have been clarified from the determination of its complete nucleotide sequence (Friers et al., 1978; Reddy et al., 1978) and similar studies have recently been published on the early region of the polyoma genome (Soeda and Griffin, 1978; Friedmann, LaPorte and Esty, 1978a; Friedmann, Doolittle and Walter, 1978b; Friedmann et al., 1979; Soeda et al., 1979). The late region of polyoma encodes the capsid proteins, and features of the location of the late functions are known in some detail (Fried and Griffin, 1977). The genetically related capsid proteins VP2 and VP3 are encoded at the 5’ end of the late region and are translated from separate 19s and 18s messenger RNA molecules, respectively. The major capsid protein VP1 is encoded at the 3’ end of the late region and is translated from a 16s messenger RNA. The three species of mRNA have a common series of heterogeneous leader sequences of up to approximately 70 bases at their 5’ end (Legon et al., 1979). The leaders for two of the mRNA species (VP3 and VPl) are joined to noncontiguous bodies by mRNA splicing, while no splicing event has yet been identified between the leader and the body of the message during the formation of the VP2 and mRNA. Multiple leader sequences have been found on at least some of those messages, however, probably due to headto-tail joining of one leader sequence to another (Legon et al., 1979). All three species of mRNA have the same family of heterogeneous cap structures at their 5’ ends (Flavell et al., 1979). In this regard the genetic organization of polyoma closely resembles that of SV40. An additional feature of the SV40 genome is the overlapping region containing 3’ ends of the VP2/ VP3 genes and the 5’ end of the VP1 gene. where two different reading frames are utilized simultaneously (Fiers et al., 1978; Reddy et al., 1978). We have now determined the complete nucleotide sequence of the late region of polyoma. Together with the previous determination of the early region se-

Cell 772

quence (Friedmann et al., 1979), this report completes the nucleotide sequence determination for the genome of strain 3 large plaque polyoma.

Results Figure 1 illustrates the approach we used to determine the sequence of the late region. The direction and extent of the arrows indicate the direction and extent of sequence determination from a labeled restriction enzyme site. Solid lines indicate sequence determined by the chemical method of Maxam and Gilbert (Maxam and Gilbert, 1977); dashed arrows indicate sequences determined by the dideoxy method of Sanger (Sanger et al., 1977). A number of the arrows represent sequence determined from both the 5’ and 3’ strands of the DNA molecule. Figure 2 gives, in single-stranded form, the nucleotide sequence of the entire late region of polyoma. The late region is defined to be the region from the nucleotide immediately following the T8A sequence near the origin of DNA replication (Friedmann et al., 1978a) to the end of the putative polyadenylation sequence 5’-AATAAA-3’ on the 3’ side of the VP1

terminator. The strand shown in Figure 2 has the same 5’ + 3’ polarity as the late mRNAs. Major restriction enzyme sites used during the sequence determination are indicated. Figure 3 illustrates the positions in all three potential reading frames of the three termination codons, and demonstrates the existence of two major open reading frames: one at the 5’ end in frame 2 encoding proteins VP2 and VP3, and an overlapping region in frame 3 at the 3’ end encoding the major capsid protein VPl. The VP2/VP3 coding area extends from position 269 through 1226, and overlaps by 29 nucleotides the VP1 coding sequence extending from positions 1197 to 2349. Table 1 illustrates the codon usage for the three separate late gene products. There is the expected paucity (for a eucaryotic DNA) of CG doublets, as shown by the rarity of the CGX codons for Arg compared with AGA or AGG, and the rarity of corresponding CG-containing Ser, Pro, Thr and Ala codons. Table 2 summarizes the deduced composition of all three capsid proteins and gives the predicted molecular weights for the final products without consideration of possible post-translational modifications.

MAP UNITS

Hae ill

-11-r

Mbol

r 9-[i!;3-i9:161 4

'17L14'161-5L12-

(j-

2

+&I ' 10-22-11-20

I - 6

-4-

* Hpa II

p

1

3

77 Pst I

----I Ava II

do0

SilO

Hind II

Hind II

2doo

NUCLEOTIDE NUMBER Figure

1. Strategy

for Sequencing

the Po)/Oma

Late Region

The arrows indicate the direction and approximate extent of sequence determination from the indicated restriction enzyme sites and the numbers refer to the restriction enzyme fragment numbers. Sequences determined by the chain termination method (Sanger et al., 1977) are indicated by dashed lines. Starred arrows from Mbo I and Hpa II sites indicate regions where both strands were sequenced from the same sites by both 5’ and 3’ end labeling. The double asterisk on the arrow from the Alu 3/l 2 junction also indicates determination of the sequence in both strands by 5’ and 3’ end labeling from the Hind Ill site corresponding to the Alu 3/l 2 site. A total of 95% of the sequence has been determined on both strands.

Polyoma 773

Late

Region

Nucleotide

Sequence

The portion of the sequence corresponding to the overlap of VP2/VP3 with VP1 for both polyoma and SV40 is shown in Figure 4. The overlap in polyoma extends for 29 bp, in contrast to the 122 bp overlap in SV40. The conserved nucleotide sequence has resulted in extensive amino acid conservation for the C terminal region of both the VP2/VP3 genes and the VP1 genes. Figure 5 shows the nucleotide and deduced amino acid homologies between the polyoma and SV40 late regions and the encoded capsid proteins. The sequences have been aligned visually to optimize the deduced amino acid homologies in the coding regions. In the VP2/VP3 region, positions of nucleotide identity constitute 48% of all positions, while in the VP1 coding region 52% of the nucleotide positions are identical with corresponding positions in SV40. Within the same region, the homologous amino acid positions are 34 and 54% for the VP2/VP3 and VP1 coding regions, respectively. Discussion 5’ Untranslated Region A striking feature of the polyoma early region, and even more so of the SV40 early and late regions, is the presence of extensive and multiple repeated sequences in the 5’ untranslated portions of the genome (Subramanian, Dhar and Weissman, 1977).This feature seems to be absent from the polyoma late region, although the polyoma early region does have repeated sequences which are shorter and less extensive than those in SV40 (Friedmann et al., 1978a). The capped 5’ ends of the three late polyoma mRNA species have been characterized by Flavell et al. (1979). There are identical families of heterogeneous leader sequences of different lengths attached to each

Figure 3. Positions Reading Frames

of the TAG,

TAA

and TGA

Terminators

in All

The positions of the terminators for each of the three frames are plotted relative to map unit and nucleotide number. The approximate position of the mRNA leader sequence is marked. The coding regions available for VP1 , VP2 and VP3 are labeled and marked with heavy lines.

Figure

2. Nucleotide

Sequence

of the Polyoma

Late Region

The strand presented has the same 5’ + 3’ orientation polyoma messenger RNA?.. The major restriction cleavage in the sequencing procedure are indicated.

as the late sites used

Cell 774

Table

1. Codon

Utilization

in Probable

Coding

Regions

T

C

VP2

VP3

A

VP2

VP1

VP3

VP1

G

VP2

VP3

VP1

5

4

8

5

Tyr Tyr

4

3

3

Term

0

0

Term

0

Phe

4

2

5

Ser

4

1

2

Phe

3

3

4

Ser

2

1

Ser

VP2

VP3

-

0

0

5

r

5

CYS

0

0

1

C

0

0

Term

0

0

0

A

0

0

Trp

7

6

6

G

Leu

3

1

1

6

7

Leu

2

1

1

Ser

2

0

Leu

7

5

6

Pro

6

6

4

His

6

6

4

Aw

1

1

0

T

Leu

6

3

9

Pro

1

1

16

His

3

2

1

Arg

1

1

2

C

Leu

12

8

4

Pro

6

6

10

Gln

6

5

6

Arg

1

0

0

A

Leu

6

3

9

Pro

0

0

2

Gln

9

7

6

Aw

4

3

0

G

lieu

6

1

4

Thr

9

3

6

Asn

9

6

10

Ser

3

1

3

T

lieu

3

2

4

Thr

3

2

7

Asn

3

2

10

Ser

2

2

5

C

lieu

9

7

3

Thr

9

5

24

LYS

1

1

18

Aw

8

7

12

A

Met

8

4

13

Thr

1

0

1

LYS

2

2

9

Arg

3

3

2

G

Val

4

3

6

Ala

10

5

3

Asp

9

0

12

GIY

10

7

2

T

Ala

10

Val

4

2

7

5

5

Asp

3

2

10

GIY

5

4

10

C

Val

6

4

8

Ala

6

5

4

Glu

12

7

12

GIY

9

4

8

A

Val

10

6

14

Ala

2

1

0

Glu

9

3

12

GIY

6

4

13

G

The probable coding regions are those depicted in Figure 3. VP2 extends from the Met codon at 288 to the terminator at position coding region extends from 614 to 1226, and the VP1 coding region overlaps the VP2,3 coding region in a different reading frame, 1200 to 2349.

Table

VP1

CYS

2. Predicted

Capsid

Protein VP2

Amino Acid Compositions VP3

VP1

Phe

7

5

9

Leu

36

21

32

lieu

18

10

10

Met

8

4

12

Val

26

15

35

Ser

21

12

18

Pro

15

13

32

Thr

22

10

39

Ala

30

16

13

9

7

12

W His

9

8

6

Gln

15

12

12

Asn

12

6

20

LYS

3

3

27

Asp

12

10

22

Glu

21

10

24

CYS

0

0

6

Trp

7

6

6

Arg

18

15

16

GIY

30

19

33

Total molecular weight The protein compositions described in Table 1.

35,007 are based

22,979 on the proposed

42,834 coding

regions

1226. The VP3 extending from

of the late messages; in the case of VP3 and VP1 mRNAs, this is done by presumed RNA splicing mechanisms. The leaders are derived from a portion of the genome between 67 and 65.5 map units (Legon et al., 1979). More specifically, the leaders extend from approximately 62 to 94 nucleotides on the 5’ side of the VP2 initiator at position 269 (Figure 2). It must be remembered, of course, that the identification of the mRNA cap sites was carried out with the A2 strain of polyoma, which is different than the strain used in this study. The sequence 5’-AAAATTA-3’ is at position 205, and the sequence 5’-ATTTTAA-3’ is at 187. These AT-rich clusters are similar to the sequence 5’-TATAAAA-3’ located f25 nucleotides to the 5’ side of a prototypical capping sequence found in a variety of eucaryotic genomes (S. Goldberg, personal communication) and in the early region of polyoma (Friedmann et al., 1979). It is not certain that either of these AT-rich clusters constitutes a promoter, and in fact a SOS similar region in SV40, 5 ‘-TTATTT-3’, lies within the cap structure of the late mRNAs. At least for SV40, such an AT cluster may therefore be distinct from the promoter and be related instead to the site of mRNA capping (Ghosh et al., 1978). As is the case in the polyoma early region (Fried225 mann et al., 19791, there is a sequence 5’CCTCCGCCATC-3’ that closely resembles the socalled capping sequences found in ,&globin, immunoglobulin and late adenovirus genes +25 nucleo-

Polyoma 775

Late

Region

Nucleotide

Sequence

Figure 4. Overlap between the Polyoma VP2. VP3 and VP1 Proteins and Their Homology with SV40 5'1196 ~~GATGGCCCCCAAAFGAAAAAGCGGCGTCTCTAAA~mmmmPy

ii~I’/i~;i

~1

/I~, 5' 14~5~'GATGGCCCCAACAPAI>PGAAAAGGAAGTTGTCCA VP1

m

m

lpRd

THR

LYS

ARG

1~ pmmSV40

LYS

m

SER

CYS

PRO iR

tides 3’ to the possible AT-rich promoter (Konkel, Tilghman and Leder, 1978). Unlike the sequence in the early polyoma region, it is not possible to identify a stable potential hairpin loop containing this putative capping sequence; nor can this sequence overlap with structures if the latter do extend 94 bases 5’ to the VP2 initiator at position 269 (Konkel et al., 1978; Flavell et al., 1979). In the same location as the possible capping sequence is a region that can interact in a stable hypothetical structure with a portion of the 3’ end of the 18s ribosomal RNA (Hagenbuchle et al., 1978) in the following way: 222 5’____ - ______--GGCCCTCCGCC II MI 3’ ,,oAUU--ACUAGGAAGGCGUCC-----

_______

py 18s RNA

This interaction, according to the Tinoco rules (Tinoco et al., 1973) shows a cumulative interaction of -16.2 kcal, and is therefore stable enough to constitute a possible ribosomal binding site. The heterogeneous polyoma leader sequences are known to extend to just beyond the Mbo I site beginning at position 246, just before the initiation triplet for protein VP2. In the case of the VP2 message, no evidence exists for splicing between the leader and the message body, while there are splices in the messenger RNAs for VP3 and VP1 (Legon et al., 1979). The leader sequences for all three species of mRNA show a similar distribution (Flavell et al., 1979). This observation, combined with analysis of prototype splicing sequences (Seif, Khoury and Dhar, 1979) indicates that there may be only one sequence used as the proximal splice site for the VP1 and VP3 messages. Beginning at position 248 is the sequence 5’-TCAA b TAAGTGA-3’, almost identical to sequences found at the putative early polyoma splice sites (Friedmann et al., 1979) in the form: 5’-TCCAA’GTAAGA-3’. G T The position of the arrow indicates the site of potential scission during mRNA splicing that conforms to the general prototypic splice site sequences in eucaryotic

The region of polyoma which has been deduced from the DNA sequence to represent overlap between the coding regions for the VP2 and 3 proteins and VP1 is presented with translations, The overlap involves 29 nucleotides and is aligned with a portion of the SV40 sequence representing a portion of the 122 nucleotide overlap region in SV40. Lines are drawn to show those nucleotides which are identical between the genomes in this region, and the conserved amino acids are enclosed in boxes.

systems (Seif et al., 1979). mRNA sequencing will be necessary to locate unambiguously the position of the splicing sites. Coding Regions: VP2, VP3 Genetic evidence exists that VP2 and VP3 are related to each other and that the nucleotide sequence encoding VP3 lies entirely within that for VP2 (Fried and Griffin, 1977). Analysis of labeled late region restriction enzyme fragments after hybridization with late viral mRNA has shown that at least two major splicing events occur during late gene expression, with the result that all three late messages contain leader sequences derived from the region from map positions 67 to 65.5. Leaders often appear to be arranged in a tandem repetition, possibly due to a head-to-tail joining event (Legon et al., 1979). Attached directly to these leaders is the possibly contiguous transcript leading directly to the initiator for VP2 at map position 65.2. On the other hand, the leaders are attached to a noncontiguous VP3 body resulting from the excision of an intervening sequence of approximately 300 bp (Legon et al., 1979). These data are consistent with predictions of coding regions made from nucleotide sequence data presented here (Figure 3) and from the identification of probable translation products through their strong homology with those of SV40. The ATG triplet beginning at position 269 initiates the potential translation product Met Gly Ala Ala Leu Thr, a sequence identical to that predicted at the N terminus of SV40 VP2 (Fiers et al., 1978; Reddy et al., 1978) and lying close to map position 65, the known location of the VP2 initiator. The ATG triplet at 269 is just 12 bases into the large open reading frame believed to be used for VP2 and VP3, and appears just 3’ to the sequence 248 5’- T CAAGTAAGTGT-3’, which shows marked similarity to the regions surrounding the probable early region splice sites of polyoma. It is also close to a region containing a possible ribosomal binding site and a possible mRNA capping signal as described above. A potential splice site at position 251 for the proximal splice site has been described above. This poten-

Cell 776

Polyoma 777

Late

Region

Nucleotide

Sequence

Figure 5. Comparison and Their Translation

of the Polyoma Products

and SV40 Late Regions

The nucleotide sequences of the polyoma and SV40 genomes are presented in the same orientation as the messenger (5’ + 3’). The sequences are aligned to produce maximum homology and the positions where the nucleotides align are connected by lines. The region before the VP2 initiator at polyoma position 216 is aligned very arbitrarily, since there were no major regions of detectable homology. (For that reason we have not drawn any lines to indicate homology.) The VP3 initiator is at polyoma position 614. in the same frame as that for VP2. Both the polyoma VP2 and VP3 proteins ,., ,I I,_ \I ;,I’, *, II ,d, \A. // 67,

Nucleotide sequence and genetic organization of the polyoma late region: features common to the polyoma early region and SV40.

Cell, Vol. 18, 771-779. November 1979, Copyright 0 1979 by MIT Nucleotide Sequence and Genetic Organization the Polyoma Late Region: Features Co...
878KB Sizes 0 Downloads 0 Views