J. Mol. Biol. (1976) 107, 445-458

Gene F of Bacteriophage ~X174. Correlation of Nucleotide Sequences from the D N A and Amino Acid Sequences from the Gene Product G. M. Am, E. H. BLACKBURNt, A. R. COULSON, F. GATJrRERT$ F. SXNO~.R,J. W. S~DATw~ D E. B: Z~F 82

Medical Research Council, Laboratory of Molecular Biology Hills Road, Cambridge CB2 2QH, England (Received 6 May 1976) Application of three different methods to obtain nucleotide sequences from ~bX174 DNA has yielded a continuous sequence of 281 nucleotides and partial sequences totalling another 263 nucleotides from adjoining regions. Simultaneous work on amino acid sequences of ~bX174 coat proteins showed t h a t these nucleotide sequences are part of the F gene, coding for the largest coat protein of the bacteriophage. A total of 136 codons has been identified, of which 56% have t h y m i n e as the base in the third position of the coding triplet.

1. Introduction (a) The nucleotide sequences In developing methods to determine nucleotide sequences in r DNA several approaches have been used in this laboratory. Essentially they are (1) partial degradation of 32P-labelled single-stranded DNA with endonuclease IV or the depurination reaction (Burton & Petersen, 1960) to obtain small oligonucleotides which can then be sequenced by partial digestion with spleen and snake venom phosphodiesterases (Ling, 1972; Robertson et al., 1973; Ziff et al., 1973; Galibert et al., 1974); (2) transcription of the DNA with RNA polymerase and sequencing the 32P-labelled RNA products (Blackburn, 1975,1976) ; (3) copying regions of the DNA using DNA polymerase with a specific primer to give short stretches of radioactive DNA. Two approaches have been used to determine such sequences: (a) incorporation of a ribonucleotide into the DNA, thus providing sites for ribonuclease or alkali cleavage to give oligonucleotides of suitable length for sequencing (Sanger et al., 1974) ; (b) the "plus and minus" method (Sanger & Coulson, 1975). t Present address: Department of Biology, Yale University, New Haven, Conn. 06520, U.S.A. ~:Present address: Laboratoire d'H~matologie Exp~rimentale, Centre Hayem, HSpital SaintLouis, Paris X, France. wPresent address : Department of Radiobiology, Yale University Medical School, Now Haven, Conn. 06520, U.S.A. 82Present address: Rockefeller University, 1230 York Avenue, New York, N.Y. 10021, U.S.A. 445

446

G.M.

AIR

~T

AL.

In the work summarized here these methods have been used to obtain nucleotide sequences from a region comprising about 10% of the ~X174 genome. Ziff et al. (1973) showed that when the single-stranded DNA from ~X174 that had been labelled with 32p in vivo was digested under mild conditions with endonuclease IV and the products fractionated by ionophoresis on acrylamide gel, a number of relatively pure fragments could be isolated. Methods were developed for studying the sequences of such fragments of 32P-labelled DNA and the first sequence to be completed was that of band 6, which was 48 nueleotide residues long (Galibert et al., 1974). These studies have been extended to some of the larger bands and this work is reported in an accompanying paper (Sedat et aL, 1976). It was found that all the fragments formed by limited endonuclease IV digestion (bands 2, 3, 5A, 5B, 6, 7 and 8) came from the same region of the DNA, and their relative order was found to be as shown in Figure 5 of Sedat et al. (1976). The complete sequence of bands 5A and 5B, most of band 3, and fragmentary sequences of band 2 have been determined. Another approach to DNA sequencing is by transcription of unlabelled DNA with RNA polymerase in the presence of 32P-labelled triphosphates and studying the synthesised RNA by techniques that are already available. Blackburn (1975,1976) has used this method on endonuelease IV bands 6 and 3, and the results confirm and supplement the data obtained by direct sequencing of these bands. DNA may also be sequenced by copying with DNA polymerase and a method has been developed for determining specific sequences by using suitable primers (Sanger & Coulson, 1975). Either synthetic oligonucleotides or fragments from restriction enzyme digests can be used as primers and the technique described is rapid and simple. A sequence of 70 residues could be predicted by priming with Hind fragment 1 (Sanger & Coulson, 1975). This sequence corresponds to the 3' end of endonuclease IV band 3. From the combined results of the three methods it was possible to deduce the complete sequence of band 3, although neither method alone was sufficient. (b) Localization of the endonuclease I V bands in fragments from restriction enzyme digestion In order to relate the oligonucleotides obtained by endonuclease IV partial digestion to restriction enzyme fragments, one of us (E. H. Blackburn) has studied transcription products of the fragments produced by digestion with restriction endonucleases from Haemophilus influenzae (HindII and HindIII) (Smith & Wilcox, 1970; Edgell et al., 1972) and Haemophilus aegyptius (HaeIII) (Middleton et al., 1972). Fragments Hind 1, Hind 6b, Hae 4 and Hae 8 were transcribed and the products compared with products obtained from the endonuelease IV bands. This established the relationships summarised in Figure 2. (c) Localization of nucleotide sequences in the genetic map The preliminary map of the restriction enzyme fragments relative to the genetic map of r (Chen et al., 1973; Sedat et al., 1976) suggested that the sequence of the endonuelea~e 1u fragments would probably code for the gene E protein or near it. However, Air (1976) has been studying the amino acid sequences of the structural proteins of r and as soon as sufficient amino acid and DNA sequences were available it was clear that sequences in the largest coat protein (the product of gene F) corresponded according to the genetic code to DNA sequences in the endonuclease IV fragments. This was amply confirmed by further work. The 5' end of

G E N E ~m OF P H A G E $ X 1 7 4

447

endonuclease IV band 5A corresponded to the third amino acid residue of the F protein, thus establishing an exact relationship between the genetic map and the DNA fragments from both restriction enzyme and endonuclease IV digests.

2. R e s u l t s

The detailed methods and results of the DNA and protein sequence work referred to in this paper are given in preceding papers (Galibert et al., 1974; Sanger & Coulson, 1975; Blackburn,. 1975) and in the accompanying papers (Air, 1976; Blackburn, 1976; Sedat et al., 1976). We give here only the additional experiments used to correlate the sequence data. (a) Location of endonuclease I V bands in restriction enzyme fragments Endonuclease IV band 3 contains the sequence G-G-C-C (Sedat et al., 1976) which, in double-stranded DNA, is recognised and cleaved by endonuclease HaeIII. r duplex DNA was therefore predicted to be cleaved at this point by the Hae enzyme. Endonuclease IV band 2 contains the sequence G-T-T-A-A-C near the 5' terminus, hence a junction of two Hind fragments was expected at this position. To ascertain which Hae and Hind fragments these were, use was made of a number of relatively long and characteristic polypyrimidine tracts present in the endonuclease IV band sequences. In order to identify these tracts in the corresponding restriction enzyme fragments and hence to line up the three sets of fragments, purified restriction enzyme fragments were transcribed with RNA polymerase and the products digested with ribonuclease A (Fig. 1 and Table 1). The large products (polypurine tracts) were then studied by ribonuclease T 1 digestion. For any polypyrimidine tract present in the double-stranded restriction enzyme fragments there will be a complementary ribonuclease A (polypurine) product. Thus the correspondence between the transcripts of restriction fragments and polypyrimidine tracts in the endonuclease IV bands was used to establish overlaps as shown in Figure 2. Table 1 shows that Hind fragment 1 contains the polypyrimidine tracts A', B' and C', while the sequences D', E' and F' are found in Hind 6a or 6b. Fragment Hae 4 contains the sequences A', B', C' and D'. In addition, analysis of ribonuelease Ti oligonucleotides in transcripts of the restriction fragments Hae 8, Hae 4 and Hind 6a ~ 6b helped in assigning the endonuclease IV bands to restriction fragments. The sequence data on these oligonucleotides are given in Table 2 and the results included in Figure 2. In these experiments the Hind fragments 6a and 6b were not separated, but the Hind cleavage map of r shows that the fragments containing the endonuolease IV bands can only be 6b (Jeppesen et al., 1976). (b) Location of the endonuclease I V bands in the genetic map A genetic map of r has been deduced by Benbow et al. (1972,1974) from genetic data and estimation of the sizes of the proteins coded by r A cleavage map of the r genome is also known in which the fragments produced by the restriction endonucleases H a e I I I and H i n d I I are accurately related to each other (Lee & Sinsheimer, 1974; Jeppesen et al., 1976). The amino acid and nucleotide sequences summarized here show that the 5' nucleotide of endonuclea~e IV band 5A is the third nucleotide of the triplet coding for amino acid residue number 3 of the

448

G. M. A I R E T A L .

FIG. 1. Pancreatic RNAase fingerprints of transcripts of denatured i~ragments of ~X174 replicative form DNA. The template fragments were produced by cutting the DNA with restriction endonuelease H i n d I I and H i n d I I I (Smith & Wilcox, 1970; Edgell et al., 1972) or H a e I I I (Middleton e~ al., 1972). Each I)NA fragment was denatured and transcribed with •scherichia coli R N A po]ymerase, as described previously (Air et al., 1975). Transcripts were labelled by incorporation of [~-a2P]ATP (shown for H i n d fragment 1) and [~-32P]GTP (shown for fragments Hae 4 and H i n d 6a @ 6b). The synthesised R N A was digested with pancreatic RNAase and the products fractionated in 2 dimensions. The lettered spots correspond to depurination products sequenced in endonuclease IV bands.

o

P-

o

~

rj~

o

~

0

Q~

~

~ v

~v

o 0

T~

r~

~ v

o v

,

,

,

,

,

,

,

,

~

"e-

L~.

r~

c~

0

0

9

~oo

~-~(~ p-i

oo

..,0

o 0

~ o .~

N

~ 0

e

b,.

,..o" 0

0

0

"0

o

0 =

GENE 5sl

Endonuclease'i~Tbends

6 I

$' O F P H A G E 3

I

451

r I

I i

Depurinetion products Ribonuclease T I products

SsA~

Hind fragments

I

Hoe fragments

I

6b 8

F~G. 2. Schematic representation of the positions of the large depurination products (A' to F') whose complements (A to F) were identified in transcripts of restriction fragments (Table l) and of the large ribonuclease T1 products (Table 2). The positions of the endonuelease IV bands relative to restriction fragments are shown. F gene product. Having established the relationships between the endonuclease I V bands and the H i n d and Hae restriction fragments as above, the genetic map and the H i n d / H a e cleavage map can be superimposed as shown in Figure 3. The alignment in the first parts of gene F, as above, and gene G (Air et al., 1975) is exact to the nucleotide level. The amino acid sequence data for the P protein (Air, 1976) and G protein (Air et al., 1975, 1976) indicate that the molecular weights estimated b y gel electrophoresis in the presence of sodium dodeeyl sulphate are close to correct. The total length of the J protein is reported to be 37 amino acid residues (Edgell & Hutclfison, unpublished data). The rest of the genetic map is not precise, since the protein molecular weights are not reliably known. The molecular weights used by Benbow et al. (1974) would require over 20% more DNA than is available, even with no allowance for intergenic sequences. I t is known, however, t h a t the origin of replication, in the A gene, is witlfin H i n d fragment 3 (Johnson & Sinshebner, 1973) and that a major promoter site before gene D is in both H i n d fragment 6 and Hae Zl

Z4

:6a

Z8

z

,o "~ 73

Fro. 3. The relationship between the genetic map (Benbow et al., 1974) and the restriction enzyme cleavage map (Lee & Sinsheimer, 1974; Jeppesen et al., 1976) of ~X174 DNA. The gene lengths correspond to the molecular weights of the F and C/gene products, otherwise the quantitative alignment is unknown, as explained in the text. From the outermost circle the maps shown are HaeIII (Z), Hind (R), endonuclease IV (2, 3, 6 and 5A), and the genetic map. Promoter sites are indicated P (Chen et al., 1973).

452

G.M.

A I R E T A L.

f r a g m e n t 3 (Chen et al., 1973) a n d t h e genetic m a p in F i g u r e 3 t a k e s t h e s e i n t o account. W e h o p e t h a t f u r t h e r D N A a n d p r o t e i n sequences will clarify t h e severe a n o m a l i e s in t h e H - A . B - C - D - E - r e g i o n of t h e m a p . (e) Location of amino acid sequences in the genetic map T h e a m i n o a c i d sequence d a t a on t h e gene F (capsid) p r o t e i n is f r a g m e n t a r y (Air, 1976) b u t , w h e n considered in c o n j u n c t i o n w i t h t h e n u e l e o t i d e sequences, m a n y o v e r l a p s are p r o v i d e d . T h e overall fitting t o g e t h e r o f t h e D N A a n d p r o t e i n sequences is s h o w n in F i g u r e 4. As e x p l a i n e d above, t h e 5' e n d of e n d o n u c l e a s e I V b a n d a n d Approx. end of F'coding

J Start of G coding

Start of Fcoding 200

NucJe~ 3EndO[V I EI Hae Hin

Amino acids

400

3:

600

2'520

1000

'

'

' Z4

I

800

,

62o

,.6oo ' Zl

'

1300

.R1 toso

I

100 I

200 I

i

300 I

400 I

R9 ,Rt0,

R2

160 ' 79'

810

50 I

100 I

i

FIG 4. Alignment of the nucleotide and amino acid sequence data for the _~ and G genes of ~X174 and their products. The solid bars indicate sequences completely known; partial sequences are cross-hatched. The junction between Hind fragments 9 and 1 is known to be in the F - G intergenic region (Sanger & Coulson, 1975; Fiddes, 1976), so there must be some error in the length either of fragment R1 or of the protein.

5 A is p a r t o f t h e c o d o n for r e s i d u e 3 o f t h e a m i n o acid sequence. B y combilfing t h e results o b t a i n e d b y d i r e c t sequencing of t h e D N A (Galibert et al., 1974; S e d a t et al., 1976), t r a n s c r i p t i o n ( B l a c k b u r n , 1976), a n d using D N A p o l y m e r a s e w i t h H i n d fragm e n t 1 as t h e p r i m e r (Sanger & Coulson, 1975), t h e c o m p l e t e sequence o f e n d o n u e l e a s e I V b a n d s 5A, 6 a n d 3 is k n o w n . T h e a m i n o acid sequence is c o n t i n u o u s t o residue 50, b u t n o n - o v e r l a p p e d p e p t i d e sequences coded b y t h e D N A sequence are a v a i l a b l e u p to r e s i d u e 91. T h u s t h e o v e r l a p s b e t w e e n a m i n o a c i d residues 50-51, 56-57 a n d 74-75 are e s t a b l i s h e d b y t h e D N A sequence. E n d o n u c l e a s e I V b a n d 3 w o u l d end in residue 94 of t h e a m i n o a c i d sequence, a n d 12 n u c l e o t i d e s i n t o e n d o n u c l e a s e I V b a n d 2 h a v e b e e n sequenced f r o m t h i s j u n c t i o n , b u t no c o r r e s p o n d i n g a m i n o a c i d sequences h a v e y e t b e e n found. S e d a t et al. (1976) h a v e also some v e r y p r e l i m i n a r y ,

Fro. 5---continued pyrimidine tracts (Ling, 1972; Harbers et al., 1976, and this paper). Sites of cleavage by endonuclease IV, HindII and HaeIII restriction endonucleases are indicated. The amino acid sequences are from Air (1976). Vertical bars indicate where either a DNA or protein sequence was overlapped only by reference to the other. The initial experiment using Hind fragment 1 as primer (Sanger & Coulson, 1975) suggested that the sequence started A-G-T-T-G. A subsequent experiment in which the initial polymerisation was less extensive indicated that this sequence should be A-G-T-G. (b) The nucleotide and amino acid sequences found for endonuclease IV band 2 (Sedat st a/., 1976; Air, 1976). A double vertical bar marks the end of a block of residues; an unknown number of residues may follow before the start of the next block. The blocks probably occur in the order 1-2-3-4-11-6-7-8-9-10, with block 5 immediately to the right of either block 1 or block 4. In block 8 the G residue at position 7 in Table 4 of Sedat et al. (1976) has been deleted, as otherwise the amino acid sequence becomes out of phase. (c) The nucleotide sequence primed by fragment Hae 1 and the corresponding amino acid sequence.

GENE

/~ O F P H A G E

4X174

453 I

Endonuclea~ IV 5A I

IEndonuclease IV 6

A-~T-G-G-~G-C~C-G~A~G~C~G-T-A-T~G-C~G-C~A~T~G-A~C~-T~T-T~C-C~C~A~T-C~T-T-G~C~T-T~C-C-T~T~ .............................................................. ~-~_~.~.~.~-~-~_~_~-~

.............

I0 S e r - A s n - I l e - G I n - T h r - G l y - A l a - G l u - A r g - M e l - P r o - H I s - A l p - L e u - S e r - H i s - ~ u - G l y - P h e - L e u -

Endonuclease IV 5A I

EndonucleaseIV 61 EndonucleaseIV 3

G- C-T-G-G-T-C-AG- A-T-T-G- G-T- C-G-T=C-T-TA- T-T-A-C-C-AT-T-T;C-A-AC-T-A - C-T-C-C-G-G-T-T-A. ~ - : - : . - :_- ;..- :_. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

AIo

-

Sly

-

Gin

-

Ile

-

Gly

-

Arg

20

-

LeiJ -

I]e

-

Thr

30 Ile -

-

Ser

-

Thr

-

Thr

-

Pro

-

Veil

T-C-G -C-T- G- G-C-G-A-C-T-C-C-

-

lie

-

AIo

-

Gly

-

4O Set -

Asp-

H o e 81 H o e 4 T-T-C--G-A-G-A-T-G-G-A-C-G-C................................

Phe -

GIu

-

Met

-

Asp

C--G- T - T - G - G - C - G - C - T - C - T - C - C - G - T - C - T - T - T - C - T - C - C - A - T - T - G - C - G - T - C - G - T - S - S ~ - - - : _ - - ._- . : _ - : _ . - _ _ . ; -.~. -_. - . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

- Ala

-

Vat

-

Gly

-

AIo

- Leu

50 Arg

-

= C-C-T-T-G-C-T-A-T-T-

Z

]

Leu

-

Ser

-

Pro -

Leu

Arg

-

I ] Gly

Arg

-

Leu

-

Ala

-

60 Ile -

-

Gin

I G-A-C-T-

C-T-A-C-T-G-

T-A-G-A-C-A-T-T-T-T-T-A-C-T-T-T-T-T-A-T-G-T*

C-C-C-T-C-A-T-C-G-T-C-A-C-G-T-T-T-A-T-G-G-T-G-A-A-C-A-G-

7o . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

"s~ " "

i

Alp

-

Ser

-

Thr

- VQI

-

Asp

-

Zle

-

Phe

-

Thr

-

Phe -

Tyr

-

Vol

-

Pro

-

His

-

Arg

I

Endonuclease IV 3 [

His

-

Vol

-

Tyr

-

Gly

-

Slu

-

Endonuclease IV 2 Hind 6b [ Hind I

T-~. _- G-A_- T-T_- A : A_-G_- T : T - C - A - T - G - A - A - G - G - A - T-G - G - T - G - T - T - A - A - T - G - C - C - A - C _ - T : C z C _ - T - C - T ~ C _ - C - C - G - A - C _ - T # G -T_-T~A_- A

Trp

-

Ile

-

Lys

-

Phe -

Mel

-

Lys

-

Asp

-

Gly

-

90 - Ash

Vol

-

AIo

(a) Hind6b

Hind I

[I 4

II

BI

6 -A -A-T-G-A-G-

1

"A-T" G "A-T'G-C"

Asn

-

Pro-

I

"C-C'G"

Asp

T-A-T-G'G

Pro

-

Asn -

,

Glu

*

Leu

-

Asp

-

GIx

- Asx

-

Asp

-

AIo

-

Arg

9

B20

I Tyr

-

Sly

-

-

Ile

Thr

-

Glu

-

C'A- T-C-T

I

Arg

I Cys -

Cys

-

His

-

Leu

-

-

Gly

-

11%r

-

Alp

I

IO

-C-A-T-T-T-G-G-A-C-T-G-C-TIC-C-G-C-T-T-C-C-T-C-C-T-

- Asx

-

I

Phe -

-A-G-A-C-T'G-A'G'C-T-T'T-C'T'C-G-C'C'II

B30 Lyl

Arg

T" T-G'C-T-G'C-

I

C-A-A-A-A-

-

8

T-C'G'T"

B 10

-

-

7

-G-C-T-A-A-C-C-C-T

Ala

Mel

- Trp

-

Thr

B4O -

AIo

-

Pro -

Leu -

Pro

-

Pro -

GIx

-

Thr

-

GIu

-

Leu

-

Ser

- Arg

5

C- A-A-A-C-T-A-C-T-G-G-T-T-A-T-

A-T-T-G-A

I[

(b) A-C-C-G-T-C-C-T-T-T-A-C-T-T-G-T-C-A-T-G-C-G-C-T-C-T-A-A-T-C-T-C-T-G-G-G-C-A-T-C-T-G-G-C-T-A-T-G-A-T-G-T-T-G-A-T-S-G-A-A-C*T-G-A C40 CSO Atg

-

Pro

-

Leu

-

Leu -

Val

-

Met

-

Arg

-

Set

-

Aln

-

Leu

-

T~p

-

AIo

-

Ser

-

Gly

-

Tyr

-

Asp -

Vol

- C C60 - Alp

(c)

FzG. 5. Nucleotide a n d a m i n o acid sequences from gene F a n d its p r o d u c t . (a) T h e D N A sequence o b t a i n e d b y direct m e t h o d s o f endonuelease I V b a n d s 5A, 6 a n d 3 is s h o w n as (. . . . . . ) (Sedat e~ at., 1976), a n d t h o s e obtahued b y o t h e r m e t h o d s are i n d i c a t e d as follows: (. . . . ) sequences t r a n s c r i b e d b y R N A p o l y m e r a s e (Blackburn, 1975,1976); ( . . . . . ) sequences o b t a i n e d b y p r i m i n g w i t h r e s t r i c t i o n f r a g m e n t H i n d 1 (Sanger & Coulson, 1975) ; ( ) sequences o f t h e poly-

454

G. M. A I R E T A L .

ordered but n o t overlapped, sequences from endonuclease I V band 2 which total about 250 nucleotides. N o n e of the peptide sequences available can be m a t c h e d to the first haft of this band, but in the second half the nucleotide and amino acid sequences provide mutual overlaps and an a l m o s t c o m p l e t e l y u n a m b i g u o u s sequence can be written for each of t h e m . The details are included in Figure 5. The amino acid sequence data also yielded another continuous sequence of 142 amino acids (C1 to C70, Air, 1976; C71 to C142, unpublished data). To determine its position in the protein an experiment was done in which H a e fragment 1 was used as the primer in the plus and minus s y s t e m (Sanger & Coulson, 1975). The results are shown in Figure 6. A sequence of 65 nucleotides was deduced which corresponded to residues

-G

-A

-T

-C +G +A +T

+C

- G - A - T - C +G +A+ T+C

i

9-:~ -:,- - ~ - ~

.:~ ~ : : ~ . . ~ . ~ . - . ; X ~ : ' ~ . . ~ '

~-. ,~ ~:..~_ . ~ ~ . ~ ,

FIG. 6. Radioautograph of an experiment in which Hae fraooznent 1 was used as a primer with the plus and minus system (Sanger & Coulson, 1975).* From this experiment a single C residue was predicted in this position, however the amino acid sequence indicated two C residues9 Apart from this the predicted DNA sequences agreed with the amino acid sequence.

GENE

2' O F

PHAGE

~X174

455

C39 to C60 of the amino acid sequence. Thus, allowing for the 15 to 20 nucleotides closest to the Hae site that are not seen by this method, the amino acid sequence C1 to C142 is aligned as sho~n in Figure 4. Figure 5 shows the correspondence of all nucleotide and amino acid sequence data obtained so far for the F gent of r and its product. The correlated sequences total 136 codons, including assignment of several Glu, Gln, Asp and Asn residues from the nucleotide sequence t h a t in the amino acid sequence results could only be designated as Glx or Asx (Air, 1976). Amino acid residues 63, 70 to 73, 8l, B22 and C50 were given by the DNA sequence as these were not identified, or were uncertain, in the protein sequence work.

3. Discussion In the determination of these sequences we have used several independent methods which each yielded partial information. The combined results however establish a continuous sequence of nucleotides and we have found in this work, as in the work on gene G and its product (Air et al., 1975, 1976), that the combination of several partial results is more satisfactory than trying to obtain a complete result by a single method. This was desirable in the DNA work where copying methods were used, which might have introduced errors, although such errors have not been detected (Blackburn, 1975, 1976; Sanger & Coulson, 1975; Air et al., 1975). In amino acid sequence work, where the methods are long-established and supposedly routine, the work involved in determining the final details can be extremely frustrating. In this project the provision of overlaps by the DNA sequence and being able to locate sequence fragments in the protein, as shown in Figure 4, were of great assistance in obtaining sequences in a relatively short time. (a) Secondary structure in F gene D N A Although we have not carried out a rigorous search for possible hydrogen-bonding into "looped" structures, there is not much secondary structure possible in these DNA sequences. In the endonucIease IV bands there are many runs of T bases but no similar runs of A bases to which they could hydrogen-bond. This absence of secondary structure is compatible with the action of endonuclease IV in excising this region specifically under conditions of maximum hydrogen-bond formation (Ziff et al., 1973; Galibert et al., 1974). (b) The coding of the F and G proteins of r The distribution of the 136 F gene codons sho~m in Figure 5 and the 65 codons previously identified in gene G (Air et al., 1975) are shown in Table 3. Although the data are limited, there appears to be a significant preference for certain codons, particularly for those with T in the third position. Table 4 shows the frequency of each base in the third position and confirms that this preference is real: 56% of codons in the F gene end in T and 59% of those in gene G. The DNA of r is in any case rich in T (32.7%), but if it is assumed that the first two positions in each codon are random (i.e. 25% of each base) and that the high T content of the DNA is only due to changes in the third position, the frequency of T in this position would be only 48%. I t seems likely therefore that this is a local effect and it suggests t h a t there are regions elsewhere in the D:NA with a lower content of T. 30

G. M. A I R

456

ET AL.

TABLE 3

The codons identified in r F Asp GAU GAC Asn AAU AAC T h r ACU ACC ACA ACG Ser U C U UCC UCA UCG AGU AGC Glu GAA GAG Gln CAA CAG Pro CCU CCC CCA CCG Giy G G U GGC GGA GGG Ala GCU GCC GCA GCG Cys U G U UGC

G

1111 111111 111 1 11111111 1

Total

111

7 6 6 2 11 1 3 2 10 2 3 2 1

1

10 3 2

GUU GUC GUA GUG Met A U G Ile AUU AUC AUA Leu UUA UUG CUU CUC CUA CUG Tyr U A U UAC Phe UUU UUC Trp U G G His CAU CAC Lys AAA AAG Arg CGU CGC CGA CGG AGA AGG

1

3

Total

111 1 111 111 11 11111

11111 11 1

11 11 1

1 1111

1 4

11 11111

111 11

5 7

1 111 1111 111111 1

1

2 3 7 6 1

1111111 111 1

11

genes F and G

111

111

Val

F

G

11111 11 1

1111

9 2 1

1111 1111111 1

11 11

6 9 1

1 1 111111111 111

11 1 1 1

3 2 10 4

1111

1 1

1 5

iiiiii

8 3 3 5 2 2 3 9 3

ll 111 111 1111 1 1 11 111111111 11

138

1 1 1 1 1

65

Total

201

TABLE 4

The frequency of each base in the third position of codons in the F and G genes of r and for comparison in the R N A phages MS2, R17, f2 and Qfl ~X174 Number

T C A G Total

MS2, R17, f2, Q~

%

F

G

F

G

%

76 33 8 19 136

38 5 10 12 65

55.9 24.3 5.9 14.0

58.5 7.7 15.4 18-4

29.1 26.9 23-6 20.4

The data used for the MS2, R17, f2 and Q~ codons were t a k e n from BarreU & Clark (1974) a n d totalled 461 eodons.

GENE ~ OF PHAGE ~X174

457

The reason for the high T content is not known. I t is similarly high in other singlestranded DNA phages including the filamentous viruses fd and fl. Previous studies on RNA phages (Adams et al., 1969; Steitz, 1969; i ~ n Jou et al., 1972; Contreras et al., 1972,1973; Jeppesen et al., 1972; Haegeman & Fiers, 1973; Sanger, 1971) show no significant preference for any particular codons, in contrast to the results with r In studies on frameshift mutants in the lysozyme gene of bacteriophage T4, in which the T (and A) content is 3 2 . 5 ~ , a somewhat similar phenomenon is observed. Of 19 codons, seven had U and seven had A as the third base (Ocada et al., 1970). I t is possible that the cause is chemical rather than biological, i.e. that there are chemical reasons for changes towards T residues to occur in these phages and, ff there is no selective preference for the third position of codons, T residues would accumulate. I t seems more probable however t h a t the effect has been brought about b y natural selection and is of some advantage to the organism. Either it is an advantage to have an overall high T content and changes have occurred in the variable third position to accommodate this, or it is an advantage to have T in the third position and the overall analysis is merely a result of this. In the former case the effect m a y be concerned with the physical constraints of the DNA and m a y involve the packaging into the virion, though both the filamentous and the spherical phages, in which the packaging must be very different, have a high T content. There are however other stages in the life-cycle of the organism that could equally well be affected by the physical properties of the DNA. Denhardt & Marvin (1969) have suggested that the high T content of the singlestranded DNA phages might be explained on the basis of an alteration in the mechanism of translation on infection such t h a t codons ending in T residues are more readily recogaised t h a n other codons. A change in the "wobble" pairing might, for instance, account for this. However, this would predict a low frequency of C in the third position, which is observed in gene G codons but not in gene F. In the F gene codons A is noticeably low in the third position, and this is true also when _~ and G gene codons are averaged. In these coding sequences the out-of-phase terminator codons are not randomly distributed. They do not appear to be associated with internal ATG triplets. In spite of similarities in some of these internal ATG sequences to initiation ATG sequences (e.g. Robertson et a~., 1973) as noted in phage MS2 RNA sequences (Haegeman & Fiers, 1973), there is no evidence t h a t false initiations would be corrected b y rapid termination. REFERENCES Adams, J. M., ffeppesen, P. G. N., Sanger, F. & Barrel|, B. G. (1969). Nature (London), 223, 1009-1014. Air, G. M. (1976). J. Mol. Biol. Air, G. M., Blackburn, E. H., Sanger, F. & Coulson, A. R. (1975). J. Mol. Biol. 96, 703-719. Air, G. M., Sanger, F. & Coulson, A. R. (1976). J. Mol. Biol. (in ~he press). Barrell, B. G. & Clark, B. F. C. (1974). Handbook of Nucleic Acid Sequences, JoynsonBruvvers Ltd., Oxford. Benbow, R. i~I., Mayol, R. F., Picchi, ft. C. & Sinsheimer, R. L. (1972). J. Virol. 10, 99-114. Benbow, R. M., Zuccarelli, A. J., Davis, G. C. & Sinsheimer, R. L. (1974). J. Virol. 13, 898-907. Blackburn, E. H. (1975). J. Mol. Biol. 93, 367-374. Blackburn, E. H. (1976). J. MoL Biol. 107, 417-431.

458

G.M.

AIR ET AL.

Burton, K. & Potorsen, G. B. (1960). Biochem. J. 75, 17-27. Chen, G.-Y., Hutehison, C. A. & :Edge]], M. H. (1973). Nature New Biol. 243, 233-236. Con%roras, 1%., Vandenburghe, A., Volkaert, G., Min Jou, W. & Fiers, W. (1972). FEB,S Leters, 24, 339-342. Contrera~% 1%., Ysebaert, M., Min Jou, W. & Fiers, W. (1973). Nature New Biol. 241, 99-101. Donhardt, D. T. & Marvin, D. A. (1969). Nature (London), 221, 769-770. Edge}/, M. H., Hutch/son, C. A. & Selair, lV[. (1972). J. Virol. 9, 574-582. Fiddes, J. C. (1976). J. Mol. Biol. 107, 1-24. Galibert, F., Sedat, J. & Ziff, E. B. (1974). J. Mol. Biol. 87, 377-407. Haegeman, G. & Fiers, W. (1973). Eur. J. Biochem. 36, 135-143. Harbors, B., Delaney, A. D., Harbers, K. & Spencer, J. H. (1976). Biochemistry, 15, 407-414. Jeppesen, P. G. lq., Barrell, B. G., Sanger, F. & CoLilson, A. 1%. (1972). Bioehem. J. 128, 993-1006. Jeppesen, P. G. Iq., Sanders, L. & Slocombe, P. M. (1976). Nucleic Acids Res. 3, 13291339. Johnson, P. H. & Sinsheimer, 1%. L. (1973). Fed. Proc. Fed. Amer. Soc. Exp. Biol. 32, 491. Lee, A. S. & Sinsheimer, 1%. L. (1974). Proc. Nat. Acad. Sei., U.S.A. 71, 2882-2886. Ling, V. (1972). Proe. Nat. Acad. Sci., U.S.A. 69, 742-746. Middleton, J. H., Edgell, M. H. & Hutchison, C. A. (1972). J. Virol. 10, 42-50. Min Jou, W., Haegeman, G., Ysebaert, 1~I. & Fiers, W. (1972). Nature (London), 237, 82-88. Oeada, Y., Amagase, S. & Tsugita, A. (1970). J. Mol. Biol. 54, 219-246. Robertson, H. D., Barrell, B. G., Weith, H. L. & Donelson, J. E. (1973). Nature New Biol. $41, 38-40. Sanger, F. (1971). Biochem. J. 124, 833-843. Sanger, F. & Coulson, A. R. (1975). J. Mol. Biol. 94, 441-448. Sanger, F., Donelson, J. E., Cotflson, A. R., KSssel, I-I. & Fischer, D. (1974). J. Mol. Biol. 90, 315-333. Sedat, J., Ziff, E. B. & Galibert, F. (1976). J. Mol. Biol. 107, 391-416. Smith, H. O. & Wilcox, K. W. (1970). J. Mol. Biol. 51, 379-391. Steitz, J. A. (1969). Nature (London), 224, 957-964. Ziff, E. B., Sedat, J. W. & Galibert, F. (1973). Nature New Biol. 241, 34-37.

Gene F of bacteriophage phiX174. Correlation of nucleotide sequences from the DNA and amino acid sequences from the gene product.

J. Mol. Biol. (1976) 107, 445-458 Gene F of Bacteriophage ~X174. Correlation of Nucleotide Sequences from the D N A and Amino Acid Sequences from the...
5MB Sizes 0 Downloads 0 Views