J. Mol. Biol. (1976) 108, 519-533

Nucleotide and Amino Acid Sequences of Gene G of ~X174 G. ]~. AIR, F. SANGER AND A. R. COULSO~

Medical Research Council Laboratory of Molecular Biology Hills Road, Cambridge CB2 2QH, England (Received 16 July 1976) The nucleotide sequence of tlm coding region of gene G of r and the amino acid sequence of the G-coded "spike" protein of the virion have now beert completed. From the 5' A of the initiating ATG to the 3' end of the terminator triplet, the gene consists of 528 nucleotides and codes for a protein of 175 amino acids, molecular weight 19,053. 1. I n t r o d u c t i o n

In studying ribosome binding to r single-stranded DNA, Robertson et al. (1973) isolated and sequenced a ribosome-protected DNA fragment which proved to contain the initiation region for translation of a spike protein of the virion (M r 19,000) which is coded by gene G (Air & Bridgen, 1973). The DNA and protein sequences were extended (Donelson et al., 1975; Air et al., 1975), and we have now obtained the complete nucleotide and amino acid sequences of gene G (528 nueleotides) and its product (175 amino acids). As in earlier work (Air et al., 1975) neither the DNA nor the protein sequence was determined to absolute certainty, but any doubtful residues of one sequence have been confirmed by reference to the other. Since the emphasis has been to develop reliable methods of sequencing DNA, the nucleotide sequence has been the primary aim, with enough amino acid sequence work to make sure that the sequence was correct. In the previous work (gobertson et al., 1973; Donelson et al., 1975; Air et al., 1975) several methods of DNA sequencing have been employed, including incorporation of 32P-labelled deoxyribonucleotide triphosphates using a synthetic primer and DNA polymerase with the single-stranded viral DNA as template. The rapidly expanding knowledge of restriction enzymes and their sites has meant that almost any region of r DNA contains enough known restriction sites to isolate fragments suitable for sequence determination by priming methods, with no necessity for synthesising oligonueleotide primers. In the DNA sequencing described here, all the 331 residues which complete the gene G sequence were determined using the "plus and minus" method (Sanger & Coulson, 1975) with viral strand r DNA as template and restriction fragments produced by the enzymes from Haemophilus influenzae R r (Hinf 1, Middleton, Stankus, Edgell, Hutchison & Barrell, unpublished data), H. aphrophilus (tlap I I, Takamtmi, 1973), H. haemolyticus (Hha [, Roberts, unpublished data) as primers. The primers were selected by reference to the physical map of r (Jeppesen et al., 1976), part of which is shown in Figure 1. This restriction 34

519

520

G. M. A I R , F . S A N G E R A N D A. R. C O U L S O N

f r a g m e n t m a p has b e e n a l i g n e d w i t h t h e genetic m a p in t h e F - G regions as p r e v i o u s l y r e p o r t e d (Air et al., 1975,1976). T h e a m i n o a c i d sequence o f t h e p r o t e i n c o d e d b y gene G h a s b e e n d e t e r m i n e d b y the standard methods of trypsin, chymotrypsin and CNBr digestion and isolating a n d sequencing t h e p e p t i d e s p r o d u c e d , e x c e p t t h a t t h e D N A sequence h a s been used to o v e r l a p p e p t i d e sequences a n d to confirm u n c e r t a i n residues.

2. Materials and Methods Single-stranded viral D N A was p r e p a r e d from ~X174 am3cs70 (lysis defective). Replicarive form ~X D N A was p r e p a r e d b y B. G. Barrell and J. C. Fiddes b y the m e t h o d of Godson & Vapnek (1973) a n d some was a gift from C. A. Hutchison. Restriction enzyme fragments were p r e p a r e d a n d m a p p e d by various members of this l a b o r a t o r y (N. L. Brown, B. G. Barrell, J. C. Fiddes, P. M. Slocombe). They were usually prepared b y fractionating the digests on gradient polyacrylamide gels (Jeppesen, 1974) a n d eluting electrophoretically (Galibert et al., 1974). 32P-labelled deoxyribonucleotide triphosphates were obtained from New E n g l a n d Nuclear a t a specific a c t i v i t y of about 100 mCi/pmol, and were used as soon after receipt as possible. I t appeared t h a t less satisfactory results were obtained with older triphosp h a t e preparations a n d this was not only due to loss of specific activity. T4 D N A polymerase was p r e p a r e d b y the m e t h o d of Goulian et al. (1968). F o r the minus reactions we have used a m u t a n t form of D N A polymerase I which lacks the 5' exonuelease a c t i v i t y (Heijneker et al., 1973). All test tubes and reactivials used in the D N A work were siliconized before use. Those for the E d m a n a n d dansyl reactions were acid-washed. F o r isolation of the phage coat proteins, cells from two 300-1itre cultures of Escherichia cell C infected with r am3cs70 were obtained from the Microbiological Research Establishment, Porton Down, Wiltshire, England. Trypsin (TPCK-treated), chymotrypsin a n d earboxypeptidases A a n d B were from Worthington. Phenylisothioeyanate, trifluoroacetic acid a n d b u t y l acetate were sequenal grade (Pierce). (a) D N A sequencing The plus a n d minus m e t h o d used for the D N A sequencing was essentially as described b y Sanger & Coulson (1975) with slight modifications. The following detailed description applies to the experiments using fragment Hap 5 as primer (Figs 6 and 7). The techniques used in the other experiments were very similar, the most significant difference being t h a t restriction enzyme digestion was originally done after the plus a n d minus polymerase reactions. I n recent experiments we have found it preferable a n d simpler to do it before. After isolation of the restriction enzyme fragments t h e y were dissolved in water such t h a t 1 t~l contained the product derived from 1 ~g CX R F . 10 tA of this solution was m i x e d with 2 tA single-stranded ~X D N A (1.2 t~g), sealed in a capillary tube a n d heated a t 100~ for 3 rain. After cooling a n d opening the capillary, 2 t~l H • 10 buffer (66 mM-Tris.HC1 (pH 7-4), 66 mM-MgC12, 10 m~-dithiothreitol, 0"5 M-NaC1) a n d 6 t~l water were added. The capillary was resealed and incubated for 4 h at 67~ The annealed m a t e r i a l was then a d d e d to 20 ~1 of a solution containing 10 ~Ci [32P]dATP, 0.1 m ~ - d G T P , dCTP a n d T T P in t I buffer. After cooling to 0~ 2 ~1 D N A polymerase 1 (Boehringer, grade 1) were added. After 2 rain, a p p r o x i m a t e l y one-third of the reaction mixture was removed and a d d e d to 5 ~1 0.2 M-EDTA. One-third samples were again removed after 8 a n d 30 min, combined with the first sample, 25 tA phenol added, a n d thoroughly m i x e d in a 0.5-ml reactivial. This was centrifuged, the aqueous layer sucked off with a drawn-out capillary tube a n d washed twice with 2 ml ether to remove excess phenol. Excess ether was then e v a p o r a t e d b y a current of ah" and the solution applied to a column of Sephadex G100 (Pharmacia) in a 1-ml disposable plastic serological p i p e t t e (Falcon}, which was m a d e up a n d run in 5 n~-Tris.HC1 (pH 7.4) 0-1 mM-EDTA. These columns run much faster t h a u

r

GENE d~ SEQUEI~CES

521

the agarose columns originally used (Sanger & Coulson, 1975) a n d the results appear to be more satisfactory. However, care must be taken to avoid t h e m drying out. The fractionation was followed using a b a n d radiation monitor. The synthesized DNA was eluted in the void volume a n d clearly separated from the excess triphosphate. I t was collected in about 200 ~1 a n d t a k e n to dryness in a desiccator, dissolved in 20 ~1 H buffer a n d digested with the appropriate restriction enzyme. The a m o u n t of enzyme used was stffficient to digest 3 ~g DNA completely u n d e r the conditions used (usually 1 h at 37~ I n the case of Hap fragment 5, the enzyme Hpa I I was used, which has the same specificity (C-C-G-G) as the Hap enzyme. After digestion the solution was diluted to a suitable volume (usually a b o u t 50 ~1) and 2-~1 samples used in the plus and minus reactions. The yield of radioactive D N A varies considerably between experiments. I n general this solution shotfld be diluted to contain about 3 nCi/~l. Plus mixes are 0.2 rn~ solutions of a single deoxyribotriphosphate i n 1"5 x H buffer. Minus mixes are 0.01 mM solutions of the appropriate tbree triphosphates in 1.5 • H buffer. The reaction mixtures contained 2 ~1 of the DNA reaction product, 2 ~l of the appropriate " m i x " a n d 1 ~l DNA polymerase in a drawn-out capillary. Minus reactions were carried out at 0~ for 30 rain with approximately 0.8 units of the m u t a n t DNA polymerase (Heijneker et al., 1973) a n d the plus reactions for 30 rain at 37~ with T4 I)NA polymerase (approx. 0.02 unit). At the end of the reaction 1 ~l 0.2 M-EDTA a n d 25 ~l freshly deionized formamide containing 0.03o/0 xylene cyanol F F a n d 0-03~o bromophenol blue were added. The solutions were heated at 100~ for 3 rain a n d then loaded onto the polyacrylamide gel. After the electrophoresis the gel was soaked for 10 to 30 rain in 10~/o acetic acid to fix the oligonucleotides and prevent diffusion of the bands. I t was then soaked in water for 1 to 2 rain before covering with Saran Wrap for autoradiography. (b) Preparation of polyacrylamids gels The plus a n d minus reaction products were fractionated on 12~/o gels in 8 M-urea at p H 8.3. To facilitate production of gels, a Perspex box with removable front (see Fig. 2) was made to hold six 20 em x 40 c m x 1.5 m m slab gel templates, each sealed at the edges with silicone grease a n d waterproof tape. The gels were poured from the b o t t o m via a raised funnel, with the slot formers in place. (The slots, 9 or 13 per gel, were the full thickness of the gel, i.e. 1.5 ram.) The buffer system was Tris.borate/EDTA (pH 8"3) (TBE) (Peacock & Dingman, 1967). 10 x T B E was made b y dissolving 108 g Tris base, 55 g boric acid a n d 9.3 g E D T A in water to 1 1. The gel solution was made as follows. 720 g urea (Schwarz-Marm, u l t r a pure), 180 g acrylamide and 7"5 g bis-aerylamide (both BDH, "purified for eleetrophoresis") were made up to approximately 1"25 1 with water a n d dissolved with slight warming. This solution was stirred for 30 rain with approximately 50 g Amberlite MB-1 resin (BDH), which was removed b y filtration through W h a t m a n no. 1 filter paper in a glass sinter funnel. 150 ml of 10 x T B E was added, the solution made up to 1.5 1 a n d degassed. 5 ral of 10~o a m m o n i u m persulpbate a n d 0"5 ml _hr,N,N',•'-tetraethylmethylenediRmlne were added prior to pouring. The gels were removed from the box after 1 to 2 h, the slot formers removed a n d excess acrylamide washed off. The gels can be stored for at least 2 weeks immersed in 8 M-urea/TBE. Prior to use about 0.5 cm of gel is removed to allow insertion of the W h a t m a n 3MM paper wick, a n d the 8 M-urea/TBE storage buffer is washed out a n d replaced b y TBE. I t has been found unnecessary to have a n y urea in the blfffer compartments of the electrophoresis apparatus. With a gradient chamber in place of the funnel the above apparatus has been used for making gradient gels. (c) Protein sequencing The purification of the gone G protein and the methods used to separate a n d sequence the peptides produced b y CNBr, trypsin a n d chymotrypsin were as described in previous papers (Air et al., 1975; Air, 1976). The a m o u n t s of protein used to determine

G. 1~I. ALP., F. S A N G E R AND A. 1%. C O U L S O N

522

the G protein sequence were 6"5 mg for trs~tie digestion, 20 mg for chymotryptic digestion a n d ,'~-,30 mg for CNBr cleavage. The peptides eluted from paper after electrophoresis a n d chromatography were fl'eeze-dried aud taken up iu 200 F1 water. 20 /~l were taken for acid hydrolysis a n d amino acid analysis, and 150 /A for sequencing by the m a n u a l d a n s y l - E d m a n method as detailed previously (Air et al., 1975; Air, 1976). Carboxypoptidaso A and B digestions were at an enzyme : substrate ratio of 1 : 50 in 1% a m m o n i u m bicarbonate at room tomperatttre. Portions were taken at 0, 10 and 30 rain, immediately frozen in a solid CO2/acetone bath, and lyophilised. The dry samples were ~aken up in sample loading bttffer, centrifuged, and the supernatants analysed olx the D u r r u m amino acid analyser.

3. Results

(a) D N A sequence The only method used to study the DNA sequence was the plus and minus method using restriction enzyme fragments as primers (Sanger & Coulson, 1975). Figure 1 shows the sites of splitting of the gene G region with restriction enzymes based on the results of Edgell et al. (1972), Lee & Sinsheimer (I974) and Jeppesen et al. (1976). N terminus of G protein

I E Hm~ I / Hind

Hho

C terminus of G protein

I OOi R9

200,

500,

I RIO I F58

Il

I

400,

5j00I

R2 F8

IF4 I ]

H3

Hop Tr~

2

t

t H2

5

=4

FZG. 1. The sites of restriction enzy]ne cleavage in the gene G region of ~bX174 as mapped by Edgell et al. (1972), Lee & Sinsheimer (1974), Jeppesen et al. (1976). The enzymes shown are H i n d I I , H i n f I, Tlha I, H a p II, and their exact, sites of cleavage are included in Fig. 10.

0.9 cm

)

22-5 cm 6.5 c ~ 2.5 mm brass screw to take wing nut

'

I f

0"5 cm

--0-5 cm

I I' I I 43 cm i i

I I I I I

~cm

a o

o ~ tj~

o

I-5 mm 0 ring

t

0.75 cm LD.s t a i n l e s s

steel inlet

Fzo. 2. Apparatus for production of 6 aerylamide gels.

2

r

GENE

G SEQUENCES

523

Fragments Hha 2, Hinf 4, Hap 5 and Hinf 8 have been used as primers and some results are shown in Figures 4 to 8 which are radioautographs of the acrylamide gels used in the plus and minus method. Using the four primers it was possible to obtain data covering the 331 nueleotide sequences of the G gene t h a t had not been sequenced in the previous paper (Air et al., 1975). I t will be seen that the quality of the experiments varies considerably. Thus the radioautographs for Hha 2 (Fig. 4) or Hap 5 (Fig. 6) are clear and relatively easy to read, whereas t h a t for Hinf 4 is very much harder with a number of artefact bands and bands missing. In general several experiments were done with each primer until the data were sufficiently good to establish the DNA sequence together with the amino acid sequence data. The sequences finally deduced in this way are shown in the Figures. The 3' end of the minus strand of fragment Hha 2 is about 55 residues from the DNA coding for the C terminus of the G protein. An initial priming with this fragment suggested a sequence whose complement is shown in Figure 3. At this time the Cterminal sequence of the G protein had not been identified, but a number of tryptic peptides had been sequenced whose position in the protein was unknown. One of these had the sequence shown in Figure 3. The last six amino acids corresponded b y the genetic code to the DNA sequence and since this was followed by the termination codon T-G-A it suggested t h a t the peptide was the C-terminus of the protein. This was confirmed b y further protein work as described below. Nucleotide s e q u e n c e deduced i n p r e l i m i n a r y

A-A-A-G-G-G-A-T-T-A-T-

-T-G-T-C-T-C-C-A-G-C-C-A-C-T-T-A-A-G-T-G-A

experiment Amino

acid

sequence

O tryptic peptide

of

Corrected DNA sequence

GIx -

-

-

A

Ile

-

lie

- Cy| -

Leu -

GIx -

Pro -

Leu -

Lys

T

FIG. 3. Identification of the t e r m i n a t i o n codon of the G gene.

In the experiment shown in Figure 4 the aerylamide gel eleetrophoresis was run for a relatively long time so that the oligonucleotides in the bands at the bottom of the gel are about 60 nueleotides long. The ohgonucleotide in position 85 is approximately 145 residues long. Nevertheless, bands in this area are separated and the sequences can be read with reasonable accuracy. This radioautograph is relatively dark which facilitates the reading. There are, however, a few artefacts. The most noticeable is in position 4 which contains a band in the - - T system. The main reason for identifying this as an artefact is t h a t the next oligonucleotide gives a band in the q-C system. Where there are runs of a particular nueleotide in a sequence one oecasionally sees bands in the minus system corresponding to each component of the run but more often only one band is seen which corresponds to the residue before the run and this is usually stronger than any subsequent bands. Thus the presence of two bands in consecutive ohgonucleotides in the - - T system in positions 4 and 5 is further evidence t h a t one of them is an artefact. This same artefact was found in most other experiments in which fragment Hha 2 was used as a primer so that the possibility of a T residue instead of a C residue in this position could not be excluded. Nor could it be excluded by the amino acid sequence. However, when fragment Hap 4, which primes about 40 residues beyond the Hha 2 priming site, was used this artefact was completely absent. Other artefacts which could be more easily eliminated were found in the -kC system in positions 14 and 41. The exact reason for these occasional artefacts

G. M. A I R , F. S A N G E R A N D A. R. C O U L S O N

524

+C

+T +A +G -C

-T

-A -G G 48 tA C

A G A

T2

A

41

A 1" G 70 A

"T ,A2

73

~

30

T C T

A

~

.,o._

~2'

58 T

20

A G A .G2 T C

miD....

A3 49 G /

3

T

1o

A2

qqm,

T2 C ~A

c T

1

C2 A C :FIo. 4. Autoradiograph of a polyacrylamide gel electrophoresis of samples from an experiment in which fragment Hha 2 was used as a primer for DNA polymerase on CX DNA using the plus and minus method. The slower blue marker (xytene eyanol FF) was at the bottom of the gel, the top 10 cm of which is not shown. The sequence shown is that finally deduced from the results of several similar experiments and of the amino acid sequence determinations of the G protein. The lines from the sequence point to the appropriate band in the plus or minus system which is indicated by a dot. The residues are numbered in the 5' to 3' direetion along the minus strand, starting at the T residue which is complementary to the 3' end of the terminating codon for the G protein. is u n k n o w n . T h e y v a r y c o n s i d e r a b l y f r o m one e x p e r i m e n t t o a n o t h e r b u t some, such as t h a t in p o s i t i o n 4, a r e e x t r e m e l y p e r s i s t e n t . T h e r e were c o n s i d e r a b l y m o r e w h e n H i n f 4 was u s e d as p r i m e r (Fig. 5), e.g. i n p o s i t i o n s 94 ( + C ) a n d 96 ( + T ) . P o s i t i o n s 42 t o 46 give t h e sequence G - A - T - T - C w h i c h is a specificity site for Hinf ( H u t e h i s o n & Barrell, u n p u b l i s h e d d a t a ) a n d r e p r e s e n t s t h e j u n c t i o n b e t w e e n Hinf f r a g m e n t s 4 a n d 8. l e r a g m e n t Hinf 4 was t h e r e f o r e u s e d as a p r i m e r t o e x t e n d t h e sequence (l~g. 5). B y c o m p a r i n g t h e p a t t e r n o f b a n d s i n F i g u r e 5 w i t h t h o s e a t t h e t o p o f F i g u r e 4 i t is clear t h a t t h e s a m e sequences are i n v o l v e d , t h u s e s t a b l i s h i n g a n o v e r l a p b e t w e e n t h e t w o sets o f d a t a . S i m i l a r o v e r l a p s can be seen b e t w e e n o t h e r

r

GENE

+C 140

G~

SEQUENCES

+T +A +G - C

T

-T

-A

525

-G 166

G G2 A2

160

128

A

Ga C2

150

118 G

110

141

105

95

81

..y.

~2

FIG. 5. Autoradiograph from an experiment in which fragment H i n f 4 was used as a primer (see Fig. 4). The xylene eyanol F F marker was near position 105. The top 7 cm of the gel are no6 shown. The results in the - - G system in this experiment were unsatisfactory and were n o t used. The electrophoresis has run rather unevenly so t h a t the b a n d s lie on curves, especially a t the edges. This curvature is taken into account in marking the positions of the sequences.

primed sequences. For instance, the pattern at the bottom of Figure 8 matches that at the top of Figure 6. I t m a y be seen t h a t the pattern of bands in any one area of a gel represents a "fingerprint" or characterization of a specific sequence and it is thus possible to use these patterns to obtain overlaps and as an alternative though somewhat more laborious method of determining a restriction enzyme fragmentation map. The next restriction site is the Hap site (C-C-G-G) in position 119 to 122. Fragment Hap 5 was used to prime from this site (Fig. 6). Although the results in the -t-A system were unsatisfactory in the experiment shown, the sequence is relatively easy to read off. Positions 236 to 243 are an alternating sequence of A and G residues, in

526

G. M. A I R ,

+C

229

F. SANGER

+T

+A

A N D A. R . C O U L 8 O N

+G-C

-T-A

-G

A: G: C:

9 ::;..:

. " ~-~i~:, :!

T G A

'. ,:i/:?~:",i'-

/'%

;~:':i!,.! T' i~!~, 221

A-

'"

c\ G3

"

m

200

A3 C2

A3 C2/ 190

~

m

250

~ ! .,,.~.:.':

AT_ A G2

2 260

m

A C

A6

~ ~ . G

9

A C~

274

C

A 210

T

9 .: .. . .,.! ":~.:!.:,.~

~

i/!

-

%! " !

~...~::i : :. ,,'.~f:.,. ,,::

!. 9 :

G

-':: :"!:J,- ~'

.-:'::.,

_. 9

I

....

239 2

9

.:::/.-'.,

G

iil ":

-:,~::_ .: , - ~ : . ~ . ~.,. ,.::- -~. ;:;.:,'.:.!::.:' ::":-

A

180

#Y~,'-:~',y:,:,,:;,~: ..,--:: .-.-\ :~ .:

C

i~::~ ~

........

A

170

T

~

A2 T" A 164

~t~: -

, ~

G

FIG. 6. Autoradiograph from an experiment in which fragment Hap 5 was used as a primer (see Fig. 4). The xylene cyanol F F marker was around position 170. The top 5 cm of the gel are not shown. The results with the 4- A system were unsatisfactory in this experiment,.

r

GENE

527

G SEQUENCES

+C +T +A +G - A

-G -C

-T

250 C ~ .

A3 T C A2 239 G A2

9

A2 / "

T

Fro. 7. The result of an experiment with fragment Hap 5 as in Fig. 6, except t h a t the polyacrylamide gel electrophoresls was carried out for twice as long. The top 19 cm of the gel are not shown.

which there was some doubt of the relative frequency of each residue, and this could not be fully resolved by the amino acid sequence. Figure 7 shows an experiment similar to that shown in Figure 6 except that the electrophoresis was run for twice as long. The bands in the - - A and - - G systems are now further spread out and it is relatively simple to deduce the sequence. The corresponding pyrimidine tract has been identified in the plus strand b y Harbers et al. (1976). Figure 8 shows an experiment primed by fragment H i n f 8 from the site G-A-C-T-C at position 242 to 246. The sequence extends into the sequence primed by fragment H i n d 2 from the G-T-T-G-A-C at position 328 to 333, which was reported in a previous paper (Air et al., 1975). (b) A m i n o acid sequence After tryptic digestion of the G protein only seven soluble peptides were obtained. These were purified directly by paper electrophoresis and chromatography, detected with fluorescamine, eluted and sequenced as far as possible. As would be expected, chymotryptic digestion gave a much more complex mixture of peptides, and a preliminary fractionation on Sephadex G50 in 1% ammonium

528

G. M. A I R ,

F. SANGER

+C

+T+A+G

AND

A. R. COULSON

-C-T-A-G

333 331

i

348

319

T -

-

A

340

312

L,, A2 C2 ~

A

.._--- ---"~"llllb

',,,,,.*-

~,.

i :!:%. )! f~:,

274

, " ~ ; '!-

3 / /

FIG. 8. Autoradiograph from an experiment in which fragment H i n f 8 was used as a primer (see Fig. 4). The xylene cyanol FF marker was at position 295. The top 2 cm and bottom 8 cm of the gel are nob shown. bicarbonate was carried out. Four fractions were pooled and the peptides in each resolved by paper electrophoresis and chromatography. I t was not found necessary to a t t e m p t te purify a n y insoluble tryptic or chymotryptic peptides, particularly because the nucleotide sequence could be used to order the peptides and check t h a t no amino acid residues were missing. Amino acid analysis of the G protein showed three methionine residues, and since one is ~T-terminal t r e a t m e n t with CNBr should give tlu'ee fragments. The CNBr digest was fractionated firstly on Sephadex G75 in 50% formic acid. This separated a small (30 residues) fragment later shown to be C-terminal, but the N-terminal fragment

~X174 G E N E

G

SEQUENCES

529

(61 and 62 residues due to partial cleavage of the N-terminal methionine residue) and the middle fragment (83 residues) were not resolved. The latter was purified by chromatography on a column of W h a t m a n DE52 equilibrated and loaded in 8 x-urea, 10 mM-Tris, 50 mM-mercaptoethanol (pH 8.0) with a linear gradient to 0.6 ~-KC]. Tryptic digestion of the CNBr fragments yielded some peptides not seen after trypsin t r e a t m e n t of the whole protein (residues 116 to 124, 135 to 145, 157 to 166). The C-terminus of this protein was not easy to characterise. All the tryptic or c h y m o t r y p t i c peptides isolated contained an amino acid of appropriate specificity at the C-terminus, so none could be assigned to the carboxyl end of the protein. Resolution of homoserine or homoserine lactone on the older amino acid analysers was erratic, and its absence from hydrolysates of a CNBr fragment was not a reliable indication t h a t it contained the C-terminus of the protein. Digestion of the protein, which as purified is v e r y insoluble, with carboxypeptidases A and B did not give a convincing result, and after solubilization b y succinylation of ~- and c-amino groups (Klotz, 1967) no amino acids were released b y carboxypeptidases. As described above, in the nucleotide sequence of the gene G region the tryptic peptide (~lx-Ile-Ile-Cys-I~eu-Glx-Pro-Leu-Lys could be seen, followed b y a termination codon, just where the G protein sequence would end if the molecular weight estimation from gel electrophoresis in sodium dodecyl sulphate (19,000) is correct. To confirm this C-terminal sequence the small CNBr fragment (residues 146 to 175) was repurified. Amino acid analysis on the D u r r u m analyser showed t h a t it did not contain homoserine, therefore it is the C-terminal fragment. T r e a t m e n t of the CNBr fragment with carboxypeptidases A and B released only ]ysine, as would be expected from a -Pro-Leu-Lys sequence. The sequence of this CNBr fragment had been ResiduesIto67: Air e/o/. (1975) 65 70 75 -MetlAsx-Thr-Ser-VQI-Asx-Ala-Ala-Asx-Gtx- Val-VoI-Ser-Vol-

CNBr peptides Tryptic peptides Chymotryphc peptides

CNBr peptides Tryptic peptides Chymotryphc peptides CNBr peptides Tryptic peptides Chymotryptic peptides

-AsnlAla-Ala-Asx-GIx(Val,VaI,Ser,Vol ,

Gly-Ala-Asx- Ire-

80

- Phe-Asx

85

90

95

I00

Ph e-Phe-Ala-Cys-Leu-Vol- Arg Phe-GIx-Ser -Ser-Ser-Vol- ProGIy, A o,Asx, Ile,Ala)Phe;Asp-A,.-Asp-Pro{Lys,Phe Phe)I IVoI-Atg-Phe. Glu-Ser-Ser-Ser-Va -Pro' I 105

I10

ll5

120

125

Thr-Thr-Leu(Pro,Thr,Ala,Tyr,Asx,Va ,Tyr,Pro,Leu,Asx,Gly)ArglHiz-Asp-Gly-Gly-Tyr(Tyr,Thr,Vol)LylJA= Thr-Thr•Leu•Pr•-Thr•••••T•r•A•p-V•••Tyr•Pr•-LeuiAsp••Iy•Arg-••••Asp•••y•G•y•Tyr•Tyr•Thr•V••~163 1:50

135

140

145

p-

150 -Ser-Asx-Phe-

CNBr peptides Tryplic peptides Chymotryplic peptides

Cys'V~ I-Thr-Ile-Asp-V~ Leu-Pr~ I Thr-Pr~ Gty-Asx-Asx-V~176 I-Gly- Phe-M~f VolCys-Vol- Thr- He- Asx-Vol- Leu ( Pro, Arg) Thr ( Pro ,Gly ) Asx (Asx ,Vol, Tyr)[Vol- Gly-Phe I

CNBr peptides Tryptic pepfides Chymotryptic peptides

155 160 165 170 175 Thr-AIo-Thr-Lys-Cys-Arq-G y-Leu-VoI-Ser-Leu-Asx-GIx-Vol-Ile-Lys(GIx~Ile,Ile,Cys,Leu,GIx, Pro,Leu,Ly|) Ic=*-Argl G,~ I L.. ,Vo .. Ser. Leu, Asn,GIn .Vol ,Ile )Lys [GIx- lie- II.-Cys-Leu-G I .-Pro-Leu-L.|, ] IThr-Ale-Thr-Lys- Cys-Ar g-e ly- Leu !vo i- Ser-Leu I

FIe. 9. Amino acid sequences of the peptides used to deduce the sequence of residues 67 to 175 of the protein coded by gene O. The sites of cleavage of enzymes or CNBr are indicated by vertical bars; broken bars indicate partial cleavages. Residues in parentheses were present in the amino acid composition but the sequence was not determined. The C-terminus was confirmed as described in the text.

530

G. M. A I R ,

F. SANGER

A N D A. R . C O U L S O N

determined by the dansyl-Edman method as far as the lysine at r~uidue 166. The amino acid composition of the fragment showed that only the tryptic peptide GlxIle-Ile-Cys-Leu-Glx-Pro-Leu-Lyscould be added to the sequence; this was confirmed by tryptic digestion of the fragment when this was the olfly peptide not contained in the known sequence of the CNBr fragment. The amino acid sequence data used to determine the sequence of the first 67 amino acids of this protein were given previously (Air et al., 1975), and in Figure 9 these data. are continued for the remaining 108 residues. Although tile peptide data has given overlaps all through this region many of them are dangerously short and could not be relied upon. The nucleotide sequence has provided an unambiguous confirmation of the peptide order. The nucleotide sequence gave a Trp residue at position 147 which had not been detected in the CNBr fragment, presumably because it was destroyed during the CNBr reaction. Several amide residues were assigned by the DNA sequence, where protein sequencing gave only "Glx" or "Asx" (residues 70, 71, 78, 138, 139, 149, 167, 172) and some doubtful residues were confirmed by the DNA sequence. The "spike" protein coded for by gene G of r DNA is 175 amino acid residues in length, with a calculated molecular weight of 19,053, in good agreement with tile estimate from sodium dodecyl sulphate gel electrophoresis.

4. Discussion The results obtained by nucleotide and amino acid sequencing arc combined in Figure 10, which shows the nucleotide sequence of gene G of r from the 5' A of the initiating A-T-G codon to the 3' end of the termination codon (528 nucleotides) and the amino acid sequence of its product (175 amino acids). Also included in Figure 10 are the sites of cleavage by restriction enzymes Hind II, Alu I, Hinf I and Hap II. Four sequences previously identified as depurination products (Ling, 1972; Harbers et al., 1976) are present, three at positions 129 to 137, 236 to 243, 357 to 364 of the DNA sequence and one in the position corresponding to residues 13 to 16 of tile amino acid sequence. As noted previously in r DNA protein coding sequences (Air et al., 1975, 1976), there is a preponderance of T residues in the third position of the codons. Of the 175 codons of gene G, T occurs as the third base in 95 of them (54-3%). The frequency of T at the first and second positions of tile codons, which is dependent on (or dictates) the amino acid sequence is 25.7% and 31.4%, respectively, and in the overall composition of gene G 37.1~ so we are still expecting to find regions in r DNA of low T content to bring the total base composition to the known value of 32.7% T. A search was made for possible base-pairing between sequences ~dthin the gene but there appeared to be no significant sequences longer than would be expected for a random sequence (A. McLachlan, unpublished data). This is in contrast to the results with the RNA bacteriophages where extensive hairpin loops are found. The protein coded by gene G is located in the 12 spikes of the virion, the molar amount present giving five molecules of G protein per spike (Burgess, 1969). Electron micrographs (J. T. Finch, J. Robertus, A. Crowther, unpublished data) show a pentamer of globular subunits which may be the individual G protein molecules. The rules of Chou & Fasman (1974) have been applied to the amino acid sequence

~X174 GENE G SEQUENCES

531

I0 Me1 -

Phr

-

Gin

-

Thr

-

Ph~

-

]11= -

Set

-

Ar 9

-

HIL

-

A~n

~0 -

Set

-

A~

-

Ph~

-

Phe

-

Set

-

A~p

-

Lu

-

Leu

~

~ol

-

Leu

A-T-G-T-T-T-C-A-G-A-C-T-T-T-T-A-T-T-T-C-T-C-G-C-C-A-C-A-A-T-T-C-A-A-A-C-T-T-T-T-T-T-T'C-T-G-A-T-A'A-G-C-T-G-G-T-T-

-

C-'1 -C -

AluZ ~ Alut6 30 Thr

-

A-c-'r

Set

-

Vol

-

Thr

-

Pro

-T -C-T-G-T-T-A-C-T-C

-

AIo

-

Ser

-

Set

" C - A - G "- CT- T - T T- C - C ~ G -

-

Ala

-

Pro

"G~C-A'C-

40 -

VoI

-

Leu

C-T'G -T-T-T'T

-

Gin

-

Thr

-A-C-A'G'A'C"

-

Pro

-

Ly$

-

AI~I

A" C-C -T~A -A-A-G

-

Thr

-

Ser

-C "T-A-C-A-T-C-G

-

Ser

-

- T - C "A - A IO

50 Thr

-

Leu

Tyr

-

-

Phe

-

Asp

-

Ser

A * C-G-T-T-A-T-A-T-T-T-T-G-A-T-

-

Leu

-

Thr

-

Vol

-

Ash

60 -

Ale

-

Gly

A-G-T-T-T-G-A-C-G-G-T-T-A-A-T-G-C-T-

-

Asn

-

Gly

-

Gty

G-G-T~A-A-T-G-G-i"-G-

-

Phe

-

Leu

S-T-T-T-T-C-

-

His

-

Cys

-

MI!

-

Asp

-

Thr

-

Ser

-

Vol

-

Asn

-

Ale

-

Ale

-

Asn

~50

70 -

-

T-T-C-A-T-T-G-C-A-T-T 360

Gin

lie

BO -

Gin

-

Val

-

Vol

-

Ter

-

Vol

-

Gly

C-A-G-A-T-G-G-A-T-A-C-A-T-C-T-G-T-C-A-A-C-G-C-C-G-C-T-A-A-T-C-A-G-G-T-T-G-T-T-T-C-A-G-T-T-G-G-T-G-C-T-G-A-T-A-T-T-G-C-~ 340 330 320 3[0

-

Ale

-

Asp

-

Ire

-

Ale

-

300

ZCO

Hind I0 ~ Hind 2 90 Phe

-

Asp

T-T-T-G-

-

Ale

-

As~

-

P~'o -

A-T-G ~C-C-G-A-C280

Lys

~

Phe

-

Phe

-

Aq~

~ Cy~

C- ~-1"-A-A-A-T-T-T-T-T-T-G-C-C-T-G-T-'F270

I O0 -

Le~

-

Voq

-

Ar~/

T-G-G-T-T-C-

-

Phe

-

G~u -

Ser

-

Set

-

G- C-T-T-T-G-A-G-T-C-T-T-C250 ~ 240

260

Set

-

V~L

-

Pc~

T-1"-C- G- G-T-T-C-CG;~30

Hinf $b | Hinf 8 I10 Thr

-

Thr

-

Leu

A-C-T-A-C-C-

-

Pro

-

Thr

-

Ale

-

Tyr

-

A~,p

C-T-C-C-C-G-A-C-T-G-C-C-T-A-T-G22(3 210

-

Vol

-

Tyr

A-T-G-T-T-

120 -

Pro

-

Leu

T-A-T-C-C-T-T200

-

Asp

-

Gly

T-G- G-A-T-

-

ArcJ

-

His

-

Aso

G- G-1"-C-G-Q-C-A-T-

ice

-

Gly

-

Gly

G-A- T-G-G-T-G-G-

-

Thr

-

Vol

-

Lys

-

Asp

-

T-A-T-A-C-C-G-T-C~A-A-G-G-A-C-T-G

Cys

-

Vet

-

Thr

-

Ire

-T-G-T-G-A-C-T-AtSO

[60

-

Asp

Tyr

~

T- T-A- T 170

ISO

130 Tyr

-

IAO -

Vol

-

Leu

T-T-G-A-C-G-T-C-C-T-T 140

-

Pro

-

Arg

- C-C-C-C-G 130

-

Thr

-

-T-A-C-G-C-

Pro

-

Gly

~ Asn

C-C- G-G-C-A-A ,~ I ; ~ 0

-

Asn

-'Vol

-

-T- A-A-C-G-T-CllO

Hop 2 | Hop 5 150 Tyr

- Vol

- Gly

T-A-C-G-T=T-G-

-

Phe

- Mel

G-T- T-T-C-A-T100

- Vai

- Trp

-

Ser

- Asn

-

G-G-T-T-T-G-G-T-C-T-A-A-C-T-T-T90

Phe

)60 - Thr

- AIO

- Thr

A-C ~ C-G- C- T-A-C80

-

Lys

- Cys

- Ar(:J -

Gly

T-A- A ~A-T- G-C- C-G-C-G-G 70 60

-

Leu

- Vat

- Ser

-

~A-T- T-G- G* T- T-T-C-G50

tTO

t.eu

-

A~n

C-T~G -A-A~T-

- Gin

~

H/nf 8

-

Val

-

E.le -

C-A-G-G-T-T-A-T-T40

Lys

-

Glu

-

A-A-A-G-A-G-A~T-T 30

lie

-

lie

-A-T-T-

-

Cys

-

Leu

T- G-T- C-T-C" 20

~ Gtn

-

C-A-G-C-

PrQ -

Leu

-

Lys

C - A - C~ T - T - A - A - G - T - G

~A

I0

Hlnf 4

Fro. 10. The nucleotide sequence of gene G of ~bX174 and the amino acid sequence of its product. The nucleotide sequence is given as the DNA sequence of the viral (plus) strand, which is the complement of the minus strand sequence obtained in Figs 4 to 8. l~estriction sites of Hind II, Hinf, Hap I and Alu I are indicated. The numbering of the amino acid residues from the Nterminus is shown above the amino acid sequence and of the nucleotide residues are shown below the corresponding sequence. They are numbered in the 3' to 5' direction starting from the last residue of the termination codon and correspond to the numbering used in Figs 4 to 8.

of the G protein and predict m a n y , often alternating, regions of ~-helix and fl-sheet, consistent with a globular conformation. Of the results described in this paper t h e 108 a m i n o acids of the protein sequence were identified f r o m 163 successful thin-layer c h r o m a t o g r a m s , whereas the 331 residues of D N A sequence could essentially h a v e been deduced f r o m four 12% a c r y l a m i d e - u r e a gels, t h o u g h in fact duplicate e x p e r i m e n t s were carried out. As in studies on the F protein of r (Air, 1976), m a n u a l m e t h o d s of amino acid sequencing h a v e been used a l m o s t exclusively. The o n l y m e t h o d found to separate the phage coat proteins has been gel filtration in 4 ~t-gu~nidine hydroehloride or 8 ~1-urea, which works v e r y well but is e x p e n s i v e and m e s s y to scale up. H e n c e it was judged preferable to m a k e small a m o u n t s of protein, isolate peptides rapidly b y

532

G. M. A I R ,

F. SANGER

AND

A. R. COULSON

paper electrophoresis and chromatography, and sequence them manually rather than purifying the much larger amounts of peptides necessary for automated sequencing. The basic reason why the simple plus and minus method of sequencing DNA camlot be adapted for proteins is the same reason that nucleotide sequencing was once considered intrinsically more difficult than protein sequencing, namely that the four nueleotides have a great deal of similarity and can be made to behave essentially identically as regards solubility, electrophoretie mobility, susceptibility to both polymerising and degradative enzymes, and so on. Amino acids, on the other hand, have vastly different properties. Peptides do not behave consistently as residues are added or removed, and rates of exo- or endoprotease reactions are unpredictable until the whole sequence is known. Hence the impossibility of determining the Cterminal sequence of this protein (-Pro-Leu-Lys) with carboxypeptidases and the existence of chymotryptie peptides containing three phenylalanine residues (residues 67 to 88) or three tyrosines (residues 108 to 121). Insolubility remains a serious problem in protein sequence work : if the DNA sequence had not been available for confirming overlaps attempts would have been made to separate the insoluble tryptic peptides, which contain 40% of the sequence, and there is no general method known to purify small, insoluble peptides. This solubility problem is absent in nucleotide sequencing work; all fragments from a restriction enzyme digest can be seen on an appropriate acrylamide gel in equimolar proportions so that bands containing unresolved fragments are immediately recognised as double the intensity of their neighbours. Peptides m a y be present in varying yields due to partial cleavage or insolubility, and since any detection methods are dependent to some extent on the amino acid composition the major products cannot be readily distinguished from the many minor ones. The plus and minus method used to sequence the DNA of gene G is rapid and straightforward, but some anomalies may arise, as detailed in Results, and as yet we have not had complete confidence that very long sequences will not have errors. Sequences can be confirmed using the other well-established methods of DNA sequencing (Robertson et al., 1973; Galibert et al., 1974; Sanger et al., 1973) or protein coding regions can be compared to the amino acid sequence (Air et al., 1975,1976). As previously discussed, most uncertainties in a DNA sequence obtained b y the plus-minus method are in determining the lengths of runs of a single nucleotide. Since the amino acid sequence gives the phase of the coding sequence it has been very useful in checking that the DNA sequence is of the correct length. However, as more experience is gained with the plus and minus method we are finding that this checking is becoming less necessary since most run lengths can be resolved by running the gel longer, as in Figure 7, or priming with another restriction fragment. We wish to thank other members of this laboratory (B. G. Barrell, N. L. Brown, J. C. Fiddes, C. A. ttutchison, P. M. Slocombe and M. Smith) for help, advice and the preparation of materials, H. L. tteijneker for a gift of the mutant DNA polymerase I, R. Kamen for a gift of T4 I)NA polymerase, and J. H. Spencer for making his results available to us before their publication. REFERENCES Air, G. M. (1976). J . Mol. Bicl. 107, 433-434. Air, G. M. & Bridgen, J. (1973). Nature N e w Biol. 241, 40-41. Air, G. M., Blackburn, E. H., Sanger, F. & Coulson, A. R. (1975). J . Mol. Biol. 96,703-719.

r

GENE G SEQUENCES

533

Air, G. M., Blackburn, E. H., Coulson, A. R., Ga|ibest, F., Sanger, F., Sedat, J. W . & Ziff, E. B. (1976). J. Mol. Biol. 107, 445-458. Burgess, A. B. (1969). Proc. Nat. Acad. Sci., U.S.A. 64, 613-617. Chou, P. Y. & F a s m a n , G. D. (1974). Biochemistry, 18, 222-245. Donelson, J. E., Barrell, B. G., Weith, H. L., K6ssel, H. & Schott, H. (1975). Eur. J. Biochem. 58, 383-395. Edgell, M. H., Hutchison, C. A. & Sclair, M. (1972). J. Virol. 9, 574-582. Galibert, F., Sedat, J. & Ziff, E. B. (1974). J. Mol. Biol. 87, 377-407. Godson, G. N. & Vapnek, D. (1973). Biochem. Biophys. Acta, 299, 516-520. Goulian, M., Lucas, Z. J. & Kornberg, A. (1968). J. Biol. Chem. 243, 627-638. Harbers, B., Delaney, A. D., Harbers, K. & Spencer, J. H. (1976). Biochemistry, 15, 407414. I-Ieijneker, H. L., Ellens, D. J., Tjeerde, R. H., Glickman, B. W., van Dorp, B. & Pouwels, P. H. (1973). Mol. Gen. Genet. 124, 83-96. Jeppesen, P. G. N. (1974). Anal. Biochem. 58, 195-207. Jeppesen, P. G. N., Sanders, L. & Sloeombe, P. M. (1976). Nucl. Acids Res. 3, 1323-1339. Klotz, I. M. (1967). I n hlethods in Enzymology (Hirs, C. H. W., ed.), vol. 11, pp. 576-540, Academic Press, New Y o r k a n d London. Lee, A. S. & Sinsheimer, R. L. (1974). Proc. Nat. Acad. Sci., U.S.A. 71, 2882-2886. Ling, V. (1972). Proc. Nat. Acad. Sci., U.S.A. 69, 742-746. Peacock, A. C. & I)ingman, C. W. (1967). Biochemistry, 7, 668-674. Robertson, H. D., Barrell, B. G., Weith, H. L. & Donelson, J. E. (1973). Nature New Biol. 241, 38-40. Sanger, F. & Coulson, A. R. (1975). J. Mol. Biol. 94, 441-448. Sanger, F., Donelson, J. E., Coulson, A. R., K6ssel, It. & Fischer, D. (1973). Proc. Nat. Acad. Sci., U.S.A. 70, 1209-1213. Takanami, M. (1973). F E B S Letters, 34, 318-322.

Nucleotide and amino acid sequences of gene G of omegaX174.

J. Mol. Biol. (1976) 108, 519-533 Nucleotide and Amino Acid Sequences of Gene G of ~X174 G. ]~. AIR, F. SANGER AND A. R. COULSO~ Medical Research Co...
11MB Sizes 0 Downloads 0 Views