.

.

.

.

.

. .

.

.

Volume 305, number 2, 163-165

.

.

.

.

.

.

.

.

FEBS 11250

.,...

June 1992

© 1992 Federati.',n of European Biochemical Societies 00145793/9T$5.00

Nucleotide composition of genes and hydrophobicity of the encoded proteins Jan Mfizek and Jaroslav Kypr Institute o f Biophysics, C'.echoslovak Academy of Sciences, C8-61265 Brno, Czechoslovakia •'

"

R e ~ i v e d 20 May i992 "

We find that true proteins are generally more hydrophobic than the eorrespondinl~ hypothetical proteins encoded by the randomized gcnc nucleotidc sequences. Furthermore, the protein hydrophobicity but not its gene nuclcotide composition is conserved within evolutionary families of ftmctioflally related proteins, These two findings indicate that there is a general drift to modify gene nucl¢otide composition in the course of evolution. • An inspection of codon usage in genes shows that the drift mainly increases the content of adenine at the c)~penseof thymine. Gene nucleotide composition; Adenine-to-thymine ratio; Protein hydrophobicity

1. I N T R O D U C T I O N

Genes code for proteins but thenucleotide sequences also bear further information. For example, the genetic code degeneracy plays a significant role in the regulation of gene expression [1], the (G+C)-content of genes correlates with the (G+C)-content of the neighboring genome refstons [:2], nucleotide organization in genes hinders their partial melting [3], and the distribution of dinucleotides contributes to D N A positioning in nudeosomes [4]. In addition, many point mutations do not affect protein functional activities so that there is a further degree of freedom which permits to optimize gene function regarding the above and other aspects that do not affect the encoded protein properties. The present study is a converging continuation of two directions of our previous work. The first deals with the information deposited in genes which buffers the effects of single point mutations on stability of the encoded protein three-dimensional structure [5]. The second is concerned with the anomalies of the nucleotide composition [fi, 7] and codon usage [8] in AIDS viruses. The present • results are based on an extensive statistical study of the EMBL database of nueleotide sequences regarding a feature which we have happened to notice during studies of AIDS virus proteins and their genes.

Correspondence address: J. Mrgaacutc;zek, Institute of Biophysics, Czechoslovak Academy of Sciences, C8-61265 Brae, Czechoslovakia.

Published by Elsevier Science Publisher.~' B. V,

2. MATERIALS A N D METHODS 2.1. Nucleotide sequences All sequencesused arc from the EMBL Nuclootide Sequence Database [9]. For this study, only protein coding sequences were n~dcd. To avoid introns, a smaller subdatabase of mature mRNA sequences was created using programs of the GCG-paekage [10].Only the mRNA sequences of complete g~n~ coding for polypcptide chains Of 100 and more amino acids were included into the subdatabas¢. The sabdatabase contains about 4000 sequences mostly from eukaryotic species. 2.2. Protein families" The following families of proteins were considered: histories Hi, H2A, H2B, H3 and H4, eytoehrome c, elongation factor Tu, glyceraldehyde-3-pho~phatc dehydrogenase, aspartie proteinaso, aetin and ras.oncogene. The histone sequences were chosen from the EMBL Nuclcotide Sequence Database by programs of the GCG-packaBe which facilitate finding sequences by keywords. Duplicate sequences were removed manually. The HSSP database of homology derived protein smactures [I 1] was used to define members of the remaining families. Control sets of ;Sequences coding for evolutionary unrelated proteins were chosen from the mRNA sequence subdatabase. 2.3. Hydeophobictties The hydrophobicities of amino acids reported by Black and Mould [12] were used. Th~'y are given in sealed values so that the most hydrophilic amino acid (arginine) has hydrophobicity 0 and the most hydrophobie amino acid (phenylalanine) 100. l-lydrophobicity of a protein is the average value of the hydrophobicities of its amino acids. The expected hydrophobicities characterize hypothetical proteins encoded by randomized gene sequences preserving, renpectively, the gene nuclcodde composition or the nueleotide compositions at the three codon positions, The hydrophobieities are calculated from the frequencies of occurren¢~ of the four basez in genes or in the first, second and third positions of the codons pn.'sent in the gene. 2.4. l/ariabilio, uf sequences Variability of hydrophobicity in sequences within a protein family or a control set was characterized by the average ar/d maximum diffcren~s between th,-" hydrophobieities of all couples of pi'oteins of the family or set. The naeleotide compositions of genes were delined by the frequencies of occurrence of the four nuclcotid~'s at the three codon

163

Volume 305, amber 2

FEBS LETFERS

imsitions. The d i f f e n m ~ between the nucleotide compositions of two was defined by the cc~fficienl D,

D=I

June 1992 I

7

I

i=!

tX

5

whexe./:-d and.f:~ are the twelve freq.,le~es in the first and the second z-

3

3. RF_SULTS

@

Wc hav~ noticed during studies o f AIDS proteins and proteins a r e m o r e hydrophobic than hypothetical proteins encoded by randomized nucteotide ~ o n s o f their genes. This motivated us to look whether tht~ is a general property or a specific feature o f the AIDS viruses. Results o f th~ analysis reveal that the above property is general (Fig. ~). r~le hydrophobicity predi~.,~d fi-om the global mmleotide commotion is shifted by approximate!3 5% o f the dil: ference between tlm most hydrophilie aad ~h~ ,~-).~ h,¥drophobi~ amino agids in the di,~ction oflower hydrophobi~ities. Such a chang~ is dramatic because it cam for example, be achieved if 15 argjnines (the most hydrophilicamino acid) are replaced in a 300 amino acid polypeptid~ by phenyLalanines (the most hydrophobic amino acid). The shift is smaller but also significant if nucleotide comlx~itions in the particular codon positions arc pixy-fred. Fig. 1 is based on the nucleotide sequences o f about 4000 genes ori~nating irom many organisms so that the above conclusions are statistically t h k ~ g e n e s .+hat ~

siLmificant. The general dissonance between the protein hydrophobicities and nucleotide compositions o f their genes

0.2r

.,':

-

b.....J/ \

A 0 •. _

1

,~5

50

55

0.03

0.04

0.05

I

7 0.06

0.07

Fig. 2. Protein hydrophobicity and gene nucleotide composition variab~ities (their averages ever all pairs o f the compared sequences) .~:Lhin evolutionarily unrelated sequences (dosed circles) and families v~ evolutionarily related sequences which include ',he families of histone HI (A), histone H2A (B), histone H2B (C), histtme H3 (Dh histone H4 (E}, actin (F), aspartic proteinase (G), ¢ytoehrome c (H), elongation factor Tu (I), glyceraldehyde-3-phosphate dehydrogenase

(J) and ras (K).

can be explained in two extreme ways. The first possibility is that protein hydrophobicity increases in the evolution while the gene nuclcotide composition remains conserved. The other explanation is that protein hydrophobicity is what is conserved in evolution while the gene nucleotide composition is modified. In order to find the answer, we analyzed a number o f families o f the proteins which are evolutionary and functionally related and found that their hydrophobicities were strongly conserved. In contrast, the nucleotide compositions of their genes varied (Fig. 2). The bases involved in the modification of the gene nucleotide composition can easily be identified by calculating their average contributions to the amino acid hydrophobicity (Table I). This gives the well known fact Table I Average occurrence of bases in genes of several species and their contributions to hydrophobicity of the encoded proteins (compilation o f codon u s a ~ in genes comes from [14]

50

!. ~ b u f i o n s ofapproximatety4000 proteins a c c e d i n g to their ~drophobk:iey (~olid l ~ ) , hyd___ccop~bk-:53.ypr~....L~ from the nuc!eolidv compodtinn o f their genes (dashed Iine) and hydrophobicity predicted from tim n ~ ¢ o m ~ at the particular codon positions of their genes (dotmd line}, flm distributions are given as the relative number o f , ~ q u e n o ~ whose hydrophobi6t ;_~_lie within small ~nterv~ of h y d J ~ w l r , ~ The ~ axis scale was chosen ~a , ~ tim area umier ~ c h curve iz one.

164

E

O

Nucleotide c o m p o s i t i o n variabiIity

Contribution to hydrophobicP.? ~

HFdrophoblclty Fig-

G

D

Base 40

J

H

C

A C G T

17.8 24.7 20.4 32.5

Occurrence (per 1000 eodons) Human

Rat

753 791 808 620

770 767 800 636

Drosophila Yeast 745 825 822 575

951 575 623 786

• A sum over all 61 sense codons of the hydrophobicities of the encoded amino acids multiplied by the occurrence of the base in the codon (from 0 to 3).

Volume 305, number2

FEBS LETTERS

that thymine is the base predominantly coding for hydrophobic amino acids while adenine codes for hydrophffic amino acids [13]. Guanine and c y t ~ m e are more o r less neutral/n !hls respect. A n inspection 6 f a recent compilation o f codon usage in genes [14] suggests that adenine indeed dominates over thymine in genes o f various o rffant~rng (Table I). T h e aderJine-to-thymine ratio ranges from i.2 to 1.3. Consequently, we arrive at the conclusion that there is a general tendency to increase the amount o f adenine in genes at the expense o f thymine, A consequence o f this tendency is the. finding which motivated this work, i ~. the fact that proteins axe generally more hydrophobic than it would be predicted by the nucleotide compositions o f their genes. 4. DISCUSSION This work reports on a new property of gene nucleotide sequences which is general and stat'~cally significant so that its further elaboration can, for exa~npl~, extend the spectrum o f methods to discriminate be,.~¢~a protein coding and non-coding nucleotide sequences which are essential especially in projects o f eukaryotic genome sequencing. ~ aspect as well as exceptions regarding the present general rule will be reported elsewhere. Here we only address the question from where the general drift to substitute t h y m i n e by adenln¢ it] ge,es might originate. Tramontano and Macchiato [15] have introduced a useful property o f codons called the information value, which reflects a probability that a codon mutated by a random sin~e point mutation perturbs the encoded protein function. Codons having high information values are eliminated from genes to stabilize protein struc-

June 1992

ture and function against ra.nd-:~:.~mutations. This tendency is so strong and general that it can be used to discriminate ~ t w c e n p r o e m ~ i n g ~ d n o n ~ i n g nueleotide s e q n ~ f i ~ [i ~ , S ~ p l e ~ a t i o n s sh6~ that thymine is the most informative base in the above sense [5]. This is a very good reason to eliminate it from all positions in genes where it can be replaced by another base.

Acknowledgemems: The authors arc grateful to D F G which supported a short-term visit of.LM, in the European Molecular Biology Laboratory. Chris Sander is thanked for his hospitality and stimulating discussions.

REFERENCES [I] Ikemura, T. (1985) Mol. Biol. Evol. 2, 13-34. [2] Bernardi, G. (1989} Annu. Rev. Genet. 23, 637~61. [3] Wada, A. and Suyama, A. 0986) in: Biomolecular Stereodynamits IV (R.H. Sarma and M.H. Sarma eds.) pp. 255-269, Adenine Press. ~4] Trifonov, E.N. and Sussman, J.L. (1980) Proc. Natl. Acad. SCi. USA 77, 3816-3820. [5] Kept, J. (1986) Biochem. Biophys. Res. Commun. 139. 1094I097.

[6] Kypr, J. and M~.ek, J. (I987) Nature 327, 20. [7] Kypr, J., Mfizek, J. and Reich, J. 0989) Bioclfim. Biophys. Acta 1009, 280-282. [8] Kypr, J. and Mgz~k, J. (I989) J. Theor. Biol. I41,423- 424. [9] Harem, G. a__-.dCameron, G. 0986) Nucleic Acids Res. 14, 5-10. [14}] Devereux, J., Haeberli, P. and Smithies, (3. (1984) Nucleic Acids Res. 12, 387-395. [11] Sander, C. and Schneider, R. (1991) Proteins 9, 56-68. [I2] Black, S.D. and Mould, D.R. (1991) Anal. Biochem. 193, 72-82. [13] Kypr, J. and Mfizc~ J. 0987) Int. J. Biol. MacromoL 9, 49-53. [14] Wada. K., Wada, Y., DoL H., lshibashi, F., Gojobori, T. and lkemura, T. (1991) Nucleic Acids Res. 19, 1981-1986. [15] Tramontano. A. and Macchiato, M.F. (1986} Nucleic Acids Res. 14, 127-135.

165

Nucleotide composition of genes and hydrophobicity of the encoded proteins.

We find that true proteins are generally more hydrophobic than the corresponding hypothetical proteins encoded by the randomized gene nucleotide seque...
222KB Sizes 0 Downloads 0 Views