This article was downloaded by: [Rutgers University] On: 09 April 2015, At: 21:34 Publisher: Taylor & Francis Informa Ltd Registered in England and Wales Registered Number: 1072954 Registered office: Mortimer House, 37-41 Mortimer Street, London W1T 3JH, UK

Journal of Biomolecular Structure and Dynamics Publication details, including instructions for authors and subscription information: http://www.tandfonline.com/loi/tbsd20

The Significant Conservative and Variable Regions of the Homologous Protein Sequences a

Pavel V. Kostetsky & Rimma R. Vladimirova

a

a

M.M. Shemyakin Institute of Bioorganic Chemistry USSR Academy of Sciences , Moscow , 117871 , USSR Published online: 21 May 2012.

To cite this article: Pavel V. Kostetsky & Rimma R. Vladimirova (1992) The Significant Conservative and Variable Regions of the Homologous Protein Sequences, Journal of Biomolecular Structure and Dynamics, 9:6, 1061-1072, DOI: 10.1080/07391102.1992.10507979 To link to this article: http://dx.doi.org/10.1080/07391102.1992.10507979

PLEASE SCROLL DOWN FOR ARTICLE Taylor & Francis makes every effort to ensure the accuracy of all the information (the “Content”) contained in the publications on our platform. However, Taylor & Francis, our agents, and our licensors make no representations or warranties whatsoever as to the accuracy, completeness, or suitability for any purpose of the Content. Any opinions and views expressed in this publication are the opinions and views of the authors, and are not the views of or endorsed by Taylor & Francis. The accuracy of the Content should not be relied upon and should be independently verified with primary sources of information. Taylor and Francis shall not be liable for any losses, actions, claims, proceedings, demands, costs, expenses, damages, and other liabilities whatsoever or howsoever caused arising directly or indirectly in connection with, in relation to or arising out of the use of the Content. This article may be used for research, teaching, and private study purposes. Any substantial or systematic reproduction, redistribution, reselling, loan, sub-licensing, systematic supply, or distribution in any form to anyone is

Downloaded by [Rutgers University] at 21:34 09 April 2015

expressly forbidden. Terms & Conditions of access and use can be found at http://www.tandfonline.com/page/terms-and-conditions

Journal of Riomolecular Structure & Dynamics, ISSN 0739-1102 Volume 9, Issue Number 6 (1992), "'Adenine Press (1992).

The Significant Conservative and Variable Regions of the Homologous Protein Sequences Pavel V. Kostetsky and Rimma R. Vladimirova

Downloaded by [Rutgers University] at 21:34 09 April 2015

M.M. Shemyakin Institute of Bioorganic Chemistry USSR Academy of Sciences Moscow 117871, USSR Abstract A method of identification of significant conservative and variable regions in homologous protein sequences is presented. A set of aligned homologous sequences is divided into two groups consisting of m and n most related sequences. Each pair of sequences from different group is compared using unitary similarity matrix. The superposition of pairwise comparisons scanned by a window of 10 amino acid residues gives intergroup local variability profile (VP). Area S of the figure between the VP and its mean value line is compared with averaged area Sr of 1000 VPs of artificial homologous protein families. The difference (S-Sr) given in standard deviation units ur is believed to be the amino acid substitution overall irregularity along the homologous protein sequences OI=(S-Sr)lur If 01>2, the real VP extrema containing the surplus of area S-(Sr +2ur) are cut off. The cut off stretches are likely to be significant conservative and variable regions. The significant conservative and variable regions of six homologous sequence families (phospholipases A2, cytochromes b, a-subunits of Na, K-ATPase, L- and M-subunits of photosynthetic bacteria photoreaction centre and human rhodopsins) were identified. It was shown that for artificial homologous protein sequences derived by k-fold lengthening of natural proteins the 01 value rises as vfk .To compare the degree of substitution irregularity in homologous protein sequence families of different length L the value of standard substitution overall irregularity for L=250 is proposed.

Introduction A comparison of protein amino acid sequences is usefully applied to study homologous protein three-dimensional structure ( 1-9). Homologous amino acid sequences often contain positions ofhigh conservativity and high variability. For protein families of globins (1,2), Cu, Zn-dismutases and others (3-9) it has been shown that hydrophobic amino acid residues inside the protein core are more conserved than the residues located at the protein surface which are subject to various substitutions. Noteworthy, a lot of invariant amino acid residues form enzymes active centres and accomplish the contact between subunits in multisubunit proteins (6,8,9). It's well known, however, that conservative and variable residues in amino acid Abbreviations: VP, variability profile.

1061

1062

Kostetsky and Vladimirova

Downloaded by [Rutgers University] at 21:34 09 April 2015

sequences of homologous proteins are located irregularly and form clusters of various length (9-14). In case of mammalian serine proteases there is a number of conservative regions consisting of segments with regular secondary structure (14). For ribulose 1,5-bisphosphate carboxilases (9), composed by 8 large and 8 small subunits, 10 conservative regions have been identified. Amino acid residues forming these regions are proved to be responsible for contacts between large and small subunits and large subunits themselves. For amino acid sequences of homologous ribulose 1,5-bisphosphate carboxilases (9) conservative regions are regarded as those having 4 or more invariant residues in adjacent positions. However, until now there has been no generally accepted method to distinguish conservative and variable regions from those with random distribution of amino acid substitutions. In 1981 Greer (14) proposed to consider groups of residues, where the acarbons of the compared amino acid residues lie very close to each other (the maximum deviation permitted for any a-carbon in the group islA) as structuallyconserved regions. Evidently, for this approach it is necessary to have data of the three-dimensional structure of all compared proteins. Such data are not always available that's why it's convinient to use just the amino acid sequence data for the identification of significant conservative and variable regions of homologous proteins. In 1989 Howell proposed (11) to define highly conserved and highly diverged segments as those having a number of identical residues greater or less by one standard deviation than mean score, respectively. Based upon the analysis of similarity plots for various pairs of amino acid sequences Howell identified several conserved and diverged regions of the family of homologous cytochromes b. The number and location of these regions depended on which pair of sequences was compared. DeLisi and coworkers (12) developed and applied a new method for location of hypervariable residues in immunoglobulin-related molecules. It was assumed that hypervariable residues do not appear randomly in the sequence but are confined to a few sequentially localized regions. To eliminate the spurious hypervariable residues a filtering algorithm was used. The method was applied to immunoglobulin light chains and to class 1 and class 2 products of the major histocompatibility complex. Simulations based on Markovian stochastic process indicate that hypervariable residues can be reliably identified by presented method when 10 or more sequences are available. More recently we have proposed a graphic method oflocation of significant conservative and variable regions in a given set of homologous protein sequences divided into two phylogenetic groups (13,15). To estimate the overall irregularity of amino acid substitutions along the homologous protein sequences we suggest to use the area S of the figure between the intergroup variability profile (VP) and the mean variability value line. The value Sis compared with the average area Sr of 1000 random VPs generated by permutation of amino acid residue columns of the initial aligned sequence family. If s is greater than sr by more than 2 standard deviations crr the real VP is suggested to contain significant peaks and hollows, responsible for variable and conservative regions of homologous protein sequences.

Special Regions of Protein Sequences

1063

Downloaded by [Rutgers University] at 21:34 09 April 2015

The family of 11 homologous snake phospholipases A2 of Naja and Bitis genera was studied in details ( 13, 15). It was shown, that the intergroup VP of the amino acid sequence family, divided into two groups according to a phylogenetic tree, doesn't depend on the number of group representatives. The substitution overall irregularity and the location of conservative and variable regions don't noticably change if one of the groups is represented only by the single sequence. In this report we describe the method of conservative and variable regions location in homologous protein sequences in details and apply it to six families of homologous proteins: phospholipases A2, cytochromes b, a-subunits ofNa,K-ATPase, L-and M-subunits of photosynthetic bacteria photoreaction centre and human rhodopsins. To estimate the amino acid substitution overall irregularity along the homologous protein sequences we introduce the value OI=(S-Sr)/crr. For all considered homologous protein families we've investigated the dependence of the value OI on the length of proteins and the length of the scan window. To compare the degree of substitution overall irregularity ofhomologous protein sequences of different length L the standard substitution overall irregularity value OI 250 for L=250 is proposed. Materials and Methods Homologous Protein Families

Six families of homologous protein represented by 32 amino acid sequences are studied: mitochondrial cytochromes b of autotrophes (wheat (16)) and heterotrophes (mouse, drosophila and yeast (16)); a-subunits ofNa, K-ATPase offish (Torpedo californica (17)) and mammalian (sheep ( 18), pig (19) and human (20)); L-and M-subunits of photosynthetic bacteria photoreaction centre (Chloroflexus aurantiacus (21,22), Rhodospirillum rubrum (23), Rhodobacter capsulatus (24), Rhodobacter sphaeroides (25,26), Rhodopseudomonas viridis (27)); snake phospholipases A2 of genus Naja (N.n.oxiana (28), N.n.atra, N.n.kaouthia, N.melanoleuca I, N.mossambica, N.mossambica pallida (16), N.nigricollis (29)), and of genus Bitis (B.gabonica, B.nasicornis, B.caudalis (16)); human rhodopsins (black-white, blue, red and green (30,31)). Sequence Alignment and Phylogenetic Trees

Homologous amino acid sequences are aligned according to published data: cytochromes b- (11 ), human rhodopsins- (31 ), a-subunits ofNa, K-ATPase- (20), L-and M-subunits of photosynthetic bacteria photoreaction centre- (21-23), phospholipases A2- (15) (Table I). Phylogenetic trees of homologous proteins are constructed from the distance matrix following Fitch (32). The distance matrix is defined as the percentage of amino acid substitutions in pairwise comparison of aligned sequences. Variability Profiles

To obtain intergroup VP a set of sequences is preliminary divided into two groups of m and n sequences according to phylogenetic tree (13). All m sequences of the first

1064

Kostetsky and Vladimirova

Downloaded by [Rutgers University] at 21:34 09 April 2015

Table I Group and Species Composition of Six Homologous Protein Families and Location of Deletions in the Aligned Sequences Protein family and its length (len)

First group of species

Phospholipases A2 of snakes, len=l27

Deletions location

Second group of species

Deletions location

N.kaouthia N.n.atra See N.n.oxiana Figure l N.m.mossambica N.m.pallida N.nigricol.

B.caudalis B.nasicomis B.gabonica

See Figure 1

Cytochromes b, len=398

wheat

none

mouse drosophila yeast

1-5,113,114,389-398 1-4,113,114,385-398 1-6,392-398

a-subunits of Na, K-ATPase, len=l024

human sheep pig

28 22,23,28 22,23,28

T.califomica

20

L-subunits of photosynthetic bacteria photoreaction centre, len=320

C.aurantiacus

244-246, 297, 315-320

1-15,30,33-40,44-53, 86,98-100,116,313-320 Rb.capsulatus l-15,30,33-40,44-53, 92,98-100,116 1-15,30,33-40,44-53 Rs.rubrum 90,98-100,116,315-320 Rb.sphaeroides 1-15,30,33-40,44-53, 92,98-100,116

M-subunits of photosynthetic bacteria photoreaction centre, len= 327

C.aurantiacus

1-8,10, 106, 318-327

Rps.virid Rb.capsulatus Rs.rubrum Rb.sphaeroides

39,106,290,306 39,106,290,306,311-327 39,106,290,306,310-327 290,306,310-327

Human rhodopsins, len=367

white-black

1-16, 353-355

red, green blue

336,362,363 336,362,363;1-19

Rps.virid.

group are compared with all n sequences of the other one, amino acid substitutions in the position are summed and normalized to n*m to define the value of the position variability- vi. Position variability values vi averaged over a given number of subsequent positions 1 gives a local variability value Vi (in this paper the scanning window length 1= 10): i+9 [1] vi= o.1 vj j=i

:L

The local variability values Vi plotted versus the first position number define the intergroup VP. The VP is determined for common part of sequences, the additional Nand C-terminal residues of some proteins are not included in the analysis (11, 13). Deletions have been counted as 21th amino acid residue.

Special Regions of Protein Sequences

1065

Substitution Overall Irregularity

The area S included between the VP and the mean local variability value line characterizes the irregularity of the amino acid substitutions along the protein sequences. The area is calculated as L-9

s=

.L

1vi-vl

[2]

i=l

Downloaded by [Rutgers University] at 21:34 09 April 2015

where L - a length of common part of aligned homologous protein sequences; Vi local variability value according [1]; L-9 -V=-1 "L. L-9 i=l

vi

[3]

is a mean local variability value. Obtained areaS is compared with the area average ST of 1000 random VPs generated by permutation of amino acid residue columns of the initial aligned sequence family ( 13,15,33). To generate random VPs it is sufficient to permute only position variability values vi. The difference (S-ST) evaluated in standard deviation units crT is suggested to define a substitution overall irregularity along the homologous protein sequences (13,15): 01 =(S-ST)/crT

[4]

Note, that the random permutation of amino acid residue columns of natural aligned homologous protein sequences gives artificial homologous protein sequences identical to those of initial family in amino acid composition and pair distance matrix. However, VPs of artifical homologous protein families are evident to contain peaks and hollows occuring only by chance and having a moderate influence on the VP area which rare exceeds the value ST + 2crT. Thus S>ST + 2crT (i.e. 01> 2) is the obvious condition of the real VP to include significant peaks and hollows. Identification of Conservative and Variable Regions

If the value 01>2, the areaS of real VP according to equation [4] is greater than the area average ST of randomly generated VPs by more than 2crT. The profile extrema containing the surplus of area S-(ST+ 2crT) are cutoff. The cutoff stretches (peaks and hollows) are likely to be variable and conservative. To calculate ordinates of cut off lines an iteration algorithm dividing intervals in half is used until the accuracy of results less 0.05. Maximum and minimum of variability values ofVP are considered as the initial ordinates of upper and lowercutofflines, respectively. Then these lines are moved iteratively so, that the whole area of cut off peaks and the whole area of cut off hollow are equal to S/2-(S/2+crT).

Analysis of auxiliary symbol string (Figure 1) allows to more accurately localize significant conservative and variable regions. Each symbol of the string - blanc, semicolon or asterisk- corresponds to degree of variability ofthe given column vi in

1066

Kostetsky and Vladimirova 10

1

20

40

50

60

65

2

•••••••••••••• S ••• , •••••••••••••••••••••••••••••••• N •••••• G ••••••

3

••••••••• K •••• S ••• L ••• N •••••••••••••••••••••• I ••••• N •• G ••• G ••••••

4

••••••••• H•••••• P •• H •• N••••••••• K •••••••••••• I •• K••••••••• C •••• I.

5 6 7

, . . . . . . . . H •••. S. P •• H •••••••••••• K •• A ••••••••••••••• C •••• L-G .... LT ••••••••• H •••• S. P •• H •••••••••••• K•••••••••••••••••• EK.G. M-G ..•.. T ••••••••• H •••• S.P •• H •••••••••••• K••••••••••••...••• EK.G.M-G .... LT

8 9

•. I .. C... SAMTGK-.SLAY.S •••••• W•• JC.Q.JC •• T •••• F .•• C•• GK.D.C.PKM---1 D.T .• G... NKMGQ--.VF.YIY •••••• W.• K.K.I.AT •••• F ••• C.• GKMGTYDTK.---T D.T .. C ••. NKMGQ--.VF.YIY ••••.• W•• Q.JC.R.AT •.•• F •.• C •• GKMGTYDTK.---T

10

.........

• •• ••• 70

Downloaded by [Rutgers University] at 21:34 09 April 2015

30

NLYQFKNMIQCTVPNRSWWDFADYGCYCGRGGSGTPVDDLDRCCQVHDHCYDEAEKISRCWPYFK

80

90

:

..... .. :••

100

110

120

1

TYSYECSQGTLTCKNGNNACAAAVCDCDRLAAICFAGAPYNHN-NYNIDLKARCQ-------

2 3 4

5

•••••••••••••• G ••• CA ••••••••••••••••••••• D.-D ••• N •••••• E-----•••••••••••••• GDD.N ••• s ......... Y•••••.••• D- •••• N•.••.• ------•• T •• SC .....•• D.G-K ••. S .••••• V•. N ••• R. T •• DK- ••••. FN ...• ------L. )( •••••• )(, •• SG ••• K. E •••• N •• LV •. N ••••••• IDA- •.. VN •. E ... -------

6 7 8 9 10

L.J(.J( ..•. J( ..• SG •• SK.G •••• N•. LV •• N••••• R.IDA- .•.. NF.K ..• ------L. )(, R...• K••. SG .. SK.G •••• N.. LV •• N....• R. IDA- ••.. NF. K... ------L ... KFHN.NIV.-GDK •• ,J(KK,.E ••• v ...... ASKHSY.K.LWRYPSSK.TTGTAEKC S.N .. IQN.GID.--DEDPQJ(KEL.E •.. V.•••.• NNRNTY.S .. FGHSSSK.TGTEQC-S.N .. FQD.DII.-GDKDPQKKEL.E ..• V•••..• NSRNTY.SK.FGYSSSK.TETEQCC-



* •:

. ••::•• ...

130



Figure 1: A comparison of snake phospholipase A2 amino acid sequences of Naja and Bitis genera: N.n.kaouthia (l), N.n.atra (2), N.n.oxiana (3), N.melanoleuca I (4), N.m.mossambica (5), N.m.pallida (6), N.nigricollis (7), B.caudalis (8), B.gabonica (9), B.nasicornis (10). Gaps(-) have been inserted to achive maximum homology. Dots represent residues identical to those ofN.n.kaouthia sequence. Each element oflower row corresponds to degree of variability of given column by 7*3 possible intergroup Naja-Bitis comparison. Asterisk indicates positions occupied by identical residues, semicolon marks positions with variability V; lower than average value and blanc marks positions with variability more or equal to average value = 0.59.

v

v,

v.

comparison with the mean position variability- The conservative and variable regions are examined to determine which residues, if any, in ajoining positions should be included.

Results and Discussion Considered aligned homologous protein sequences have a lot of deletions (15% in the case of L-subunits of the bacteria photoreaction centre). The most of them, however, are located at N-and C-termini of proteins (Table I). There are only 0.5% on the average of deletions in the common part of sequences (only in the case of snake phospholipases A2 the quantity of deletions is over 2%). To decrease the influence of deletions on the VP we do not include in the analysis the additional Nand C-terminal residues. Homologous amino acid sequences of each considered family are divided into two groups of most related proteins according to phylogenetic trees (Table 1). All families are proved to have by intergroup comparison the high degree of amino acid substitution irregularity along the homologous protein sequences from OJ= 3.2 for Lsubunits of photosynthetic bacteria photoreaction centre to 01 = 10.3 for a-subunits ofNa, K-ATPase (Table II).

1067

Special Regions of Protein Sequences

-~

..0 0

L..

0,5

Downloaded by [Rutgers University] at 21:34 09 April 2015

0

>

0

u

!2

100 residue number Figure 2: Intergroup local variability profile of snake phospholipase A2 sequences from Naja and Bitis genera presented on Figure I. The profile shows the variability value averaged over 10 subsequent position versus the segment first position. Unbroken line indicates the mean local variability value along protein sequences. The upper and lower dash lines give cut-off value to identify significant variable and conservative regions of sequences.

The most part ofconsidered families of homologous sequences (phospholipases A2, Land M-subunits of photosynthetic bacteria photoreaction centre and rhodopsins) contains more than 50% of amino acid changes in intergroup comparison. The cytochrome b family contains about 40% of changes and only the family orasubunits of Na,K-ATPase has less than 15% changes in the average. As a consequence, the VPs of the first five above mentioned families can be easily analysed. These VPs contain significant peaks and hollows (Figure 2), corresponding to variable and conservative regions of homologous protein sequences (Table II). If the proteins have a very high degree of homology the identification of conserva-

tive regions is not easy. For example, the intergroup comparison of a-subunits of Na,K-ATPase reveals high homology of proteins (only 13-14% of substitutions). The corresponding VP has stretches along the one third oflength of absciss axis (Figure 3). In this case random VPs also stretch along the axis and, therefore, dispite of high conservativity these regions are not statistically significant. Hence, only variable regions are reasonably identifiable in this case. Note, that the dashed lines identifying the significant peaks and hollows of the phospholipases A2 and a-subunits ofNa, K-ATPase VPs (Figures 2 and 3) are evident to

1068

Kostetsky and Vladimirova

1

:-a ~

.Q ~

Downloaded by [Rutgers University] at 21:34 09 April 2015

~ 0.5

0 100

200

300

'iOO

500

600

700

800

900 1000

Residue number Figure 3: Intergroup local variability profile of a-subunits ofNa, K-ATPases from the fish (T.californica) and mammalia (human, pig and sheep). Unbroken and dashed lines as on Figure 2.

be at the different distance from the mean value line (unbroken line) due to these lines cut the same amount of area from the upper and lower parts ofVP, which are not the same. It should be mentioned that in the family of cytochromes b six regions ofhigh conservativity are identified (Table II), 5 of which have been earlier found by Howell (11). Howell analyses a series of comparisons in which five different proteins, used as a master, are compared with other. In contrast, the intergroup VP of the cytochromes b family divided into two phylogenetic groups, allows to get information about all significant conservative and variable regions at once (Table II). Although the most of families shown in Table I consistofhomologous proteins from various species, the family of rhodopsins is intraspecies. It consists of 4 human rhodopsins providing black-white and colour vision. In this case black-white rhodopsin is compared with 3 colours, the corresponding VP demonstrates the high substitution overall irregularity (OI =6.9) and contains a number of significant peaks and hollows (Table II). It should be noted, that there are several homologous rhodopsins of other species (fruit fly, bovine, octopus and others (16)) in this family. However, we use only human rhodopsins because the focus of our analyses in this report are homologous proteins of one dimensional diversity, i.e., either from different species or differing in function. This condition facilitates dividing the family into two groups and interpreting results of intergroup comparison.

1069

Special Regions of Protein Sequences

Table II The Results of Intergroup Comparison in Six Homologous Protein Sequence Families Intergroup 01 amino acid vachanges% lue

Downloaded by [Rutgers University] at 21:34 09 April 2015

Protein family and groups

Segments of high conservativity

Segments of high variability

Phospholipases A2: Naja and Bitis residues 1-120

54-64

5.0

4-9,24-34, 41-51,92-101

10-23,52-66,75-89, 102-118

Cytochromes b: autotrophes and heterotrophes residues 7-384

34-50

6.4

39-59,70-95, 133-161,170-186, 207-217,254-298

7-19,96-100,113-119, 196-202,239-250,298-307 314-319,325-341,361-384

a-subunits ofNa,K-ATPase: T.califomica and mammalia residues l-1 024

13-14

10.3

difficult to identify

5-33,54-62,120-127,231-234, 260-265,437-441,469-476,495-504, 521-537,560-583,674-678, 882-899,972-977,1001-1013

L-subunits of photosynthetic bacteria photoreaction centre: green and purple residues 54-311

58-61

3.2

56-67,192-201, 226-237

70-90,98-103, 238-250,277-297

M-subunits of photosynthetic bacteria photoreaction centre: green and purple residues 9-309

56-63

4.8

46-53,101-115 194-206,246-260

9-43,54-67,86-95, 142-152,226-241, 277-291,300-309

Human rhodopsins: white-black and colour residues 20-367

57-60

6.9

82-94,117-130, 142-158,190-204, 259-270,307-322

20-38,52-70,100-116, 131-141,165-185,207-229, 236-249,289-306,333-349

01 8 6 4

2

2

4

6

8

12

14

16

18

20

Figure 4: The dependence of substitution overall irregularity value 01 ofhomologous rhodopsins upon the length of scan segment (1).

1070

Kostetsky and Vladimirova Table III Observed and Calculated Values of Amino Acid Substitution Overall Irregularity Olobs and Olcalc for Artificial Homologous Protein Sequences Derived by k-Fold (k=2,3) Lengthening of Natural Homologous Sequences (k= l)"

The natural sequences family and length of common part L

k=l 01

Olobs

k=2 Olcalc

Phospholipases A2 of snakes, L= 120

5.0

6.3

7.0

8.3

8.6

7.2

Cytochromes b: autotrophes and heterotrophes. L=378

6.4

9.1

10.3

ll.l

11.6

5.2

10.3

14.2

14.6

19.8

17.8

5.2

L-subunits of photosynthetic bacteria photoreaction centre: green and purple, L=259

3.2

4.7

4.5

5.9

5.5

3.2

M-subunits of photosynthetic bacteria photoreaction centre: green and purple, L=300

4.8

6.9

6.8

8.2

8.3

4.4

Human rhodopsins: whiteblack and colour, L=348

6.9

9.2

9.7

12.1

11.9

5.8

Downloaded by [Rutgers University] at 21:34 09 April 2015

a-subunits ofNa, K-ATPase: T.califomica and mammalia, L= 1023

k=3 Olobs Olcalc

01250

• 01 and Olobs are obtained from numerical experiments using the equation [4]; o~.1c = 01 y'k:" Standard substitution overall irregularity value for hypothetical homologous protein family of fixed length Ll=250 is calculated as 01250 = Olvf250/L.

We have studied the different length of scan segment to find out whether the results of region identification depend on the segment length. It was found, that maximum value of substitution overall irregularity OI is achieved at the segment length I= 8-9 amino acid residues for the most of the families, and I= 12 for a-subunits ofNa,KATPase. The number and location of significant peaks and hollows of VP are almost consistent at I= 8-12. In contrast, at I 20 the OI values decrease and peaks and hollows become indistinct and inconsistent (Figure 4). Consequently, a segment of I= 10 seems to be suitable for successful identification of conservative and variable regions of homologous protein sequences. It should be mentioned, that the substitution overall irregularity value OI depends on the protein length. The longer protein family, the more OI value. Data in Table III illustrate such dependence: the 01 values are presented for six considered families and for artificial families derived by k-fold lengthening of natural homologous protein sequences. Evidently, the OI value increases as proportionally VIC and diverged only 6% on the average. To compare a degree of substitution overall irregularity of different homologous protein families it's convinient to use a standard length substitution overall irregularity. In this report the length of250 amino acid residues, corresponding to the length of a large protein domain (34), is suggested. The formulas for calculation of substitution standard overalll irregularity 01 250 is: OI250 = 01 J 250/L

[5]

where OI- as calculated according [4], L- the length of common part of the aligned

Special Regions of Protein Sequences

1071

sequences. Note, thatthe 01 250 = 5.2 fora-subunits ofNa, K-ATPase is twice less than the value 01= 10.3 (Table III), but for phospholipases A2 family 01 250 =7, while 01 = 5. Thus among 6 considered families of homologous protein sequences, the family of snake venom phospholipases A2 has the largest value of standard substitution overall irregularity. It seems reasonable to identify conservative and variable regions, if01 250 >2. This inequality could be written according [4] and [5) as follows:

Downloaded by [Rutgers University] at 21:34 09 April 2015

[6)

Thus, the cut-off surplus of area is S-(Sr +2crr / L/250 ). It's evidently for protein families ofL>250 the restriction [6) is the more rigid the longer protein sequences. For example, the homologous sequences with L> 1000 and 01

The significant conservative and variable regions of the homologous protein sequences.

A method of identification of significant conservative and variable regions in homologous protein sequences is presented. A set of aligned homologous ...
670KB Sizes 0 Downloads 0 Views