J Mol Model (2014) 20:2136 DOI 10.1007/s00894-014-2136-5

ORIGINAL PAPER

C-H…pi interactions in proteins: prevalence, pattern of occurrence, residue propensities, location, and contribution to protein stability Manjeet Kumar & Petety V. Balaji

Received: 6 September 2013 / Accepted: 2 January 2014 / Published online: 14 February 2014 # Springer-Verlag Berlin Heidelberg 2014

Abstract C-H…pi interactions are a class of non-covalent interactions found in different molecular systems including organic crystals, proteins and nucleic acids. High-resolution protein structures have been analyzed in the present study to delineate various aspects of C-H…pi interactions. Additionally, to determine the extent to which redundancy of a database biases the outcome, two datasets differing from each other in the level of redundancy have been analyzed. On average, only one out of six {with C-H(Aro) group} or eight {with C-H(Ali) group} residues in a protein participate as C-H group donors. Neither the frequency of occurrence in proteins nor the number of C-H groups present in it is correlated to the propensity of an amino acid to participate in C-H…pi interactions. Most of the residues that participate in C-H…pi interactions are solventshielded. Solvent shielded nature of most of the C-H…pi interactions and prevalence of intra- as well as intersecondary structural element C-H…pi interactions suggest that the contribution of these interactions to the enthalpy of folded form will be significant. The separation in the primary structure between donor and acceptor residues is found to be correlated to secondary structure type. Other insights obtained from this study include the presence of networks of C-H…pi interactions spanning multiple secondary structural elements. To our knowledge this has not been reported so far. A substantial number of residues involved in C-H…pi interactions are found in catalytic and ligand binding sites suggesting their possible role in maintaining active site geometry. No significant differences of CH…pi interactions in the two datasets are found for any of the parameters/features analyzed. Electronic supplementary material The online version of this article (doi:10.1007/s00894-014-2136-5) contains supplementary material, which is available to authorized users. M. Kumar : P. V. Balaji (*) Department of Biosciences and Bioengineering, Indian Institute of Technology Bombay, Powai, Mumbai 400 076, India e-mail: [email protected]

Keywords C-H…pi interaction . Secondary structure elements . Stability . PDB data mining . Solvent accessible surface area (SASA) . Adjacency

Introduction The 3D structures of biological molecules and their complexes are stabilized by intra- and inter-molecular non-covalent interactions. These interactions are categorized as hydrogen bonding, electrostatic, van der Waals, etc. for convenience of study [1]. A variety of computational and experimental approaches have been used to probe the relative strengths and importance of such interactions. The appropriateness of the model system and transferability of data to biological systems is a matter of concern. For example, the strength of an N-H… O=C H-bond has been measured by both experimental and computational methods in different model systems [2–9] as well as in proteins [10–16]. The strength of this H-bond, as reported in these studies, varies and expectedly depends on the system and environment, besides the measurement approach. A complementary approach is to mine the high-resolution 3D structure data to study the frequency of occurrence of specific types of non-covalent interactions. Such data mining studies also provide information about the microenvironment and geometry (distance, angle, etc.) of non-covalent interactions (see, for example, refs. [14–16]). C-H…pi interactions are a class of non-covalent interactions. They can be either electrostatic- or dispersiondominated depending upon the acidity of the C-H group [17]. Early quantum chemical work on methane-benzene and benzene-benzene systems showed that C-H…pi interactions involving aliphatic or aromatic C-H groups are dominated by dispersion interactions [18]. Consequently, such C-H… pi interactions can be prevalent in both polar and non-polar environments.

2136, Page 2 of 14

Mining of 3D structural databases has shown the ubiquity of these interactions in many types of molecular systems including organic crystals, proteins, and complexes of proteins with carbohydrate and other ligands [19–27]. The prevalence, geometry and location of C-H…pi interactions in proteins, and associated residue propensities have been investigated by many research groups [20, 21, 24, 28–30]. One such study found recurrent patterns of C-H…pi interactions from an analysis of all high-resolution structures [21]. In another study, the geometry of C-H…pi interactions in a high stringency non-redundant dataset was analyzed [20]. It was observed that the perpendicular distance from the C atom to the plane of the pi ring has a clear maximum at 3.7 Å. It was also reported that C-H…pi interactions are more frequent between the side chains of residues that are separated by fewer residues in the primary structure. Both these studies concluded that C-H…pi interactions play an important role in the stabilization of secondary structure elements [20, 21]. Two other studies investigated the conservation of C-H… pi interactions in functionally related proteins. From an analysis of single chain all alpha proteins [26], it was found that a significant fraction of donor (75 %) and acceptor residues (65 %) involved in C-H…pi interactions are conserved and from this, it was concluded that C-H…pi interactions have an important contribution to protein stability. From an analysis of the structures of 26 SH2 domains, it was found that mutations have conserved the C-H…pi interactions in these proteins [28]. The objective of the present study is to analyze various aspects of intramolecular C-H…pi interactions as observed in high-resolution protein structures available to date. Two datasets have been used for analysis by filtering out redundant sequences with high and low stringencies (35 and 100 % sequence identity cut-offs, respectively). The reason for taking two such datasets is to determine the extent to which patterns derived from a highly redundant dataset match those derived from a highly non-redundant dataset.

Methods Datasets The entire set of polypeptide chains in the protein databank (PDB; [31]) was culled in February 2012 using PISCES, a protein sequence culling server [32] with the following settings: (i) (ii) (iii) (iv) (v)

theoretical models and entries with only Cα coordinates are excluded only crystal structures are considered resolution is ≤2.5 Å and crystallographic R-factor is ≤25 % length is between 40 and 10,000 residues sequence identity cutoff is 100 %.

J Mol Model (2014) 20:2136

This resulted in a non-redundant dataset of polypeptide chains named D100. The 100 % sequence identity cutoff ensured that proteins with multiple structure depositions are considered only once. Inclusion of this criterion resulted in nearly 50 % reduction in the total number of proteins. The PISCES server was used again to obtain a second dataset, named D35, from D100 by setting the sequence identity cutoff to 35 %. No other constraint or filter was employed for generating D35. The coordinates of proteins included in the dataset were downloaded from the protein databank [33]. From these, the following residues were excluded: (i) residues with partial occupancy (ii) residues with one or more missing atoms (iii) residues with doubtful sequence (i.e., more than one type of side chain) (iv) residues whose average side chain B-factor >40 Å2. After applying these filters, only polypeptide chains that have at least one aromatic residue were retained leading to 19,672 proteins in D100 and 5399 proteins in D35 (for brevity, the word ‘protein’ will be used henceforth to refer to a polypeptide chain). Most of the proteins are in the resolution range 1.5–2.5 Å and R-factor range 15–25 % (Fig. S1). The number of proteins in the lower resolution range is very similar in D35 and D100, but this is not so in the higher resolution range, as can be expected. The distributions of the lengths of proteins and the amino acid compositions of the two datasets are comparable to those of Swiss-Prot and TrEMBL (Fig. S2) databases. Identification of C-H…pi interactions Hydrogen atoms were added using GROMACS [34–37]. The side chains of His, Phe, Trp, and Tyr were considered as pi acceptors. The arithmetic average of the coordinates of ring atoms was taken to be the position of the pi acceptor. All the C-H groups were considered as potential donors and were categorized as either aliphatic [denoted as C-H(Ali)] or aromatic [C-H(Aro)] donors. The C-H groups from the side chains of His, Phe, Trp and Tyr, with the exception of CB-H, were considered as aromatic C-H groups; all other C-H groups were considered as aliphatic C-H groups. The H…pi distance, with a cutoff of 3.5 Å was used to identify C-H…pi interactions. For all the CH…pi interactions, C…pi distance, C-H-pi angle (denoted as θ) and the angle ω between the plane of the aromatic residue and the plane containing H and the two nearest atoms of the aromatic ring were also calculated. θ and ω are defined as in [19, 30, 38] (Fig. S3). STRIDE [39] was used for absolute solvent accessible surface areas (SASA) and secondary structure assignment to residues. VMD [40] and PyMol (The PyMOL Molecular Graphics System, Version 1.2r3pre, Schrödinger, LLC.) were used for visualization and rendering.

J Mol Model (2014) 20:2136

Page 3 of 14, 2136

Results The ratio of the number of proteins in D100 to that in D35 is comparable to the ratio of the number of C-H…pi interactions in these two datasets (Table 1). The numbers are comparable whether the C-H donor group is aliphatic or aromatic. Thus, use of a lower stringency sequence similarity cut-off for generating the non-redundant dataset has not influenced the number of C-H…pi interactions. In fact, (i) the frequencies of occurrences of the 20 amino acids (Fig. S2) and (ii) the fraction of a residue type participating in C-H…pi interactions in the two datasets (Fig. 1) also are very similar in the two datasets. Number of C-H…pi interactions On average, only one out of six {C-H(Aro)} or eight {C-H(Ali)} residues in a protein participate as C-H group donors. However, two out of three aromatic residues participate as a C-H group acceptor (Table 1). Such a low participation can be ascribed to the number of aromatic residues being much less than the number of aliphatic residues and the presence of fewer aromatic residues in the neighborhood of aliphatic residues (Fig. S4). These show that C-H…pi interactions are less prevalent. Many residues participate in multiple C-H…pi interactions (Fig. 3). Acceptor residues have a higher propensity to be involved in such multiple interactions than donors since the number of acceptor residues is less than the number of donor residues. Table 1 Number of C-H… pi interactions and number of proteins in which they are found

a

The amino acids His, Phe, Trp, and Tyr are considered as the aromatic residues. The remaining 16 amino acid residues are treated as the non-aromatic residues. Donor C-H groups from the aromatic and non-aromatic residues are referred to as C-H(Aro) and CH(Ali), respectively

b The number of residues that participate in the corresponding CH…pi interactions

This suggests a clustering of C-H(Ali) donor groups around aromatic residues. The frequency of occurrence of an amino acid in the dataset is not correlated to how often it participates as a donor in a CH…pi interaction (Fig. 2). Leu, Ala, Gly, and Val are the most frequently occurring residues but Phe, Met, Trp, and Pro participate in C-H…pi interactions the most. The four rotatable single bonds in the side chain of Met presumably facilitate its participation in C-H…pi interactions. C-H groups from Met participate in 26,161 interactions in D100; of these, CE-H, CG-H, CB-H, and CA-H are involved in 47.2, 30.4, 18.3, and 4.1 % of interactions, respectively. C-H…pi interaction between Met and His in the transmembrane domain of the viral potassium channel Kcv has in fact been shown to be important for function [41]. Similar to the frequency of occurrence, even the mutability of the 20 residues derived from the human genome sequence analysis [42] is not correlated to the frequency with which a residue participates in C-H…pi interactions. Phe, Met, and Trp participate most frequently in C-H…pi interactions but are ranked 5, 6, and 20 in terms of their mutability. Ile has highest mutability but has intermediate frequency for participation in C-H…pi interactions. The generality of the mutability values derived from an analysis of the human genome is not known. Hence, it is not clear if the mutability of an amino acid is correlated in any way to the frequency with which it participates in a C-H…pi interaction.

Description

Number of proteins in the dataset Number of non-aromatic residuesa Number of aromatic residuesa Total number of residues C-H(Ali)…pi Number of interactions Number of donor residuesb Number of acceptor residuesb Number of interactions per acceptor C-H(Aro)…pi Number of interactions Number of donor residues Number of acceptor residuesb Number of interactions per acceptor Average number of interactions per protein Non-aromatic residue {i.e., C-H(Ali)} Aromatic residue {i.e., C-H(Aro)} Average number of interactions per residue Non-aromatic residue {i.e., C-H(Ali)} Aromatic residue {i.e., C-H(Aro)}

Dataset

D100:D35 ratio

D100

D35

19,672 3,205,874 474,078 3,679,952

5,399 876,238 128,934 1,005,172

3.64 3.66 3.68 3.66

556,083 407,548 319,994 1.74

150,780 110,692 86,923 1.74

3.69 3.68 3.68 1.0

90,324 76,171 77,155 1.17

25,032 21,172 21,464 1.17

3.61 3.59 3.59 1.0

28.3 4.6

27.9 4.6

1.0 1.0

0.17 0.19

0.17 0.19

1.0 1.0

2136, Page 4 of 14

J Mol Model (2014) 20:2136

Fig. 1 Comparison of the frequencies of occurrence (a) and frequencies of participation in CH…pi interactions (b) of the 20 amino acids in D100 and D35. AAtot: The total number of all the 20 amino acid residues in the dataset. Xtot: The number of times an amino acid residue X is found in the dataset. XC-H…pi: The number of times the amino acid residue X participates in a C-H… pi interaction

Among the different atom types, CA atoms participate the least in C-H…pi interactions. However, in most of the CAH…pi interactions, the residue involved is either Gly, Ala or Pro. In fact, Gly:CA has the highest atomic solvent accessibility value among the CA atoms of all residues [43]. A comparison of homologs from thermophilic and mesophilic organisms has shown that C-H…O and C-H…pi interactions are more in thermophilic proteins and for the latter, Gly and Pro contribute more [44]. Frequency of participation and the number of C-H donor groups The number of C-H groups that can potentially participate in a C-H…pi interaction varies across the 20 amino acids because of the differences in their structures and sizes. Gly has only two C-H donor groups, whereas Leu and Ile have ten each. However, the number of C-H…pi interactions that a residue participates in is not strictly correlated to the number of C-H donor groups present in it (Fig. 4). Ala occurs as frequently as Leu but the fraction of Ala involved in C-H…pi interactions is much less than that of Leu, probably because it has fewer C-H groups. However, such a rationalization based on the number of potential C-H groups is not applicable in all the cases: for example, Ile and Fig. 2 Comparison of the fraction of times a residue participates in C-H…pi interactions to its frequency of occurrence in datasets D100 (a) and D35 (b). AAtot, Xtot and XCH…pi are the same as in Fig. 1. For the aromatic residues, counts include the number of times they participate as donors as well as acceptors

Leu have the same number of C-H groups but fewer Ile participate in C-H…pi interactions than Leu (Fig. 4). Two pairs of amino acid residues (Asn/Gln and Asp/Glu) have comparable frequencies of occurrence in proteins (Fig. S2), and within a pair, not surprisingly, residue which has more CH groups participates more frequently in C-H…pi interactions (Fig. 2). Phe and Trp have a higher propensity to participate in CH…pi interactions than His and Tyr (Fig. 2). In the case of Trp, this can be ascribed to its larger surface area. In fact, Gly and Trp are among the other amino acids that show significant deviations in the frequencies of participation/ occurrence plots. Quantum chemical calculations with model systems in gas phase have shown that the interactions are strongest when Trp is the acceptor whereas with His as acceptor, they are weakest [45]. These suggest that His plays a minor role as pi acceptor in the stabilization of protein structure. This also is in accordance with the observation that His often has a functional role such as metal binding and catalysis [46–49]. Geometry of C-H…pi interactions With a few exceptions as noted below, the H…pi distances span the entire range from 2

J Mol Model (2014) 20:2136

Page 5 of 14, 2136

Fig. 3 Representative renderings showing the participation of a residue in multiple C-H…pi interactions. a Bacillus sp. sarcosine oxidase (PDB id 2A89). b Sus scrofa alpha-amylase (PDB id 1PPI). c Dictyostelium

discoideum non-muscle type myosin-2 heavy chain (PDB id 3BZ9). Color code: hydrogen, white; nitrogen, blue; oxygen, red; carbon, magenta (participating in C-H…pi interactions) or green

to 3.5 Å for both C-H(Ali) and C-H(Aro) interactions (Fig. 5 for D100 and Fig. S5 for D35). The number of interactions in the lower H…pi distance range is low, whereas it is high for the higher distance range. The number of interactions increases as H…pi distance increases unlike that for theta and omega, wherein the histograms show a clear maximum. Gas phase calculations have shown that the interaction energy continually increases (i.e., progressively weaker interactions) with distance [50, 51]. This indicates that only fewer C-H…pi interactions have optimal geometry and hence, maximal contribution to enthalpy. Overall, the distribution of H…pi distances and the angles θ and ω indicate that the energy of C-H…pi interactions varies widely. In the case of C-H(Aro) interactions in D35, the increase in the number of interactions with distance is not uniform (Fig. S5). The histogram shows an oscillatory pattern with

well defined maxima at certain distances. This suggests preferences for certain H…pi distances but the disappearance of this pattern in D100 (Fig. 5) implies that the preference is not strong. The H…pi distance is 120 or

C-H…pi interactions in proteins: prevalence, pattern of occurrence, residue propensities, location, and contribution to protein stability.

C-H…pi interactions are a class of non-covalent interactions found in different molecular systems including organic crystals, proteins and nucleic aci...
1MB Sizes 1 Downloads 3 Views