Pairwise decomposition of an MMGBSA energy function for computational protein design.

FULL PAPER

WWW.C-CHEM.ORG

Pairwise Decomposition of an MMGBSA Energy Function for Computational Protein Design Thomas Gaillard* and Thomas Simonson* Computational protein design (CPD) aims at predicting new proteins or modifying existing ones. The computational challenge is huge as it requires exploring an enormous sequence and conformation space. The difficulty can be reduced by considering a fixed backbone and a discrete set of sidechain conformations. Another common strategy consists in precalculating a pairwise energy matrix, from which the energy of any sequence/conformation can be quickly obtained. In this work, we examine the pairwise decomposition of protein MMGBSA energy functions from a general theoretical perspective, and an implementation proposed earlier

for CPD. It includes a Generalized Born term, whose manybody character is overcome using an effective dielectric environment, and a Surface Area term, for which we present an improved pairwise decomposition. A detailed evaluation of the error introduced by the decomposition on the different energy components is performed. We show that the error remains C 2014 Wiley reasonable, compared to other uncertainties. V Periodicals, Inc.

Introduction

lytical approximation of the PB model, which is more efficient.[72,73] The nonpolar effect of solvation can be modeled with a term proportional to the solvent-accessible surface area (SA) of the solute atoms.[74–77] The combination of MM with PB/ GB and SA solvation terms leads to an “MMPBSA/MMGBSA” energy function. These have been widely used to estimate and decompose the free energy of molecular systems.[78,79] In addition to implicit solvent models, several other strategies have been used to reduce the cost of energy evaluation when exploring the vast sequence-conformation space. A first, widely used approximation consists in using a fixed backbone.[8] In some CPD applications, like sidechain reconstruction or the inverse folding problem, the backbone is indeed an input of the problem. In general, it would be preferable to take backbone flexibility into account, in particular when the system is known to adopt distinct conformational states. In such cases, one can do a series of CPD calculations with different backbones,[80] or alternate cycles of sequence exploration with a fixed backbone and cycles of backbone conformation exploration with a fixed sequence.[21] A second key idea consists in replacing the continuous conformational space of sidechains by a discrete set of favorable conformations called rotamers.[81] Rotamer libraries of different sizes and built with different approaches have been used in CPD.[82] A third strategy consists in precalculating and storing energy components, from which the energy of any sequence-conformation can be quickly obtained.[10,15,19,22,83] Such approaches are common in computer science and known as lookup table strategies. The

Protein design aims at conceiving novel proteins or modifying existing ones to achieve desired functions. Computational techniques are valuable to help rationalize the predictions and guide experimental tests.[1–7] Computational protein design (CPD) has stimulated important methodological efforts,[8–25] which have been applied to sidechain reconstruction,[9,15,17,22,26] mutation stability prediction,[22,23] redesign of full protein sequences,[10,15,16,20,24,27–30] fold recognition,[31,32] and the creation of proteins with a wide range of new structures or activities.[33–46] CPD is particularly difficult as it involves exploring an enormous sequence and conformational space. For a protein of 300 amino acids, with an average of 10 conformations per amino acid, an exhaustive exploration would visit (10 3 20)300 possibilities. Efficient exploration algorithms are thus required.[9,14,15,26,47–58] Another critical aspect is the choice of the energy function.[59–61] Energy functions for CPD can either be the same ones widely used for molecular mechanics (MM) simulations,[15,19,20,22–25] or a combination of empirically readjusted MM functions and knowledge-based statistical potentials.[16,26,33,40,62–64] Energy functions based on physical principles have the advantage of better transferability and can benefit from the latest developments and additions in force field libraries and solvation models. Indeed, an essential component of the energy function is the solvent contribution.[65] In MM, the water molecules can either be explicitly represented or implicitly modeled through a continuum having the same average properties as water.[66–68] While explicit representations of the solvent are more accurate, implicit solvent models have the advantage of reduced computational cost, increased conformational sampling speed, and easier access to thermodynamic properties. The electrostatic effect of solvation can be described with the Poisson–Boltzmann (PB) equation.[69–71] The Generalized Born (GB) model is an ana-

DOI: 10.1002/jcc.23637

T. Gaillard T. Simonson Department of Biology, Laboratoire de Biochimie (CNRS UMR7654), Ecole Polytechnique, 91128 Palaiseau, France E-mail: [email protected] and [email protected] C 2014 Wiley Periodicals, Inc. V

Journal of Computational Chemistry 2014, 35, 1371–1387

1371

FULL PAPER

WWW.C-CHEM.ORG

total number of interaction terms to precalculate is then of around ð103203300Þ2 =2 for a protein with 300 amino acids. Our CPD implementation follows all three of these strategies.[22–24] The latest version of our CPD software Proteus was presented recently.[84] The lookup strategy just described requires that the energy function is “pairwise additive”: it has the form of a sum over residue pairs. MM energy terms from commonly used, nonpolarizable force fields can be written in this way. However, this is not true of GB and SA solvation terms, which both have a many-body character. Indeed, the SA term depends on the surface area (SA) of a union of atomic spheres. Intersections between three or more atomic spheres can lead to SA energy terms that depend on more than two residues. The GB interaction energy between two residues depends on the Born solvation radii of all their atoms. These radii measure the distance of each atom to the solvent, and depend on the geometry of the whole system. In this work, we present a general analysis of many-body effects in MMGBSA energy functions and a method for their pairwise decomposition, with CPD applications in mind. We distinguish three types of many-body effects. The first type is well known; it comes from the many-body character of the solvation energy, mentioned above. This effect has already been studied, and approximate pairwise decomposition methods have been proposed for both SA [15,85] and GB terms,[18,19,86,87] and also for PB terms.[88–90] Here, to make the solvation energy pairwise additive, we use a GB “Native Environment” approximation (NEA), presented earlier,[46,84,91] and an improved pairwise decomposition for the SA terms, presented here. With the NEA, the Born solvation radii are obtained for each residue with the rest of the system in a native or native-like conformation, which provides a plausible, effective dielectric environment. While the NEA eliminates one type of GB many-body effects, it introduces another type. Indeed, the Born radius of each atom is directly related to the GB “self” interaction of the atom with itself, mediated by solvent.[92] Within dielectric continuum theory,[70] this self-energy takes the form of an integral over the protein volume. In GB models, the self-energy is made pairwise additive by replacing the volume integral by a sum over atomic volumes.[93,94] This approximation is usually good but can deteriorate for unphysical conformations, where some of the atomic volumes significantly overlap,[95–98] so that there is a double-counting of desolvation effects. With the NEA, the Born radii are obtained for each residue and each of its rotamers with the rest of the system in its native conformation. Not uncommonly, some rotamers will overlap with the native environment, leading to inaccurate Born radii and GB energies. Below, we evaluate this second type of many-body effect and introduce a simple method to correct for it. Finally, a third many-body effect arises with some CPD implementations, including ours.[15,24,84] Indeed, when the pairwise interaction energies are computed and tabulated, our method applies a short energy minimization, which allows the pair of interest to adjust its conformation slightly. This helps alleviate the rotamer approximation and, more importantly, 1372

Journal of Computational Chemistry 2014, 35, 1371–1387

the use of fixed geometries for covalent bonds and angles, which is common in CPD applications. These minimizations imply that the pair interactions in the energy lookup table all use slightly different structures, so that they do not add up exactly to the energy computed for any single structure. This third, “structural” many-body effect is less general than the first type and will not be studied here. Rather, we do test calculations where no pairwise minimization is performed and the third effect does not exist. In the next section, the theoretical basis for MMGBSA pairwise decomposition is given. We then describe the computational implementation and the setup for test calculations on three proteins. In the Results section, we report the accuracy and efficiency of the energy decomposition for three test proteins. The last section is a Discussion.

Theory In this section, we present the theoretical basis for MMGBSA energy pairwise decomposition. The MM, GB, and SA components are analyzed separately. While general, the analysis is written with CPD applications in mind. The fixed and variable subsystems In the CPD context, the system is usually made up of two parts: a fixed part (e.g., the protein backbone), and one or more variable parts (e.g., the sidechains). The presence of a fixed part has practical consequences for the pairwise decomposition of the SA term, and so we make it explicit: fsystemg5ffix; var1 ; var2 ; . . .g

(1)

Energy terms only involving the fixed part are constant and do not need to be calculated as they cancel out when comparing different sequences or conformations of the protein. Other energy terms are variable and need to be evaluated for each combination of sequence/conformation. The variable part of the energy will be approximated below by a sum of self-energies EII (the interaction of a sidechain I with itself and with the fixed part) and of pairwise interaction energies EIJ: Evariable

X I

EII 1

X

EIJ

(2)

I

An efficient parallel algorithm for accelerating computational protein design.

Expanded explorations into the optimization of an energy function for protein design.

Computational design of enzyme-ligand binding using a combined energy function and deterministic sequence optimization algorithm.

An accurate method for direct dual-energy calibration and decomposition.

Crysalis: an integrated server for computational analysis and design of protein crystallization.

Rosetta:MSF: a modular framework for multi-state computational protein design.

Protein design with a comprehensive statistical energy function and boosted by experimental selection for foldability.

Direct Calculation of Protein Fitness Landscapes through Computational Protein Design.

Exploring the repeat protein universe through computational protein design.

Improving computational efficiency and tractability of protein design using a piecemeal approach. A strategy for parallel and distributed protein design.

Designing protein function - macromolecular design.

Towards the computational design of protein post-translational regulation.

Computational design of a self-assembling symmetrical β-propeller protein.

Computational design of a pH-sensitive IgG binding protein.

Computational study on the vinyl azide decomposition.

An empirical energy function for structural assessment of protein transmembrane domains.

Pairwise interactions in collagen and the design of heterotrimeric helices.

Computational protein design enables a novel one-carbon assimilation pathway.

Antibody humanization by structure-based computational protein design.

Removing T-cell epitopes with computational protein design.

Guided ion beam and computational studies of the decomposition of a model thiourea protein cross-linker.

An efficient algorithm for pairwise local alignment of protein interaction networks.

Improved scFv anti-HIV-1 p17 binding affinity guided from the theoretical calculation of pairwise decomposition energies and computational alanine scanning.

Pairwise cluster randomization: an exposition.