FULL PAPER

WWW.C-CHEM.ORG

Quantum Supercharger Library: Hyper-Parallel Integral Derivatives Algorithms for Ab Initio QM/MM Dynamics C. Alicia Renison, Kyle D. Fernandes, and Kevin J. Naidoo* This article describes an extension of the quantum supercharger library (QSL) to perform quantum mechanical (QM) gradient and optimization calculations as well as hybrid QM and molecular mechanical (QM/MM) molecular dynamics simulations. The integral derivatives are, after the two-electron integrals, the most computationally expensive part of the aforementioned calculations/simulations. Algorithms are presented for accelerating the one- and two-electron integral derivatives on a graphical processing unit (GPU). It is shown that a Hartree–Fock ab initio gradient calculation is up to 9.3X faster on a single GPU compared with a single central processing unit running an optimized serial

version of GAMESS-UK, which uses the efficient Schlegel method for s- and l-orbitals. Benchmark QM and QM/MM molecular dynamics simulations are performed on cellobiose in vacuo and ˚ water sphere (45 QM atoms and 24843 point charges, in a 39 A respectively) using the 6-31G basis set. The QSL can perform 9.7 ps/day of ab initio QM dynamics and 6.4 ps/day of QM/MM C 2015 Wiley dynamics on a single GPU in full double precision. V Periodicals, Inc.

Introduction

these vary in levels of accuracy as we have shown in chemical glycobiology simulations.[14–16] The use of a higher level of theory, such as an ab initio method, to describe the electronic Hamiltonian is closer to a realistic model of a chemical reaction. Here, we present hyper-parallel implementations of two important components of the derivative of the electronic energy, namely the first derivative of the two- and oneelectron integrals with respect to geometric variables. The former results in an acceleration of calculations such as geometry optimizations while the latter is key to massively speeding up ab initio QM/MM simulations. The combination of this work and the quantum supercharger library (QSL) Hartree–Fock (HF) algorithm, which we present separately,[2] forms the core of the QSL package that may easily accelerate legacy codes, such as GAMESS-UK,[17] GAMESS-US,[18,19] and NWChem,[20] used in electronic structure calculations. The only other complete graphical processing unit (GPU) implementation of the derivative evaluation in ab initio computations is that of the Martınez group.[21] Their work is based on developing a new package called TeraChem that they have interfaced with the AMBER[22] molecular dynamics package to enable users to run QM/MM molecular dynamics simulations.[23]

The use of energy or forces provides two fundamental and distinct ways to discover chemical mechanisms.[1] In the companion paper to this work, we described the acceleration of the one- and two-electron integrals and the self-consistent field (SCF) routines that are central to computing the energy of a molecular system.[2] Key molecular processes that underpin, for example, chemical reactivity can be understood through the net forces and appropriately partitioned components of the net forces on the nuclei in a molecular system.[3] Optimizing small molecular systems in vacuum at 0 K with a view to understanding the effect of geometry and conformation on enthalpically based macroscopic properties forms the kernel of research in many laboratories.[4] This is because the potential energy surface (PES) and stationary points on it underlies the macroscopic behavior of molecular systems.[5] Establishing the optimal conformation and geometry is principally achieved through computing the gradient of the energy with respect to the nuclear coordinates. Reaction mechanisms are preferably discovered from minimum pathways traced in free energy space. An investigation of chemical reactions passing through complex intermediates in enzymes,[6] or chemical reactions in solution passing via complex intermediates,[7,8] are best understood from their multidimensional free energy volumes. Previously we[4,9] and others[6,10] have argued that the use of multidimensional free energy reaction surfaces within a quantum mechanical and molecular mechanical (QM/MM) paradigm[11–13] is central to understanding chemical reaction mechanisms. Hybrid QM/ MM methods are extensively used for studying electronic events in condensed phase and enzymatic environments. Typically a semiempirical QM method is used even though 1410

Journal of Computational Chemistry 2015, 36, 1410–1419

DOI: 10.1002/jcc.23938

C. Alicia Renison, K. D. Fernandes, K. J. Naidoo Scientific Computing Research Unit and Department of Chemistry, University of Cape Town, Rondebosch, Cape Town 7701, South Africa E-mail: [email protected] Contract grant sponsor: South African Research Chairs Initiative (SARChI) of the Department of Science and Technology and the National Research Foundation (NRF); Contract grant number: 48103 (K.J.N) C 2015 Wiley Periodicals, Inc. V

WWW.CHEMISTRYVIEWS.COM

FULL PAPER

WWW.C-CHEM.ORG

E ¼ hWjHjWi ¼ 2

X

ðwi jH1 jwi Þ

i

       X  1 1 2 wi wi  wj wj 2 wi wj  wi wj 1Vnuc2nuc r12 r12 ij X 1X ¼ Pqg hqg 1 Pqgcd ðqgjjcdÞ1Vnuc2nuc 2 qgcd qg

1

(1)

where Pqg and Pqgcd are the one- and two-particle density matrices, hqg and ðqgjjcdÞ are the one- and two-electron integrals and Vnuc2nuc is the nuclear–nuclear repulsion. The first derivative of the energy with respect to a nuclear coordinate Ai is, therefore, @E ¼ @Ai

Figure 1. Computational bottlenecks in (a) a gradient calculation based on olestra and (b) a static QM/MM simulation of cellobiose in a water sphere (20 A˚, 3267 point charges). The bars show CPU (blue) and GPU (red) times in seconds. [Color figure can be viewed in the online issue, which is available at wileyonlinelibrary.com.]

Electronic Structure Optimization and QM/MM Evaluations The analytical first derivatives of the energy with respect to nuclear coordinates provides the capability to calculate the forces on atoms, locate minima on PESs and perform conformational and/or configurational analyses. Previously, we showed that accelerating the two-electron integrals significantly improves the performance of a static QM calculation.[2] However, when performing an optimization, gradient or QM dynamics simulation, the two-electron integral derivatives become a second major bottleneck in an accelerated electronic structure modeling package (see Fig. 1a). Of greater significance to the computational effort required for QM/MM simulations is the time taken to compute the oneelectron integral derivatives. The relative time taken to calculate this third bottleneck along with the other two is illustrated in Figure 1b.

Ionic Forces The forces on the nuclei in molecules provide a valuable physical picture of molecular structure and reactivity. A practical route to computing these forces at the ground state of the system comes from the Hellmann–Feynman theorem.[24,25] The HF energy is given by

       @H  @W W W 12 jHjW : @Ai @Ai

(2)

The first term of this equation, which results from differentiating the one-electron operator, is the Hellmann–Feynman force and can be calculated as an expectation value of this operator. The second term is a correction to the Hellmann– Feynman force that is computed by differentiating the basis functions. It is referred to as the wave function force (or Pulay force)[26] and is evaluated by the one- and two-electron integral derivatives.[27,28] In the event of an exact wave function the second term disappears. However, numerical computations using Gaussian basis functions are not exact representations of the molecular wave function and, therefore, the second term makes a significant contribution. Computing the Pulay force is the most computationally demanding step of a gradient calculation. A computationally practical form of the gradient expression is gained by expanding the above equation in terms of derivatives with respect to the nuclear coordinates as well as the coordinates of the basis functions, vq , that is, X @E @hqg 1 X @ ðqgjjcdÞ ¼ Pqg 1 Pqgcd @Ai 2 @Ai @A i qg qgcd @Vnuc2nuc X 0 @Sqg 1 2 qg Pqg @Ai @Ai

(3)

where, the one-electron integral derivative is expressed as a function of the nuclear attraction, Vqg , and the kinetic energy, Tqg , giving ! @hqg @Vqg X @Vqg @Tqg ¼ 1 1 : @Ai @Ai @vc @vc c The next sections expand on the two-nuclear attraction terms above and the two-electron integral derivatives, as massively parallel algorithms have been developed for them. These are later shown to achieve significant acceleration on GPUs compared with mature CPU algorithms currently in use in legacy codes.

McMurchie–Davidson Algorithm We discussed our choice of the McMurchie–Davidson (MD) algorithm[29] for evaluating one- and two-electron Gaussian integrals in the presentation of the hyper-parallel HF Journal of Computational Chemistry 2015, 36, 1410–1419

1411

FULL PAPER

WWW.C-CHEM.ORG

routines.[2] Here, we show how to evaluate the integral derivatives using the MD approach. In the MD scheme, the product of two Gaussians can be expanded in terms of Hermite Gaussian functions as shown in equation 32 of the accompanying paper. An alternative way of expressing Hermite Gaussian functions is to define



Kj x1P ; alm exp 2alm x1P 2 ¼



@ @Px

j



exp 2alm x1P 2 :

The advantage of using this form of the Hermite Gaussian functions is that the integral and partial derivative operators can be interchanged under certain conditions using the Leibniz integration rule.[30,31] This alludes to the MD method being ideal for the calculation of the derivatives of the integrals over nuclear coordinates. Previously, we introduced Cartesian Gaussians or basis func

tions centered on an atom A ¼ Ax ; Ay ; Az with r1 ¼ ðx1 ; y1 ; z1 Þ and normalization constant Nnlm explicitly defined as

/l n; l; m; al ; r1A ¼

l

Nnlm ðx1 2Ax Þn y1 2Ay ðz1 2Az Þm exp 2al r1A 2 :

An advantage of the Hermite form of the Gaussian functions is that the evaluation of the derivatives is similar to the evaluation of higher angular momentum integrals. Nuclear attraction integral derivative We previously expressed the nuclear attraction integral, Vqg , in terms of the MD formalism.[2] The first derivative expression for the nuclear attraction integral is      ZC  @ clm l m @Ax r1C  X

  ¼ clm Elm ZC 2ndNn21;n 12al dNn11;n el;L l fMm;m ðN11ÞLM ð

1 KN ðx1P ÞKL ðy1P ÞKM ðz1P Þexp 2alm x1P 2 dr1 r1C X

  2p ¼ clm Elm ZC 2ndNn21;n 12al dNn11;n el;L l fMm;m RN;L;M;0 ðN11ÞLM alm (7) h i   The first derivative of srZ1CC s with respect to the first center is clm

Differentiation of a Cartesian Gaussian with respect to the xcomponent of center A gives a linear combination of Cartesian Gaussians with lower and higher angular momentum quantum numbers, that is, (ignoring normalization)

l

i @

@ h ðx1 2Ax Þn y1 2Ay ðz1 2Az Þm exp 2al r1A 2 /l ¼ @Ax @Ax ¼ 2nðx1 2Ax Þn21 12al ðx1 2Ax Þn11 (4)

l

y1 2Ay ðz1 2Az Þm exp 2al r1A 2 :

 

  ZC 

@ 2p  x R0;0;0;0 1 2alm 21 R1;0;0;0 : s s ¼ clm 2al Elm ZC PA @Ax r1C alm

Hellmann–Feynman force Feynman[25] showed that the force on a nucleus is calculated by the charge on the nucleus times the electric field at the nucleus where the electric field is the gradient of the electric potential. The x-component of the electric field integral is then ð Fqg ¼ vq ðr1 Þvg ðr1 Þx1C r1C 23 dx1 which is easily found by differentiating the nuclear attraction operator as shown by McMurchie and Davidson, that is,

Two-electron integral derivative Using the expansion in eq. (4) the first derivative of the twoelectron integral can be written in terms of the auxiliary functions, expansion coefficients and other variables defined before,[2] as @ lmjkr ¼ clm ckr Elm Ekr C @Ax X

  2ndNn21;n 12al dNn11;n el;L l fMm;m

giving

clm ckr

N0 L0 M0

0

0

0

0



0



0

0

0

0

dNn 0 ;n elL0;l fMm0 ;m ð21ÞN 1L 1M RN1N0 ;L1L0 ;M1M0 ;0

(5)

The first derivative of ½ssjss with respect to the first center is clm ckr



@  x R0;0;0;0 1 2alm 21 R1;0;0;0 ½ssss ¼ clm ckr 2al Elm Ekr C PA @Ax (6)

where RN;L;M;0 are the auxiliary functions previously defined. Moreover, the auxiliary functions are related by the same recursion relationship as shown previously.[2] 1412

Journal of Computational Chemistry 2015, 36, 1410–1419

    x1C ZC  @ m ¼ 2clm 2p Elm ZC dn;n el;l f m;m RN11;L;M;0 : (8) l N L M @Cx alm r1C 3  i h   The derivative of the Hellmann–Feynman force for sxr1C1CZ3C s with respect to the nuclear coordinate is given by clm

ðN11ÞLM

X

@ 1 ¼ x1C r1C 23 @Cx r1C

clm

  x1C ZC  @ s ¼ 2clm 2p Elm ZC R1;0;0;0 s @Cx alm r1C 3 

GPU Algorithms Computational approximations In the discussion that follows, we will refer to the integral ðqgjcdÞ as the combination of a bra and ket, that is, ðqgj and WWW.CHEMISTRYVIEWS.COM

FULL PAPER

WWW.C-CHEM.ORG

Figure 2. GPU implementation of the two-electron integral derivatives where N ¼ ðx; y; z Þ and Ipn is the pair index.

jcdÞ pair. The evaluation of the gradient of the two-electron integral with respect to nuclear coordinates involves the calculation of 12 derivatives with respect to four centers, as each center is differentiated with respect to each nuclear coordinate (x; y; z). The computational effort for the calculation of the first derivatives formally scales as OðN4 Þ where N is the number of basis functions.[32] The scaling can be reduced to OðN2 Þ using prescreening and translational invariance schemes. The screening test used here is based on the principle that the overlap of two basis functions becomes negligible with increasing distance between them. It is, therefore, related to the size of the pre-exponential factor. In addition to screening, most CPU packages make use of translational invariance to reduce the number of significant derivatives to at most nine. The translational property results in the sum of the components of the gradient being zero,[33] that is, @ @ @ @ ðqgjcdÞ1 ðqgjcdÞ1 ðqgjcdÞ1 ðqgjcdÞ ¼ 0: @Ai @Bi @Ci @Di Following this, the calculation of only three derivatives for an integral where all the centers are unique is necessary as the fourth is calculated from the negative sum of the other three. For integrals where some of the centers are equal, this number can be reduced even further. While we take advantage of the translational invariance property in the CPU computations, this approximation is not used in the GPU implementation for reasons that will be explained below.

Two-electron integral derivative The QSL’s two-electron integral derivative implementation is similar to the 1T1B (1 Thread 1 Batch) scheme reported by the Martınez group.[21,34–36] A major difference is that the derivatives are summed into the atoms and are not needed in the form of a Fock matrix. This means that the Coulomb and exchange contributions do not need to be separated. Further differences include the points in the algorithm where the

primitive to contracted integral contraction is implemented and where the Hermite to Cartesian Gaussian contractions occur. The gradient calculation is performed once on completion of the direct SCF procedure described previously. While the QSL can be used with p- or l-shells, only the l-shell implementation is shown here to maintain a concise description of the parallel design. On the host, we start by generating contracted shell pair lists based on angular momentum, that is, ss, sl; and ll. These pair lists are formed by taking the qg  gq symmetry into account giving Nx ðNx 11Þ=2 ss and ll pairs and Nx Ny sl pairs where Nx is the number of x-shells. Furthermore, for the expansion to primitives, the contraction symmetry is taken into account when q ¼ g. The ket pairs are exactly the same as the bra pairs, and therefore, we only generate one set of pairs that are used for both. The contracted shell pairs are screened using a prefactor test to pgive significant conffiffiffiffiffiffiffiffiffiffithe ffi  tracted shell pairs. The variables 2p5=2 alm 21 Elm clm , P, PA,  and PB are precalculated. An efficient approach is to calculate integral derivatives in classes based on angular momentum. This is because the integrals in a class are evaluated by similar equations and use the same variables. In this initial release nine kernels including r½ssjss, r½ssjsl, r½sljss, r½sljsl, r½ssjll, r½lljss, r½sljll, r½lljsl, and r½lljll were coded. In the GPU QSL design, we used a one-dimensional (1D) grid that has the size of the number of significant contracted shell pairs as shown in Figure 2 below. A hypothetical system of 3 shells is used to illustrate the concept based on the 6-31G basis set with sextuply, triply, and singly contracted shells. In a strategy similar to the Coulomb integrals, each contracted shell pair is mapped onto a thread block. Within each block the bra loop runs over all the primitive pairs of a specific contracted pair sequentially and the ket loop runs over all the primitive pairs in the system and is mapped onto the threads within a block. This means that each thread comprises the calculation of multiple primitive integral derivatives. The ordering of the grid is similar to the integral evaluation described in the Journal of Computational Chemistry 2015, 36, 1410–1419

1413

FULL PAPER

WWW.C-CHEM.ORG

Figure 3. GPU implementation of the Hellmann Feynman (left) and nuclear attraction (right) contribution to the gradient where N ¼ ðx; y; z Þ and INuc is the nuclear index.

associated publication with the exception that a preexponential factor value (instead of a Schwartz value) is used to preserve calculations that will make a significant contribution. The kernel contains equations to calculate the derivatives of the first and second center (q and g) in each direction. The number of equations provides the abstraction to basis functions. Center q and g remain the same within a block (e.g., in block 2 center q is always shell 2 and center g is always shell 1). Therefore, a sum reduction can be performed within each block (i.e., IntSum 11 for block 1) and the atom information that is needed later can be stored. There is an IntSum value for each of the two centers in each direction but only one is shown in Figure 2 for clarity as they are all summed in the same way. The same cannot be done for center’s 3 and 4 (c and d) as the integrals correspond to different shells and summing them would discard the atom information. Because of the permutation symmetry of the two-electron integrals, only the unique integrals corresponding to the upper triangle (shown in red) need to be computed, accompanied by the doubling of the off-diagonal elements. However, we note that center 1 and 2 in ð11j21Þ is equal to center 3 and 4 in ð21j11Þ as shown in blue below. Therefore, if we use the complete grid (almost double the number of integrals we need) we can also calculate center 3 and 4 easily. The integrals are combined with the relevant density matrix element within the kernel. Following this the IntSum values are copied back to the CPU and summed into the atoms as shown in Figure 2. Implementation of translational invariance to reduce the number of derivatives was avoided as this would require the derivative value for the first three centers to be within the same block. Nuclear attraction integral derivative The generation and screening of contracted shell pairs are performed in exactly the same way as for the two-electron integral derivatives. As before, a 1D grid is used that is sized according to the number of significant contracted shell pairs 1414

Journal of Computational Chemistry 2015, 36, 1410–1419

as shown in Figure 3 below. A difference is that the precalculated variable used is 2palm 21 Elm clm . The bra loop (each contracted shell pair) is mapped onto a thread block and the ket loop (all the nuclei) is mapped onto the threads within a block. A sum reduction is performed within each block and the values copied back to the host. The derivatives are summed only into the QM atoms. In the case of a QM/MM CPU calculation (see Fig. 1b) this is a major computational bottleneck. Hellmann–Feynman force The hyper-parallel Hellmann–Feynman force algorithm uses the same buildup on the host as described in the nuclear attraction algorithm above. A 1D grid sized according to the number of nuclei is used where the bra loop (each nucleus) is mapped onto the thread blocks and the ket loop (all the primitive shell pairs belonging to the contracted shell pairs) is mapped onto the threads within a block. Therefore, each block calculates the force on the nucleus by all the primitive shell pairs in the system (Fig. 3). A sum reduction is performed within each block resulting in the derivative of each nucleus. These values are then copied back to the host.

System Software and Hardware The GAMESS-UK[17] package version 8.0 (June 2008 revision 6299 was used for all CPU computations. The QSL was linked to GAMESS-UK where the HF-SCF, one- and two-electron gradients were computed on the GPU. The CPU calculations were performed on a single core of an Intel Xeon E5620 at 2.4 GHz and a disk capable of I/O at 210 MiB/s, while the GPU calculations were performed on an Nvidia Tesla K20c card. The compute capability of the K20c GPU is 3.5, however, the QSL is compatible with cards of compute capability 2.0 and higher. The GPU code was developed using the PGI CUDA FORTRAN implementation. The CUDA 6.5 Toolkit along with cuBLAS is used for all SCF operations except the diagonalizations for WWW.CHEMISTRYVIEWS.COM

WWW.C-CHEM.ORG

Table 1. Times (s) for a static QM/MM CPU[a] calculation of Cellobiose in water using different methods in GAMESS-UK.

Method

1-e2 integrals[b]

1-e2 derivs[b]

2-e2 derivs

Total[c]

DRK-D[d] DRK-C[e] PHS-D[f ] PHS-C[g]

10.58 10.56 10.57 10.58

55.26 57.72 55.33 57.58

79.87 93.63 9.94 10.66

483.41 233.40 206.84 128.83

[a] Intel Xeon E5620 @ 2.40 GHz; disk I/O at 210 MiB/s. [b] All 1-e2 contributions use Dupuis, Rys, King. [c] Total SCF 1 gradient calculation time for the QM/MM calculation. [d] Direct SCF-Dupius, Rys, King. [e] Conventional SCF-Dupius, Rys, King. [f ] Direct SCF, Integrals:Pople Hehre, Derivatives:Schlegel. [g] Conventional SCF, Integrals:Pople Hehre, Derivatives:Schlegel.

which we use MAGMA version 1.5.0. Both the CPU and GPU codes were compiled with PGI version 14.10 and linked with Intel MKL version 11.1.2. The code was optimized using the following compiler optimization flags: CPU: pgf90 –fast –O3 GPU: pgf90 –fast –O3 –Mcuda= Cuda6.5,cc35,llvm MAGMA compilation: nvcc –O3 –gencode arch=compute_35, code=sm_35 –gencode arch=compute_35, code=compute_35

Results and Discussion Current optimal CPU implementation Concerns of overstatement often arise in response to claims of massive speed-up on accelerators such as GPUs.[37] These concerns have some legitimacy, as GPU code developers may compare (1) timings of selected algorithms favoring GPU performance out of context of a complete computation as used by users of legacy packages and (2) with CPU algorithms that may not be optimally configured and compiled within legacy packages. We address the first concern by comparing the QSL’s performance for ionic force calculations within a complete SCF computation and QM/MM simulation on CPU and GPU hardware platforms. Addressing the second concern, we now search for an optimal CPU algorithmic implementation using the optimal compiler optimization flags given above. The Dupuis, Rys, and King[38–40] (DRK) and MD quadrature methods are superior (speed and accuracy) for computing high angular momentum classes and have a low memory footprint. However, the MD quadrature method has fewer floating point operation count (FLOP count).[41,42] For these reasons they (particularly the MD scheme) are suited for a GPU implementation. However, the GAMESS-UK CPU package does not have the option to evaluate the integrals using the MD algorithm to allow direct comparison to the QSL. Nonetheless the default GAMESS-UK options choose an optimal integration algorithm based on angular momentum type. To establish the optimal CPU compute configuration we perform a complete SCF and gradient calculation on a static QM/ MM model, that is, cellobiose (45 QM atoms and 251 basis functions) in a 20 A˚ water sphere (3267 point charges) using the 6-31G basis set. A comparison of all combinations of oneelelctron integral evaluation, integral derivative evaluation and

FULL PAPER

SCF methods as used in an optimized GAMESS-UK CPU release was made (Table 1). The conventional SCF scheme is the default setting in GAMESS-UK and produces the best results (Table 1). Similar to the rotated axis method of Pople and Hehre[43] (PH) used for the two-electron integrals, GAMESS-UK uses the Schlegel[44] method as the default to evaluate the two-electron integral derivatives for s- and l-type orbitals. The DRK method can be selected with the INTEGRAL HIGH keyword for all angular momenta types and provides marginally improved accuracy over the Schlegel method but is significantly slower (Table 1). Unlike the two-electron contributions the one-electron integrals and derivatives are evaluated with the DRK quadrature for all angular momentum types. The fastest CPU times are gained from the combination of PH (for integral evaluation) and Schlegel (for derivatives) using a conventional SCF scheme shown as method PHS-C in Table 1. We use this combination of algorithms when reporting CPU timings in a comparison with GPU timings.

Performance of Accelerated Two-Electron Integral Derivatives The performance of the QSL gradient computations are evaluated through a comparison of complete SCF 1 gradient calculations for a set of test molecules using the 6-31G basis set. The test molecules are glucose (C6H12O6), caffeine (C8N4H10O2), cellobiose (C12H22O11), cholesterol (C27H46O), cellotetraose (C24H42O21), taxol (C45NH49O15), buckyball (C60), cellooctaose (C48H82O41), valinomycin (C54 N6 H90 O18), and olestra (C156H278O19). (Coordinates for the test set of molecules can be found in the supporting information). We include carbohydrates comprising glucose monomers linked together as computational glycobiology presents a special challenge to the accuracy gained from semiempirical methods. The rest of the set forms part of a growing standard used for benchmarking.[21,45] We compare the time taken for the gradient evaluation alone as well as the time taken for the total SCF 1 gradient evaluation on the GPU and CPU (Table 2). In both implementations, the SCF converges in the same number of steps. The speed-up for the two-electron integral derivatives averages around 8X for small molecules that have a high ratio of hydrogen to heavy atoms (e.g., glucose) and increases to 9.5X for a larger system such as olestra. The speed-up for the total SCF 1 gradient calculation increases approximately with system size. Besides the performance, the accuracy of a GPU calculation is of equal importance. Unlike in the SCF procedure where a single measure of energy can be used to probe the accuracy, in a gradient calculation there are 3N derivative values to consider. In this case, the root mean squared error (RMS error) can be used to determine the accuracy as has been shown by the Martınez group.[21] We used full double precision in our GPU implementation. There is no equivalent MD method implemented in GAMESS-UK with which to make a direct comparison. We, therefore, compared the GPU implemented MD with the CPU implemented Schlegel and DRK quadrature methods Journal of Computational Chemistry 2015, 36, 1410–1419

1415

FULL PAPER

WWW.C-CHEM.ORG

Table 2. Analytical energy gradient and total SCF 1 gradient calculation times (s) for a set of test molecules using a 6-31G basis set. Two-electron integral derivative time (s) [a]

Molecule

CPU

Glucose Caffeine Cellobiose Cholesterol Cellotetraose Taxol Buckyball Cellooctaose Valinomycin Olestra

2.36 3.03 10.90 29.16 45.50 98.05 131.29 182.03 197.59 733.01

GPU

[b]

0.29 0.41 1.29 3.28 5.76 13.55 22.49 23.85 25.73 78.95

Total SCF 1 gradient time(s)

RMS difference

Speed-up

CPU

GPU

Speed-up

8.14 7.39 8.45 8.89 7.90 7.24 5.84 7.63 7.68 9.28

9.14 12.63 64.95 173.80 275.04 1312.83 2160.31 2166.79 5544.52 10892.32

2.58 3.94 9.68 27.31 37.28 110.67 116.86 133.57 321.76 647.55

3.54 3.21 6.71 6.36 7.38 11.86 18.49 16.22 17.23 16.82

Schlehgel 1.21 1.94 2.53 3.67 4.72 7.17 5.07 8.02 8.33 1.16

E E E E E E E E E E

25 25 25 25 25 25 25 25 25 24

DRK 2.11 3.99 4.05 8.70 5.63 7.83 8.06 6.04 5.82 1.25

E E E E E E E E E E

27 27 27 27 27 27 26 27 27 26

[a] GAMESS-UK on Intel Xeon [email protected] GHz; disk I/O at 210 MiB/s. [b] Nvidia Tesla K20c card.

for the derivatives. We are comparing to the Schlegel method as it is the best performing and the DRK method because it is algorithmically more similar to the MD method used in the QSL. For the accuracy comparison the CPU and GPU gradient calculations were started from the same density matrix to estimate the error for the integral derivatives only. The RMS gradient differences are excellent, being only 5E-5 and 5E-7 for QSL versus Schlegel and QSL versus DRK, respectively, even though the methods are vastly different.

Optimization We also performed a geometry optimization of glucose (24 atoms and 132 basis functions) on the CPU and GPU (Nvidia Tesla K20c) with the 6-31G basis set starting from the same initial geometry. Both optimizations were performed with the DL-FIND[46] module built into the GAMESS-UK package. We used the default Limited-memory Broyden–Fletcher–Golfarb– Shanno[47–52] (LBFGS) optimizer recommended for a minimum search. A comparison of the SCF energy for the CPU and GPU runs along the course of the optimization is given in Figure 4a. It is clear that the two curves match each other very well. A closer look at the comparative accuracy of the GPU and CPU calculations (Fig. 4b) reveals that the difference is on the order of 1E-6 hartrees this equates to 0.0006 kcal/mol, which is well within chemical accuracy. This is an excellent comparison given that the CPU and GPU implementations use different integral/gradient evaluation methods and cutoffs, which contribute to this millionth hartree difference in energy. QM dynamics We used the existing coupling between GAMESS-UK and CHARMM[53–55] to estimate the effect of the QSL library on molecular dynamics simulation speed-up. The result of the QM dynamics speed-up is the same as the static gradient and optimization calculations. This is because the two-electron integral and corresponding derivative remains the bottleneck in all three computations. Therefore, the speed-ups are not shown as the same gradient calculation is performed at each step of the dynamics. Notwithstanding the simulation demonstrates the instant availability of fast ab initio dynamics simulations on desktop computers via a QSL plugin. As an example 9.7 ps/day of cellobiose (45 QM atoms and 251 basis functions) QM dynamics using a 6-31G basis set with CHARMM/GAMESS-UK accelerated by the QSL can be achieved on a single Nvidia Tesla K20c GPU. (Further computational details about the simulation can be found in the supporting information.)

Performance of Accelerated QM/MM Figure 4. Progress of a CPU (blue) and GPU (red) Glucose optimization a) Total SCF energy and b) Difference in energies. [Color figure can be viewed in the online issue, which is available at wileyonlinelibrary.com.]

1416

Journal of Computational Chemistry 2015, 36, 1410–1419

The total energy of a QM/MM system is composed of a sum of the energy of the QM system, MM system, and the interaction WWW.CHEMISTRYVIEWS.COM

FULL PAPER

WWW.C-CHEM.ORG

Figure 5. QM/MM representation based on an electrostatic embedding scheme. [Color figure can be viewed in the online issue, which is available at wileyonlinelibrary.com.]

between the two. An accurate representation of the interaction between the classical and quantum partitions is core to successful QM/MM modeling. The electrostatic component of the nonbonded interaction is considered the most important factor determining the accuracy of the model. A common approach that delivers accurate electrostatic coupling between the QM charge density and MM point charge field is the use of electrostatic embedding. In the electrostatic embedding scheme, the MM point charge field polarizes the QM region. This is achieved by treating the point charges as one-electron terms in the QM Hamiltonian (Fig. 5).[56] Most QM/MM molecular dynamics simulations on large biomolecules are performed using a semiempirical QM method.[57,58] This is done since semiempirical methods are several orders of magnitude faster than ab initio or DFT methods. However, we have found that there are significant limitations in the accuracy of semiempirical methods when modeling key chemical properties such as glycan ring puckering in transition states.[14] Despite these shortcomings up until now it has not been feasible for laboratory level (i.e., desktop computers) use of legacy codes such as GAMESS-UK to undertake ab initio QM/ MM dynamics simulations for large molecules.

In practice more accurate QM methods used within a QM/ MM scheme/simulation have the calculation of a great number of two-electron integrals as a severe bottleneck. A second consideration as shown in Figure 1b are the one-electron integral derivatives that consume by far the largest amount of computational resources as these computations are dependent on the number of point charges in the system. We performed (on a CPU and repeated on an Nvidia Tesla K20c GPU) four static QM/MM simulations of cellobiose (45 QM atoms and 251 basis functions) in water spheres of ˚ , 31 A ˚ , and 39 A ˚ ) using GAMESS-UK and increasing size (20 A the 6-31G basis set. The water spheres were represented as point charges in GAMESS-UK. Thus, all the other bottlenecks are kept constant and we only increase the number of MM point charges affecting the one-electron terms. In the CPU calculation, the one-electron integral derivative is seen to increase linearly with an increase in number of point charges (Table 3) while the GPU compute time grows more slowly and subsequently the speed-up increases with sphere size for the systems studied here. (Fig. 6). As the sphere size ˚ the one-electron integrals also start increases beyond 31 A making a significant contribution to the total calculation time as shown in Table 3. (The one-electron integral algorithm was described previously[2]) This suggests that for larger QM/MM systems accelerating only the two-electron contributions on the GPU will not deliver sufficient performance gain. To make ab initio QM/MM dynamics viable for regular laboratory use the acceleration of the one-electron contributions are a necessity (Table 3). The speed-up gained from the one-electron derivative accel˚ sphere eration results in increases from 304 times for a 20 A ˚ to 364 times for a 39 A sphere. Further, the computational cost on the GPU also does not grow as fast with system size as on the CPU. This comparison is done against the best algorithmic CPU combinations (Table 1 method PHS-C). Recall the first concern we mentioned earlier, that the overall computational speed-up is the only real concern of users using modeling techniques in their research. With this in mind, we observe that the speed-up of the QM/MM simulation for ˚ sphere is 13 times and 34 times for cellocellobiose in a 20 A ˚ biose in 39 A sphere. The latter size is a comparable system size used to perform reaction dynamics simulations of for example a protein in water.

Table 3. Effect of the one-electron contributions on the computational cost of a QM/MM simulation. 1-electron derivatives time (s)

1-eletron integrals time (s) Sphere size 20 25 31 39

A A A A

Total time[c] (s)

Point charges

CPU[a]

GPU[b]

Speed-up

CPU

GPU

Speed-up

CPU

GPU

Speed-up

3267 6522 12429 24843

10.58 20.85 39.56 78.66

0.11 0.15 0.21 0.35

96.18 139.00 188.38 224.74

57.68 114.47 216.70 444.59

0.19 0.34 0.63 1.22

303.58 336.68 343.97 364.42

128.73 196.17 318.21 589.37

9.96 10.52 12.05 17.32

12.92 18.65 26.41 34.03

[a] GAMESS-UK on Intel Xeon [email protected] GHz; disk I/O at 210MiB/s. [b] Nvidia Tesla K20c card. [c] Total SCF 1 gradient calculation time for a QM/MM simulation.

Journal of Computational Chemistry 2015, 36, 1410–1419

1417

FULL PAPER

WWW.C-CHEM.ORG

How to cite this article: C. A. Renison, K. D. Fernandes, K. J. Naidoo. J. Comput. Chem. 2015, 36, 1410–1419. DOI: 10.1002/ jcc.23938

]

Figure 6. Effect of the point charges on the computational expense of the one-electron integral derivative. The bars show CPU (blue) and GPU (red) and the black line shows the total speed-up (right axis). [Color figure can be viewed in the online issue, which is available at wileyonlinelibrary.com.]

QM/MM dynamics We now measure the viability of using an ab initio method for systems sizes approximating the ones regularly used in biomolecular reaction dynamics simulations. We obtained 6.4 ps/day of QM/MM ˚ water sphere using CHARMM/ dynamics for cellobiose in the 39 A GAMESS-UK/QSL on a Kepler K20c. Further computational details about the simulation can be found in the supporting information. We have tested this on off the shelf current and more cost effective GTX Titan Black GPU cards and achieve values closer to 10 ps/ day. This points to the practical viability of QSL accelerated ab initio QM/MM enzyme dynamics for routine laboratory level computation. We elaborate on this in greater detail later.

Conclusions The routines responsible for the one- and two-electron integral derivatives have been massively parallelized and ported to GPUs. A CPU optimized version of GAMESS-UK linked to QSL produces speed-ups of 8X for medium to large molecules. When combined with the previous HF QSL routines the speedup is around 16X times. However, of greatest significance is the speed-up of a QM/ MM solution simulation of size comparable to a small protein and ligand in a water droplet. Here an Nvidia K20c delivers 6.4 ps dynamics in a day making ab initio QM/MM simulations a realistic prospect for regular laboratory use.

Acknowledgments C. Alicia Renison and Kyle D. Fernandes would like to thank SARChI for PhD fellowship support. The authors thank the Nvidia Professor Partnership Program to KJN for a generous donation of equipment. Keywords: integral derivatives  graphical processing units  ionic force  Pulay force  code acceleration  GPU  GAMESS-UK  quantum supercharger library  QSL  QM/MM 1418

Journal of Computational Chemistry 2015, 36, 1410–1419

Additional Supporting Information may be found in the online version of this article.

[1] B. M. Deb, Rev. Mod. Phys. 1973, 45, 22. [2] K. D. Fernandes, C. A. Renison, K. J. Naidoo, J. Comput. Chem. 2015, DOI: 10.1002/jcc.23936. [3] J. C. Slater, Quantum Theory of Molecules and Solids: Electronic Structure of Molecules; McGraw-Hill Book: New York, NY, 1963. [4] K. J. Naidoo, Phys. Chem. Chem. Phys. 2012, 14, 9026. [5] D. J. Wales, Science 2001, 293, 2067. [6] B. Ensing, M. De Vivo, Z. W. Liu, P. Moore, M. L. Klein, Acc. Chem. Res. 2006, 39, 73. [7] B. K. Carpenter, Acc. Chem. Res. 1992, 25, 520. [8] P. R. Schreiner, H. P. Reisenauer, D. Ley, D. Gerbig, C.-H. Wu, W. D. Allen, Science 2011, 332, 1300. [9] K. Naidoo, Sci. China Chem. 2011, 54, 1962. [10] J. Chandrasekhar, S. F. Smith, W. L. Jorgensen, J. Am. Chem. Soc. 1984, 106, 3049. [11] A. Warshel, M. Karplus, J. Am. Chem. Soc. 1972, 94, 5612. [12] A. Warshel, M. Levitt, J. Mol. Biol. 1976, 103, 227. [13] M. J. Field, P. A. Bash, M. Karplus, J. Comput. Chem. 1990, 11, 700. [14] C. B. Barnett, K. J. Naidoo, J. Phys. Chem. B 2010, 114, 17142. [15] K. K. Govender, K. J. Naidoo, J. Chem. Theory Comput. 2014, 10, 4708. [16] K. Govender, J. Gao, K. J. Naidoo, J. Chem. Theory Comput. 2014, 10, 4694. [17] M. F. Guest, I. J. Bush, H. J. J. van Dam, P. Sherwood, J. M. H. Thomas, J. H. van Lenthe, R. W. A. Havenith, J. Kendrick. Mol. Phys. 2005, 103, 719. [18] M. S. Gordon, J. H. Jensen, S. Koseki, N. Matsunaga, K. A. Nguyen, S. Su, T. L. Windus, M. Dupuis, J. A. Montgomery, J. Comput. Chem. 1993, 14, 1347. [19] M. S. Gordon, M. W. Schmidt, In Theory and Applications of Computational Chemistry: The First Forty Years; C. E. Dykstra, G. Frenking, K. S. Kim, G. E. Scuseria, Eds.; Elsevier: Amsterdam, 2005; pp. 1167–1189. [20] M. Valiev, E. J. Bylaska, N. Govind, K. Kowalski, T. P. Straatsma, H. J. J. Van Dam, D. Wang, J. Nieplocha, E. Apra, T. L. Windus, W. A. de Jong, Comput. Phys. Commun. 2010, 181, 1477. [21] I. S. Ufimtsev, T. J. Martinez, J. Chem. Theory Comput. 2009, 5, 2619. [22] D. A. Case, T. A. Darden, T. E. Cheatham III, C. L. Simmerling, J. Wang, R. E. Duke, R. Luo, R. C. Walker, W. Zhang, K. M. Merz, B. Roberts, S. Hayik, A. Roitberg, G. Seabra, J. Swails, A. G€ otz, I. Kolossvary, K. F. Wong, F. Pae-sani, J. Vanicek, R. M. Wolf, J. Liu, X. Wu, S. R. Brozell, T. Steinbrecher, H. Gohlke, Q. Cai, X. Ye, J. Wang, M. J. Hsieh, G. Cui, D. R. Roe, D. H. Mathews, M. G. Seetin, R. Salomon-Ferrer, C. Sagui, V. Babin, T. Luchko, S. Gusarov, A. Ko-valenko, P. A. Kollman, AMBER 12; University of California: San Francisco, 2012. [23] C. M. Isborn, A. W. G€ otz, M. A. Clark, R. C. Walker, T. J. Martınez, J. Chem. Theory Comput. 2012, 8, 5092. [24] H. Hellmann, Einfuhrung in die Quantenchemie; Deuticke: Leipzig 1937, p. 285. [25] R. Feynman, Phys. Rev. 1939, 56, 340. [26] P. Pulay, Mol. Phys. 1969, 17, 197. [27] P. Pulay, Mol. Phys. 2002, 100, 57. [28] H. B. Schlegel, Theor. Chem. Acc. 2000, 103, 294. [29] L. E. McMurchie, E. R. Davidson, J. Comput. Phys. 1978, 26, 218. [30] S. Reine, T. Helgaker, R. Lindh, Wiley Interdiscip. Rev. Comput. Mol. Sci. 2012, 2, 290. [31] Z. B. Maksic, K. Kovacevic, M. Primorac, Pure Appl. Chem. 1989, 2075. [32] H. B. Schlegel, M. Frisch, In Theoretical and Computational Models for Organic Chemistry; S. Formosinho, I. Csizmadia, L. Arnaut, Eds.; Springer: Netherlands, 1991; pp. 5–33. [33] A. Komornicki, K. Ishida, K. Morokuma, R. Ditchfield, M. Conrad, Chem. Phys. Lett. 1977, 45, 595. [34] I. S. Ufimtsev, T. J. Martınez, J. Chem. Theory Comput. 2008, 4, 222. [35] I. S. Ufimtsev, T. J. Martinez, Comput. Sci. Eng. 2008, 10, 26. [36] A. V. Titov, I. S. Ufimtsev, N. Luehr, T. J. Martinez, J. Chem. Theory Comput. 2013, 9, 213.

WWW.CHEMISTRYVIEWS.COM

FULL PAPER

WWW.C-CHEM.ORG

[37] V. W. Lee, C. Kim, J. Chhugani, M. Deisher, D. Kim, A. D. Nguyen, N. Satish, M. Smelyanskiy, S. Chennupaty, P. Hammarlund, R. Singhal, P. Dubey, SIGARCH Comput. Archit. News 2010, 38, 451. [38] M. Dupuis, J. Rys, H. F. King, J. Chem. Phys. 1976, 65, 111. [39] J. Rys, M. Dupuis, H. F. King, J. Comput. Chem. 1983, 4, 154. [40] H. F. King, M. Dupuis, J. Comput. Phys. 1976, 21, 144. [41] M. Head-Gordon, J. A. Pople, J. Chem. Phys. 1988, 89, 5777. [42] P. M. W. Gill, M. Head-Gordon, J. A. Pople, Int J Quantum Chem 1989, 36, 269. [43] J. A. Pople, W. J. Hehre, J. Comput. Phys. 1978, 27, 161. [44] H. B. Schlegel, J. Chem. Phys. 1982, 77, 3676. [45] Y. Miao, K. M. Merz, J. Chem. Theory Comput. 2012, 9, 965. [46] J. K€astner, J. M. Carr, T. W. Keal, W. Thiel, A. Wander, P. Sherwood, J. Phys. Chem. A 2009, 113, 11856. [47] C. G. Broyden, IMA J. Appl. Math. 1970, 6, 76. [48] R. Fletcher, Comput. J. 1970, 13, 317. [49] D. Goldfarb, Math. Comput. 1970, 24, 23. [50] D. Shanno, Math. Comput. 1970, 24, 647. [51] J. Nocedal, Math. Comput. 1980, 35, 773. [52] D. Liu, J. Nocedal, Math. Prog. 1989, 45, 503.

[53] B. R. Brooks, C. L. Brooks, A. D. Mackerell, L. Nilsson, R. J. Petrella, B. Roux, Y. Won, G. Archontis, C. Bartels, S. Boresch, A. Caflisch, L. Caves, Q. Cui, A. R. Dinner, M. Feig, S. Fischer, J. Gao, M. Hodoscek, W. Im, K. Kuczera, T. Lazaridis, J. Ma, V. Ovchinnikov, E. Paci, R. W. Pastor, C. B. Post, J. Z. Pu, M. Schaefer, B. Tidor, R. M. Venable, H. L. Woodcock, X. Wu, W. Yang, D. M. York, M. Karplus, J. Comp. Chem. 2009, 30, 1545. [54] B. R. Brooks, R. E. Bruccoleri, B. D. Olafson, D. J. States, S. Swaminathan, M. Karplus, J. Comput. Chem. 1983, 4, 187. [55] A. D. MacKerell, B. Brooks, C. L. Brooks, L. Nilsson, B. Roux, Y. Won, M. Karplus, In Encyclopedia of Computational Chemistry; John Wiley & Sons, Ltd: Chichester, England, 2002, pp. 271–277. [56] H. M. Senn, W. Thiel, Angew. Chem. Int. Ed. 2009, 48, 1198. [57] C. B. Barnett, K. A. Wilkinson, K. J. Naidoo, J. Am. Chem. Soc. 2011, 133, 19474. [58] I. L. Rogers, K. J. Naidoo, J. Phys. Chem. B 2015, 119, 1192.

Received: 19 February 2015 Revised: 15 April 2015 Accepted: 16 April 2015 Published online on 14 May 2015

Journal of Computational Chemistry 2015, 36, 1410–1419

1419

Copyright of Journal of Computational Chemistry is the property of John Wiley & Sons, Inc. and its content may not be copied or emailed to multiple sites or posted to a listserv without the copyright holder's express written permission. However, users may print, download, or email articles for individual use.

MM dynamics.

This article describes an extension of the quantum supercharger library (QSL) to perform quantum mechanical (QM) gradient and optimization calculation...
546KB Sizes 1 Downloads 10 Views