Fuzzy cluster analysis of molecular dynamics trajectories.

PROTEINS Structure, Function, and Genetics 14:249-264 (1992)

Fuzzy Cluster Analysis of Molecular Dynamics Trajectories Heather L. Gordon and Rajmund L. Somorjai Institute for Biological Sciences, National Research Council of Canada, Ottawa, Ontario, Canada K I A OR6

ABSTRACT We propose fuzzy clustering as a method to analyze molecular dynamics (MD) trajectories, especially of proteins and polypeptides. A fuzzy cluster analysis locates classes of similar three-dimensional conformations explored during a molecular dynamics simulation. The method can be readily applied to results from both equilibrium and nonequilibrium simulations, with clustering on either global or local structural parameters. The potential of this technique is illustrated by results from fuzzy cluster analyses of trajectories from MD simulations of various fragments of human parathyroid hormone (PTH). For large molecules, it is more efficient to analyze the clustering of root-mean-squaredistances between conformations comprising the trajectory. We found that the results of the clustering analysis were unambiguous, in terms of the optimal number of clusters of conformations, for the majority of the trajectories examined. The conformation closest to the cluster center can be chosen as being representative of the class of structures making up the cluster, and can be further analyzed, for example, in terms of its secondary structure. The CPU time used by the cluster analysis was negligible compared to the MD Simulation time. 0 1992 Wiley-Liss, Inc. Key words: fuzzy clustering, molecular dynamics, simulations, nonequilibrium, protein conformation, parathyroid hormone (PTH) INTRODUCTION The use of molecular dynamics (MD) simulations to study proteins dates back to the pioneering paper of McCammon et al.,' and by now the technique is used extensively for theoretical calculations on biomolecules.' With the addition of a few special techniques, the development of MD algorithms for the study of biomacromolecules was relatively straightforward, following work established by MD studies of simple molecular liquids. However, the complexity of biomolecules is such that their structural descriptors are necessarily totally different from those used to characterize the structure of molecular liquids (e.g., radial and angular distribution functions). Consequently, the analytical tools with 0 1992 WILEY-LISS, INC.

which to interpret the vast quantities of data produced from MD simulations of biomolecules are far less developed than the algorithms themselves. There is a need to design new techniques to fully exploit the information content contained in MD trajectories. For example, recent work in this area includes an analysis that involves novel graphical presentation of the time evolution of conformational and helicoidal parameter^,^ and a method that describes the motion of secondary structure in

protein^.^ Molecular dynamics simulations can provide useful information on the conformations adopted by residues in polypeptides. However, structures that represent an average conformation of the molecule, such as those interpreted from thermodynamic experiments, can be computed only if the simulation .~ a has reached a steady-state e q ~ i l i b r i u m When steady state is attained, similar results are found for average conformations computed from independent simulations, each starting from a different initial polypeptide conformation. A long simulation will be required for the system to reach a steady state if the equilibrium structure is not readily accessible from the initial conformation. This is likely to be the case when no a priori knowledge of the equilibrium structure is available with which to help select the initial conformer. In terms of conventional statistical methods, meaningful analyses of average structures from MD simulations are restricted to those cases where the simulator is confident in the validity of a n analysis based on the assumption of a Gaussian (normal) distribution of the sample. When this assumption cannot be made, it would be desirable to have a technique, other than visual inspection, by which to quantify classes of conformations explored during the MD trajectory. This would be especially useful for large molecules, such as polypeptides, whose con-

Received August 30, 1991; revision accepted January 13, 1992. Address reprint requests to R.L. Somorjai, Senior Research Officer, Informatics, Institute for Biodiagnostics, National Research Council of Canada, 435 Ellice St., Winnipeg, Manitoba, Canada R3B 1Y6. H.L. Gordon is a Canadian Government Laboratory Visiting Fellow, 1990-1991,

250

H.L. GORDON AND R.L. SOMORJAI

formations are described by more than a few parameters. Moreover, conventional techniques that are used to analyze (nonequilibrium) MD trajectories of polypeptides, such as time courses of the values of dihedral angles, are local, not global, descriptors of structure. In this paper we demonstrate the successful application of a fuzzy clustering technique to analyze MD trajectories of various fragments of human parathyroid hormone. This clustering analysis of MD trajectories is applicable whether or not the system has attained equilibrium. Structural similarity among classes of conformers can be flexibly defined: similarity measures can range from the global, e.g., an all-atom comparison, to various local quantities, e.g., a restricted set of structural parameters, down to a single entity.

THEORY Clustering techniques partition a data set into identifiable subgroups or clusters. The measure of similarity among members of the same cluster is defined by some objective function. The concept of fuzzy sets was introduced by Zadeh? and was applied to cluster analysis by D ~ n n Gustafson ,~ and Kessel: Ruspini: and Bezdek et a1.lO-l’ Fuzzy clustering uses a “membership” function as a similarity index. Each data point is assigned membership values between 0 and 1to denote its degree of belonging to any and all of a predetermined number of clusters. Values close to 1indicate a high degree of cluster membership; values close to 0 indicate a low degree of membership. “Hard” clustering of the data corresponds to assigning binary membership values of either 1 or 0: each data point belongs unambiguously to one and only one of the clusters. The advantage of a fuzzy rather than a hard partitioning of data is that both outliers (aberrant data) and data that do reflect properties of more than one subgroup can be described by the allocation of nonzero membership values to several clusters. We present the basic equations for fuzzy clustering in this section. Algorithms are available elsewhere for fuzzy clustering based on the coordinates Of11-13 or on the distances among14 data points, and so will not be elaborated upon here. In the following, we denote a vector of points {x,} by a boldface lowercase letter (x) and a matrix of points {x”} by a boldface sans serif uppercase letter (X). An estimate of a true value, x, is denoted by a circumflex, 32. In order to illustrate the concepts discussed below, the results of a fuzzy cluster analysis of a simple twodimensional problem are provided in Appendix A. Consider a set of N data points Y where ykj is the j t h coordinate of the kth data point, j = 1, . . . ,M where each point yk is an M-dimensional vector. For example, an MD trajectory comprised of N configurations of a molecule having n atoms can be repreykZ, sented by an (N x 3n) matrix Y, where {bkl, Yk3), b k 4 , Y k 5 , Yk6), . . . bk 3n--29 Yk 3n-19 Y k 3n)) are 9

the Cartesian coordinates of the n atoms in the kth configuration. We want to optimally partition the data set into C clusters by determining a matrix of membership values U where u,,(y,) measures the degree of membership of the kth data point in the ith cluster. The additional constraints are that C

C i=

ULk =

1,

o 5 u , k 5 1,

for all k data points

1

(1) and

zl N

Uik

> 0,

for all i clusters.

(2)

Equation (1) is a normalization constraint: each data point must belong to a t least one of the C clusters. Equation (2) states that each cluster must be occupied by a t least one member. Different mathematical criteria can be used to optimally subdivide the data points into clusters and assign membership values uik (see, for example, refs. 9, 11, 15, and 16). The clustering criterion used here is (local) minimization of the generalized leastsquared error functional”

where &k

= b’k - Vilw = ( Y k - V J T A ( y k - Vi).

(4)

In Eqs. (3) and (41, m is a weighting constant, vi is a vector of coordinates that designates the center of the ith cluster; c& is the squared A-distance from the kth data point to the ith cluster center. The superscript T denotes the transpose of a vector or matrix. In Eq. (41, A is an (N x N) positive definite matrix chosen to control the shape of the optimal clusters.” In general, A is the inverse of the covariance matrix. Here, we choose A to be the identity matrix so that hyperspherical clusters are formed. The optimal partitioning of data into a predetermined number of clusters- is achieved by choosing estimates 8 ( = {G ,}I and U ( = {Liik}) of the optimal cluster centers and membership values that locally minimize J,. For m > 1, the necessary (but not sufficient) conditions are10-12*17 N

and C fiik =

1 / c (&k/djjk)2’(m-1),

lsk5N;

J=1

lsisc;

Y k f $ V k j .

(6)

In Eq. (5), Gu is the j t h coordinate of the ith cluster centre. Solutions for iri and dik are found via iterative

FUZZY CLUSTER ANALYSIS

the number of clusters C is predetermined; an initial (usually random) set of membership values U is assigned that satisfies the constraints of Eqs. (1) and (2); the coordinates of the cluster centers V computed via Eq. (5); a new matrix of membership values is then calculated from Eqs. (4) and (6). The last two steps in this process are repeated until self-consistency in U is attained and J , is minimized. From Eq. (5), we see that the data point having the largest membership in a cluster has the most weight in determining the coordinates of the cluster center iri. The weighting constant m, 15 m < a,controls the contributions of the dfk to J,. If m = 1, then the minimization of J , gives only hard partitioning of the data (u, = 0 or 1).As m+m, the minimization of J,,, produces uik = 1/C for all i and k. From Eqs. (1) and (2), the membership of a data point Y k is equally divided or shared among the C clusters if U i k = 1/c, for all l ~ i s CThus, . as the value of m increases, the local minima of J , that are found correspond to increasingly fuzzy clusters. The advantage of using m > 1 is that the fractional solutions U indicate the degree to which data points are shared by different clusters (as demonstrated in Appendix A). This is not possible with a hard partitioning of the data, where each data point is assigned to one, and only one of the identified clusters. The value of m also determines the solutions for the cluster centers. For m + 1 in Eq. (5), iri is the geometric center of the ith cluster. Each member of the ith cluster contributes equally toward the determination of iri. For m > 1,iri can be viewed as the center of “mass” of the ith cluster of points, where the “mass” of each point is proportional to its membership value to the ith cluster. Thus, although the cluster center is not one of the data points, it can be used to represent the class of structures that comprise the cluster. The actual value of m is chosen empirically; there is no a priori method for selecting an optimal value of m. As the value of m increases, so does the number of ways in which the data set can be partitioned among C clusters, with the estimates (U,V) producing increasing fuzzy (ill-defined) clusters. Typical values used are in the range 1 < m < 3. Stability of clusters with respect to small changes in m implies the validity of the clustering [The effect of changes in m on (U,V) is also shown in the sample problem of Appendix A.] One may think of m as a measure of the “temperature” of the M-dimensional swarm of data points. That is, data that consist of diffuse clusters can be interpreted as possessing a higher temperature (kinetic energy) than data that are tightly clustered. In the latter case, the integrity of identified clusters will be better conserved as m is increased. A data point yk can be preferentially assigned to the ith cluster if the value of its membership to that cluster, i&k, is greater than all other &,k, j # i. The

25 1

relative (or percent) “fuzziness” of a cluster can be determined by computing the ratio of numbers of “fuzzy” members to total members of a cluster, where of a “fuzzy” member is less than some arbitrary threshold value, Uthreshold. Here, we use Utheshold = (1+ C)/2C, which is midway between U i k = 1/12and 1 (i.e., completely fuzzy and completely hard). As mentioned previously, there is no unique solution for V and U that locally minimizes J , when m > 1. Examination of various cluster validity functional^^^^^^*'^^^^-^^ for different pairs of solutions V and U assists in the determination of an optimal (or sufficiently good) fuzzy partitioning of the data. Note that fuzzy clustering finds the optimal partitioning of data among a fixed, predetermined number of clusters C. Additional criteria must be examined in order to determine the optimal number of clusters into which a given data set can be divided. Here we use agreement among a number of functionals to select the optimal number of clusters of conformations in an MD trajectory. The functionals we used are the partition coefficient (F),11,12r21 entropy (H),11,12 nonfuzzy index (NFn,14and minimum and maximum hard tendencies (min-ht and mean-ht)” of the clusters. The equations for these validity functionals, and their significance are explained in Appendix B. Optimal partitioning of the sample data tends to maximize F, NFZ, min-ht, and mean-ht, and minimizes H and the percent fuzziness of the clusters. Consensus among the majority of these functionals is indicative of the most appropriate number of clusters into which the sample data can be divided. From the above discussion, it should be apparent that a fuzzy cluster analysis is not fully automated but necessarily involves some judgments on the part of the user. Also, minimization of Eq. (3) partitions data into two or more clusters; the fuzzy cluster analysis alone cannot discern whether a data set that forms a single cluster of uniform density is compact (high density) or diffuse (low density). In Appendix A, we demonstrate how these various considerations are used to judge the optimal number of clusters into which a sample data set may be partitioned. Since we generally compare polypeptide conformers within an intramolecular and not a space-fixed frame of reference, it is necessary to superimpose each conformer in a painvise fashion, prior to a cluster analysis. (Alternatively, any representation of conformers that depends only on translationally and rotationally invariant coordinates could be used, e.g., dihedral angles.) For N conformations of an n atom molecule, cluster analysis using the optimally superimposed atomic coordinates entails storage of an (N x 3n) matrix. For macromolecules where 3n > N , it is computationally more efficient to do clustering on the root-mean-square (rms) distance matrix D (or on D2).Here D is the symmetric (N x N )

252


TABLE I. Amino Acid Sequence of 1-34 PTH 1 Ser

Val

Ser

Glu

GlY

LYS

His

Glu

np

Leu

11

Leu 21 Val 31 Val

5 Ile 15 Leu 25 Arg

10

Gln

Leu

Met

His

Asn

Ser

Met

Glu

LYS

LYS

Leu

Gln

Asn 20

Arg 30 ASP

34

His

Asn

Phe

matrix whose element d , is the rms distance between optimally superimposed conformers i and j . Once D has been formed, the computational demands of the clustering problem depend only on the number of configurations in the MD trajectory and not on the number of atoms in the system. The elements of the rms distance matrix are substituted directly for d, in Eqs. (3) and (6), and Eq. (5) is modified

In Eq. (7), now is a scalar that represents the weighted average of the rms distances among members of the ith cluster. In the case of a cluster analysis of the distances, the coordinates of the cluster center are not directly available as they are when the sample coordinates are used in Eq. (5). Thus, for clustering of the rms distances, we choose the data point having the largest membership value in a cluster as being closest to the center of the cluster. Note that for a cluster analysis of the coordinates Y, D as defined by Eq. (4) must be updated for each new iterative solution to the cluster centers 8 and memberships U. Therefore a cluster analysis of the rms distance matrix is computationally less time consuming per iteration, since D is fixed, and only and U are updated.

+

CALCULATIONS Model System Molecular dynamics simulations were carried out on models of several fragments of human parathyroid hormone (PTH). Parathyroid hormone is a n 84 residue polypeptide that regulates the serum calcium concentration." To date, the exact structure that PTH assumes in solution is not known, although interpretations of results from spectroscopic studies suggest varying amounts of helical struct ~ r e , ' ~ - ' ~as do secondary structure predictions24,28,29 or thermodynamic argument^.'^ In the absence of specific information about the native structure of PTH, initial starting conformations for MD simulations must be chosen arbitrarily. We found that simulations of 2-3 nsec were of insufficient length to achieve a steady-state equilib-

rium even for the smallest fragment studied (20-34 PTH). Since computing average structures from these simulations was not valid, fuzzy clustering was used to identify sequences of structures explored during the MD trajectories.

Molecular Dynamics Simulations Molecular dynamics simulations of 1-34, 13-34, and 20-34 PTH were performed using the CHARMm p a ~ k a g e . ~ OTable . ~ l I lists the residues of 1-34 PTH. Standard protonation states at pH 7 were used, so that charges were + 1 for Lys and Arg and -1 for Asp and Glu; all other residues were uncharged. The zwitterionic forms of the polypeptides were used (NH; for the N-terminus and CO, for the C-terminus). All atoms, including hydrogens, were modeled explicitly. The total potential energy was taken as a sum of bonded (bond, angle, dihedral, and improper dihedral energies) and nonbonded interactions (electrostatic and van der Waals energies). No potential cutoff was applied, so that all nonbonded interactions were included in the calculation of the potential energy. The solvent was modeled as a dielectric continuum, that is, the electrostatic interactions were screened via a dielectric function ~ ( r=) &br, where r is the interatomic distance, and with E~ chosen to be 72 or 2. All MD simulations used the Verlet algorithm. The details of the MD simulations are given in Table 11. Simulations were initiated from various arbitrary starting conformations (helical, random, or extended), since no specific information was available about the equilibrium conformations of the examined PTH fragments. All simulations but one (run X) were at 300 K. For these room temperature runs, the time step was 1 fsec; covalent bonds to hydrogen were constrained by SHAKE to their equilibrium distance^^'.^^; data were collected over a period of 2-3 nsec for the 2034 and 13-34 fragments and 0.5 nsec for the 1-34 fragment. Prior to the data collection period, the systems were equilibrated at the desired temperature by scaling of the atom velocities. No velocity scaling was applied during the data collection period and the total energies of the systems did not drift. One calculation (run X) was performed to illustrate how fuzzy clustering categorized the trajectory of a high

253


TABLE 11. Details of Molecular Dynamics Simulations Run PTH 20-34** A B C D X PTH 13-34** E PTH 1-34** F G

Initial conformer*

62

Heating time* (psec)

Equilibration times (psec)

Helical-like" Randomized*** Extendedttt Randomized*** Extendedttt

78 78

-

150 350

78

3

400

2 78

-

20

Helical***

78

3

Helical*** Helical"'

78 2

0.6 0.6

Collection time** (psec)

Trajectory++

40

2000 3000 2000 2000 1000

1000/2 psec 1000/3 psec 1000/2 psec 1000/2 psec 1000/1 psec

20

2000

1000/2 psec

50

3.9 6.0

495.5 450

99U0.5 psec 990/0.5 psec

*Initial conformation of molecule for MD simulation. +Valueof cb for dielectric function used to screen electrostatic interactions (see text). *Runs C, E to G Kinetic energy added by assigning atom velocities corresponding to 5 K increments every 50 MD time steps until system a t 300 K. Run X Kinetic energy added by assigning atom velocities corresponding to 10 K increments every 50 MD time steps until system at 600 K. §Runs A to G: Atom velocities scaled every 50 steps if I average T - 300 K I > 10 K. Run X: Atom velocities scaled every 50 steps if 1 average T - 600 K 1 > 10 K. **No scaling of atom velocities. rtCompositionof trajectory file compiled during collection period for use in clustering analysis: in the form AIB where A is the number of configurations and B is the time interval between configurations. **RunsA to G were a t 300 K; MD time step of 1 fsec; covalent bonds to hydrogen constrained by SHAKE to equilibrium distances. Run X was at 600 K; MD time step of 2 fsec; all bonds constrained by SHAKE to equilibrium distances. @Conformerextracted from final conformation of run F. ***Conformer obtained after heating initial conformer of run A a t 2,000 K for 100 psec. +++Fully extended conformer, energy minimized to remove bad contacts. $**Fullya-helical conformer, energy minimized to remove bad contacts.

temperature simulation where a much broader range of configuration space is accessed. Simulation X was at 600 K, starting from an extended conformation. (We were not concerned with a realistic representation of the dielectric screening of water given this unrealistic temperature, and so chose a value of 78 for E,,.) In this one case, the MD time step was increased to 2 fsec, as all bond lengths were constrained by SHAKE to their equilibrium values. The data collection period was 1nsec after an equilibration period of 0.04 nsec.

Fuzzy Cluster Analysis For each of the simulations listed in Table 11, a condensed trajectory file was constructed, consisting of - 1,000 configurations taken at discrete, fixed time intervals over the entire length of the collection run (column 7, Table 11). A fuzzy cluster analysis of the rms distance matrix was chosen as requiring less computer time, since the total number of atomic coordinates (the minimum and maximum numbers of atoms were 284 in 20-34 PTH and 581 in 1-34 PTH) was comparable to the total number of conformations. The rms distance matrix was calculated using a n optimal superposition alg0rithm.3~ Two rms distance matrices were computed for each PTH fragment, using first a n all-atom and then a C,-atom comparison of conformers in the MD trajectory. The number of clusters varied from 2 to 6 or 10.

Our algorithm solved the clustering problem iteratively''; the initial values of the cluster memberships were assigned randomly, subject to the constraints of Eqs. (1)and (2). The value of the clustering parameter, m, was varied between 1.1and 1.3. We found that, for a given rms distance matrix, the same optimal partitioning of data occurred for the same optimal number of clusters given different random assignments of the initial values of the cluster memberships and small changes in the value of m. This gives us confidence in the robustness of our results. We chose consensus among the following cluster validity functionals as a criterion for optimal number of clusters: partition coefficient, entropy, NFI, minimum, and mean hard tendency. All these functionals agree on the optimal number of clusters when well-defined, well-separated clusters are present.

RESULTS The total CPU times taken for construction of the rms distance matrix using the optimal superposition a l g ~ r i t h m ?on ~ a n SGI 4D/280S were between 11to 14 min for (1000 x 1000) matrices of 15 to 34 C,atom conformers and between 55 to 106 min for similar size matrices of 284 to 581 all-atom conformers. These times can be substantially reduced, if definitions of the rms distance other than that used here are employed. The superposition algorithm34 opti-

254


mally aligns different conformers by minimizing the rms distance calculated from intermolecular distances between coordinates of identical atoms. A more rapidly computed distance matrix would result from comparison of intramolecular atom-atom distances, which are also dependent on the conformations of the polypeptide. The subsequent clustering of the rms distance matrix took varying amounts of time as the minimization of Eq. (3) is iterative and depends on the number of clusters C , the fuzziness of the data set, and the value of the clustering parameter m. The distance clustering algorithm consumed about 15 to 30 CPU min to find the optimal partitioning of a (1000 x 1000) distance matrix into {2,3, . . . ,lo} clusters. These times are trivial when compared to the amount of time taken to generate the MD simulations themselves (several hundred hours for each simulation in Table 11). Tables I11 to V and Figures 1 and 2 contain information about the optimal number of fuzzy clusters identified for each of the simulation trajectories, given analyses of both all-atom and C,-atom rms distance matrices. Between one to three clusters of conformations were identified for each simulation. Only one cluster was inferred if there was no clear consensus among the various cluster validity functionals as to the optimal number of clusters, if the optimal number of clusters varied with small changes in the clustering parameter m, or if the clustering procedure for all values of C gave only very fuzzy (% fuzziness > 25%) clusters for small values of m ( m 5 1.15).(Table VI contains values of the cluster validity functionals and percent fuzziness of the corresponding clusters as a function of cluster number for two representative MD trajectories, runs B and X.) The rms distance between the cluster center and the initial conformation, from which the simulation was started, can indicate if the system stayed in the region of the original configuration. The maximum rms distance within each cluster can be used as a measure of the maximum “diameter” or range of conformation space explored within each cluster. The mean rms distance from the cluster center to each of the members of the cluster is also a measure of the average range of conformation space encompassed by the cluster. As mentioned previously, the conformation with the largest membership value within a cluster was chosen as being representative of the center of the cluster. Figure 3 contains plots of the positions of the C, atoms of all these conformers. The optimal numbers of clusters for runs A, B, D, E, and F were unambiguous, as all (or all but one) cluster validity functionals were in agreement. (In order to illustrate how the optimal number of clusters were chosen, Table VI contains, for run B, the values of F, H , NFZ, mean-ht, min-ht, and percent fuzziness of clusters as the number of clusters C is increased from 2 to 6.)The identified clusters are not

very fuzzy (between 0.2 and 9% fuzziness for m = 1.25 or 1.30: column 7, Table 111). Figure 1 illustrates graphically the all-atom rms distances between the two or more cluster centers for each simulation. This shows that the cluster centers are well separated (rms distances > 4 A) and so represent different regions of conformation space explored by the polypeptide (compare plots of cluster centers labeled Runs A, B, D, E, and F: Figure 3). For runs A, B, D, E, and F, the fractions of the conformers in the all-atom and C,-atom clusters for these runs are almost identical (column 6, Table 111). Figure 2 shows that the partition of conformers among the clusters is almost identical for all-atom and C,-atom clusters. This implies that the conformations of the polypeptide can be adequately described solely from the positions of the backbone atoms, even for small polypeptides, which are anticipated to have a great deal of flexibility. The optimal superposition of conformers for construction of the rms distance matrix given the positions of all the atoms takes more computer time than a C,-atom comparison: 4.9 times longer for all 284 atoms versus 15 C, atoms of 20-34 PTH;7.3 times longer for all 581 atoms versus 34 C, atoms of 1-34 PTH. Therefore, a fuzzy clustering analysis based on the C,-atom rather than a n all-atom rms distance matrix is more efficient. (Other definitions of the distance matrix would not produce such sizable time differences between C,- and all-atom comparisonssee above.) Table IV contains the rms distances between C,and all-atom conformers identified as being closest to the cluster centers. For all simulations, with the exception of the high temperature run X, these cluster centers are within 2 A of each other. So even if the C,-atom and all-atom cluster centers are wellseparated in MD time they are very similar in conformation space (also see Fig. 3, where the superimposed all-atom and C,-atom cluster centers can be seen to be very similar, except for run X). For example, both C,- and all-atom cluster analyses partition the trajectory of simulation A into two clusters, and both analyses apportion almost identical conformers to each of these two clusters (see Fig. 2). The allatom cluster center for the first cluster is at 58 psec, whereas the conformer identified as the center of the first cluster by the C,-atom analysis is at 284 psec. These two conformers are well-separated in MD time along the trajectory, yet they are only 1.7 A apart. Figure 2 demonstrates that all conformers of the intervening trajectory are also assigned to the first cluster, so that major conformational transitions have not occurred during the time course between 58 and 284 psec. These results confirm that comparable clusters and cluster centers are obtained from fuzzy clustering analysis of the C,- and allatom rms distance matrices at “normal” temperatures.

255


TABLE 111. Results of Fuzzy Cluster Analysis of MD Simulations

Atoms* PTH 20-34 All

Runt

m*

A

1.25

c,

A

1.25

All

B

1.25

c,

B

1.25

All

-

All

C C D

1.25

c,

D

1.25

All

X X

-

E

1.30

E

1.30

F

1.30

c,

F

1.30

All

G G

-

c,

c, PTH 13-34 All

c, PTH 1-34 All

c,

-

-

Cluster center§ (psec)

u,**

Ltt

58 1956 284 1982 201 1083 2688 192 900 2718 914 714 694 1676 232 1684 92 1 22 1

0.995 0.996 0.999 0.999 0.998 0.994 0.985 0.999 1.000 0.993

74 1440 74 948 33 265.5 32 285 144.5 114

Fuzziness**

Maxrms cluster§§ center***

Rmsfrom initialttt

Maxrms overall’’*

(%)

(A,

(A,

(A,

(A)

8.0 2.9 8.7 3.9 1.4 2.3 8.4 1.4 1.2 5.3

-

7.9 6.7 5.4 5.0 6.5 5.9 5.8 4.3 4.1 4.2 5.5 4.0 5.5 4.4 3.4 2.7 11.0 9.5

2.8 2.5 1.6 1.5 2.6 2.0 2.6 1.5 1.0 1.6 2.1 1.0 2.5 1.9 1.2 1.0 6.8 4.4

2.5 5.9 2.9 6.0 5.3 7.7 7.1 5.9 7.5 7.0 12.9 12.9 5.4 6.7 5.9 6.7 12.4 13.6

7.9

-

0.62 0.38 0.61 0.39 0.14 0.53 0.33 0.14 0.52 0.34 1.00 1.00 0.64 0.36 0.65 0.35 1.00 1.00

0.997 0.999 0.999 1.000

0.15 0.85 0.14 0.86

5.4 1.1 3.5 0.8

8.7 7.8 6.6 6.2

3.3 2.9 2.1 1.9

2.6 9.2 2.6 8.7

10.1

0.997 0.999 0.999 1.000

0.23 0.77 0.23 0.77 1.00 1.00

2.6 0.4 2.6 0.3

9.4 7.3 8.2 6.0 5.9 4.7

3.8 3.1 3.0 2.2 2.1 1.5

6.8 12.6 7.0 12.6 2.4 2.5

13.8

-

0.985 0.999 0.997 1.000

-

-

0.2 0.3 0.8 2.3

-

5.6 8.4

5.5

5.5 4.0 5.9 4.2 11.0 9.5

9.0

12.9 5.9 4.7

*Results from clustering of all-atom or C,-atom rms distance matrix. ‘Simulation runs described in Table 11. *Weighting parameter for fuzzy clustering, Eq. (3). Value of MD time step of conformer closest to cluster center. For two or more clusters, the cluster center is the conformer having the largest membership uLkin a cluster. For one cluster, the cluster center is the conformer having the smallest sum of rms distances to all other conformers in the trajectory. **Value of membership of conformer closest to cluster center. ‘+Fraction of conformers in cluster. *$Percentfuzziness computed from ratio of numbers of “fuzzy” members to total members in a cluster, where a “fuzzy” member has hL&(1+C)/ZC, where C is the total number of clusters. *‘Maximum rms distance between members within a cluster. ***Mean rms distance between cluster center and all other members of cluster. ‘‘+Rms distance between initial conformation (see Table 11) and conformation closest to cluster center. ***Maximumrms distance between conformers over the entire trajectory.

Figure 2 illustrates that the clustering of the polypeptide conformations tends to be divided along time domains, and that these are of lengths on the order of a hundred to a thousand picoseconds. This makes sense intuitively, as transitions between relatively widely separated conformations are not likely to occur more frequently than this. It is important to note

that Figure 2 illustrates that our MD simulations a t 300 K are not fully equilibrated in a conformational sense, despite their being energetically equilibrated (no drifting of total energy was observed during the collection periods). If the simulations were long enough so that the polypeptide fragments fully sampled conformation space, we would expect either

256


TABLE IV. RMS Distances Between Cluster Centers Determined by All-Atom and C,-Atom Comparison

even less likely that simulation E of 13-34 PTH or the much shorter (-0.5 nsec) simulations F and G of 1-34 PTH attained equilibrium. For simulations A, D, E, and F, for which more than one cluster was identified, the rms distance between the cluster center and the initial conformation (column 10, Table 111) increases with the MD time step of the cluster center. This also implies that for these simulations, the system is still moving away from the starting configuration, and has not equilibrated. This trend is also noticed for the first and second cluster centers of simulation B. However, the second and third cluster centers, which are more than 1,500 psec apart, are both a comparable distance from the initial conformer (7-8 A), although they themselves are well separated (rms distance of 4.1 A). Figure 2 indicates that there is an oscillation between conformations associated with the second and third cluster in the latter part of the trajectory. Both these results may indicate that the 20-34 PTH has started to equilibrate after -2 nsec, but the simulation would need to be more than a nanosecond longer t o verify this. Run C was found to exhibit ambiguous clustering. The optimal number of clusters varied with small changes in m; low values of m ( m 5 1.15) had to be employed for the percent fuzziness of clusters to drop below 10%; different optimal number of clusters were found for all-atom and C,-atom rms distances. Also the rms distance between cluster centers was less than 3 A. Therefore, only one “cluster” is given in Table I11 for run C. In cases where only one “cluster” was found, the conformer that had the smallest mean rms distance from all other conformers in the trajectory was identified as being closest to the “center” of the conformation space explored during the simulation. The rms distance between the cluster center and the initial, extended conformer is much larger than the maximum rms distance over the tra-

Cluster center (psec) All atom Allatom C, atom rms (A) rms (A) ( 7 ,

Run FTH 20-34 A

B C D X

FTH 13-34 E PTH 1-34 F G

58 1956 20 1 1083 2688 914 694 1676 92 1

284 1982 192 900 2718 714 232 1684 221

1.7 1.1 1.5 1.7 1.9 1.3 2.1 1.1 7.7

0.6 0.5 0.8 0.5 1.0 0.5 0.8 0.4 3.6

74 1440

74 948

0.0 2.3

0.0 1.1

32 285 114

1.1 1.6 1.1

0.7 1.0 0.6

33 265.5 144.5

that cluster membership would be composed of groups of conformers from different time segments of the MD simulation, and/or that many more clusters would be found, covering a larger region of conformation space. In addition, rms distances between the cluster centers from independent MD simulations started from different initial conformations of 20-34 PTH are large (> 4 A: Table V). This means that different regions of conformation space were explored in runs A, B, C, and D for 20-34 PTH. Therefore, even after simulations of 2-3 nsec, equilibrium structures were not found for 20-34 PTH. Thus, it is

TABLE V. Rms Distances Between Cluster Centers Identified From Different Simulations of 20-34 PTH* Rudcluster center A 58 psec 1956 psec B 201 psec 1083 psec 2688 psec

B

A

C

D

58 psec

1956 psec

201 psec

1083 psec

2688 psec

914 psec

694 psec

1676 psec

-

5.5

5.5

-

5.5 4.5

6.6 7.1

6.0 7.3

6.9 7.1

6.3 6.2

6.9 6.0

5.5 6.6 6.0

4.5 7.1 7.3

-

5.9

5.9 7.2

-

7.2 4.1

4.1

-

7.1 7.9 7.8

5.0 6.8 8.4

6.8 7.0 8.1

914 psec

6.9

7.1

7.1

7.9

7.8

6.2

6.4

694 psec

6.3

6.2

5.0

6.8

8.4

6.2

-

4.6

6.9

6.0

6.8

7.0

8.1

6.4

4.6

C D

1676 psec

*Rms distances between conformers in

A,

computed after optimally superimposing all atoms.

-

257


Run A

Run D

Run E

Fig. 1. Rms distances between cluster centers. Results shown for two or more clusters of all-atom rms distance matrices (Table 111). Simulation runs are those described in Table II. Rms distances between the labeled cluster centers are in angstroms.

The area of each circle is proportionalto the fraction of conformers in the cluster. Overlap between circles represents “fuzzy” members shared by different clusters.

jectory of the collection period (12.9A versus 5.5 A, for all atoms). A significant distortion from the initial (unfavorable) extended conformation occurred during the 403 psec of heating and equilibration, relative to the narrower range of conformational change during the subsequent, longer 2-nsec collection period. The high temperature simulation X did, as expected, explore a larger region of conformation space than did the 300 K simulations of 20-34 PTH (maximum rms distance over the entire trajectory of run X was 11 versus 4 to 8 for runs A to D: column 11, Table 111). In contrast to simulation C, the conformational change from the initial extended configuration was not greater in the heating and equilibration periods than in the collection period-the rms distance between cluster center and initial conformer and maximum rms distance over the collection trajectory are comparable (12.4 and 11.0 A). The greater kinetic energy of simulation X as compared to simulation C enabled more extensive sampling of conformation space during the collection period. No significant clustering of conformers was found for this run since any clusters found at m

= 1.15 were more than -25% fuzzy. (Table VI contains values of the cluster validity functionals and percent fuzziness of clusters for run X, given 2 to 6 clusters.) Note that the all-atom and C,-atom cluster centers of run X are not only MD time separated (921 versus 221 ps) but are spatially separated as well (all-atom rms distance of 7.7 A: Table IV), as can be seen in Figure 3. Since these two structures are not similar, this implies that for run X, it is not sufficient to consider only backbone configurations when describing the conformation space explored by the polypeptide. In this case of (unrealistically) high kinetic energy, positions of the side chains as well as the polypeptide backbone are also important determiners of conformation. Both simulations of 1-34 PTH (runs F and G ) were started from a fully helical conformation. In the nonpolar solvent (E,, = 2), this helical configuration was very stable-throughout the entire 0.5 nsec simulation, the fragment remained in a n ahelix. Not surprisingly, only one cluster was found for run G . Computer animation of the trajectory showed the body of the a-helix moving with largescale, cyclical oscillations about the helical axis,

258


J/F/p II201 ps

1083 ps

/ /

Y

All-atom

Y

//

2680 ps

I

Ca-atom

Run A

2000 ps

Run B

3000 ps

E-

2000 DS

All-atom 1676 ps

Ca-atom Run D

2000 ps

.Run ..

0 PS

1-1 ' 0 PS

All-atom

Run F

495.5 ps

Fig. 2. Division of MD trajectories into clusters of conformations. Results shown for two or more clusters of all-atom and C,-atom rms distance matrices (Table Ill). Simulation runs are those described in Table II.The horizontal axes cover the MD time course of the collection periods. ( x ) marks cluster centers which are also labeled by MD time step. Vectors denote groupings of

conformers that are members of a single cluster. Cross-hatching indicates "fuzzy" members shared by different clusters. In run B, the "fuzzy" members shown in the latter part of the trajectory are shared by all-atom clusters having centers at 1,063and 2,688 psec, and by C,-atom clusters with centers at 900 and 2,718 psec.

somewhat like the bending of a spring. Thus, it makes sense that the rms distance between the cluster center and the initial, rigid cylindrical helix (2.4 A) is less than the maximum rms distance among conformations of the trajectory (5.9 A). The cluster center corresponds to a nearly perfect cylindrical configuration (see plot labeled Run G Fig. 3) about which the helix flexes, and the maximum rms distance corresponds to two conformers a t the opposite extrema of the oscillations. However, in the water~ 78), the full helical conformation like solvent ( E = was not stable. A bend defined by residues 11-19 formed. A large proportion (0.77) of the conformations in run F belong to the second cluster, whose center displays this bend and has a significant amount of helical structure surviving [see plot labeled Run F, (2): Fig. 31. The simulation results are consistent with results from circular dichroism experiments, in that a greater proportion of residues of

It is possible to further analyze the cluster centers, for example, we could examine their backbone torsion angles, orientations of side chains, et cet. We have performed an analysis of the secondary structure existing in the conformations chosen as being closest to the cluster centers, using QUANTA.35 These results have been reported elsewhere in conjunction with circular dichroism results and thermodynamic predictions of a-helices, and the interested reader is directed to ref. 27.

1-34 PTH are in a helical conformation for solvents

MD trajectory. The conformer closest t o the cluster

having low dielectric p e r m i t t i v i t i e ~ . ~ ~

center can be regarded as being representative of a

CONCLUSIONS Fuzzy cluster analysis has been applied to the rms distance matrices between conformers from MD simulations of various fragments of a small polypeptide. A variety of clustering validity functionals usually agreed unambiguously as to the optimal number of clusters of conformations (if present) comprising an

259


TABLE VI. Cluster Validity Functionals and Percent Fuzziness as a Function of the Number of Clusters for T w o MD Traiectories

c*

Ft

Run B, m = 2 0.83 3 0.91* 4 0.82 5 0.82 6 0.71

@

NFP min-ht** mean-httt 1 1.25, clustering analysis of all-atom rms distance matrix” 0.30 0.17* 0.34 0.35 0.56

0.66 0.87* 0.76 0.77 0.65

1.11 1.44 1.51 1.52* 1.32

0.98 1.24* 0.96 0.96 0.69

2.3 1.4 1.4 16.9 21.6

Run X, m = 1.15, Clustering analysis of all-atom rms distance matrix*** 2 0.73* 0.43* 0.45 0.63* 0.54* 42 3 0.65 0.61 0.48* 0.54 0.51 40 4 0.61 0.73 0.48* 0.57 0.49 35 5 0.56 0.86 0.45 0.56 0.43 44 6 0.50 0.99 0.40 0.60 0.39 52

2

Fuzziness of cluster**(%) 3 4 5

3.7 2.3 16.4 4.8 5.0

8.4 25.0 2.4 7.8

0.8 0.8 29.4

14.2 29.5

23 32 29 28 51

32 43 53 55

27 26 44

50 45

6

0.7

26

*Number of clusters. ‘Partition coefficient, Eq. (Bl). *Entropy, Eq. (B2). $Nonfuzzy index, Eq. (B3). **Minimum hard tendency, see Appendix B. “Mean hard tendency, see Appendix B. **Percentfuzziness computed from ratio of number of “fuzzy” members to total members in a clusters, where a “fuzzy” member has n,,$l +C)I2C. “Simulation run described in Table 11.The m is the weighting constant in Eq. (3). An asterisk (*) denotes maximum values ofF, NFI, min-ht, and mean-ht, and minimum value of H. Four of the five cluster validity functionals indicate that the optimal number of clusters is 3. The percent fuzziness of the clusters is small for C = 3 (compare with results of run X).Note that percent fuzziness increases for C > 3. All results are strong indicators of cluster validity. ***Simulation run described in Table 11. The m is the weighting constant in Eq. (3).An asterisk (*) denotes maximum values of F, NFI, min-ht, and mean-ht, and minimum value of H.Four of the five validity functionals indicate that the optimal number of clusters is 2. However, note that the percent fuzziness of clusters is large (compare with results of run B), even for this small value of m ( m = 1.15). This tends to indicate that if there is clustering of the sample data, it is very fuzzy.

class of structures the polypeptide assumes and can be further analyzed, e.g., for secondary structure. For simulations at 300 K, the results of clustering of the rms all-atom and C,-atom distance matrices were very similar both in terms of the size and membership of clusters, and with regards to the structure of the conformer closest to the cluster center. Therefore, conformations of these small polypeptides are determined primarily by the positions of the atoms of the backbone. Construction of the C,-atom rms distance matrix is less time consuming than the allatom rms distance matrix. Nevertheless, fuzzy cluster analysis with either distance matrix consumes a trivial amount of computer time compared to the generation of the MD trajectory. Cluster analysis should prove to be a n additional, useful tool with which to examine the results from MD simulations on proteins and polypeptides. Fuzzy cluster analysis of MD trajectories is relatively fast, easy to apply, and can be used to analyze trajectories from nonequilibrium simulations, since it does not require that the sample be drawn from any statistical distribution. Indeed, examination of the cluster membership can indicate whether or not conformational equilibrium has been attained. If a sequence

of clusters is generated from successive MD time domains, as was seen here (see Fig. 2), then it is unlikely that conformational equilibrium has been achieved. On the other hand, if conformers comprising a given cluster originate from disparate parts of the MD trajectory, then the system is probably oscillating between some equilibrium conformations. In the latter case, if the MD simulation were of sufficient length, it would be possible to estimate both relative populations of different structural conformers and transition rates among those conformers from the details of the clustering analysis. Although we have specifically examined MD simulations, because they are the most prevalent simulation technique for proteins, fuzzy clustering is equally applicable to molecular conformations produced from Monte Carlo simulations. It is worth emphasizing that fuzzy cluster analysis is an effective new indicator of (global) conformational equilibrium. Evaluating whether or not a model system has attained equilibrium is of significant concern for both MC and MD simulation^.^ Lack of drifting of the system energy is not always a sufficiently good indication that the observable property to be measured has equilibrated. For the polypeptide fragments studied


260

p-? Run C

Run E

Y / b ) / v p a Run F

Run G Fig. 3. Cluster centers of MD trajectories. The conformers and closest to the cluster centers as found from all-atom (-) C -atom (- - -) rms distance matrices are superimposed (see Tiible 111). Only C, atoms are plotted. Run A: (1) 58 and 284 psec; (2) 1,956 and 1,982 psec. Run 8 : (1) 201 and 192 psec; (2), 1,083

and 900 psec; (3)2,688 and 2,718 psec. Run C:914 and 714 psec. Run D: (1) 694 and 232 psec; (2),1,676 and 1,684 pSeC. Run X: 921 and 221 psec. Run E: (1) 74 psec; (2), 1,440 and 948 psec. Run F: (1) 33 and 32 psec; (2) 265.5 and 285 psec. Run G: 144.5 and 114 psec.


here, fuzzy clustering conclusively demonstrated that conformational equilibration had not been attained, despite equilibration of the energy.

REFERENCES 1. McCammon, J.R., Gelin, B.R., Karplus, M. Dynamics of folded proteins. Nature (London) 267585490, 1977. 2. Hermans, J. (ed). “Molecular Dynamics and Protein Struc-

ture.” Western Springs, IL: Polycrystal Book Service, 1985. 3. Swaminathan, S., Ravishanker, G., Beveridge, D.L., Lavery, R., Etchebest, c., Sklenar, H. Conformational and helicoidal analysis of the molecular dynamics of proteins: “Curves”, dials and windows for a 50 psec dynamic trajectory of BPTTI. Proteins 8:179-193, 1990. 4. Rojewska, D., Elber, R. Molecular dynamics study of secondary structure motions in proteins: Application to myohemerythrin. Proteins 7:265-279, 1990. 5. Allen, M.P., Tildesley, D.J. “Computer Simulation of Liquids.” Oxford Clarendon Press, 1987. 6. Zadeh, L.A. Fuzzy sets. Inform. Control 8:338-353, 1965. 7. Dunn, J.C. A fuzzy relative of the ISODATA process and its use in detecting compact, well-separated clusters. J . Cybernet. 3(3):32-57, 1974. 8. Gustafson, D.E., Kessel, W.C. Fuzzy clustering with a fuzzy covariance matrix. In: “Advances in Fuzzy Set Theory and Applications.” Gupta, M.M. (ed). Amsterdam: North-Holland, 1979. 9. Ruspini, E.H., Numerical Methods for Fuzzy Clustering, Inform. Sci. 2:319-350, 1970. 10. Bezdek, J.C., Coray, C., Gunderson, R., Watson, J. Detection and characterization of cluster substructure. S U M J. Appl. Math. 40:339-372, 1981. 11. Bezdek, J.C. “Pattern Recognition with Fuzzy Objective Function Algorithms.” New York Plenum Press, 1981. 12. Bezdek, J.C., Ehrlich, R., Full, W. FCM: The Fuzzy cMeans clustering algorithm. Comput. Geosci. 10:191-203, 1984. 13. Kim, T., Bezdek, J.C., Hathaway, R. Optimality tests for fixed points of the Fuzzy c-Means algorithm. Pattern Recog. 21:651-663, 1988. 14. Roubens, M. Pattern classification problems and fuzzy sets. Fuzzy Sets Syst. 1:239-253, 1978. 15. Leszczynski, K., Penczek, P., Grochulski, W. Sugeno’s fuzzy measure and fuzzy clustering. Fuzzy Sets Syst. 15: 147-158, 1985. 16. Foglein, J . Objective functions in fuzzy clustering. In: “Proceedinas. 8th International Conference on Pattern Recognition, Paris, 1986,”pp. 625-627. Computer Society Press, 1986. 17. Hathaway, R.J., Bezdek, J.C., Tucker, W.T. An improved convergence theory for the Fuzzy c-Means clustering algorithms. In: “Analysis of Fuzzy Information, Vol. 3.” Bezdek, J.C. (ed.) Boca Raton: CRC Press, 1987. 18. Rivera, F.F., Zapata, E.L., Carazo, J.M. Cluster validity based on the hard tendency of the fuzzy classification. Pattern Recog. Let. 11:7-12, 1990. 19. Jain, A.K., Moreau, J.V. Bootstrap technique in cluster analysis. Pattern Recog. 20547-568, 1987. 20. Gath, I., Geva, A.B. Fuzzy clustering for the estimation of the parameters of the components of mixtures of normal distributions. Pattern Recog. Lett. 9:77-86, 1989. 21. Bezdek, J.C. Cluster validity with fuzzy sets. J . Cybernet. 3(3):58-73, 1974. 22. Habener, J.F., Rosenblatt, M., Potts, J.T., J r . Parathyroid hormone: biochemical aspects of biosynthesis, secretion, action, and metabolism. Physiol. Rev. 64:985-1053, 1984. 23. Lee, S.C., Russell, A.F. Two-dimensional ‘H-NMR study of the 1-34 fragment of human parathyroid hormone. Biopolymers 28:1115-1 127, 1989. 24. Klaus, W., Dieckmann, T., Wray, V., Schomburg, D., Wingender, E., Mayer, H. Investigation of the solution structure of the human parathyroid hormone fragment (1-34) by ‘H NMR spectroscopy, distance geometry, and molecular dynamics calculations. Biochemistry 306936-6942, 1991. 25. Barden, J.A., Kemp, B.E. NMR study of a 34-residue Nterminal fragment of the parathyroid-hormone-related

26.

27.

28. 29. 30.

31. 32.

33. 34. 35.

261

protein secreted during humoral hypercalcemia of malignancy. Eur. J. Biochem. 184:379-394, 1989. Zull, J.E., Smith, S.K., Wiltshire, R. Effect of methionine oxidation and deletion of amino-terminal residues on the conformation of parathyroid hormone. J. Biol. Chem. 265: 5671-5676, 1990. Neugebauer, W., Surewicz, W.K., Gordon, H.L., Somorjai, R.L., Sung, W., Willick, G.E. Structural elements of human parathyroid hormone and their possible relationship to biological activity. Biochemistry 31:2056-2063, 1992. Zull, J.E., Lev, N.B. A theoretical study of the structure of parathyroid hormone. Proc. Natl. Acad. Sci. U.S.A. 77: 3791-3795,1980. Fiskin, A.M., Cohn, D.V., Peterson, G.C. A model for the structure of bovine parathormone derived by dark field electron microscopy. J . Biol. Chem. 252:8261-8268, 1977. Brooks, B.R., Bruccoleri, R.E., Olafson, B.D., States, D.J., Swaminathan, S., Karplus, M. CHARMM: A program for macromolecular energy, minimization, and dynamics calculations. J . Comput. Chem. 4:187-217, 1983. CHARMm version 21.2, Polygen Corporation, Waltham, MA, 1990. Rychaert, J.P., Ciccotti, G., Berendsen, H.J.C. Numerical integration of the Cartesian equations of motion of a system with constraints: molecular dynamics of n-alkanes. J . Comput. Phys. 23:327-341, 1977. van Gunsteren, W.F., Berendsen, H.J.C. Algorithms for macromolecular dynamics and constraint dynamics. Mol. Phys. 34:1311-1327, 1977. Zuker, M., Somorjai, R.L. The alignment of protein structures in three dimensions. Bull. Math. Biol. 5155-78, 1989. Polygen Corporation, Waltham, MA; QUANTA version 3.0, 1990.

APPENDIX A. FUZZY CLUSTERING ANALYSIS OF ‘‘BU’ITERFLY’’ DATA Table A1 and Figures A1 and A2 contain the results of a fuzzy cluster analysis of a two-dimensional set of “butterfly” data. The coordinates of the data are given in the footnotes of Table AI. A clustering analysis using the coordinates of the 11 points was performed, for C = 2 to 5, with the weighting constant rn = 1.5,2.0,3.0.The identity matrix was used for A in Eq. (4)[hyperspherical (circular in two dimensions) clusters are sought]. The values of the cluster validity functionals F, H , NFZ, min-ht, and mean-ht are shown in Table AI. The optimal number of clusters into which these data can be partitioned is the value of C for which F, NFZ, min-ht, and mean-ht attain their maximum values, and H its minimum value. From Table AI, it appears that the data are best divided among 2 or possibly 3 clusters. As C + N , the number of data points, the functional will begin to indicate optimal partitioning at N clusters, with each cluster having only one of the data points as a member. (Note that NFZ, min-ht, and mean-ht peak a t C = 2 or 3, decrease, and then start to increase a t C = 5, for which there is a mean cluster membership of only 2.) Figure A1 illustrates the optimal partitioning of the data between two clusters, the positions of the cluster centers, and the values of the memberships liik for each data point. For rn = 1.5, all but the central point at coordinates (0,O)have hard memberships to either one or the other of the clusters. As rn increases, the clusters become more fuzzy. Note the corresponding changes in the validity functionals. For example, the maxi-

262


TABLE AI. Cluster Validity Functionals as a Function of the Number of Clusters C and Weighting Constant m for “Butterfly” Data* mT 1.5

2.0

3.0

C

F

lP

NFI**

min-htt’

2 3 4 5 2 3 4 5 2 3 4 5

0.949* 0.940 0.897 0.893 0.873* 0.744 0.689 0.713 0.698* 0.590 0.507 0.562

0.081* 0.131 0.204 0.222 0.231* 0.444 0.588 0.573 0.475* 0.709 0.935 0.896

0.898 0.910* 0.863 0.866 0.747* 0.616 0.585 0.641 0.397* 0.385 0.343 0.453

2.511* 2.029 1.776 3.616 1.298 5.519* 0.945 8.791 0.672 5.331* 0.509 6.374

mean-ht**

1.641* 1.619 1.325 2.245 0.990 2.170* 0.716 2.270 0.568 1.993* 0.383 1.616

*Number of data points, N = 11; coordinates (ykl,y k z ) = {(-4,2),(-4,0),(-4,-2),(-2,1),(-2,l), (0,0),(2,1)(2,-1),(4,2),(4,0),(4,-2)}. An asterisk (*) denotes maximum values of F, NFZ, min-ht, and mean-ht, and minimum value of H for 2 5 C 5 4 (C is approaching N at C = 5, and these results are reported for interest only-see text). tWeighting constant in Eq. (3). $Partition coeficient, Eq. B1. ‘Entropy, Eq. B2. **Nonfuzzy index, Eq. B3. ttMinimum hard tendency, see Appendix B. **Meanhard tendency, see Appendix B.

m = 1.5, C = 2

m = 2.0, C

(0.78.0.22)

=2

(0.22, 0.78)

(0.89,O.ll)

(0.11,0.89)

(0.78.0.22)

(0.22.0.78)

m = 3.0, C = 2 Fig. Al. Membership values and cluster centers for optimal partitioning of “butterfly” data into two clusters for rn = 1.5, 2.0, and 3.0. (0) Data points (coordinates given in Table Al; membership values (Olk,Ozk) are above each data point; ( x ) cluster centers 0 , and 0,: for rn = 1.5,(Pll, P12) = (-3.0,0.0), (Pzl, GZ2) =

(3.0,O.O); for rn = 2 and 3, (P,l, P,J = (-3.1,0.0), (P2,, PZ2) = (3.1,O.O).The data point at coordinates (0,O)that falls within the intersection of the clusters is equally shared by both clusters.

263


I

'

I)

(0.55.0.39.0.06)

1

I'

(0.99,0.01,0) 'X

1

\

\

(0.06,0.39,0.55)

'

):0(

1

(o,y;o.99)

(0.92,0.06,0.02)

m=2,C=3 (0.7,0.2,0.1)

(0,0.1,0.9)

(0.9,0.1,0) 1

X'

X

X '

2

,

(0.4,0.4,0.2)

m=3,C=3 Fig. A2. Membership values and cluster centers for optimal partitioningof "butterfly" data into three clusters form = 2 and 3. (0) Data points (coordinates given in Table Al); membership Values (Oln02n03k) are above each data point; ( x ) cluster centers O , , 3,, and 0,: for rn = 2.0, (P,,, P12) = (-3.6,0.0), (P21, P22) =

P32) = (3.6,O); for m = 3, (P,,, P12)= (-3.8,0.0), = (3.8,O). The data points that fall within the intersection of two clusters are fuzzy members of both clusters.

mum value that both F and NFI can attain is 1, indicating a totally hard partitioning of the data. For the same partitioning of the data between two clusters, the value of F and NFI decrease with increasing m. The central point at (0,O) is equally = CZk shared by both clusters for all values of m, = 1/C = 0.5.The data points a t (-4,O) and (4,O) are closest to the cluster centers, and have the largest membership values to the corresponding cluster. As m increases, it becomes less clear whether the optimal partitioning of the data is among two or three clusters. Figure A2 shows how the data set is divided among three clusters for m = 2 and 3, with the corresponding cluster centers and membership values. Here, data points with coordinates (-2,l) and (- 2, - 1)are fuzzy members of the first and second clusters, and points (2,l) and (2,-1) are fuzzy members of the second and third clusters. As m increases, it is apparent that & + 1/C (1/3) as the membership of these four points becomes more fuzzy. The decision as to whether or not this data set is best represented by two or three clusters may depend

upon contextual information about the sample, e.g., prior knowledge about the distribution from which the sample was drawn, and also upon the purposes for which the information about the clusters is to be used. Whereas for this particular example, visual inspection alone can be used to recognize patterns (which will not be possible for higher dimensional samples), fuzzy clustering enables us to express numerically the extent of membership sharing.

(0,O):(P,,,

(f21,v22) = (O,O), (P,,, P32)

APPENDIX B: CLUSTER VALIDITY FUNCTIONALS All of the following cluster validity functionals are independent of both the number of clusters C and the size of the data set N . This means that they can be used to discriminate between different locally optimal clustering solutions for a given value of C and between cluster solutions for different C . Partition Coefficient'1*12y21 The partition coefficient F is defined as N

F

C

(&)'/N.

= 1i=l

264

H.L. GORDON AND R.L.SOMORJAI

It can be shown that 1/C 5 F 5 1, where the lower bound corresponds to a totally fuzzy partitioning of the data (a, = 1/C, V i,k), and the upper bound to a totally hard partitioning of the data (ai,= 1or 0, V i,k). The magnitude of F indicates the average relative painvise sharing of members between clusters. Larger values of F correspond to partitionings of the data set that have fewer shared members, and thus form tighter clusters.

Nonfuzzy Index14 The equation for the nonfuzzy index NFZ is

Entropy”.12 The entropy functional H is given by

The range of values are 0 5 NFZ 5 1. The lower bound corresponds to a totally fuzzy (aik= 1/C, V i,k) partition of the data and the upper bound to a completely hard (ai,= 1 or 0, V i,k) partitioning of the data. Maximal values of NFI correspond to well-defined clusters.

The range of H is between 0 5 H 5 In C , where the lower bound corresponds to a totally hard (ajk= 1or 0, V i,k) and the upper bound to a totally fuzzy (ai,= 1/C,V i,k) partitioning of the data. Tightly clustered data can be said to exhibit a low degree of disorder (of which entropy is a measure). A fuzzy cluster solution that has a low value of H has fewer shared members than one with a larger H .

Minimum and Mean Hard Tendencies” The mean hard tendency (mean-ht) of a set of clusters is a measure of the average “hardness” or “tightness” of the clusters. The minimum hard tendency (min-ht) functional is the value of the hard tendency of the most hardhonfuzzy of the C clusters. The range of values are 0 5 min-ht; mean-ht 5 a. Maximal values of both functionals are indicative of well-defined, hard clusters. Several equations are required to define the mean-ht and min-ht, and interested readers are directed to ref. 18.

Available Instruments for Analyzing Molecular Dynamics Trajectories.

Molecular Dynamics of "Fuzzy" Transcriptional Activator-Coactivator Interactions.

Real-time interactive frequency filtering of molecular dynamics trajectories.

MDTraj: A Modern Open Library for the Analysis of Molecular Dynamics Trajectories.

Molecular analysis of cluster headache.

mDCC_tools: characterizing multi-modal atomic motions in molecular dynamics trajectories.

Clustering molecular dynamics trajectories for optimizing docking experiments.

Cluster analysis of molecular simulation trajectories for systems where both conformation and orientation of the sampled states are important.

Fuzzy molecular surfaces.

Two worlds collide: image analysis methods for quantifying structural variation in cluster molecular dynamics.

Resolving the anomalous infrared spectrum of the MeCN-HCl molecular cluster using ab initio molecular dynamics.

Compositional Dynamics: Defining the Fuzzy Cell.

Cluster Analysis as a Method of Recovering Types of Intraindividual Growth Trajectories: A Monte Carlo Study.

Trajectories of microsecond molecular dynamics simulations of nucleosomes and nucleosome core particles.

Single Molecule Cluster Analysis dissects splicing pathway conformational dynamics.

Projection of Monte Carlo and molecular dynamics trajectories onto the normal mode axes: human lysozyme.

Multivariate spatial condition mapping using subtractive fuzzy cluster means.

Random Matrix Theory in molecular dynamics analysis.

NMR Order Parameter Determination from Long Molecular Dynamics Trajectories for Objective Comparison with Experiment.

hyperactivity disorder with fuzzy entropy.

JED: a Java Essential Dynamics Program for comparative analysis of protein trajectories.

Molecular Dynamics of Biomolecules through Direct Analysis of Dipolar Couplings.

Molecular dynamics analysis of the aggregation propensity of polyglutamine segments.

Fuzzy clustering analysis of osteosarcoma related genes.