J. Mol. Biol. (1992) 227. 371-374

Predicting Protein Secondary Structure with a Nearest-neighbor Algorithm Steven Salzberg and Scott Cost Departme,nt of Computer 8cience Johns Hopkins C:niversity, Baltimore, LI+TD21218, U.S.A. (Received 3 Narch

1992; accepted 7 April

1992)

We have developed a new method for protein secondary structure prediction that achieves accuracies as high as 71.0%, the highest value yet reported. The main component of our method is a nearest-neighbor algorithm that uses a more sophisticated treatment of the feature space than standard nearest-neighbor methods. Tt calculates distance tables that allow it to produce real-valued distances between amino acid residues, and attaches weights to the instances to further modify the the structure of feature space. The algorithm, which is closely related to the memory-based reasoning method of Zhang et nl., is simple and easy to train. and has also been applied with excellent results to the problem of identifying DSA promot,er sequences. Keywords:

protein secondary memory-based

structure; reasoning;

This communication reports on a new machine learning algorithm that has been applied successfully to the problem of protein structure prediction. This method classified amino acid sequences into one of three categories, a-helix, /?-sheet*, or coil, and has achieved accuracy of 71 o/o on a data set of 106 proteins. Tn a previous study report’ed in this journal. a neural net learning method achieved on the same data (Qian & 64.3 ?{, accuracy Sejnowski, 1988). More recently, another study in this journal by Zhang et al. (1992) described a hybrid method that achieves 66.496 accuracy on a different, data set consisting of 107 proteins. All of these methods compare favorably to earlier techniques (e.g. Linr, 1974; Chou & Fasman, 1978; Gamier et al.. 1978). The purpose of this research note is to draw attention to the capabilities of methods in the machine learning community, with the hope that researchers in the biological sciences will take advant,ages of these methods. Our m&hod has also been shown to work well on identifying DKA promoter sequences and pronouncing English text. Details of those experiments, and full details of the algorithm, can be found in (lost $ Salzberg (1992). The purpose of this note is to present the results on protein secondary struct,ure prediction, in order to complement the two longer papthrs previously published in this journal. The disciplines of machine learning and pattern recognition have for many years been exploring

nearest-neighbor neural nets

methods;

methods for classification. The protein folding problem falls into the set of problems appropriate for these methods. In particular. when protein folding is expressed as a problem of taking as input a string of amino acids, and producing a label from the set (a-helix, /?-sheet. coil), it becomes clear t’hat many classification techniques might be tested on this problem. Nearest-neighbor classification is one of the most well-known methods, having been studied since the early 1950s (Fix & Hodges, 1952). More recent methods have emphasized building models in the form of rules, decision trees (Quinlan, 1986). or hyper-rectangles (Salzberg. 1991), but the nearest-neighbor method is probably the simplest algorithm for performing classification. However. when the feature values have symbolic, unordered values (e.g. the 20 amino acids in a globular protein. which have no natural inter-value “distance”). nearest-neighbor methods typically resort t#o much simpler metrics, such as Hamming distance. The Hamming distance between two amino acid strings is simply the number of positions for which the two strings mismatch. Simpler metrics may fail to capture the complexity of the problem domains, and as a result may not perform well. Our algorithm enhances standard nearest’neighbor by first constructing distance tables that define a numeric distance between two amino acids. and then attaching weights to individual examples. These weights create “exception spaces” around the examples: i.e. areas of the attribute space in which 37 1 Cr; 1992 ,\cadcnlic I’rrw Limitrd

a single example dictates how to classify ne\+ examples. The combination of these t’wo techniques results in a robust nearest-neighbor learning algorit’hm that works for any domain with symbolic feature values. Our implement,ation. PEBLS (Parallel Exemplar-Based Learning System). performs as well as any previously reported method for predicting protein secondary structure. Its training time is much faster than the neural net methods that have been used for this problem. In addition: it is easy to parallelize our algorithm for even greater speed-ups, as we have shown by implementing out system on a parallel. transput,er-based computer. I’EBLS was designed to process instances that have symbolic feature values. The heart, of t’he PEBLS algorithm is the way in which it, measures distance between two examples. This consists of essentially three components. The first is a modification of Stanfill & Waltz’s (1986) Value Difference Metric (VDM), which defines the distance bet’ween different. values of a given feature. (Zhang et al. (1992) also use a modification of the Stanfill-Waltz VDM, though it is different from ours.) Our metric, gives us a good estimate of the distance between two amino acids. Our second component is a st)arldard distance metric for measuring the distance between two examples in a multi-dimensional at,tribute space. Finally, the distance is modified by a weighting scheme that weights instances in memory according to their performance history (Salzbrrg, 1990). The learning problem can be described as follows: given a sequence of residues from a fixed length window from a protein chain, classify the central residue in the window as cc-helix, b-sheet, or coil. The setup is as follows:

In 1986. Stanfill and Waltz presented a powrrtill new mrt hod for measuring the dist,ancv bt,tweetl at,tributr values in domains with symbolic, attri butes. They applied t,heir technique to thca prol)lerrr of pronouncing English text. with impressivtl rc>sults (Stanfill & Waltz, 1986). Their \Talutl Ihfic~rerlw Metric t,akes into account t hr caorrrlat ions twt MWII each possible attribute value and each class. Our method works similarly. using the 20 amino acids as the set of attribute values. and the tJhrt~rfoltl types as t’he classes. First, WC’ estimate thts tlistancbr between ail amino acids statistically, based on the examples in the t,raining set. The distancoe 6 1)ct wee11 two amino avids is defined in equation (I ):

In the equation, A, and -4, represent t’wo amino acids. The distance between the amino acids is a sum over all 12 classes, where n = 3 for this experiof times A, was classified ment. (‘II, is the number into category i, C’, is the t,otal number of times .4 I occurred, and k is a constant, usually set, to 1. The idea behind this met,ric is that values should be similar if they occur with t,hr same relat.ive frequency for all classes. The t.erm C’l,/C’l represent’s the likelihood that the cent)ral residue will be classified as i given that the feature in quest)ion has value A,. Thus we say that two values art’ similar if they give similar likelihoods for all possibk cblassifications. Equation (1) computes overall similarity between two values by finding the sum of t,he differences of these likelihoods over all classifications. Equation (1) defines a geometric, distance on H fixed, finite set of values. It obeys all the requirch-

window TDYGNDVEYXGQVT

E GTPGKSFNLNFDTC

_.

-c central Qian & Sejnowski (1988), Holley & Karplus (1989), and Zhang et al. (1992) formulated the problem in exactly the same manner. PEBLS requires two passes through the training set. During the first pass, tables are constructed that contain the distances between the amino acid values. A different, table is created for each position in the window, according to the equations shown below. In the second pass, the system attempts to classify each instance in the training set, using the nearest-neighbor to make each prediction. The system then checks to see if the classification is correct, and uses this feedback to adjust a weight’ on the old instance. The weight reflects how reliable a classifier each instance is. Finally, the new instance is stored in memory. During testing, examples are classified in the same manner, but no modifications are made to memory or to the distance tables.

residue ments (i) (ii) (iii) (iv)

of a distance

metric:

6(a,b)

> 0,a

6(a,b)

= 6(6,a);

h(a,a)

= 0;

6(a,b)

+ 6(6,c) 2 &a.~).

e.g.

# 6;

Stanfill and Waltz’s original VDM was nonsymmetric; e.g. 6(n,6) #6(6,a). However. both our metric and the one used by Zhang et al. (1992) are symmetric. The total distance A between two distances is given by:

A(X. Y) = uxwy ;!I 6(xi,Yi)r, where X and Y represent two windows for the protein folding

instances domain),

(2) (e.g. 2 with X

Communications

being an exemplar in memory and I’ a new example. The variables xi and yi are values of the ith feature for S and Y where each example has IV feat,ures. wx and ujy are weights assigned to exemplars, described below. For a new example Y, wy = 1. r is a constant. in this experiment set to 1. Home stored instances are more reliable classifiers than ot)hers. Int,uitively, one would like these trustworthy exemplars to have more “drawing powers” than ot’hers. An important’ change that, our method makes t#o standard neare&neighbor met hods is the use of the weight ox in our distance formula: reliable exemplars are given smaller weights, making them appear closer to a new example. Our weighting scheme was first adopt’ed in t’he EA(“H system (Salzberg 1990. 1991), which assigned weights to exemplars according to their performance hist’ory. ox is the ratio of the number of uses of an exemplar to the number of correct uses of the exemplar; thus. accurate exemplars will have Use z 1. I:nreliahle exemplars will have wx > 1, making t’hem appear further away from a new example. These unreliable esemplars may represent either noise or “exceptions”. small areas of feature space in which the normal rule does not apply. The morp times an exemplar is incorrect’ly used for c4assification. the larger it’s weight grows. When PEBLS computes distance from a new irlstanc*e to a weighted exemplar. that’ distance is multiplied by the exemplar’s weight. Intuitively. that makes it’ less likely for a new instance to appear near an exemplar as the exemplar’s weight grows. (:eomet8ric~ally. the use of weights creates a spherical envelope around the exemplar with the larger weight, defining an “chxception space” that, shrinks as the weight difference increases. Onl; points inside the sljherr will match the point with the larger weight. The protein sequences used for our experiments were originally from the Brookhaven National Laboratory. Secbondary structure assignments of cc-helix, b-sheet and coil were made based on atomic co-ordinates using the method of Kabsch & Sander (1983). Qian & Sejnowski (1988) collected a database of 106 prot’eins. containing 128 prot,ein subunits, This experiment used the same set of proteins FOI our intial experiments, we and subunits. tlivided the data, into a training set containing 100 protein subunits and a test set containing 28 subunits. There was no overlap between the t,wo sets. The composition of the sets is detailed in Table 1. wtlich shows that the percentages of the three categories were approximately the same in the test set as in the training set.? Protein subunits were not,

373

Table Comparison

1

of training

and test xets

separated: i.e. all instances drawn from one subunit resided together either in the training or the testing set,. M’r trained PEBTS on a variet’y of different window sizes. ranging from 3 to 21. We found that a window of size 19 was optimal, though nearly identical results were obtained with a window of Rize 17. The E’EBLS algorithm includes. for its domain, a based on the minimum posi -processing step sequence-length restrictions used by Holley & Karplus (1989). These restrictions stated t’hat a /&sheet must consist of a contiguous sequence of no fewer t’han two residues, and an %-helix of no fewer than four. L\Then a predicted subsequence did not contijrm to these restrict,ions, the individual residues were tar-classified as coil. The PEBLS program achieved 71+J0,, overall accuracy on the test set, compared to the best result of 61.3 oo reportled by @an & Sejnowski ( 1988) for the same dat)a. Another frequently used measure of performance is the correlation coefficients of Mathews (1975). which provide a measure of accuracy for each of the categories. A comparison of PEBl,S with the two studies previously published in the ,Journal of Molecular Biology, as well as the study by Halley B Karplus (1989). appears in Table 2. In each st’udy. the reported ac*curacies in the Table were measured on a test set) that was entirely separa.tc from the training set. Sote that, Hollry Nr Karplus and Zhang it nl. used different data sets from ours. Tn addition, Zhang’s data set’ was carefully screened to minimize homologies between proteins. Zhang et al. (1992) present additional comparisons with other met hods. none of whic*h perform hetter t,han the results presented here. !Ve used the same methodology as Qian $ Sejnowski (198X) in order to facilitate direct compaiison of t)he results. Thus our experiments used a single training and test set. which can he misleading. As Zhang et nl. (1992) demonstrate, the variation in performance of a single algorithm from

Table 2 t Qian B Srjno\vski (1988) carefully balanced the overall frc,qurncies of t,hr 3 categories in the training and test, sets. and we attempted to do the same. Their training set vontainrd 18,105 residues. while ours was slight~l~ srnallrr. aith just 17.142 residues. Although t,he d c.ollrc.tions of proteins were identical. we did not have ac~~rss to the spec>itiic*partitioning into t,raining and test set’s usrtl 1)~ Qian and Sejnowski.

( ‘ompardcson

of correlation

rorficirnts

one kst. set t)o anotjher van tw quit~c~ larger. Thus it better measure of tIhr accuracy of an algorithm is its average performance on several diffwc:nt t’rst sets, Of the results in Table 2. only t)hose of Zhang vt r/l. (I 992) usrd this mrthodotogy. \I’e c~onduc+r~d f&her experiments with our algnrit,hm using diffwent test set)s. as follows. CVc randomly setec+ rd IO’& of tJtw data as a t,wt. set, and 1 minfd 0tI the rrrnaining 902;, We repeated this cxperirnrnt IO t.imes. choosing different tes;t and training sets racsh t.ime, and avrraged the results. The overall classi cation accuracy of PEKLS was WI O. in this c>xperimentj. us& a window of size 19. In this experiment. we did not post-process tht, data using minimum seque”c~r-lenatt1 rrstrivtions (descritwd utwve).~ Our algorithm is quite similw to t,he MRR method of Zhang cut nl., whicah acahievrd cil.f,“,, acwiracy. also without’ post-processing. Our st,udies show that PEBLS is quat or stight.t? superior to othrr methods for predicating protjt+n secondary st~ructurv. More details. and rwu Its on other problems sucah as t>?;X promot,w identific-abon. are reported by (‘ost & Salzherg (1992). The basic tea,rning and classification m&od is nearrstneighbor, one of tht, simplest of all mvthods. Thcb wright,s t)hat we uwd to enhanw our method may Iw c*omput,ed through simptr recortl-keeping. as described above. In addition. the model cwated I)> a rwarest -neighbor met’hod such as I’ERI,S is much rasirr t,o interprrt than, for examptc~. t hv wGght assignments in a neural network. Aftrl, training. T’EBLS contains exemplars that providr sprcific* referent instatrws. or “case histories” as it w’re. which ma)- tw vitrd as suppor1 for a part~iwta~r dwision. 13~ examining the n-eights. one car1 drter.rninP whether twah instanw is a reliable ctassitier. The distanw t,ahlrs revrat an order on t hr set of symbolic vatucts that is not apparwit in the \-atuw atone. The purpose of this rwearch not,cl is to drav attention to t,he capabititirs of m&hods such as ours t the did this

Our exprrirnerital design prrvrntrd us t’rom usiily minimum srquenw-lrngth rrstric~tions, bwause WV not krrp all the rrsidues in a subunit tog&her itt rxprrirnent. Instead. WY chow rxarrrt)lw wtirrly at

rantfom

for

inclusion

in

the

test

set.

without

c*onsitlcaring

whrther the adjncwlt c~xamplrs \\.tlrr also in thr trst Thus. whrn prtvlic+ing the srcm~dary strwture of’ a prtic~trlar rrsidur, one or mow of its nrighhoring rrsiciuw

might

hr

missing

from

thr

test

set.

wt.

Predicting protein secondary structure with a nearest-neighbor algorithm.

We have developed a new method for protein secondary structure prediction that achieves accuracies as high as 71.0%, the highest value yet reported. T...
481KB Sizes 0 Downloads 0 Views