Structure prediction and modelling.

Structure prediction and modelling Mark B. Swindells and Janet M. Thornton University College London, London, UK Protein structure prediction from sequence remains a major goal in molecular biology. The methods described in this review concentrate on deriving structural information through the detection of similarities between a test sequence and a database of known structures. Such methods are often referred to as knowledge-based strategies reflecting the use of a structural database in the analyses. The past year has seen considerable advances in both the development of automated procedures and their application to protein sequences of outstanding biological interest. (This review is an updated and modified version of a review first published in Current Opinion in Structural Biology 1991, 1:219-223.) Current Opinion in Biotechnology 1991, 2:512-519

Introduction Research into the relationship between sequence and structure of proteins has now been in progress for over a quarter of a century. Initial attempts were limited to predicting secondary structure because there was insufficient information available to predict the overall fold. More recently, however, modelling in three dimensions using related stmctures has become increasingly important. This is because many newly sequenced proteins are found to b d o n g to a known superfamily. Although homology modelling is also a form of structure prediction, its importance has led to a separate classification within the predictive scheme. With the rapid increase in the number of known sequences to over 25 000 [ 1], including many of medical importance, it is not surprising that these areas are receiving substantial interest.

suits show that to be confident of topological similarity, only 25% of residues need to be similar when sequences of greater than 80 residues are compared, whereas this number rises to nearly 80% when comparing decapeptides. At lower identities, topology may still be conserved even though it is not yet feasible to detect it from sequence. For example, the globins all look remarkably similar despite sequence identities as low as 14%. In the absence of detectable sequence similarities, the researcher must resort to alternative approaches to structure prediction. In this review, we shall first discuss the publications that are relevant to predicting structure in the absence of such information. We shall then focus on new strategies for modifying known structures, following the order shown in Fig. 1.

Structure prediction Adopting a suitable strategy The approach used to derive structure from sequence is determined by the degree of similarity between the test sequence and the sequences of known structure. If one can be certain, through the detection of significant sequence similarities, that a novel sequence will have a similar fold to at least one known structure, a model may be constructed by strategically modifying the family of known structures. In the past, all sequences with at least 25% residue identity have been considered to have the same topology. This generalization does not hold for peptides, however, where long-range interactions are absent. For example, pentapeptides with identical sequences can have completely different structures [2,3]. Recently, Sander and Schneider [4 °" ] have attempted to correlate residue identity with sequence length. Their re-

Traditionally, algorithms have Concentrated on predicting a residue's secondary structure, in the hope that accurate assignments at this level could be transformed at a later stage into a correct tertiary prediction. Methods generally limit predictions to whether a residue is in a helix, strand or coil conformation and invariably rely on the Kabsch and Sander [5] procedure for deriving these assignments in sequences of known structure. Perhaps the best known methods for predicting secondary structure are those of Chou and Fasman [6], Gibrat et al. [7] and Lim [8], although many other groups have also tried to perfect this approach. More recently, pattern matching methods [9,10] and neural networks [11,12] have gained in popularity. Unfortunately, the accuracy of all these techniques has been disappointing. No method can assign more than

Abbreviations RMS~root mean square; SCR--structurally conserved region.

512

~) Current Biology Ltd ISSN 0958-1669

Structure prediction and modelling Swindells and Thornton of up to 95% recorded. Similarly, KneUer et al. [17 o] have noted that neural networks behave differently when trained on proteins from distinct structural classes. This observation can be turned to the researcher's advantage in the light of additional information.

kii[:i H

El

H N k;l DI

Fig. 1. Steps in modelling by homology. 65% of residues to the correct conformation (where there are three conformational states; helix, strand and coil). Faced with this limitation, it is necessary to try to elucidate why predictions go wrong. It would appear that the current size and diversity of the structural database is a major limiting factor [13",14,]. An increase in the number of known structures would assist research in two ways; firstly, more structural relationships would be detected and secondly, longer sequence patterns with higher inherent reliabilities could be analysed. Assuming that the size of the structural database will not quadruple overnight, one is faced with the prospect of developing novel methods using existing data. Garratt et al. [15"] have noticed that the edges o f ]3-strands tend to be predicted with less accuracy than their respective internal residues. Using hydrogen bonding pattems to assign [3-strand residues as intemal or extemal, propensities were derived for each and applied using the formalism of information theory [7]. Although the results look promising, the method still suffers from a lack of structural data. If additional structural information is available from methods such as circular dichroism, it can be used to significant advantage. For instance, when Niermann and Kirschner [16.] confined their work to the prediction of triose phosphate isomerase barrel topologies, the success of their algorithm, which was again an adaptation of information theory, rose dramatically with accuracies

Neural networks are a useful tool because they allow the rapid detection of relationships (in this case between sequence and structure) through the optimisation of a set of weights, dispensing with the need for manual intervention. Because networks are flexible they can be applied to a range of problems. For instance, networks have been trained with a-carbon distance matrices in order to predict the C a traces of unknown structures. So far, however, the approach has only been tested with a highly homologous (74% residue identity) structure in the training set [18.]. In addition, Holbrook et aL [19"] have predicted residue accessibility and Muskal et aL [20] have estimated the likelihood of cysteine residues participating in disulphide bridges using similar types of networks. An altemative to the above technique involves the analysis of hydrophobicity profiles. In order to perform such analyses, a scale of hydrophobicities for the 20 amino acids and an appropriate algorithm are required. Two algorithms are commonly used. The first averages the hydrophobicity values over a specified window of residues in order to predict whether the residue located centrally within the window is buried or accessible. The second approach detects patterns of amphipathicity which are indicative of accessible a-helices and [3-strands. For a comprehensive summary of arnphipathic a-helices, see the review by Segrest et al. [21-]. Whichever algorithm one decides to use, the choice of hydrophobicity scale is always going to be a matter of most heated debate. For an update of the many scales that are now available, the reader is referred to an excellent review by Comette et aL [22]. One disadvantage of using this type of scale is that residues with both hydrophilic and hydrophobic groups cannot be represented adequately by a single parameter. To circumvent this problem, Lesser and Rose [23"'] (the latter arguably being the originator of hydrophobicity analysis) have analysed the contributions of each constituent atomic group to a residue's hydrophobicity. Although this additional information cannot be incorporated into traditional procedures, it should be useful to researchers involved with more a b initio approaches to predicting structure.

Modelling The first attempts to model by homology were made as early as 1969 when Browne et al. [24] proposed a model for a-lactalbumin based on the crystal structure of hen egg white lysozyme, b u t the first rigorous approach involving the use of multiple structures was made by Greer [25] using mammalian serine proteinases. From his study, the principal steps in modelling shown in Fig. 1 were established.

513

514

Proteinengineering Framework definition Chothia and Lesk [26] showed that for proteins of a similar fold, the root mean square (RMS) deviation of topologically equivalent residues increased as the sequence identity between them decreased. This effect is often observed as a result of relative movement"between regions of secondary structure. On superimposing very similar structures upon one another, one is immediately able to distinguish regions of higher conservation; these are commonly referred to as structurally conserved regions (SCRs). At lower percentage identities, howevei', rigid body superposition fails to define satisfactorily all the residues that one would intuitively consider equivalent. Thus, an alternative description of protein structure is required. Three groups [27,28".,29] have approached this problem independently. Taylor and Orengo [27] make use of a combination of distance matrices and vectors. Using these, similarities can be detected and aligmnents made by applying dynamic programming to the resultant similarity scores. gali and Blundell [28..] approach the problem in a similar way, but also consider the structural environment of each residue and use simulated annealing to optimise geometric relationships such as hydrogen bonding. Mitchell et al. [29] describe protein structure using graph theory in which nodes correspond to regions of regular secondary structure and edges relate to the angles and distances separating the nodes. By detecting similarities between different graphs, topological equivalences may be inferred. All three methods appear to work well and will assist greatly the automatic detection of similar motifs within different structures.

Alignments There are many algorithms available for aligning a family of sequences. Pattem matching methods based on multiple sequence alignment have been developed extensively by numerous groups and have been reviewed recently [30"]. Two methods of particular importance specifically address the structure-based problem of distinguishing conserved regions from the more variable loops. This is achieved by generating a consensus or template sequence using the proteins of known structure [31,32"]. It has been suggested that alignments could also be improved by incorporating substitution matrices that consider specifically the structural environment of a residue, instead of the standard Dayhoff-type matrix. Such matrices might be useful for investigating the substitution pattems of buried residues whose main chain conformations are a-helical [33"]. However, such specific tables may lack statistical significance because of insufficient crystallographic data. Assessing the significance of alignments is also problematic because certain regions of a sequence may be aligned more reliably than others. To circumvent this problem, Vingron and Argos [34",] have developed an algorithm that identifies regions of higher accuracy within a pairwise alignment. A simple yet novel extension of this procedure based on matrix multiplication enables the detection of

conserved regions within a family of distantly related sequences [35.-].

Construction of structurally conserved regions SCR construction is still generally approached in the manner defined by Greet [25], using sequentially similar SCRs from the homologous proteins to define the new core. Although this approach is satisfactory when the percentage identity is at least 40%, the errors are larger when the residue identities are lower. A suitable altemative might incorporate distance and chirality constraints. Using these parameters as the input for distance geometry calculations, Havel and Snow [36"] have successfully constructed several trypsin inhibitors. In a similar manner, gali et al. [37"] have proposed a method for generating models which uses a combination of conjugate gradient minimisation and simulated annealing. Modelling loops In homologous structures, most of the changes in sequence and structure are restricted to the loop regions. Thus, modelling the loops accurately is one of the most demanding steps in producing a reliable prediction. Loop conformations are derived by either searching a database for those with appropriate constraints or by using more a b initio molecular mechanics and dynamics approaches. In attempts to understand the basis of loop conformation, several recent papers highlight the importance of key residues in determining immunoglobufin loops [38,39"] as well as the critical role that glycine residues play in defining tight 13-hairpin conformations [40]. Chothia et a t [41] have developed a loop modelling approach for immunoglobulin sequences based on their hypothesis that there are only a few canonical main chain conformations. In a separate approach, Martin et al. [42] have developed an algorithm that combines knowledgebased loop extraction, global searches for dO ~ loop space for possible loop conformations, and energy evaluations. These approaches yield results that are in relatively good agreement with the crystallographic data. Comparisons of predicted and observed structures showed RMS superpositions for main chain atoms from 0.3-1.5 A and for all atoms from 1.0-2.9~ Clearly, there is a need for a fully autofilated package for loop modelling that takes into account the key residue hypothesis, searches a database for suitable loops giving preference to those from the same structural family, efficientlyattaches the selected loop to the framework, and incorporates rebuilding and energetic approaches, possibly using simulated annealing.

Side chains Currently, most approaches to modelling involve delineating the main chain conformation and then appending the side chain atoms. Side chain conformation is often conserved between homologous residues even when a replacement has occurred [43]. Consequently, where

Structure prediction and modelling Swindells and Thornton possible, ~ values are taken from the parent structure(s) and used to model the substitute. Other approaches involve conformational searching and energy evaluation in space. Summers & Karplus [44] have presented an excellent and well tested procedure for rule-based side chain modelling. Their method involves four stages: the derivation of information from the template structure; a ~ space search using conformational sampling; an initial energy evaluation; and extensive refinement and evaluation involving the calculation of self and interaction energies, accessibilities and hydrogen bonding energies. In their analysis, 92% of the ~1 and 81% of the ~2 angles were positioned correctly using this procedure and disagreements were generally restricted to highly accessible side chains. This careful and exhaustive method which combines template knowledge with energetic considerations is to be recommended. Following previous workers, Schiller et al. [45"] have used energetics and maximal atomic overlap to determine side chain conformation, and to demonstrate the importance of including a solvation term in energy evaluations for predicting the energy of accessible side chains. A novel method for predicting side chain conformations given the correct backbone structure has also been published recently [46..]. This rapid and completely automated method uses simulated annealing to optimise side chain packing by considering only van der Waals interactions. Over nine proteins, the program gave RMS deviations of 1.77 A from the native structure, dropping to 1.25~ for buried core residues. This method is clearly an important advance and will be especially relevant to modelling the side chain conformations of proteins with very low residue identities. Analyses of side chain packaging [47.,48"] have provided an insight into the rules by which side chains pack in proteins. Singh and Thornton [47"] have created an atlas of all 20 x 20 side chain interactions using high resolution structures from the Brookhaven Protein Databank [49]. This will be useful in assessing the predictive power of energy calculations as well as in predicting empirically side chain packing and sites of intermolecular recognition. Establishing the accuracy of a model

Once a model has been built, it is important to assess its quality. Using simple energy calculations, Novomy et al. [50] have tried to distinguish incorrect structures from their native folds. Non-native conformations were generated by selecting two structures of identical length from different structural classes (e.g. an all 0~-helical protein and an all [3-sheet protein) and associating the sequence of one protein with the structure of the other. The empirical energy function used for this assessment included covalent and non-covalent terms and is typical of those used in commercial modelling packages. Unfortunately, the application of these potentials i~ailed to distinguish the native conformation from the incorrect folds. The inadequacy of such an approach can be attributed at least

in part to the exclusion of a solvation energy term in the calculation. Recently, another attempt to address this problem was made by Hendlich et al. [51"]. Their approach uses a different set of potentials known as potentials of mean force, which are derived from proteins of known structure. Although, initially, such potentials seem to be less sophisticated than those used by Novotny, their derivation from known protein structures means that they effectively include a term for the hydrophobic effect. Using such an approach, this group were able to distinguish successfully the native conformations in a similar experiment to that carried out by Novotny et al.. With suitable modifications, this method could be applied easily to any newly constructed model. It is important to realise that the cases tested by Hendlich et al. [51"] are rather extreme examples, and it is still uncertain whether such an approach would detect the correct fold when (more realistically) a homologous structure is available instead of the native conformation. If one is certain that the correct fold has been modelled, subtle errors such as those in torsion angles become a major source of interest. In these cases, it is advisable to use more traditional energy functions such as those employed by Novotny et al. [50] or to compare the modelled structure with the results of relevant analyses. For instance, if a [3-hairpin loop was being modelled it would be prudent to refer to the work of Sibanda et al. [40]. An alternative critical assessment involves the comparison of models whose structures are subsequently solved by X-ray crystallography. This year has seen the publication of two such assessments which confirm that regions of higher sequence conservation are more likely to be correctly modelled [52",53"]. Greer [52"] investigated relationships within the family of mammalian serine proteinases. By comparing models from his earlier studies [25] with their subsequently derived experimental structures, he investigated the causes of modelling errors and then aligned 35 proteinase sequences using the knowledge gained. From this, he suggests that future models will be more accurate and regions of potential uncertainty will be suitably classified. In another paper [53"], two independently derived models of human immunodeficiency virus proteinase are compared with the subsequently derived crystal structure. The analysis demonstrates clearly the advantage of basing a model on the most sequentially homologous structure.

Applications of structure prediction and modelling Within the past year, modelling has been performed in many areas of therapeutic importance. Examples include the modelling of cytochrome P450170(, which is a major target for the chemotherapy of prostatic cancer [54], and a model for the central domain of dystrophin, the deficient protein in X-linked Duchenne muscular dystrophy [55]. In addition, regions of the cystic fibrosis gene prod-

515

516

Proteinengineering uct have been modelled using their predicted structural homology to the ATP-binding cassette superfamily [56"]. Other papers describe investigations of the structure of RNase Pchl using the sequentially homologous RNase T1 [57], and the structure of photosystem II components based on their homology with the photosystem from purple bacteria [58]. Recently, a model of intact %-antitrypsin has been constructed using the X-ray structures of two cleaved serpins; cq-antitrypsin and plakalbumin [59"]. Cleavage of the intact structure leads to a remarkable 70A separation between the newly generated termini. On the basis of the assumption that the central strand of a [3-sheet may be implicated in this movement, a model of the intact form was generated. The intact form of plakalbumin subsequently derived suggests that the strand movement was correctly anticipated, whereas the reactive centre may have been predicted inaccurately. Modelling techniques are often most reliable for predicting the effects of point mutations on a given structure. Bott and Frane [60.] have developed a procedure for identifying significant structural differences between mutant and wild-type proteins. Using different structural solutions of subtilisin variants, the analysis first established a relationship between structurally equivalent residue identifies and the crystallographic B value or temperature factor which is used to indicate the uncertainty associated with each atomic coordinate. This relationship is then used to detect significant differences between mutated residues. In a separate analysis, Bordo and Argos [61.] have used nine structural families to derive substitution matrices for accessible and inaccessible residues. This type of information should improve the design of future site-directed mutagenesis experiments.

program called HERA [64..] transforms proteins into two-dimensional diagrams, while retaining relevant structural information. As well as enabling hydrogen-bonding diagrams and helical wheels to be drawn automatically, HERA also allows the extraction of simple structural motifs such as [3-hairpins. Programs such as this will become increasingly important as more structures are solved. Many of the newly solved structures reveal a topology which has already been seen, although sequence similarities between the proteins are almost undetectable. For example, the structure of human interleukin-ll3 closely resembles Erythrina trypsin inhibitor, despite only six identities existing at topologically equivalent positions out of 153 residues. Currently, such low sequence similarities cannot be recognized by standard sequence alignment programs although many groups are working towards this goal. Indeed, the possibility of a significant fraction of the 25 000 sequences currently in the databanks adopting already known topologies highlights the importance of developing reliable modelling packages, particularly once the human genome project comes on line.

References and recommended reading Papers of special interest, published within the annual period of review, have been highlighted as: • of interest •, of outstanding interest 1.

BLEASBYAJ, WOOTTON JC: Construction o f a Validated, Nonredundant Composite Protein Sequence Database. Protein Eng 1989, 3:153-159.

2.

KABSCHW, SANDERC: On the Use of Sequence Homologies to Predict Protein Structure: Identical Pentapeptides c a n have Completely Different Conformations. Proc Natl Acad Sci USA 1984, 81:1075-1078.

3.

ARGOSP: Analysis of Sequence Similar Pentapeptides in Unrelated Protein Tertiary S t r u c t u r e s . J Mol Biol 1987, 197:331-348.

Future directions Although modelling techniques are normally used to modify known structures, they could also be used to help interpret poor electron density maps. Correa [62,.] has developed a fully automated procedure for predicting protein structure solely from ~z-carbon coordinates. A main chain is generated using glycine, alanine and proline residues and then side chains are appended to the structure. During construction, the structure is continually refined using molecular dynamics. Only a few years ago, projects such as this would have been unrealistic, but current computing facilities now enable molecular dynamics and simulated annealing algorithms to be used almost routinely during refinement. One continuing problem in structural analyses concerns the visual representation of protein structure. To the untrained eye, proteins, though elegant in form can be confusing even using the most sophisticated graphics programs. Past procedures have relied on simplified representations of tertiary structure such as the topology diagrams used by Richardson [63]. Unfortunately, information about the precise location of residues and their associated hydrogen bonds is lost using this technique. A new

4. ..

SANDERC, SCHNEIDERR: Database of Homology Derived Protein Structures and the Structural Meaning of Sequence Alignment. Proteins 1991, 9:56~58. In this paper, an attempt is made to correlate residue identity with sequence length. The results show that to be confident of topological similarity, only 25%,of residues need to be similar when sequences of greater than 80 residues are compared, whereas this number rises to nearly 80% when comparing decapeptides. In addition, a database of homology-based secondary structure assignments is derived by aligning to each known structure all sequences deemed to be related on the basis of the current work. 5.

KABSCHW, SANDERC: Dictionary o f Protein Secondary Structure: Pattern Recognition o f Hydrogen-Bonded and Geometrical Features. Biopolymers 1983, 22:2577-2637.

6.

CHOU PY, FASMANGD: Empirical Predictions of Protein Conformation. A n n u Rev Biochem 1978, 47:251-276.

7.

GIBRATJ-F, GARNIERJ, ROBSON B: Further Developments of Protein Secondary Structure Prediction Using Information Theory. J Mol Biol 1987, 198:425-443.

8.

LIM VI: Structural Principles o f the Globular Organisation of Protein Chains. A Stereochemical Theory of Globular Protein Secondary Structure. J Mol Biol 1974, 88:857-872.

Structure prediction and modelling 9.

LEVINJM, GARNIERJ: I m p r o v e m e n t s in a Secondary Structure Prediction Method Based o n a Search for Local Sequence Homologies and its Use as a Model Building Tool, Biochem Biophys Acta 1988, 955:283-295.

10.

RoOMANMJ, WODAKSJ: Identification of Predictive Sequence Motifs Limited by Protein Structure Database Size. Nature 1988, 335:45-49.

11.

QIANN, SEJNOWSKITJ: Predicting t h e Secondary Structure of Globular Proteins Using Neural Networks. J Mol Biol 1988, 202:865-884.

12.

HOLLEYLH, KARPLUSM: Protein Secondary Structure Prediction with a Neural Network. Proc Natl Acad Sci USA 1989, 86:152-156.

13. •

STERNBERGMJE, ISLAMSbz Local Protein Sequence Similarity does not Imply a Structural Relationship. Protein Eng 1990, 4:12.%131. Database searches are used to show that local sequence similarity does not necessarily indicate structural similarity in the absence of additional evolutionary or functional relationships. Although this type of analysis has been executed previously for pentapeptides, this work suggests that the same observations can hold even with sequences longer than 20 residues, t 14. •

ROOMANMJ, WODAK SJ: Weak Correlation B e t w e e n Predictive Power o f Individual Sequence Patterns and Overall Prediction Accuracy in Proteins. Proteins 1991, 9:69-78. It is found that although certain sequence patterns have a high (72%) predictive power, this cannot be translated into an equally successful algorithm because of a shortage of accurate patterns. It is further suggested that these peptides of high intrinsic accuracy could.represent the conformations adopted during the first stages of folding, that is, before long-range interactions influence the final structure. 15. •

GARRATrRC, THORNTON JM, TAYLOR XVR: All Extension of Secondary Structure Prediction Towards t h e Prediction o f Tertiary Structure. FEBS Lett 1991, 280:141-146. A novel modification to the information theory approach is described. This is based on the observation that there are two distinct substates for [3-strands, which the authors classify as external and internal. Predictions using the modified algorithm attempt to distinguish core [3-sheet residues from those situated at the edges, and examples s h o w that this approach can be applied to the ct/13 class of proteins. 16. •

NmRMANNT, KmSCHNERK: Improving t h e Prediction of Secondary Structure of ~rIM-Barrel' Enzymes. Protein Eng 1991, 4:137-147. A procedure is devised to optimise secondary structure predictions for triose phosphate isomerase barrel proteins. Improvements were attained by averaging the propensities at each aligned position. Further improvements were achieved by considering other properties such as amphipathicity and flexibility. KNELLERI)6, COHENFE, LANGRIDGER: I m p r o v e m e n t s in Secondary Structure Prediction by an Enhanced Neural Network. J Mol Biol 1990, 214:171-182. Neural networks have previously been applied to secondary structure prediction with limited success as the networks only included sequence and secondary structure information. This paper has attempted to improve predictions in two ways. The first approach, in which network units were added to detect periodicities in the input sequence, only gave small improvements. However, the inclusion of tertiary structural class led to significant increases in accuracy. Such approaches are valid as protein class can often be anticipated from experiments such as circular dichroism. 17. •

18.

BOHR H, BOHR J, BRUNAK S, COTI1iRILL RMJ, FREDHOLM H, LAUTRUPB, PETERSON SB: A Novel Approach to t h e Prediction of 3-Dimensional Structures of Protein Backbones by Neural Networks. FEBS Lett 1990, 261:43--46. Alternative ways to predict protein structure are always of interest. These authors show that neural networks can be used to generate a protein C a distance matrix using knowledge of sequences, secondary structure patterns and sets of binary distance constraints from homologous proteins. •

Swindells and Thornton

19. •

HOLBROOKSR, MUSKALSM, KIM S-H: Predicting Surface Exp o s u r e of Amino Acids from Protein Sequence. Protein Eng 1990, 3:659-665. A neural network is used to predict accessibility from amino acid sequence. The network is first trained using one set of proteins and then tested on another set. Distinguishing only two states (buried and exposed), an accuracy of 72% is achieved. 20.

MUSKALSM, HOLBROOKSR, KIM S-H: Prediction of t h e Disulp h i d e Bonding State of Cysteine in Proteins. Protein Eng 1990, 3:667-672.

21. •

SEGREST JP, DE IX)OF H, DOHLMAN JG, BROUILLETYE CG, ~ANTHARAMAIAHGM: Amphipathic Helix Motif: Classes and Properties. Proteins 1990, 8:103-117. A comprehensive review on amphipathic or-helices. The paper deals with numerous different classes of amphipathic proteins including those associated with membranes. 22.

CORNETrEJL, CEASE KB, MARGAIaTH, SPOUGEJF, BERZOFSKYJA, DELIsI C: Hydrophobicity Scales and Computational Techniques for Detecting Amphipathic Structures in Proteins. J Mol Biol 1987, 195:659-685.

23. LESSERGJ, ROSE GD: Hydrophobicity of Amino Acid Sub•• groups in Proteins. Proteins 1990, 8:6-13. Hydropathy values are derived for each atom within all 20 amino acid residues. By comparing constituent groups such as -COOH within all residue types, it is noted that on average, each group buries a constant fraction of its accessible surface area. 24.

BROWNEWJ, NORTH ACT, PHILLIPS DC, BREW K, VANAMANTC, HILL RL: A Possible T h r e e Dimensional Structure of Bovine ct Lactalbumin Based o n that of Hen's Egg W h i t e Lysozyme. J Mol Biol 1969, 42:65-86.

25.

GREENJ: Comparative Modelling of the Mammalian Serine Protein Proteases. J Mol Biol 1981, 153:1027-1042.

26.

CHOTmA C, LESK AM: T h e Relation B e t w e e n t h e Diverg e n c e o f Sequence and Structure in Proteins. EMBOJ 1986, 5:823-826.

27.

TAYLORWR, ORENGO CA~ Protein Structure Alignment. J Mol Biol 1989, 208:1-22.

28. ••

gALIA, BLUNDELLTL: Definition of Topological Equivalence in Protein Structures - a Procedure Involving Comparison of Properties and Relationships t h r o u g h Simulated Annealing and Dynamic Programming. J Mol Biol 1990, 212:403-428. A flexible method of comparing protein structures which does not rely on RMS superposition techniques. 29.

MITCHELLEM, ARTYMIUKPJ, RICE DW, WILLETr P: Use of Techniques Derived from Graph T h e o r y to C o m p a r e Secondary Structure Motifs in Proteins. J Mol Biol 1989, 212:151-166.

30. TAYLORWR, JONES DT: Templates, C o n s e n s u s Patterns and •. Motifs. Curt Opin Struct Biol 1991, 1:327-333. Describes methods for consensus pattern searching and template matching in sequence databases. Such methods are the most sensitive for recognising distant sequence similarities. 31.

TAYLOR WPc Identification of Protein Sequence Homology by C o n s e n s u s Sequence Alignment. J Mol Biol 1986, 188:233-258.

32. ,

BARTONGJ, STERNBERGMJE: Flexible Protein Sequence Patterns - a Sensitive Method to Detect Weak Structural Similarities. J Mol Biol 1990, 212:389-402. Methods are suggested for the combination of separate information sources in order to enhance the sensitivity of template recognition techniques. 33. •

OVER~NGTONJP, JOHNSON MS, gAta AS, BLUNDELLTL: Tertiary Structural Constraints o n Protein Evolutionary Diversity: Templates, Key Residues and Structure Prediction. Proc R Soc Lond 1990, 241:132-145. The problem with classic substitution matrices is that they fail to take account of the structural environment of the protein. By comparing se-

517

518

Protein e n g i n e e r i n g quences from homologous families, the authors were able to construct context-specific substitution tables. 34. VINGRONM, ARGOS P: Determination o f Reliable Regions in ,• Protein Sequence Alignments. Protein Eng 1990, 3:565-569. Describes a method for determining reliable regions of sequence alignments, when only two sequences are available. This method can be applied to great effect w h e n aligning two distantly related sequences because certain areas of the alignment can be anchored, thus enabling more detailed analyses of the remaining regions. 35. .•

VLNGRONM, ARGOS P: Motif Recognition and Alignment for Many Sequences by C o m p a r i s o n of Dot-Matrices. J Mol Biol 1991, 218:33-43. An algorithm is presented which enables the delineation of reliably aligned regions in dot plots w h e n more than two sequences are available. The procedure, which is based o n matrix multiplication, reduces the need for judicious gap penalty applications when aligning distantly related proteins. 36. ..

HAVELTF, SNOW ME: A N e w M e t h o d for Building Protein Conformations fi'om Sequence Alignments w i t h Homologues o f K n o w n Structure. J Mol Biol 1991, 217:1--7. An automated procedure for building strncmres by applying distance geometry procedures to a known strncmre and a related family of aligned sequences is described. By using distance and chiral constraints as input for the distance geometry calculation, it is possible to generate an ensemble of modelled conformations. gALlA, OVERIIRGTONJP, JOHNSON MS, BLUNDELLTL: From Cornparisons of Protein S e q u e n c e s and Structures to Protein Modelling and Design. Trends Biochem Sci 1990, 15:235-240. This paper suggests many novel approaches to modelling that would enable future procedures to move away from treating structures as rigid bodies. Alternatives are suggested to the current methods of defining structural equivalences, aligning new sequences and building models of unknown structures.

46. ..

LEE C, SUBBIAH S: Prediction of Protein Side Chain Conformation by Packing Optimisation. J Mol Biol 1991, 217:373-388. A rapid automated procedure for predicting side chain conformations is described which uses simulated anealing to optimise the packing. Only van der Waals interactions are considered during the process. In a test set of nine proteins, the procedure performed well with an overall RMS deviation of 1.77 • from the native conformations. Significantly, the results for core residues were more accurate. This probably reflects the additional constraints of the limited volume available for packing. 47. .

SINGHJ, THORbrrON JM: SIRIUS--an Automated Method for the Analysis of Preferred Packing Arrangements B e t w e e n Protein Groups. J Mol Biol 1990, 211:595-615. Information on the preferred packing arrangements for the 20 x 20 chain pairs is extracted from 62 high resolution structures. Threedimensional distributions can be visualized and analysed using polar coordinates and compared with 'random' distributions. These data should be useful for testing energy parameters and for empirical predictions of packing interactions. 48. •

IPPOLITOJA, ALEXANDERRS, CHRISTIANSONDW: Hydrogen Bond Stereochemistry in Protein Structure and Function. J Mol Biol 1990, 215:457-471. An analysis of hydrogen b o n d stereochemistry is made by extracting data from 50 high resolution protein structures. For example, the way in which hydrogen b o n d donors are clustered about carboxytates in aspartie mad glutamle acid residues is considered. 49.

BERNSTEINFC, .KOETZLE"IF, WILLIAMSGJB, MEYEREF, BRICE MD, RODGERS JR, KENNARDO, SHIMANOUCPIIT, TASUMIM: The Protein Databank: A Computer-Based Archival File for Macromolecular Structures. J Mol Biol 1977, 112:535--542.

50.

NOVO'rNYJ, BRUCCOLERI RE, KARPLUSM: An Analysis of Incorrectly Folded Models - Implications for Structure Predictions. J Mol Biol 1984, 177:787-818.

37. ,•

38.

CHOTHIAC, LESK AM: Canonical Structures for the Hypervariable Regions of Imrnunogiobulins. J Mol Biol 1987, 196:901-917.

39. •

TRAMONTANOA, CHOTH1AC, LESKAM: Framework Residue 71 is a Major D e t e r m i n a n t of t h e Position and Conformation of the Second Hypervariable Region in t h e VH Domains of the Immunoglobulins. J Mol Biol 1990, 215:175-182. In a detailed study of the second hypervariable region of heavy chain immunoglobulins, the identity of residue 71, which is part of the [3-barrel framework, is shown to be crucial in determining the loop conformation. This is a good example of the key residue hypothesis. 40.

SmANDABL, BLUNDELLTL, THORNTONJM: Conformation of Hairpins in Protein Structures. J Mol Biol 1989, 206:759-777.

41.

CHOTH1AC, LESK AM, TRAMONTANOA, LEV1Tr M, SM1Ttt-G1LL SJ, Ala G, SHERIFF S, PAl)taN EA, DAVIFS D, TULIP XVR, ET ilL.: Conformations of I m m u n o g l o b u l i n Hypervariable Regions. Nature 1989, 342:877-883.

42.

MARTIniACR, CHEETHAM JC, REES AR: Modelling Antihody Hypervariable Loops - - a C o m b i n e d Approach. Proc Natl Acad Sci USA 1989, 86:926&9272.

43.

SUMMERS1N'L,CARLSONXY4D, KARPLUSM: Analysis of Side Chain Orientations in Homologous Proteins. J Mol Biol 1987, 196:175-198.

44.

SUMMERSNL, KARPLUS M : Construction of Side Chains in Homology Modelling. Application to t h e C Terminal Lobe of Rhizopuspepsin. J Mol Biol 1989, 210:785-811.

45. •

SCHIFFERCA, CALl)WElLJW, KOLLMANPA, STROUD RM: Prediction of Homologous Structures Based o n Conformational Searches and Energetics. Proteins 1990, 8:30--43. A method is developed to model side chains using template information and energetics. The authors stress the importance of including a solvation term to improve the prediction of accessible side chains.

51. ••

HENDLICH M, LACKNER P, WE1TCKUS S, FROSCHAUER R, GOTrSBACHERK, CASAR1G, SIPPL M: Identification of Native Folds A m o n g s t a Large N u m b e r of Incorrect Models. J Mol Biol 1990, 216:167-180. Describes a technique for identifying native protein folds from incorrect strncmres by detecting the conformation of lower energy. The method uses an empirically derived energy function to assess this energy. 52. •

GREERJ: Comparative Modelling Methods - Application to t h e Family of Mammalian Serine Proteases. Proteins 1990, 7:317-334. This extension of Greer's 1981 paper [25] summarizes his view of model building procedures and makes use of subsequently solved structures to assess the accuracy of earlier models. 53.

WEBERIT: Evaluation of Homology Modelling of HIV Protease. Proteins 1990, 7:172-184. The only reliable way to assess the accuracy of a model is to wait until its structure is solved by crystallography. In this paper, both Weber's model and an earlier attempt by Pearl and Taylor are subjected to such analysis. The study shows that the accuracy of a model is dependent on the percentage ~esidue identity between the sequences. 54.

LAUGHTONCA, NEIDLE S, ZVELEBIL MJJM, STERNBERG MJE: A Molecular Model for t h e Enzyme Cytochrome P45017a, a Major Target for t h e C h e m o t h e r a p y of Prostatic Cancer. Biochem Biophys Res Commun 1990, 171:1160-1167.

55.

CROSSRA, STEWARTM, KENDRICK-JONESJ: Structural Predictions for t h e Central Domain of Dystrophin. FEBS Let• 1990, 262:87-92.

56. •

HYDESC, EMSLEYP, HARTSHORNMJ, MIMMACKMM, GILEADIU, PEARCESR, GALLAGHERMP, GILL DR, ttUBBARDRE, I~hGG1YSCF: Structural Model of ATP-Binding Proteins Associated w i t h Cystic Fibrosis, Multidrug Resistance and Bacterial Transport. Nature 1990, 346:362-365. This paper pushes the limits of current model building technology in order to shed s o m e light on the important problem of cystic fibrosis. Although no sequentially related proteins could be identified, a consensus secondary structure prediction suggested that regions of the cystic

Structure prediction and m o d e l l i n g Swindells and Thornton fibrosis gene product could have a similar fold to the ATP-binding proteins. 57.

58.

FLOEGALR, ZIELENKIEwICZP, SAENGERW: Tertiary Structure of RNase P c h l Predicted from t h e Model Structure of RNase Ms and t h e Crystal Structure of RNase T1. Eur Biophys J 1990, 18:225-233.

SVENSSONB, VASS I, CEDERGRENE, STRYING S: Structure of Donor Side C o m p o n e n t s in P h o t o s y s t e m II Predicted by C o m p u t e r Modelling. EMBO J 1990, 9:2051-2060.

59. ENGHRA, WRIGHTHT, HUBERR: Modelling o f the Intact Form ,,, of cx Lyric Proteinase Inhibitor. Protein Eng 1990, 3:469-478. Extensive structural reorganisation must occur during the deavage of intact cz-tytic proteinase. This paper suggests h o w this might occur and proposes a structure for the intact form. 60. •

BoTr R, FRANEJ: Incorporation of T e m p e r a t u r e Factors in t h e Statistical Analysis of Protein Tertiary Structures. Pr~> tein Eng 1990, 3:649-657. A method is developed to identify significant structural differences between mutant and wild-type proteins. A linear relationship is found to exist between the logarithm of the distance between equivalent atoms in two structures and their mean temperature factor. This relationship is then used to detect significant differences between atoms having low temperature factors even though their distances fall below the overall RMS deviation for all equivalent atoms. Conversely, false significance re suiting from high mobility can also be detected. BORDO D, ARGOS P: Suggestions for 'Safe' Residues Substitutions in Site Directed Mutagenesis. J Mol Biol 1991, 217:721-729. Matrices of preferred amino acid exchanges have been constructed through the comparison of nine protein families. The derived matri-

ces are intended to guide site-directed mutagenesis experiments when the structure of a topologically related protein is known. Three matrices are shown; one each for accessible and buried residues with the third being a summation of the other two. 62. CORREAP: The Building of Protein Structures from c~ Carbon •, Coordinates. Proteins 1990, 7:366-377. A n e w fully automated procedure is developed to generate complete protein structure, given only the cz-carbon coordinates. After first generatihg main chain coordinates using glycine, proline and alanine residues and refining with molecular dynamics, side chain residues are appended and refined in a similar fashion. For the best modelled structure (cx-lytic pro(einase), the overall RMS deviation was 1.24g, This procedure is computer intensive, but useful for modelling and interpreting electron density. 63.

RICHARDSONJ: T h e A n a t o m y and T a x o n o m y of Protein Structure. Adv Protein Chem. 1981, 34:167-339.

HUTCHINSONEG, THORNTONJM: HERA - a Program to Draw Schematic Diagrams of Protein Secondary Structures. P r o tein 1990, 8:203-212. This package transforms proteins into two-dimensional diagrams, while retaining relevant structural information. As well as being able to automaticaUy draw hydrogen bonding diagrams and helical wheels, HERA can extract simple structural motifs such as B-hairpins. Programs such as this will become increasingly important as more structures are solved. 64. ..

61. ,

MB Swindells and JM Thornton, Biomolecular Structure and Modelling Unit, Department of Biochemistry and Molecular Biology, University College London, Gower Street, London WC1E 6BT, UK.

519

Structure prediction and modelling.

Prediction of the structure of the Y+.R-.R(+)-type DNA triple helix by molecular modelling.

Crystal structure and prediction.

Prediction limits of mobile phone activity modelling.

Correct structure prediction?

Prediction of gene structure.

Antibody H3 Structure Prediction.

Modelling the structure of sludge aggregates.

Modelling the Evolution of Social Structure.

Jury returns on structure prediction.

Modelling the Structure and Dynamics of Biological Pathways.

Modelling the structure and function of enzymes by machine learning.

Perspective: crystal structure prediction at high pressures.

Structure prediction for multicomponent materials using biminima.

Genome-wide Membrane Protein Structure Prediction.

RNA-SSPT: RNA Secondary Structure Prediction Tools.

Neural networks for protein structure prediction.

Protein structure. Prediction of progress at last.

Protein structure prediction based on statistical potential.

Hybrid system for protein secondary structure prediction.

OPTIMIZATION BIAS IN ENERGY-BASED STRUCTURE PREDICTION.

Structure prediction of magnetosome-associated proteins.

Serum albumin domain secondary structure prediction.

Multiphysics modelling, quantum chemistry and risk analysis for corrosion inhibitor design and lifetime prediction.