An efficient automated computer vision based technique for detection of three dimensional structural motifs in proteins.

This article was downloaded by: [Rutgers University] On: 10 April 2015, At: 14:44 Publisher: Taylor & Francis Informa Ltd Registered in England and Wales Registered Number: 1072954 Registered office: Mortimer House, 37-41 Mortimer Street, London W1T 3JH, UK

Journal of Biomolecular Structure and Dynamics Publication details, including instructions for authors and subscription information: http://www.tandfonline.com/loi/tbsd20

An Efficient Automated Computer Vision Based Technique for Detection of Three Dimensional Structural Motifs in Proteins Daniel Fischer

a b

Haim Wolfson

a d

a

, Orly Bachar , Ruth Nussinov

b c

&

a

Computer Science Department , School of Mathematical Sciences Tel Aviv University , Tel Aviv , 69978 , Israel b

Sackler Inst, of Molecular Medicine Faculty of Medicine Tel Aviv University , Tel Aviv , 69978 , Israel c

laboratory of Mathematical Biology PRI/DynaCorp , NCI-FCRF , Bldg 469, rm 151, Frederick , MD , 21701 d

Robotics Research Laboratory , Courant Inst, of Mathematical Sciences New York University , 715 Broadway, 12th fl., New York , NY , 10003 Published online: 21 May 2012.

To cite this article: Daniel Fischer , Orly Bachar , Ruth Nussinov & Haim Wolfson (1992) An Efficient Automated Computer Vision Based Technique for Detection of Three Dimensional Structural Motifs in Proteins, Journal of Biomolecular Structure and Dynamics, 9:4, 769-789, DOI: 10.1080/07391102.1992.10507955 To link to this article: http://dx.doi.org/10.1080/07391102.1992.10507955

PLEASE SCROLL DOWN FOR ARTICLE Taylor & Francis makes every effort to ensure the accuracy of all the information (the “Content”) contained in the publications on our platform. However, Taylor & Francis, our agents, and our licensors make no representations or warranties whatsoever as to the accuracy, completeness,

or suitability for any purpose of the Content. Any opinions and views expressed in this publication are the opinions and views of the authors, and are not the views of or endorsed by Taylor & Francis. The accuracy of the Content should not be relied upon and should be independently verified with primary sources of information. Taylor and Francis shall not be liable for any losses, actions, claims, proceedings, demands, costs, expenses, damages, and other liabilities whatsoever or howsoever caused arising directly or indirectly in connection with, in relation to or arising out of the use of the Content.

Downloaded by [Rutgers University] at 14:44 10 April 2015

This article may be used for research, teaching, and private study purposes. Any substantial or systematic reproduction, redistribution, reselling, loan, sub-licensing, systematic supply, or distribution in any form to anyone is expressly forbidden. Terms & Conditions of access and use can be found at http://www.tandfonline.com/page/terms-and-conditions

Journal of Biomolecular Structure & Dynamics, ISSN 0739-1102 Volume 9, Issue Number 4 (1992), "'Adenine Press (1992).

An Efficient Automated Computer Vision Based Technique for Detection of Three Dimensional Structural Motifs in Proteins Daniel Fischer1•2, Orly Bachar1, Ruth Nussino~· 3 * and Haim Wolfson 1•4 1


Computer Science Department School of Mathematical Sciences Tel Aviv University Tel Aviv 69978, Israel 2

Sackler Inst. of Molecular Medicine Faculty of Medicine Tel Aviv University Tel Aviv 69978, Israel

3

Laboratory of Mathematical Biology PRI/DynaCorp NCI-FCRF Bldg 469, rm 151 Frederick, MD 21701

~obotics Research Laboratory Courant Inst. of Mathematical Sciences New York University 715 Broadway, 12'h fl. New York, NY 10003 Abstract As the number of available three dimensional coordinates of proteins increases, it is now

recognized that proteins from different families and topologies are constructed from independent motifs. Detection of specific structural motifs within proteins aids in understanding their role and the mechanism of their operation. To aid in identification and use of these motifs it has become necessary to develop efficient methods for systematic scanning of structural databases. To date, methods of structural protein comparison suffer from at least one of the following limitations: (1) are not fully automated (require human intervention), (2) are limited to relatively similar structures, (3) are constrained to linear alignments of the structures, (4) are sensitive to insertions, deletions or gaps in the sequences or (5) are very time consuming. We *Author to whom correspondence should be addressed at NCI-FCRF Blgd 469, rm !51, Frederick, MD 21712.

769

770

Fischer et a/.


present a method to overcome the above limitations. The method discovers and ranks every piece of structural similarity between the structures compared, thus allowing the simultaneous detection of real 3-D motifs in different domains, between domains, in active sites, surfaces etc. The method uses the Geometric Hashing Paradigm which is an efficient technique originally developed for Computer Vision. The algorithm exploits the geometrical constraints of rigid objects, it is especially geared towards recognition of partial structures in rigid objects belonging to large data bases and is straightforwardly parallelizable. Computer Vision techniques are for the first time applied to molecular structure comparison, resulting in an efficient, fully automated tool. The method has been tested in a number of cases, including comparisons of the haemoglobins, immunoglobulins, serine proteinases, calcium binding proteins, DNA binding proteins and others. In all examples our results were equivalent to the published results from previous methods and in some cases additional structural information was obtained by our method.

Introduction Studies of protein structure have indicated the presence of recurring motifs, conserved through evolution. Many of the specific folding motifs have been identified and classified. Moreover, it has become increasingly clear that proteins can be clustered into different structural families, built from a limited set of motifs (1 ). These structural motifs recur many times in different proteins. An interesting example is the family of DNA-binding proteins. Recent analysis of these proteins have indicated the presence of autonomous domains (2). Structural and functional studies of these domains have demonstrated the existence of several structural motifs. A classical example of such a motif is the helix-turn-helix motif(3). This motif has first been found in prokaryotic activator and repressor proteins. Other examples of structural motifs in DNAbinding proteins include the zinc fingers (4), leucine zipper (5) and homeodomain (6) motifs. The fact that families of proteins retain a common underlying structure, even though their amino acid sequences differ, suggests that during evolution, tertiary structure changes much more slowly than amino acid sequence. Finding structural motifs in proteins is crucial for understanding their role, and the mechanism by which they work. These can be inferred by analogy with other proteins containing the motif. The problem we are faced with is to devise efficient techniques for routine scanning of structural databases, searching for recurrences of inexact structural motifs. The development ofX-ray crystallographic techniques in the last few years has led to an increasing availability of three-dimensional coordinate data for proteins, via the Protein Data Bank (7). To date, it contains 3-D structural information for about 700 macromolecules, of which, only about 100 protein sequences are non-homologous. Unfortunately, there are still many well known proteins for which no crystallographic data has been obtained, although the size of the database is growing fast. Protein comparison techniques are usually based on the following steps: l.Find (relatively small) subsets of the proteins that form an initial match. 2.Find a superposition of the proteins that achieves the closest match of corresponding atoms of the above subsets.

Detection of 3D· Motifs in Proteins

771

3.Transform one of the proteins according to the superposition found and extend the initial match by choosing additional pairs of atoms that lie close enough under the superposition.


Usually, step (1) above is the most time consuming. For relatively homologous proteins, eye inspection using computer graphic devices can be helpful in finding the initial match. For less homologous proteins, quantitative methods are required. These methods usually perform a search of the large space of possible subsets (for a review see reference 8). Remington and Matthews (9-10) compare all possible pairs of linear structural fragments of a chosen length between two structures. A drawback of this method is its sensitivity to insertions and deletions between the sequences being compared. Rossmann and Argos (11-13) compare the structures in all possible orientations in order to find similar substructures. Chothia and Lesk (14) propose a method where corresponding segments of secondary structures in the proteins being compared are individually identified and superimposed. All these methods are computer expensive, exploit the linear order of the amino acid chain or require that the structures being compared be relatively similar. A number of authors have used" diagonal" or"distance" plots for the rapid visual recognition of structural domains. Equivalent secondary structures may be shown on such plots. Automated attempts of plot comparison have been carried out (e.g. see reference 15). Taylor and Orengo (16) automatically determine the 3-D structural alignment of two proteins without the need of an initial alignment. Their method is not sensitive to insertions and deletions and is based on distance plot analysis. It applies the dynamic programming algorithm of Needleman and Wunsch (17) at two levels. At the lower level, the algorithm is applied to find the maximum scores for each possible matched pair of atoms based on residue-residue inter-distances. At the higher level, each element of the weight matrix holds the cumulative sum ofthe maximum scores from the lower level. The method used by Sali and Blundell (18) is also based on the dynamic programming technique. It uses a weight matrix whose elements are a weighted sum of different features and properties such as residue position in space, residue identity, side/main-chain orientations and accessibilities, residue local fold, residue type, dihedral angles, hydrogen bonding and others (for an explanation of these features and properties see reference 18). Thus, the method allows a systematic inclusion of a number of different types of information into the comparison of proteins. Nevertheless, it is difficult to determine what features or properties to choose for each comparison, or how to assign their respective weights. The computation of some ofthe features require an initial superposition of the structures that are to be compared while other features require prior visual inspection using graphic devices. The method is not fully automated and is dependent on the setting of the values ofthe parameters. However, a major advantage of this approach is its taking into account chemical properties, providing a useful tool in the modeling of proteins by homology and analogy. The method we propose follows the above defined steps as well. However, the novelty of our work is on the approach taken in solving each of the steps. Computer Science techniques in the field of Computer Vision are applied for the first time to

772

Fischer eta/.


protein comparison. Computer Vision techniques have for many years explored the problem of finding a superposition of two images composed of points. In our case, the images are the proteins and the points are the atoms. In particular, we use the Geometric Hashing Paradigm for model-based object recognition in Computer Vision introduced by Lamdan, Schwartz and Wolfson (19) and adapted for macromolecules by Nussinov and Wolfson (20). Another major difference between previous work and ours is that whereas previous methods compare 3-D structures belonging to contiguous amino acids, ours is completely independent of the order of the amino acids in the chain. Our method compares proteins in a "real" 3-D approach. A linear match or alignment follows the "progression rule": for elements i and k from one sequence and elementsj and 1from the other sequence, if element i is matched to elementj and element k is matched to element/, and if k is to the right of i in the first sequence, then 1must also be to the right ofj in the second sequence. For example, the structural sequence" abc de f g h i" cannot be aligned with the (structural) sequence "FED ABC G HI" if the progression rule is to be maintained (each letter in the above sequences can represent either a single amino acid or a segment containing a short sequence of amino acids). Nevertheless, we would like to automatically identifY each existing similar pattern (i.e. "a b c" should match "A B C"; "d e f' should match "F E D" inverted ; and "g h i" should match "G H I"). Using existing techniques, the only way to find these similarities is to first identify each substructure and then compare between them. In addition, 3-D comparison conserving amino acid "linearity'' can not focus on 3-D motifs that are strictly spatial. Our method can find such 3-D motifs contributed by different segments or by isolated single amino acids. It may thus find motifs in protein active sites, on surfaces, in cores, etc. An intrinsic limitation of any method based on linear sequence-structure alignment, is that an alignment is not always complete, meaningful or accurate for dissimilar proteins and can not focus on 3-D motifs that are strictly spatial. In addition, more than one possible match may exist Our method for protein comparison searches for non-predefined, sequence independent matches, requires no prior knowledge of the motifs nor an initial alignment of the proteins. The method is highly efficient, fully automated, and is not constrained to local or linear motifs (i.e. the residues need not be contiguous on the chain). Our procedure is efficient for both non-predefined motifs, or when searching for occurrences of a known motif in new proteins. Also, our method simultaneously analyzes several matches, ranks them and reports every piece of structural similarity between the structures compared. In order to show the accuracy and efficiency of the method, we examine typical structure comparison problems and compare our results with published ones of previous methods. Some ofthe examples below compare homologous structures for which previous methods were used. In these examples, the matches produced by our method are equivalent to those of conventional alignment techniques. Thus, these examples appear to follow the "progression rule" assumed by previous methods, mainly because in these cases the best match coincides with the sequential alignment It should be noted however, that no information about the order of the residues in the

Detection of 3D- Motifs in Proteins

773

primary chain has been exploited by the algorithm. However, some additional structural, 3-D matches (i.e. not conserving the progression rule) are also obtained by our method. In other examples, a motif is searched for within a protein. The motif can be a sequential motif or a real 3-D, non sequential motif. In these examples our method correctly detects the motif within the protein.

Methods Next, a very brief summary of the three major steps of our method is presented. A detailed description of the implementation can be found in reference 21. Downloaded by [Rutgers University] at 14:44 10 April 2015

Step 1: Finding an Initial Match This step is based on the Geometric Hashing paradigm for model-based object recognition. The goal of model-based recognition systems is to recognize objects within an image that belong to a library of known models. The algorithm presented here is tailored for a library containing only one model: one of the structures to be compared. The extension to a library of many models is straightforward. Mathematically, the structure comparison problem can be stated as follows. Given the 3-0 coordinates of the atoms of two different molecules, find a rigid transformation (rotation and translation) in space, so that a" sufficient" number of atoms of one molecule match the atoms of the other molecule. One of the structures will be referred to as the" model" and to the other as the" scene". When searching for a rigid transformation, the main problem is to find adequate corresponding reference frames both in the model and in the scene. Our system is composed of two interrelated phases: preprocessing and recognition. In the preprocessing stage the model is transformed by the machine to an internal representation which aids the recognition phase. In the recognition phase, the scene data is acquired and compared to the prerecorded model representation in order to discover similar substructures. The goal is to represent the coordinates of the atoms by few intrinsic parameters in a rotation and translation invariant manner. The idea is to represent -prior to recognition- every model point in all possible reference frames. In recognition time, adequate reference frames are searched for in the model and in the scene. A reference frame can be uniquely defined by any triplet of non-collinear atoms. All the other atom coordinates can uniquely be represented in this coordinate frame by a triplet of scalars which are invariant under the rigid motion transformation. Alternatively, the atoms can also be represented using different rotation and translation invariant parameters by choosing a reference frame of only two atoms. All other atom coordinates can be represented by the three distances of the sides of the triangle they form with the two atoms in the reference frame. Nevertheless, since weaker geometric constraints are applied, the latter representation is not unique (symmetrical atoms with respect to the reference frame may have the same distances). The ambiguity thus produced is resolved by a verification phase of the algorithm (see below).

774

Fischer et al.

Substep 1: Preprocessing


For each reference frame, all model atoms are represented by the lengths ofthe sides of the triangle that they form with the two atoms of the reference frame (see above). This (redundant) information is stored in a Hash Table for quick reference during recognition. (Every model atom is represented many times; one representation for each possible reference frame). The hash table address (index) is defined by these three distances. In each entry, the three atoms forming the triangle (the reference frame and the third atom) are stored. In practice, not all possible triangles are computed; a limit on the length of the triangles' sides can be imposed, saving time and memory. In this case, the triangles are composed of three atoms whose atom-to-atom distances are below the above limit. Thus, only nt 2 triangles are considered, where tis the average number of" spatial neighbors" for each atom, and n is the number of atoms in the model. In the results below the maximum distance allowed is about 20A. Note that the preprocessing stage is done without any knowledge of the scene to be recognized. The major advantage of the redundant representation contained in the hash table is its ability to allow efficient matching of objects having only partial (previously unknown) equivalent substructures.

Substep 2: Recognition During this phase, all possible scene frames (pairs of atoms) are tested. Each scene atom is represented by the distances of the sides of the triangle formed with each possible reference frame. (As for the model, the allowed lengths of the sides of the triangles are below a defined limit). For every such atom and reference frame, the hash table (from the preprocessing phase) is accessed to search for possible matches. The procedure is: (a) For each pair of scene atoms forming a reference frame, do the following: 1. Hold a vote counter for each different model reference frame that appears

in the Hash Table. Each vote counter will hold a list of pairs of the form (model-atom, scene-atom), corresponding to the (matching) third atoms of the model triangles and the scene triangles, respectively. Each such list is built in step (a)3 described below and is called the vote list for the corresponding model reference frame. 2. Compute the three distances of the triangles formed between the scene frame and every other atom. 3. For each such set of distances, (1) access the hash table at the address defined by them; (2) extract every model triangle appearing in this entry; (3) for each triangle found, add a vote to the vote counter corresponding to the triangle's model frame and (4) add to the corresponding vote list the pair (model-atom, scene-atom), where the model-atom is the third atom of the triangle from the hash table, and the scene-atom is the current third atom of the current scene triangle.


775

(b) Mter all the scene atoms have been processed, check the vote counters. For each model frame that scores a large number of votes," remember" its vote list (together with the current scene frame and model frame), go back to step (a) and choose another frame in the scene.


At the end of this process, each "remembered" vote list corresponds to one possible initial match. The threshold of step (b) as well as all other thresholds and parameters used by the algorithm are set to default initial values. Alternatively, the thresholds can be set to user specified values. In addition, during execution, the program automatically tunes many of its parameters in accordance to the progress of the comparison. For relatively similar proteins, where many candidate matches receive high scores in the intermediate stages, the thresholds are automatically tightened in order to reduce the execution time. In relatively dissimilar proteins, the thresholds are relaxed in order not to loose candidate small matches. In particular, the default value for the threshold of step (b) above is 7 votes. For a detailed description of this feature, and other implementation related details see reference 21. During the recognition phase, mt2 triangles are considered, where t is the average number of "spatial neighbors" for each atom, and m is the number of atoms in the model. Step (a)3 above accesses hash table entries whose addresses are defined by the lengths of the three sides ofthecurrenttriangle. To allow for inexact matches, entries corresponding to slightly larger or shorter sides should also be accessed in this step. This can be viewed as allowing each side of the triangles to have a specified width. There is an obvious tradeoff between the width of the triangle's edges and program performance. In the results presented here, a side width of approximately .6 A was used. Step 2: Superpositioning the Initial Matches

This step is composed of three substeps: Substep 1: Verification. Each vote list obtained in the above step, implies a rigid

transformation between model and scene. Before computing the transformations, each vote list is verified. The vote lists may contain pairs of atoms that do not correspond to a rigid transformation, because of the symmetry limitation in the reference frame definition. This is done by choosing three pairs from the list, two of which being the pairs corresponding to the model and scene frames. The third pair is chosen in such a way that the transformation defined by these three pairs, produces the best least squares fit between all pairs in the vote list for which the distance between its two atoms is below a predefined threshold (i.e. pairs in the list that lie too far apart are not considered). Mter an appropriate third pair has been chosen, pairs for which the distance between their components are above a given threshold are eliminated from the list. This procedure has a quadratical complexity on the size of the initial match.

776

Fischer et a/.


Substep 2: Least Squares Fitting Algorithm. A rigid transformation can now be computed for each "cleaned" vote list. A variation of the least-squares fitting algorithm by Schwartz and Sharir (22) was used. The algorithm's input is a list of pairs of coordinates (the vote lists) and its output is a 3 X3 rotation matrix, a 3 X 1 displacement vector in 3-D, and the value ofthe least-squares distance. Its complexity is linear on the number of pairs. This transformation obtained from the whole list should be more accurate than the one obtained in the verification substep, as it takes into account more than three pairs of atoms. Substep 3: Clustering of Transformations. Many similar transformations are usually obtained. To produce concise results, a clustering algorithm is applied. The algorithm clusters into groups the transformation parameters (a rigid transformation can be represented by 3 translational and 3 rotational parameters) of all the transformations computed above. Transformations belong to the same cluster if the 6 dimensional distance between any two of them is below a predefined threshold. At the end, each cluster has one representative transformation. In addition, each cluster contains a vote list that is obtained by joining the initial vote lists of the individual transformations composing the cluster. This merged vote list contains matched pairs of atoms representing the initial match of the cluster. This new list is used to recompute the transformation representing the cluster. Clustering can be performed in linear time on the number of elements to be clustered. Currently, our system uses a quadratical more compact algorithm. Each group obtained above defines an "initial match" which is defined by a vote list and its corresponding rigid transformation.

Step 3: Extending the Initial Matches It is not easy to decide which scene atoms match model atoms, because many plausible candidates can exist. This is both because of the inaccuracy of crystallographic data and because inexact matches are searched for. Thus, an efficient matching algorithm to extend the initial matches is needed.

A best match between two sets of coordinates (the model coordinates and the transformed scene coordinates) is equivalent to the "Standard Assignment Problem" (e.g. see reference 23) known in graph theory, where the weights are defined in this case by the distances from each model atom to every (transformed) scene atom. An implementation of the well known Vogel's Approximation Method (24) was used. In the current implementation, the time required by the matching process is quadratical on the size of each match.

Implementation Source code is written in the C programming language, except for a module to obtain the rigid transformation from the initial matches, which is written in Fortran 77. The examples below were run on a Sun 4/390 workstation (SPARC CPU). t t

Requests for structural comparisons under our program can be sent to danif(ii·taurus.tau.ac.il.


777

Our system accepts any subset of atoms as points: Ca. atoms, CJ3 atoms, residue geometric centers, residue gravity centers, or any combination desired. In the examples shown below, only Ca. atoms were taken as interest points. Other choices of interest points will be studied in detail later. Running Times


A typical comparison of two lOO residue structures takes about 25 seconds CPU time. About 25000 triangles are built for each protein. A 15 residue motif search within a lOO residue protein takes about 7 seconds CPU time and the hash table contains about 1500 entries. Results

We present some results of our comparisons applied to a range of well known structures. We examine how the results from our method compare with previously published ones. All published results perform a linear alignment of the structures compared. In all cases, the results of our method agree with previous comparisons. The fact that our 3-D matching algorithm produces equivalent results to previous alignments, demonstrates that in these examples, the best real 3-D, structural match corresponds to a sequential alignment. As the structures compared are relatively homologous (belonging to the same family), the above fact was expected. N evertheless, in some examples where the structures are less similar, our results show also some "real" 3-dimensional, sequence independent matches. These non-sequential matches belonging to amino acids not contiguous on the primary chain appear where the alignments reported a large r.m.s deviation between the structures. We also present some examples where instead of comparing two full proteins, a predefined motifis searched within a protein. This motif can be a sequential segment of another protein, or contain residues from different positions of the chain. Other examples not shown here (21 ), include comparisons from the haemoglobins, serine proteinases, aspartic proteinases and others. Our results in all latter comparisons were also equivalent to published results. The output of our program may show more than one single possible match between the model and the scene. Each match represents one substructural similarity between the structures being compared. Within each match, not all Ca. atoms are paired; a scene atom is matched to a model atom only if both atoms lie close enough (up to a predefined threshold) under the rigid transformation between the model and the scene. Atoms that have no structural (i.e." spatial") partners are left unm~tched. If more than one match is produced by the program, we show only those results whose number of matched pairs is at least 80 percent of the number of matched pairs in the largest match. In all tables given below Ca. atoms are shown with their sequence position and the residue they belong to. As described in methods, our procedure first finds an initial set of corresponding atoms in model and scene. Next this set is expanded to include other pairs. Model matched atoms obtained from the initial set are shown in uppercase; model matched atoms obtained from the extension of the initial match are shown in

778

Fischer eta/.

Table I Comparison of three DNA-binding proteins: tryptophan represor (2wrp ), lambda CRO (lcroB) and lambda 434 (2cro). The HTH motif in each protein is correctly matched.

rms: 0.90 MODEL ISCENE I 2cro I 2vrpl -------------1 I I I I I

rms: 1.29 MODEL ISCENE I 2vrp l1croBI -------------1 I I I I I

1..... 1


I

I

1..... 1

I

I I I 37-GI I 36-AI I 35-HI 34-II 33-AI 32-KI 31-HI 30-II 29-AI 28-SI 27-QI 26-YI 25-VI 24-GI 23-LI 22-DI 21-KI 20-AI 19-TI 18-KI 17-TI 16-QI 15-GI 14-FI 13-RI

13-1 11-i 9-r

88-SI 87-NI 86-SI 85-GI 84-R 83-TI 82-II 81-TI 80-AI 79-II 78-GI 77-AI 76-GI 75-LI 74-EI 73-HI 72-KI 71-LI 70-EI 69-RI 68-QI 67-SI 66-MI 65-EI 64-GI 63-RI

..... 1 I I I

89-1 88-S 87-H 86-s 85-G 84-R 83-t 82-I 81-t 80-a 79-I 78-g 77-a 76-G 75-1 74-e 73-n 72-k 71-1 70-e 69-r 68-q 67-s 66-m 65-e

1..... 1

I 40-II

1..... 1 ..... 1 37-G 36-a 35-E 34-i 33-L 32-q 31-i 30-s 29-Q 28-q 27-K 26-v 25-g 24-A 23-k 22-t 21-a 20-L 19-e 18-t 17-Q 16-t

1..... 1 55-v

rms: 0.97 MODEL ISCENE I 1croB I 2crol -------------1 I 55-v I 60-QI I· .... I 44-i I 53-NI 51-y 52-a

..... 1 36-A 35-H 34-I 33-A 32-k 31-n 30-i 29-A 28-s 27-Q 26-y 25-V 24-g 23-L 22-d 21-K 20-a 19-T 18-k 17-t 16-q 15-G 14-f 13-r

59-E

9-DI 8-KI 7-LI 6-TI

..... 1

10-Y

8-k 7-1

43-r

39-k

53-TI ..... 1 50-AI

TRAN:16.6, -7.7, -2.4 ROT: -0.11, 0.29,-2.53

10-RI 9-RI 8-KI 7-KI 6-LI

..... 1

..... 1 44-f

37-GI 36-AI 35-EI 34-II 33-LI 32-QI 31-II 30-SI 29-QI 28-QI 27-KI 26-VI 25-GI 24-AI 23-KI 22-TI 21-AI 20-LI 19-EI 18-TI 17-QI 16-TI 15-MI 14-KI

..... 1

..... 1 63-r 61-L

I 50-MI I 49-AI

TRAH:-16.6, -49.6, -20.1 ROT: 2.34, 0.64, -0.90

2-LI ..... 1 I

TRAN:-23.4, -45.2, -19.7 ROT: 1.91, -o. 77, -2.59


779

lowercase. Unmatched (scene) residues are left blank. Below each match the 3 translational (TRAN) parameters and the 3 rotational (ROT) parameters of the rigid body transformation applied to obtained the match are shown.

Helix-Tum-Helix


The HELIX-TURN-HELIX (HTH) is a motif found in some DNA-binding proteins. The HTH motif is involved in DNA-binding in some repressor proteins. Repressor proteins play an integral role in the control of gene transcription (for a recent review see reference 25). The motif contains two alpha helices connected by a variable tum. We compare three transcriptional regulatory proteins known to contain the HTH motif: tryptophan repressor (PDB code: 2WRP), lambda CRO (PDB code: lCRO) and phage 434 CRO (PDB code: 2CRO). In lCRO, there are four crystallographically unrelated monomers in the asymmetric unit. These monomers have been assigned chain identifiers O,A, Band C. The dimerof ICRO that exists in solution is presumed to be the 0-B dimer, which is thought to be the one which actually binds DNA We use the B monomer in the comparison shown below, but a comparison using all four domains produces similar matches. The sequence positions where the HTH motifs appear are: Prot 2wrp: lcro: 2cro:

Pos 66-88 14-36 15-37

Sequence MS QRELKNELGA GIATITRGSNS FG QTKTAKDLGV YQSAINKAIHA MT QTELATKAGV KQQSIQLIEAG

In the three pairwise comparisons below (see Table I), our method succeeds in matching the HTH motif from one protein to the HTH motif from the other. Very few other pairs are matched, showing that the only equivalent substructure between the proteins is the HTH motif itself. The pairs outside the HTH motif are 3-D nonlinear matches. In the above example and the ones below, we do not try to qualify the matches obtained by our method, or asses any biological significance. We only present the best matches produced by the program; i.e. those matches with the largest number of matched pairs and with smallest r.m.s. deviation. Since our method is a purely geometric one,no biologically true result or false positive can be verified. The fact that our method, unbiased by the order of the amino sequences, produces equivalent results to those previously reported (which exploit specifically the sequence order) indicates that the best structural superposition between the structures in these cases is indeed one that conserves the sequence order. Such a coincidence of results is definitely not random. Nevertheless, sporadic non linear matches which have not been discovered in earlier work, appear in our results. These matches reflect the fact that under the given transformation, the atoms in the matched pairs came sufficiently close together so as to be matched. The biological significance of such pairs is out of the scope of this paper.

780

Fischer eta/.

Table II An immunoglobulin motif was defined as the Ca residues from the pin of domain VL from 3FAB. This pin motif was detected in the CHI, CL and VH domains of3FAB.

:0.69 MODEL ISCENE I VLpin I ch11 -------------1 I

rms

rms

: 0.56 MODEL ISCENE I VLpin I cl I -------------1 I

1••••• 1

87-C 86-Y 85-Y

1200-CI 1199-II 1198-YI

73-A 72-L 71-T 70-A 69-S

1187-TI 1186-VI 1185-VI 1184-SI 183-SI .••.. I 158-WI 157-SI 156-VI ....• I 146-VI 145-LI 144-CI 143-GI 142-LI .•••• I

1••••• 1

87-C 86-Y 85-Y


1••••• 1

34-W 33-K 32-V 24-g 22-c 21-s 20-i

TRAN:-27.3, 22.2,-40.0 ROT: 0.42, 0.12,-0.66

73-a 72-L 71-t 70-A 69-s 34-W 33-k 32-V 24-G 22-c 21-S 20-i

1195-CI 1194-SI 1193-YI 1..••. 1 1181-SI 180-LI 179-YI 178-SI 177-SI ....• I 150-WI 149-AI 148-VI •..•. I 138-II 137-LI 136-CI 135-VI 134-LI ..... I

TRAN:-54.7, 25.4, 8.2 ROT: -1.52, -1. 12. -2.09

rms : 0.58 MODEL ISCENE I VLpin I vh I ----------1 I I ••••. I 87-c I 95-CI 86-y I 94-YI 85-y I 93-YI 1 ••••• 1 73-A 81-RI 72-L 80-LI 71-T 79-SI 70-A 78-FI 69-S 77-QI ..•.. I 34-v 36-WI 33-k 35-TI 32-v 34-SI .••.. I 24-G 24-VI 23-TI 22-C 22-CI 21-s 21-TI 20-i I 20-LI 1..•.• 1 TRAN:-23.8, 5.8, 44.3 ROT: -1.82, -0.25, -2.33

Immunoglobulin Domains

Immunoglobulin molecules consist of six domains that have similar secondary and tertiary structures and low amino-acid sequence homology. Each domain contains two stacked ~-sheets pinned together by a disulphide bridge. One ~-sheet contains four strands and the other three. Lesk and Chothia (26) extensively studied the cores of several immunoglobulin domains. They observed a group of35 residues present in all domains which they compared and called it the ~-sheet core. Within the core, they defined a group of 9 residues that occur at the center of the region between the two ~-sheets and called it the pin. The antigen-binding fragment of FAB NEW (PDB code: 3FAB) contains four of these domains designated VH, VL, CL and CHI. The VH and VL domains are highly homologous to each other. The CL and CHI domains are also highly homologous to each other, and less homologous to VH and VL.

781

Detection of 3D- Motifs in Proteins Table III Comparison of the immunoglobulin VL domain with the CL and CHI domains from 3FAB.


rms

: 1.17 MODEL ISCENE I vl I ch11 -------------1 I 1••••• I 9-s 1217-PI 103-t 1216-EI 1215-VI 1214-KI 100-g 1213-KI 99-f 1212-DI 98-v 1211-VI 95-r 1210-KI 1..... 1

91-d 90-y 89-s 88-q 87-c 86-y 85-y 83-a 82-e

1204-HI 1203-NI 1202-VI 1201-NI 1200-CI 1199-II 1198-YI 1.••.• 1 1196-QI 1195-TI 1..•.. 1

rms 1.14 MODEL ISCENE I vl I ell -------------1 I 1•..•. 1 45-1

99-f 98-v 95-r 94-1 93-s 92-r 91-d 90-y 89-S 88-Q 87-c 86-y 85-y

78-q

74-i 73-A 72-1 71-t 70-A 69-s 68-S 66-s 65-k 64-s 63-V 62-s

1188-VI 1187-TI 1186-VI 1185-VI 1184-SI 1183-SI 1182-LI 1••.•• 1 1170-FI 1169-TI 1168-HI 1167-VI 1166-GI 1..••• 1

1206-KI 1205-EI 1204-VI 1203-TI 1202-SI 1201-GI 1200-EI 1199-HI 1198-TI 1197-VI 1196-QI 1195-CI 1194-SI 1193-YI 1.••.. 1

35-y 34-v 33-k 32-v 31-h 30-n

1159-NI 1158-WI 1157-SI 1156-VI 1155-TI 1154-VI I. .... I

27-n

1149-YI 1148-DI 1147-KI 1146-VI 1145-LI 1144-CI 1143-GI 1142-LI 1141-AI 1..... 1

24-G 23-t 22-C 21-s 20-I 19-t

44-k 43-p 36-q 35-y 34-v 33-k 32-v 31-h 30-n 29-g 27-s 25-s 24-g 23-T 22-c 21-S 20-I 19-t 18-v 17-r 16-q

1185-EI 1..... 1 74-i 1182-LI 73-a 1181-SI 72-L 1180-LI 7-p 71-t 1179-YI 6-Q 1127-PI 6-q 70-A 1178-S I 5-t 1126-FI 5-t 69-s 1177-SI 4-1 1125-VI 4-1 68-S 1176-AI 3-v 1124-SI 3-v 1•••.. 1 1•...• 1 66-s 1164-TI 65-k 1163-TI 64-S 1162-EI 63-v 1161-VI 62-s 1160-GI I .•••. 1 TRAN:-26.8, 20.6,-40.1 TRAN:-55.2, 24.9, 7.5 ROT: 0.38, 0.11,-0.60 ROT: -1.55, -1.16, -2.06

156-PI 155-SI 154-SI 153-DI 152-AI 151-KI 150-WI 149-AI 148-VI 147-TI 146-VI 145-AI •.••. I 141-FI 140-DI 139-SI 138-II 137-LI 136-CI 135-VI 134-LI 133-TI 132-AI 131-KI 130-NI .•••• 1 120-FI 119-LI 118-TI 117-VI 116-SI .•••. I

782

Fischer eta/.

Table IV Comparison of the immunoglobulin VH domain with the CL and CHl domains from 3FAB.


rms

: 1.23 MODEL (SCENE( vh ell I -------------1 I I I I I 1..... 1 112-1 1209-AI 111-s 1208-VI 110-g 1207-TI 109-q 1206-KI 107-w 1205-EI 1204-VI 106-v 1203-TI 1..... 1 97-r 1197-VI 96-a 1196-QI 95-c 1195-CI 94-y 1194-SI 93-y 1193-YI 92-v 1192-SI 91-a 1191-KI 1..... 1 81-R 1181-SI 80-1 1180-LI 79-s 1179-YI 78-f 1178-SI 77-q 1177-SI 1176-AI 1..... 1

69-m 68-t

1161-VI 1160-GI 1..... 1

rms :1.21 MODEL I SCENE vh I ch1 -------------1 MATCH # 1 I I I I I 1..... 108-G 1213-K 107-w 1212-D 1..... I I 97-r 1202-VI 96-A 1201-NI 95-C 1200-CI 94-Y 1199-II 93-y 1198-YI 92-v 1197-TI 91-a 1196-QI 1..... 1 81-R 1187-TI 80-1 1186-VI 79-S 1185-VI 78-F 1184-SI 77-Q 1183-SI 76-N 1182-LI 1..... 1 71-v 1170-FI 1..... 1 69-M 1167-VI 1166-GI 67-v 1165-SI l ..... f

48-i 47-w 46-e 45-1 39-q 37-v 36-w 35-T 34-s 33-y

26-g 25-s 24-V 23-t 22-c 21-t 20-L 19-s

8-g 7-s 6-Q 5-e 4-1 3-q 2-v

1157-VI 1156-PI 1155-SI 1154-SI 1153-DI 1152-AI 1151-KI 1150-WI 1149-AI 1148-VI 1147-TI 1146-VI 1..... 1 1140-DI 1139-SI 1138-II 1137-LI 1136-CI 1135-VI 1134-LI 1133-TI t ..... f

48-i 47-w

37-v 36-w 35-T 34-s 33-Y 32-y 31-D

24-v 23-T 22-C 21-t 20-L 19-s 9-p 8-g

1121-PI 1120-FI 1119-L I 1118-TI 1117-VI 1116-SI 1115-PI 1..... 1

6-q 4-1 26-g 2-v

TRAN:-30.1, 39.8, 1.6 ROT: 0.43, 0.10,-0.82

TRAN:-65.8, 30.2, 44.7 ROT: -1.43, -1.38, -2.63

1163-LI 1162-AI 1..... 1 I I I I I I 1159-NI 1158-WI 1157-SI 1156-VI 1155-TI 1154-VI 1153-PI 1..... 1 I I I I 1146-VI 1145-LI 1144-CI 143-GI 142-LI 141-AI

..... (

131-SI 130-PI 129-AI 128-LI 127-PI 126-FI 125-VI 124-SI 123-PI

..... (

1121-KI 1..... 1

We compared every domain to each other and defined the pin as a motif to be searched within the domains. 1) Motif Search

A motif was defined as the Ca atoms from the VL domain pin, plus some other

783

Detection of 3D· Motifs in Proteins TableV


An EF-hand motif defined from calmodulin (3clnl) is detected twice within parvalbumin (3cpv). rms :1.07 HODEL 3cln1 MATCH # 38-s 37-R 36-H 35-v 34··T 33-g 32-L 31-e 30-K 29-t 28-t 27-1 2G··T 25-G 24-d 23-g 22-d 21-1< 20-d 19-f 18-1 17-s 16-f 15-a 14-e 13-k 12-f 11-e

ISCENE I I 3cpvl I I 1108-AI 1107-KI 1106-VI 1105-LI 1104-AI 1103-TI 1102-FI 1101-EI 100··DI 99-VI 98-GI 97-11 96-KI 95-GI 94-DI 93-GI 92-DI 91-SI 90-DI 89-GI 88-AI 87-KI 86-LI 85-FI 84-TI 83-KI 82-TI 81-EI I 80-GI ..•.. 1 68-QI 67-LI 66-FI 65-LI 64-KI 63-LI 62-EI 61-DI 60-EI 59-EI 58-11 57-FI 56-GI 55-SI 54-KI 53-DI 52-QI 51-DI 50-11 49-11 48-AI 47-FI 46-AI 45-KI 44-KI 43-VI 42-DI

rms :1.11 HODEL 3cln1 MATCH # 2

MATCH 1: TRAN:-31.8, 38.8, 42.6 ROT: 0.77, 0.07, 2.36

38-s 35-V 34-t 33-G 32-1 31-E 30-11: 29-t 28-t 27-i 26-T 25-g 24-d 23-G 22-D 21-K 20-d 19-f 18-1 17-s 16-F 15-a 14-e 13-K 12-f 11-e

HATCH 2: TRAN: -3.8, ROT: 2.31,

66.8, 0.82,

7.1 0.17

784

Fischer eta/.

residues from the core. This motif can represent the basic fold of an immunoglobulin domain. The residues used to define the motif are: 20-I, 21-S, 22-C, 24-G, 32-V, 33-K, 34-W, 69-S, 70-A. 71-T, 72-L, 73-A. 85-Y, 86-Y and 87-C. It includes the disulphide bridge, some of its neighbor residues, a conserved tryptophan as well as additional ones. We searched for this motifin the VH, CL and CHI domains. The residues from each domain that matched the motif residues define the motif occurring within the domain. The results of our method (see Table· II), show the same match as those defined by Lesk and Chothia.


2) Domain to Domain Comparison

We show our results (Tables III-IV) of the comparisons of the VH and VL domains with the CL and CHI domains. Our results are equivalent to those of Lesk and Chothia for the core regions, except for minor displacements in one of the~ strands. The VH-CHl comparison differs from the alignemnt reported by Taylor and Orengo (16) by a global two residue displacement. Calcium-Binding EF-Hand Motif

Proteins from the calmodium superfamily contain the so called EF hand structure, which is a calcium binding motif(27). The 29-residue motif contains two alpha helical regions flanking a 12-residue calcium binding loop that chelates the ion. Two EF hand motifs usually form a domain. We use the first of the EF hand motifs from calmodulin (PDB code 3CLN) as a motif to be searched for in parvalbumin (PDB code 3CPV). The calmodulin motiflies on residues 11-38. Parvalbumin contains the expected two EF hand motifs (residues 42-69 and 81-108) plus an additional motif (in theN-terminus) that does not bind calcium. The residues involved in calcium binding are: Calmodulin (first) motif: Parvalbumin first motif: Parvalbumin second motif:

20-D 22-D 24-D 26-T 28-T 31-E 51-D 53-D 55-S 57-F 59-E 62-E 90-D 92-D 94-D 96-K 98-G 100-E

Our method produces two different matches (Table V), each corresponding to a different transformation needed to match the motif in one of the two motifs of parvalbumin. The matched pairs indicate the occurrence of the EF-hand motif within parvalbumin. This motif from calmodulin was used to search some other calciumbinding proteins known to contain the EF-hand motif (results not shown). The search was done on the following PDB files: 1TNC, 3ICB, 4TNC, 5TNC and 3CLN itself. Each occurrence of the EF-hand motif within these proteins was detected by our method. Plastocyanin-Azurin

Plastocyanin (PDB code: 3PCY) and azurin (PDB code: lAZU) are small copper-


785

Table VI Comparison of the azurin (lazu) and plastocyanin (3pcy) proteins. See Figure 1 and text for explanation.


rms

:1.70 rms :1.67 MODEL ISCENE I MODEL 3pcy I 1azul 3pcy -------------1 1--------------MATCH # 1 I MATCH# 2 I 1128-KI 98-v 1127-LI 77-k 97-t 1126-TI 79-e 96-v 1125-LI 80-y 95-K 1124-TI 95-K 94-g 1123-G I 94-g 93-V 1122-K I 93-v 92-• 1121-111 92-a 91-G 1120-LI 1119-A I 91-G 88-q 1118-S I 88-q 87-h 1117-HI 87-h I ••••• I 86··p 1114·-FI 86-P 85-s 1113-TI 85-s 84-C 1112-C j 84-c 83-y 1111-FI 83··Y 82-F 1110-F I 82-F 81-s 1109-MI 81-s 80-·Y 1108-YI 79-e 1107-QI 78-g 1106-EI 1105-GI 77-k 1104-EI 76-n 1103-KI 48-s 1102-LI 1101-KI 50-v ..••. 1 72-v 97-FI 71-e 96-TI 70-f 95-VI 69-t 94-SI 68-e 93-DI 67-g 92-KI 69-t 65-a 91-EI 68-e 90-C I 66-k 89-SI 65-a 88-GI 64-n 63-1 87-11 63··1 62-1 86-LI 62-1 85-KI 61-d 84-TI 57-a 57-a 83-HI 56-s 82-AI 56-s 55-I 81-II 52-a 80-VI TRAN: 52-a 79-RI ROT: 59-e 73-LI 59-e 72-YI 60-e 71-DI 60-e .•••• I

·····'

42-d 43-e 44-d 45-s 46-i 41-f 40-V 39-i 38··n 37-h 36-p

33-a 31-n 30-k 28-v 26-k 22-s 21-i 20-s 19-f 18·-e 17-s 16-p 15-v 13-a 12-1

6-g 5-1 4-1 2-d 1-i

56-Ill 55-DI 54-AI 53·-AI 52-TI 51-SI SO-LI 49-VI 48··WI 47-NI 46-HI 45-GI 44-lil 1..••• 1 I 36-PI I 35-HI I 34-SI I 33-LI I 32-111 I 31-VI I 30-TI I 29·-FI 1....• 1 I 23-DI ~ 22-V I 71-TI 20-II 19-AI 18-Nl 11-TI 16-tll 15-FI 14-QI 13-MI 12-QI 11-DI 10-NI 9-GI 8-QI 7-II 6-DI 5-VI 4-SI 3-CI

4'>-s

42-d 41-F 40-v 39-1 38-n 37-h 36-p 35-f 32-n 31-n 30-k 71-e 72-v 73-a 23-p 98·-· 97-t 96-v 18-e 17-s 15-v 14-F 13-A 12-1 11-s 33-a 6-g 5-1 4-1 3-v 28-v 27-i 26·-k 24-g

-0.5, 46.3,-11.8 TRAN: -2.4, 2.27, 0.24, 1.01 ROT: 2.14,

45.9, -11.1 0.56, 0.84

786

Fischer et a/. FIGURE

8 SHEET I

[

A~g;~~


SHUT II

MATCH

AZU PCY

46-37 47-]8 118-39 49-liO 50-41 51 112 52 115 53

AZU PCY 117---a7 1!11---a6 113---a5 112----all 111---a3 11D---a2 109--81 108 80 107--79 106--78 77

FIGURE J!!:MATCH

8 SHE:ET I

8

AZU 10

PCY

~6

116-37 117-38 118-39 119-40 50-41 51-112 52-45 53

AZU PCY 111 ..,..13 15' Ill 16-15

7........._ 5 6 ........_11 5-...._

3

11-....._........_2 3 ........_,

AZU PCY

AZU PCY

121--92 122--93 123--911 18-17 1211----95 19-18 125-96 ~19 126-97 21--20 127----98 22--21 128 99

2

AZU PCY AZU PCY AZU PCY 9D---66 36-32 ID---6 91---68 35-31 9--5 92---69 311-30 8-ll 93 70 33 29 7-3 911 71-32 28---6 2 95 72-30 27--'5 1 96 73-29 26~ 97 711 28 25 3 AZU PCY

SHEET II

1

311, 32 91 66 33 "-31 92---67 32-]0 ]!......._ 29 93--68 ......._28 94-69 3D 95--70 29----27 96--71 28 26 97--72 27 25

AZU PCY

8

J.!:

AZU PCY 111-13 15-111 16-15

AZU PCY AZU PCY AZU PCY 117---a7 1111----a& 113---a5 112----all 121--92 111---a3 122--93 11D---a2 123--911 18-17 109--81 1211----95 19-18 108 80--125 96-20 19 107 79-126 97-21 20 106 78/127 98-22 21 17 128 99

Figure 1: Schematic figure of the matches for 3PCY and lAZU shown in Table VI. (Some matched pairs corresponding to the copper ligands and the residues adjacent to them were included in the figure although they do not belong to the beta-sheets).

binding proteins composed of two ~-sheets packed face-to-face and an a-helix. Each ~-sheet contains four strands. The copper atom is bound by four homologous residues: 37-H, 84-C, 87-H and 92-M in plastocyanin and 46-H, 112-C, 117-H and 121-M in azurin. Chothia and Lesk (28), Adman (29) and Taylor and Orengo (16) have aligned these proteins, obtaining different equivalences for ~-sheets I. Adman's (29) and Taylor and Orengo's (16) results for this sheet differ from Chothia and Lesk's (28) by a two residue displacement. Our method produces two matches (Table VI) that partially agree with the above works. They perfectly match the copper ligands and the residues adjacent to them but can not be compared to any alignment, as real3-D matches (not conserving the


787

linearity on the main chain) were obtained. Figure 1 depicts the results of our method I) Match 1

Figure la depicts the results from match 1. The matches obtained for ~-sheet II are equivalent to those obtained by Chothia and Lesk (28), Adman (29) and Taylor and Orengo (16). Nevertheless, some of the strands of ~-sheet I differ from the above works by a one residue displacement.


2)Match2

Figure lb depicts the results from match 2. The strands of the ~-sheets are shown with different angles, with residues from equivalent strands getting farther apart until they become closer to the neighbor strand to their right. For example, residues 30 to 32 of plastocyanin are matched to residues 34 to 36 of the equivalent strand of azurin, whereas residues 26 to 28 ofplastocyanin are matched to residues 4 to 6 of the neighbor strand of azurin. The first matched pairs (from the top of the figure downwards) are equivalent to the alignments reported in the above works for~ sheet II, whereas for ~-sheet I, the first matches are equivalent to those obtained by Taylor and Orengo and by Adman, but disagree with Chothia and Lesk by a 2 residue displacement. The other matches are purely geometric, non sequential matches. These results are produced by our method considering only the 3-D coordinates of each protein, without taking into account any biological criteria as residue-residue interactions or sequential order of the chain. In addition, we compared the ~-sheets from plastocyanin to the corresponding ~ sheets from azurin (not shown). In each sheet we included all the copper ligands and the residues adjacent to them. The results from comparing the ~-sheets II were identical to all reported alignments with all the ligands and the residues adjacent to them properly matched (translation 0.1, 44.1, -13.0; rotation 2.26, 0.42, 1.02). Nevertheless, in the comparison of the ~-sheets I, our method produced two different matches: one agreeing with Lesk and Chothia's results (translation -9.0, 43.3, -14.0; rotation 2.04, 0.20, 1.11), and the other with Taylor and Orengo's (translation 3.9, 37.8 -12.9; rotation 2.36, 0.49, 1.19). In the latter match, only part ofthe ligands and the residues adjacent to them were correctly matched.

Conclusions An efficient method for comparison of protein structures is introduced based on the Geometric Hashing Paradigm originally designed for Computer Vision. The results shown above demonstrate that the system produces similar structural equivalences to those already reported in the literature. Many of the comparisons were originally determined with considerable difficulty, often requiring the use ofinteractive computer graphic tools and/or extensive computations. Our method is fully automated and has a better performance than previous methods. With such an efficient tool, an extensive structural comparison of proteins and motifs in protein data banks can more easily be attained.

788

Fischer et at.


By applying a completely general technique such as the Geometric Hashing Paradigm and other Computer Science techniques to protein structural comparison, we obtained an efficient, fully automated method that is not sensitive to insertions, deletions, gaps or displacements of equivalent substructures between the molecules being compared. "Pure" 3-D comparison provides a way to obtain sequence independent matches not constrained by the "progression rule" of alignment techniques. This is particularly useful when searching for motifs in active sites, surfaces, cores, etc. Another advantage of our method is that it produces every piece of found structural similarity, thus obtaining different ranked matches for each comparison. Such output is useful when the proteins being compared are relatively dissimilar and a unique global alignment may not be meaningful nor accurate. Our future work will involve a systematic search for motifs in the protein data bank. We hope this search will find more structural similarities between proteins that could hardly be obtained using previous methods. Parallelization of the algorithm is straightforward, and will contribute a meaningful speedup of the search procedure. Another aspect of our future work will be the inclusion of more biological criteria into a weighted voting strategy of the matching algorithm. In addition to Ca atoms, the set of features used can be expanded to include other atoms, interatomic vectors, residue centers of gravity, directions, dihedral angles as well as other criteria such as hydrophobicity, hydrogen bonding or residue types.

Acknowledgments We would like to thank Drs. D. Covell, R. Jernigan and J.V. Maizel for their interest and discussions. In particular, we are grateful to G. Smythers of the Advanced Scientific Computing Laboratory, at the NCI in Frederick for his invaluable help during the course of this work. Research sponsored at least in part by the National Cancer Institute, DHHS, under contract No. 1-C0-74102 with Program Resources, Inc. The contents of this publication do not necessarily reflect the views or policies of the DHHS, nor does mention of trade names, commercial products, or organizations imply endorsement of the U.S. Government. References and Footnotes 1. 2. 3. 4. 5. 6. 7. 8. 9. 10. 11. 12.

Richardson, J.S.,Adv. Protein Chern. 34, 167-339 (1981). Nussinov, R., CRC Crit. Rev. Biochern., In press (1990). Pabo, C.O. and Sauer, R.T.,Annu. Rev. Biochern. 53,293-321 (1984). Klug, A and Rhodes, D., Trends Biochern. Sci. 12,464-469 (1987). Landschulz, W.H., Johnson, P.F. and McKnight, S.L., Science 240, 1759-1764 (1988). Gehring, W.J., Science 236, 1245-1252 (1987). Bernstein, F. C., Koetz1e, T., Williams, G., Meyer, E., Brice, M., Rodgers, J., Kennard, 0., Shimanouchi, T. and Tasumi, M.,.l. Mol. Bioi. 112,535-542 (1977). Matthews, B.W. and Rossmann, M.G., Methods Enzyrnol. 115, 397-420 (1985). Remington, S.J. and Matthews, B.W., Proc. Natl.Acad. Sci. USA 75,2180-2184 (1978). Remington, SJ. and Matthews, B.W.,.l. Mol. Bioi. 140,77-99 (1980). Rossmann, M.G. and Argos, P., J. Bioi. Chern. 250, 7525-7532 (1975). Rossmann, M.G. and Argos, P.,J. Mol. Bioi. 105,75-96 (1976).



13. 14. 15. 16. 17. 18. 19. 20. 21. 22. 23. 24. 25. 26. 27. 28. 29.

789

Rossmann, M.G. and Argos, P.,J Mol. Bioi. 109, 99-129 (1977). Chothia, C. and Lesk, AM., EMBO Jour., vol.5 no.4, 823-826 ( 1986). Richards, F.M. and Kundrot, C.E., Protein Struct. 3, 71-84 (1988). Taylor, W.R. and Orengo, CA.,J. Mol. Bioi. 208, 1-22 (1989). Needleman, S.B. and Wunsch, C.D.,J Mol. Bioi. 48, 443-453 (1970). Sali, A and Blundell, T.L.,J Mol. Bioi. 212,403-428 (1990). Lamdan, Y., Schwartz, J.T. and Wolfson, H.J., In Proc. IEEE Int. Conf. Robotics and Automation, 1407-1413, Philadelphia, Pa., April (1988). Nussinov, R. and Wolfson, H.J., Proc. Nat/. Acad. Sci. (USA) 88, 10495-10499 (1991). Bachar, 0., Fischer, D., Nussinov, R. and Wolfson, HJ., T.R 1991, Tel Aviv University, In preparation. Schwartz, J.T. and Sharir, M., The Int. Jour. Robotics Research 6(2), 29-44 (1987). Taha, HA., In Operations Research, An Introduction. The Macmillan Co., 123-125 (1971). Zukhovitskiy, S. and Audeyeva, L., In Linear and Convex Programming, Philadelphia, Pa. Saunders Co., 147-155 (1966). Brennan, R.G. and Matthews, B.W., Trends Bioch. Sci. 163,286-290 (1989). Lesk, AM. and Chothia, C., J. Mol. Bioi. 160, 325-342 (1982). Kretsinger, R.H., Crit. Rev. Biochem. 8, 119-174 (1980). Chothia, C. and Lesk, AM.,J Mol. Bioi. 160, 309-323 (1982). Adman. E.T., In Metalloproteins (Harrison, PM, ed), part I. Ch. I. 1-142, Verlag Chemie, Weinheim, (1985).

Date Received: June 21, 1991

Communicated by the Editor R.H. Sarma

Efficient detection of three-dimensional structural motifs in biological macromolecules by computer vision techniques.

Efficient and automated large-scale detection of structural relationships in proteins with a flexible aligner.

Understanding structural relationships in proteins of unsolved three-dimensional structure.

When Ultrasonic Sensors and Computer Vision Join Forces for Efficient Obstacle Detection and Recognition.

Computer-aided detection of brain metastases using a three-dimensional template-based matching algorithm.

Computer vision for image-based transcriptomics.

A three-dimensional computer-based perspective of the skull base.

Detection of common three-dimensional substructures in proteins.

Automated Assessment of Children's Postoperative Pain Using Computer Vision.

Vision in our three-dimensional world.

Hypertext and three-dimensional computer graphics in an all digital PC-based CAI workstation.

An efficient technique for nuclei segmentation based on ellipse descriptor analysis and improved seed detection algorithm.

Automated fit quantification of tibial nail designs during the insertion using computer three-dimensional modelling.

Feature-based three-dimensional registration for repetitive geometry in machine vision.

Computer vision-based automated peak picking applied to protein NMR spectra.

Incorporating texture features in a computer-aided breast lesion diagnosis system for automated three-dimensional breast ultrasound.

Three-Dimensional Computer-Aided Detection of Microcalcification Clusters in Digital Breast Tomosynthesis.

Clinical application of a novel computer-aided detection system based on three-dimensional CT images on pulmonary nodule.

Computational morphology: three-dimensional computer graphics for electron microscopy.

Fast calculation method for computer-generated cylindrical holograms based on the three-dimensional Fourier spectrum.

Three-dimensional structure for the beta 2 adrenergic receptor protein based on computer modeling studies.

Phosphatidylinositol transfer proteins: sequence motifs in structural and evolutionary analyses.

Automated analysis of retinal imaging using machine learning techniques for computer vision.

An open-source automated platform for three-dimensional visualization of subdural electrodes using CT-MRI coregistration.