Selection of a representative set of structures from Brookhaven Protein Data Bank.

PROTEINS:Structure, Function, and Genetics 14965-276 (1992)

Selection of a Representative Set of Structures From Brookhaven Protein Data Bank Jorma Boberg,' Tapio Salakoski,' and Mauno Vihinen' 'Department of Computer Science, University of Turku, SF-20520 Turku, Finland; 'Department of Biochemistry, and Centre for Biotechnology, University of Turku, SF-20500 Turku, Finland

ABSTRACT Reliable structural and statistical analyses of three dimensional protein structures should be based on unbiased data. The Protein Data Bank is highly redundant, containing several entries for identical or very similar sequences. A technique was developed for clustering the known structures based on their sequences and contents of a-and p-structures. First, sequences were aligned pairwise. A representative sample of sequences was then obtained by grouping similar sequences together, and selecting a typical representative from each group. The similarity significance threshold needed in the clustering method was found by analyzing similarities of random sequences. Because three dimensional structures for proteins of same structural class are generally more conserved than their sequences, the proteins were clustered also according to their contents of secondary structural elements. The results of these clusterings indicate conservation of a-and p-structures even when sequence similarity is relatively low. An unbiased sample of 103 high resolution structures, representing a wide variety of proteins, was chosen based on the suggestions made by the clustering algorithm. The proteins were divided into structural classes according to their contents and ratios of secondary structural elements. Previous classifications have suffered from subjectice view of secondary structures, whereas here the classification was based on backbone geometry. The concise view lead to reclassification of some structures. The representative set of structures facilitates unbiased analyses of relationships between protein sequence, function, and structure as well as of structural characteristics. o 1992 Wiley-Liss, Inc. Key words: representative PDB structures, sequence clustering, significance of sequence similarity, classification of protein structures, amino acid composition INTRODUCTION Structural information is required to understand the relationship of protein structure and function. 0 1992 WILEY-LISS,INC.

Refined structures constitute a general framework for computer-aided molecular modeling. Reasonable sequence similarity is a prerequisite for reliable modeling, since proteins having the same function generally have the same fold. Computer modeling has severe limitations, if the structure of any similar type of protein is not known. Refined structures have been used to analyze protein folding and structures and often further to predict several features. Promising knowledge-based methods have been applied to predict the three dimensional folding of proteins based on their amino acid The Brookhaven Protein Data Bank (PDB),which contains the coordinates of determined three dimensional structures, is highly redundant including several entries for very similar structures. Statistical analyses of refined structures have been applied to develope techniques for predicting, e.g., hydropath^:+^ secondary structural antigenic sites and flexibility.*The quality of these predictions as well as general knowledge of the structure and function of proteins could be improved by having rationally selected unbiased set of structures for analysis. A representative sample of any objects is obtained by grouping similar objects together, and selecting a typical representative from each group. These groups are commonly called clusters, and the process of forming them is called clustering. We have developed a method for protein clustering and applied it firstly on sequences and secondly on proportions of a- and p-structures. One typical structure was chosen to represent each cluster. Although contents of secondary structures can vary greatly, quite similar results were obtained from both clusterings. The total of 103 structures of 3 A resolution or higher were selected for the analysis of structural features. Instead of sequences, it would have been ideal to analyze the three dimensional structures but it was not possible, because the present day algorithms for

~

Received June 11, 1991; revision accepted January 21, 1992. Address reprint requests to Dr. Mauno Vihinen, Department of Biochemistry, University of Turku, SF-20500 Turku, Finland.

266

J. BOBERG ET AL.

structural alignment are unfeasible. In practice this would mean years of computer Since amino acid sequences contain the necessary information for protein folding, they provide a n able basis for the selection of structures from Protein data bank.

METHODS The Outline of the Study The amino acid sequences of proteins in PDB (July 1990)were aligned by using the GCG program GAP,ll which is an implementation of the algorithm of Needleman and Wunsch12 using linear gap functions, the theory of which have been discussed by Waterman13 and others. The secondary structural analysis was based on the proportions of amino acids taking part in a- and p-structures according to the program DSSP.14 A clustering method was developed to group the proteins from PDB according to their sequence similarity in order to find typical representatives for mutually similar proteins. The clustering method was also applied to the secondary structural information of the same proteins. The visual inspection of the locations of secondary structural elements was made with the program Insight (Biosym Technologies, Ltd.) on the Evans & Sutherland PS390 graphics terminal. The outline of the study is shown in Figure 1. Phase P: preliminary selection of sequences The number of the sequences was reduced by omitting bibliographic entries, entries having only a-carbon or backbone coordinates, poorly refined structures (resolution worse than 3.0 A), modeled structures, and point mutations of well characterized proteins. Also the entries having short (

Selection of representative protein data sets.

Extracting representative structures from protein conformational ensembles.

The Protein Data Bank: a computer-based archival file for macromolecular structures.

The Protein Data Bank: a computer-based archival file for macromolecular structures.

HPDB-Haskell library for processing atomic biomolecular structures in Protein Data Bank format.

Protein Data Bank Japan (PDBj): updated user interfaces, resource description framework, analysis tools for large structures.

PDBe: Protein Data Bank in Europe.

Citing a Data Repository: A Case Study of the Protein Data Bank.

Occurrence of protein disulfide bonds in different domains of life: a comparison of proteins from the Protein Data Bank.

iPfam: a database of protein family and domain interactions found in the Protein Data Bank.

Computational methods for determining protein structures from NMR data.

PCDDB: new developments at the Protein Circular Dichroism Data Bank.

Small molecule annotation for the Protein Data Bank.

A data set from flash X-ray imaging of carboxysomes.

The Protein Data Bank: Current Status and Future Challenges.

The toxicology data bank.

BDB: biopanning data bank.

[The selection of spongiosa donors for a bone bank].

Evolution of basal metabolic rate in bank voles from a multidirectional selection experiment.

The RCSB protein data bank: integrative view of protein, gene and 3D structural information.

PDB-Explorer: a web-based interactive map of the protein data bank in shape space.

A colony bank containing synthetic Col El hybrid plasmids representative of the entire E. coli genome.

ValidatorDB: database of up-to-date validation results for ligands and non-standard residues from the Protein Data Bank.

A set of powerful negative selection systems for unmodified Enterobacteriaceae.