SESAM: a relational database for structure and sequence of macromolecules.

PROTEINS: Structure, Function, and Genetics 11:59-76 (1991)

SESAM: A Relational Database for Structure and Sequence of Macromolecules Martina Huysmans,' Jean Richelle? and Shoshana J. Wodak' 'BIM, B-3078 Everberg, Belgium, and 'Unit6 de Conformation de Macromoldcules Biologiques, Universitd Libre de Bruxelles, CP 160, B-1050 Bruxelles, Belgium

ABSTRACT A system is described that provides ways of integrating data on protein structure, sequence, and survey results, with molecular graphics and molecular mechanics software. Its major component is the relational database SESAM,presently implemented under the commercial package SYBASE. By design, the database allows full integration-within the same data organization-of raw data on protein structure, sequence, ligands, and heterogroups, obtained from the Brookhaven Protein Databank, with pure sequence information available from other databanks such as SWISS-PROT. It contains in addition higher level descriptions of structural and topological properties, as well as survey results, obtained by executing specialized computer programs. Aside from the very useful attribute of closely combining structural and nonstructural information, other important features distinguish it from analogous systems developed elsewhere. It includes a molecular dictionary with complete description of geometric properties and energy parameters used in modeling and conformational energy calculations. Using this dictionary, structural data are validated by checking for localized inconsistencies in atomic coordinates, atomic symbols, chirality definitions, and flagging errors and incomplete entries. Because of both the dictionary and the validation procedures, SESAM can be readily interfaced with conventional molecular graphics and mechanics software packages, or with other specialized application programs. With the aid of appropriate interfaces, data access is sufficiently fast for SESAM to be interrogated interactively. Prototypes of user interfaces, as well as an interface with the molecular graphics package BRUGEL, are described and the power of the system is illustrated in applications such as homology-based protein modeling, computer-aided protein design, protein structure predictions, analysis of local structure motifs, and of relationships between protein sequence and structure. Keywords: protein structure, amino acid sequence, database, macromolecules, molecular modeling, knowledge base 0 1991 WILEY-LISS, INC.

INTRODUCTION In recent years there has been an explosive growth of protein sequence data and a slower-yet swift-growth of data on protein structure and function. Managing this information has become one of the major problems in molecular biology.' So far, information on proteins and nucleic acids has been stored in flat sequential files, that are grouped in several databanks. Such databanks already play an invaluable role. Sequence databanks such as GenBank, PIR-NBRF, and SWISS-PROT enable the detection of sequence homology that often also leads to the identification of protein function. The Brookhaven Databank of 3D-structures2 has been extremely helpful in deriving structure prediction in improving our understanding of the principles that govern protein structures,8 and in modeling protein^.^ The latter aspect, in particular, is increasingly exploited by the biotechnology and pharmaceutical industry to design drugs, vaccines, and therapeutics. Examples include modeling of human renins,lO*'' of second generation tissue plasminogen activators, AIDS virus protease," insulinlike growth factors,13 and immunoglobulins.'* In line with these applications, most protein modeling programs (FROD0,I5 INSIGHT," and others) offer access to the Brookhaven Databank or to libraries of structural fragments generated from it, which serve as a source of data on recurrent folding motifs. According to the envisaged application, however, specific tools have to be provided for retrieving and cross-correlating the data that are organized in a series of flat files. As a result, adding on new information and transferring data between different programs are both cumbersome and nonstandard. As the amount of information on protein structure and sequence is rapidly expanding, the problem will only get worse, unless more sophisticated information handling systems are used. Such systems would require integrating and expanding existing databanks into a more complete and accessible

Received August 9, 1989; accepted February 22, 1991. Address reprint requests to Jean Richelle, UniE de Conformation de Macromolecules Biologiques, Universite Libre de Bruxelles, CP 160, Av. P. Heger, B-1050 Bruxelles, Belgium.

60

M. HUYSMANS ET AL.

“knowledge base” that would not only include raw data such as atomic coordinates and amino acid sequence but also higher order descriptions based on concepts and relationships abstracted from human thinking. l7 A key aspect would be the combination of structural with nonstructural information on macromolecules in the same conceptual framework. Practically, this would imply highly efficient ways of storing and cross-referencing information, user friendly interfaces and close integration with a library of procedures and programs for data validation and analysis. Relational databases can be considered as a first step in this direction, since they provide a framework for uniform representation of data and relationships between data items. Commercially available database management systems add to this a n integrated set of programs and tools to efficiently manage and access information in the database. Several laboratories worldwide are already involved in developing such databases. The first system was developed by Morffew et a1.18 The largest present effort is that of the Protein Engineering Club Database Group in the UK.19320 Other efforts are by Hirayama and Ceska of Kyowa Hakko Kogyo Co. Ltd, Tokyo, Japan, and Go and collaborators at Kyoto University, Japan. More recently Bryantz1 has reported a system integrating a high-level query language with the Brookhaven Databank, and Gray et a1.22have described a n Object Oriented Protein DataBase. The results of these efforts are just becoming accessible to the scientific community and there is no doubt that they will be immensely useful. The present paper describes a system that provides ways of integrating structural data and sequence information on individual molecules with survey results. It provides moreover means of improving the efficiency with which available knowledge on proteins can be combined with numerical simulations in modeling applications. Here the main goal has been to offer the possibility of addressing, within the context of a model building session, a very diverse set of questions that otherwise require surveying different databanks and executing specialized programs. For example, one should be able to ask about conformational preferences for a protein loop by looking up tables of sequence patterns previously found to be associated with certain structure motifs in surveys of the Brookhaven Databank,23 or ask about homologous sequences identified in surveys of PIR-NBRFZ4 or SWISSPROT by analyzing tables where multiple aligned sequencesz5 are stored. When modeling a sequence change in a protein, it should be possible to consult not only mutability matrices determined from various surveys of sequence and structure data,26 but also to obtain information on the specific conformations, spatial interactions, and structural context of side chains in other known structures. Conversely,

it should be possible to use graphics and energy calculations to analyze and interpret survey results without recourse to additional programming. The core of the system is the relational database SESAM, implemented under the commercial package SYBASE. This core is accessed through general purpose user interfaces that facilitate interrogation and allow integration with specialized computer programs. The database contains moreover a molecular dictionary with full description of geometric and physical properties of atoms and residues as well as parameters for conformational energy calculations. Thanks to this dictionary, structural data entered into SESAM can be thoroughly validated and the database can be readily interfaced with molecular graphics software such as BRUGEL.27 The database and its interfaces are described in the next two sections, respectively. To illustrate the power of the system, examples of its use in investigating problems of protein modeling and structure analysis are then described, followed by a general discussion.

MATERIALS AND METHODS A Relational Database for Macromolecules: Design Strategy The relational data model is simple and powerful as a result of the uniform representation of data and relationships in tables.” The subject represented in a table is described by a number of attributes stored in different columns that can be accessed in any order. The order in which rows, i.e., table entries, are accessed can however not be predefined, which poses a problem in handling a n ordered set of residues along the amino acid sequence. This problem is circumvented by adding a column that identifies the position of the residues along the chain, and executing a special ordering command. Several commercially available packages implement the relational model. They usually provide a processor of an interactive query language-most commonly SQL (Structured Query Language)-whereby users can issue high-level commands to the database not only for data retrieval, but also for creating new tables and for various “housekeeping” tasks. There are many ways of storing data in a database for any given application. An efficient system requires careful analysis aimed a t achieving the following goals: (1)database structure (i.e., the table and column definitions) should correspond as closely as possible to the user’s view of the field, (2) access for frequently executed queries and retrieved data must be efficient, and (3) avoid problems of information inconsistency by reducing data redundancy, following Codd’s normalization rules.29 As all these goals cannot be achieved simultaneously, compromises are obviously needed. The database described here is a result of such compromises which have been based on thorough research

MACROMOLECULE STRUCTURE AND SEQUENCE DATABASE

aiming to develop a n efficient and application-independent tool. Preference has been given to a design that offers data integrity, clear structure, and user friendliness: (1)table and column names are chosen to be self-explanatory and columns containing the same type of information have similar names, (2) data elements are grouped by level (macromolecule, sequence, and atom levels), and (3) minimal redundancy has been tolerated to facilitate database use.

Sources of Data Data from different sources can be entered into SESAM. These sources presently include the Brookhaven Protein Databank for protein structures,2 and the sequence databank SWISS-PROT. Only minor changes in SESAM load programs would be required to enter information from other databanks such as PIR-NBRF,24 GenBank, and JIPID. A full integration of structural and sequence data within the same data structure is one of the key features of SESAM. It increases the efficiency with which structural and sequence information can be analyzed and cross-correlated. This is particularly useful when dealing with families of homologous proteins. Indeed, since multiple aligned sequences can be stored together with any subset of the corresponding three-dimensional structures, sequence variability associated with given structural motifs, in a specific position within a motif or a structure, can be readily analyzed. The Brookhaven Databank is the only public source for data on structures of macromolecules. It is essentially an archive of sequential files describing mostly results of analyses by X-ray diffraction techniques, with each file containing data on one macromolecule. SESAM presently contains data from all the Brookhaven Databank entries (January 1990 version) that include detailed atomic coordinates. For each entry all data records are loaded except for the SITE record, and the salt-bridge and hydrogen bond information from the CONECT record. From protein sequence databanks, the following data are loaded: amino acid sequence, disulfide bonds, protein-ligand bonds whenever appropriate, as well as general information such as the name, source, and function of the protein and literature citations. SESAM contains a t present pure sequence data for only a limited number of protein families for which several three-dimensional structures are also stored. These include the glob in^,^' cytochromes (Chothia, personal communication), and microbial RNa~es.~’ A third major source of data is the “dictionary” files used by the molecular modeling package BRUGEL.27 Such files are required by virtually all molecular modeling programs that display proteins and perform conformational energy calculation^^^ or molecular dynamics simulation^.^^ The dictio-

61

nary files contain a description of the properties of atoms and molecules. For each molecule a full description is given. It includes atom names, atomic charge, connectivity between atoms, as well as parameters used for energy calculations. A very important use of this information is in checking the quality and completeness of the structural data from the Brookhaven Databank files, as described below. Additional information, derived from these basic data through execution of specialized programs, is also loaded into the database. Good examples are the secondary structure assignments obtained by applying DSSP,34 and limits of secondary structure elements. Other derived data include hydrogen bonds made by the backbone and side-chain atoms and residue dihedral angles computed using routines in the BRUGEL package. Solvent-accessible surface areas and volumes of atoms and residues computed using a n analytic algorithm35 are also included. Inclusion of higher level topological information is also well under way. This concerns for the moment, limits of structural domains obtained by applying the algorithm of Wodak and Janin,36 definitions of conserved structural features in homologous proteins,31 an exhaustive repertoire of recurrent local structural motifs, and spatial orientation of a-helices and P-strands described by the directions of best fitting axes. In the near future, information about amino acid residues participating in contacts between secondary structure elements, domains, and subunits will also be included. Other very basic data such as atom and residue symbols are loaded into SESAM from flat files by means of built-in tools. The data mentioned above only constitute a general basis on which individual users can build their own system and applications: different sources of data (public or private) can be loaded. The commercially available database packages offer moreover facilities for adding tables and columns, allowing for the integration of research results and data from surveys into the database. The database described here can therefore easily be tuned to more specific needs.

Loading and Validating the Source Data Four procedures are presently available for loading information into the database. They perform the following tasks: (1)analyze and load the contents of a dictionary file, (2) combine and load data from a Brookhaven type file with the output of the DSSPprogram34 for the macromolecule, (3) calculate and load data using BRUGEL, (4) load information from the various sequence databanks. The programs are written in C and FORTRAN. Moreover, to check input data and to compute from it additional relevant information, use is made of SQL procedures-an extension of standard SQL, offered by SYBASE, the

62

M.HUYSMANS ET AL.

database package we chose for the implementation of SESAM. Data validation is a major issue when managing a large amount of information on protein sequence and structure. Incorrect or incomplete information may crucially bias subsequent research based on this information. Ideally, validation procedures should be able to detect errors in protein sequences and structures, flag them, and, when possible, correct them. This is, however, not straightforward since we do not have as yet reliable criteria for detecting errors in sequence and structure. It seems, however, that grossly incorrect polypeptide folds could be detected by simple tests on the atomic coordinates or backbone dihedral angles,37 while errors in amino acid sequence can be detected only by checking consistency of sequences from different sources or with the corresponding DNA sequence when available. A much more frequent and annoying problem is small localized errors in atomic coordinates, or incomplete side-chain or main-chain descriptions given by the crystallographers, mostly due to difficulties in interpreting the electron density map. Graphic display and modeling programs, which all work with dictionaries containing well-defined building blocks for naturally occurring amino acids and nucleic acids, cannot handle such incomplete entries. The problem is then solved by ad hoc “corrections,” which either add missing atoms, or create new dictionary entries corresponding to the incomplete residues. Other annoying common problems are errors in atomic symbols, which may lead to wrong descriptions of chiral atoms, and finally simple typographic errors, of which only those leading to inconsistencies in the data can be detected. The data validation procedures presently implemented are mainly aimed a t “cleaning up” information extracted from the Brookhaven Databank, and thus bring solutions to the most commonly encountered problems. Validation is performed a t several levels. First, symbols of atoms and residues, as well as residue structure descriptions are checked against those in the “dictionary” files using mostly SQL procedures. Next, the following tasks are performed: (1)all bond connectivities are generated, including those in disulfide bridges whenever absent in the source, (2) the lengths of all covalent bonds are checked, (3) names of atoms in valine, leucine, phenylalanine, aspartic acid, glutamic acid, and tyrosine residues are checked against IUPAC conventions, and corrected whenever necessary, and (4) incomplete residues are flagged and missing atoms are added using standard geometry for bonds and valence angles. Data that fail the validation tests are nevertheless stored in the database, together with a description of the problem and a reference to the corresponding data in the input file which appears in a special table. Finally, following the dif-

and basics

-

\

( .functional information -)/ -/

Fig. 1. General overview of the sections included in the database. The sections are represented by ovals and the most important relationships between the sections are indicated by connecting lines. It is important to note that in a relational implementation all sections are defined and used at the same level, even though the relationships between the different sections may indicate a hierarchical structure. See text for detailed description of the contents of individual sections.

ferent steps detailed above, the user can interactively visualize the errors encountered and change the data manually.

Database Organization General overview Figure 1gives a general overview of how the subjects describing various aspects of molecular structure are organized in the database. Information on individual macromolecules is spread over three sections: MACROMOLECULE DESCRIPTION, TOPOLOGICAL INFORMATION, and FUNCTIONAL INFORMATION. The first section contains information on molecular structure. Sequence mapped information and descriptions of topological features based either on the secondary or on the tertiary structure are stored in the second section. The third section, which is still empty in the present version of the database, is destined to harbor information related to protein function, such as active sites, catalytic residues, epitopes, and other functional entities. The HIGHER LEVEL CLASSIFICATION section contains information about sequence homology, multiple sequence alignments of protein families, protein subsets, or families of protein fragments defined using various criteria. A fifth section, BUILDING BLOCKS AND BASICS, describes the structure and properties of residues-mainly amino acids and nucleotides-as well as the values of energy parameters. Tables I and I1 list all the tables currently implemented, together with a succinct description of their contents.


63

TABLE I. List of the Tables in the Higher Level Classification, Macromolecule Description, and Topological Information Sections: Other Accessory Tables Are Also Listed* Higher Level Classification seqhomology homologyaethod prot-set prot-set-desc alignment align-prot align-desc Macromolecule Description instance lit remark inst-conformations cryst refinement transmat crystl sequence atoms atomstdev atombond Topological Information pdb-chain domain secinpdb stele primpatterns sec-motifs associations t errit ories Miscellaneous journals books ...~~-

Data on sequence homologies between two macromolecules Method and parameters used to determine sequence homology between two macromolecules Sets of protein fragments complying with specific characteristics (e.g., high resolution, low homology) Description of sets in the table pot-set Position of the residues of a protein in a given alignment List of the proteins making up the different alignments Description of the different alignments General description of each macromolecule Literature citations relative to each macromolecule Text of the REMARK records (Brookhaven files) of each macromolecule General description of each macromolecule conformation Information and parameters related to the crystallographic study of each macromolecule Refinement methods and R-factors of each macromolecule Contents of the ORIGX, SCALE, and MTRIX records (Brookhaven files) for each macromolecule Unit cell parameters and space group designation of the crystallographic study of each macromolecule Complete residue description (including dihedral angles and DSSP assignments) at the level of the sequence Description of the macromolecule at the atomic level (e.g., coordinates of the atoms) Reported standard deviations on atomic positions, occupancies and temperature factors Description of special bonds such as disulfide bonds Descriptions of the macromolecule chains according to the chain identifiers in the Brookhaven files Definition of the domains of the macromolecules (limits and structural type) Secondary structure elements of each macromolecule as described by the author in the Brookhaven file Description of secondary structure elements of each macromolecule, derived from the DSSP assignments in the table sequence Sequence pattern of amino acids and/or residue properties Patterns of secondary structure motifs Associations between sequence patterns and structure motifs Location in the proteins of specific associations between sequence patterns and structure motifs List of journals with their bibliomaDhic sDecifications List of books with their bibliographic speiifications

*The name and a brief description of the contents are given for each table.

The MACROMOLECULE DESCRIPTION section Figure 2 shows the database tables used to describe protein 3D structure, as well as some of the relationships between them. The contents of these tables is summarized in Table I. Three levels can be distinguished according to the hierarchical nature of protein architecture: the Macromolecule, Sequence, and Atomic level. The Macromolecule level gives general information such as the name of the macromolecule, its source, its residue composition, the name(s) of the author(s), literature citations, remarks, and information on the protocol used to derive the coordinates. Presently,

only X-ray diffraction results are loaded; protocol information thus includes, when available, data on crystallization conditions, refinement method, resolution, R-factors, unit-cell parameters, space group designation, and coordinate transformation matrices. The Sequence level includes the primary sequence extended with heterogroups and water molecules, whenever present, torsion angles, secondary structure (DSSP) and other types of local structure assignments, residue solvent accessibilities, and volumes. Finally, at the Atomic level, we store the 3D coordinates of the atoms, the temperature factors, occupancies, atomic solvent accessibilities, and volumes. Interresidue connections like disulfide bonds,

64

M. HUYSMANS ET AL.

TABLE 11. List of the Tables in the Building Blocks Section* Building Blocks Basic Tables iupac iuDac-alias mendeleev residue res-alias type

Definition of the IUPAC atom symbols List of aliases for the IUPAC svmbols Definition of the Mendeleev symbols for atoms General description of the residues List of aliases for residues Definition of residue types characterizing the position of residues in a polypeptide chain Definition of the library models library Definition of the library versions lib-version Definition of the chemical types chemtype Definition of the different residue properties property Assignments of residue properties (e.g., hydrophobicity) res-prop Assignments of residue properties according to Taylor46 res-prop-t aylor Definition of the different properties used in the table resqrop-taylor prop-taylor Residue hydropathy scales according to Cornette et al.47 hydropathy Definition of the hydropathy scales of the table hydropathy hydropathy-basic Definition of residue classes according to the type of bond between two res-binding-class residues Residue Structure and Energy Parameters General description of the residues as a function of their type res-type List of atoms for each residue res-desc Dihedral angles definition for each residue res-coneangle Description of residue bond connectivity res-graph Description of the interresidue connections; each row contains a bond between resres-connections two atoms of different residues General description of the library model of residues as a function of their type res-lib-type Chemical types of each atom in a residue as a function of the residue model res-lib-desc List of dihedral angles that can be computed for each residue according to the res-lib-conv-angle residue model Atomic charges of the residues as a function of the residue model and the res-lib-Val library version Atomic parameters (e.g., radii) for each chemical type as a function of the chemtype-Val residue model and the library version van der Waals energy parameters for the interaction between two atoms of vdw given chemical type Parameters for the hydrogen bond energy between two atoms of given hbond chemical type Parameters for the covalent bond energy between two atoms of given bond chemical type Parameters for the bond angle energy between two atoms of given chemical bondangle type Parameters for the torsion energy between two atom of given chemical type torsion Parameters for the planarity deformation energy for a given central atom and planarity another one of given chemical type Parameters for the tetragonality deformation energy for atoms of given tetragonality chemical type *The name and a brief description of the contents are given for each table.

protein-ligand interactions, and H-bonds are also defined on the atomic level and stored in a separate table. In the following we highlight specific design aspects of this section. Representation of the residue sequence. An important feature of the database design is that heterogroups, nucleotide sequences, and water molecules are stored along with amino acid sequences. For each macromolecule first the amino acid and/or nucleotide sequence is stored, followed by heterogroups and water molecules if any of these exists. The position of the residues in the sequence is of utmost importance for the manipulation of the macromole-

cule structure. Each residue is therefore assigned a unique sequence position identifier, which is an extension of the sequence numbering used in the Brookhaven Databank. It is composed of the following four elements: (1) a one-letter code indicating the chain class; possible values are “ ” (a blank character) when dealing with amino acid sequence, “N”, nucleotide sequence, “W’, water molecules, “ H , heterogroups; (2) a chain identifier (one-letter code), (3) a residue sequence number that always increases

starting from the N-terminus for proteins or the

5’-

terminus for nucleic acids; this number can have negative values as numbering schemes can be estab-

65


{ ;

cryst

‘.

refinement

;

remark

transmat

;

lit

crystl

:

Macromolecule

instance

:

Sequence

inst-conformations

I

,____________________. ;

Atoms

I

Fig. 2. Diagram showing the macromolecule description tables of the database. Each table is indicated by a box labeled with the table name. A succinct description of the tables and their contents can be found in Table I. The lines between the boxes indicate the most direct relationship between the tables. It is of

course possible to cross-correlate (join) information between any set of tables of the diagram. The hierarchical nature of the protein architecture is reflected in the three levels, Macromolecule, Sequence, and Atoms.

lished relative to a homologous protein, and (4) a one-letter code indicating residue insertion. The last three elements are the residue sequence number used in the Brookhaven Databank. They are stored together in a unique alpha-numeric field in the database. The first element, the chain class, is introduced to avoid ambiguity when representing amino acids, nucleotides, heterogroups, and water molecules in the same table; it is stored in a separate field. The sequence position identifier, although being unique for each residue, cannot be used to order the sequence due to the possible negative values of the third element. Therefore, as already mentioned, a serial number attribute is assigned to each residue, starting at 1 for each macromolecule description and incrementing by 1 along the sequence, heterogroups and water molecules included. Completeness of the macromolecule description. The present database design includes a number of flags for each macromolecule description; they indicate the amount of information extracted from the experimental study. The flags are situated at different levels: a t the atomic level, indicating presence or absence of the atomic coordinates; a t the sequence level, indicating the number of atoms in the residue and the completeness of the backbone atoms. These flags allow moreover to distinguish

macromolecules for which no three-dimensional structure is available from those for which the three-dimensional structure is available. This is particularly useful when dealing with families of evolutionarily related proteins. Several other flags are included in the database indicating, e.g., the residue model to be used when analyzing the macromolecule structure. Loading several conformations or descriptions for a given macromolecule. An important use for a database for macromolecule structure and sequence is as an archive for research results. This implies the storage of several descriptions for a given macromolecule, such as the results from a molecular dynamics simulation, from different stages of crystallographic refinement, or a set of structures that comply with distance constraints derived from NMR experim e n t ~ . ~These ’ descriptions may differ in the conformation of the macromolecule, resulting in changes in all tables containing information based on atomic coordinates. Those tables therefore include a “conformation number” attribute. A summary of the conformations loaded for each macromolecule is provided. When macromolecule descriptions also differ in their sequence (mutant or variant), separate descriptions must be loaded in the database. Indi-

66

M. HUYSMANS ET AL.

vidual descriptions are treated as different macromolecules, and are therefore assigned a specific “macromoleculeidentifier.” This identifier is used to reference unequivocally the macromolecule in all relevant tables.

The TOPOLOGICAL INFORMATION section Presently, this section contains the description in terms of chains for each macromolecule, limits of structural domains obtained using the algorithm of Wodak and J a n i ~ limits ~,~~ and geometric descriptions (in terms of length and direction of least squares axes) of secondary structures derived from the DSSP assignments, and those extracted from the Brookhaven Databank. It also contains survey results on sequence patterns that characterize local structure motifs, which are either the usual secondary structures or families of short fragments that adopt similar conformations.23~39~40 The HIGHER LEVEL CLASSIFICATION section This section stores information concerning groups of macromolecules. For proteins, this includes families of molecules related by evolution or grouped according to various other criteria. The description of a protein family includes an identifier of the family as a whole, identifiers of family members, and a description of the protein core defined as the structural parts which are most conserved in proteins within a family.31s41*42 The latter information is particularly useful when building the model of a protein having a sequence homologous to a protein of known 3D structure. Alternative descriptions of sets of proteins, or protein fragments, that comply with specific conditions such as a specified range in resolution or sequence identity, are also available. This section contains, in addition, information on sequence homology between two proteins obtained from sequence alignment program^,^^.^^ as well as multiple aligned sequences obtained from several source^.^^,^^ These aligned sequences also include entries, obtained from surveys of sequence databanks such as GenBank, SWISS-PROT, or PIRNBRF, for which there is no corresponding structural information. The BUILDING BLOCKS section This section contains data from the “dictionary” files used by the molecular modelling package BRUGEL.27Two main parts of these files are represented in the database: the definition of residue structure and topology and the values for energy parameters used in molecular mechanics and dynamics simulations. The Building Blocks section also includes tables with definitions of all basic symbols and conventions for atomic and molecular de-

scriptions used in the database. This section contains, in addition, data from the literature such as residue properties as defined by Taylor,4692 hydropathy scales from Cornette et al.,47 as well as the average solvent-accessible surface area of residues in proteins of known structure^.^^ Residue structure definitions. Residues and heterogroups are described a t the “physical” level by designating the atoms that make up the residues and the covalent bonds connecting the atoms. Several “residue types” are distinguished according to the position of the residue a t the N-terminus, Cterminus, or middle of the polypeptide chain. Other physical properties include connections between residues and definition of conventional dihedral angles (+, (J, xl, and others). Finally, each residue is assigned a “residue class”: “aa” (amino acid), “nuc” (nucleotide), “het” (heterogroup), “hoh” (water molecule), “pre” (prefix, indicating chemical groups at the beginning of a polypeptide chain), “suf)’(suffix, i.e., group added a t the end of a polypeptide chain). All this, however, is not sufficient to ensure the completeness of the molecular description since it may vary according to the level of detail used to represent the structure. Certain numerical simulations (as well as structural information from neutron diffraction studies) require a very detailed molecular model with explicit positions for all hydrogen atoms including those bound to aliphatic carbons. Other analyses use only positions of hydrogens attached to polar atoms, or even no hydrogens at all. Descriptions of these three levels of detail are therefore included in the database. For each one, the atoms that make up the residues as well as the applicable bonds and torsion angles are given. Energy parameters. Forces and interactions between atoms in systems of biological macromolecules can be computed using classical empirical energy functions, with parameters derived from the analysis of known structures, mostly small organic molecules.32These functions are expressed as sums of pairwise interactions between atoms, and different energy parameters are assigned to atoms of different chemical types. To efficiently store energy parameters associated for example with van der Waals interactions, or bond stretching, atoms are grouped according to their chemical type, and atoms with the same chemical type are assigned identical energy parameters. Thus, the latter need to be stored only once for each chemical type, avoiding unnecessary redundancy. An additional complication arises from the fact that there is little consensus on the values of the energy parameters, and different laboratories often tend to use different parameter sets. In the database, these are referred to as library versions, and every table of energy parameters includes two columns indicating both the residue model (level o f detail for hydrogen representation-see above) and the library version to which it corresponds.

67


Implementation Aspects

The full database schema as described above has been implemented under SYBASE Relational Database Management System. The database tables containing information on 224 proteins loaded from the Brookhaven Databank occupy 70 megabytes, which on the average amounts to 0.875 kilobytes per residue. Additional space is however required, among other things, for SYBASE internal work space as well as for keeping a record of all the changes made to the database. The efficiency of accessing the database has been measured under an implementation on a SUNSPARC 2 computer, by executing a set of four queries with an increasing number of joined tables, as illustrated in Table 111. These queries extract information on atoms and residues from catalase, one of the largest proteins in the database. The first query extracts all the atomic coordinates in 12 sec. Other queries extract in addition information such as residue names and properties in 19 to 24 sec, according to the complexity of the query, which is determined by the number of joined tables. We have also established that database size has no notable influence on performance (data not shown). This is primarily due to the use of properly defined indexes that are sets of pointers to rows defined on the basis of the value in any set of columns in a table. As expected, these indexes considerably speed up the access to individual data items, by avoiding complete table scans. Thanks to the fact that SQL is a relatively standardized database description and query language shared by various management systems, prototypes of SESAM were readily implemented under other Relational Database Systems such as ORACLE or SUN-Unify.

INTERFACES A crucial aspect in using SESAM is the means available to the user for accessing the database. Commercial packages offer general purpose interfaces; they are of two types: an SQL interpreter and a library of routines in FORTRAN and C. The first type of interface is rather primitive. Its syntax is simple but it requires detailed knowledge of the database schema and table names, which are of no concern to the everyday user. A set of higher level interfaces has therefore been developed using the SYBASE routine library. The interfaces for which prototypes are presently available are summarized in Figure 3. ISQL and DWB are interfaces available in SYBASE;LOAD is the set of programs that load data into SESAM; BRUGEL is the graphics package through which the database can be accessed; ALI (A command Line Interpreter) is a dedicated interface that provides simple and powerful access to the database. An intermediate layer, the QUERY BUILDER, automat-

ically builds complete and syntactically correct SQL queries from simple commands issued by the user, thereby eliminating formulation of complex join clauses. To issue these commands all that is necessary is to know the names of the fields that contain the requested information but not how they are distributed among the database tables. Changes in database organization (schema) therefore do not affect the manner in which it is interrogated, provided the contents and names of data fields remain unchanged. The commands themselves are handled a t an upper level by ALL One of the useful features of ALI is that it allows queries to be progressively refined according to answers from previous queries, as illustrated in the examples given in Figure 4. An exhaustive list of proteins is extracted from SESAM by specifying the fields inst-id and name, referring, respectively, to the Brookhaven code and protein name, using the column command issued in Figure 4A. This list can be reduced in a subsequent pass (Fig. 4B) by issuing the command protein, which specifies the individual proteins to which the search operation defined in Figure 4A should be restricted. The condition command formulates SQL-like condition clauses. For example, the condition res-code = %,' in Figure 4C, states that one is looking for all instances of a glycine residue in the protein subset defined at that stage. The build command displays the full SQL query constructed by the QUERY BUILDER as illustrated in Figure 4D,E, which shows that the appropriate joins and join conditions have been generated from the commands issued by ALI. The user can moreover directly edit the generated SQL commands or further modify the query by issuing other ALI commands. Furthermore, ALI allows the database to be searched for specific sequence patterns that occur in a defined structural context and to show the context of the patterns found. In the example of Figure 4E, the database is surveyed for the occurrences of the pattern Gxxbax++, where G stands for glycine, b stands for any buried residue, (Y stands a helical residue, is a residue with a positive backbone angle, and x stands for any of the naturally occurring amino acids. Formulating such a pattern search with classical SQL leads to very intricate queries involving multiple join operations (see legend of Fig. 4); execution times for such SQL queries crucially depend on pattern complexity because they require self-joining of the table sequence as many times as there are conditions formulated on specific sequence positions in the pattern. ALI not only alleviates the task of formulating the queries but it processes them much more efficiently by using the buffering option of the SYBASE C-interface, which partially circumvents the usual SQL procedures. Because of this option it can randomly access any number of data rows, that are retrieved in a well-defined

++

+

68

M. HUYSMANS ET AL

TABLE 111. Influence of Joins on Performance*

Query 1 select res-pos-nr, iupac-symbol, x, y, z from atoms where i n s t i d = ‘8CAT’ Query 2 select reshist-number, res-ext-code, iupac-symbol, x, y, z from sequence, atoms where atomsinst-id = ‘8CAT’ and sequence.inst-id = atoms.inst-id and sequence.res-pos-nr = atoms.res-pos_nr and sequence.conf-id = atoms.conf-id Query 3 select instance-id, res-histnumber, res-exkcode, iupacsymbol, x,y, z from instance, sequence, atoms where atoms.instid = ‘8CAT’ and sequence.instid = atoms.instid and instance.inst-id = atoms.instid and sequence.res-pos-nr = atoms.res-pos_nr and sequence.confid = atoms.confid Query 4 select instance-id, reshistnumber, hydrophobic, sequence.res-ext-code, iupac-symbol, x, y, z from res-property, instance, sequence, atoms where atoms.inst-id = %CAT and sequence.inst-id = atoms.inst-id and instanceinstid = atoms.inst-id and sequences.res-posxr = atoms.res-pos-nr and sequence.confid = atoms.confid and res-property.res-ext-code = sequence.res-ext-code

Elapsed time (sec) 12

19

21

24

*The queries, which are numbered according to the number of tables they invoke, extract information for catalase (8CAT), one of the largest proteins in the database. All the queries use the table containing the atomic coordinates (table atoms), which is the largest table of the database. Query 1 extracts the residue serial number (res-pos-nr), the IUPAC symbol (iupac-symbol), and the atomic coordinates (x,y,z) for all the atoms of the protein, which is identified by its Brookhaven Databank code finst-id = ‘8CAT’). This query is progressively extended by extracting information spread over different tables. Query 2 extracts in addition for each residue its sequence number (res-hist-number, including the chain identifier, a position number, and an insertion code) and its name in three-letter code (res-ext-code) from the table sequence. Query 3 uses the table instance to extract a unique ordinal load number specific to each protein (instance-id). Finally, query 4 uses the table res-property to determine for each residue its hydrophobic character (I if it is hydrophobic, 0 if otherwise). The join conditions, in queries 2 to 4, ensure that properly related informations are extracted from the different tables: for example, sequence.inst-id = atoms.inst-id means that only rows having the same inst-id, i.e., belonging to the same protein, will be extracted from tables sequence and atoms. The total number of rows extracted by each of these queries is 8,032. The set of queries has been executed on a SUN SPARC 2 hardware (with 16 megabytes of central memory) under SYBASE with the database residing in a raw partition. Elapsed time, the time it takes between sending the query and getting the results, averaged over 6 executions, is given in seconds for the database containing the following 224 proteins: 155C, 156B, lACX, lALC, lAMT, lAZU, lBDS, 1BP2, lCAC, 1CC5, lCCR, lCHG, lCHO, 1CN1, lCPV, lCRN, lCSE, lCTF, lCTS, lCTX, 1CY3, lCYC, lECA, lECD, lECN, lECO, lEST, lETU, lF19,1FB4, lFBJ, lFCl,lFC2,1FD2, lFDH, lFDX, lFVB, lFVW, 1FX1, lFXB, lGCN, lGCR, l G D l , l G F l , 1GF2, lGOX, 1GP1, lGPD, lHBS, lHCO, lHDS, lHFM, lHHO, lHIP, lHKG, lHMG, lHMZ, lHNE, lHOE, lHVP, 1IG2, lIGE, lINS, 1L01,1L02,1L04,1L05,1L06,1L07,1L08,1L09,1L10,1L11,1L12,1L13,1L14,1L15,1L16,lLDB, lLDM, lLDX, lLHl,lLH2,1LH3,1LH4,lLH5,1LH6,1LH7, lLLC, lLYM, lLYZ, 1LZ1, lLZT, 1MB5, lMBC, lMBD, lMBN, lMBO, lMBS, lMCP, lMEV, IMLP, lMLT, lNTP, lNXB, lOVO,lP2P, lPAD, lPAZ, lPCY, lPFC, lPFK, lPHH, 1PP2, lPPD, lPPT, lPRC, lPSG, lPYP, 1R69, lRBB, lRDG, 1RE1, lRHD, lRLX, 1RN3, lRNS, lRNT, 1RS1, lRSM, lSBC, lSBT, lSGC, lSGT, 1SN3, lTEC, lTGB, lTGC, lTGN, lTGS, lTGT, lTIM, lTLP, lTMN, lTON, lTPA, lTPO, lTPP, lTRM, lUBQ, lUTG, lWRP, lWSY, 1XY1, 1XY2, BAAT, BABX, BACT, BALP, 2APP, BAPR, PATC, BAZA, 2B5C, 2BP2,2C2C, SCAB, BCCY, BCDV, PCNA, 2CR0, SCTS, BCYP, 2EST, 2FB4, 2GN5, PLBP, 2LH4, BLHB, BLZM, PMDH, 2MT2, BOVO, 2PAB, BPKA, BSEC, SSGA, BSNS, 2SOD, 2STV, 2TS1, BYHX, 3ADK, 3C2C, SDFR, 3EST, 3FD1, BFXC, 3GRS, 3ICB, 3RP2,3SGB, 3TLN, 3WGA, 451C, 4ADH, IAPE, 4FD1,4FXN, 4HHB, 4LDH, 4MDH, 4RXN, 4TNC, 4TS1, TCHA, BCPA, 5PT1, GLDH, 7RSA, 8ADH, 8CAT, 9PAP.

physical order due to appropriate table indexes, and stored in a temporary buffer that is continuously updated as SESAM is scanned. Inspection of the constructed SQL query of Figure 4E shows indeed that the table sequence is used only once and that no clause specifying the pattern is present: the table is

completely scanned and ALI itself performs the selection of the successive rows verifying the pattern. For the query in Figure 4E, which requires scanning 224 proteins or 79,234 rows, ALI provides the answer in 7 min 27 sec on a SUN 3/50 workstation with 4 megabytes of central memory, while the response


KERNEL

L Fig. 3. Interfaces to SESAM. The KERNEL refers to the relational database core; it is primarily accessed in SQL (Structured Query Language), through a query processor. The commercial package SYBASE provides two user interfaces: ISQL, an interactive program that issues SQL queries, and DWB, a window based SQL interface. LOAD is the set of programs used to load into the database information extracted from the different data sources. Different commands implemented in the modeling package BRUGEL allow loading and checking data as well as querying the databases (see Fig. 5). The QUERY BUILDER automatically builds complete SQL queries from simple commands. Such commands can be issued through ALI, A command Line Interpreter, either interactively, or in batch mode. OTHER covers all the other dedicated interfaces that are under development, such as a module to create files having the Brookhaven format and containing specific data extracted from SESAM, and the specialized programs accessing SESAM. The arrowheads show the main flow of information between the layers.

time with the usual SQL queries is over 24 h r on the same hardware. Similar queries performed under an implementation on the faster SUN-SPARC 2 machine (with 16 megabytes memory) are executed in about 40 sec via ALI, independently of the complexity of the pattern to be searched. Such performance together with the increased ease in formulating queries make it possible to interrogate SESAM directly and interactively in many applications. SESAM can also be accessed by the molecular graphics package BRUGEL. This is done using specific BRUGEL commands that access the database either directly via SQL or through ALI and the QUERY BUILDER, as illustrated in Figure 5. The BRUGEL command DB-VALIDATE checks and, if necessary, corrects the contents of the database for individual protein entries. It reads the data for the protein entry from the Brookhaven Databank, performs all the validation steps previously described, and, after comparison, updates SESAM if necessary. This command is also used to compute and enter into SESAM many of the derived data such as dihedral angles, solvent accessible areas and volumes of residues and atoms, definitions of secondary structure elements, and their geometrical characteristics. DB-VALIDATE can be activated during the loading stage, or independently, during an in-

69

teractive BRUGEL session. Another important and frequent use of the BRUGEL interface is to graphically display information stored in the database. For example, atomic coordinates of one or more proteins or structural fragments can be retrieved using the DB-COORD command (Fig. 5b) and then displayed using available options in BRUGEL. Subsets of atoms corresponding to secondary structure elements, to structural domains, to specified intervals of 4, angles, etc (Fig. 5c,d), can be selected and stored as logical objects that BRUGEL manipulatesz7 for display (different coloring) and computational purposes (evaluation of accessible surface areas, and energies). It is also possible to interactively create tables in BRUGEL that contain values retrieved from SESAM (Fig. 5e), and subsequently manipulate them as logical objects for analysis purposes.

+

APPLICATIONS SESAM has already been very instrumental in several studies. In these studies, data from SESAM were accessed either directly (mainly in interactive applications) or via transit on flat files. The latter procedure was used to gain efficiency in applications that require surveying for a large number of proteins or only a limited subset of the data stored in SESAM. A good example of such application is a study of the relationships between amino acid sequence and secondary structure in protein^.^ This study showed that characteristic relationships between short amino acid sequence patterns and protein secondary structures can be found by searching the database, and that sequence motifs of high predictive value can locally characterize secondary structure in proteins. It was, however, determined that at present our ability to find these motifs is limited by the size of the available databases. In another study it was shown that most amino acid sequence templates derived from recurrent short folding motifs in proteins are not specific enough to be predictive by t h e m s e l ~ e s . ~ ~ In both studies information on sequence, atomic coordinates, and secondary structure assignments stored in the database were accessed either directly, in interactive surveys of the territory of a given sequence pattern in the database, or through transit on flat files in batch mode procedures that perform systematic surveys for many sequence patterns. Moreover, characteristic sequence-structure associations were stored in specific database tables along with information on their location in the different proteins, so that they can be readily used for prediction purposes. More recently, an automatic method for classifying recurrent motifs in protein structures was proposed.39 This method applies a classical clustering algorithm that operates on distances between selected backbone atoms. In one application all pro-

70

M. HUYSMANS ET AL.

tein fragments of fixed length (4,5, 6, 7 residues, respectively) were clustered into only 4 structural classes. The relevance of such classification for protein structure prediction was tested23 by deriving amino acid patterns that characterize these classes and comparing their predictive values with those of patterns that characterize the usual 4 classes of secondary structure a,p, turn, coil. In a second application, the clustering procedure was applied to give an exhaustive and objective description of highly specific families of local structure formed by protein fragments containing 7 residues3' and, more recently, 6 and 10 residue^,^' respectively. This study required information on protein coordinates, amino acid sequence, values of +,IJJ angles, and secondary structure assignments. The distance based clustering was performed on selected backbone atoms whose coordinates were retrieved from the database and stored in flat files. In a sub-

155C 156B lACX lALC lAMT lAZU lBDS 1BP2 lCAC 1CC5 ICCR ICHG lCHO ICNl lCPV lCRN lCSE lCTF lCTS lCTX 1CY3 ICYC lECA IECD

sequent step, fragment classes were further subdivided by assigning their angles to allowed regions of the Ramachandran map, resulting in a more discrete description of structural families. For this step, I$,+ information was directly extracted from the database. The specific families of local structure, containing on the average 50 members and displaying appreciable sequence variability, are presently analyzed by interactive graphics using BRUGEL via the QUERY BUILDER. The aim is to obtain a detailed description of the interactions between residues in each fragment, and to rationalize the role of these interactions in determining observed fragment conformations, in an effort to generalize the approach described in the recent studies of antigen binding loops in the i r n m u n ~ g l o b u l i n s . ~ ~ ~ ~ ~ This requires, first, flexible attribution of complicated coloring schemes, performed using a combina-

+,+

lECD lECD lECD lECD 1ECD

CYTOCHROME C550 IPARACGCCUS) CYTROCHROME 8562 ( E . COLI, OX1 ACTINOXANTHIN ALPHA-LACTALBUMIN ALAMETHICIN AZURIN BDS-I PROTEIN PHOSPHOLIPASE A2 CARBONIC ANHYDRASE C CYTOCHROME C5 (AZOTOBACTER) CYTOCHROME C (RICE1 CHYMOTRYPSINOGEN A ALPHA-CHYWOTRYPSIN (COMPLEX) CONCANAVALIN A (DEMETALLIZEDI PARVALBUMIN B CRAMBIN SUBTILISIN CARLSBERG RIBOSOMAL PROTEIN L7/1.12 CITRATE SYNTHASE - COMPLEX ALPHA COBRATOXIN CYTOCHROME C3 (NORW DESULFOVII FERROCYTOCHROME C ERYTHROCRUORIN (AQUO MET1 HEMOGLOBIN ICHIRONOMOUSI

lECD

lMB0 lMBO lMBO lUBD lMBD lMBD lMBD lMBD lMBD

HEMOGLOBIN HEMOGLOBIN HEMOGLOBIN HEMOGLOBIN HEMOGLOBIN HEMOGLOBIN MYOGLOBIN MYOGLOBIN MYOGLOBIN MYOGLOBI N MYOGLOBI N MYOGLOBIN MYOCLOBIN MYOGLOBIN MYOGLOBIN

(CHIRONOMOUS) (CHIROHOMOUSI (CHIRONOMOUS) (CHIRONOMOUS) (CHIRONOMOUSI (CHIRONOMOUSI

71 9 i

103 116 122

130 5

23 25 65 73 80 121 124 129

SELECT instance inst-id, instance "am?, sequence res-hist-number

FROM Instance. s e q u e n c e W E R E 3nstance.inst-id sequence,inst_ld AND Sequence res-code = 'G' AND instance inst-id IN l'1MBD'.'4HHB','1ECO'l L

E >

1mbd.lhhb.lecd

>>> >

>a2 Inst-id

name

lECD

HEMOGLOBIN ICHIRONOMOUSI MYOGLGBIN HEMOGLOBIN (HUMAN1

condltlon l.r_s"rf_acC-sOl" < lo

a3:sec-stru~t

= 'H' > 0 1nst-id.res-hist-number 0:5'res-cade.O 5'sec-strurt

_ _ _ _ _ _ ~ __ _ _ _ ~ _ . _ _ _ _ _ _ _ _ _ _ _ _ _ . > _ - ~ -5:phi lMBD

4HHB

>

>

>.a inst-id

res_hist-number

0 5 res-code 0 5 - ~ e c _ s t r u c t

~ . ~ . ~ ..~ ~ _ .~ .. . ..~ ~ ..~ ~ _~ ~ _~ ~ ~ ~ ~ -~ ~ ~~ -

> condition res-code = ' G > res-hist-number >€!Q

1CC5

82

lHDS

D5l 70 €56

ISBT

3SGB inst-id

G G G G

K A T H

M V V C

S M A T

G N A D

L N L G

H H H H

H H H H

H H H H

H H H H

C H C T

C C C T

res-hist-number

name

__...___ _____~__ ~--~-----.-.--~~_~___~__~_.____.__

IECD lECD lECD lECD lECD

HEMOGLOBIN HEMOGLOBIN HEMOGLOBIN HEMOGLOBIN HEMOGLOBIN

ICHIRONOMOUSI (CHIRONOMOUS) (CHIRONOMOUSI (CHIRONOMOUSI (CHIRONOMOUS)

18 22 43 51 64

Fig. 4.

>u

SfLECT s e q u e n c e inrt-ld. sequence.rPshist-number. s e q u e n c e res -code. s e q u e n c e p o s _ i n . ~ c h s i n , s e q u e n c e r-s,rrf-acc-solv,sequence s e c - s t r u ~ t , sequence phi FROM SeQuence

Legend appears on page 71


tion of options in BRUGEL and the QUERY BUILDER, to differentiate between fragments belonging to different proteins whenever displayed simultaneously, to highlight secondary structure elements, or different physical properties of side chains. Next it is necessary to analyze contributions from different conformational energy items such as van der Waals, electrostatic, and H-bonds, in the isolated fragments as well as when the fragments are imbedded in their parent protein structures. van der Waals and Coulombic terms are evaluated interactively using routines written in BRUGEL. Information on H-bonds is retrieved from the database or computed interactively by BRUGEL (whenever it is necessary to flexibly modify H-bond parameters). Surface areas of residues in a given fragment, buried through interactions with other residues-either in the fragment or elsewhere in the parent protein-are also analyzed. This entails retrieving solvent accessibility values stored in the

Fig. 4. Accessing SESAM through ALI, A Line Interpreter. ALI commands (underlined text) are issued after the prompt >. (A) The command column instid,name requests the list of all the proteins identified by the Brookhaven code (field insfid) and their name (field name).(B) The above list is restricted to three proteins (sperm whale myoglobin, human hemoglobin, chironomous hemoglobin) by the command protein Imb&fhhb,lecd. (C) The command condition r e s a d e = %'queries for the occurrences of glycine residues in the currently specified set of proteins, for which the position is also requested by the command column reshisnumber. (D) the build command allows the complete SQL query generated by ALI to be viewed. At this stage two tables are usedinstance and sequence, from which the fields instjd, name, and reshisf-number are extracted; the rows actually selected are those corresponding to the conditions imposed by the ALI command condition and protein; note that the join condition-insfance.instJd = sequence.instJd-has been generated automatically. (E) The command clear protein removes the restriction on the proteins to be scanned; the command clear column removes the list of fields to be viewed; the three condifion commands specify the sequence pattern to be searched by restricting the values of specific fields at given positions along the sequence. Note that the command condition resxode = ' G specified in (C) still holds, and that it could also be written as condition 0:resxode = 'G,' which means that the first residue of the requested sequence pattern (position 0) must be a glycine. The following command, column 0:5:resxode. . . , allows viewing, for each pattern occurrence in the database, also the five residues that follow the glycine (by their one-letter amino acid code), together with the corresponding secondary structure assignment. For sake of comparison we give, in the following, the query that needs to be formulated in classical SQL to extract the information specified in (E): select a.instjd, a.reshist-number, a.resxode, b.res_code, c.res-code, d.resxode, e.resxode, f.resxode, a.secstruct, b.secstruct, c.secsfruct, dsecsfrucf, esecstrucf, fsecsfruct from sequence a, sequence b, sequence c, sequence d, sequence e, sequence f where a h s t i d = b.instjd and b.instAi = c.instid and c.instid = d.instjd and d.instid = e.instjd and e.instjd = f.insfid and a.confid = b.confid and b.confid = c.confid and c.conf~'d= d.confid and d.confjd = e.confM and b.res_pos-nr = a.res_pos_nr + 1 and e.confAd = f c o n f ~ d and c.res_pos-nr = a.res_pos_nr + 2 and d.res_pos_nr = a.res_pos_nr t 3 and e.res_pos_nr = a.res_posnr i 4 and f.res_pos_nr = a.res-pos_nr 5 and b.posJn-chain = a.posin-chain 1 and c.posJn_chain = a.posln_chain 2 and d.posJn_chain = a.posjn-chain + 3 and e.posin_chain = a.posjn_chain 4 and f.posh_chain = a.posA-chain + 5 and a.res-code = ' G and c.rsurfdccso/v < 10 and d.secstruct = " H ' and f.phi > O!

+

+

+

+

71

database as well as performing additional computations interactively using BRUGEL after flexibly including or deleting relevant parts of the analyzed structures. Another pertinent example is the recently developed automatic procedure for building a polyalanine backbone from C, positions using fragments belonging to proteins of known s t r u c t ~ r e . ~These ' , ~ ~ procedures rely on comparing inter-C, distances of fragments from proteins in the database to those in a guide segment. In the latest implementation of one of the procedures5' in BRUGEL, C, positions from a subset of proteins specified by the user are interactively retrieved from SESAM and stored in a flat file. This file is then accessed by the fragment retrieval routine, which computes the inter-C, distances on the fly, circumventing the problem of storing large distance matrices with only a minor payoff in computer time. Finally, after all overlapping fragments that best fit the guide positions have been identified, BRUGEL returns to the database to retrieve the corresponding full backbone structures. Other related procedures concern the remodeling of protein loops to accommodate insertions or deletions. Indeed, modeling the backbone portion can be performed using loop structures from proteins in the database by a n analogous procedure to that described above.52 This is particularly effective when the database contains several protein structures with sequences homologous to the protein to be modeled, as seen from results obtained for immunoglobul i n ~ , ~and ' , ~suggested ~ by our own experience while building the three-dimensional structure of two different domains of the covalent polymer globin chain of Artemia.53For example, domain E, was built using the crystal structure of sperm whale myoglobin as a template, based on a sequence alignment with 19% identity. In two regions, between helices C and D and between helices F and G, the sequence of the Artemia domain is one residue shorter than myoglobin. These deletions were readily modeled using equivalent regions in the crystal structure of Chironomous globin IIIA (erythrocruorin). Fragments from these regions, chosen as a result of a systematic search through the C, coordinates of the entire database (which contains globins as well as non globin structures), turned out to fit best. The choice was made by requiring good geometric fit between fragment backbones and regions in myoglobin immediate flanking the deletions and on fragment length5' so as to retrieve loops precisely one residue shorter than in myoglobin. The approach described above is completely general and can be used to model deletions and insertions in proteins for which there are no homologous structures in the database, a requirement that may arise in protein design experiments by site-directed mutagenesis. Our tests showed, h ~ w e v e r , ~that ' the chances of picking the correct loop conformation de-

72

M. HUYSMANS ET AL.

BRUGEL

_ _ _ _ _ DB- VALIDA TE protein code > flags

lragments

_ _ _ _ - DB-COORD >protein code > (limits)

e validation

-

dihedral angles * solvent accessibility structure elements J

< coordinates

secondary structures coloration

_-_---

STELE > protein code flags c masks for each

struct. elements, for all helices, sheets, turns, etc c (axes for helices and R-strands) c coloration

I

QUERY BUILDER

I

SQL

I KERNEL

Fig. 5. Accessing SESAM through the modeling package BRUGEL. Examples of BRUGEL commands that access the database are given. Each command is described in a separate box; the name of the command is given in uppercase. Lines beginning by (>) refer to information provided by the user, lines beginning by (

A relational database of transcription factors.

ProteinsPlus: a web portal for structure analysis of macromolecules.

Renal Gene Expression Database (RGED): a relational database of gene expression profiles in kidney disease.

CMAP: contig mapping and analysis package, a relational database for chromosome reconstruction.

Pattern structure and relational discrimination learning.

HDF database for whole-cell model predictions.

query system for the Current Index to Statistics.

Protein sequence database.

Development of the ECODAB into a relational database for Escherichia coli O-antigens and other bacterial polysaccharides.

Sequence database versioning for command line and Galaxy bioinformatics servers.

Molecule database framework: a framework for creating database applications with chemical structure search capability.

Compact variant-rich customized sequence database and a fast and sensitive database search for efficient proteogenomic analyses.

The PIR protein sequence database.

A computerized representation of a medical school curriculum: integration of relational and text management software in database design.

Mass spectrometry methods for studying structure and dynamics of biological macromolecules.

Photo-induced sequence defined macromolecules via hetero bifunctional synthons.

Developments in x-ray crystallographic structure determination of biological macromolecules.

The effects of relational structure on analogical learning.

MSDB: A Comprehensive Database of Simple Sequence Repeats.

Food Composition Database Format and Structure: A User Focused Approach.

The Structure-Function Linkage Database.

Protein kinase catalytic domain sequence database: identification of conserved features of primary structure and classification of family members.

EmptyHeaded: A Relational Engine for Graph Processing.

EN 13606 standardized electronic health record extracts: relational vs. NoSQL approaches.