HUMAN GENOME PROTEIN FUNCTION DATABASE Dean K. Sorenson, Ph.D. Dept. Medical Informatics, University of Utah Salt Lake City, Utah such as cystic fibrosis, muscular dystrophy, Huntington's disease, and some forms of cancer

Abstract A database which focuses on the normalfunctions of the currently-known protein products of the Human Genome was constructed. Information is stored as text,figures, tables, and diagrams. The program contains built-in functions to modify, update, categorize, hypertext, search, create reports, and establish links to other databases. The semi-automfated categorization feature of the database program was used to classify these proteins in terms of biomedicalfunctions.

[4,5]. A number of computer sequence and/or genetics databases have been developed to keep track of nucleic acid and protein sequences and genetic information at the national and intemational levels. These include the Human Gene Mapping Library, the GenBank sequence database, and the Genome Data Base [6,7].

Although DNA and protein sequences are currently of great interest, it is generally the protein products which are of importance for diagnosis, testing, management, and the design of possible treatment regimens for genetic diseases. Many genetic diseases are presently known to result from abnormal protein products, e.g., sickle-cell anemia, phenylketonuria, familial hypercholesterolemia, muscular dystrophy, and cystic fibrosis. According to Nobel laureate Walter Gilbert, "once the sequences are known there will be a paradigm shift in molecular biology with the focus of interest changing to the normal and abnormal functions of the genes themselves or their protein products" [1].

Introduction In recent years, human genetics has become more and more important in human health, i.e. in

diagnosis (e.g. cystic fibrosis, muscular dystrophy, and certain forms of cancer), screening for diseases, treatment/management of disease, laboratory testing, genetic counseling, and medicolegal testing. Computer databases are becoming an increasingly essential tool in managing the enormous amount of new information becoming available in the fields of molecular biology and human genetics. Such information needs to be accessed for many reasons, e.g., researchers studying genes or gene products or those studying diseases associated with genetic aberrations, clinicians trying to diagnose or treat patients, social workers counseling parents or parents to be, and students in various health-related disciplines attempting to understand general principles and the "big picture".

One database useful for genetic researchers and clinicians alike is OMIM, the online version of Mendelian Inheritance in Man [4,7]. MIM is a comprehensive compedium of all known human gene assignments and associated inherited disorders or anomalies. OMIM is organized by genetic disorders or traits and each OMIM entry includes a unique MIM number, the name of the disorder or trait, clinical characterisitics, observed inheritance patterns and bibliographic references. Many entries also include additional information such as chromosomal location, allelic variants, biochemical analysis of defective gene products, and molecular genetics data. OMIM is a rich source of genetic information, clinical correlations, citations to the literature, and information about genes, but it includes little information about normal protein functions and the computer interface has a somewhat limited feature set.

There is presently an international effort to map the human genome, i.e. to determine the nucleotide sequence of the genes and eventually the total content of more than 3 billion base pairs [1]. The task of identifying and sequencing the 50-100,000 human genes is expected to be completed in about 15 years in spite of the fact that less than 2000 of these genes are presently known [2,3,4]. The initial goal is the completion of a high quality map to locate genes that cause 4000 or so hereditary diseases. A partial map has already helped to narrow the search for genes that cause some of the relatively common genetic diseases 0195-4210/91/$5.00 © 1992 AMIA, Inc.

434

Description of the Program The program described herein (ProteinFunctions) is a useful adjunct to OMIM in terms of additional related synoptic information about the function and regulation of gene products and an enhanced functionality of the software program. This database is designed to be a convenient source of information for students, clinicians and researchers in the health sciences. ProteinFunctions focuses primarily on normal functions of proteins. Many of the entries in MIM have no known protein product or are the subject of active research. A convenient source of up-to-date information on the functional aspects of the protein products should be of interest to researchers, clinicians, and students alike.

showing the source and target tissues of various hormones. These diagrams illustrate functional relationships between proteins and thereby make the information easier to learn and remember. General diagrams can be used as graphical directories to peruse the database. More specific diagrams are also associated with individual proteins, e.g. biochemical reactions catalyzed by a given enzyme with associated chemical structures of reactants and products. Every entry in ProteinFunctions is automatically "hyperlinked". In other words, any entry can be accessed from all other entries by simply selecting it's name or MIM number in the text of the original entry. A separate list of MIM numbers and MIM page numbers is provided as a means of identifying MIM entries by name. Each entry in ProteinFunctions has an associated list of aliases if they exist. A general list of aliases can be perused if a desired entry is not found in the database. Alternatively, a new entry can easily be entered into the database. Links to pre-existing databases (e.g. supporting literature or research data) can also be made easily by the user, so that information can be accessed or transferred directly in either direction.

All MIM entries with fairly well-defined protein products are represented in ProteinFunctions. Every entry has a designated unique data record (called a "card" in HyperCard) which can contain any amount of information (limited only by space on the harddisk) and may include text, diagrams, tables, and digitized pictures. Each entry has synoptic information (if available) extracted from a large number or textbooks and journal articles; the source of the information is cited in all cases. Information can be added to, modified, deleted, saved, or printed at any time by the user. Keywords (e.g. abbreviations and aliases) can easily be assigned and they are

ProteinFunctions provides a unique construct called "actions". Actions are formalized attributes assigned to a given protein (e.g. glucagon increases gluconeogenesis; IL-2 increases T cell proliferation). These attributes are defineable by the user to best serve their own purposes. Once actions are assigned, it is possible to ask questions such as: "Which hormones increase gluconeogenesis?"; "Which growth factors increase T cell proliferation?". These queries can be made with respect to the entire database or restricted to a category or set of categories of proteins.

automatically indexed. The program has a utility which makes it easy to create categories of proteins. Categories are a way of grouping related proteins, e.g. enzymes, hormones, growth factors, onc proteins, receptors, and ion channels. Proteins can be members of multiple categories. Members of a category can easily be added to, moved, or deleted. Categories and subcategories can be created, modified, or deleted at anytime without affecting the functionality of the system; all indexes are updated automatically. All or part of a category can be browsed, saved, or printed.

Glossary terms can also be created, edited, or deleted dynamically in ProteinFunctions and can then be accessed from anywhere in the program. ProteinFunctions presently runs on Macintosh computers. The program was created with HyperCard (distributed with all Macintosh computers) and enhanced with a number of custom-made "external functions" written in "C".

In ProteinFunctions, information can be accessed in any arbitrary sequence or in a pre-determined sequence. Information can be extracted and accumulated during this navigation process, then saved or printed.

Unlike many biomedical databases presently available, diagrams have been included in ProteinFunctions. For example, the activation sequence of coagulation, biochemical pathways such as the Krebs cycle, and endocrine maps

435

Experimental Methods and Results In constructing the initial ProteinFunctions database, each of the entries in MIM was analyzed to determine if a protein product had been identified. Of the 1800+ genes catalogued in the 1990 OMIM, about 1500 were determined to have fairly well-defined protein products based on the description provided by MIM. This number is greater than the number of functional proteins since many of the protein products are subcomponents of multimeric proteins. Each protein was initially assigned to one primary category using the semi-automated categorizing feature of ProteinFunctions. The following distribution was obtained:

Primary Category Adhesive proteins Binding proteins Cell cycle related proteins Cell surface antigens NEC Coagulation related proteins Complement related proteins Connective tissue proteins Cytoskeletal proteins Embryonic proteins Enzymes Growth factors HLA related proteins NEC Hormones Immunoglobulins Inflammation related proteins Inhibitor proteins NEC Lipoproteins Membrane proteins NEC Membrane transport proteins Muscle related proteins Neurologically related proteins Nuclear proteins NEC Onc proteins Phenotypic determinants NEC Receptors Regulatory proteins NEC Serum proteins Tissue specific proteins NEC Translation related proteins Unclassified proteins Total Proteins

Enzyme Subcategory Amino acid related enzymes - Amino acid related NEC - Aminoacyl LRNA synthetases - Bilirubin related - Catecholamine related - Creatine related - Porphyrin related - Urea cycle Carbohydrate related enzymes - Carbohydrate related NEC - Glycogen related - Glycolipid related

Number 9 32 3 90 40 20 28 25 7 657 45 13 39 23 2 31 14 2 32 13 21 36 82 3 97 5 24 98 10 38

- Glycolysis/gluconeogenesis - Pentose pathway - Krebs cycle - Proteoglycan related

Electrolyte related enzymes - ATPases - Carbonic anhydrases -

Cytochromes

- Electrolyte related NEC - Phosphatases NEC

Lipid related enzymes - Fatty acid metabolism - Lipid related NEC - Lipoprotein related - Phospholipid related - Sphingolipid related - Steroid related - Triglyceride related Nucleotide related enzymes - Nucleotide related NEC - Purine related - Pyrimidine related Nucleic acid related enzymes Protein related enzymes - Peptide related - Protease related - Protein related NEC - Protein kinases

Unclassified enzymes Vitamin related enzymes

1539

Total Enzymes

436

Number 121 88 8 2 5 3 8 7 177 83 11 21 40 6 12 4 53 10 8 18 3 14 81 11 31 3 4 9 19 4 54 11 29 14

20 83 13 29 34 7 62

6 657

with enzymes which has resulted from the fact that thousands of researchers have isolated enzymes from numerous sources (different lifeforms, tissues, and cell locations) and named them by whatever criteria seemed appropriate at

Discussion Obviously the primary classification is somewhat arbitrary and subject to change. However, it can easily be modified as desired using the ProteinFunctions program without changing the primary data. It can be seen that enzymes comprise, by far, the largest group of presentlyknown proteins. Actually the enzyme category is somewhat larger since some of the other primary categories such as "Coagulation related" and "Complement related" contain enzymes. Whether this will continue to be the case as more proteins are discovered is a matter of speculation. However, it should be noted that enzymes are more easily detected than most other kinds of proteins because of their catalytic nature.

the time.

However, most of the enzymes were subclassified in this study satisfactorily by making operational definitions and empirical rules as needed. For example, two of the most important biochemical pathways known are the degradation of glucose (glycolysis) and the synthesis of glucose from amino acids or lactic acid (gluconeogenesis). Although there are important differences between these two pathways, they share many common enzymes, and so tne enzymes are classified together.

Enzymology has long been the arcane domain of experimental biochemists. This is not surprising in light of the large number of human enzymes, not to mention the numerous additional ones in lower lifeforms. Enzymes have been formally classified by the International Union of Biochemistry (IUB) into 6 major classes by a four part code [8]. The major classes are oxidoreductases, transferases, hydrolases, lyases, isomerases, and ligases. This organization is primarily chemical in nature, in that it focuses on which chemical groups are transferred, which molecule is the donor or acceptor, and which bonds are formed or broken. This chemical classification is not sufficient or is not very useful for many students and professionals in the health sciences.

Creatine, porphyrin, and bilirubin related enzymes are all classified as primarily amino acid related since these compounds are all derived from amino acids. Monoamine oxidase is classified as amino acid related since it's primary function is to inactivate the neurotransmitters epinephrine, norepinephrine, dopamine, and serotonin; all of which are amino acid derivatives. Acetylcholine esterase is classified as amino acid related since choline is derived from the amino acids serine and methionine. Acetylcholine esterase inactivates the neurotransmitter acetylcholine. HMG CoA is a product of carbohydrate metabolism, but this molecule is also the precursor of cholesterol and other steroids as well as the ketone bodies. Both of these groups of compounds are classified as lipids (based on their solubility properties in lipid solvents; in spite of the fact that ketone bodies are also fairly soluble in aqueous solvents). HMG CoA synthase is the enzyme that synthesizes HMG CoA from acetoacetyl CoA. This enzyme is classified as being primarily carbohydrate related, whereas HMG CoA reductase (which is the first enzyme in cholesterol synthesis) and HMG CoA lyase (which is the first enzyme in ketone body formation) are classified as lipid-related.

What is presented here is a classification scheme which is superimposed on the existing nomenclature, which may be more meaningful or useful to those who are not professional enzymologists. This scheme attempts to be as systematic as possible, but at the same time focuses on the presently perceived most important biomedical functions (as opposed to purely chemical functions) of these enzymes. The functional categorization of enzymes is complicated by a number of confounding factors. There are different forms of the same enzyme (i.e. isoenzymes, proenzymes, zymogens). Also many enzymes are comprised of multiple copies of a single protein or of different protein subunits in various combinations. Enzymes can be structurally and functionally associated with cofactors (coenzymes, minerals, porphyrins) and may be covalently modified by carbohydrate, lipid, or nucleotide molecules to affect their location, membrane solubility, or activity. There is also a plethora of common names associated

Carnitine acyltransferase is a membrane transport protein, but its primary function appears to be the transfer of fatty acids from one carrier to another as they are transferred into the mitochondria for the purpose of oxidation. Therefore, it is classified as being part of fatty acid metabolism. Hexoseaminidase A (the deficient enzyme in Tay Sach's disease) cleaves a sugar (i.e. carbohydrate) molecule from a ganglioside. It can be argued that the key functional properties of gangliosides

437

ProteinFunctions program to adapt to the specifications of the individual user.

(i.e. in determining antigenic properties such as blood group A vs B) are due mainly to the presence of their peripheral carbohydrate groups. By this criteria hexoseaminidase A is classified here as being primarily carbohydrate-related. The fact that the pathology of Tay-Sach's disease primarily results from the lysosomal accumulation of lipid material is not used as a primary classification criteria since this result is not a normal process and Tay-Sach's is an extremely rare disease. Alpha L-fucosidase has a similar function to hexoseaminidase A and is also categorized here as being primarily carbohydraterelated. The associated protein, Alpha Lfucosidase regulator, is only of functional significance in that it enhances the activity of the associated enzyme, so it is classified with the enzyme in spite of the fact that it is not an enzyme itself. Likewise, numerous inhibitors and activators are classifed with the enzymes they modulate.

The significance of ProteinFunctions to Medical Informatics is as follows: Biochemistry underlies essentially all physiological and pathological processes. Consequently, ProteinFunctions contains information of interest to almost all students and professionals in the health sciences. All current biochemistry textbooks discuss the functions of a relatively small % of the total number of known proteins. In most cases, it is not possible to determine the tissue and subcellular distribution of the protein or even whether the discussion relates to the human protein. Also, information that is presented is rarely linked directly to the primary literature. Existing multivolume compediums of enzymes are primarily concerned with methodologies and structure; not function. In most cases, there is no attempt to separate information about the human proteins. This is partly because information about the human proteins is only recently becoming available. This, in turn, is a result of the recent progress in molecular biology, isolation and preservation of human tissues, and the development of methods to culture certain kinds of human cells. The large amount of information about human proteins and it's dynamic nature are the fundamental reasons underlying the need for a database such as ProteinFunctions.

Similarly, enzymes involved in the processing of coenzymes (which are derived from vitamins) are classified, where possible, with the enzymes whose function they support. For example, tetrahydrofolate is a coenzyme derived from the B vitamin folic acid. 5-methyl tetrahydrofolate homocysteine methyltransferase is classified as amino acid related since homocysteine is an amino acid, whereas 5,10-methenyl tetrahydrofolate cyclohydrolase (which has to do with the activation of the coenzyme per se) is classified as vitamin-related NEC (not elsewhere

References 1. Roberts L. A meeting of the minds on the genome project? Science 250:756-757 1990. 2. Culliton BJ. Mapping Terra Incognita. Science 250:210-212 1990. 3. Watson JD. The human genome project: past, present, and future. Science 248:44-49 1990. 4. McKusick VA, Mendelian Inheritance in Man, 1990, John Hopkins Press. 5. Roberts L. Whatever happened to the genetic map? Science 247:281-282 1990. 6. Stephens JC, ML Cavenaugh, MI Gradie, ML Mador, KK Kidd. Mapping the human genome: current status. Science 250:237-244 1990. 7. Brunn CW, SE Kelley, RE Lucier, DD Marquette, TT Ying. The Genome Data Base: a genetic mapping and disease database to support the human genome project. Proceed. Amer. Med. Informatics. Assoc. p73, 1990. 8. Devlin TM. Textbook of Biochemistry with Clinical Correlations, 1986, p119-123, John Wiley & Sons.

classified). UMP (uridine monophosphate) is a modified pyrimidine, but in combination with glucose or galactose (e.g., UMP-glucose), it's functional significance is as a 'carrier' or 'activator' for the glucose residue. Consequently, UDP-glucose pyrophosphorylase is classified as "carbohydrate related". On the other hand, uracil DNA glycosylase is a nucleic acid repair enzyme and, as such, is classified as "nucleic acid related".

Similar retationale were used to make all the primary and enzyme secondary assignments, but cannot be included here because of space restrictions. Also, most of the other primary categories have been subcategorized. Each of the "Unclassified" subcategories will be further subcategorized in the future and new subcategories will no doubt be needed as new proteins are discovered. Although it is hoped that the proposed scheme will be found useful by others, it can easily be modified within the

438

Human genome protein function database.

A database which focuses on the normal functions of the currently-known protein products of the Human Genome was constructed. Information is stored as...
861KB Sizes 0 Downloads 0 Views