Clustering proteins into families using artificial neural networks.

CABIOS

Vol 8. no.1. 1992 Pages 39-44

Clustering proteins into families using artificial neural networks Edgardo A. Ferrdn and Pascual Ferrara1 weight the neural connections. These values are gradually adjusted through a learning algorithm. Neural networks are An artificial neural network was used to cluster proteins into usually considered as an archetype of parallel processing families. The network, composed of 7x7 neurons, was trained computers. with the Kohonen unsupervised learning algorithm using, as Multilayered feed-forward neural networks are currently used inputs, matrix patterns derived from the bipeptide composition as a computational technique to infer rules from examples. The of 447 proteins, belonging to 13 different families. As a result synaptic connections of these networks, for a given learning of the training, and without any a priori indication of the number set of external input-output relations, are determined with a or composition of the expected families, the network selfsupervised learning algorithm (Le Cun, 1985; Rumelhart et al, organized the activation of its neurons into topologically ordered 1986). This approach has been used to predict, for example, maps in which almost all the proteins (96.7%) were correctly secondary structures of proteins (Quian and Sejnowski, 1988; clustered into the corresponding families. In a second computaAndreassen et at, 1990), immunoglobulin domains (Bengio and tional experiment, a similar network was trained with one family Pouliot, 1990) and coding regions and splice junctions in DNA of the previous learning set (76 cytochrome c sequences). The (Lapedes et al., 1990). In a different approach, as the one new neural map clustered these proteins into 25 different proposed by Kohonen (1982, 1988), the neural network selfneurons (five in the first experiment), wherein phylogenetically organizes its activation states into topologically ordered maps, related sequences were positioned close to each other. This using an unsupervised learning method. These maps are the result shows that the network can adapt the clustering resoluresult of an information compression that retains only the most tion to the complexity of the learning set, a useful feature when relevant common features among the input signals. In our case, working with an unknown number of clusters. Although the since the number and composition of protein families are not learning stage is time consuming, once the topological map is known, the application of standard supervised methods is not obtained, the classification of new proteins is very fast. appropriate. For this reason, we explored the classification of Altogether, our results suggest that this novel approach may protein sequences using the unsupervised Kohonen algorithm. be a useful tool to organize the search for homologies in large macromolecular databases. System and methods Introduction The program was written in FORTRAN 77 and implemented The management and exploitation of DNA and protein databases are already subjects of major concern. Since the number of entries will enormously increase as a consequence of endeavours like the Human Genome Project (Watson, 1990), the need for better methods of organizing and searching for homologies in those databases has become evident. In the present work, we use artificial neural networks to cluster proteins into families, according to their degree of sequence similarity. Artificial neural networks are simplified models of the nervous system. The basic elements of these mathematical models, called formal neurons, are considered as processing units that have many input signals and produce one output signal (McCulloch and Pitts, 1943). The properties of the network depend highly on the synaptic efficacies that Sanofi Elf Bio Recherches, Labige lnnopole, BP 137, 31328 Labige Cedex, France ' To whom reprint requests should be sent

© Oxford University Press

on a VAX 6310 under VMS. The protein sequences were obtained from the SwissProt database (release 14.0, 5/90). The link between the program and the database was performed using the subroutine DBNextSeq of the Genetics Computer Group (GCG v. 6.0), developed by the University of Wisconsin (Devereux et al., 1984). Algorithm We consider networks composed of one bidimensional output layer of AyV>. neurons. Each neuron receives a fixed number of input signals, carrying information derived from a protein sequence. The input signals are the 400 components &/ of a 20x20 matrix X obtained from the bipeptide composition of the protein to be learned. Thus, &/ is the normalized frequency of the bipeptide kl in the sequence (k and / are integer numbers between 1 and 20, which indicate each of the 20 possible different amino acids). These 20 x20 matrices allow us to feed 39

Downloaded from http://bioinformatics.oxfordjournals.org/ at National Dong Hwa University Library on April 7, 2014

Abstract

E-A.Fernin and P.Ferrara

Results We have asked a network composed of 49 neurons (Nx = Ny = 7) to perform its own classification of a learning set of 447 sequences. This set includes proteins stored into the SwissProt database (release 14.0, 5/90) belonging to the 13 following families: hemoglobin a-chain (132 proteins); cytochrome c (76 proteins); insulin (10 insulin precursors, 5 proinsulins and 26 insulins); nef (23 gene products); pol (20 gene products); gag (22 gene products); env (30 gene products); rev (21 gene products); tat (22 gene products); w/(18 gene products); vpr (18 gene products); vpu (15 gene products) and vpx (9 gene products). The last 10 families group together gene products from different primate lentiviruses (HIV-1, HIV-2 and SIV). Fragments were not considered. 40

In a representative experiment, the training algorithm produced the final map shown in Figure 2. In this map, each neuron is shown with the number of homologous sequences for which it is the winner. The protein patterns are automatically clustered using 34 out of the 49 available neurons. In most cases, the HIV-1 and HIV-2/SrV subfamilies of each gene product (sequence identity = 50%) are grouped into the same neuron or into neighboring ones. The only exception is the w/family, whose HTV-2/SrV subfamily is positioned rather far from the w/HTV-1 subfamily (sequence identity = 30%). The only case where patterns from different families have the same winner is neuron (2,7), where rev and tat gene products of HTV-2/SIV are clustered together. These proteins represent 3.3% of the learning set. The vpr and vpx families are associated to neighboring neurons (sequence identity = 25%). The vpx and w/gene products of the African green monkey (coded vpx$sivat and vif$sivat respectively in SwissProt) are generally classified apart from their corresponding families [neurons (2,5) and (4,6) in Figure 2]. In several experiments (not shown), we have indeed observed vpx$sivat and the vpr family to have the same winner. This agrees with the observation (Tristem et at, 1990) that vpx$sivat aligns well with all vpr gene products (sequence identity = 37%). It should be noted too that insulin, proinsulin and insulin precursor patterns are classified into neighboring neurons, showing that fragments of a protein may also be clustered together. We have previously shown, using smaller learning sets of interleukin-1 precursors and interleukin-1 receptors, that a network trained with complete sequences may recognize short portions of the learned sequences (Ferran and Ferrara, 1991b). On average, fragments representing 7.5% of the sequence can be sufficient to obtain a correct winner. This average value has to be taken only as indicative, because, in some cases, 45% of the sequence is needed, depending on the distances between the learned proteins and the final synaptic vectors. Although in Figure 2 we have not indicated the corresponding species, the spreadings of the hemoglobin and cytochrome c families over 10 and 5 neurons respectively are performed in a way that roughly resembles taxonomy classifications. Neuron (1,3), for example, groups together all the 25 cytochrome c proteins belonging to plants and a green-algae cytochrome c. ' It can be generally observed that the number of neurons associated to each family is proportional to the number of proteins per family. The distances between all pairs of synaptic vectors of neurons associated to a given family are smaller than the distances between any of them and the synaptic vectors of neurons attached to other families. To analyze the dynamics of the clustering process as well as to explore the ability of neural networks to adapt themselves to the complexity of the learning set, we asked for a network of similar size as the previous one to classify only the 76 cytochrome c patterns. The evolution of the topological map during learning, in a representative computational experiment.


the network with a constant number of inputs, regardless the protein length. This naive representation of the whole sequence information was successfully used for the classification of proteins applying statistical techniques (Nakayama et aL, 1988; M.Van Heel, 1991). The neural network is trained with these patterns to produce a neural map in which related proteins are associated either to a single neuron or to neighboring ones. In a typical training experiment, the 400 values corresponding to the synaptic efficacies that weight the input signals from a protein pattern are the components of a synaptic vector associated to each neuron. We denote by nty the synaptic vector of the neuron positioned in the (ij) site of the output layer. All synaptic vector components are real numbers first randomly taken from the interval [0,1]. Both input patterns and synaptic vectors are normalized to unitary vectors, thus they may be visualized as points positioned on the surface of a 400-dimensional hypersphere. This normalization of input patterns avoids the clustering of proteins according to their number of amino acids. During learning, each protein pattern is presented as input to the network and the neuron having the closest synaptic vector to the protein pattern (the winner neuron) is selected. The learning procedure consists in changing the synaptic vectors of any neuron placed in a winner neighborhood, 'moving' them towards the position of the input vector on the hypersphere. These adjustments are performed only to an a per cent of the separation between the vectors. We consider square neighborhoods centered at the winner neuron (we take s neighboring neurons to each side of the winner site, whenever it is possible). All protein patterns of the learning set are processed by the network repeatedly, in the same sequential order. Each processing cycle of the whole learning set is called an epoch. During learning, a decreases exponentially all A/a epochs [a(t + A/a) = a a(t), where t is the epoch number and 0 < a < 1] and the winner neighborhood shrinks every Atv epoch from the whole network to the winner neuron. A summary of the algorithm, depicted as a flowchart, is given in Figure 1. A detailed description of it can be found elsewhere (Ferran and Ferrara, 1991a,b).

Clustering proteins using neural networks

Define N , N , s , a(0) , a , At and At x y a Give initial random values to the synaptic efficacies u ^ j (0)

Input matrix pattern X of the n-th protein

|m. , - x U Minim - x l i v».i* r , , i i. j i Change synaptic efficacies n ^ of the neurons located in the winner neighborhood:

Decrease size of the winner neighborhood : s(t)-Max[s(t-1)-1.0)

Fig. 1. Row diagram describing the learning algorithm. nlJJU indicates the synaptic efficacy between the signal corresponding to the (*,/) component of the protein matrix and the neuron positioned in the (ij) site of the network. T is the number of epochs and N the number of sequences of the learning set. The other symbols are defined in the text.

showed that, after being clustered into a single neuron, the proteins were classified into four groups: Gl (plants), G2 (fungi), G3 (protozoa and non-vertebrates) and G4 (vertebrates). Each group had as winner one different corner of the network. G3 included four phylogenetically wrongly classified sequences, corresponding to two algae and two fishes. In the following epochs, three of these misclassifications were corrected (an euglenoid algae sequence remains in this group), and the zone of influence of each group spread over the neighboring neurons of the corresponding corner. The final topological map shows that Gl covers the lower left region of the network, G3 the upper left and G4 the middle and the upper right, while G2

remains at the lower right corner (Figure 3). It should be noted that the protozoa and euglenoid algae sequences are finally positioned between sequences of plants and non-vertebrates. As in the first case, the size of each family domain (i.e. the number of associated neurons) is proportional to its member number. Interestingly, in several cases, all the patterns belonging to a given species are clustered into the same neuron [all fishes are grouped into neuron (4,4), primates into (5,7), green algae into (5,1), birds into (1,7), etc.]. Apparent wrong clusterings (from a taxonomical point of view) are due to the fact that some proteins have a greater degree of homology than the one expected by phylogenetical classification. The cytochrome c of 41


X Determine the position (i^ of the winner neuron :

E.A.Ferran and P.Ferrara

Cyt (27)

Cyt (18)

Cyt (26)

Vpx.H2/S (8)

Cyt (4)

Env.H2/S (8)

Vpu.Hl (15)

Env.Hl (22)

Ins (4) Ins (18) Pins (1)

PPIns (10) Pins Ins (2) (1) Ins (3) Pins (2)

Tat.Hl (14)

Vpx.SIV (1)

Tat.H2/S (8) Rev.H2/S 17)

Gag. HI (14) Gag.H2/S (8)

V1T.H2/S (7)

Pol. HI (12) Pol.H2/S (8)

Vlf.Hl (10)

Rev.Hl (14)

Vlf.SIV (1)

Nef.Hl (15) Nef.H2/S (8)

Hba (6)

Hba (1)

Hba (5)

Hba (7)

Hba (10)

Hba (30)

Hba (28)

Hba (1)

Hba (8)

Hba (36)

Fig. 2. Topological map for a learning set of 447 proteins. Only the number of homologous sequences that have each neuron as a winner is indicated. Hba, hemoglobin or-chain; Cyt, cytochrome i-, PPIns, insulin precursor. Pins, proinsulin; Ins, insulin; HI, HIV-1 gene product; H2/S, HIV-2/SIV gene product. Learning proceeds during 500 epochs with a(0) = 0.9; a = 0.9; A/,. •» 50 and A/o = 5.

the California grey whale {Eutheria cetacea), the rabbit (Eutheria lagomorpha) and the Eastern grey kangaroo (Marsupialia) have 98% of sequence identity, determined by the Needleman-Wunsch method (1970), and are correctly classified into the same neuron (3,7). Similarly, cytochrome c of the Eastern diamondback rattlesnake is closer to human cytochrome c (86% identity) than to that of the snapping turtle (79%). In Figure 3, proteins having the same winner are sorted according to a decreasing order of the Euclidean distances between the input vectors and the synaptic vector. These distances can be used to classify further these proteins. For example, in neuron (1,1) the corresponding distances are: 0.26 (greenbottle fly), 0.28 (flesh fly), 0.29 (horn fly), 0.35 (tobacco hawkmoth), 0.42 (cynthia moth) and 0.58 (monsoon riverprawn). These values separate the six proteins into three subgroups (three flies, two moths and one crustacean). Hierarchical trees of protein classification may be constructed taking into account the distances between the final synaptic vectors. Work in this direction is in progress. Discussion We have used the unsupervised Kohonen learning algorithm to cluster protein sequences. Basically, this algorithm is a method to map data that are defined in a higher-dimensional space into a space of much lower dimensionality. This data 42

compression is performed in such a way that the mapping preserves the topological structure of the input data as much as possible, transforming complex similarity relations between the data in the high-dimensional space (the degrees of similarity between the protein sequences of the learning set, in our case) into much simpler distance relations in the low-dimensional space (the Euclidean distance between the synaptic vectors of the two-dimensional neural map). Thus, the algorithm provides a two-dimensional geometrical representation of the relationships between the bipeptide compositions of the protein sequences that have been learned. In fact, we have considered two-dimensional maps only to display the final results in a simple way. This corresponds to obtain a two-dimensional representation of a multi-dimensional cluster analysis, but maps of higher dimensionality can be built in an analogous way. In both the above computational experiments, the network succeeds in clustering the proteins according to their sequence similarities, in spite of the fact that one was performed with an heterogeneous learning set and the other with very related proteins. These results point out the most remarkable difference with many non-hierarchical statistical approaches to cluster data, in which usually the number of expected classes of the partition should be defined before the analysis (Auray et al., 1990). We have already obtained topological neural maps for all the 1550 human proteins recorded in the SwissPlot database (release 16.0, 10/90) and we are currendy determining the optimal size


Cyt (1)

Vpr.Hl (10) Vpr.H2/S (8)

Clustering proteins using neural networks

IruGB.Fly In:Fle»hFly ln:HomFly inzT.Hawkmoth ln:C.Moth

lirD.Locuxt

AjrCE-SUrftah OtCB.Worm

Ajn.BullFrog

GtBG.SnaJI

In: Honey Bee

RciS.Turtle

Cr.MflPmm

Av:Emu Av:O»trtch Av: Penguin A v: Chicken A v: Duck Av: Pigeon

PlTl Jnfl uuinuS

PajP.Lamprer PcPS.Dogfl»h Po.-C.Corp PaS.Tun«

Ch:C.Rclnh»rdttl Ch:E.lntesUnalls

Re:EDR-Snalte

Ap:LMlM

Ma:Human Ma:RhMacaque

Ma^.Monkey

0

Ap:CuckooPlnt Ap:Sunnower Ap:Hemp Ap'CauUflower Ap-Pumpldn Ap:Rlce Ap:Tomato Ap.PoUto Ap:MungDemn Ap:Maize Ap:Saame

Ap:N.Sccd

t

ApMJUalkwr Ap:C.Dcan Ap:EJUdcr

Ap:Poranlp

Ap:SI. Cotton Ap:Box£l

Holography in artificial neural networks.

Artificial neural networks in neurosurgery.

Prediction of Soil Deformation in Tunnelling Using Artificial Neural Networks.

Honey characterization using computer vision system and artificial neural networks.

Designing Artificial Neural Networks Using Particle Swarm Optimization Algorithms.

A movement pattern generator model using artificial neural networks.

Surrogate modeling of deformable joint contact using artificial neural networks.

Design of jetty piles using artificial neural networks.

Classification of images acquired with colposcopy using artificial neural networks.

Editorial: Predicting surgical satisfaction using artificial neural networks.

Generalization and specialization in artificial neural networks.

Artificial neural networks in bioprocess state estimation.

Introduction of artificial crosslinks into proteins.

A Comparative Approach to Hand Force Estimation using Artificial Neural Networks.

Modeling of tumor growth in dendritic cell-based immunotherapy using artificial neural networks.

Artificial neural networks for guest chirality classification through supramolecular interactions.

Forecasting Natural Gas Prices Using Wavelets, Time Series, and Artificial Neural Networks.

Error-correction learning for artificial neural networks using the Bayesian paradigm. Application to automated medical diagnosis.

Franck-Condon factors using supervised artificial neural networks. I. The CF+ cation.

Improving quantitative structure-activity relationship models using Artificial Neural Networks trained with dropout.

Modeling of batch processes using explicitly time-dependent artificial neural networks.

Predicting all-cause risk of 30-day hospital readmission using artificial neural networks.

Communication: separable potential energy surfaces from multiplicative artificial neural networks.

Differential diagnostics of Thalassemia Minor by artificial neural networks model.