G Model

ARTICLE IN PRESS

ARTMED-1403; No. of Pages 9

Artificial Intelligence in Medicine xxx (2015) xxx–xxx

Contents lists available at ScienceDirect

Artificial Intelligence in Medicine journal homepage: www.elsevier.com/locate/aiim

Protein–protein interaction identification using a hybrid model Yun Niu ∗ , Yuwei Wang College of Computer Science and Technology, Nanjing University of Aeronautics and Astronautics, 29 Yudao Street, Qinhuaiqu, Nanjing, Jiangsu 210016, China

a r t i c l e

i n f o

Article history: Received 15 May 2014 Received in revised form 13 May 2015 Accepted 15 May 2015 Keywords: Relational similarity model Word similarity model Biomedical text mining Protein–protein interaction

a b s t r a c t Background: Most existing systems that identify protein–protein interaction (PPI) in literature make decisions solely on evidence within a single sentence and ignore the rich context of PPI descriptions in large corpora. Moreover, they often suffer from the heavy burden of manual annotation. Methods: To address these problems, a new relational-similarity (RS)-based approach exploiting context in large-scale text is proposed. A basic RS model is first established to make initial predictions. Then word similarity matrices that are sensitive to the PPI identification task are constructed using a corpus-based approach. Finally, a hybrid model is developed to integrate the word similarity model with the basic RS model. Results: The experimental results show that the basic RS model achieves F-scores much higher than a baseline of random guessing on interactions (from 50.6% to 75.0%) and non-interactions (from 49.4% to 74.2%). The hybrid model further improves F-score by about 2% on interactions and 3% on non-interactions. Conclusion: The experimental evaluations conducted with PPIs in well-known databases showed the effectiveness of our approach that explores context information in PPI identification. This investigation confirmed that within the framework of relational similarity, the word similarity model relieves the data sparseness problem in similarity calculation. © 2015 Elsevier B.V. All rights reserved.

1. Introduction Information on protein–protein interactions (PPIs) is crucial for understanding the functional role of individual proteins as well as the entire biological process. Although numerous PPIs have been manually curated into database such as BioGRID [1], BIND [2],DIP [3], HPRD [4], IntAct [5] and MINT [6] by experts, information about many PPIs is still only available through the PubMed database. However, the amount of biomedical literature in PubMed grows rapidly and it is not practical to get complete coverage by manual curation. Therefore, mining PPIs from literature has become increasingly important and has attracted a lot of research interests. The well-known BioCreAtIvE (Critical Assessment of Information Extraction Systems in Biology) challenge includes a PPI detection task in two evaluations [7,8]. The primary goal of the task is to determine whether two target proteins interact. Approaches for mining PPIs from biomedical text range from co-occurrence analysis to more sophisticated natural language processing systems. Co-occurrence analysis is the most

∗ Corresponding author. Tel.: +86 25 84896490. E-mail address: [email protected] (Y. Niu).

straightforward approach and generally results in high recall but low precision [9,10]. Some other approaches construct patterns specifying how an interaction is described in literature and use them as rules to find PPIs [11–16]. Rule or pattern-based approaches can increase precision but significantly lower recall. In addition, these rule sets are derived from training data and are therefore not always applicable to other data they are not developed for [17,18]. In recent years, more and more approaches explore natural language processing technologies with a favor on machine learning (ML) methods. Some approaches focus on identifying features that are helpful in PPI identification, including lexical features, syntactic features, and semantic features [19–24]. Some approaches investigate various strategies of measuring the distance of two data points and explore it in kernel functions [25–31]. These ML approaches do not require manual construction of rules or patterns and often achieve better accuracy. However, they are experiencing some difficulties. Given two target proteins, these ML approaches determine whether they interact based on evidence within a rather small text span, typically a sentence in which the proteins co-occur. Similar to other information extraction tasks, for PPI identification, the task is defined as determining whether there is an interaction relation between any two proteins mention in a sentence, as in the following example.

http://dx.doi.org/10.1016/j.artmed.2015.05.003 0933-3657/© 2015 Elsevier B.V. All rights reserved.

Please cite this article in press as: Niu Y, Wang Y. Protein–protein interaction identification using a hybrid model. Artif Intell Med (2015), http://dx.doi.org/10.1016/j.artmed.2015.05.003

G Model ARTMED-1403; No. of Pages 9

ARTICLE IN PRESS Y. Niu, Y. Wang / Artificial Intelligence in Medicine xxx (2015) xxx–xxx

2

The screen identified interactions involving c-Cbl and two 14-3-3 isoforms, cytokeratin 18, human unconventional myosin IC, and a recently identified SH3 domain containing protein, SH3 P17. In this sentence, three proteins are mentioned (marked as bold). The task is to determine whether there is an interaction between any two of them, i.e., which ones of the three pairs (c-Cbl, cytokeratin 18), (c-Cbl, SH3 P17), (cytokeratin 18, SH3 P17) are interactions. The decision is made solely on evidence within this sentence. These single-sentence-based approaches have some disadvantages. Firstly, complex syntactic structures of sentences often make the predictions very difficult. PPIs are complex biological processes and it is often the case that multiple proteins playing various roles are mentioned in the same sentence. Actually, in the AImed dataset [11] of PubMed abstracts annotated by human experts with protein interactions, over 40% of the sentences have more than three protein mentions. In order to depict these roles, complex syntactic structures are often used in a sentence. As a result, the connections of two protein mentions are often implicit, which makes it difficult to determine their relationship. As in the above sentence, there is a long distance (in terms of words) between c-Cbl and SH3 P17 and it would be difficult to derive a direct relation between them even through a deep syntactic parsing of the sentence. Secondly, context of interactions is ignored in these approaches. Actually, information in nearby sentences often provide the context of the interactions, thus could be very helpful in identifying the target interactions. However, this context is ignored in single-sentencebased approaches. In addition, an interaction may be reported and described by different pieces of research work hence appears in various papers. All these descriptions provide valuable evidence in recognizing the target PPI. Yet this information is not fully explored in the single-sentence-based approaches. Thirdly, these ML approaches suffer from small training datasets. In a singlesentence-based approach, in order to build the training data, every protein pair appearing in a sentence has to be manually annotated as positive (interactions) or negative (non-interactions). This is very intensive labeling work. As a result, these machine learning algorithms are usually trained on small datasets. This will inevitably affect the accuracy and portability of the models. To address these issues, we propose a novel approach exploring corpus-based strategy to identify PPIs. Although there have been attempts to explore corpus-wide properties, they mostly explore frequency of interesting patterns [32,33]. Different from them, in the present work, relations between proteins are analyzed within the framework of relational similarity (RS) in natural language processing. In addition, a word similarity model that is derived from a large corpus is introduced to further improve the accuracy of the similarity calculation. Our method takes known PPIs in existing PPI databases (e.g., HPRD) as training data and no extra annotation is required. The experimental results show that this approach achieves high accuracy and well-balanced precision and recall. The rest of this paper is organized as follows: Section 2 introduces the relational similarity framework. The process of PPI identification using the basic RS model and the results are discussed in Section 3. In Section 4, we introduce the word similarity model to further improve the accuracy of PPI identification and analyze the results of the hybrid model in detail. Section 5 concludes all our work. 2. The relational similarity framework Research on relational similarity (RS) in the field of natural language processing provides a unified framework for accurately recognizing relations in text. Medin, Goldstone, and Gentner [34] describe relations as follows: relations are predicates taking two

or more arguments (e.g., X collides Y, X is larger than Y), which are used to express abstract connection between objects. Most work on RS analysis tries to identify relations implied by word pairs, through comparing the similarity of the target relation with some known relations [35–38]. Usually, distributional properties of relations are first extracted from large-scale text. These properties characterize the connections between the two involved words. Then, some similarity measures are applied to calculate the similarity between the target relation and the known relations. The most similar one would be used to label the relation between the two target words. Our decision to perform PPI recognition within the RS framework is based on two evaluations. First of all, interactions between proteins are typical semantic relations that match Medin’s definition. More important, as discussed in the previous section, context information in a large corpus is crucial in determining whether two proteins interact. Within the RS framework, relations are indeed characterized by properties presented in large-scale text. This matches well with our intension to incorporate context in PPI recognition. Therefore, in the presented work, we analyze PPIs from the viewpoint of relational similarity. In the proposed method, the prediction is made upon the rich context information in a large corpus. The RS framework contains three modules: collecting relation descriptions, relation representation, and similarity calculation. The first module is to get the collection of text that is likely to describe the relation between the two arguments from a large corpus. These descriptions can be phrases, sentences or paragraphs, etc. For example, Turney [35] selected 128 groups of phrases (e.g., X of Y, Y for X, X to Y) that contain the arguments (X, Y), while Nakov [36] used the set of sentences containing the two arguments. In the module of relation representation, vector space models are often used. Dimensions of the vectors correspond to properties characterizing the target relation. In the third module, appropriate similarity measures need to be designed and applied to calculate the distance between the target relation and the known relations. Finally, the target relation is labeled with the most similar known relation. 3. The basic relational similarity model in PPI recognition 3.1. System architecture In the presented PPI recognition system, if two proteins interact, they form a positive pair. Otherwise, it is a negative pair. In order to determine whether two proteins interact, we calculate the similarity between the target pair and the known positive pairs, and the similarity between the target pair and the known negative pairs, respectively. The target pair gets a positive label if it is more similar to the positive pairs and a negative label otherwise. Fig. 1 shows the architecture of the PPI identification system. The basic RS model is presented in the solid-line frame. As in the RS framework, our system of PPI recognition contains three modules, marked by the dashed-line frames in Fig. 1. They are described in the following subsections. The word similarity model is in the wavyline frame and will be discussed in Section 4. 3.2. Collecting relation descriptions The whole PubMed is used as the corpus from which descriptions of protein pairs are extracted. For a protein pair (p1, p2), we extract from PubMed all the sentences in which p1 and p2 co-occur, as they are likely to describe the relationship between p1 and p2. This set of sentences is regarded as the signature of (p1, p2). The signature is obtained by two steps.

Please cite this article in press as: Niu Y, Wang Y. Protein–protein interaction identification using a hybrid model. Artif Intell Med (2015), http://dx.doi.org/10.1016/j.artmed.2015.05.003

G Model ARTMED-1403; No. of Pages 9

ARTICLE IN PRESS Y. Niu, Y. Wang / Artificial Intelligence in Medicine xxx (2015) xxx–xxx

3

Fig. 1. Process of PPI identification using a hybrid model.

(1) Collect abstracts in which p1 and p2 co-occur by querying PubMed using the two proteins. This is done by using the Eutilities (esearch and efetch) through the API of PubMed [39]. We use the names of the target proteins as the keywords without trying to expand the query by including synonyms of them. (2) Search for sentences that contain both p1 and p2 throughout these abstracts. Each abstract is processed using a sentence segmentation tool [40]. Then sentences that contain the target protein pair are retrieved. At the end of this module, each protein pair is associated with its signature. The signature file describes from various aspects how the two proteins are related, thus provides abundant evidence that supports relation recognition. 3.3. Relation representation We use vector space model to represent the relation between p1 and p2. Dimensions of a vector are features that characterize

the relation and the features are extracted from the signature of a pair. Since words are in fact crucial in expressing the relation, we use unigrams (single word) in the signature files (stop words, single-character words and numbers are removed) as features. The weight of a feature is assigned using two measurements: binary values (0/1) and term frequency-inverse document frequency (tf.idf) values. Using binary values, the weight of a feature word is 1 if it appears in the signature file and 0 otherwise. In the second measurement, a feature word is weighted according to the importance of the word in the signature file, which is captured by the tf.idf value as shown in Eq. (1). wi = tf i × log

N df i

(1)

where tfi is the frequency of the ith feature in a signature file (the term frequency), dfi is the number of signature files in which the ith feature word appears (the document frequency), N is the total number of signatures.

Please cite this article in press as: Niu Y, Wang Y. Protein–protein interaction identification using a hybrid model. Artif Intell Med (2015), http://dx.doi.org/10.1016/j.artmed.2015.05.003

G Model

ARTICLE IN PRESS

ARTMED-1403; No. of Pages 9

Y. Niu, Y. Wang / Artificial Intelligence in Medicine xxx (2015) xxx–xxx

4

Table 1 Features used in the single-sentence-based classifier.

3.4. Relational similarity calculation In this module, the vector of the target protein pair which represents the relation between the two involved proteins is compared to the vectors of known protein pairs. The most similar vector would be found and the label of that relation (positive or negative) would be assigned to the target pair. We use cosine similarity metrics to measure the distance of two relation vectors r1 and r2 . Here r1 represents the relation between two target proteins, and r2 represents the relation denoted by a known protein pair, as shown in Eq. (2). r1 = r1,1 , . . ., r1,n ,

r2 = r2,1 , . . ., r2,n 

n cos  =



BWM1F: the first word to the left of prot1 BWM1L: the second word to the left of prot1 BWF: the left most word between prot1 and prot2 BWL: the right most word between prot1 and prot2 BWO: other words between prot1 and prot2 BWM2F: the first word to the right of prot2 BWM2L: the second word to the right of prot2 BWFL: there is only one word between prot1 and prot2 BWNULL: there is no word between prot1 and prot2 BWM1FL: there is only one word to the left of prot1 BWM2FL: there is only one word to the right of prot2

r ·r i=1 1,i 2,i

n (r )2 i=1 1,i

·

n

i=1

= (r2,i )2



Table 2 Results of the single-sentence-based approach on AImed.

r1 · r2 r1 · r1 ·



r1 · r2 = r1  · r2 

r2 · r2

Positive Negative

Precision

Recall

F-Score

56.6 85.8

37.5 92.4

44.2 89.0

(2)

3.5. The experimental setup In the proposed approach, we do not need to annotate the relation of two proteins suggested by a specific sentence. Instead, the training data are acquired from PPI databases such as HPRD. In our experiment, to get the positive pairs, we first extracted all the protein pairs from HPRD. Then, these pairs are searched in PubMed and those that appear in more than one abstract are kept, which leads to a positive set of 1420 pairs. To build a negative dataset, we took an approach commonly used in Bioinformatics research. Firstly, we randomly paired proteins appearing in HPRD to form an initial set. Then, the pairs in the initial set that happen to be PPIs in HPRD were removed. Finally, every pair left was searched throughout PubMed. Those that the two involved proteins co-occur in more than one sentence were added to the negative set. The result was a set of 1353 negative pairs. Therefore, there are 2773 pairs in total in the dataset. In the experiments, following Turney [35] and Nakov [36], we performed leave-one-out cross validation to test the basic RS model. One protein pair in the dataset was taken as the test pair and the others were used as training data. This process was repeated 2773 times to allow each protein pair being tested once. Using the cosine similarity measure, we built a 1-nearest neighbor (1NN) classifier, which tags the test pair using the label (positive or negative) of the protein pair in the training set that is most similar to it. When there were more than one protein pair having the maximum similarity value, we counted the number of the positive pairs (Cpos ) and the negative pairs (Cneg ) respectively and checked the ratio r = Cpos /Cneg . If r > 1 the test pair got a positive label. If r < 1 it got a negative label. Otherwise, it got the label of the second nearest pair. 3.6. Results of the basic relational similarity model Table 3 shows the results of the basic relational similarity model in detecting positive and negative protein pairs. In the following experiments, random labeling is taken as a baseline. Performance of the two feature weighting schemes is compared in the table. Since we make predictions based on signatures of protein pairs instead of single sentences, it is difficult to make a fair comparison with existing single-sentence-based approaches. Nonetheless, we report the results of a single-sentence-based classifier on our dataset. In order to apply a single-sentence-based classifier on our dataset, we performed the two steps. Firstly, we train a

single-sentence-based classifier that determines whether two proteins in a sentence interact solely based on this sentence, as it is in related work. Then, for a target protein pair in our dataset, we apply this classifier to every sentence in its signature. If the target pair is classified to be positive in at least one sentence, we label the pair as positive. Otherwise, we label the pair as negative. The experiment is described as follows. In the relational-similarity-based approach presented in the paper, protein pairs are represented by unigram features in the signatures. Similarly, we use the neighbor words of the target protein pair in a sentence as features to build the single-sentence-based classifier. These features have been shown effective in this task. The features we use are detailed in Table 1. We built the single-sentence-based classifier on AImed corpus [11] using support vector machine (SVM). In this corpus, interactions in a sentence are manually labeled. AImed has been used to evaluate many PPI identification approaches. The five-fold cross-validation results on AImed using libsvm [41] with default parameters are shown in Table 2. Then we used the whole AImed corpus as training data to train the single-sentence-based classifier. The classifier was then applied to our dataset, as discussed before. The results are presented in Table 3. As shown in the table, using binary weights, the basic RS model achieves much higher F-score (about 25% improvements) than the random baseline in detecting both interactions and noninteractions. Furthermore, precision and recall are well-balanced. Compared to the single-sentence-based approach, the basic RS model gets much higher recall (+14.8%) in detecting positive pairs while precision is almost the same, i.e., the basic RS model is able to pick up more PPIs. On negative pairs, the basic RS model also achieves higher F-score. Table 3 Results of the basic RS model. Precision

Recall

F-Score

Positive Random labeling Single-sentence-based Binary (0/1) tf.idf

51.2 75.5 75.6 65.8

50.0 59.7 74.5 81.3

50.6 66.6 75.0 72.7

Negative Random labeling Single-sentence-based Binary (0/1) tf.idf

48.8 65.3 73.6 73.9

50.0 79.7 74.7 55.6

49.4 71.8 74.2 63.5

Please cite this article in press as: Niu Y, Wang Y. Protein–protein interaction identification using a hybrid model. Artif Intell Med (2015), http://dx.doi.org/10.1016/j.artmed.2015.05.003

G Model

ARTICLE IN PRESS

ARTMED-1403; No. of Pages 9

Y. Niu, Y. Wang / Artificial Intelligence in Medicine xxx (2015) xxx–xxx

˛

Positive pairs number (percentage)

Negative pairs number (percentage)

0.3 0.4 0.5 0.6

866 (61.0%) 683 (48.1%) 499 (35.1%) 318 (22.4%)

682 (50.4%) 450 (33.3%) 316 (23.2%) 255 (18.8%)

precision precision/recall/F-score(%)

Table 4 The number of positive and negative pairs having the largest similarity value greater than ˛.

5

recall

F-score

79 78 77 76 75

74 73 72

The results clearly show that descriptions about PPIs extracted from the large corpus actually present some common patterns and these patterns usually are not observed in the negative pairs. This commonality can be effectively captured by the proposed similarity-based model. Therefore, identifying PPIs by measuring similarity of descriptions extracted from large corpus is a direction worth further investigation. The tf.idf weighting scheme tends to label more pairs as interactions, thus achieves better recall in the positive class. In contrast, using binary weights gets a less biased model in detecting interactions and non-interactions. It gets better F-score in both the positive and the negative classes. Binary weights are used in all the following experiments. Although every protein pair in the dataset has at least one nearest neighbor, the actual similarity value could be low. Table 4 shows the number of correctly classified positive and negative pairs whose similarity with the nearest neighbor is greater than a threshold ˛. As shown in the table, at every similarity level (indicated by ˛), the percentage of positive pairs is larger than that of the negative pairs, i.e., more positive pairs can find a neighbor that is actually similar to it. This result again reveals that descriptions about interactions are more similar while descriptions about non-interactions are more diverse. We also evaluate the KNN classifier with different K. The results are shown in Fig. 2 (on the positive pairs) and Fig. 3 (on the negative pairs). For the positive pairs, as K increases precision first goes up then slowly decreases. The highest precision is achieved when K is 5. In contrast, recall first decreases and then slowly increases. For the negative pairs, recall first increases and then decreases slowly. The highest recall is achieved when K is 5. There is not much change in precision for the negative pairs. This indicates that when K gets bigger some negative pairs that happen to have a positive nearest neighbor are correctly labeled. At the same time, some true positive pairs are misclassified. When K gets even bigger, precision and recall are getting close to those values when K is 1. As a result, Fscore is also getting close to the value of the 1NN classifier. This is true for both positive and negative pairs. Generally, F-score of the positive pairs does not change much as K increases while that of the negative pairs is higher. In the next section, we attempt to improve the basic RS model by incorporating a word similarity model, which is also derived from PubMed.

precision/recall/-F-score(%)

precision

recall

F-score

78 77 76 75 74 73 72 1

3

5

7

9

11

13

K Fig. 2. Results of the KNN classifier on positive pairs.

15

17

1

3

5

7

9

11

13

15

17

K Fig. 3. Results of the KNN classifier on negative pairs.

4. The hybrid model The performance of similarity-based approaches may be degraded by the sparsity of lexical features. In our basic RS model, even though more than 5000 features are extracted from the signature files of protein pairs, only a small number of them are relevant to a specific pair. For example, the words molecule and protein are features in the vector space model of protein pairs. The two words may appear separately in the signatures of two positive pairs A and B. Hence, in the feature vector of each pair, only one of the two corresponding dimensions has a non-zero value. Therefore, although these two words are clearly semantically related, connections between them would be missed in the similarity calculation. As a result, similarity of the two pairs could be decreased. If the similarity value of its nearest neighbor is low, the confidence of a pair getting a label would be low. Actually, this issue is indeed observed with the basic RS model. We found that about 64.6% protein pairs that are incorrectly classified have a low nearest similarity value (less than 0.35). This motivates our work to explore word similarity to enhance our basic RS model. More specifically, a word similarity model is built and applied to reveal the connections between semantically related features in the basic RS model. Thus, the absence of lexical features can be compensated by taking these connections into account when calculating similarity between protein pairs. 4.1. The word similarity model 4.1.1. A corpus-based strategy for word similarity calculation A straightforward way to find similar words is to search existing resources such as dictionaries or thesaurus. However, the disadvantages are also obvious. Firstly, words in these resources are usually general, hence not sensitive to a specific domain or task. For example, in the PPI identification task, words like inhibit, stimulate, induce are similar in the sense that these words are often used to indicate the type of protein interactions. Yet this connection is unlikely to be found in available dictionaries. Secondly, these resources are limited in terms of coverage. This problem becomes more serious when we are dealing with rapidly growing corpus such as PubMed, in which new words and new expressions appear every day. Therefore, we explore a corpus-based strategy to identify similar words. The hypothesis is that two words are similar if they occur in similar contexts. In this approach, a word is represented by its distributional profile [42,43], which is formed by the context of the word in a large-scale corpus. The context is defined as words within a fixed-size window around the target word. Then the distributional profiles of words are compared to each other to find those that are similar. In our experiment, PubMed is used to get the distributional profiles of words appearing in the signatures of the protein pairs.

Please cite this article in press as: Niu Y, Wang Y. Protein–protein interaction identification using a hybrid model. Artif Intell Med (2015), http://dx.doi.org/10.1016/j.artmed.2015.05.003

G Model ARTMED-1403; No. of Pages 9 6

ARTICLE IN PRESS Y. Niu, Y. Wang / Artificial Intelligence in Medicine xxx (2015) xxx–xxx

4.1.2. The target words in the word similarity model The purpose of incorporating the word similarity model is to build connections between the feature words in the vectors of protein pairs, thus to relieve the problem of data sparseness in calculating similarity of these pairs. Therefore, we took all the words in the signatures (from which features are extracted) of the protein pairs in the dataset as the initial target words. Then, these words were grouped by their part-of-speech (POS) tags. Since nouns, verbs, adjectives, and adverbs are content words that actually convey meanings, they were kept as the target words and the other groups were removed. To get POS tags of words, we conducted shallow parsing on sentences in the signature files using Apache OpenNLP syntactic analysis tool [44]. A word may belong to more than one group. 4.1.3. The word similarity matrix We calculated similarities of any two target words having the same POS tag. Four similarity matrices corresponding to nouns, verbs, adjectives, and adverbs were derived by the following steps. Take the verb group GV as an example. S1: For each target word in GV, construct its distributional profile. We extracted 1G-sized abstracts from PubMed randomly and used them as the corpus to extract the context of a target word. The context is defined as words within a 5-word window around each occurrence of the target word in the corpus (stop words, single-character words and numbers were removed). All the context words constitute the distributional profile of the target word. S2: Representing the distributional profile using the vector space model. The distributional profiles of all target words in the same POS group formed a profile set. All words in this set were used as features to build a co-occurrence vector that represents a target word. The weight of a feature was determined by the conditional probability P(w|w1), where w1 is the target word and w is the feature word. It was estimated by the frequency that w and w1 co-occur divided by the frequency that w1 appear in the 1G-sized PubMed corpus. After that, a co-occurrence matrix A was established, in which each row was the co-occurrence vector of a target word in GV. S3: Calculate the similarity of any two words within the same POS group. Since each target word is represented by its co-occurrence vector, the similarity between two target words can be measured by the cosine similarity metrics (Eq. (2)). The result is a word similarity matrix B, in which rows and columns correspond to the targets words in GV. The element Bij is the similarity value between the word at row i and the word at column j. To get matrix B, we first normalized matrix A to unit length. Then B can be calculated as A · AT . Therefore, B is a |GV| × |GV| triangular matrix with |GV|(|GV| − 1)/2 similarity values. Some results of word similarity calculation are shown in Table 5. In the table, words that are most similar to the target word as well as the similarity values are listed in descending order. We can see that first of all, the similarity between inflections of a word (e.g., bind, binds, bound) is high, which provides clear evidence that comparing distributional profiles of words is effective in gathering similar words. Secondly, as expected, the distributional profiles collected from PubMed are domain-specific. They characterize words through expressing the roles they play in various biological processes, thus are very helpful in revealing semantic connections between words that are interesting to specific tasks. As

Table 5 Results of word similarity calculation. Target word

Most similar words and the similarity values

Binds

(interacts, 0.84649) (binding, 0.74441) (bind, 0.74436) (bound, 0.71035) (interacted, 0.70552) (interact, 0.70062) (enhanced, 0.90721) (enhance, 0.8555) (augments, 0.82522) (inhibits, 0.8201) (augmented, 0.81756) (suppresses, 0.80275) (affect, 0.80226) (investigate, 0.96131) (determine, 0.91181) (evaluate, 0.90132) (explore, 0.897) (assess, 0.85986) (ascertain, 0.84071) (stimulate, 0.85974) (inhibits, 0.83933) (enhance, 0.83764) (inhibited, 0.81549) (suppress, 0.79658) (induce, 0.78485) (interfere, 0.77744) (inhibition, 0.74436) (enhances, 0.74354) (augment, 0.73779) (affect, 0.7241) (enhanced, 0.72117) (inhibiting, 0.71708) (promote, 0.71179) (stimulates, 0.70733) (modulate, 0.70056) (suppresses, 0.69349) (bind, 0.84152) (interacting, 0.82057) (interacts, 0.80558) (act, 0.762) (interacted, 0.754) (binds, 0.70062) (activate, 0.6779) (interactions, 0.67159)

Enhances

Examine

Inhibit

Interact

Table 5 shows, our similarity matrix indicates that bind, interact, and activate are very similar, because they are used in similar context describing similar biological processes, such as PPI. Inhibit, stimulate, enhance, suppress, and induce are regarded as similar words for the same reason, although some of them are antonyms in general dictionaries. Nonetheless, their similarity revealed by our matrix is clearly reasonable and crucial in PPI identification. 4.2. Exploring word similarity in PPI identification In this section, the four POS-grouped word similarity matrices are incorporated into the basic RS model, as shown in the wavy-line area in Fig. 1. In the new hybrid model, a voting scheme is first used to make an initial prediction on the label of a protein pair. The votes are given by a 1NN classifier and a KNN (K > 1) classifier separately, using the basic RS model. If the two votes are the same, they are taken as the final decision. Otherwise, the weights of features in the vectors of protein pairs will be adjusted according to the word similarity matrices. Then the 1NN classifier is applied again to get the final prediction. Fig. 4 gives the detailed algorithm. Feature words of protein pairs were extracted from the signature files of all protein pairs in the dataset. They were grouped by their POS tags. In the similarity matrix of a POS group, each element is the similarity value of two feature words with the same POS tag.  is a threshold determining whether a similarity value should be used to update the weight of a feature word. Various  values are evaluated in the experiments in the next section. 4.3. Results and analysis In this section, the hybrid model is evaluated and the results are compared to the basic RS model. In addition, another approach of using the centroid of the training data to relieve the data sparsity problem is also evaluated. In this approach, a centroid vector is calculated for positive pairs and negative pairs respectively by averaging with the vectors of pairs in each class. Then the label of the target protein pair is assigned to be the class of the nearest centroid. Table 6 shows the results of the nearest centroid method as well as the hybrid model on positive pairs when using different K in the KNN classifier at different similarity threshold . Table 7 presents the results on negative pairs. The highest precision, recall, and F-score in every K-group are marked as bold. As shown in the tables, the nearest centroid method tends to label a protein pair as negative thus recall of the positive class is pretty low. The feature weights in the negative centroid vector are

Please cite this article in press as: Niu Y, Wang Y. Protein–protein interaction identification using a hybrid model. Artif Intell Med (2015), http://dx.doi.org/10.1016/j.artmed.2015.05.003

G Model

ARTICLE IN PRESS

ARTMED-1403; No. of Pages 9

Y. Niu, Y. Wang / Artificial Intelligence in Medicine xxx (2015) xxx–xxx Table 7 Results of the hybrid model on negative pairs.

Table 6 Results of the hybrid model on positive pairs.

Basic RS model Nearest centroid Hybrid model  K 0.9 0.8 3 0.7 0.9 0.8 5 0.7 0.9 0.8 7 0.7 0.9 0.8 9 0.7

7

Precision

Recall

F-Score

75.6 79.3

74.5 47.1

75.0 59.1

77.6 (+2.0) 78.2 (+2.6) 78.5 (+2.9) 77.5 (+1.9) 78.7 (+3.1) 79.1 (+3.5) 77.7 (+2.1) 78.8 (+3.2) 79.1 (+3.5) 78.0 (+2.4) 79.1 (+3.5) 78.7 (+3.1)

74.8 (+0.3) 73.8 (−0.7) 73.6 (−0.9) 75.0 (+0.5) 74.5 (+0) 74.0 (−0.5) 75.5 (+1.0) 75.1 (+0.6) 74.1 (−0.4) 75.1 (+0.6) 75.1 (+0.6) 73.5 (−1.0)

76.2 (+1.2) 75.9 (+0.9) 75.9 (+0.9) 76.2 (+1.2) 76.5 (+1.5) 76.5 (+1.5) 76.6 (+1.6) 76.9 (+1.9) 76.5 (+1.5) 76.5 (+1.5) 77.1 (+2.1) 76.0 (+1.0)

generally low as there are less common features between negative pairs compared to those of the positive pairs. On the other hand, the vector representing a target pair is sparse. Hence, it is more likely that the similarity between the target pair and the negative centroid is higher. This is probably the reason of the high recall of the negative class. F-Score of this method is worse than the basic RS model in both positive and negative classes. F-Score achieved by the hybrid model (K = 9,  = 0.8) is significantly higher than the basic RS model (paired t-test, p < 0.05). Furthermore, the hybrid model achieves better F-score in all the cases on both the positive and the negative pairs. It clearly shows that incorporating the word similarity matrix leads to a better model of relational similarity calculation. On the positive pairs (Table 6), compared to the precision achieved by the basic RS model

Precision Basic RS model Nearest centroid Hybrid model  K 0.9 0.8 3 0.7 0.9 0.8 5 0.7 0.9 0.8 7 0.7 0.9 0.8 9 0.7

Recall

F-Score

73.6 61.1

74.7 87.1

74.2 71.8

74.5 (+0.9) 74.0 (+0.4) 74.0 (+0.4) 74.6 (+1.0) 74.6 (+1.0) 74.4 (+0.8) 75.0 (+1.4) 75.1 (+1.5) 74.5 (+0.9) 74.9 (+1.3) 75.2 (+1.6) 74.0 (+0.4)

77.3 (+2.6) 78.4 (+3.7) 78.8 (+4.1) 77.2 (+2.5) 78.8 (+4.1) 79.5 (+4.8) 77.3 (+2.6) 78.8 (+4.1) 79.3 (+4.6) 77.8 (+3.1) 79.2 (+4.5) 79.2 (+4.5)

75.9 (+1.7) 76.2 (+2.0) 76.3 (+2.1) 75.9 (+1.7) 76.7 (+2.5) 76.9 (+2.7) 76.2 (+2.0) 76.9 (+2.7) 76.8 (+2.6) 76.3 (+2.1) 77.1 (+2.9) 76.5 (+2.3)

(75.6%), the best result achieved by the hybrid model is 79.1%. In three of the four K-groups (K = 3, 5, 7), the best precision is achieved when  is 0.7. It suggests that 0.7 is a better choice if precision has higher priority in a PPI recognition task. The best recall achieved by the hybrid model is 75.5%, about 1 percentage point higher than that of the basic RS model. In all the four K-groups, the best recall is achieved when  is 0.9, which suggests that 0.9 is a better choice if recall is valued more. The best F-score is achieved when K = 9 and  is 0.8, which is about 2.1 points higher than the baseline. As shown in Table 7, on negative pairs, the hybrid model outperforms the basic RS model in all the cases. The highest precision is 75.2%, which is about 1.6 points higher than the basic model. The highest recall is 79.5%, which is about 4.8 points higher than the basic model. The largest improvement in F-score is about 2.9 points. As

Fig. 4. Algorithm for identifying PPIs using the hybrid model.

Please cite this article in press as: Niu Y, Wang Y. Protein–protein interaction identification using a hybrid model. Artif Intell Med (2015), http://dx.doi.org/10.1016/j.artmed.2015.05.003

G Model

ARTICLE IN PRESS

ARTMED-1403; No. of Pages 9

Y. Niu, Y. Wang / Artificial Intelligence in Medicine xxx (2015) xxx–xxx

8 Table 8 Results on positive protein pairs ( = 0.8). K 3 5 7 9

Strategy

Precision

Recall

F-Score

Strategy 1 Strategy 2 Strategy 1 Strategy 2 Strategy 1 Strategy 2 Strategy 1 Strategy 2

78.2 78.0 78.7 78.7 78.8 78.9 79.1 79.2

73.8 74.2 74.5 75.1 75.1 75.6 75.1 75.6

75.9 76.1 76.5 76.9 76.9 77.2 77.1 77.3

Table 9 Results on negative protein pairs ( = 0.8). K 3 5 7 9

Strategy

Precision

Recall

F-Score

Strategy 1 Strategy 2 Strategy 1 Strategy 2 Strategy 1 Strategy 2 Strategy 1 Strategy 2

74.0 74.3 74.6 75.1 75.1 75.5 75.2 75.5

78.4 78.0 78.8 78.7 78.8 78.8 79.2 79.2

76.2 76.1 76.7 76.8 76.9 77.1 77.1 77.3

shown in Tables 6 and 7, when  is 0.8, the hybrid model gets the best F-score in most K-groups. In the step of weight adjustment in the PPI recognition algorithm (Fig. 4), any feature word whose weight is zero would be considered for adjustment. In this process, we notice that some feature words are seldom involved in portraying the relations. For example, words like describe, explain, and depict often appear in abstracts of journal articles just to introduce the structures of the papers. Thus, adjusting their weights according to the word similarity matrix could be misleading. On the other hand, some words are more likely to play a role in building the relationships. For example, the neighbor words of protein mentions in a sentence are often part of the relation expressions. Adjusting weights of these words is more likely to be beneficial in the hybrid model. Based on this observation, in the following experiment, we select a subset of features as candidates of weight adjustment. From the signature files of the target protein pairs we extracted words that co-occur with the target proteins within a five-word window and took them as the candidates of weight adjustment. Tables 8 and 9 are the results of exploiting this subset (strategy 2). As a comparison, the results obtained by the algorithm in Fig. 4 are also shown (strategy 1). As shown in the tables, on positive pairs, strategy 2 improves recall in all the different K-groups without lowering precision. As a result, F-score in all the groups is also improved. On negative pairs, strategy 2 improves precision without lowering recall and achieves better F-score. Since candidates in strategy 2 (2632 words) is a subset of that in strategy 1 (4867 words), strategy 2 is more efficient in similarity calculation. These results suggest that identifying a proper subset of the feature words as candidates for weight adjustment is a direction worth further exploring. 5. Conclusions This paper contributes to the research on mining PPIs from literature with a focus on addressing the difficulties of current single-sentence-based approaches. Specifically, we (1) propose a novel approach that recognizes PPIs under the framework of relational similarity, which collects evidence from context of protein pairs in a large corpus, (2) construct word similarity matrices that are sensitive to the PPI identification task using a corpus-based approach, (3) develop a hybrid model to integrate the word similarity model with the basic RS model, which further improves the

performance. Experimental results show the effectiveness of the hybrid model. The results of the system can be added to the PPI network directly. Moreover, the proposed approach takes known PPIs in existing PPI databases (e.g., HPRD) as the training data and no extra annotation is required. We believe that adopting relational similarity framework in PPI identification has great potentials. Firstly, it would be natural to explore semi-supervised algorithms within the framework of relational similarity calculation. Semi-supervised learning algorithms have been successfully applied in biomedicine [45–47]. Many semisupervised algorithms leverage similarity between labeled data and large amount of unlabeled data to improve accuracy of classification [48,49]. For example, Zhu et al. [48] develop an algorithm that propagates labels from labeled data points to their neighbors in the feature space under the hypothesis that similar data points tend to have the same label. Our hybrid model that measures similarity between protein pairs matches well with this hypothesis. In addition, our approach provides similarity measurements that are essential in this kind of algorithms. In PPI research, the number of known PPIs is small. In contrast, the amount of unlabeled protein pairs is huge. Therefore, it is a promising direction to explore semisupervised algorithms that make use of the small amount of labeled PPIs as well as the huge amount of unlabeled protein pairs in PPI identification. Secondly, the RS-based approach could be extended to other biological relation detection tasks, e.g., gene-disease association detection, and even to tasks with many target classes such as gene-function correlation identification. Our attempts in PPI detection serve as an example that is beneficial to similar tasks. There are several avenues that might be explored in the future work. In the proposed approach, feature weights are adjusted to release problems caused by sparseness of features. In the next step, we attempt to investigate other strategies on this issue. As shown in our experiment, identifying appropriate subset of features for weight adjustment could lead to a more efficient system of higher accuracy. We believe this is a direction worth further exploring. Acknowledgments This study is funded by the National Natural Science Foundation of China, grant #61202132. We gratefully acknowledge access to the U.S. National Library of Medicine (NLM) MEDLINE/PubMed database through the NLM Data Distribution Program, using the license code 1463NLM154. References [1] Chatr-aryamontri A, Breitkreutz B-J, Oughtred R, Boucher L, Heinicke S, Chen D, et al. The BioGRID interaction database: 2015 update. Nucleic Acids Res 2015;43:D470–8. [2] Bader GD, Betel D, Hogue CW. Bind: the biomolecular interaction network database. Nucleic Acids Res 2003;31(1):248–50. [3] Salwinski L, Miller CS, Smith AJ, Pettit FK, Bowie JU, Eisenberg D. The database of interacting proteins: 2004 update. Nucleic Acids Res 2004;32(Suppl. 1):D449–51. [4] Prasad TK, Goel R, Kandasamy K, Keerthikumar S, Kumar S, Mathivanan S, et al. Human protein reference database: 2009 update. Nucleic Acids Res 2009;37(Suppl. 1):D767–72. [5] Kerrien S, Alam-Faruque Y, Aranda B, Bancarz I, Bridge A, Derow C, et al. IntAct open source resource for molecular interaction data. Nucleic Acids Res 2007;35(Suppl. 1):D561–5. [6] Chatr-Aryamontri A, Ceol A, Palazzi LM, Nardelli G, Schneider MV, Castagnoli L, et al. Mint: the molecular interaction database. Nucleic Acids Res 2007;35(Suppl. 1):D572–4. [7] Krallinger M, Leitner F, Rodriguez-Penagos C, Valencia A. Overview of the protein–protein interaction annotation extraction task of biocreative II. Genome Biol 2008;9:S4. [8] Leitner F, Mardis SA, Krallinger M, Cesareni G, Hirschman LA, Valencia A. An overview of biocreative II.5. IEEE/ACM Trans Comput Biol Bioinform 2010;7(3):385–99. [9] Albert S, Gaudan S, Knigge H, Raetsch A, Delgado A, Huhse B, et al. Computerassisted generation of a protein-interaction database for nuclear receptors. Mol Endocrinol 2003;17(8):1555–67.

Please cite this article in press as: Niu Y, Wang Y. Protein–protein interaction identification using a hybrid model. Artif Intell Med (2015), http://dx.doi.org/10.1016/j.artmed.2015.05.003

G Model ARTMED-1403; No. of Pages 9

ARTICLE IN PRESS Y. Niu, Y. Wang / Artificial Intelligence in Medicine xxx (2015) xxx–xxx

[10] Grimes G, Wen T, Mewissen M, Baxter RM, Moodie S, Beattie J, et al. PDQ Wizard: automated prioritization and characterization of gene and protein lists using biomedical literature. Bioinformatics 2006;22(16):2055–7. [11] Bunescu R, Ge R, Kate RJ, Marcotte EM, Mooney RJ, Ramani AK, et al. Comparative experiments on learning information extractors for proteins and their interactions. Artif Intell Med 2005;33(2):139–55. [12] Hao Y, Zhu X, Huang M, Li M. Discovering patterns to extract protein–protein interactions from the literature: Part II. Bioinformatics 2005;21(15):3294–300. [13] Jang H, Lim J, Lim J-H, Park S-J, Lee K-C, Park S-H. Finding the evidence for protein–protein interactions from PubMed abstracts. Bioinformatics 2006;22(14):e220–6. [14] Plake C, Hakenberg J, Leser U. Optimizing syntax patterns for discovering protein–protein interactions. In: Liebrock LM, editor. Proceedings of the 2005 ACM symposium on applied computing. New York: ACM; 2005. p. 195–201. [15] Romano L, Kouylekov M, Szpektor I, Dagan I, Lavelli A. Investigating a generic paraphrase-based approach for relation extraction. In: McCarthy D, Wintner S, editors. Proceedings of the 11st conference of the European chapter of the association for computational linguistics. Stroudsburg, PA: Morgan Kaufmann Publishers/ACL; 2006. [16] Yakushiji A, Miyao Y, Tateisi Y, Tsujii J. Biomedical information extraction with predicate-argument structure patterns. In: Hahn U, Valencia A, editors. Proceedings of the first international symposium on semantic mining in biomedicine (SMBM), CEUR workshop proceedings. RWTH Aachen University; 2005. [17] Ananiadou S, Kell DB, Tsujii J-I. Text mining and its potential applications in systems biology. Trends Biotechnol 2006;24(12):571–9. [18] Kabiljo R, Clegg AB, Shepherd AJ. A realistic assessment of methods for extracting gene/protein interactions from free text. BMC Bioinform 2009;10(1):233. [19] Rinaldi F, Schneider G, Kaljurand K, Clematide S, Vachon T, Romacker M. Ontogene in biocreative II.5. IEEE/ACM Trans Comput Biol Bioinform 2010;7(3):472–80. [20] Mitsumori T, Murata M, Fukuda Y, Kouichi D, Hirohumi D. Extracting protein–protein interaction information from biomedical text with SVM. IEICE Trans Inf Syst 2006;89(8):2464–6. [21] Giuliano C, Lavelli A, Romano L. Exploiting shallow linguistic information for relation extraction from biomedical literature. In: McCarthy D, Wintner S, editors. Proceedings of the 11th conference of the European chapter of the association for computational linguistics. Stroudsburg, PA: Morgan Kaufmann Publishers/ACL; 2006. p. 98–113. [22] Niu Y, Otasek D, Jurisica I. Evaluation of linguistic features useful in extraction of interactions from PubMed; application to annotating known, high-throughput and predicted interactions in I2D. Bioinformatics 2010;26(1):111–9. [23] Bui Q-C, Katrenko S, Sloot PM. A hybrid approach to extract protein–protein interactions. Bioinformatics 2011;27(2):259–65. [24] Qian W, Fu C, Cheng H. Semi-supervised method for extraction of protein–protein interactions using hybrid model. In: Werner B, editor. The third international conference on intelligent system design and engineering applications (ISDEA), IEEE-CPS. 2013. p. 1268–71. [25] Miwa M, Sætre R, Miyao Y, Tsujii J. Protein–protein interaction extraction by leveraging multiple kernels and parsers. Int J Med Inf 2009;78(12):e39–46. [26] Kim S, Yoon J, Yang J, Park S. Walk-weighted subsequence kernels for protein–protein interaction extraction. BMC Bioinform 2010;11(1):107. [27] Chowdhury FM, Lavelli A, Moschitti A. A study on dependency tree kernels for automatic extraction of protein–protein interaction. In: Omnipress I, editor. Proceedings of the 10th workshop on biomedical natural language processing BioNLP 2011. Stroudsburg, PA: Morgan Kaufmann Publishers/ACL; 2011. p. 124–33. [28] Zhang Y, Lin H, Yang Z, Li Y. Neighborhood hash graph kernel for protein–protein interaction extraction. J Biomed Inform 2011;44(6):1086–92.

9

[29] Qian L, Zhou G. Tree kernel-based protein–protein interaction extraction from biomedical literature. J Biomed Inform 2012;45(3):535–43. [30] Zhang S-W, Hao L-Y, Zhang T-H. Prediction of protein–protein interaction with pairwise kernel support vector machine. Int J Mol Sci 2014;15(2): 3220–33. [31] Chen P, Guo J, Yu Z, Wei S, Zhou F, Yan X. Protein–protein interaction extraction based on convex combination kernel function. J Comput Commun 2013;1:9. [32] Huang M, Zhu X, Hao Y, Payan DG, Qu K, Li M. Discovering patterns to extract protein–protein interactions from full texts. Bioinformatics 2004;20(18):3604–12. [33] Wren JD, Garner HR. Shared relationship analysis: ranking set cohesion and commonalities within a literature-derived relationship network. Bioinformatics 2004;20(2):191–8. [34] Medin DL, Goldstone RL, Gentner D. Similarity involving attributes and relations: judgments of similarity and difference are not inverses. Psychol Sci 1990;1(1):64–9. [35] Turney PD. Similarity of semantic relations. Comput Linguist 2006;32(3): 379–416. [36] Nakov P, Hearst MA. Solving relational similarity problems using the web as a corpus. In: McKeown K, Moore JD, Teufel S, Allan J, Furui S, editors. Proceedings of the 46th annual meeting of the association for computational linguistics. Stroudsburg, PA: Morgan Kaufmann Publishers/ACL; 2008. p. 452–60. [37] Bollegala DT, Matsuo Y, Ishizuka M. Measuring the similarity between implicit semantic relations from the web. In: Quemada J, León G, Maarek YS, Nejdl W, editors. Proceedings of the 18th international conference on world wide web. New York: ACM Press; 2009. p. 651–60. [38] Toch E, Reinhartz-Berger I, Dori D. Humans, semantic services and similarity: a user study of semantic web services matching and composition. Web Semant: Sci Serv Agents World Wide Web 2011;9(1):16–28. [39] PubMed, Eutilities; 2010 http://eutils.ncbi.nlm.nih.gov/entrez/eutils/ [accessed May 2011], OL. [40] U. of Illinois at Urbana Champaign. Sentence segmentation tool; 2011 http:// cogcomp.cs.illinois.edu/page/tools view/2 [accessed May 2011], OL. [41] Chang C-C, Lin C-J. Libsvm – a library for support vector machines, [OL]; 2013 http://www.csie.ntu.edu.tw/cjlin/libsvm/ [accessed December 2014]. [42] Lin D. Automatic retrieval and clustering of similar words. In: Boitet C, Whitelock P, editors. 36th annual meeting of the association for computational linguistics and 17th international conference on computational linguistics. Morgan Kaufmann Publishers/ACL; 1998. p. 768–74. [43] Turney P. Mining the web for synonyms: PMI-IR versus LSA on TOEFL. In: Raedt LD, Flach PA, editors. Proceedings of the 12th European conference on machine learning. New York: Springer; 2001. p. 491–502. [44] A.S. Foundation. Apache OpenNLP 1.5.2-incubating; 2010 http://opennlp. apache.org/index.html [accessed December 2012], OL. [45] Craven M, Kumlien J. Constructing biological knowledge bases by extracting information from text sources. In: Proceedings of international conference on intelligent systems for molecular biology ISMB, vol. 1999. 1999. p. 77–86. [46] Rinaldi F, Schneider G, Clematide S. Relation mining experiments in the pharmacogenomics domain. J Biomed Inform 2012;45(5):851–61. [47] Clematide S, Rinaldi F. Ranking relations between diseases drugs and genes for a curation task. J Biomed Semant 2012;3(Suppl. 3):S5. [48] Zhu X, Ghahramani Z, Lafferty J. Semi-supervised learning using Gaussian fields and harmonic functions. In: Fawcett T, Mishra N, editors. Proceedings of the 20th international conference on machine learning (ICML). Palo Alto, CA: AAAI Press; 2003. p. 912–9. [49] Valentini G, Paccanaro A, Caniza H, Romero AE, Re M. An extensive analysis of disease-gene associations using network integration and fast kernel-based gene prioritization methods. Artif Intell Med 2014;61(2):63–78.

Please cite this article in press as: Niu Y, Wang Y. Protein–protein interaction identification using a hybrid model. Artif Intell Med (2015), http://dx.doi.org/10.1016/j.artmed.2015.05.003

Protein-protein interaction identification using a hybrid model.

Most existing systems that identify protein-protein interaction (PPI) in literature make decisions solely on evidence within a single sentence and ign...
1MB Sizes 0 Downloads 6 Views