This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TNB.2015.2420754, IEEE Transactions on NanoBioscience

> REPLACE THIS LINE WITH YOUR PAPER IDENTIFICATION NUMBER (DOUBLE-CLICK HERE TO EDIT) REPLACE THIS LINE WITH YOUR PAPER IDENTIFICATION NUMBER (DOUBLE-CLICK HERE TO EDIT) < more common interacting partners are more likely to interact [27]. The effectiveness of local-neighbor based methods is hindered by the fact that PPI networks are typically very sparse in practice, which could make the local information unstable and inaccurate [12, 27]. Motivated by this issue, some researchers proposed to improve the heuristics of local-neighbor method by taking into consideration longer distance between protein nodes and even the global topological structure of PPI networks[12, 18, 19, 27-31]. Several approaches of this type are based on random walks on graphs [18, 32]. The accuracy of random-walk based method would be reduced by hub nodes in the networks, i.e., proteins that would have many associated interactions. To neutralize this unwanted effect, Lei et al. recently proposed the Random Walk with Resistance (RWR) approach which could cap the influence of hub nodes[12]. On the other hand, Zhu et al. proposed to identify false PPI links using a customized generative network model, which could model the scale-free nature of complex networks [33]. Alternatively, several latent feature models have been proposed for assessing PPI networks [19, 27-29, 33], in which different nodes of the studied PPI network are embedded into a Euclidean space. Zhu et al. proposed to identify false PPI links via a novel Generative Network Model (RIGNM) where the scale-free property of the PPI network explicitly model[33]. In addition, several latent feature methods were developed based on the model assumption that PPI networks would have a geometric structure[27, 34, 35], which has been experimentally demonstrated to accurately capture the statistical properties of real-world PPI networks[34, 36]. Under this assumption, if two vectors have a short distance in the embedding space, their associated proteins are hypothesized to have large chance of interacting [34, 37]. Based on this assumption, Kuchaiev et al. [28] firstly constructed a pair-wise distance matrix between protein nodes, then used multidimensional scaling (MDS) to learn a embedding which can preserve the distance matrix as well as possible. Finally, the PPIs are ranked according to the Euclidean distance between their learned feature representations, which follows naturally from the previously described geometric assumptions. Instead of MDS, You et al. [27] adopted isometric feature mapping (ISOMAP)[38], a manifold learning algorithm to learn latent feature representations of proteins, then used FSWeight to calculate a reliability index for each PPI. Experimental evaluation shows the predictive performance of their approach is consistently competitive even when the PPI networks under evaluation is sparse and large [27]. Despite the encouraging performance of current latent methods for pruning PPI networks, their effectiveness is affected by a severe flaw: since they are specially intended to denoise PPI networks, a desirable property of these methods would be that they are not sensitive to the erroneous information in networks, however, the previous proposed approaches mainly seek to preserve the noisy topological information of the PPI networks in the embedding space, for example, the parameters of RIGM are learned by fitting the

2

model to the connectivity matrix using maximum a posteriori probability(MAP) estimation, which is problematic since the inherently noisy nature of input PPIs information has not been taken into full consideration, which could contaminate the learning process. On the other hand, both MDS-GEO and ISOMAP-FSWeight rely on shorted path on the graph to compute the distance measure, which could be significantly perturbed by the spurious links in the network [18, 39]. In this paper we propose a novel approach called Leave-One-Out Logistic Metric Embedding (LOO-LME) for assessing the reliability of interactions. The learning process of LOO-LME can be divided into two parts. In the first part, we propose the logistic metric embedding (LME) method to map the nodes of PPI network into a low dimensional space and preserve its topological information. As is proven by experimental evaluations, by adaptively learning a metric embedding using maximum likelihood estimation (MLE), the PPI networks can be well mapped into low dimensional embedding metric space using LME. However, LME is not robust to noise and have over-fitting problems, i.e., it has the same flaw as previous latent feature models for PPIs assessment. In the second part of LOO-LME, we transform LME into an equivalent factorized discriminant formulation, based on this relationship, we propose to use a leave-one-out-style (LOO) approach to deal with the noise in PPI networks. The experimental results show that our combined approach substantially outperforms previous methods on PPI assessment problems. The remainder of this paper is organized as follows. Section 2 outlines the methodologies used in this paper. A variety of experimental results are presented in Section 3. Finally, we provide some concluding remarks in Section 4. II.

THE PROPOSED METHOD

A. Logistic Metric Embedding (LME) A PPI network can be naturally modeled as a neighborhood graph G  V , E  , where the set of vertices V   v1 , v2 ,, vn  are the proteins, and the set of edges E  eij  indicate interaction relationships between the proteins. The main idea of LME is to learn a mapping g : vi  F  vi   1d according to the geometric assumptions of PPI networks, i.e., we would like the Euclidean distance between node pairs that is known to interact to be smaller than the distances corresponding to non-interacting pairs. Similar to prior approaches for social network analysis [26, 40], the probability of an interaction between two protein nodes is defined as a logistic function of their distance in the embedding space, formally we have:







pinteract  vi , v j   p eij  E F V  , b  s b 

where s  z   1  exp   z  

1

is the sigmoid function and b is a bias term,



pnon-interact  vi , v j   1  pinteract  vi , v j   s F 

1536-1241 (c) 2015 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.

This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TNB.2015.2420754, IEEE Transactions on NanoBioscience

> REPLACE THIS LINE WITH YOUR PAPER IDENTIFICATION NUMBER (DOUBLE-CLICK HERE TO EDIT) < The second equality of (2) holds because s  z   s   z   1 . It is easy to verify that there is a higher probability of interacting if two node have similar latent representations, which nicely matches the geometric assumption of PPI networks. We use maximum likelihood estimation (MLE) to optimize the parameters of our model, i.e., in LME we maximize the log-likelihood function: L  F V  , b   log



eij E , i  j

pinteract  vi , v j 

 



eij E , i  j

ij

, yij  ,1  i  n, i  j  n , with the additional constraint that

the regression parameters should be factored as F V  F V  . T

This process is illustrated in Fig.1. E 12

 

E 24

E 23

E 13

pnon-interact  vi , v j 

2   log  s F  vi   F  v j   b 2  eij E , i  j 2   log  s b  F  vi   F  v j  2  eij E , i  j

Fopt  V  Fopt  V  ,  bopt T

E

34

E 14

     

  (3)  Fig. 1. Alternative interpretation of LME: In the first stage the protein pairs is mapped into the space of symmetric matrices, then a specially constrained logistic discriminant model is learned in the new space such that it can separate the links(blue edges) from nonlinks(red edges).

B. Leave-One-Out (LOO) Scoring Once the nodes have been embedded to a low-dimensional metric space using LME, we can use this embedding to assign a suitable reliability score ( RS ) to each protein interaction

v ,v ,i  j

E

3

Accordingly, false PPI links are equivalent to training samples with incorrectly assigned labels. Inspired by this insight, we propose a leave-one-out-style approach for





in the PPI network. Since in LME we assume that

identifying them: we exclude E ij , yij from the estimation of

he probability that two proteins interact is a monotonically decreasing function of their distance in the embedding space, we can simply compute the RS between vi and v j as:

F V  when evaluating the reliability of  vi , v j  . That is, we

i

j

define false links to be those which have low probability under the cross-validated predictive distribution. This idea is (4) RS LME  vi , v j    Fopt  vi   Fopt  v j  2 as formulated ~ i , j  ~i , j  RS LOO-LME  vi , v j    Fopt  vi   Fopt v j 

where   Fopt  v1       max L  F V  , b     , bopt    Fopt V  , bopt   arg F V  , b  F  v      opt n  

(5)

However, the noise levels inherent in all current PPI networks are usually very high, our concern is that LME may not be robust to noise and have over-fitting problems, which would make RS LME highly unreliable. An equivalent form of (3) can give us more insights of this issue. First we define matrices E ij   n n for every protein pair  vi , v j  , i  j , such that it has only four non-zero entries Eijij  Eijji  1 , Eijii  Eijjj  1 . Then we have F  vi   F  v j 

2 2

 E ij , F V  F V 

T

(6)

where the operation , is the inner product defined on the positive

semidefinite

cone:

for

A, B  0

any

 

L  F V  , d    log s yij b  F V  F V  , E ij i j

T



F V  , b

(7)



 

k l , k ,l    i , j 

(9)



This process is illustrated in Fig.2. Intuitively, since in this way the assessment of a specific link is not affected by its possibly wrong label, the risk of over-fitting can be reduced. Although the above described approach might work in principle, a brute force implementation of it requires solving m independent large-scale graph embedding problems, which does not seem to allow for an efficient solution. We now propose a relaxation of (9) that is computationally tractable. T ~ 2 ,4 ~ 2 ,4 ~ 2,4 Fopt   V  Fopt   V  ,  bopt 

,

A, B  trace  AB  [41]. Then L  F V  , b  can be rewritten as

where   F~ i , j   v    1   opt  ~ i , j   ~ i , j  ~ i, j     , bopt   Fopt V  , bopt   ~ i, j    Fopt  vn       T  arg max  log s ykl b  F V  F V  , E kl



(8)

2

E 12

E 13

E 24

E 23

E 34 E 14

where yij  y ji  1 if eij  E and 1 otherwise. It is easy to

see that the model in (7) appears as a standard binary logistic discriminant model, and LME is equivalent to a two-stage procedure: First every protein pair  vi , v j  is mapped to Eij in

the space of symmetric matrices, with yij assigned as its label; then

a

logistic

model

is

trained

on

the

set

Fig. 2. Leave-one-out-style scoring procedure: E 24 is excluded from the estimation of the embedding when one assesses the reliability of the link  v2 , v4  .

Assuming that the exclusion of  E ij , yij  from the estimation of FV  would only change the embedding position of vi and

1536-1241 (c) 2015 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.

This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TNB.2015.2420754, IEEE Transactions on NanoBioscience

> REPLACE THIS LINE WITH YOUR PAPER IDENTIFICATION NUMBER (DOUBLE-CLICK HERE TO EDIT) < ~i, j  ~ i , j  v j , i.e., bopt  bopt , Fopt  vk   Fopt  vk  , k  i, j , we can first

on a PC with Intel® Core™ i7-3930K 6-core CPU with 16GB RAM.

obtain Fopt V  by maximizing (7), then for every link  vi , v j 

A. Data Sources and Evaluation Metric Three PPI datasets of S.Cerevisiae derived from various scales and high-throughput technologies are used in our experiments, which include Tong[42], DIP[43], and Krogan[2]. For each PPI network, we remove from the dataset self-links and redundent links, and only the largest connected component is used for experimental evaluation. The characteristics of the PPI networks are summarized in table 1. We first use Receiver Operator Characteristic (ROC) curve as metric to measure the success of embedding PPI network into low dimensional metric space using our approach, then we evaluate its ability in identifying false links in real PPI networks using Gene Ontology (GO) annotations[44].

~ i, j ~ i, j we only need to update Fopt   vi  and Fopt   v j  with the rest

of model parameters fixed as constants, and (9) is simplified as: ~ i , j  ~ i, j  Fopt  vi  , Fopt v j 





 



 log s y b  F v  F v 2   i  opt  k  2 ki opt   (10)  arg max    2   F  vi  ,F  v j  , k  i , j   log s y  kj bopt  F  v j   Fopt  vk  2      We see that the objective function in (10) is separable in the





~ i, j ~ i, j variables Fopt   vi  and Fopt   v j  , it thus can be further

decomposed into two smaller-scale problem: ~ i , j  F opt  vi   arg max Lij F  vi   F  vi 

 

log  s yki bopt  F  vi   Fopt  vk   k i,k  j ~ i , j  F opt  v j   arg max L ji F  v j   arg max F  vi 





F  vi 

 arg max  

F vj

2





 log  s  ykj bopt  F  v j   F opt  vk    k i,k  j



 

2

2 2

 

 

 

  

  L    1  exp y F  v   F  

 

F vi 

2

ij

k i, j

ik

i

 vk  2  bopt 2

  

opt

F  v   F  v   2 y exp  y  F  v   F  v   i

jk

opt

k

jk



j

opt

k

F v  Lji    1  exp  y F  v   F  v  j  jk opt j k k i, j  F  v j   Fopt  vk 





 

(13)  bopt  2  2   bopt  2  2

All of the optimization problems involved in our method are unconstrained ones and we solve them using standard gradient ascent method, which terminates when an optimization step changes the objective function value by less than 1  10 6 . It is also worth noting that can be done in parallel III.



B. ROC Curve Similar to the embedding stage of both ISOMAP-FSWeight and MDS-GEO, LME can be used to extract a feature representation for each proteins according to the geometric assumption. In this subsection, the embedding quality of these three methods are compared using ROC curve analysis [27, 36]. The ROC curves are constructed by varying the distance cutoff r threshold from 0 to the maximum distance between any two embedding vectors of proteins. For each distance cutoff r value, we predict that only those protein pairs with embedding distance smaller than r are truly interacting, the four possible predictive outcomes of which are listed in table 2. TABLE II CONTINGENCY TABLE

(12)

2 yik exp yik F  vi   Fopt  vk   bopt 2

TABLE I STATISTICAL SUMMARY OF THE THREE PPI NETWORKS PPI Dataset Number of proteins Number of interactions Data source Tong 2171 7622 [42] DIP 4875 17173 [43] Krogan 3645 12934 [2]

(11)

C. The Final Algorithm and Implementation Details The gradient of the above defined loss function is 2 2 yij exp  yij F  vi   F  v j   b  2   F v F v F  vi  L     i  j 2   j i 1  exp  yij F  vi   F  v j   b  2   2  yij exp  yij F  vi   F  v j   b  2   b L   2  i j 1  exp  yij F  vi   F  v j   b  2  

EXPERIMENTS

In this section we provide the experiments to evaluate the performance of our method on several PPI networks described below. All the tests were implemented using Matlab R2013b run

4

eij  E

eij  E

F  vi   F  v j   r

True Positive Prediction

False Positive Prediction

F  vi   F  v j   r

False Negative Prediction

True Negative Prediction

2

2

The ROC curve would then plots parametrically True-Positive-rate-versus-False-Positive-rate tradeoff with r as the varying parameter[45]. From Figure 3, we can see that the embedding accuracy of LME is significantly better than the other two methods. For example, the sensitivity and specificity of ROC curve of LME can reach 96% and 97% respectively when PPI network is embedded into the 8 dimensional space, while MDS-GEO can only reach 85% and 80%. A commonly used assessment metric for ROC curve is the area under the ROC curve (AUC) [36], in Figure 3 the parameterized curves of the AUC value are also plotted, with embedding dimension as the varying parameter. We can see that the AUC value achieved by LME is consistently the best. Therefore, the PPI network is well modeled by low dimensional embedding metric space using LME.

1536-1241 (c) 2015 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.

This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TNB.2015.2420754, IEEE Transactions on NanoBioscience

> REPLACE THIS LINE WITH YOUR PAPER IDENTIFICATION NUMBER (DOUBLE-CLICK HERE TO EDIT) < (a) LME

(b) MDS-GEO 1 Sensitivity (true positive rate)

Sensitivity (true positive rate)

1 0.8 Two dimensions Three dimensions Four dimensions Five dimensions Six dimensions Seven dimensions Eight dimensions

0.6 0.4 0.2 0

0

0.2

0.4 0.6 1-specificity (false positive rate)

0.8

0.8

0.4 0.2 0

1

Two dimensions Three dimensions Four dimensions Five dimensions Six dimensions Seven dimensions Eight dimensions

0.6

0

0.2

0.4 0.6 1-specificity (false positive rate)

1

1

0.8

0.95 Two dimensions Three dimensions Four dimensions Five dimensions Six dimensions Seven dimensions Eight dimensions

0.6 0.4 0.2 0

0

0.2

0.4 0.6 1-specificity (false positive rate)

0.8

1

(d)

0.8

0.9 AUC

Sensitivity (true positive rate)

(c) ISOMAP-FSWEIGHT

PPI networks. The functional measures are further refined using the approach described in [52]. For each tested method, we rank the interactions of proteins according to their score from the highest to the lowest, and measure the average functional similarity of the top-ranked PPIs according to three approaches in three GO domains. The experimental results on the three datasets are respectively showed in Fig. 4-6.

0.85

0.75

BP(Resnik)

LME MDS-GEO ISOMAP-FSWEIGHT

0.8

1

5

0

5

10 Number of Dimensions

15

BP(Jiang)

10

1.5

8

1

BP(Lin) 1.5

1

20 6

0.5 0.5

Fig. 3. ROC curve analysis. The panels (a), (b) and (c) compare the ability of recovering the Ho network using LME, ISOMAP-FSWeight and MDS-GEO with embedding space dimensions of 2 to 8. Panel (d) shows area under curve (AUC) comparison measuring the ability of recovering the Tong network using embedding space dimensions of 1 to 20.

4

2 0.1

0

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

-0.5 0.1

0.2

0.3

0.4

CC(Resnik)

0.5

0.6

0.7

0.8

0 0.1

0.9

0.2

0.3

0.4

CC(Jiang)

10

0.5

0.6

0.7

0.8

0.9

0.7

0.8

0.9

0.7

0.8

0.9

CC(Lin)

1.5

1.5

8 1

1

0.5

0.5

6 4 2

C. GO Annotation Although LME can accurately preserve topological structure of PPI networks, as mentioned above, the input PPI networks are usually very noisy, therefore this advantage itself does not necessarily imply good performance of filtering unreliable PPIs. As in previous works [12, 27, 46], in this section we use gene-ontology based functional annotations[44] to evaluate the usefulness of LOO-LME for pruning false positive PPIs. Five state-of-the-art approaches for this task are considered as comparison baselines, which include RWS[12], RIGM[33], CD-Dist [23], ISOMAP-FSWeight [27] and MDS-GEO[47], in order to demonstrate the effectiveness of the “LOO” stage of LOO-LME, both LME and LOO-LME are tested here for the sake of comparison. According to the widely accepted ‘guilt-by-association’(GBA) rule [48], a protein's property is at least partially determined by its interacting partners in the PPI network, and protein pairs which interact tend to share similar properties such as cellular locations, expression patterns and functional relationship[49, 50]. Alternately they could all participate in the same process to incur certain phenotypes[49]. Therefore, it is commonly expected that "true positive" interacting protein pairs should have a statistically higher level of functional similarity [51]. Gene Ontology (GO) annotations[44] have been widely accepted as a standard data source for searching functional annotation information. Based to different structure domains which include cellular component (CC), biological process (BP) and molecular function (MF), GO can be divided into three ontology which are structured as directed acyclic graph (DAG). On the other hand, several measures have been previously proposed for computing the functional similarity between proteins based on GO[52-56], and currently there is no clear consensus regarding which measure should be the overall preferred one[52]. In this paper, we use three commonly used GO-based measures proposed by Resnik[54], Lin[55], and Jiang and Conrath[56] and all three parts of GO to fully test the performance of various methods for increasing the reliability of

0 0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

0 0.1

0.2

0.3

0.4

MF(Resnik)

0.5

0.6

0.7

0.8

0 0.1

0.9

0.2

0.3

0.4

MF(Jiang)

3.5

0.4

3

0.2

0.5

0.6

MF(Lin) 0.4

0.2

2.5 0 2

0 -0.2

1.5

0.5 0.1

-0.2

-0.4

1 0.2

0.3

0.4

0.5

0.6

0.7

0.8

LOO-LME

0.9

LME

-0.6 0.1

0.2

RIGM

0.3

0.4

0.5

RWS

0.6

0.7

0.8

-0.4 0.1

0.9

ISOMAP-FSWeight

0.2

0.3

0.4

MDS-GEO

0.5

0.6

CD-Dist

Fig. 4. Performance comparision of various methods on the DIP network using GO annotations. In each subplot, the vertical axis is average score, the horizontal axis is the ratio of top-tanked PPIs. BP(Jiang)

BP(Resnik)

BP(Lin)

8

1.2

1.2

7

1

1

6

0.8

0.8

5

0.6

0.6

4

0.4

3 0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

0.2 0.1

0.4

0.2

0.3

CC(Resnik)

0.4

0.5

0.6

0.7

0.8

0.9

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

0.7

0.8

0.9

0.7

0.8

0.9

CC(Lin)

1.5

2

8

1.5

6

1

1

4

2 0.1

0.2 0.1

CC(Jiang)

10

0.5

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

0.5 0.1

0.2

0.3

MF(Resnik)

0.4

0.5

0.6

0.7

0.8

0.9

0 0.1

0.2

0.3

MF(Jiang)

2 1.8

0.4

0.5

0.6

MF(Lin)

0.2

0.2

0.1

0.1

1.6 0

0

-0.1

-0.1

-0.2

-0.2

1.4 1.2 1 0.8 0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

LOO-LME

0.9

LME

-0.3 0.1

0.2

RIGM

0.3

0.4

RWS

0.5

0.6

0.7

0.8

0.9

ISOMAP-FSWeight

-0.3 0.1

0.2

0.3

MDS-GEO

0.4

0.5

0.6

CD-Dist

Fig. 5. Performance comparision of various methods on the Tong network using GO annotations. In each subplot, the vertical axis is average score, the horizontal axis is the ratio of top-tanked PPIs. BP(Jiang)

BP(Resnik)

BP(Lin)

7

1

1

6

0.8

0.8

5

0.6

0.6

4

0.4

3 0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

0.2 0.1

0.4

0.2

0.3

CC(Resnik)

0.4

0.5

0.6

0.7

0.8

0.9

0.2 0.1

0.2

0.3

CC(Jiang)

12

0.4

0.5

0.6

0.7

0.8

0.9

0.7

0.8

0.9

0.7

0.8

0.9

CC(Lin)

2

2

1.5

1.5

10 8 6 1

1

4 2 0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

0.5 0.1

0.2

0.3

MF(Resnik)

0.4

0.5

0.6

0.7

0.8

0.9

0.5 0.1

1

1

0.5

0.5

2

0

0.3

0.4

0.5

0.6

0.7

LOO-LME

0.9

-0.5 0.1

LME

RIGM

0.8

0.4

0.5

0.6

1.5

3

0.2

0.3

MF(Lin)

1.5

4

1 0.1

0.2

MF(Jiang)

5

0

0.2

0.3

RWS

0.4

0.5

0.6

0.7

0.8

0.9

ISOMAP-FSWeight

-0.5 0.1

0.2

MDS-GEO

0.3

0.4

0.5

0.6

CD-Dist

Fig. 6. Performance comparision of various methods on the Krogan network using GO annotations. In each subplot, the vertical axis is average score, the horizontal axis is the ratio of top-tanked PPIs.

1536-1241 (c) 2015 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.

This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TNB.2015.2420754, IEEE Transactions on NanoBioscience

> REPLACE THIS LINE WITH YOUR PAPER IDENTIFICATION NUMBER (DOUBLE-CLICK HERE TO EDIT) < As can be seen in Fig.5, LOO-LME is the best in identifying protein pairs with high functional similarity in the Tong network, which could be considered to have higher likelihood of physical interaction[33]. As more interactions with lower ranking were removed from the interactions, the average similarity of the remaining interactome increases at a faster rate than other methods. For example, according to Resnik’s metric, the average BP similarity of the top 20% of PPIs ranked by LME could reach 7.4, while the corresponding performance of the best competing method(CD-Dist) is 6.3. For Krogan and DIP network, the conclusions are similar. On the whole, LOO-LME achieves the best performance as compared to the other approaches for increasing the reliability of protein interactomes, which confirms the usefulness of our method. IV.

CONCLUSION

In this paper, we propose a new geometric approach called Leave-One-Out Logistic Metric Embedding (LOO-LME) for assessing the reliability of interactions. Unlike previous approaches which mainly seek to preserve the noisy topological information of the PPI networks in the embedding space, LOO-LME attempts to deal with the uncertainty in PPI networks directly using a leave-one-out-style approach. The experimental results show that LOO-LME substantially outperforms previous methods on PPI assessment problems. On the other hand, LOO-LME also substantially outperforms LME during the experimental evaluation, which suggests that it is promising to directly take into consideration the inherently noisy nature of input PPIs information during the model process. Our future work will focus on this aspect.

[5]

[6]

[7]

[8]

[9]

[10]

[11]

[12]

[13]

[14]

[15]

[16]

[17]

ACKNOWLEDGMENT This work was supported by the grants of the National Science Foundation of China, Nos. 61133010, 61373105, 61303111, 61411140249, 61402334, 61472282, 61472280, 61472173, 61373098 and 61272333, China Postdoctoral Science Foundation Grant, Nos. 2014M561513, and partly supported by the National High-Tech R&D Program (863) (2014AA021502 & 2015AA020101), and the grant from the Ph.D. Programs Foundation of Ministry of Education of China (No. 20120072110040), and the grant from the Outstanding Innovative Talent Program Foundation of Henan Province, No. 134200510025.

REFERENCES [1] A. C. Gavin, P. Aloy, P. Grandi, R. Krause, M. Boesche, M. Marzioch, et al., "Proteome survey reveals modularity of the yeast cell machinery," Nature, vol. 440, pp. 631-636, 2006. [2] N. J. Krogan, G. Cagney, H. Y. Yu, G. Q. Zhong, X. H. Guo, A. Ignatchenko, et al., "Global landscape of protein complexes in the yeast saccharomyces cerevisiae," Nature, vol. 440, pp. 637-643, 2006. [3] A. C. Gavin, M. Bosche, R. Krause, P. Grandi, M. Marzioch, A. Bauer, et al., "Functional organization of the yeast proteome by systematic analysis of protein complexes," Nature, vol. 415, pp. 141-147, 2002. [4] T. Ito, T. Chiba, R. Ozawa, M. Yoshida, M. Hattori, and Y. Sakaki, "A comprehensive two-hybrid analysis to explore the yeast protein

[18]

[19]

[20]

[21]

[22]

[23]

[24]

[25]

[26]

6

interactome," Proceedings of the National Academy of Sciences of the United States of America, vol. 98, pp. 4569-4574, 2001. Y. Ho, A. Gruhler, A. Heilbut, G. D. Bader, L. Moore, S. L. Adams, et al., "Systematic identification of protein complexes in saccharomyces cerevisiae by mass spectrometry," Nature, vol. 415, pp. 180-183, 2002. P. Uetz, L. Giot, G. Cagney, T. A. Mansfield, R. S. Judson, J. R. Knight, et al., "A comprehensive analysis of protein–protein interactions in Saccharomyces cerevisiae," Nature, vol. 403, pp. 623-627, 2000. L. Giot, J. S. Bader, C. Brouwer, A. Chaudhuri, B. Kuang, Y. Li, et al., "A protein interaction map of drosophila melanogaster," Science, vol. 302, pp. 1727-1736, 2003. J. De Las Rivas and C. Fontanillo, "Protein–Protein Interactions Essentials: Key Concepts to Building and Analyzing Interactome Networks," PLoS Comput Biol, vol. 6, p. e1000807, 2010. T. Ito, T. Chiba, R. Ozawa, M. Yoshida, M. Hattori, and Y. Sakaki, "A comprehensive two-hybrid analysis to explore the yeast protein interactome," Proceedings of the National Academy of Sciences, vol. 98, pp. 4569-4574, 2001. G. T. Hart, A. K. Ramani, and E. M. Marcotte, "How complete are current yeast and human protein-interaction networks," Genome Biol, vol. 7, p. 120, 2006. A. M. Edwards, B. Kus, R. Jansen, D. Greenbaum, J. Greenblatt, and M. Gerstein, "Bridging structural biology and genomics: assessing protein interaction data with known complexes," TRENDS in Genetics, vol. 18, pp. 529-536, 2002. C. Lei and J. Ruan, "A novel link prediction algorithm for reconstructing protein–protein interaction networks by topological similarity," Bioinformatics, vol. 29, pp. 355-364, 2013. M. Deng, K. Zhang, S. Mehta, T. Chen, and F. Sun, "Prediction of protein function using protein-protein interaction data," Journal of Computational Biology, vol. 10, pp. 947-960, 2003. G. Liu, J. Li, and L. Wong, "Assessing and predicting protein interactions using both local and global network topological metrics," Genome Informatics, vol. 22, pp. 138-149, 2008. R. Saito, H. Suzuki, and Y. Hayashizaki, "Interaction generality, a measurement to assess the reliability of a protein–protein interaction," Nucleic Acids Research, vol. 30, pp. 1163-1168, 2002. R. Saito, H. Suzuki, and Y. Hayashizaki, "Construction of reliable protein–protein interaction networks with a new interaction generality measure," Bioinformatics, vol. 19, pp. 756-763, 2003. F. Radicchi, C. Castellano, F. Cecconi, V. Loreto, and D. Parisi, "Defining and identifying communities in networks," Proceedings of the National Academy of Sciences of the United States of America, vol. 101, pp. 2658-2663, 2004. Y. Fang, W. Benjamin, M. T. Sun, and K. Ramani, "Global geometric affinity for revealing high fidelity protein interaction network," Plos One, vol. 6, p. e19349, 2011. C. V. Cannistraci, G. Alanis-Lobato, and T. Ravasi, "Minimum curvilinearity to enhance topological prediction of protein interactions by network embedding," Bioinformatics, vol. 29, pp. i199-i209, 2013. R. Sharan, S. Suthram, R. M. Kelley, T. Kuhn, S. McCuine, P. Uetz, et al., "Conserved patterns of protein interaction in multiple species," Proceedings of the National Academy of Sciences of the United States of America, vol. 102, pp. 1974-1979, 2005. Z.-H. You, Y.-K. Lei, L. Zhu, J. Xia, and B. Wang, "Prediction of protein-protein interactions from amino acid sequences with ensemble extreme learning machines and principal component analysis," BMC bioinformatics, vol. 14, p. S10, 2013. N. Natarajan and I. S. Dhillon, "Inductive matrix completion for predicting gene–disease associations," Bioinformatics, vol. 30, pp. i60-i68, 2014. C. Brun, F. Chevenet, D. Martin, J. Wojcik, A. Guenoche, and B. Jacq, "Functional classification of proteins for the prediction of cellular function from a protein-protein interaction network," Genome Biology, vol. 5, pp. 6-6, 2003. H. N. Chua, W. K. Sung, and L. Wong, "Exploiting indirect neighbours and topological weight to predict protein function from protein-protein interactions," Bioinformatics, vol. 22, pp. 1623-1630, 2006. J. Chen, W. Hsu, M. L. Lee, and S. K. Ng, "Discovering reliable protein interactions from high-throughput experimental data using network topology," Artificial Intelligence in Medicine, vol. 35, pp. 37-47, 2005. P. Sarkar, D. Chakrabarti, and A. W. Moore, "Theoretical justification of popular link prediction heuristics," in 23rd Conference on Learning Theory, COLT 2010, Haifa, Israel, 2010, pp. 295-307.

1536-1241 (c) 2015 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.

This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TNB.2015.2420754, IEEE Transactions on NanoBioscience

> REPLACE THIS LINE WITH YOUR PAPER IDENTIFICATION NUMBER (DOUBLE-CLICK HERE TO EDIT) < [27] Z. H. You, Y. K. Lei, J. Gui, D. S. Huang, and X. B. Zhou, "Using manifold embedding for assessing and predicting protein interactions from high-throughput experimental data," Bioinformatics, vol. 26, pp. 2744-2751, 2010. [28] O. Kuchaiev, M. Rasajski, D. J. Higham, and N. Przulj, "Geometric de-noising of protein-protein interaction networks," Plos Computational Biology, vol. 5, p. e1000454, 2009. [29] L. Zhu, Z. H. You, and D. S. Huang, "Increasing the reliability of protein-protein interaction networks via non-convex semantic embedding," Neurocomputing, vol. 121, pp. 99-107, 2013. [30] A. E. Raftery, X. Y. Niu, P. D. Hoff, and K. Y. Yeung, "Fast Inference for the Latent Space Network Model Using a Case-Control Approximate Likelihood," Journal of Computational and Graphical Statistics, vol. 21, pp. 901-919, 2012. [31] Y. Hulovatyy, R. W. Solava, and T. Milenkovic, "Revealing Missing Parts of the Interactome via Link Prediction," Plos One, vol. 9, 2014. [32] F. Fouss, A. Pirotte, J. M. Renders, and M. Saerens, "Random-walk computation of similarities between nodes of a graph with application to collaborative recommendation," Knowledge and Data Engineering, IEEE Transactions on, vol. 19, pp. 355-369, 2007. [33] Y. Zhu, X.-F. Zhang, D.-Q. Dai, and M.-Y. Wu, "Identifying Spurious Interactions and Predicting Missing Interactions in the Protein-Protein Interaction Networks via a Generative Network Model," Computational Biology and Bioinformatics, IEEE/ACM Transactions on, vol. 10, pp. 219-225, 2013. [34] N. Przulj, D. G. Corneil, and I. Jurisica, "Modeling interactome: scale-free or geometric?," Bioinformatics, vol. 20, pp. 3508-3515, 2004. [35] L. Zhu, Z.-H. You, D.-S. Huang, and B. Wang, "t-LSE: A Novel Robust Geometric Approach for Modeling Protein-Protein Interaction Networks," PLoS ONE, vol. 8, p. e58368, 2013. [36] D. J. Higham, M. Rasajski, and N. Przulji, "Fitting a geometric graph to a protein-protein interaction network," Bioinformatics, vol. 24, pp. 1093-1099, 2008. [37] T. Milenkovic, J. Lai, and N. Przulj, "Graphcrunch: a tool for large network analyses," BMC Bioinformatics, vol. 9, p. 70, 2008. [38] J. B. Tenenbaum, V. De Silva, and J. C. Langford, "A global geometric framework for nonlinear dimensionality reduction," Science, vol. 290, pp. 2319-2323, 2000. [39] M. Gomez-Rodriguez, J. Leskovec, and A. Krause, "Inferring networks of diffusion and influence," ACM Transactions on Knowledge Discovery from Data (TKDD), vol. 5, p. 21, 2012. [40] P. D. Hoff, A. E. Raftery, and M. S. Handcock, "Latent space approaches to social network analysis," Journal of the american Statistical association, vol. 97, pp. 1090-1098, 2002. [41] F. Alizadeh, "Interior point methods in semidefinite programming with applications to combinatorial optimization," SIAM Journal on Optimization, vol. 5, pp. 13-51, 1995. [42] A. H. Y. Tong, B. Drees, G. Nardelli, G. D. Bader, B. Brannetti, L. Castagnoli, et al., "A combined experimental and computational strategy to define protein interaction networks for peptide recognition modules," Science, vol. 295, pp. 321-324, 2002. [43] L. Salwinski, C. S. Miller, A. J. Smith, F. K. Pettit, J. U. Bowie, and D. Eisenberg, "The database of interacting proteins: 2004 update," Nucleic acids research, vol. 32, pp. D449-D451, 2004. [44] M. Ashburner, C. A. Ball, J. A. Blake, D. Botstein, H. Butler, J. M. Cherry, et al., "Gene Ontology: tool for the unification of biology," Nature genetics, vol. 25, pp. 25-29, 2000. [45] T. Hastie, R. Tibshirani, and J. Friedman, "The Elements of statistical learning: data mining, inference, and prediction," BeiJing: Publishing House of Electronics Industry, 2004. [46] H. N. Chua and L. Wong, "Increasing the reliability of protein interactomes," Drug Discovery Today, vol. 13, pp. 652-658, 2008. [47] O. Kuchaiev, M. Rasajski, D. J. Higham, and N. Przulj, "Geometric De-noising of Protein-Protein Interaction Networks," Plos Computational Biology, vol. 5, p. 10, 2009. [48] S. Oliver, "Guilt-by-association goes global," Nature, vol. 403, pp. 601-603, 2000. [49] R. Mani, R. P. S. Onge, J. L. Hartman, G. Giaever, and F. P. Roth, "Defining genetic interaction," Proceedings of the National Academy of Sciences, vol. 105, pp. 3461-3466, 2008. [50] M. Dreze, A.-R. Carvunis, B. Charloteaux, M. Galli, S. J. Pevzner, M. Tasan, et al., "Evidence for network evolution in an Arabidopsis interactome map," Science, vol. 333, pp. 601-607, 2011.

7

[51] C. Pesquita, D. Faria, H. Bastos, A. E. Ferreira, A. O. Falcão, and F. M. Couto, "Metrics for GO based protein semantic similarity: a systematic evaluation," BMC bioinformatics, vol. 9, p. S4, 2008. [52] H. Yang, T. Nepusz, and A. Paccanaro, "Improving GO semantic similarity measures by exploring the ontology beneath the terms and modelling uncertainty," Bioinformatics, vol. 28, pp. 1383-1389, 2012. [53] J. Z. Wang, Z. Du, R. Payattakool, P. S. Yu, and C.-F. Chen, "A new method to measure the semantic similarity of GO terms," Bioinformatics, vol. 23, pp. 1274-1281, 2007. [54] P. Resnik, "Semantic similarity in a taxonomy: An information-based measure and its application to problems of ambiguity in natural language," Journal of Artificial Intelligence Research, vol. 11, pp. 95-130, 1999. [55] D. Lin, "An information-theoretic definition of similarity," in Proceedings of Machine Learning (ICML-98), 24-27 July 1998, San Francisco, CA, USA, 1998, pp. 296-304. [56] J. Jiang and D. Conrath, "Multi-word complex concept retrieval via lexical semantic similarity," in Proceedings 1999 International Conference on Information Intelligence and Systems, 31 Oct.-3 Nov. 1999, Los Alamitos, CA, USA, 1999, pp. 407-14.

Lin Zhu obtained his Ph.D. degree in Pattern Recognition and Intelligent System from University of Science and Technology of China (USTC), Hefei, China, in 2013. Now, he is a postdoc in the college of Electronics and Information Engineering, Tongji University, China. His research interests include latent feature learning, dimensionality reduction, and large-scale Learning. Su-Ping Deng received the Ph.D degree from Tongji University, China in 2012. Currently, Dr. Deng is a postdoc in the college of Electronics and Information Engineering, Tongji University, China. She is mainly interested in computational biology and bioinformatics. De-Shuang Huang received the Ph.D. degrees in electronic engineering from Xidian University, Xian, China in 1993. In 2000, he joined the Institute of Intelligent Machines, Chinese Academy of Sciences as the Recipient of “Hundred Talents Program of CAS”. In September 2011, he entered into Tongji University as Chaired Professor. At present, he is the director of Institute of Machines Learning and Systems Biology, Tongji University. Dr. Huang is currently Fellow of International Association of Pattern Recognition (IAPR Fellow), senior members of the IEEE and International Neural Networks Society. He has published over 170 journal papers. Also, in 1996, he published a book entitled “Systematic Theory of Neural Networks for Pattern Recognition” (in Chinese), which won the Second-Class Prize of the 8th Excellent High Technology Books of China, and in 2001 & 2009 another two books entitled “Intelligent Signal Processing Technique for High Resolution Radars” (in Chinese) and “The Study of Data Mining Methods for Gene Expression Profiles” (in Chinese), respectively. His current research interest includes bioinformatics, pattern recognition and neural networks.

1536-1241 (c) 2015 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.

A Two-Stage Geometric Method for Pruning Unreliable Links in Protein-Protein Networks.

Protein-protein interactions (PPIs) play essential roles for determining the outcomes of most of the cellular functions of the cell. Although the expe...
267KB Sizes 1 Downloads 7 Views