Statistical Approaches for the Construction and Interpretation of Human Protein-Protein Interaction Network.

Hindawi Publishing Corporation BioMed Research International Volume 2016, Article ID 5313050, 7 pages http://dx.doi.org/10.1155/2016/5313050

Research Article Statistical Approaches for the Construction and Interpretation of Human Protein-Protein Interaction Network Yang Hu,1 Ying Zhang,2 Jun Ren,1 Yadong Wang,3 Zhenzhen Wang,4 and Jun Zhang2 1

School of Life Science and Technology, Harbin Institute of Technology, Harbin 150001, China Department of Pharmacy, Heilongjiang Province Land Reclamation Headquarters General Hospital, Harbin 150088, China 3 School of Computer Science and Technology, Harbin Institute of Technology, Harbin 150001, China 4 College of Bioinformatics Science and Technology, Harbin Medical University, Harbin 150081, China 2

Correspondence should be addressed to Yadong Wang; [email protected], Zhenzhen Wang; [email protected], and Jun Zhang; [email protected] Received 3 June 2016; Revised 23 July 2016; Accepted 1 August 2016 Academic Editor: Qin Ma Copyright © 2016 Yang Hu et al. This is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited. The overall goal is to establish a reliable human protein-protein interaction network and develop computational tools to characterize a protein-protein interaction (PPI) network and the role of individual proteins in the context of the network topology and their expression status. A novel and unique feature of our approach is that we assigned confidence measure to each derived interacting pair and account for the confidence in our network analysis. We integrated experimental data to infer human PPI network. Our model treated the true interacting status (yes versus no) for any given pair of human proteins as a latent variable whose value was not observed. The experimental data were the manifestation of interacting status, which provided evidence as to the likelihood of the interaction. The confidence of interactions would depend on the strength and consistency of the evidence.

1. Introduction Individual proteins cannot perform their biological functions by themselves, and actually they need to perform their functions in the biological process through interacting with other proteins [1]. Usually the interaction between two proteins means either they perform a biological function corporately or there is physical direct contact between them [2]. Most of the important molecular processes in cell, such as DNA replication, need to be performed by a large number of protein complexes. And these complexes are made up by the interactions between proteins. The study of PPIs is also considered to be a central problem in proteomics for living cells. Due to the dynamic interaction between proteins, the impact of surrounding environment should also be taken into account. The study of human PPI network can help to enhance the understanding of the disease but also provide a theoretical foundation for finding new treatment. With the continuous progress and development of highthroughput experimental technology, more and more large quantities of interactions between human proteins had been

confirmed by a variety of experimental methods. And many kinds of biological interaction networks have been investigated [3–7]. However, current high-throughput experimental techniques also indicated the shortcomings of high error; not only might the different experimental methods induce different experimental results, but also even different research groups using the same experimental method could not guarantee the exact same result. Therefore, it was urgent to integrate the data from different biological experiments, and even different species, to construct a highly credible network of PPIs. So in this paper, a Bayesian hierarchical model of human PPI network was constructed with a variety of sources of protein interaction data. Meanwhile, a Monte Carlo expectation maximization algorithm was used to estimate the parameters of the model. Then the confidence of protein interaction relationship was calculated based on Bayesian model, and human PPI network with high-confidence level could be obtained. Thereafter, the role of intrinsic disordered proteins (IDPs) was investigated in the high-confidence PPI network. First of all, different functional modules were obtained through

2

BioMed Research International

Experimental data of human (Y2H, MPC, literature)

Interaction status of human protein pairs (yes versus no)

Interaction status of the corresponding homolog pairs in other organisms (yes versus no)

Experimental data of other organisms (Y2H, MPC, literature)

Genomic features (GO, co expression, . . . )

Figure 1: Overall scheme to construct the human protein-protein interaction network. The interaction status of a given pair of human proteins and their homolog in other organisms are unobserved (dashed box) and the experimental data and genomic features are observed evidence (solid boxes). Solid arrows represent model hierarchy and dashed arrows represent inference steps.

Table 1: Data sets or databases used to construct the human proteinprotein interaction network. Method Organism Y2H Human Y2H Human MPC Human Literature Human Y2H Yeast Y2H Yeast MPC Yeast MPC Yeast MPC Yeast MPC Yeast Literature Multiple Literature Multiple Multiple Multiple

Reference Stelzl et al. [8] Rual et al. [9] Ewing et al. [10] HPRD [11], http://www.hprd.org/ Ito et al. [12] Uetz et al. [13] Gavin et al. [14] Ho et al. [15] Gavin et al. [16] Krogan et al. [17] IntAct [18], http://www.ebi.ac.uk/intact/ MIPS [19], http://mips.gsf.de/proj/ppi/ DIP [20], http://dip.doe-mbi.ucla.edu

clustering of high-confidence PPI network based on the network topology structure. Then we found the functional modules which were significantly correlated with intrinsically disordered proteins and analysed the effect of IDPs in these functional modules, while searching for the associations between these functional modules and diseases.

2. Materials and Methods 2.1. Data Collection. In Table 1, we show the experimental data that will be used for the construction of the human PPI network [8–20]. Note that the literature or text mining approach represents most of the low-throughput experimental studies of individual protein-protein interaction. It is possible that the result from the same experiment will be recorded in multiple databases. We will eliminate this type of redundancy. It should be emphasized that the MPC experiments provide result in the format of protein complexes instead of pair-wise protein-protein interactions. Since proteins located in the same complex might not interact with one another directly, we will account for this factor in our model.

2.2. Statistical Modeling of Various Data Sources. The overall scheme of our approach is illustrated in Figure 1. We consider an empirical Bayes approach to integrate various sources of evidence. Let 𝑍𝑖𝑗 be the binary indicator such that 𝑍𝑖𝑗 = 1 means that human proteins 𝑖 and 𝑗 have a direct physical interaction and it is 0 otherwise. Hence, 𝑍𝑖𝑗 is the true interacting status that is not observed. To infer 𝑍𝑖𝑗 , we consider individual model for each type of observed data and integrate the evidence to compute the probability of 𝑍𝑖𝑗 = 1. 2.2.1. Human Y2H Data. It has been found that there are a number of mechanisms that can lead to the expression of the reporter gene in a Y2H experiment, which means that an observed interaction might not necessarily mean a true interaction. In our model, we consider the following mechanisms: (a) true interaction; (b) self-activation; and (c) unknown process. Let 𝑌𝑖𝑗 be the binary indicator such that 𝑌𝑖𝑗 = 1 if proteins 𝑖 and 𝑗 are observed to interact in a Y2H experiment and it is 0 otherwise. Then 𝑌𝑖𝑗 = 1 only if at least one of the three above mechanisms is functional. Let 𝑋𝑖 = 1 if protein 𝑖 is a self-activation protein and let it be 0 otherwise. We define 𝛼𝐼 = Pr [𝑎 is functional | 𝑍𝑖𝑗 = 1] ,

(1)

𝛼𝑆 = Pr [𝑏 is functional | 𝑋𝑖 + 𝑋𝑗 > 0] ,

(2)

𝛼𝑈 = Pr [𝑐 is functional] .

(3)

Then we have Pr [𝑌𝑖𝑗 = 1 | 𝑍, 𝑋] 𝑍𝑖𝑗

= 1 − (1 − 𝛼𝐼 )

𝑋𝑖 +𝑋𝑗

(1 − 𝛼𝑆 )

(4) (1 − 𝛼𝑈) .

2.2.2. Human MPC Data. MPC experiment reveals protein complexes instead of individual pairwise PPI. We say protein B is an 𝑛-step neighbour of protein A if the shortest path between A and B in the PPI network is of length 𝑛. We conjecture that the bait will mostly fish out its 1-step neighbours, and 2-step neighbours and distant proteins (at least three


3 2.2.3. Literature Data on Human PPI. Let 𝐿 𝑖𝑗 be the interaction status of proteins 𝑖 and 𝑗 reported. We will account for the false positive rate (𝛾0,𝑘 ) and false negative rate (𝛾0,𝑘 ):

1.0

Normalized score

0.8

𝑍

1−𝑍𝑖𝑗

Pr [𝐻𝑖𝑗 = 1 | 𝑍𝑖𝑗 ] = 𝛾1 𝑖𝑗 𝛾0 0.6

.

(7)

2.2.4. Data from Other Organisms. We will also collect (𝑌∗ , 𝐶∗ ) from other organisms with corresponding unobserved variables denoted by (𝑍∗ , 𝑋∗ ). Similar models can be used to model (𝑌, 𝐶, 𝐿) for inference of (𝑍∗ , 𝑋∗ ). To connect (𝑍∗ , 𝑋∗ ) to (𝑍, 𝑋), we consider the following models:

0.4

0.2

Pr [𝑍𝑖∗󸀠 𝑗󸀠 = 1 | 𝑍𝑖𝑗 ] 𝑍𝑖𝑗

0.0

= [Δ 1 (𝐽𝑖𝑖󸀠 ,𝑗𝑗󸀠 ; 𝜙1 )] 0.0

0.2

0.4

0.6

0.8

1.0

𝜀

Figure 2: The optimization of 𝑄𝑁 and 𝑄𝑆 for different 𝜀. Red line and green line correspond to 𝑄𝑁 and 𝑄𝑆 separately.

step-away) are occasionally observed. Hence, we define the following parameters for the bait proteins: Pr [1-step neighbour is observed] = 𝜓1 ,

(5)

Pr [2-step neighbour is observed] = 𝜓2 .

Let 𝐶𝑘 be the set of proteins in a complex corresponding to bait protein k. Denote by 𝑛𝑘(1) , 𝑛𝑘(2) the set of 1-step and 2step neighbours of the bait protein 𝑘 under a given value of 𝑍. Then the probability of observing 𝐶𝑘 can be written as follows: Pr [𝐶𝑘 | 𝑍] (1)

= 𝜓1 |𝑛𝑘

∩𝐶𝑘 |

|𝑛𝑘(1) \𝐶𝑘 |

(1 − 𝜓1 )

(2)

𝜓2 |𝑛𝑘

∩𝐶𝑘 |

(1 − 𝜓2 )

|𝑛𝑘(2) \𝐶𝑘 |

(6) ,

where | ⋅ | is the function that maps a set to its size.

1−𝑍𝑖𝑗

[Δ 0 (𝐽𝑖𝑖󸀠 ,𝑗𝑗󸀠 ; 𝜙0 )] 𝑋

,

(8) 1−𝑋

Pr [𝑋𝑖∗󸀠 = 1 | 𝑋𝑖 ] = [Ω1 (𝐼𝑖𝑖󸀠 ; 𝜆 1 )] 𝑖 [Ω0 (𝐼𝑖𝑖󸀠 ; 𝜆 0 )] 𝑖 , where 𝐽𝑖𝑖󸀠 ,𝑗𝑗󸀠 is the joint sequence identity between 𝑖 and 𝑖󸀠 and between 𝑗 and 𝑗󸀠 and 𝐼𝑖𝑖󸀠 is sequence identity between 𝑖 and 𝑖󸀠 ; Δ 1 , Δ 0 , Ω1 , and Ω0 are functions of the joint or individual sequence identities with parameters 𝜙1 , 𝜙0 , 𝜆 1 , and 𝜆 0 , which can be modeled by parametric structure. 2.3. Construction of Hierarchical Bayesian Model. So far we have introduced the distribution models for the experimental data and genomic features that are conditional on the values of Z and X. To finish the model, we also need to specify the distributions of Z and X, which can be modeled with independent Bernoulli distributions: Pr (𝑍𝑖𝑗 = 1) = 𝜌, Pr (𝑋𝑖 = 1) = 𝑟.

With the observed data and the unobserved variables, we can infer the posterior probability of 𝑍 using the EM algorithm. Note that there are multiple organisms and multiple data sets for some of the organisms. Different parameters will be used to account for difference in the data. As illustrated in (10), the complete log likelihood function of our model can be expanded below, and the factor of (10) can be substituted by (3)∼(9):

𝐿 𝐶 (Ψ) = 𝑓 (𝐻, 𝑌, 𝑊, 𝑍, 𝑋, 𝐿, 𝑌∗ , 𝑊∗ , 𝑍∗ , 𝑋∗ | 𝜃) = 𝑓 (𝐻 | 𝑍, 𝜃) 𝑓 (𝑌 | 𝑍, 𝑋, 𝜃) 𝑓 (𝑊 | 𝑍, 𝜃) 𝑓 (𝐿 | 𝑍, 𝜃) ⋅ 𝑓 (𝑌∗ | 𝑍∗ , 𝑋∗ , 𝜃) 𝑓 (𝑊∗ | 𝑍∗ , 𝜃) 𝑓 (𝑍∗ , 𝑋∗ | 𝑍, 𝑋, 𝜃) 𝑓 (𝑍, 𝑋 | 𝜃) = ∏ 𝑓 (𝐻𝑖𝑗 | 𝑍𝑖𝑗 , 𝜃) (𝑖,𝑗)∈𝑆𝐻

𝑟𝑖𝑗

𝑒𝑖𝑗

𝑒𝑗𝑖

𝑡=1

𝑡=1

𝑡=1

𝑗(𝑡)

⋅ ∏ [∏𝑓 (𝑌𝑖𝑗(𝑡) | 𝑍𝑖𝑗 , 𝑋𝑖 , 𝑋𝑗 , 𝜃)] ∏ [∏𝑓 (𝑊𝑖𝑗𝑖(𝑡) | 𝑍𝑖𝑗 , 𝜃) ∑𝑓 (𝑊𝑖𝑗 (𝑖,𝑗)∈𝑆𝑌

(𝑖,𝑗)∈𝑆𝑀

𝑟𝑖𝑗∗

⋅ ∏ (𝑖,𝑗)∈𝑆𝑌∗

[∏𝑓 (𝑌(𝑡)∗ 𝑖𝑗 𝑡=1 [

|

𝑍𝑖𝑗∗ , 𝑋𝑖∗ , 𝑋𝑗∗ , 𝜃)] ]

𝑒𝑖𝑗∗

∏ [∏𝑓 (𝑊𝑖𝑗𝑖(𝑡)∗ ∗ 𝑡=1 (𝑖,𝑗)∈𝑆𝑀 [

|

𝑖∈𝑆𝑌∗

(𝑖,𝑗)∈𝑆

𝑖∈𝑆𝑌

| 𝑍𝑖𝑗 , 𝜃)]

∗ 𝑒𝑗𝑖 𝑗(𝑡)∗ ∗ 𝑍𝑖𝑗 , 𝜃) ∑𝑓 (𝑊𝑖𝑗 𝑡=1

⋅ ∏ 𝑓 (𝑍𝑖𝑗∗ | 𝑍𝑖𝑗 , 𝜃) ∏ 𝑓 (𝑋𝑖∗ | 𝑋𝑖 , 𝜃) ∏ 𝑓 (𝑍𝑖𝑗 | 𝜃) ∏ 𝑓 (𝑋𝑖 | 𝜃) (𝑖,𝑗)∈𝑆∗

(9)

| 𝑍𝑖𝑗 , 𝜃)] ∏ 𝑓 (𝐿 𝑖𝑗 | 𝑍𝑖𝑗 , 𝜃) ] (𝑖,𝑗)∈𝑆𝐿

4

BioMed Research International 𝑍𝑖𝑗

= ∏ [1 − (1 − 𝛼𝐼 ) (𝑖,𝑗)∈𝑆𝑌

𝑍

1−𝑍𝑖𝑗 𝐻𝑖𝑗

⋅ ∏ (𝛾1 𝑖𝑗 𝛾0

)

(𝑖,𝑗)∈𝑆𝐻

(𝑖,𝑗)∈𝑆𝑌∗

𝑍∗ 𝑊𝑖𝑗+∗

⋅ ∏ [𝜓1 𝑖𝑗 ∗ (𝑖,𝑗)∈𝑆𝑀

𝑍

1−𝑍𝑖𝑗 𝐿 𝑖𝑗

𝑍

𝑍𝑖𝑗∗

)

(𝑖,𝑗)∈𝑆𝐿

1−𝑍𝑖𝑗

(𝑖,𝑗)∈𝑆∗

)

1−𝑍𝑖𝑗 1−𝐻𝑖𝑗

)

𝑌𝑖𝑗+

[(1 − 𝛼𝐼 ) 𝑍𝑖𝑗 𝑊𝑖𝑗+

𝑌𝑖𝑗+∗

(1 − 𝛼𝑈)] ∗

𝑍𝑖𝑗

(1 − 𝛼𝑆 )

(1 − 𝜓1 ) 𝑍𝑖𝑗∗

[(1 − 𝛼𝐼 )

(1−𝑍𝑖𝑗∗ )𝑊𝑖𝑗#∗

+∗

× 𝜓2 (1−𝑍𝑖𝑗 )𝑊𝑖𝑗 (1 − 𝜓2 )

𝑍

1−𝑍𝑖𝑗 1−𝐿 𝑖𝑗

𝑍

1−𝑍𝑖𝑗∗

(1 − 𝛽1 𝑖𝑗 𝛽0

1−𝑍𝑖𝑗

(1 − 𝜙1 𝑖𝑗 𝜙0

)

(1 − 𝛼𝑈)]

1−𝑍𝑖𝑗

∏ 𝜌𝑍𝑖𝑗 (1 − 𝜌)

𝑌𝑖𝑗#

(1−𝑍𝑖𝑗 )𝑊𝑖𝑗#

+

× 𝜓2 (1−𝑍𝑖𝑗 )𝑊𝑖𝑗 (1 − 𝜓2 ) 𝑋𝑖∗ +𝑋𝑗∗

(1 − 𝛼𝑆 )

𝑍

]

𝑌𝑖𝑗#∗

(1 − 𝛼𝑈)]

1−𝑍𝑖𝑗 𝐻𝑖𝑗

] ∏ (𝛾1 𝑖𝑗 𝛾0

)

(𝑖,𝑗)∈𝑆𝐻

𝑍

1−𝑍𝑖𝑗 1−𝐻𝑖𝑗

(1 − 𝛾1 𝑖𝑗 𝛾0

)

∏ 𝑟𝑋𝑖 (1 − 𝑟)1−𝑋𝑖 𝑖∈𝑆𝑌

(𝑖,𝑗)∈𝑆

)

𝑋𝑖 +𝑋𝑗

𝑍𝑖𝑗 𝑊𝑖𝑗#

∏ [𝜓1

(𝑖,𝑗)∈𝑆𝑀

𝑋𝑖∗ +𝑋𝑗∗

𝑍𝑖𝑗∗ 𝑊𝑖𝑗#∗

𝑍

(1 − 𝛼𝑈)]

(1 − 𝛼𝑆 )

(1 − 𝜓1 )

⋅ ∏ (𝛽1 𝑖𝑗 𝛽0

⋅ ∏ (𝜙1 𝑖𝑗 𝜙0

(1 − 𝛼𝑆 )

(1 − 𝛾1 𝑖𝑗 𝛾0

𝑍𝑖𝑗∗

⋅ ∏ [1 − (1 − 𝛼𝐼 )

𝑋𝑖 +𝑋𝑗

𝑋

∗ 1−𝑋𝑖 𝑋𝑖

∏ (𝜆 1 𝑖 𝜆 0

𝑖∈𝑆𝑌∗

)

𝑋

∗ 1−𝑋𝑖 1−𝑋𝑖

(1 − 𝜆 1 𝑖 𝜆 0

)

, (10)

where the parameter vector 𝜃 = {𝜌, 𝑟, 𝛼𝐼 , 𝛼𝑆 , 𝛼𝑈, 𝜓1 , 𝜓2 , 𝛾1 , 𝛾0 , 𝛽1 , 𝛽0 , 𝜙1 , 𝜙0 , 𝜆 1 , 𝜆 0 }. 2.4. Monte Carlo Expectation Maximization for Parameter Estimation. In the model, it was not possible to estimate the true value of potential variables and model parameters directly. In order to effectively estimate the potential variables and model parameters, this paper used the Monte Carlo expectation maximization algorithm based on incomplete parameter estimation, as illustrated in Algorithm 1. In the 𝐸-step of Algorithm 1, we use Gibbs sampling to sample (𝑍, 𝑋, 𝑍∗ , 𝑋∗ ) from 𝑓(𝑍, 𝑋, 𝑍∗ , 𝑋∗ | 𝐻, 𝑌, 𝑊, 𝐿, 𝑌∗ , 𝑊∗ , 𝜃̂0 ) in turn. Repeat the sampling process until the estimations of missing data are obtained. Then in the 𝑀-step of Algorithm 1, the parameter vector 𝜃 = {𝛾1 , 𝛾0 , 𝛼𝐼 , 𝛼𝑆 , 𝛼𝑈, 𝛽1 , 𝛽0 , 𝜙1 , 𝜙0 , 𝜆 1 , 𝜆 0 } is estimated by Greedy Hill Climbing. Finally the iteration is stopped when diff > 0.01.

them as reliable confidence interaction if Pr[𝑍𝑖𝑗 = 1 | 𝐻, 𝑌, 𝑊, 𝐿, 𝑌∗ , 𝑊∗ ] > 0.8. Then we got 48361 PPIs with reliable confidence measure among 23286 proteins. 3.2. Characterization of Network and Roles of IDPs Based on Network Topology. We analysed the role of IDPs in the human PPI networks with reliable confidence measure. A IDP was defined as a protein with continuous intrinsically disorder region whose length was larger than 40 amino acids. And 8735 IDPs were identified from 23286 proteins after predictions. Firstly, the human PPI network was cut into subnetworks or modules by SCAN. SCAN obtained modules based on the similarity between common neighbors. Then we used modularity and similarity-based modularity as metrics. Modularity is a statistical measure of the quality of network clustering, which is defined as follows: 𝑁𝐶

𝑄𝑁 = ∑ [

3. Results All the protein names were mapped to the Entrez IDs. Finally we got 32540 proteins, and there were 144603 interactions between these proteins. 3.1. Construction of the Human PPI Network with Reliable Confidence Measure. Four models were established separately using high-throughput Y2H experimental data, highthroughput MPC experimental data, human PPI data, and all the PPI data. The comparisons among these four models were listed in Table 2. After the estimation of parameter vector 𝜃 by Monte Carlo EM, we recalculated the posterior probability of 𝑍, which is Pr[𝑍 | 𝐻, 𝑌, 𝑊, 𝐿, 𝑌∗ , 𝑊∗ ], with 𝜃 and the observed values 𝐻, 𝑌, 𝑊, 𝐿, 𝑌∗ , 𝑊∗ . And for each pair of PPI, we considered

𝑠=1

𝑙𝑠 𝑑 2 − ( 𝑠 ) ], 𝐿 2𝐿

(11)

where 𝑁𝐶 is the number of clusterings, L is the number of edges, 𝑙𝑠 is the number of edges for 𝑠th module, and 𝑑𝑠 is the degree of all the nodes in 𝑠th module. We could obtain the best clustering by optimizing 𝑄𝑁. And similarity-based modularity is the supplementary for the modularity, which is defined as follows: 𝑁𝐶

𝑄𝑆 = ∑ [ 𝑠=1

𝐼𝑆𝑖 𝐷𝑆 2 − ( 𝑖 ) ]. 𝑇𝑆 2𝑇𝑆

(12)

As shown in Figure 2, on one hand, the modularity monotonically decreased from the position nearby zero, and it could not be maximized. On the other hand, the similarity-based modularity could be maximized while the threshold 𝜀 equals 0.61. Conditional on the 𝜀 = 0.61, the reliable human PPI


(1) (2) (3) (4) (5) (6) (7) (8) (9) (10) (11)

(12) (13)

5

𝑖 = 0, initialize the parameters while (diff > 0.01) { 𝑗 = 0 // 𝐸-Step while (𝑗 ≤ 𝑇) { Sample 𝑋(𝑗+1) from 𝑓(𝑍(𝑗) , 𝑋, 𝑍∗(𝑗) , 𝑋∗(𝑗) | 𝐻, 𝑌, 𝑊, 𝐿, 𝑌∗ , 𝑊∗ , 𝜃̂𝑖 ) Sample 𝑍(𝑗+1) from 𝑓(𝑍, 𝑋(𝑗+1) , 𝑍∗(𝑗) , 𝑋∗(𝑗) | 𝐻, 𝑌, 𝑊, 𝐿, 𝑌∗ , 𝑊∗ , 𝜃̂𝑖 ) Sample 𝑍(𝑗+1) from 𝑓(𝑍(𝑗+1) , 𝑋(𝑗+1) , 𝑍∗(𝑗) , 𝑋∗ | 𝐻, 𝑌, 𝑊, 𝐿, 𝑌∗ , 𝑊∗ , 𝜃̂𝑖 ) Sample 𝑍(𝑗+1) from 𝑓(𝑍(𝑗+1) , 𝑋(𝑗+1) , 𝑍∗ , 𝑋∗(𝑗+1) | 𝐻, 𝑌, 𝑊, 𝐿, 𝑌∗ , 𝑊∗ , 𝜃̂𝑖 ) 𝑗=𝑗+1 } calculate 𝑄 function 𝑇 ̂ (𝜃 | 𝜃(𝑖) , 𝑌, 𝑊, 𝐿, 𝑌∗ , 𝑊∗ ) = 1 ∑ log 𝐿𝑐 (𝜃 | 𝑌, 𝑊, 𝐿, 𝑌∗ , 𝑊∗ , 𝑍(𝑚) , 𝑋(𝑚) , 𝑍(𝑚)∗ , 𝑋(𝑚)∗ ) 𝑄 𝑇 𝑚=1 // 𝑀-Step (𝑚) 1 𝑇 ∑(𝑖,𝑗)∈𝑆 𝑍𝑖𝑗 ̂ (𝑖+1) = ∑ ( ) 𝜌 𝑇 𝑚=1 |𝑆| (𝑚) 1 𝑇 ∑𝑖∈𝑆 𝑋𝑖 ̂𝑟(𝑖+1) = ∑ ( 󵄩 𝑌 󵄩 ) 𝑇 𝑚=1 󵄩󵄩󵄩𝑆𝑌 󵄩󵄩󵄩 ∗ ∗ 𝑒𝑖𝑗 𝑒𝑗𝑖 𝑒𝑖𝑗 𝑒𝑗𝑖 𝑗𝑘 𝑗𝑘∗ 𝑊𝑖𝑗𝑖𝑘 + ∑𝑘=1 𝑊𝑖𝑗 ) + ∑(𝑖,𝑗)∈𝑆∗ 𝑍𝑖𝑗∗(𝑚) (∑𝑘=1 𝑊𝑖𝑗𝑖𝑘∗ + ∑𝑘=1 𝑊𝑖𝑗 ) ∑(𝑖,𝑗)∈𝑆𝑀 𝑍𝑖𝑗(𝑚) (∑𝑘=1 1 𝑇 𝑀 (𝑖+1) ̂ 𝜓1 = ∑ ( ) 𝑇 𝑚=1 ∑(𝑖,𝑗)∈𝑆𝑀 𝑍𝑖𝑗(𝑚) (𝑒𝑖𝑗 + 𝑒𝑗𝑖 ) + ∑(𝑖,𝑗)∈𝑆∗ 𝑍𝑖𝑗∗(𝑚) (𝑒𝑖𝑗∗ + 𝑒𝑗𝑖∗ ) 𝑀

̂ (𝑖+1) 𝜓 2 (14) (15) (16) (17) (18) (19) (20) (21) (22) (23) (24) (25) (26) (27) (28)

}

𝑇

1 = ∑( 𝑇 𝑚=1

𝑒

𝑒

𝑒∗

𝑗𝑘

𝑒∗

𝑗𝑘∗

𝑖𝑗 𝑗𝑖 𝑖𝑗 𝑗𝑖 𝑊𝑖𝑗𝑖𝑘 + ∑𝑘=1 𝑊𝑖𝑗 ) + ∑(𝑖,𝑗)∈𝑆∗ (1 − 𝑍𝑖𝑗∗(𝑚) ) (∑𝑘=1 𝑊𝑖𝑗𝑖𝑘∗ + ∑𝑘=1 𝑊𝑖𝑗 ) ∑(𝑖,𝑗)∈𝑆𝑀 (1 − 𝑍𝑖𝑗(𝑚) ) (∑𝑘=1 𝑀

∑(𝑖,𝑗)∈𝑆𝑀 (1 − 𝑍𝑖𝑗(𝑚) ) (𝑒𝑖𝑗 + 𝑒𝑗𝑖 ) + ∑(𝑖,𝑗)∈𝑆∗ (1 − 𝑍𝑖𝑗∗(𝑚) ) (𝑒𝑖𝑗∗ + 𝑒𝑗𝑖∗ )

)

𝑀

𝑘 = 0; change0 = 0.01 (𝑖) 𝜃𝑘 = 𝜃(𝑖) = {𝛼𝐼(𝑖) , 𝛼𝑆(𝑖) , 𝛼𝑈(𝑖) , 𝛾1(𝑖) , 𝛾0(𝑖) , 𝛽1(𝑖) , 𝛽0(𝑖) , 𝜙1(𝑖) , 𝜙0(𝑖) , 𝜆(𝑖) 1 , 𝜆0 } while (1) { 𝛼𝐼𝑘+1 = arg max 𝑄 (𝛼𝐼 , 𝛼𝑆𝑘 , 𝛼𝑈𝑘 ) 𝛼𝑆𝑘+1 = arg max 𝑄 (𝛼𝐼𝑘+1 , 𝛼𝑆 , 𝛼𝑈𝑘 ) 𝛼𝑈𝑘+1 = arg max 𝑄 (𝛼𝐼𝑘+1 , 𝛼𝑆𝑘+1 , 𝛼𝑈 ) ̂ (𝜃𝑘+1 ) − 𝑄 ̂ (𝜃𝑘 ) change𝑘+1 = 𝑄 if (abs(change𝑘+1 ) < abs(change𝑘 /20)) break 𝑘=𝑘+1 } 󵄨󵄨 ̂ (𝑖+1) ̂ (𝜃(𝑖) )󵄨󵄨󵄨󵄨 󵄨󵄨𝑄 (𝜃 ) − 𝑄 󵄨 diff = 󵄨 ̂ (𝜃(𝑖) ) 𝑄 𝑖=𝑖+1 𝑇 = 𝑇 ∗ 1.1

Algorithm 1: Monte Carlo expectation maximization for parameter estimation.

network was cut into 241 modules. Under the significant level 𝛼 = 0.05, the 𝑝 value of each module was calculated by the formula below: 𝑛

( 𝑀𝑖 ) ( 𝑁−𝑀 𝑛−𝑖 ) , 𝑁) ( 𝑛 𝑖=𝑚

𝑝-value = ∑

(13)

where 𝑁 is the number of all the proteins and 𝑀 is the number of all the IDPs. 33 modules among 241 modules were significantly associated with IDPs.

However, due to the fact that acquisition of functional modules is only dependent on the network topology, we analysed the modules with known diseases. And the overlap of PPI in hela cell and a functional module which was highly related with IDPs was shown in Figure 3. The weight of each side is the posterior probability of the real value Z. If a node with more than 5 neighbours was defined as a hub node in this subnetwork, a total of 69% of the hub nodes were IDPs. It is verified that IDPs were easy to become hub nodes of the protein interaction network due to the flexibility

6

BioMed Research International MAP3K11

0.988281943951733

LYK5 CAB39

CDC42

0.992211195221171 0.928797709895298

0.916706934804097 STK11 PRKAA2

0.976683685462922

TAOK2

0.93252996976953 MAP3K4

0.961012795846909 0.931957258866169 RPS6KB1 0.942964086215943 RICTOR

ETS1

TNF

NOS3

0.94532719089184 MAP2K3 0.953462355793454 0.975196305336431

MAX NFATC4

0.981849851994775

MKNK2 0.984706089040264 0.973410787642933 0.926640554866754 PPM1A

0.981691981828771 MAPK13

0.940581873967312

MAP2K6

0.945371649204753 0.977618924784474 0.915498713194393

0.921881220955402

GADD45A 0.975216674013063

MAPK12 EIF4EBP10.956537746964022

AKT1

0.939944128529169 NFKB1

TNFRSF1A 0.961103916191496 IKBKB

0.918918188964017

MAPK14

0.929881895054132 0.996390584134497 0.956659654597752

0.950761633040383 0.920713611249812 0.923483328218572 MAPKAPK3 0.993070949031971 0.946814049035311 0.985305433603935 MAPKAPK2 0.992017667135224 0.983116342336871 APPL1 0.905622362112626

RELB 0.919653447926976

DUSP1 SRC

0.970759057998657

RELA

0.958500777697191

RPS6KA5

0.996512538474053 0.958548365836032 MAP3K7IP1 0.921444909600541 MAPK11 0.961261739302427 0.976575488573872 0.945294330199249 NR4A1 HSPA1L 0.996898632822558 0.934782098606229 0.914294418483041 0.965479361545295 0.968648963863961 ATF2 0.974183189147152 RPS6KA2 PTPN7 ACVR1B 0.967031562980264 0.994482649955899 0.9805953731527550.994280808861367 HSPB1 PTK2 0.932734908000566 PIK3R1 0.965239932364784 0.93360921673011 0.930646225716919 MAP3K7 0.95543672691565 0.995021546422504 0.907413092372008 PPM1B 0.975433830474503 MAP3K1 0.942683638213202 MAPK3 0.951040239352733 DUSP9 0.920654378598556 0.956093352544121 0.918770066648722 BIRC4 0.900253127841279 PIAS1 ARRB2 0.905868032528088 0.950504301209003 0.989411127241328 0.925117675447837 0.909018674050458 0.901917463727295 0.954404675262049 MAP2K7 0.924848372861743 0.924461668543518 0.956331543298438 FLNA ELK1 TP53 PXN 0.921318202326074 0.931317168916576 0.978415978094563 0.941751045174897 0.986539391428232 0.913279780140147 TRAF6 0.911073893494904 0.94671230611857 CHUK NLK 0.983137573464774 FGFR2 MAPK8 CYCS 0.986690857866779 0.914776860992424 0.975804627151229 RET 0.975841080630198 CRK 0.956825831206515 MAPK1 0.9997719044331460.913919422426261 MAP3K3 0.9952761003281920.985154237272218 0.971650722622871 PLCG1 0.907407541829161 0.91643746108748 0.93231390232686 RPS6KA3 0.925122427009046 0.988122801296413 FOXO3 CDKN1B 0.945910584321246

CSF1R

MKNK1 0.923692180938087 MDM2

MAP3K14

0.999690869147889 0.934390111477114

0.968578722397797 0.947340827737935

0.98290718451608 0.939194727083668 0.982442429615185

0.950850107567385

0.943548620864749

KIT 0.902885242179036 STAT1 HSPA1B

DAXX

PML PTPRR

PRKACA

TRAF2

0.971134889591485

DFFA

PAK2

GSTP1

LAMA4 0.981363289873116 0.967327374499291 0.957270197686739

MAP2K5 PDPK1

STK30.942446070653386 CASP3

MAPKAPK5

0.915790811181068

0.951265787566081 MAP3K5

EGFR TGFB1

0.926799199474044 PECAM1

RXRA

0.995621985103935 0.933443687297404

SNCA

0.995423619728535

0.963376750890166

0.912790355179459

TRAF1 0.934508309792727

0.931238074460998 0.961266692611389 APP

STAT3 0.997650526557118HDAC2 0.913925072038546

FAS

0.955746185895987 CASP6

0.918777518323623

ACTN4

EGF 0.910632426384836

0.907892302935943

0.965811743983068

MMP9 MAPK9

CTNNB1

0.947107569966465

PDGFRB

0.946906847506762 0.938711688760668 PARD3 0.908502965210937 0.944146110233851

RAP1A

DUSP4 CDK6

0.978788903797977 FYN

0.940609277109616

AR

RASGRP4

0.998856807569973 0.934440714376979

PRKCD

FASLG

0.981066464190371

FN1

Figure 3: A reliable subnetwork for hela cell. Circles correspond to IDPs. And the degree of grey corresponds to the length of intrinsically disordered region for IDP.

Table 2: Comparison of parameters based on different data. HighParameters throughput Y2H 𝜌 6.8 × 10−3 𝑟 7.7 × 10−5 𝛼𝐼 0.658 𝛼𝑆 0.426 𝛼𝑈 4.5 × 10−3 𝜓1 — 𝜓2 —

Highthroughput MPC

Human PPI data

All PPI data

1.9 × 10−3 — — — — 0.738 0.623

6.1 × 10−3 5.3 × 10−5 0.543 0.496 9.7 × 10−4 0.755 0.764

1.4 × 10−2 8.9 × 10−5 0.933 0.852 0.007 0.809 0.788

of the structure, revealing an important role of IDPs in the regulation of cervical cancer hela cell.

4. Discussion Our model is unique and novel in the following perspectives. First, it integrates Y2H and MPC data in a cohesive and unified model that connect the two types of data through the unobserved true status of direct physical interaction 𝑍. Second, the model allows a natural calculation of the confidence of each interacting pair via the posterior probability. This is a critical measurement in downstream analysis and will be accounted for. To our knowledge, no previous study has considered uncertainty in the PPI network analysis. The inference of the interacting probability involves a large number of latent variables. The combinatorial effects make it impractical to compute the expectation of the missing variables analytically during the 𝐸-step. It is likely that various data sets carry different amount of information regarding the true interaction status. Hence, the inference can be

BioMed Research International made by appropriately weighing data of various types instead of treating them equally. This can be achieved by setting parameter constrain.

Competing Interests The authors confirm that there is no conflict of interests related to the content of this article.

Authors’ Contributions Yang Hu and Ying Zhang contributed equally to this work.

References [1] C. Wu, F. Zhang, X. Li et al., “Composite functional module inference: detecting cooperation between transcriptional regulation and protein interaction by mantel test,” BMC Systems Biology, vol. 4, article 82, 2010. [2] Y.-A. Huang, Z.-H. You, X. Chen, K. Chan, and X. Luo, “Sequence-based prediction of protein-protein interactions using weighted sparse representation model combined with global encoding,” BMC Bioinformatics, vol. 17, article 184, 2016. [3] X. Chen, “Predicting lncRNA-disease associations and constructing lncRNA functional similarity network based on the information of miRNA,” Scientific Reports, vol. 5, Article ID 13186, 2015. [4] X. Zeng, X. Zhang, and Q. Zou, “Integrative approaches for predicting microRNA function and prioritizing disease-related microRNA using biological interaction networks,” Briefings in Bioinformatics, vol. 17, no. 2, pp. 193–203, 2016. [5] Q. Zou, J. Li, Q. Hong et al., “Prediction of microRNA-disease associations based on social network analysis methods,” BioMed Research International, vol. 2015, Article ID 810514, 9 pages, 2015. [6] Q. Zou, J. Li, L. Song, X. Zeng, and G. Wang, “Similarity computation strategies in the microRNA-disease network: a survey,” Briefings in Functional Genomics, vol. 15, no. 1, pp. 55– 64, 2016. [7] F. Zhang, B. Gao, L. Xu et al., “Allele-specific behavior of molecular networks: understanding small-molecule drug response in yeast,” PLoS ONE, vol. 8, no. 1, Article ID e53581, 2013. [8] U. Stelzl, U. Worm, M. Lalowski et al., “A human protein-protein interaction network: a resource for annotating the proteome,” Cell, vol. 122, no. 6, pp. 957–968, 2005. [9] J.-F. Rual, K. Venkatesan, T. Hao et al., “Towards a proteomescale map of the human protein-protein interaction network,” Nature, vol. 437, no. 7062, pp. 1173–1178, 2005. [10] R. M. Ewing, P. Chu, F. Elisma et al., “Large-scale mapping of human protein-protein interactions by mass spectrometry,” Molecular Systems Biology, vol. 3, article 89, 2007. [11] T. S. Keshava Prasad, R. Goel, K. Kandasamy et al., “Human Protein Reference Database—2009 update,” Nucleic Acids Research, vol. 37, supplement 1, pp. D767–D772, 2009. [12] T. Ito, T. Chiba, R. Ozawa, M. Yoshida, M. Hattori, and Y. Sakaki, “A comprehensive two-hybrid analysis to explore the yeast protein interactome,” Proceedings of the National Academy of Sciences of the United States of America, vol. 98, no. 8, pp. 4569–4574, 2001.

7 [13] P. Uetz, L. Glot, G. Cagney et al., “A comprehensive analysis of protein-protein interactions in Saccharomyces cerevisiae,” Nature, vol. 403, no. 6770, pp. 623–627, 2000. [14] A.-C. Gavin, P. Aloy, P. Grandi et al., “Proteome survey reveals modularity of the yeast cell machinery,” Nature, vol. 440, no. 7084, pp. 631–636, 2006. [15] Y. Ho, A. Gruhler, A. Heilbut et al., “Systematic identification of protein complexes in Saccharomyces cerevisiae by mass spectrometry,” Nature, vol. 415, no. 6868, pp. 180–183, 2002. [16] A.-C. Gavin, M. B¨osche, R. Krause et al., “Functional organization of the yeast proteome by systematic analysis of protein complexes,” Nature, vol. 415, no. 6868, pp. 141–147, 2002. [17] N. J. Krogan, G. Cagney, H. Yu et al., “Global landscape of protein complexes in the yeast Saccharomyces cerevisiae,” Nature, vol. 440, no. 7084, pp. 637–643, 2006. [18] S. Kerrien, Y. Alam-Faruque, B. Aranda et al., “IntAct—open source resource for molecular interaction data,” Nucleic Acids Research, vol. 35, no. 1, pp. D561–D565, 2007. [19] H. W. Mewes, D. Frishman, U. G¨uldener et al., “MIPS: a database for genomes and protein sequences,” Nucleic Acids Research, vol. 30, no. 1, pp. 31–34, 2002. [20] I. Xenarios, Ł. Salw´ınski, X. J. Duan, P. Higney, S.-M. Kim, and D. Eisenberg, “DIP, the database of interacting proteins: a research tool for studying cellular networks of protein interactions,” Nucleic Acids Research, vol. 30, no. 1, pp. 303–305, 2002.

Dynamic protein interaction network construction and applications.

Bioinformatics approaches for the functional interpretation of protein lists: from ontology term enrichment to network analysis.

Statistical and Non-Statistical Interpretation of Test Results.

Novel Statistical Approaches for High-dimensional Gene-gene and Gene-environment Interaction Analyses.

Statistical Approaches for the Study of Cognitive and Brain Aging.

NetworkAnalyst--integrative approaches for protein-protein interaction network analysis and visual exploration.

Statistical Approaches for Gene Selection, Hub Gene Identification and Module Interaction in Gene Co-Expression Network Analysis: An Application to Aluminum Stress in Soybean (Glycine max L.).

Interpretation of commonly used statistical regression models.

Statistical interpretation of DNA typing data.

Nanogels as a Basis for Network Construction.

Gleditsia sinensis: transcriptome sequencing, construction, and application of its protein-protein interaction network.

Algorithm for the Construction of a Global Enzymatic Network to be Used for Gene Network Reconstruction.

Interpretation and Construction of Meaning of Bliss-words in Children.

Statistical approaches and software for clustering islet cell functional heterogeneity.

Uniting statistical and individual-based approaches for animal movement modelling.

Statistical approaches to toxicological data.

Functional features and protein network of human sperm-egg interaction.

Reconstructing protein structures by neural network pairwise interaction fields and iterative decoy set construction.

Untangling statistical and biological models to understand network inference: the need for a genomics network ontology.

Statistical approaches to developing a multiplex immunoassay for determining human exposure to environmental pathogens.

Editorial: part 2: network construction and mining for systems biology.

Neural network approaches for noisy language modeling.

Oral squamous cell cancer protein-protein interaction network interpretation in comparison to esophageal adenocarcinoma.

Construction of protein interaction network involved in lung adenocarcinomas using a novel algorithm.