Briefings in Bioinformatics, 17(4), 2016, 686–695 doi: 10.1093/bib/bbv065 Advance Access Publication Date: 6 August 2015 Paper

Algorithms for modeling global and context-specific functional relationship networks Fan Zhu, Bharat Panwar and Yuanfang Guan Corresponding author: Yuanfang Guan, 2044 Palmer Commons, 100 Washtenaw Ave., Ann Arbor, MI 48109; E-mail: [email protected]

Abstract Functional genomics has enormous potential to facilitate our understanding of normal and disease-specific physiology. In the past decade, intensive research efforts have been focused on modeling functional relationship networks, which summarize the probability of gene co-functionality relationships. Such modeling can be based on either expression data only or heterogeneous data integration. Numerous methods have been deployed to infer the functional relationship networks, while most of them target the global (non-context-specific) functional relationship networks. However, it is expected that functional relationships consistently reprogram under different tissues or biological processes. Thus, advanced methods have been developed targeting tissue-specific or developmental stage-specific networks. This article brings together the state-of-the-art functional relationship network modeling methods, emphasizes the need for heterogeneous genomic data integration and context-specific network modeling and outlines future directions for functional relationship networks. Key words: functional relationship network, inference, predictions

Introduction The idea that genes cannot work independently but act together is the fundamentals of systems biology era [1, 2]. The conceptual framework of integrating diverse functional genomics data has been developed over a decade ago [3, 4]. Modeling functional relationship network is invaluable [5–11] for further extending our understanding of gene functions, pathways and systems-level properties of an organism. It helps to interpret diverse biological processes systematically, and offers a critical complement to the reductionist focus of modern biology [12]. The objective of functional relationship networks is to infer, or reverse-engineer, the co-functionality between any two genes in the genome. However, inferring biological networks is an underdetermined problem, as the number of gene pairs often greatly exceeds the number of the independent measurement. Functional relationship networks can be depicted as graphs, where genes/transcripts are nodes and interactions are edges. This modeling of functional relationships as networks is generated because of any kind of similarity between two genes. For

example, the functional relationships in many biology processes are expected to be a scale-free topology [13], i.e. a sparse graph with a few highly connected nodes. Another important feature of functional relationship networks is that genes cluster into separate modules, where nodes have high connection density with other nodes within the same module but sparse connections with rest of the nodes [14]. Modeling and understanding of gene modules play important roles in systems biology. There is also existing literature reviewing network inference methods available [15, 16]. To solve this underdetermined network inference problem, different strategies have been adopted for developing different models and algorithms. One common solution is to use co-expression-based network algorithms for modeling gene co-functionality networks under the guilt-by-association principle. Applying these well-studied methods on public data sets has significantly improved our understanding of the genome systematically [2, 17]. Because of the increasing availability of high-throughput data, many efforts have been devoted to

Fan Zhu is a research fellow at the Department of Computational Medicine and Bioinformatics, University of Michigan. His research interests include dynamic network modeling and clinical outcome prediction. Bharat Panwar is a postdoctoral fellow at the Department of Computational Medicine and Bioinformatics, University of Michigan. His research interests are in the areas of functional genomics and machine learning. Yuanfang Guan is an assistant professor at Department of Computational Medicine and Bioinformatics, University of Michigan. Her research interests include network biology, data integration and clinical informatics. Submitted: 26 March 2015; Received (in revised form): 13 July 2015 C The Author 2015. Published by Oxford University Press. For Permissions, please email: [email protected] V

686

Algorithms

inferring functional relationship networks using heterogeneous genomic data sets. Furthermore, as, one might expect, functional relationships in different organisms or biological processes are rewired constantly [18–20], context-specific networks have been introduced to model functional relationships in the biological context of interest. There are many tools available for analyzing and visualizing gene–function relationship networks, including but not limited to Cytoscape [21], VisANT [22], IMP [23], MORPHIN [24] and GeneMANIA [25]. Furthermore, standard languages, e.g. SBML [26] and BioPAX [27], have been developed to represent biochemical reaction networks, which include but are not limited to co-functionality relationships. While inferring gene regulatory networks is also important, we focus on co-expression-based network inference methods targeting genes participating in the same biological process or a common pathway. In this article, we will review methods for inferring functional relationship networks with an emphasis to contextspecific networks. With an emphasis on co-expression relationship, we will first introduce the generic network modeling methods, and then compare them with the context-specific network modeling methods such as tissue-specific and developmental stage-specific networks.

Weighted correlation network analysis Weighted correlation network analysis (WGCNA [28]) is an R package that calculates functional relationship network based on weighted gene co-expression network analysis. WGCNA can be used to find gene clusters/modules, summarize clusters using hub genes and compare clusters with each other or external sample traits. Since its development, it has become an amazingly popular tool for analyzing expression data, especially among biologists. WGCNA specifies an adjacency matrix A, which is a symmetric matrix with elements ranging from [0, 1]. In this, componentAi;j represents the network connection strength between nodes i and j. The adjacency matrix A is calculated using coexpression similarity S, which is a signed matrix by default and is defined as:   Si;j ¼ cor xi ; xj

(1)

687

WGCNA also implements alternative methods to measure the co-expression levels, such as Biweight mid-correlation and Spearman correlation. Furthermore, a signed co-expression similarity is also possible to keep track of the sign of coexpression information. For weighted-network, the adjacency matrix A is defined by transforming the co-expression similarity to its power: Ai;j ¼ Sbi;j

(2)

Where b  1. Equation (2) implies that the adjacency Ai;j is proportional to their similarity Si;j on a logarithmic scale. The range of optimal b is 5–10, as suggested by the authors of WCGNA. Furthermore, considering that genes in a same pathway can be both up-regulated and down-regulated simultaneously, e.g. inhibition relationships, activated and deactivated genes can also be functionality related. This can be solved by allowing anti-correlated relationships. If we consider anticorrelated genes to be similar, e.g. we are interested in inhibition relationships, b should be set as even number (default value is 6). If we consider anti-correlated genes to be not functionally related, one common solution is to set the b to odd number because the negatively correlated pairs will be removed in the following step. For unweighted network, a threshold parameter s can be applied to the adjacency matrix A to enable binary predictions:

Global network algorithms There are many existing methods to infer gene co-functionality networks. Certainly, these publications mentioned in this article are the tip of the iceberg because it is not possible to include complete literature, but we believe that they are representative of the development of the field. They can be classified into three major categories: (i) classification type, i.e. supervised, semisupervised and unsupervised; (ii) input data set number, i.e. integrative and single data set-based (non-integrative); and (iii) specificity, i.e. global (non-context-specific), tissue-specific and developmental stage-specific. Figure 1 shows the categorization of methods that we are going to discuss in this article. In this section, several representative global network methods will be discussed. Here, global network represents generalized relationships without reference to any particular condition(s).

|

Ai;j ¼

1

Si;j s

0

otherwise

(3)

The threshold parameter s can be determined automatically using WGCNA’s built-in functions. The b in Equation (2) and the s in Equation (3) are used to keep the output network scale free and determined by users, e.g. by applying approximate scale-free criterion [29]. Instead of directly using the adjacency matrix, clustering can be applied based on the adjacency matrix that separates all genes into distinct clusters. Genes within the same cluster could also be considered as functionally related.

Context likelihood of relatedness algorithm Context likelihood of relatedness algorithm (CLR) [30] is an unsupervised network inference method. It is another popular and extensively used tool among biologists. The CLR is an extension of the relevance networks approach [31]. CLR could benefit from combining transcriptional profiles of an organism across diverse conditions when determining transcriptional regulatory interactions. CLR uses mutual information (MI) to evaluate similarity between the expression profiles of two genes. The MI is defined as: MIðX; YÞ ¼

X i;j

Pðxi ; yi Þlog

Pðxi ; yi Þ Pðxi ÞPðyi Þ

(4)

Where X and Y represent a transcription factor, and its target gene, xi and yj, represent particular expression levels, and Pðxi ; yj Þ is the probability that X ¼ xi and Y ¼ yj . Besides, Pearson correlation could be used as an alternative to MI and delivers similar (slightly worse) results based on their evaluation.

688

|

Zhu et al.

Figure 1. Example of existing methods for predicting functional relationship networks.

This figure shows representative gene network inference methods. Methods can be categorized by specificity (global, tissue-specific and developmental stage-specific), modeling methods (unsupervised, semi-supervised and supervised) or whether they are integrative.

For each gene pair, the CLR then estimates a likelihood of the MI score by comparing the MI value with a background distribution of MI values in a null model. The final form of CLR likelihood is: f ðX; YÞ ¼

qffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi Z2X þ Z2Y

SVM is a maximal margin classifier that maximized the distance of the nearest correctly classified examples to optimize the classification performance. SVM tries to minimize the cost function: X 1 jjwjj2 þ C fi 2 i

(5)

Where Z2X is the z-score of MIX;Y in the marginal distribution. Furthermore, other background distributions could be generalized extreme-value distribution, the Rayleigh distribution and empirical distribution, depending on input data.

Supervised inference of regulatory networks Another state-of-the-art machine learning algorithm, support vector machines (SVM), is an excellent candidate to build predictive models of co-functionality networks. SIRENE (supervised inference of regulatory networks) is a supervised method for the inference of gene networks using SVM [32]. SIRENE focuses on inferring the gene regulatory relationship, which is one type of the co-functionality relationships. An Escherichia coli network is used as example. SIRENE requires two types of input: (i) a compendium of expression profiles; and (ii) a list of established relationships. In the case of E. coli network, the known regulatory relationships are collected from the public database, e.g. RegulonDB [33].

(6)

Where w defines the plane that separates the positive and negative examples, f represents the degree of misclassification for each sample and C is a constant that is empirically optimized. Based on their computational cross-validation, the SIRENE paper reports that it has a significantly better performance than CLR, i.e. SIRENE can predict six times more known regulatory relationships than CLR at the 60% precision. Furthermore, community-based Dialogue for Reverse Engineering Assessments and Methods challenge also focuses on regulatory network inference problem, and numerous regulatory network inference methods have been developed [34–39]. Based on a review article by Maetschke et al. [17], an SVM-based method is the best performing method to predict gene regulatory networks among these methods.

Bayesian network Bayesian network is a supervised learning method that is popular in the field of functional relationship networks. It has been

Algorithms

used to successfully integrate diverse data sources together to infer functional networks [3, 4, 40–42], and many other works that include but not limit to [5–7, 43–48]. With the increasing availability of high-throughput data, modeling functional relationship networks by integrating hundreds or thousands of data sets using Bayesian network-based methods becomes possible. Furthermore, considering that many data sets are targeting different genes and have missing values, Bayesian approaches are well suited for this case. Figure 2 illustrates the workflow of using a Bayesian network to infer gene co-functionality network. The first step is to collect a positive gold standard that includes confirmed gene cofunctionality relationships. The gold standard could be retrieved from multiple publicly available databases, candidate databases including but not limited to Gene Ontology (GO) terms [49], BioCyc [50], KEGG [51], Textpresso [52], HPRD [53], Pfam [54], Reactome [55], PID [56] and WormBase [57, 58]. Protein interaction data sets are also suitable candidates as either the input data set or a source of positive gold standard, such as BioGRID [59], BIND [60], DIP [61], IntAct [62], MINT [63] and MIPS [64]. However, accuracy and reliability are different between databases or even within these databases [65].

|

689

Negative gold standard refers to two genes that are not functionally connected. However, databases of defining negative relationships are not available. So, the negative gold standard is usually randomly generated from the gene pairs that share no co-annotation. The second step of Bayesian network is to calculate the posterior probability of functional relationships within each data set. The co-functionality of two genes is assessed using base learners like co-expression-based methods. To ensure comparability between data sets, the correlation coefficients can be further normalized using Fisher’s z-transform. The posterior of each data set could also be calculated using unified scoring scheme (LLS) [4]: PðLjKÞ  PðLjKÞ LLS ¼ lnð Þ PðLÞ  PðLÞ

(7)

Where PðLjKÞ and  PðL j KÞ are the frequencies of linkage L that are observed in the given data set K, between annotated genes in the same pathway and different pathways, respectively.

Figure 2. A workflow of using Bayesian network to predict gene co-functionality networks.

This figure is a standard workflow that is used to predict co-functionality networks using Bayesian network. The gold standard is first established by collecting gene co-annotation data from the public database. It could be directly used (in the case of global networks) or refined to fit the context of interest (tissue-/stage-specific networks). Bayesian network is used to measure the quality and accuracy of each data set, and to integrate multiple data sets together using global, tissue-specific or stage-specific gold standard as example.

690

|

Zhu et al.

PðLÞ and  PðLÞ represent the total frequency of linkages between all annotated genes in the same pathways and different pathways, which is the prior expectations. The third step is to integrate results from individual data sets into the final network: n   Y 1  P FRi;j ¼ 1 j E1i;j ; E2i;j ; . . . ; Eni;j ¼ P FRi;j ¼ 1 PðEki;j j FRi;j ¼ 1Þ Z k¼1

(8) Where FR ¼ 1 represents a gene pair that is functionally related, Eki;j stands for the scores for genes i and j calculated from base learner in data set k and Z is a normalization factor. Intuitively, the probability PðFRi;j ¼ 1 j E1i;j ; E2i;j ; . . . ; Eni;j Þ denotes how likely, genes i and j participate in the same biological process, given all the existing data sets. Equation (8) relies on input data sets to be independent, which may turn out to be inappropriate for many biological data sets. As a result, conditional dependence between data sets is a major factor affecting the performance of naı¨ve Bayesian integration [66–68]. Thus, MI is introduced to minimize the damage to the independence. The co-functionality probability is adjusted as 0

P ðEki;j j FRi;j ¼ 1Þ ¼

1 0:5 ak PðEki;j j FRi;j ¼ 1Þ þ 1 þ ak 1 þ ak

(9)

Where ak is the sum of MI between data set k and all other data sets to the entropy of this data set k. When ak is small, i.e. data set k contains highly independent information 0 with other data sets, P is the dominating contributor toP ; if ak is large, i.e. data set k contains highly redundant information 0 with other data sets,P becomes 0.5 and irrelevant to P, which means this data set k contributes almost nothing to the final posterior probability. Sleipnir is one of the tools to infer co-functionality networks using Bayesian network [67].

Gaussian graphical models Gaussian graphical models (GGMs; [69–71]) are other popular methods to infer gene co-functionality networks [72–77]. GGMs are undirected graphical models that could be used to identify condition-independent relations. The inference of GGMs is based on an estimation of the covariance matrix of multivariate Gaussian distribution Nðl; R Þ. Assume K ¼ R 1 is the concentration matrix of the distribution, and k is its element. The partial correlation coefficient between genes i and j is defined as: kij qffiffiffiffiffiffiffiffiffiffi kii kjj

(10)

The partial correlation coefficient matrix shows the correlation coefficient corrected after removing the influence of other genes. A typical GGM can be generated with the following procedure [78]: (i) compute covariance matrix from given data, (ii) calculate partial correlation matrix using Equation (10) and (iii) remove edge (i, j) if its corresponding partial correlation coefficient is small. One of the advantages of GGMs over correlation-based method is that a high correlation coefficient does not necessarily indicate high connection (as almost all genes could be correlated), but a zero correlation coefficient is a strong independent

indicator. However, similar to Bayesian network-based or other methods, GGMs suffer from the p >> n problems (p is the number of parameters, and n is the number of training samples), which makes it challenging to find a stable solution. There are also some reviews that include (but not limited to) GGMs [15, 16].

Context-specific network algorithms The global algorithms mentioned in the previous section focused on networks at a species level. One major limitation of these approaches is that one might expect functional relationships in different organisms to reprogram consistently in different biological contexts [18–20]. To address this limitation, Myers and Troyanskaya developed an algorithm that models contextspecific networks in 2007 [44]. Thereafter, significant efforts have been put into context-specific networks, e.g. processspecific network [68], tissue-specific network based on Bayesian network [79], tissue-specific network based on SVM [80], tissuespecific network based on Gaussian mixture model [81] and developmental stage-specific networks [82, 83]. Another popular type of method not discussed in this article is sub-network extraction method [84–89], which identifies connected set of genes that significantly differentially overexpressed particular subsets of conditions.

Process-specific network Huttenhower et al. has developed biology process-specific networks that incorporate 229 biological processes. A total of 665 data sets have been used in this project, including microarray experiments and protein interaction databases. In the Bayesian network, the gold standard is formed of the gene pairs that satisfy at least one of the two criteria (i) both genes must be coannotated in the process of interest in GO; (ii) one of the two genes is annotated to the process of interest, and the pair is not co-annotated in other processes. In their experiments, five novel genes (AP3B1, ATP6AP1, BLOC1S1, LAMP2 and RAB11A) that are predicted to be active in macro-autophagy area have been confirmed experimentally. A web-based implementation called HEFalMp (Human Experimental/Functional Mapper) is available based on this work.

Tissue-specific network: Bayesian network In the method developed by Guan et al. [79], a tissue-specific network was established. This tissue-specific network is based on the assumption that both genes in a pair need to be expressed in the tissue of interest to be functionally related to a same biological process. Thus, gold standard mentioned in section ‘Supervised inference of regulatory networks’, as a tool to evaluate how accurate and related each data set is in the context of interest, needs to be refined according to the tissue of interest. In this algorithm, a positive gold standard pair must satisfy two criteria: (i) it must exist in the nonspecific gold standard, i.e. both the genes must appear in a same GO term, KEGG pathway, etc.; and (ii) both genes in this pair must be expressed in the tissue of interest [90, 91]. This specific type of gold standard will adjust the models to the corresponding relationships for the context of interest. Figure 2 illustrates the workflow of modeling context-specific networks using Bayesian network.

Algorithms

Based on the evaluation from Guan et al., there is an 20% improvement in the predictive power of the tissue-specific networks over the global networks. Taking the cerebellum-specific network as an example, there is evidence supporting the top genes predicted to be critical to ataxia. For example, RBFOX1 (physical interaction with ATXN2 [92]), SORBS1 (physical interaction with ATXN7 [92]) and PLCB4 (double knockout confirmed in mice) are among the top genes predicted as ataxia-causing genes by cerebellum-specific network, while they cannot be identified by the global network.

Tissue-specific network: SVM Chikina et al. has created a tissue-specific network using SVM [80]. In this study, a neuron-specific network was generated as an example. This method combined 53 microarray data sets into a large compendium of data with 916 samples. A rank-based statistic was used to evaluate the level of under- or overexpression of genes associated with given microarray experiments against 2872 genes known to be expressed in a particular tissue. SVM was applied here to establish the networks. SVM can locate hidden tissue-specific expression patterns through only a few samples in the combined data set. Intuitively, SVM can automatically identify expression patterns in the combined data set whose combination maximally separates genes expressed in tissues of interest and from others. In their evaluation of performance across all the major tissues of worm, the method has a precision higher than 90% for the top 30 predictions, while the baseline precision for random prediction is only 5%.

Developmental tissue-specific network with extra annotation The network modeled by Pop 2010 focuses on Arabidopsis thaliana tissue-specific networks, with 60 data sets as input [82]. This approach is based on naı¨ve Bayesian network, while it used an extra source to determine the tissue-specific gold standard. Therefore, tissue-specific gold standard is defined using two criteria: (i) both genes in this pair must be coannotated to the same GO term; and (ii) both genes in this pair must be co-annotated in the same tissue of interest in the Plant Ontology. This method can also be used to predict stage-specific networks giving a stage-specific gene annotation list. In their cross-validation evaluation, 74% of their contextspecific networks perform better than the global network. Furthermore, more than half (55%) of the context-specific networks have an area under curve >0.63, while the baseline precision for random prediction is 0.5.

Developmental stage-specific network with stage-specific data set Although tissue-specific networks are more accurate than global networks while inferring the functional relationships in tissues of interest, there are still some limitations. One major limitation of these tissue-specific or global networks is that they are static, whereas one might expect functional relationships to reprogram consistently during differentiation, even within the same tissue [93–95]. To address this major limitation, developmental stage-specific network [83, 96] and other networks [97, 98] have been developed to model such dynamic changes.

|

691

In the algorithm by Zhu et al. 2014 [83], human erythroid cell differentiation is used as an example because of the convenience of isolating a synchronized erythroid cell lineage. As the erythroid differentiation-specific data are collected from one public data set [99, 100], a limited number (i.e. eight) of samples is available. This algorithm refines the global gold standard to stagespecific one and infers the network based on Bayesian network. The critical step is to collect expression data defined at a specific developmental stage for a defined cell lineage. Nowadays, it becomes achievable in many tissues because of the advancement of single-cell RNA-seq technologies. This algorithm is based on the assumption that genes need to be both expressed and co-expressed to be functionally related to the developmental stage of interest. This algorithm generates positive gold standard pairs with three criteria: (i) both genes in a pair must exist in the nonspecific gold standard; (ii) both genes in this pair must express in the developmental stage of interest; and (iii) the expression levels of the two genes must be highly correlated with the stage of interest. The first two criteria are similar to ones of the tissue-specific network, while the third criterion ensures that the gold standard is adjusted to fit the specific developmental stage. In comparison with WGCNA and CLR, this algorithm can handle the stage-specific cases with a small number of samples (e.g. in their methodology paper, the authors only used four developmental stage-specific samples to infer the human erythroid cell differentiation network), whereas both WGCNA and CLR heavily rely on the number of input samples (e.g. in their methodology papers, there are more than a hundred samples). The major advantage of this stage-specific network over WGCNA and CLR is that, it does not directly infer a network from the stage-specific data set. Instead, it uses the stagespecific data set as a reference to examine and weigh public data sets. The data sets that are accurate and related to the developmental stage of interest will be upweighted. In their evaluation, the erythroid cell differentiation stagespecific network has also shown a significant performance improvement over the nonspecific networks. Many erythroidrelated genes, which are automatically determined by the stage-specific network, are also supported by existing literature. Figure 3, which is modified from Zhu et al. 2014 [83], is an example indicating how developmental stage-specific networks can improve the predicting accuracy on top of global networks. For example, in their comparison, the well-known and confirmed co-functionality relationship of hemoglobin beta (HBB)–hemoglobin alpha 1 and HBB–hemoglobin alpha 2 have a co-functionality probability of only 0.141 and 0.121 in the global network. This is only slighting higher than the probability of random prediction, which is 0.05. These relationships are automatically emphasized in the erythroid developmental stagespecific network with a probability of 0.618 and 0.974, respectively. Under some circumstances, relationships are delayed, e.g. gene A is regulated by gene B after some delay, which makes the two genes correlated with each other with a time lag. This relationship is one type of the stage-specific relationship and can be inferred by algorithms taking the delays into consideration [101, 102].

Conclusions and future directions Functional relationship networks could complement the reductionist focus of modern biology for understanding diverse

692

|

Zhu et al.

Figure 3. The developmental stage-specific network identifies the known and novel erythroid-related genes.

This figure is modified from Figures 3 and 4, Zhu et al. 2014. Subfigures (a) and (b) are for well-studied erythroid gene HBB. Subfigures (c) and (d) show the global and stage-specific network for a novel gene OSBP2 (oxys-terol binding proteins 2). A colour version of this figure is available at BIB online: http://bib.oxfordjournals.org.

biological processes in an organism. Numerous methods have been developed to infer the gene co-functional relationship networks. To fully use existing prior information (e.g. GO terms, pathways), it is necessary to use the supervised and semisupervised models to infer the networks. Meanwhile, one thing we should keep in mind is that the accuracy of the (semi-)supervised methods is highly depending on the quality of the prior information. Single data set-based (non-integrative) methods are highly limited to the number of samples and data quality of the training data compendium. With increasing availability of highthroughput data, integrative methods become good candidates. Integrative methods dramatically benefit from automatically adjusting the weight of each input data set by their quality and relevance. Thus, the most recent network approaches are mainly (semi-)supervised and integrative methods. Considering that the functional relationships are consistently reprogramming in different tissues, context-specific networks provide a better view of the relationships in the tissues of interest. The tissue-specific approaches assume that the tissuerelated genes follow specific expression patterns. This assumption is used to examine the quality and relevance to the tissue of interest of different data sets. Furthermore, functional relationships keep rewiring during different biological stages even within the same tissues.

Therefore, stage-specific networks are necessary to handle this situation. These networks may use different assumptions to identify the developmental stage-specific genes, e.g. genes that are co-expressed in the data set from the developmental stage, or genes that are co-annotated in a developmental stage of interest. Nowadays, there is a large amount of genomic data available for the human disease, e.g. cancer data from The Cancer Genome Atlas project [103]. These disease-specific data are particularly useful for inferring the disease-specific gene cofunctionality networks. In next-generation sequencing era, it is necessary to give some advancement to network biology field. Therefore, context-specific networks can provide better insight for understanding functional relationship in a particular context. While this article is only focused on co-expression relationship network, there are also many other important networks existing such as causal/regulatory network [104, 105], metabolic networks [88, 106] and microbiome network [107, 108]. Furthermore, heterogeneous models could benefit from diverse combinations of data, e.g. expression profile, protein–protein interaction information and phenotypic information. This article describes the representative methods in the field of modeling functional relationship networks. It is not possible to include all the methods in this article. Generally, more

Algorithms

specific networks result in more accurate inference that better reflects the real functional relationships at the tissue/stage of interest. With the development of single-cell sequencing, it is envisioned that the developmental stage-specific networks can be applied to more biological systems.

11.

12.

Key Points • Functional relationship networks provide valuable in-

formation to facilitate our understanding of normal and disease-specific physiology. • Numerous methods have been developed to model the gene co-functionality networks. These methods can be categorized into supervised and unsupervised, integrative and non-integrative and specific and global ones. • We reviewed the representative state-of-the-art cofunctionality network inference methods, with an emphasis on the context-specific network modeling methods. • Context-specific networks can capture the relationship rewiring at different tissues and biological processes. They are more accurate than global networks when modeling the functional relationships at the tissues/ developmental stages of interest.

Funding

13. 14.

15.

16. 17.

18.

19.

20.

NSF 1452656, NIH 1R21NS082212-01, EU-FP VII Systems Biology of Rare Disease, NIH University of Michigan O’Brien Kidney Translational Core Center.

21.

References

22.

1.

Jacob F, Monod J. Genetic regulatory mechanisms in the synthesis of proteins. J Mol Biol 1961;3(3):318–56. 2. De Smet R, Marchal K. Advantages and limitations of current network inference methods. Nat Rev Microbiol 2010;8(10):717–29. 3. Troyanskaya OG, Dolinski K, Owen AB, et al. A Bayesian framework for combining heterogeneous data sources for gene function prediction (in Saccharomyces cerevisiae). Proc Natl Acad Sci 2003;100(14):8348–53. 4. Lee I, Date SV, Adai AT, et al. A probabilistic functional network of yeast genes. Science 2004;306(5701):1555–8. 5. Linghu B, Snitkin ES, Hu Z, et al. Genome-wide prioritization of disease genes and identification of disease-disease associations from an integrated human functional linkage network. Genome Biol 2009;10(9):R91. 6. Lee I, Ambaru B, Thakkar P, et al. Rational association of genes with traits using a genome-scale gene network for Arabidopsis thaliana. Nat Biotechnol 2010a;28(2):149–56. 7. Lee I, Lehner B, Vavouri T, et al. Predicting genetic modifier loci using functional gene networks. Genome Res 2010b;20(8):1143–53. 8. Lee I, Blom UM, Wang PI, et al. Prioritizing candidate disease genes by network-based boosting of genome-wide association data. Genome Res 2011;21(7):1109–21. 9. Hwang S, Rhee SY, Marcotte EM, et al. Systematic prediction of gene function in Arabidopsis thaliana using a probabilistic functional gene network. Nat Protoc 2011;6(9):1429–42. 10. Wang PI, Hwang S, Kincaid RP, et al. RIDDLE: reflective diffusion and local extension reveal functional associations for

23.

24.

25.

26.

27.

28.

29.

30.

|

693

unannotated gene sets via proximity in a gene network. Genome Biol 2012;13(12):R125. Dutkowski J, Kramer M, Surma MA, et al. A gene ontology inferred from molecular networks. Nat Biotechnol 2013;31(1):38–45. Stuart JM, Segal E, Koller D, et al. A gene-coexpression network for global discovery of conserved genetic modules. Science 2003;302(5643):249–55. Barabasi AL, Oltvai ZN. Network biology: understanding the cell’s functional organization. Nat Rev Genet 2004;5(2):101–13. Wang Y, Huang H. Review on statistical methods for gene network reconstruction using expression data. J Theor Biol 2014;362:53–61 Aittokallio T, Schwikowski B. Graph-based methods for analysing networks in cell biology. Brief Bioinform 2006;7(3):243–55. Markowetz F, Spang R. Inferring cellular networks–a review. BMC Bioinformatics 2007;8(Suppl 6): S5. Maetschke SR, Madhamshettiwar PB, Davis MJ, et al. Supervised, semi-supervised and unsupervised inference of gene regulatory networks. Brief Bioinform 2014;15:195–211. Winter EE, Goodstadt L, Ponting CP. Elevated rates of protein secretion, evolution, and disease among tissue-specific genes. Genome Res 2004;14(1):54–61. Chao EC, Lipkin SM. Molecular models for the tissue specificity of DNA mismatch repair-deficient carcinogenesis. Nucleic Acids Res 2006;34(3):840–52. Lage K, Hansen NT, Karlberg EO, et al. A large-scale analysis of tissue-specific pathology and gene expression of human disease genes and complexes. Proc Natl Acad Sci USA 2008;105(52):20870–5. Smoot ME, Ono K, Ruscheinski J, et al. Cytoscape 2.8: new features for data integration and network visualization. Bioinformatics 2011;27(3):431–2. Hu Z, Hung JH, Wang Y, et al. VisANT 3.5: multi-scale network visualization, analysis and inference based on the gene ontology. Nucleic Acids Res 2009;37:W115–21. Wong AK, Park CY, Greene CS, et al. IMP: a multi-species functional genomics portal for integration, visualization and prediction of protein functions and networks. Nucleic Acids Res 2012;40(W1):W484–90. Hwang S, Kim E, Yang S, et al. MORPHIN: a web tool for human disease research by projecting model organism biology onto a human integrated gene network. Nucleic Acids Res 2014;42:W147–53. Warde-Farley D, Donaldson SL, Comes O, et al. The GeneMANIA prediction server: biological network integration for gene prioritization and predicting gene function. Nucleic Acids Res 2010;38(Suppl 2):W214–20. Hucka M, Finney A, Sauro HM, et al. The systems biology markup language (SBML): a medium for representation and exchange of biochemical network models. Bioinformatics 2003;19(4):524–31. Demir E, Cary MP, Paley S, et al. The BioPAX community standard for pathway data sharing. Nat Biotechnol 2010;28(9):935–42. Langfelder P, Horvath S. WGCNA: an R package for weighted correlation network analysis. BMC Bioinformatics 2008;9(1):559. Zhang B, Horvath S. A general framework for weighted gene co-expression network analysis. Stat Appl Genet Mol Biol 2005;4(1):Article17. Faith JJ, Hayete B, Thaden JT, et al. Large-scale mapping and validation of Escherichia coli transcriptional regulation

694

31.

32. 33.

34.

35. 36.

37.

38.

39.

40.

41.

42.

43.

44.

45.

46.

47.

48.

49.

50.

|

Zhu et al.

from a compendium of expression profiles. PLoS Biol 2007;5(1):e8. Eisen MB, Spellman PT, Brown PO, et al. Cluster analysis and display of genome-wide expression patterns. Proc Natl Acad Sci USA 1998;95:14863–8. Mordelet F, Vert JP. SIRENE: supervised inference of regulatory networks. Bioinformatics 2008;24(16):i76–82. Gama-Castro S, Jime´nez-Jacinto V, Peralta-Gil M, et al. RegulonDB (version 6.0): gene regulation model of Escherichia coli K-12 beyond transcription, active (experimental) annotated promoters and Textpresso navigation. Nucleic Acids Res 2008;36(Suppl 1):D120–4. Stolovitzky G, Monroe D, Califano A. Dialogue on reverseengineering assessment and methods. Ann N Y Acad Sci 2007;1115(1):1–22. Stolovitzky G, Prill RJ, Califano A. Lessons from the DREAM2 Challenges. Ann N Y Acad Sci 2009;1158(1):159–95. Marbach D, Prill RJ, Schaffter T, et al. Revealing strengths and weaknesses of methods for gene network inference. Proc Natl Acad Sci USA 2010;107(14):6286–91. Marbach D, Costello JC, Ku¨ffner R, et al. Wisdom of crowds for robust gene network inference. Nat Methods 2012;9(8):796–804. Prill RJ, Marbach D, Saez-Rodriguez J, et al. Towards a rigorous assessment of systems biology models: the DREAM3 challenges. PloS One 2010;5(2):e9202. Prill RJ, Saez-Rodriguez J, Alexopoulos LG, et al. Crowdsourcing network inference: the DREAM predictive signaling network challenge. Sci Signal 2011;4(189):mr7. Franke L, Bakel HV, Fokkens L, et al. Reconstruction of a functional human gene network, with an application for prioritizing positional candidate genes. Am J Hum Genet 2006;78(6):1011–25. Guan Y, Myers CL, Lu R, et al. A genomewide functional network for the laboratory mouse. PLoS Comput Biol 2008;4(9):e1000165. Lee I, Lehner B, Crombie C, et al. A single gene network accurately predicts phenotypic effects of gene perturbation in Caenorhabditis elegans. Nat Genet 2008;40(2):181–8. Lee I, Li Z, Marcotte EM. An improved, bias-reduced probabilistic functional gene network of baker’s yeast, Saccharomyces cerevisiae. PloS One 2007;2(10):e988. Myers CL, Troyanskaya OG. Context-sensitive data integration and prediction of biological networks. Bioinformatics 2007;23(17): 2322–30. Kim WK, Krumpelman C, Marcotte EM. Inferring mouse gene functions from genomic-scale data using a combined functional network/classification strategy. Genome Biol 2008;9(Suppl 1):S5. Myers CL, Chiriac C, Troyanskaya OG. Discovering biological networks from diverse functional genomic data. In: Protein Networks and Pathway Analysis. Springer, 2009, 157–75. Guan Y, Ackert-Bicknell CL, Kell B, et al. Functional genomics complements quantitative genetics in identifying diseasegene associations. PLoS Computat Biol 2010;6(11):e1000991. van Steensel B, Braunschweig U, Filion GJ, et al. Bayesian network analysis of targeting interactions in chromatin. Genome Res 2010;20(2):190–200. Consortium GO. The Gene Ontology (GO) database and informatics resource. Nucleic Acids Res 2004;32(suppl 1): D258–61. Karp PD, Ouzounis CA, Moore-Kochlacs C, et al. Expansion of the BioCyc collection of pathway/genome databases to 160 genomes. Nucleic Acids Res 2005;33(19):6083–9.

51. Kanehisa M, Goto S. KEGG: kyoto encyclopedia of genes and genomes. Nucleic Acids Res 2000;28(1):27–30. 52. Mu¨ller HM, Kenny EE, Sternberg PW. Textpresso: an ontology-based information retrieval and extraction system for biological literature. PLoS Biol 2004;2(11):e309. 53. Mishra GR, Suresh M, Kumaran K, et al. Human protein reference database—2006 update. Nucleic Acids Res 2006;34(Suppl 1):D411–14. 54. Finn RD, Mistry J, Tate J, et al. The Pfam protein families database. Nucleic Acids Res 2010;38(Suppl 1):D211–22. 55. Joshi-Tope G, Gillespie M, Vastrik I, et al. Reactome: a knowledgebase of biological pathways. Nucleic Acids Res 2005;33(Suppl 1):D428–32. 56. Schaefer CF, Anthony K, Krupa S, et al. PID: the pathway interaction database. Nucleic Acids Res 2009;37(Suppl 1):D674–9. 57. Stein L, Sternberg P, Durbin R, et al. WormBase: network access to the genome and biology of Caenorhabditis elegans. Nucleic Acids Res 2001;29(1):82–6. 58. Harris TW, Baran J, Bieri T, et al. WormBase 2014: new views of curated biology. Nucleic Acids Res 2014; 42(D1):D789–93. 59. Stark C, Breitkreutz BJ, Reguly T, et al. BioGRID: a general repository for interaction datasets. Nucleic Acids Res 2006;34(Suppl 1):D535–9. 60. Alfarano C, Andrade C, Anthony K, et al. The biomolecular interaction network database and related tools 2005 update. Nucleic Acids Res 2005;33(Suppl 1):D418–24. 61. Xenarios I, Salwinski L, Duan XJ, et al. DIP, the Database of Interacting Proteins: a research tool for studying cellular networks of protein interactions. Nucleic Acids Res 2002;30(1):303–5. 62. Kerrien S, Alam-Faruque Y, Aranda B, et al. IntAct—open source resource for molecular interaction data. Nucleic Acids Res 2007;35(Suppl 1):D561–5. 63. Ceol A, Aryamontri AC, Licata L, et al. MINT, the molecular interaction database: 2009 update. Nucleic Acids Res 2009;38:D532–9. 64. Mewes HW, Frishman D, Gu¨ldener U, et al. MIPS: a database for genomes and protein sequences. Nucleic Acids Res 2002;30(1):31–4. 65. Lee I, Marcotte EM. Effects of functional bias on supervised learning of a gene network model. In: Computational Systems Biology. Springer, 2009, Vol. 541, pp. 463–75. 66. Huttenhower C, Hibbs M, Myers C, et al. A scalable method for integration and functional analysis of multiple microarray datasets. Bioinformatics 2006;22(23): 2890–7. 67. Huttenhower C, Schroeder M, Chikina MD, et al. The Sleipnir library for computational functional genomics. Bioinformatics 2008;24(13):1559–61. 68. Huttenhower C, Haley EM, Hibbs MA, et al. Exploring the human genome with functional maps. Genome Res 2009;19(6):1093–106. 69. Lauritzen SL. Graphical Models. Oxford University Press, 1996. 70. Edwards D. Introduction to Graphical Modelling. Springer Science & Business Media, 2000. 71. Hastie T, Tibshirani R, Friedman J, et al. The Elements of Statistical Learning. Springer, 2009. 72. Friedman N. Inferring cellular networks using probabilistic graphical models. Science 2004;303(5659):799–805. 73. Wille A, Zimmermann P, Vranova´ E, et al. Sparse graphical Gaussian modeling of the isoprenoid gene network in Arabidopsis thaliana. Genome Biol 2004;5(11):R92.

Algorithms

74. Ma S, Gong Q, Bohnert HJ. An Arabidopsis gene network based on the graphical Gaussian model. Genome Res 2007;17(11):1614–25. 75. Shimamura T, Imoto S, Yamaguchi R, et al. Weighted lasso in graphical Gaussian modeling for large gene network estimation based on microarray data. Genome Inform 2007;19:142–53. 76. Dittrich MT, Klau GW, Rosenwald A, et al. Identifying functional modules in protein–protein interaction networks: an integrated exact approach. Bioinformatics 2008;24(13):i223–31. 77. Krumsiek J, Suhre K, Illig T, et al. Gaussian graphical modeling reconstructs pathway reactions from high-throughput metabolomics data. BMC Syst Biol 2011;5(1):21. 78. Werhli AV, Grzegorczyk M, Husmeier D. Comparative evaluation of reverse engineering gene regulatory networks with relevance networks, graphical gaussian models and bayesian networks. Bioinformatics 2006;22(20):2523–31. 79. Guan Y, Gorenshteyn D, Burmeister M, et al. Tissue-specific functional networks for prioritizing phenotype and disease genes. PLoS Comput Biol 2012;8(9):e1002694. 80. Chikina MD, Huttenhower C, Murphy CT, et al. Global prediction of tissue-specific gene expression and contextdependent gene networks in Caenorhabditis elegans. PLoS Comput Biol 2009;5(6):e1000417. 81. Lahti L, Knuuttila JE, Kaski S. Global modeling of transcriptional responses in interaction networks. Bioinformatics 2010;26(21):2713–20. 82. Pop A, Huttenhower C, Iyer-Pascuzzi A, et al. Integrated functional networks of process, tissue, and developmental stage specific interactions in Arabidopsis thaliana. BMC Syst Biol 2010;4(1):180. 83. Zhu F, Shi L, Li H, et al. Modeling dynamic functional relationship networks and application to ex vivo human erythroid differentiation. Bioinformatics 2014;30:3325–33. 84. Ideker T, Ozier O, Schwikowski B, et al. Discovering regulatory and signalling circuits in molecular interaction networks. Bioinformatics 2002;18(Suppl 1):S233–40. 85. Noirel J, Ow SY, Sanguinetti G, et al. Automated extraction of meaningful pathways from quantitative proteomics data. Brief Funct Genomic Proteomic 2008;7(2): 136–46. 86. Sanguinetti G, Noirel J, Wright PC. MMG: a probabilistic tool to identify submodules of metabolic pathways. Bioinformatics 2008;24(8):1078–84. 87. Beisser D, Klau GW, Dandekar T, et al. BioNet: an R-Package for the functional analysis of biological networks. Bioinformatics 2010;26(8):1129–30. 88. Faust K, Dupont P, Callut J, et al. Pathway discovery in metabolic networks by subgraph extraction. Bioinformatics 2010;26(9):1211–18. 89. Liu B, Pop M. MetaPath: Identifying Differentially Abundant Metabolic Pathways in Metagenomic Datasets. BMC proceedings, BioMed Central Ltd, 2011. 90. Son CG, Bilke S, Davis S, et al. Database of mRNA gene expression profiles of multiple human organs. Genome Res 2005;15(3):443–50. 91. Shlomi T, Cabili MN, Herrga˚rd MJ, et al. Network-based prediction of human tissue-specific metabolism. Nat Biotechnol 2008;26(9):1003–10.

|

695

92. Brusco A, Gellera C, Cagnoli C, et al. Molecular genetics of hereditary spinocerebellar ataxia: mutation analysis of spinocerebellar ataxia genes and CAG/CTG repeat expansion detection in 225 Italian families. Arch Neurol 2004;61(5): 727–33. 93. Park SH, Zarrinpar A, Lim WA. Rewiring MAP kinase pathways using alternative scaffold assembly mechanisms. Science 2003;299(5609):1061–4. 94. Bandyopadhyay S, Mehta M, Kuo D, et al. Rewiring of genetic networks in response to DNA damage. Science 2010;330(6009):1385–9. 95. Kim J, Kim I, Yang JS, et al. Rewiring of PDZ domain-ligand interaction network contributed to eukaryotic evolution. PLoS Genet 2012;8(2):e1002510. 96. Zhu F, Shi L, Engel JD, et al. Regulatory network inferred using expression data of small sample size: application and validation in erythroid system. Bioinformatics 2015;31: 2537–44. 97. Orr SJ, Boutz DR, Wang R, et al. Proteomic and protein interaction network analysis of human T lymphocytes during cell-cycle entry. Mol Syst Biology 2012;8(1):573. 98. Xiong J, Zhou T, Khanin R. A kalman-filter based approach to identification of time-varying gene regulatory networks. PloS One 2013;8(10):e74571. 99. Shi L, Lin YH, Sierant M, et al. Developmental transcriptome analysis of human erythropoiesis. Hum Mol Genet 2014a;23:4528–42. 100. Shi L, Sierant M, Gurdziel K, et al. Biased, non-equivalent gene-proximal and-distal binding motifs of orphan nuclear receptor TR4 in primary human erythroid cells. PLoS Genet 2014b;10(5):e1004339. 101. Schmitt WA, Raab RM, Stephanopoulos G. Elucidation of gene interaction networks through time-lagged correlation analysis of transcriptional data. Genome Res 2004;14(8):1654– 63. 102. Zou M, Conzen SD. A new dynamic Bayesian network (DBN) approach for identifying gene regulatory networks from time course microarray data. Bioinformatics 2005;21(1):71–9. 103. Weinstein JN, Collisson EA, Mills GB, et al. The cancer genome atlas pan-cancer analysis project. Nat Genet 2013;45(10):1113–20. 104. Guelzim N, Bottani S, Bourgine P, et al. Topological and causal structure of the yeast transcriptional regulatory network. Nat Genet 2002;31(1):60–3. 105. Teichmann SA, Babu MM. Gene regulatory network growth by duplication. Nat Genet 2004;36(5):492–6. 106. Herrgard MJ, Swainston N, Dobson P, et al. A consensus yeast metabolic network reconstruction obtained from a community approach to systems biology. Nat Biotechnol 2008;26(10):1155–60. 107. Faust K, Raes J. Microbial interactions: from networks to models. Nat Rev Microbiol 2012;10(8):538–50. 108. Faust K, Sathirapongsasuti JF, Izard J, et al. Microbial cooccurrence relationships in the human microbiome. PLoS Comput Biol 2012;8(7):e1002606.

Algorithms for modeling global and context-specific functional relationship networks.

Functional genomics has enormous potential to facilitate our understanding of normal and disease-specific physiology. In the past decade, intensive re...
NAN Sizes 0 Downloads 10 Views