Journal of Microbiological Methods 116 (2015) 44–52

Contents lists available at ScienceDirect

Journal of Microbiological Methods journal homepage: www.elsevier.com/locate/jmicmeth

A global network-based protocol for functional inference of hypothetical proteins in Synechocystis sp. PCC 6803 Lianju Gao, Guangsheng Pei, Lei Chen, Weiwen Zhang ⁎ Laboratory of Synthetic Microbiology, School of Chemical Engineering & Technology, Tianjin University, Tianjin 300072, PR China Key Laboratory of Systems Bioengineering, Ministry of Education of China, Tianjin 300072, PR China SynBio Platform, Collaborative Innovation Center of Chemical Science & Engineering, Tianjin, PR China

a r t i c l e

i n f o

Article history: Received 19 April 2015 Received in revised form 24 June 2015 Accepted 25 June 2015 Available online 3 July 2015 Keywords: Hypothetical proteins Network Global algorithm Operon Synechocystis

a b s t r a c t Functional inference of hypothetical proteins (HPs) is a significant task in the post-genomic era. We described here a network-based protocol for functional inference of HPs using experimental transcriptomic, proteomic, and protein–protein interaction (PPI) datasets. The protocol includes two steps: i) co-expression networks were constructed using large proteomic or transcriptomic datasets of Synechocystis sp. PCC 6803 under various stress conditions, and then combined with a Synechocystis PPI network to generate bi-colored networks that include both annotated proteins and HPs; ii) a global algorithm was adapted to the bi-colored networks for functional inference of HPs. The algorithm ranked the associations between genes/proteins with known GO functional categories, and assumed that the top one ranked HP for each GO functional category might have a function related to the GO functional category. We applied the protocol to all HPs of the model cyanobacterium Synechocystis, and were able to assign putative functions to 122 HPs that have never been functionally characterized previously. Finally, the functional inference was validated by the known biological information of operon, and results showed that more than 70% HPs could be correctly validated. The study provided a new protocol to integrate different types of OMICS datasets for functional inference of HPs, and could be useful in achieving new insights into the Synechocystis metabolism. © 2015 Elsevier B.V. All rights reserved.

1. Introduction Hypothetical proteins (HPs) are proteins predicted from nucleic acid sequences and that have not been demonstrated by any experimental evidence (Lubec et al., 2005). They constitute about 25–40% of all open reading frames in well-studied model microorganisms, such as Escherichia coli (Galperin and Koonin, 2004). However, this proportion could be as high as 50% in the genomes of less-studied microorganisms (Doerks et al., 2012). Since HPs compose a considerable fraction of proteomes in microbial genomes, it is possible that they are serving important and even novel biological roles (Adams et al., 2007; Desler et al., 2009; Eisenstein et al., 2000). In fact, it becomes clear that the existing of a large number of HPs in the genomics data has significantly restricted the effort in deciphering microbial metabolism, raising urgent needs to develop new experimental and computational methods to decode or infer functions of HPs (Mazandu and Mulder, 2012). Cyanobacteria are autotrophic prokaryotes that can perform oxygenic photosynthesis, similar to that performed by higher plants (Rippka et al., 1979). According to a recent survey of CyanoBase ⁎ Corresponding author at: Laboratory of Synthetic Microbiology, School of Chemical Engineering & Technology, Tianjin University, Tianjin 300072, PR China. E-mail address: [email protected] (W. Zhang).

http://dx.doi.org/10.1016/j.mimet.2015.06.013 0167-7012/© 2015 Elsevier B.V. All rights reserved.

(http://genome.microbedb.jp/cyanobase), 30–60% of the putative proteins are HPs in various cyanobacterial genomes. The model cyanobacterial species, Synechocystis sp. PCC 6803 (here after Synechocystis) is the first phototrophic organism sequenced (Kaneko et al., 1995, 1996), and significant researches have been conducted on it (Govindjee, 2011). However, even with many years' efforts in improving its genome annotation, the Synechocystis genome still contains a large proportion of HPs, with nearly 33% of all putative proteins are still annotated as HPs (Qiao et al., 2013a). Meanwhile more evidence is emerging that HPs may play important physiological roles in Synechocystis. For example, HP Slr1799 was found to be involved in response to salt stress (Karandashova et al., 2002) and chloroplast HP Slr0374 is involved in the regulation of CO2 utilization in Synechocystis (Jiang et al., 2015). Since experimental characterization of protein function cannot accommodate the vast amount of HPs already available in the database (Liolios et al., 2010), the computational-based annotation has therefore been proposed as one useful and practical mean in inferring function of HPs, and providing functional clues for further experimental validation (Radivojac et al., 2013). Among various computational methods, the network-based approach has attracted significant attention in deciphering potential function since it is not dependent of sequence similarity (Kourmpetis et al., 2010).

L. Gao et al. / Journal of Microbiological Methods 116 (2015) 44–52

Two types of network-based approaches have been previously established: direct annotation and module-assisted scheme (Sharan et al., 2007). The direct annotation scheme is guided by the principle that proteins that lie closer to one another in the network are more likely to have a similar function (Deng et al., 2003; Karaoz et al., 2004; Letovsky and Kasif, 2003; Mostafavi et al., 2008; Vazquez et al., 2003), while the module-assisted scheme defines protein modules based on connectivity of different proteins in the network and then assigns possible functions to proteins based on associated member proteins with known functions (Arnau et al., 2005; Becker et al., 2012). So far a number of algorithms have been developed for both direct- and moduleassisted function predictions, and a fraction of them have even been implemented and supported with graphical interfaces (Sharan et al., 2007). In a simplistic comparison of the two approaches, Sharan et al. (2007) applied a simple neighbor-counting method and the more involved module-assisted the molecular complex detection algorithm (MCODE) to protein–protein interaction (PPI) datasets, and found that direct method has a higher specificity in predicting functions when compared to the module-assisted one. Recently a new method of the direct annotation scheme was developed and used to annotate functions of long non-coding RNAs, and results showed that the inferred functions highly matched to those in the literature (Guo et al., 2012). Using this method as a core, in this paper, we described a network-based protocol for functional inference of HPs using experimental OMICS datasets and existing PPI datasets. To integrate different types of datasets (i.e., genomics, transcriptomics and proteomics), bi-colored biological networks were first constructed using either transcriptomics or proteomics, along with PPI data. In addition, a global propagation algorithm was applied to the networks to infer functions of HPs (Guo et al., 2012). The analysis resulted in functional inference of 122 HPs in the Synechocystis genome, which were then validated using operon information. The study provided a new protocol to integrate different types of experimental OMICS datasets for functional inference of HPs. 2. Methods

4.0% NaCl (w/v), and nitrogen-starvation treatments. For each condition, cells were harvested at three time points (i.e., 24, 48 and 72 h) for transcriptomic analysis. For details regarding experimental design and quality control of the data, please refer to the previous publications (Huang et al., 2013; Liu et al., 2012; Qiao et al., 2013b; Wang et al., 2012; Zhu et al., 2013). 2.1.3. PPI dataset and annotation information A PPI dataset of Synechocystis was downloaded from the STRING database (http://www.string-db.org/) (Jensen et al., 2009). In the STRING database, several types of evidences for the association, including genomic context, high-throughput experiments, conserved coexpression and previous biological knowledge were used to calculate a single combined_score for each gene in the genome. In this study, the combined_scores indicative of a higher confidence than other single evidence, were applied to construct the PPI network to cover potential protein–protein connections (Szklarczyk et al., 2011). To describe protein function, we used the classification scheme provided by the biological process (BP) of the Gene Ontology (GO) Consortium (Ashburner et al., 2000; Lægreid et al., 2003). The known ‘gene2go’ associations in the Synechocystis genome were downloaded from CyanoBase database (Nakamura et al., 1998). 2.2. Missing value estimation To improve the quality of imputation of missing proteomic values, three imputation methods were first implemented and evaluated, they were: the method based on K nearest neighbors (KNN) algorithm (Thirumahal and Patil, 2014), the local least squares imputation (LLSimpute) method (Kim et al., 2005), and the imputation method based on chained equations named predictive mean matching (PMM) (Souverein et al., 2006). To compare error rates for each method, a set of values were randomly chosen from a proteomic dataset and removed to generate an incomplete proteomic dataset at certain missing rates. Because the real values are known, the estimation error can be calculated. These methods were evaluated according to normalized root-meansquare error (NRMSE) values:

2.1. Data sources 2.1.1. Proteomic data A total of eleven iTRAQ LC–MS/MS proteomic datasets of Synechocystis from previous studies were obtained. For the first five datasets, Synechocystis was grown under ethanol (1.50%, v/v), butanol (0.20%, v/v), hexane (0.80%, v/v), salt stress (4%, w/v) and nitrogen starvation conditions, respectively. For each condition, cells were harvested at two time points (i.e., 24 and 48 h) that were corresponding to middle-exponential and exponential-stationary transition phases in the growth time courses for proteomics analysis (Huang et al., 2013; Liu et al., 2012; Qiao et al., 2012, 2013b; Tian et al., 2013). In the remaining six datasets, knockout mutants in Synechocystis grown under conditions of ethanol (1.60%, v/v), butanol (0.25%, v/v), cadmium (4.0 μM), pH 5.0, pH 12.0, and salt stress (4%, w/v). Cells were harvested at either 36 or 48 h (Chen et al., 2014; Qiao et al., 2012, 2013b; Ren et al., 2014; Tian et al., 2013). The mass spectroscopy analysis was performed using an AB SCIEX TripleTOF™ 5600 mass spectrometer (AB SCIEX, Framingham, MA), coupled with an online micro-flow HPLC system (Shimadzu Co, Kyoto, Japan) as described previously (Unwin et al., 2010). For details regarding experimental design and quality control of the data, please refer to the previous publications (Huang et al., 2013; Liu et al., 2012; Qiao et al., 2012, 2013b; Ren et al., 2014; Tian et al., 2013). 2.1.2. Transcriptomic data Five RNA-seq transcriptomic datasets of Synechocystis from previous studies were obtained. Cells were collected from five stress conditions: 0.2% butanol (v/v), 1.5% ethanol (v/v), 0.8% hexane (v/v),

45

NRMSE ¼

rffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi 2 ffi XN  ^ j =N y j −y j¼1

σy

where yj is the real value, ŷj is the estimated value, and σy is the standard deviation for the N true values (Kim et al., 2005). The value of NRMSE is between 0.0 and 1.0. A smaller NRMSE value means a higher accuracy. The evaluation criteria have been applied in several previous studies (Feten et al., 2005; Meng et al., 2014). 2.3. Construction of bi-colored networks The procedure to construct co-expression network includes:i) data normalization was performed with the raw proteomic or transcriptomic data converted into a ratio of condition versus its control, and then log2 transformed; ii) correlation values were calculated for all possible pairs, in which correlation was defined as the Pearson correlation coefficient for all pairwise genes/proteins; iii) normalized the correlation coefficient using a Min–Max linear normalization algorithm developed previously (Guo et al., 2012), and then the co-expression network was constructed in which the weight of edge represents Pearson correlation coefficients, and the node of network represents genes/proteins (Pei et al., 2014); iv) a correlation coefficient cutoff was applied to the coexpression network, where only gene/protein pairs with a correlation coefficient higher than the cutoff were considered connected. As the biological networks behave like a scale-free network (Tsoi et al., 2014), the distribution of connections follows power-law relationship. To select suitable correlation coefficient cutoff for co-expression networks,

46

L. Gao et al. / Journal of Microbiological Methods 116 (2015) 44–52

the node degree distribution in the bi-colored network was calculated. We compared networks constructed using different correlation coefficient cutoffs by considering number of proteins in the network and R2 of the fitting curves to generate networks with a better topology. To obtain the final PPI network for constructing bi-colored networks, edges that contain at least one protein presented in the co-expression network were kept, and the PPI network was re-calculated and constructed according to step 3 described above. To generate a weighted bi-colored network, the proteomics- or transcriptomics-based co-expression network was combined with the PPI network using the following formula (Von Mering et al., 2005) by calculating the weight of edge in the bi-colored network.      1−wi;ppi ¼ 1− 1−wi;co−expr wi;bi−colored j j j ppi

Here, wi,jbi − colored, wi,jco − expr and wi,j denote the weighted relation in the bi-colored network, the weighted co-expression and the weighted protein interaction between gene/protein i and gene/protein j, respectively. ‘Bi-colored’ means that the networks include two different colors of nodes (i.e., annotated and HPs) and two different colors of edges (i.e., co-expression and PPI) in the network. According to the procedure, three bi-colored networks were independently established and used for HP functional inference, respectively: the first was the network constructed by using proteomic and PPI data, the second was the network using transcriptomic and PPI data, and the third was the network using proteomic, transcriptomic and PPI data.

all the genes/proteins were ranked according to the association scores with one given function category and the top one HP was annotated according to the functional category. 2.5. Operon analysis Functional inference of HPs through the process is validated using the known biological knowledge, naming the operon information. Based on the principle that HPs may share the same function with other functionally known genes in the same operon (Taboada et al., 2012), we compared the HP functional inferences from the protocol with the Synechocystis operon acquired from the Prokaryotic Operon DataBase (ProOpDB) (http://operons.ibt.unam.mx/OperonPredictor/) (Taboada et al., 2010). 3. Results and discussion 3.1. Imputing missing values

Finally, an iterative process (Vanunu et al., 2010) was applied to compute the prediction function S as follows,

The proteomic data is typically incomplete due to experimental or technical limitations (Nie et al., 2006), which could lead to a biased biological interpretation. To address the issue, imputing missing proteomic value has been proposed as an important step in fully utilizing the experimental proteomic data (Nie et al., 2007). Several methodologies have been previously established in imputing missing proteomic data (Aittokallio, 2009). In this work, we first evaluated the application of three methods of imputing missing values using the Synechocystis datasets: KNN imputation method, LLS imputation algorithm, and predictive mean matching (PMM) that have been previously applied to incomplete microarray data (Troyanskaya et al., 2001). According to previous results, KNN imputation method was used since the method is less sensitive to the exact parameters used (Thirumahal and Patil, 2014); LLS imputation algorithm is able to predict the missing value using the least squares formulation for the neighborhood column and the non-missing entries that are well-established algorithms widely applied (Bose et al., 2013); and predictive mean matching (PMM) method belongs to multiple imputations, which has become an important and influential approach in the statistical analysis of incomplete data. The key advantages of PMM are flexibility and generality. Furthermore, the imputation model and substantive model are kept separate (Kenward and Carpenter, 2007). Using NRMSE as an evaluation index, we compared three methods to a randomly generated incomplete proteomic dataset using R statistical software (Buuren and Groothuis-Oudshoorn, 2011). We conducted an imputation for the incomplete proteomic dataset with several missing rates of 1%, 5%, 10%, 15%, 20%, 25%, and 30%, and all parameters were selected automatically. The analysis was repeated five times to demonstrate the estimation ability of the testing methods. The results in Fig. 1 represented the calculated NRMSE values against different missing rates for three methods. The results showed that PMM-based method provided the best imputation for missing values in our testing proteomic dataset. In addition, the results showed that the NRMSE values varied for datasets with different missing rates (i.e., 1 to 30%), with better imputation achieved for low missing rate (1–10%) (Fig. 1), which is consistent with an early result that LLS imputation methods generally worked more accurately with datasets with a less than 5% missing rate (Kim et al., 2005). Taking the above results into consideration, we imputed missing data for 5% more proteins in the Synechocystis proteomic dataset used in this study, which is corresponding to 304 more proteins compared with the original proteomic dataset, and used for the following analysis.

S t :¼ 0:618  S t−1  w 0 þ 0:382  T

3.2. Construction of bi-colored network

where S1 := T. The iterative computation will stop when the mean square deviation of S t + 1 and S t is less than or equal to ‘1.0E−5’ as described previously (Jebara et al., 2009; Zhu and Goldberg, 2009). Finally,

Network approach has been proved to be a significant tool to infer function for HPs (Sharan et al., 2007). To decipher functions of the Synechocystis HPs using a network-based approach, we first constructed

2.4. Global network-flow prediction algorithm The global propagation algorithm developed by Guo et al. (2012) was applied to the bi-colored networks for function inference of HPs. We denote the bi-colored network as a graph G = (V, E, w), where V is a set of nodes, E is a set of edges connecting pairs of vertices and the weight function w denotes the reliability of each edge. In the bicolored networks, the nodes represent genes/proteins and the edges represent interactions. The process can then rank associations between genes/proteins and all known function categories of GO, and assume that the top one ranked gene/protein was related to the functional category. Here, one prediction function S is applied to compute the association scores between genes/proteins and functions. Its expression is: 2 Svf

¼ 0:618  4

3

X

Suf

 w ðv; uÞ5 þ 0:382  T vf 0

u ϵ NðvÞ

where f is the function category, v and u all represent genes/proteins respectively, the constants weight the relative importance of the global and local constraints, w′ is a normalized form of w, please refer to the previous publication (Vanunu et al., 2010) for the detailed process, Tvf is used as the previous knowledge score. If nvf denotes the number of neighboring genes of gene v, which are annotated with f, N(v) is the number of neighboring genes, the equation of Tvf is

T vf ¼

8 > > >
> > : eNðvÞ−5 

if NðvÞ≥5 nvf

NðvÞ

:

otherwise

L. Gao et al. / Journal of Microbiological Methods 116 (2015) 44–52

Fig. 1. Comparison of three imputation methods (i.e., KNN impute, LLS and PMM) for incomplete proteomics datasets.

co-expression networks using either proteomic or transcriptomic dataset to determine gene–gene or protein–protein associations, respectively. The workflow of the network construction was illustrated in Fig. 2. Briefly, i) Co-expression networks were constructed basing directly on pairwise or low-order conditional pairwise association measures, such as the correlation or mutual information, to infer the connectivity between genes or proteins (Xuan et al., 2012). To filter associations with low correlation for the networks, we

47

calculated the degree distribution with different correlation values (i.e., 0.70, 0.75, 0.80, 0.85 and 0.90). The results showed that the degree distribution of the constructed network followed the power-law rule with slightly different fitting degrees. In terms of R2 of the fitting curve, correlation coefficient cutoff of 0.80 was selected for the network that contained 916 proteins based on the proteomics data (Fig. 3). Similarly, 0.98 (2128 genes) and 0.90 (1010 proteins) were chosen to be cutoffs in the network constructed using transcriptomics data, and proteomics–transcriptomics data, respectively (data not shown); ii) The threshold correlation matrix was transformed into a coexpression network using a perl script. The nodes in the networks are genes/proteins while the links between them (edges) represent co-expression properties; iii) The protein interactions downloaded from STRING database were integrated protein interactions including both experimentally confirmed and predicted protein interactions. To cover more associations, combined_score was used for the construction of PPI network. The resulting Synechocystis PPI network contained a total of 3561 proteins and 226,983 distinct interactions; iv) The PPI network was integrated with the Synechocystis proteomic/ transcriptomic networks to establish the bi-colored networks. Three integrated networks (i.e., proteomics–PPI, transcriptomics– PPI, and proteomics–transcriptomics–PPI) contained a total of 3358, 3513 and 3374 genes/proteins, respectively. The number of HPs and functionally known proteins was shown for each of the bi-colored networks (Fig. 4). Compared with co-expression network, an increasing number of HPs were included in the final bicolored network, demonstrating that by integrating different types of data, a better coverage of the HPs and increased power of functional inferences can be achieved. In addition, the advantage

Fig. 2. A scheme of workflow for the protocol. In the integrated bi-colored networks, blue node represents known protein and red node represents hypothetical protein, respectively. (For interpretation of the references to color in this figure legend, the reader is referred to the web version of this article.)

48

L. Gao et al. / Journal of Microbiological Methods 116 (2015) 44–52

Fig. 3. Comparison of node degree distribution for five different correlation coefficients on the network constructed using proteomics and PPI data. rcutoff: correlation coefficient; R2: variance; N: number of proteins in co-expression network.

of bi-colored networks can be attributed to the better connectivity of the bi-colored networks than that in co-expression networks (McDermott et al., 2012). By constructing the bi-colored networks, we were able to integrate measurements from different molecular levels and obtained a wider view of the system under study (Zhang et al., 2010), which may provide a better foundation for functional inference of HPs.

3.3. Functional inference of HPs Global algorithm is based on a functional flow that exploits the underlying structure of protein interaction maps in order to infer protein function, which takes advantage of both network topology and some measure of locality (Nabieva et al., 2005). In an early study,Vazquez et al. (2003) applied the approach to analyze Saccharomyces cerevisiae protein–protein interaction network and resulted in multiple functional assignments of HPs (Vazquez et al., 2003). In an recent study, Guo et al. (2012) applied a global propagation algorithm, which exploits local and global topological properties of every node, to infer probable functions for the long non-coding RNAs, and the results showed that 94.9% long non-coding RNAs in the maximum connected component among the network were all functionally characterized (Guo et al., 2012). Using BP of the GO functional terms, we ranked all genes/proteins according to the association scores with regard to every known function category, and then assumed that the top one ranked gene/protein of the functional category could be assigned the corresponding function. The

association scores between each of the genes/proteins and all functional categories were calculated considering global and local constraints imposed by the network topology as illustrated in the method section. All genes/proteins were ranked based on the association scores. The analysis showed that 62, 70 and 59 HPs were assigned putative functions based on their top ranking against various functional categories in the proteomics–PPI, transcriptomics–PPI, and proteomics–transcriptomics–PPI bi-colored networks, respectively. In total, the protocol allowed functional inference to 122 HPs in the Synechocystis genome. The functional inference of 122 HPs was provided in Suppl. Table 1. However, it worth noting that a large number of HPs still cannot be inferred, in our case a total of 1658, 1795 and 1676 HPs were unannotated in the proteomics–PPI, transcriptomics–PPI, and proteomics–transcriptomics–PPI bi-colored networks, respectively, which may be due to size and quality of databases used for the network construction and functional inference. According to Suppl. Table 1, a Venn diagram showed that 59.8% (73/ 122), 40.2% (49/122) and 16.4% (20/122) of the HP functional inferences were inferred by one, at least two and three bi-colored networks, respectively (Fig. 5). More importantly, for the HP functional inferences achieved through more than one bi-colored network, the functional inferences were consistent with each other, suggesting of the good quality of the functional inference. The only exception seen from Suppl. Table 1 was Slr0423, whose function was found related to superoxide metabolic process (GO: 0006801) based on both proteomics-PPI and proteomics– transcriptomics–PPI bi-colored networks, but related to trehalose biosynthetic process (GO: 0005992) and glucosylglycerol biosynthetic process (GO: 0051473) based on transcriptomics–PPI bi-colored

L. Gao et al. / Journal of Microbiological Methods 116 (2015) 44–52

49

Fig. 4. Overview of protein coverage through different networks. A) Protein distribution in the network established by proteomics and PPI data; B) protein distribution in the network established by transcriptomics and PPI data; C) protein distribution in the network established by proteomics, transcriptomics and PPI data.

network, which may be due to the multiple GO associations of the gene. Nevertheless, it is well known that all these three biological processes (i.e., GO: 0006801, GO: 0005992 and GO: 0051473) were involved in stress response, especially salt stress response in Synechocystis (Hagemann, 2011), it is thus highly possible that the function of HP Slr0423 is related to stress response. Using different types of experimental OMICS datasets in various bi-colored networks, it not only improves the accuracy rate of functional inferences for HPs as the analysis can cross-validate each other, but also allows an increased coverage of more HPs that can be assigned with putative functions. 3.4. Validation of the functional inferences To verify the HP functions inferred from the protocol described in the paper, we adapted the operon analysis (Doerks et al., 2012). Among 122 HPs we were able to assign putative function through the protocol, we found that 34 HPs were located in operons in which at least one gene has been functionally annotated in various databases. To verify the functional inferences, we first assigned GO functions to HPs according to the GO annotation of the functionally known genes

Fig. 5. Overview of HPs annotated by different bi-colored networks.

in the same operon, and then compared the GO function assigned by this approach with that assigned through our network-based protocol. The results showed that 24 out of 34 HPs have the identical GO annotations, indicating approximately 70.6% HPs could be validated correctly by operon information (Table 1). In addition, for the 10 HPs with different GO annotations through two approaches, we found a clear relationship between the GO annotations for several HPs. For example, HP Sll1160 was found associated with ubiquinone biosynthetic process (GO: 0006744) based on the network-based inference, meanwhile, protein Sll1159 in the same operon has the function of cell redox homeostasis (GO: 0045454). In cyanobacteria, ubiquinone might be used as an electron carrier in the respiratory electron transport chain to maintain cell redox homeostasis (Friedrich and Scheide, 2000). For another example, HP Slr0723 was inferred with a putative function related to named rRNA modification (GO: 0000154) by the network based approach, meanwhile another protein in the same operon, Slr0722 was known with rRNA processing (GO: 0006364) function (Table 1).

3.5. Putative functions of selected Synechocystis HPs Among the functional inference of 122 HPs, several HPs were found with putative functions related to key metabolism of cyanobacteria. For example, HP Sll0410 was assigned with a putative function of regulation of nitrogen utilization (GO: 0006808). In various biotechnological applications, nitrogen utilization has been one important factor affecting primary photosynthetic production and the accumulation of fatty acids in cyanobacterial cells (de Loura et al., 1987; Krasikov et al., 2012). HP Slr0476 was found related to cellular SOS response (GO: 0009432). Several proteins (i.e. Sll1626, Sll1225) involved in SOS responses in Synechocystis have previously been found (Kamei et al., 2001; Oliveira and Lindblad, 2005), and the discovery of a new SOS response protein may worth further investigation. HP Slr0642 was found participating in tetrahydrofolate biosynthetic process (GO: 0046654), consistent with an early study showing that the Synechocystis Slr0642 was conferred the ability to transport folates and folate analogs when expressed in E. coli (Klaus et al., 2005). Sll1939 and Slr0650 were assigned the putative function related to the regulation of cell cycle (GO: 0051726), which may worth further investigation due to the importance of cell cycle in cyanobacteria (Yang et al., 2010).

50

L. Gao et al. / Journal of Microbiological Methods 116 (2015) 44–52

Table 1 The detailed functions of inferred hypothetical proteins by operon validation. HPs

GO

Biological process annotation by network approach

No. of genes Functionally known Biological process annotation by operon approach in operon proteins in the same operon

Sll0085 Sll0496 Sll0721 Sll0847

GO: 0006875 GO: 0006421 GO: 0009404 GO: 0006270GO: 0006275 GO: 0046348 GO: 0006069 GO: 0009249 GO: 0009249 GO: 0006082 GO: 0006566 GO: 0045005

4 2 3 2

Sll0086 Sll0495 Sll0720 Sll0848

+ + + +

3 3 3 3 2 2 3

Sll0861 Sll0990 Sll1187 Sll1187 Sll1299 Sll1760 Sll1772

+ + + + + + +

5

Slr0090, Slr0091

+

4

Slr0194

+

4 2 3

Slr0379 Slr0597 Slr0853

+ + +

2

Slr1133

+

3

Slr1182

+

2 3 2 2 3

Slr1197 Slr1364 Slr1718 Slr1840 Slr2001

+ + + + +

Ssr2333 GO: 0015684 Sll0176 GO: 0016481 Sll0487 GO: 0015797

Cellular metal ion homeostasis Asparaginyl-tRNA aminoacylation Toxin metabolic process DNA replication initiation regulation of DNA replication Amino sugar catabolic process Ethanol oxidation Protein lipoylation Protein lipoylation Organic acid metabolic process Threonine metabolic process Maintenance of fidelity during DNA-dependent DNA replication Cellular aldehyde metabolic process Aromatic amino acid family metabolic process Pentose-phosphate shunt, non-oxidative branch dTDP biosynthetic process IMP biosynthetic process N-terminal protein amino acid acetylation Arginine biosynthetic process via ornithine C-terminal protein amino acid methylation DNA mediated transformation Cofactor metabolic process Coenzyme M biosynthetic process Organic acid phosphorylation Cellular macromolecule metabolic process Ferrous iron transport Negative regulation of transcription Mannitol transport

2 2 3

Slr1392 Sll0177 Sll0485, Sll0486

Sll0793 Sll1160 Slr0095 Slr0723

GO: 0006825 GO: 0006744 GO: 0006419 GO: 0000154

Copper ion transport Ubiquinone biosynthetic process Alanyl-tRNA aminoacylation rRNA modification

2 2 2 2

Sll0792 Sll1159 Slr0093 Slr0722

Slr0823

GO: 0006450

Regulation of translational fidelity

2

Slr0822

Slr1419 Slr1577

GO: 0006534 GO: 0051276

Cysteine metabolic process Chromosome organization

2 3

Slr1420 Slr1575

Slr1690

GO: 0019363

Pyridine nucleotide biosynthetic process

3

Slr1689, Slr1691

+ DNA recombination (GO: 0006310) Regulation of transcription, DNA-dependent (GO: 0006355) Two-component signal transduction system (phosphorelay) (GO: 0000160) Rhythmic process (GO: 0048511) Regulation of transcription, DNA-dependent (GO: 0006355) Cell redox homeostasis (GO: 0045454) Protein folding (GO: 0006457) Activation of protein kinase C activity by G-protein coupled receptor protein signaling pathway (GO: 0007205) rRNA processing (GO: 0006364) ATP biosynthetic process (GO: 0006754) Cation transport (GO: 0006812) Metabolic process (GO: 0008152) Carbohydrate metabolic process (GO: 0005975) Regulation of protein amino acid phosphorylation (GO: 0001932) Transmembrane transport (GO: 0055085) Base-excision repair (GO: 0006284) DNA repair (GO: 0006281) Nucleotide-excision repair (GO: 0006289) NAD biosynthetic process (GO: 0009435) Nitrogen compound metabolic process (GO: 0006807)

Sll0863 Sll0992 Sll1186 Sll1188 Sll1298 Sll1761 Sll1771 Slr0092

GO: 0006081 GO: 0009072

Slr0196

GO: 0009052

Slr0380 Slr0596 Slr0852

GO: 0006233 GO: 0006188 GO: 0006474

Slr1134

GO: 0042450

Slr1183

GO: 0006481

Slr1196 Slr1365 Slr1719 Slr1839 Slr2003

GO: 0009294 GO: 0051186 GO: 0019295 GO: 0031388 GO: 0044260

“+” represents the same biological process annotation as those from the network-based protocol.

4. Conclusions HPs pose a great challenge not just to genome annotation, but also in-depth biology interpretation. In this study, we described a networkbased global detection and functional inference for the HPs, and applied it to the model cyanobacterium Synechocystis. The cores of the protocol were the construction of bi-colored networks using different types of OMIC datasets, and functional inference using a global propagation algorithm. With the protocol, we were able to assign GO-based putative annotation to 122 HPs in the Synechocystis genome. In addition, using the operon biological information, more than 70% HPs annotated by

this approach could be correctly validated. Finally, it is worth noting that as the network construction and functional inference were constrained by the experimental data used (i.e., data quality) (Valencia, 2005), the HP functional prediction presented in this study still needs further computational or experimental validation. Nevertheless, the analysis presented a good foundation for guiding experimental design of high-efficiency validation. Finally, with minor modifications, this protocol established can also be applied to the functional inference of HPs in other microbial genomes. Supplementary data to this article can be found online at http://dx. doi.org/10.1016/j.mimet.2015.06.013.

L. Gao et al. / Journal of Microbiological Methods 116 (2015) 44–52

Acknowledgments The research was supported by grants from the National Basic Research Program of China (“973” program, project No. 2011CBA00803 and No. 2014CB745101), the National High-tech R&D Program (“863” program, project No. 2012AA02A707), and the Natural Science Foundation of China (NSFC) (No. 31470217), and the Tianjin Municipal Science and Technology Commission. References Adams, M.A., Suits, M.D., Zheng, J., Jia, Z., 2007. Piecing together the structure-function puzzle: experiences in structure-based functional annotation of hypothetical proteins. Proteomics 7, 2920–2932. Aittokallio, T., 2009. Dealing with missing values in large-scale studies: microarray data imputation and beyond. Brief. Bioinf. 11, 253–264. Arnau, V., Mars, S., Marín, I., 2005. Iterative cluster analysis of protein interaction data. Bioinformatics 21, 364–378. Ashburner, M., Ball, C.A., Blake, J.A., Botstein, D., Butler, H., Cherry, J.M., Davis, A.P., Dolinski, K., Dwight, S.S., Eppig, J.T., 2000. Gene ontology: tool for the unification of biology. Nat. Genet. 25, 25–29. Becker, E., Robisson, B., Chapple, C.E., Guénoche, A., Brun, C., 2012. Multifunctional proteins revealed by overlapping clustering in protein interaction network. Bioinformatics 28, 84–90. Bose, S., Das, C., Chakraborty, A., Chattopadhyay, S., 2013. Effectiveness of different partition based clustering algorithms for estimation of missing values in microarray gene expression data. Adv. Comput. Inf. Technol. 177, 37–47. Buuren, S., Groothuis-Oudshoorn, K., 2011. MICE: multivariate imputation by chained equations in R. J. Stat. Softw. 45, 1–67. Chen, L., Zhu, Y., Song, Z., Wang, J., Zhang, W., 2014. An orphan response regulator Sll0649 involved in cadmium tolerance and metal homeostasis in photosynthetic Synechocystis sp. PCC 6803. J. Proteomics 103, 87–102. de Loura, I.C., Dubacq, J.P., Thomas, J.C., 1987. The effects of nitrogen deficiency on pigments and lipids of cyanobacteria. Plant Physiol. 83, 838–843. Deng, M., Zhang, K., Mehta, S., Chen, T., Sun, F., 2003. Prediction of protein function using protein–protein interaction data. J. Comput. Biol. 10, 947–960. Desler, C., Suravajhala, P., Sanderhoff, M., Rasmussen, M., Rasmussen, L.J., 2009. In Silico screening for functional candidates amongst hypothetical proteins. BMC Bioinf. 10, 289. Doerks, T., Van Noort, V., Minguez, P., Bork, P., 2012. Annotation of the M. tuberculosis hypothetical orfeome: adding functional information to more than half of the uncharacterized proteins. PLoS ONE 7 e34302. Eisenstein, E., Gilliland, G.L., Herzberg, O., Moult, J., Orban, J., Poljak, R.J., Banerjei, L., Richardson, D., Howard, A.J., 2000. Biological function made crystal clear-annotation of hypothetical proteins via structural genomics. Curr. Opin. Biotechnol. 11, 25–30. Feten, G., Almøy, T., Aastveit, A.H., 2005. Prediction of missing values in microarray and use of mixed models to evaluate the predictors. Stat. Appl. Genet. Mol. Biol. 4 Article10. Friedrich, T., Scheide, D., 2000. The respiratory complex I of bacteria, archaea and eukarya and its module common with membrane-bound multisubunit hydrogenases. FEBS Lett. 479, 1–5. Galperin, M.Y., Koonin, E.V., 2004. ‘Conserved hypothetical’proteins: prioritization of targets for experimental study. Nucleic Acids Res. 32, 5452–5463. Govindjee, D.S., 2011. Adventures with cyanobacteria: a personal perspective. Front. Plant Sci. 2, 28. Guo, X., Gao, L., Liao, Q., Xiao, H., Ma, X., Yang, X., Luo, H., Zhao, G., Bu, D., Jiao, F., 2012. Long non-coding RNAs function annotation: a global prediction method based on bi-colored networks. Nucleic Acids Res. 41 e35. Hagemann, M., 2011. Molecular biology of cyanobacterial salt acclimation. FEMS Microbiol. Rev. 35, 87–123. Huang, S., Chen, L., Te, R., Qiao, J., Wang, J., Zhang, W., 2013. Complementary iTRAQ proteomics and RNA-seq transcriptomics reveal multiple levels of regulation in response to nitrogen starvation in Synechocystis sp. PCC 6803. Mol. BioSyst. 9, 2565–2574. Jebara, T., Wang, J., Chang, S.-F., 2009. Graph construction and b-matching for semisupervised learning. Proc. 26th Ann. Inter. Conf. Mach. Lear. ACM, pp. 441–448. Jensen, L.J., Kuhn, M., Stark, M., Chaffron, S., Creevey, C., Muller, J., Doerks, T., Julien, P., Roth, A., Simonovic, M., 2009. STRING 8—a global view on proteins and their functional interactions in 630 organisms. Nucleic Acids Res. 37, D412–D416. Jiang, H.-B., Song, W.-Y., Cheng, H.-M., Qiu, B.-S., 2015. The hypothetical protein Ycf46 is involved in regulation of CO2 utilization in the cyanobacterium Synechocystis sp. PCC 6803. Planta 241, 145–155. Kamei, A., Hihara, Y., Yoshihara, S., Geng, X., Kanehisa, M., Ikeuchi, M., 2001. Functional analysis of lexA-like gene, sll1626 in Synechocystis sp. PCC 6803 using DNA microarray. Plant Cell Physiol. Suppl. (32), S95. Kaneko, T., Sato, S., Kotani, H., Tanaka, A., Asamizu, E., Nakamura, Y., Miyajima, N., Hirosawa, M., Sugiura, M., Sasamoto, S., Kimura, T., Hosouchi, T., Matsuno, A., Muraki, A., Nakazaki, N., Naruo, K., Okumura, S., Shimpo, S., Takeuchi, C., Wada, T., Watanabe, A., Yamada, M., Yasuda, M., Tabata, S., 1996. Sequence analysis of the genome of the unicellular cyanobacterium Synechocystis sp. strain PCC6803. II. Sequence determination of the entire genome and assignment of potential proteincoding regions. DNA Res. 3, 109–136. Kaneko, T., Tanaka, A., Sato, S., Kotani, H., Sazuka, T., Miyajima, N., Sugiura, M., Tabata, S., 1995. Sequence analysis of the genome of the unicellular cyanobacterium Synechocystis sp. strain PCC6803. I. Sequence features in the 1 Mb region from map positions 64% to 92% of the genome. DNA Res. 2, 191–198.

51

Karandashova, I., Elanskaya, I., Marin, K., Vinnemeier, J., Hagemann, M., 2002. Identification of genes essential for growth at high salt concentrations using salt-sensitive mutants of the cyanobacterium Synechocystis sp. strain PCC 6803. Curr. Microbiol. 44, 184–188. Karaoz, U., Murali, T., Letovsky, S., Zheng, Y., Ding, C., Cantor, C.R., Kasif, S., 2004. Wholegenome annotation by using evidence integration in functional-linkage networks. Proc. Natl. Acad. Sci. U. S. A. 101, 2888–2893. Kenward, M.G., Carpenter, J., 2007. Multiple imputation: current perspectives. Stat. Methods Med. Res. 16, 199–218. Kim, H., Golub, G.H., Park, H., 2005. Missing value estimation for DNA microarray gene expression data: local least squares imputation. Bioinformatics 21, 187–198. Klaus, S.M., Kunji, E.R., Bozzo, G.G., Noiriel, A., de la Garza, R.D., Basset, G.J., Ravanel, S., Rebeille, F., Gregory 3rd, J.F., Hanson, A.D., 2005. Higher plant plastids and cyanobacteria have folate carriers related to those of trypanosomatids. J. Biol. Chem. 280, 38457–38463. Kourmpetis, Y.A., Van Dijk, A.D., Bink, M.C., van Ham, R.C., ter Braak, C.J., 2010. Bayesian Markov Random Field analysis for protein function prediction based on network data. PLoS ONE 5 e9293. Krasikov, V., Aguirre von Wobeser, E., Dekker, H.L., Huisman, J., Matthijs, H.C., 2012. Timeseries resolution of gradual nitrogen starvation and its impact on photosynthesis in the cyanobacterium Synechocystis PCC 6803. Physiol. Plant. 145, 426–439. Lægreid, A., Hvidsten, T.R., Midelfart, H., Komorowski, J., Sandvik, A.K., 2003. Predicting gene ontology biological process from temporal gene expression patterns. Genome Res. 13, 965–979. Letovsky, S., Kasif, S., 2003. Predicting protein function from protein/protein interaction data: a probabilistic approach. Bioinformatics 19, i197–i204. Liolios, K., Chen, I.-M.A., Mavromatis, K., Tavernarakis, N., Hugenholtz, P., Markowitz, V.M., Kyrpides, N.C., 2010. The Genomes On Line Database (GOLD) in 2009: status of genomic and metagenomic projects and their associated metadata. Nucleic Acids Res. 38, D346–D354. Liu, J., Chen, L., Wang, J., Qiao, J., Zhang, W., 2012. Proteomic analysis reveals resistance mechanism against biofuel hexane in Synechocystis sp. PCC 6803. Biotechnol. Biofuels 5, 68. Lubec, G., Afjehi-Sadat, L., Yang, J.-W., John, J.P.P., 2005. Searching for hypothetical proteins: theory and practice based upon original data and literature. Prog. Neurobiol. 77, 90–127. Mazandu, G.K., Mulder, N.J., 2012. Function prediction and analysis of Mycobacterium tuberculosis hypothetical proteins. Int. J. Mol. Sci. 13, 7283–7302. McDermott, J.E., Diamond, D.L., Corley, C., Rasmussen, A.L., Katze, M.G., Waters, K.M., 2012. Topological analysis of protein co-abundance networks identifies novel host targets important for HCV infection and pathogenesis. BMC Syst. Biol. 6, 28. Meng, F., Cai, C., Yan, H., 2014. A bicluster-based Bayesian principal component analysis method for microarray missing value estimation. IEEE J. Biomed. Health Inform. 18, 863–871. Mostafavi, S., Ray, D., Warde-Farley, D., Grouios, C., Morris, Q., 2008. GeneMANIA: a realtime multiple association network integration algorithm for predicting gene function. Genome Biol. 9, S4. Nabieva, E., Jim, K., Agarwal, A., Chazelle, B., Singh, M., 2005. Whole-proteome prediction of protein function via graph-theoretic analysis of interaction maps. Bioinformatics 21, i302–i310. Nakamura, Y., Kaneko, T., Hirosawa, M., Miyajima, N., Tabata, S., 1998. CyanoBase, a www database containing the complete nucleotide sequence of the genome of Synechocystis sp. strain PCC6803. Nucleic Acids Res. 26, 63–67. Nie, L., Wu, G., Brockman, F.J., Zhang, W., 2006. Integrated analysis of transcriptomic and proteomic data of Desulfovibrio vulgaris: zero-inflated Poisson regression models to predict abundance of undetected proteins. Bioinformatics 22, 1641–1647. Nie, L., Wu, G., Culley, D.E., Scholten, J.C., Zhang, W., 2007. Integrative analysis of transcriptomic and proteomic data: challenges, solutions and applications. Crit. Rev. Biotechnol. 27, 63–75. Oliveira, P., Lindblad, P., 2005. LexA, a transcription regulator binding in the promoter region of the bidirectional hydrogenase in the cyanobacterium Synechocystis sp PCC 6803. FEMS Microbiol. Lett. 251, 59–66. Pei, G., Chen, L., Wang, J., Qiao, J., Zhang, W., 2014. Protein network signatures associated with exogenous biofuels treatments in cyanobacterium Synechocystis sp. PCC 6803. Front. Bioeng. Biotechnol. 2, 48. Qiao, J., Wang, J., Chen, L., Tian, X., Huang, S., Ren, X., Zhang, W., 2012. Quantitative iTRAQ LC–MS/MS proteomics reveals metabolic responses to biofuel ethanol in cyanobacterial Synechocystis sp. PCC 6803. J Proteome Res. 11, 5286–5300. Qiao, J., Shao, M., Chen, L., Wang, J., Wu, G., Tian, X., Liu, J., Huang, S., Zhang, W., 2013a. Systematic characterization of hypothetical proteins in Synechocystis sp. PCC 6803 reveals proteins functionally relevant to stress responses. Gene 512, 6–15. Qiao, J., Huang, S., Te, R., Wang, J., Chen, L., Zhang, W., 2013b. Integrated proteomic and transcriptomic analysis reveals novel genes and regulatory mechanisms involved in salt stress responses in Synechocystis sp. PCC 6803. Appl. Microbiol. Biotechnol. 97, 8253–8264. Radivojac, P., Clark, W.T., Oron, T.R., Schnoes, A.M., Wittkop, T., Sokolov, A., Graim, K., Funk, C., Verspoor, K., Ben-Hur, A., 2013. A large-scale evaluation of computational protein function prediction. Nat. Methods 10, 221–227. Ren, Q., Shi, M., Chen, L., Wang, J., Zhang, W., 2014. Integrated proteomic and metabolomic characterization of a novel two-component response regulator Slr 1909 involved in acid tolerance in Synechocystis sp. PCC 6803. J. Proteomics 109, 76–89. Rippka, R., Deruelles, J., Waterbury, J.B., Herdman, M., Stanier, R.Y., 1979. Generic assignments, strain histories and properties of pure cultures of cyanobacteria. J. Gen. Microbiol. 111, 1–61. Sharan, R., Ulitsky, I., Shamir, R., 2007. Network-based prediction of protein function. Mol. Syst. Biol. 3, 88.

52

L. Gao et al. / Journal of Microbiological Methods 116 (2015) 44–52

Souverein, O., Zwinderman, A., Tanck, M., 2006. Multiple imputation of missing genotype data for unrelated individuals. Ann. Hum. Genet. 70, 372–381. Szklarczyk, D., Franceschini, A., Kuhn, M., Simonovic, M., Roth, A., Minguez, P., Doerks, T., Stark, M., Muller, J., Bork, P., 2011. The STRING database in 2011: functional interaction networks of proteins, globally integrated and scored. Nucleic Acids Res. 39, D561–D568. Taboada, B., Ciria, R., Martinez-Guerrero, C.E., Merino, E., 2012. ProOpDB: prokaryotic operon database. Nucleic Acids Res. 40, D627–D631. Taboada, B., Verde, C., Merino, E., 2010. High accuracy operon prediction method based on STRING database scores. Nucleic Acids Res. 38, e130. Thirumahal, R., Patil, D.A., 2014. KNN and ARL based imputation to estimate missing values. Indones. J. Electr. Eng. Inform. (IJEEI) 2, 119–124. Tian, X., Chen, L., Wang, J., Qiao, J., Zhang, W., 2013. Quantitative proteomics reveals dynamic responses of Synechocystis sp. PCC 6803 to next-generation biofuel butanol. J. Proteomics 78, 326–345. Troyanskaya, O., Cantor, M., Sherlock, G., Brown, P., Hastie, T., Tibshirani, R., Botstein, D., Altman, R.B., 2001. Missing value estimation methods for DNA microarrays. Bioinformatics 17, 520–525. Tsoi, L.C., Elder, J.T., Abecasis, G.R., 2014. Graphical algorithm for integration of genetic and biological data: proof of principle using psoriasis as a model. Bioinformatics http://dx. doi.org/10.1093/bioinformatics/btu799. Unwin, R.D., Griffiths, J.R., Whetton, A.D., 2010. Simultaneous analysis of relative protein expression levels across multiple samples using iTRAQ isobaric tags with 2D nano LC-MS/MS. Nat. Protoc. 5, 1574–1582. Valencia, A., 2005. Automatic annotation of protein function. Curr. Opin. Struct. Biol. 15, 267–274.

Vanunu, O., Magger, O., Ruppin, E., Shlomi, T., Sharan, R., 2010. Associating genes and protein complexes with disease via network propagation. PLoS Comput. Biol. 6 e1000641. Vazquez, A., Flammini, A., Maritan, A., Vespignani, A., 2003. Global protein function prediction from protein–protein interaction networks. Nat. Biotechnol. 21, 697–700. Von Mering, C., Jensen, L.J., Snel, B., Hooper, S.D., Krupp, M., Foglierini, M., Jouffre, N., Huynen, M.A., Bork, P., 2005. STRING: known and predicted protein–protein associations, integrated and transferred across organisms. Nucleic Acids Res. 33, D433–D437. Wang, J., Chen, L., Huang, S., Liu, J., Ren, X., Tian, X., Qiao, J., Zhang, W., 2012. RNA-seq based identification and mutant validation of gene targets related to ethanol resistance in cyanobacterial Synechocystis sp. PCC 6803. Biotechnol. Biofuels 5, 89. Xuan, N., Chetty, M., Coppel, R., Wangikar, P.P., 2012. Gene regulatory network modeling via global optimization of high-order dynamic bayesian network. BMC Bioinf. 13, 131. Yang, Q., Pando, B.F., Dong, G., Golden, S.S., van Oudenaarden, A., 2010. Circadian gating of the cell cycle revealed in single cyanobacterial cells. Science 327, 1522–1526. Zhang, W., Li, F., Nie, L., 2010. Integrating multiple ‘omics’ analysis for microbial biology: application and methodologies. Microbiology 156, 287–301. Zhu, H., Ren, X., Wang, J., Song, Z., Shi, M., Qiao, J., Tian, X., Liu, J., Chen, L., Zhang, W., 2013. Integrated OMICS guided engineering of biofuel butanol-tolerance in photosynthetic Synechocystis sp. PCC 6803. Biotechnol. Biofuels 6, 106. Zhu, X., Goldberg, A.B., 2009. Introduction to semi-supervised learning. Synthesis Lectures on Artificial Intelligence and Machine Learning 3 pp. 1–130.

A global network-based protocol for functional inference of hypothetical proteins in Synechocystis sp. PCC 6803.

Functional inference of hypothetical proteins (HPs) is a significant task in the post-genomic era. We described here a network-based protocol for func...
1MB Sizes 0 Downloads 11 Views