Journal of Theoretical Biology ∎ (∎∎∎∎) ∎∎∎–∎∎∎

1 2 3 4 5 6 7 8 9 10 11 12 13 Q1 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66

Contents lists available at ScienceDirect

Journal of Theoretical Biology journal homepage: www.elsevier.com/locate/yjtbi

Human proteins characterization with subcellular localizations Lei Yang a, Yingli Lv a, Tao Li b, Yongchun Zuo c,n, Wei Jiang a,nn a b c

College of Bioinformatics Science and Technology, Harbin Medical University, Harbin 150081, PR China College of Life Science, Inner Mongolia Agricultural University, Hohhot 010018, PR China The National Research Center for Animal Transgenic Biotechnology, Inner Mongolia University, Hohhot 010021, PR China

H I G H L I G H T S

 Many properties are used to study human proteins in different subcellular localizations.  These properties are systematically compared for human proteins of different subcellular localizations.  Significant differences are found in human proteins in the different categories.

art ic l e i nf o

a b s t r a c t

Article history: Received 9 February 2014 Received in revised form 4 May 2014 Accepted 5 May 2014

Proteins are responsible for performing the vast majority of cellular functions which are critical to a cell’s survival. The knowledge of the subcellular localization of proteins can provide valuable information about their molecular functions. Therefore, one of the fundamental goals in cell biology and proteomics is to analyze the subcellular localizations and functions of these proteins. Recent large-scale human genomics and proteomics studies have made it possible to characterize human proteins at a subcellular localization level. In this study, according to the annotation in Swiss-Prot, 8842 human proteins were classified into seven subcellular localizations. Human proteins in the seven subcellular localizations were compared by using topological properties, biological properties, codon usage indices, mRNA expression levels, protein complexity and physicochemical properties. All these properties were found to be significantly different in the seven categories. In addition, based on these properties and pseudo-amino acid compositions, a machine learning classifier was built for the prediction of protein subcellular localization. The study presented here was an attempt to address the aforementioned properties for comparing human proteins of different subcellular localizations. We hope our findings presented in this study may provide important help for the prediction of protein subcellular localization and for understanding the general function of human proteins in cells. & 2014 Elsevier Ltd. All rights reserved.

Keywords: Topological properties Biological properties Codon usage bias Expression level Physicochemical properties

1. Introduction Cells are highly compartmentalized, limiting most proteins to specific subcellular localizations or organelles, which are specialized to carry out different biological functions. The cell nucleus contains most of the genetic material that governs all functions of the cell. The mitochondria play an important role in supplying cellular energy to the eukaryotic cell. They are also involved in signaling, cellular differentiation, cell death and cell growth (McBride et al., 2006). Cytoplasm comprises the jelly-like material enclosed within the cell membrane. The cytoplasm is about 80% water and usually colorless (Luby-Phelps, 1999). It takes up most of

n

Corresponding author. Tel./fax: þ 86 471 5227683. Corresponding author. Tel.: þ 86 451 8666 9617; fax: þ86 451 8661 9617. E-mail addresses: [email protected] (Y. Zuo), [email protected] (W. Jiang).

nn

the cell volume, filling the cell and serving as a ‘molecular soup’ in which all of the cell’s organelles are suspended (Chou and Shen, 2007). The cell membranes function as a boundary layer that separates the interior of all cells from the outside environment, while protecting the cell from outer injury. They are involved in many cellular processes such as ion conductivity, cell adhesion and cell signaling. The endoplasmic reticulum transports proteins throughout the cell that are marked with an address tag called a signal sequence. Together with the ribosome, they are also responsible for synthesizing proteins. Secreted proteins are secreted by cells that include many hormones, enzymes, toxins, and antimicrobial peptides. Comprehensive knowledge of the location of proteins within cellular microenvironments is critical for understanding their functions and interactions. Thus, knowing the subcellular localization of a protein gives us valuable information about its function. For example, the knowledge of protein subcellular localizations is

http://dx.doi.org/10.1016/j.jtbi.2014.05.008 0022-5193/& 2014 Elsevier Ltd. All rights reserved.

Please cite this article as: Yang, L., et al., Human proteins characterization with subcellular localizations. J. Theor. Biol. (2014), http://dx. doi.org/10.1016/j.jtbi.2014.05.008i

67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87

2

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66

L. Yang et al. / Journal of Theoretical Biology ∎ (∎∎∎∎) ∎∎∎–∎∎∎

vitally important for aiming target in cancer therapy. The experimental determination of subcellular localizations of large numbers of proteins coupled with many high throughput experimental techniques has made systematic analyses of proteins in specific subcellular localizations feasible. Numerous efforts have been made to develop various methods for predicting protein subcellular localization based on the sequence information (Chou, 2001; Chou and Elrod, 1999; Chou and Cai, 2002; Zhou and Doctor, 2003). More information for prediction of subcellular localization was shown in two comprehensive review papers (Chou and Shen, 2007; Nakai, 2000). Moreover, many new algorithms were established for identifying the subcellular localization in recent years (Chou and Shen, 2010a, 2010b; Chou et al., 2011, 2012; Fan and Li, 2012; Mei, 2012; Wan et al., 2013; Wu et al., 2011, 2012; Xiao et al., 2011a, 2011b). Although various experimental techniques and computational approaches have been developed and used to identify subcellular localizations of proteins (Chou and Shen, 2007; Dreger, 2003; Gygi et al., 1999; Nakai, 2000; Tsien, 1998), however, to date, only a few attempts have been made to globally analyze proteins in different subcellular localizations (Drawid et al., 2000; Ghaemmaghami et al., 2003; Martin and MacNeill, 2004; Wang et al., 2013). The rapid identification of genome-wide human protein–protein interaction (PPI) network provided many new insights into the understanding of proteins directly from the PPI network (Rual et al., 2005; Stelzl et al., 2005). However, most of the PPI networks are too complex to be easily understood. By investigating the topological properties of the PPI networks by graph theoretical concepts, this problem can be overcome. In recent years, many groups of proteins have been evaluated by the topological properties in the PPI networks (Goh et al., 2007; Han et al., 2013a, 2013b; Hwang et al., 2009; Kotlyar et al., 2012; Wang et al., 2011; Xu and Li, 2006; Yıldırım et al., 2007; Zhu et al., 2009). Codon usage bias is usually defined as differences in the frequency of occurrence of synonymous codons in the coding regions of genomic sequences. The codon usage bias is possible due to the redundancy of the genetic code, which allows differential use of synonymous codons (Kurland, 1991). Investigations of complete genome sequences of diverse species found that biased usage of synonymous codons may result from various factors (Ermolaeva, 2001; Rocha, 2004), such as gene length, composition bias recombination rates and RNA stability (Moriyama and Powell, 1998; Powell and Moriyama, 1997). Genome-wide investigations of codon usage bias patterns, their causes and consequences and identification of evolution selective forces have significant importance in genome biology. Therefore, codon usage bias is a strong species-specific statistic with numerous applications, such as gene prediction (Burge and Karlin, 1997). However, until recently, neither network-based topological analysis nor codon usage bias has been used in the dataset of human proteins at a subcellular localization level. The recent advent of experiments that measure biological properties and gene expression levels on a genome-wide scale allow a comprehensive view of gene activity patterns in cells. In this study, the human PPIs network was obtained from Online Predicted Human Interaction Database (OPHID) (Brown and Jurisica, 2005), 11,952 proteins were extracted from this network. Using the Swiss-Prot classification (Bairoch and Boeckmann, 1991), these proteins were classified into seven categories: (1) cytoplasmic proteins, (2) membrane proteins, (3) mitochondrial proteins, (4) secreted proteins, (5) nuclear proteins, (6) endoplasmic reticulum proteins and (7) ‘unknown’ proteins. With high-quality data from the OPHID, Swiss-Prot (Bairoch and Boeckmann, 1991), OGEE (Chen et al., 2012), OMIM (Hamosh et al., 2005), GO (Ashburner et al., 2000), KEGG (Kanehisa et al., 2004), Prosite (Hulo et al., 2006), Pfam (Bateman et al., 2004), TRANSFAC (Matys et al., 2003), CORUM (Ruepp et al., 2008), Ensembl (Hubbard et al., 2002) and

BioGPS (Wu et al., 2013) databases, we presented a computational analysis workflow aimed at characterizing human proteins at a subcellular localization level, which included the use of differential topological properties, biological properties, codon usage indices, gene expression levels, protein complexity and physicochemical properties. Based on the analysis, significant differences were found in all properties in the seven categories. According to a recent comprehensive review (Chou, 2011), to establish a really useful analysis method or statistical predictor for a protein system, we need to consider the following procedures: (i) construct or select a valid benchmark dataset to train and test the predictor; (ii) formulate the protein samples with an effective mathematical expression that can truly reflect their intrinsic correlation with the attribute to be predicted; (iii) introduce or develop a powerful algorithm (or engine) to operate the prediction; (iv) properly perform cross-validation tests to objectively evaluate the anticipated accuracy of the predictor; (v) establish a user-friendly webserver for the predictor that is accessible to the public. Below, let us describe how to deal with these steps.

2. Materials and methods 2.1. Datasets Human PPI datasets were downloaded from OPHID (version 1.95) (Brown and Jurisica, 2005). In order to obtain the high quality of human PPIs, only the literature-curated human PPIs were used. The final network comprised 12,265 nodes and 61,170 interactions after removing self-loops and duplicate edges. This network contains 228 connected components, and the main component consisted of 11,952 proteins and 61,081 interactions. The connected component is the subnetwork in which all the nodes are connected together, and the main component is the largest connected component of the network. The remaining 313 proteins are separated into 227 smaller components of size one to five nodes. Because the topological properties are incalculable for proteins that do not belong to the main component, so only the main component is considered in this study. We obtained the subcellular localization of each protein from Swiss-Prot (release 2013_03) (Bairoch and Boeckmann, 1991). As is well known, many proteins are singleplex and multiplex proteins. The web-servers iLoc-Euk (Chou et al., 2011), iLoc-Hum (Chou et al., 2012), iLoc-Plant (Wu et al., 2011), iLoc-Gpos (Wu et al., 2012), iLoc-Gneg (Xiao et al., 2011a), and iLoc-Virus (Xiao et al., 2011b) can be used to cope with the multiple location problems in eukaryotic, human, plant, Gram-positive, Gram-negative, and virus proteins, respectively. More detailed information is given in a review article (Chou, 2013). In this study, our dataset contains both singleplex and multiplex proteins, however, only the singleplex proteins are used. If a protein corresponded to multi-subcellular localizations (multiplex proteins), then that protein was removed. Of the 11,952 proteins in the PPI network, 5926 proteins have the unique subcellular localization (singleplex proteins). Using the Swiss-Prot classification, the 5926 proteins were classified into six categories: (1) 1178 cytoplasmic proteins, (2) 1913 membrane proteins, (3) 266 mitochondrial proteins, (4) 709 secreted proteins, (5) 1795 nuclear proteins, and (6) 65 endoplasmic reticulum proteins. Of the 11,952 proteins, 2916 proteins were classified into the ‘unknown’ category due to the lack of clear annotation in Swiss-Prot (Fig. 1). To remove the homologous sequences from the benchmark dataset, a cutoff threshold of 25% was imposed (Chou and Shen, 2007, 2008, 2010a, 2010b) to exclude those proteins from the benchmark datasets that have equal to or greater than 25% sequence identity to any other in the same subset. However, in this study we did not use such a stringent criterion because the

Please cite this article as: Yang, L., et al., Human proteins characterization with subcellular localizations. J. Theor. Biol. (2014), http://dx. doi.org/10.1016/j.jtbi.2014.05.008i

67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132

L. Yang et al. / Journal of Theoretical Biology ∎ (∎∎∎∎) ∎∎∎–∎∎∎

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 Q6 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66

3

Fig. 1. Subcellular fractionation for human proteins. The percentages of 1178 cytoplasmic proteins, 1913 membrane proteins, 266 mitochondrial proteins, 709 secreted proteins, 1795 nuclear proteins, 65 endoplasmic reticulum proteins, and 2916 ‘unknown’ proteins are shown in this figure.

currently available data do not allow us to do so. Otherwise, the number of proteins for some subsets would be too small to have statistical significance. In this study, we adopt the benchmark dataset for protein subcellular localization prediction. 1178 cytoplasmic proteins, 1913 membrane proteins, 266 mitochondrial proteins, 709 secreted proteins, and 1795 nuclear proteins are used in this study. There are only 65 endoplasmic reticulum proteins, so, endoplasmic reticulum proteins are not used in our benchmark dataset because too few proteins are contained in this group. The CD-HIT software (Fu et al., 2012) is used to remove sequences that have more than 30% sequence identity. The final benchmark dataset consists of 928 cytoplasmic proteins, 1129 membrane proteins, 242 mitochondrial proteins, 442 secreted proteins, and 1082 nuclear proteins. None of the proteins included in the dataset has more than 30% sequence identity to any other in the same subcellular location. The human essential genes were downloaded from the OGEE database (build: 304) (Chen et al., 2012). Because the conditional essential genes were essential only under certain circumstances (Joyce et al., 2006), these genes were not used in this study. The protein product of an essential gene was regarded as an essential protein. There were 1528 human essential genes in the OGEE database, 1292 protein products of human essential genes were mapped into the PPI network. Lists of known hereditary disease genes were obtained from Online Mendelian Inheritance in Man (OMIM) Morbid Map on May 28, 2013 (Hamosh et al., 2005). In May 2013, the OMIM Morbid Map table contained 4992 disorders and 8652 diseaserelated genes. The protein product of a disease gene was regarded as a disease protein. The disease genes were mapped into the main component network, and the total number of such gene products contained in the PPI network was 2030. The Gene Ontology annotations for human genes were downloaded from the GO slims set (release 13-5-28) (Ashburner et al., 2000). The Prosite dataset (Hulo et al., 2006), Pfam dataset (Bateman et al., 2004) and KEGG dataset (Kanehisa et al., 2004) were obtained from org.Hs.eg.db (version 2.80) using the R software (version 3.0.2). The transcription factors and their target genes were downloaded from TRANSFAC (release 2013.3) (Matys et al., 2003). The human protein complexes were downloaded from the CORUM database (released in February 2012) (Ruepp

et al., 2008). The earliest expression stage during development, the phyletic ages of genes, Ka/Ks ratio (Hurst, 2002) and the number of homologs in the same genome were downloaded from the OGEE database (build: 304). 20,265 protein sequences for humans were obtained from Swiss-Prot (release 2013_03) (Bairoch and Boeckmann, 1991). All annotated coding sequences (CDSs) for humans were obtained from the Ensembl database (Homo_sapiens.GRCh37.74) (Hubbard et al., 2002). To improve the quality of sequences and minimize sampling errors, any CDSs under 300 nucleotides in length were removed, as well as any CDSs not beginning with a start codon (ATG) or terminating with a stop codon (TAA/TAG/TGA). Any CDSs that contain internal stop codons were also removed. A total of 85,623 human coding sequences met the above criteria and were selected for further analysis. 2.2. Topological properties In this study, the following topological properties are calculated for illustrating the characterization of proteins in the PPI network (Table 1). The most elementary and simplest characteristic of a node is its degree, which is defined as the number of nodes directly connected to a given node i. In PPI networks, the degree of a protein correlates with the essentiality of the protein (Jeong et al., 2001), and this measure can be used to identify hub proteins or important proteins in the PPI networks. Proteins with high degrees are thought to play an important role in biological networks. The degrees of essential genes, drug targets, disease genes and toxin targets were significantly higher than those of other proteins in the PPI networks (Goh et al., 2007; Xu and Li, 2006; Yang et al., 2014; Yıldırım et al., 2007; Zhang et al., 2010; Zhu et al., 2009). Average shortest path (ASP) is defined as the average shortest path between a node and all the nodes in the PPI network. It is a measure of the efficiency of information or mass transport on a network. The clustering coefficient was first proposed by Watts and Strogatz in 1998 for measuring the modularity of the neighborhood of a node (Watts and Strogatz, 1998). It is a measure for determining how the neighbors of a node are interconnected. The relationship between the clustering coefficient and modular structure has been investigated by several authors (Ravasz et al., 2002). Proteins with higher clustering coefficients

Please cite this article as: Yang, L., et al., Human proteins characterization with subcellular localizations. J. Theor. Biol. (2014), http://dx. doi.org/10.1016/j.jtbi.2014.05.008i

67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132

L. Yang et al. / Journal of Theoretical Biology ∎ (∎∎∎∎) ∎∎∎–∎∎∎

4

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66

Table 1 Topological feature set. Name

Definition

Degree Clustering coefficient Topological coefficient Betweenness

ki 2ni =ðki ðki  1ÞÞ 1=ðki jM i jÞ ∑ Cði; jÞ

Core number

K

Description

The number of links to node i ni is the number of links between all neighbors of node i Cði; jÞ is defined as all the number of nodes that both link to node i and j; M i is the set of nodes that share neighbors j A Mi with node i ð ∑ Lði; jÞÞ=ðn  1Þðn  2Þ=2 Lði; jÞ is the length of the shortest path between nodes i and j jAN

ASP

∑ dij =jNjjN  1j

The node i has the k-core number of k if all nodes of degree less than k are removed from the graph when iteration number is k, and then the node i has no link to any other node in the graph N denotes the set of all nodes in the PPI network; dij denotes the shortest path between node i and node j

jAN

BN

∑ ps ðiÞ

sAN

EPC

ð1=jNjÞ ∑ 〈δij 〉

MNC

jMNCðiÞj

DMNC ADEG/ADDG

E/Ne ∑ dij =jMj

iAN

1N essential gene/ disease gene

jAM t ki =ki

T s is defined as the shortest path tree rooted at s. If more than jVðT s Þj=4 paths from s to other nodes in T s meet at the node i, ps ðiÞ ¼ 1, otherwise ps ðiÞ ¼ 0 When nodes i and j are connected, δij ¼ 1, otherwise δij ¼ 0.〈δij 〉 is the ensemble average of δij The maximum connected component of the G[N(i)] is defined as MNC(i). N(i) is the neighbor of node i and the induced subgraph of N(i) is defined as G[N(i)] For node i, N is the node number and E is the edge number of MNC(v), respectively. In this study, e is 1.7. dij is the shortest path between node i and node j; M denotes the set of all essential genes and all disease genes t

ki is the number of links between node i and essential genes or disease genes

are more likely to be involved in more protein complexes (Kotlyar et al., 2012). Besides clustering coefficient, topological coefficient is used to measure the topological characteristics of the proteins in the PPI network (Goldberg and Roth, 2003). Topological coefficient is a relative measure of the tendency of the nodes in the network to have shared interaction partners with other nodes. Betweenness is one of the most important topological properties of a network for determining the bottleneck nodes (Freeman, 1982). It is defined as the number of shortest paths from all vertices to all others that pass through that node. The proteins with larger betweenness represent the bottlenecks in the PPI networks or biological pathways they participate in (Hwang et al., 2009). In the regulatory network, proteins with higher betweenness have a much higher tendency to be essential proteins (Yu et al., 2007). The k-core analysis is an iterative process in which the nodes are gradually removed from the graphs in order of least connected (Wachi et al., 2005; Wuchty and Almaas, 2005). For each iteration of k, all nodes with a degree lower than k are removed from the graphs. This will result in a series of subgraphs that gradually reveal the backbone of the original network. Larger values of kcore clearly correspond to protein with larger degree, a more important position within the network structure and the backbone of the proteome. In addition, four topological properties: bottleneck (BN) (Przulj et al., 2004; Yu et al., 2007), edge percolation component (EPC) (Chin and Samanta, 2003), maximum neighborhood component (MNC) and density of maximum neighborhood component (DMNC) are calculated by Hubba (Lin et al., 2008). In interaction network, a tree (Tv) can be constructed from the shortest paths starting from a node v. In this tree, node v is defined as the root, and weight of a node w in this tree is defined as the number of shortest paths starting from node v passing through node w. If the weight of w is equal to or more than n/4, the node w is defined as a bottleneck, where n is the number of nodes in Tv. The score of node w, BN(v), is defined as the number of node v such that w is a bottleneck node in Tv. Each edge linking nodes i and j is assigned a random probability pij in an interaction network G. After eliminating the edges with random probability greater than p, the new network G0 is constructed. In the new network G0 , if nodes i and j are connected, set dij ¼ 1, otherwise set dij ¼ 0. The percolation correlation cij of v and w is defined as the ensemble average of dij. MNC(i) of node i is defined to be the size of the maximum connected component of a subnetwork N(i) where N(i) is the

neighborhood of node i and nodes adjacent to i. For DMNC(i), N is node number and E is the edge number, respectively. The score of node i, DMNC(i) is defined to be E/Ne, where e is 1.7. Besides, the average distance to essential genes (ADEG) (Zhu et al., 2009) was defined as the average shortest distance between a protein and all essential genes in the PPI network. The average distance to disease genes (ADDG) was also defined in a similar way. 1N essential gene index and 1N disease gene index (Xu and Li, 2006) of node i were defined as the proportion of essential genes and disease genes among all its neighbors, respectively (Table 1). Because essential genes and disease genes are more likely to interact with each others, so 1N essential gene index and 1N disease gene index can be used as the essentiality index and the disease gene index, respectively. In this study, common gene ontology index (CGOI) (Hwang et al., 2009) was defined as follows: B

c

M

CGOIðiÞ ¼ ∑ dij þ dij þdij j

B

C

M

where j represents any node adjacent to node i. dij ,dij and dij were the deepest ontology depths of common function shared by i and j in the category of biological process (B), cellular component (C) and molecular function (M), respectively. CGOI was used to measure the amount of common GO annotations of adjacent nodes. Based on the Gene Ontology annotation, each edge was assigned a score according to the ontology depth of shared annotation between two connected nodes. In a similar way, common KEGG index (CKI), common family index (CFI), common motif index (CMI), common transcription factor index (CTFI) and common protein complex index (CPCI) were also defined as follows: CKIðiÞ ¼ ∑ kij ; j

CFIðiÞ ¼ ∑ f ij ;

CMIðiÞ ¼ ∑ mij ;

j

CTFIðiÞ ¼ ∑ tf ij ; j

j

CPCIðiÞ ¼ ∑ pcij j

where j represents any node adjacent to node i.kij ,f ij ,mij , tf ij and pcij are the amount of common pathways, protein families, motifs, transcription factors and protein complexes shared by i and j in the databases of KEGG, Pfam, Prosite, TRANSFAC and CORUM, respectively.

Please cite this article as: Yang, L., et al., Human proteins characterization with subcellular localizations. J. Theor. Biol. (2014), http://dx. doi.org/10.1016/j.jtbi.2014.05.008i

67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132

L. Yang et al. / Journal of Theoretical Biology ∎ (∎∎∎∎) ∎∎∎–∎∎∎

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66

2.3. Biological properties The GO score of a gene was defined as the amount of all ontology depths of the gene. The number of KEGG pathways in which a gene was involved, the number of Prosite motifs contained in a protein, the number of protein complexes in which a protein was involved, the number of protein families in which a protein was involved and the number of regulating transcription factors were used in this study. The properties of each gene were regarded as the properties of this gene product. All the detailed information is summarized in Table 2. 2.4. Codon usage indices The codon adaptation index (CAI) (Sharp and Li, 1987), codon bias index (CBI) (Bennetzen and Hall, 1982), frequency of optimal codons (Fop) (Ikemura, 1981), effective number of codons (Nc) (Wright, 1990), GC content (GC) and GC content of silent third codon positions (GC3s) were calculated using the freeware program, CodonW 1.4. The CDSs were mapped to Gene symbols. If multiple CDSs corresponded to the same gene, then the values were averaged. Finally, 22,174 genes that have values calculated by CodonW 1.4 were used in this study. The codon usage indices of each gene were regarded as the indices of this gene product.

5

There were 1161 cytoplasmic proteins, 64 endoplasmic reticulum proteins, 1881 membrane proteins, 263 mitochondrial proteins, 1770 nuclear proteins, 672 secreted proteins and 1696 unknown proteins that had codon usage indices.

2.5. Expression level (EL) The mRNA expression profile (GSE1133) was downloaded from the BioGPS database (Wu et al., 2013). Log 2 transformation was performed on all of the mRNA expression values. For expression profile, probe sets were mapped to Gene symbols. If multiple probe sets corresponded to the same gene, then the expression values of these probe sets were averaged. Finally, the expression levels of 18,215 genes in 15 different tissues (heart, kidney, liver, lung, lymph node, ovary, prostate, skin, testis, thyroid, uterus, whole blood, whole brain, colon and small intestine) were used. The expression levels of each gene were regarded as the expression levels of this gene product. There were 1069 cytoplasmic proteins, 57 endoplasmic reticulum proteins, 1749 membrane proteins, 243 mitochondrial proteins, 1650 nuclear proteins, 662 secreted proteins and 1526 unknown proteins that had expression information.

Table 2 The biological properties and their corresponding numbers in different subcellular localizations. Property

Cyt

ER

Mem

Mit

Nuc

Sec

Unknown

GO score KEGG number Family number Motif number TF number Complex number Ka/Ks ratio Homologous gene number Phyletic age Expression stage

1176 462 1637 842 134 208 574 585 1124 805

65 33 56 45 3 7 35 35 60 47

1892 981 1752 1366 318 244 1291 1306 1798 1282

265 113 243 167 19 67 95 96 252 168

1793 483 1596 1294 253 660 745 753 1701 1264

708 336 654 574 250 58 380 390 672 414

1531 643 1637 1257 129 256 769 789 1650 1132

Cyt–cytoplasm; ER—endoplasmic reticulum; Mem—membrane; Mit—mitochondrion; Nuc—nucleus; Sec—secreted.

Table 3 Five physicochemical indices of 20 amino acids. Amino acid

Alanine Cysteine Aspartate Glutamate Phenylalanine Glycine Histidine Isoleucine Lysine Leucine Methionine Asparagine Proline Glutamine Arginine Serine Threonine Valine Tryptophan Tyrosine

Code

A C D E F G H I K L M N P Q R S T V W Y

Physicochemical properties Accessibility

Exposed

Flexibility

Hydrophilicity

Polarity

16 168  78  106 189  13 50 151  141 145 124  74  20  73  70 70  38 123 145 53

15 5 50 55 10 10 34 13 85 16 20 49 45 56 67 32 32 14 17 41

0.357 0.346 0.511 0.497 0.314 0.544 0.323 0.462 0.466 0.365 0.295 0.463 0.509 0.493 0.529 0.507 0.444 0.386 0.305 0.420

 0.5  1.0 3.0 3.0  2.5 0.0  0.5  1.8 3.0  1.8  1.3 0.2 0.0 0.2 3.0 0.3  0.4  1.5  3.4  2.3

8.1 5.5 13.0 12.3 5.2 9.0 10.4 5.2 11.3 4.9 5.7 11.6 8.0 10.5 10.5 9.2 8.6 5.9 5.4 6.2

Please cite this article as: Yang, L., et al., Human proteins characterization with subcellular localizations. J. Theor. Biol. (2014), http://dx. doi.org/10.1016/j.jtbi.2014.05.008i

67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132

L. Yang et al. / Journal of Theoretical Biology ∎ (∎∎∎∎) ∎∎∎–∎∎∎

6

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66

2.6. Protein complexity and physicochemical properties

2.9. Support vector machine

According to the measure of diversity (Laxton, 1978), the protein complexity is defined as follows:

Support vector machine (SVM) was first proposed by Vapnik and his co-workers (Cortes and Vapnik, 1995). It is based on the structural risk minimization principle for minimizing both training and generalization error. In this study, the publicly available LIBSVM software (Chang and Lin, 2011) developed by Lin ChihJen professor is used to predict the subcellular localization of humans.

20

PCðNÞ ¼ N ln N  ∑ ni ln ni i¼1

here N is the length of protein sequence, ni indicates the absolute frequency of the ith amino acid in protein sequence. If ni equals zero, then ni ln ni ¼0. Since the protein sequences are composed of 20 amino acids by different physicochemical properties, each amino acid has specific properties. Therefore, recognizing the properties of every amino acid is very important to study its structures and functions. In this study, five physicochemical indices extracted from AAindex database (Kawashima et al., 2008) were used: accessibility (Biou et al., 1988), exposed surface (Janin et al., 1978), flexibility (Bhaskaran and Ponnuswamy, 1988), hydrophilicity (Hopp and Woods, 1981) and polarity (Grantham, 1974). Five sets of physicochemical indices of 20 amino acids are shown in Table 3. The accessibility of each protein is defined as follows: 20

A ¼ ∑ ni ai

2.10. Evaluation of methods In order to roundly estimate the accuracy of our predictor, the overall prediction accuracy (Acc) is also calculated: Acc ¼ ∑ TPi =N i

where TP denotes the number of the correctly recognized positives and N is the total number of proteins. In this study, the jackknife test is implemented for evaluating the performance of our predictor. During the process of the jackknife test, each protein is singled out in turn as a test sample, the remaining proteins are used as training set to calculate the test sample’s membership and to predict the class.

i¼1

where ni indicates the absolute frequency of the ith amino acid in a protein sequence and ai indicates the correspondent accessibility value of the ith amino acid. Similarly, exposed surface, flexibility, hydrophilicity and polarity of each protein are also defined. Each property consists of 20 values assigned to each of the amino acid residues on the basis of their relative propensity to possess the property. There were 1178 cytoplasmic proteins, 65 endoplasmic reticulum proteins, 1913 membrane proteins, 266 mitochondrial proteins, 1795 nuclear proteins, 709 secreted proteins and 1762 unknown proteins that had protein properties.

2.7. Excess retention The excess retention (Wuchty, 2004; Wuchty and Almaas, 2005) of proteins with subcellular localization A is defined as follows: ERAk ¼ ðN Ak =Nk Þ=ðN A =NÞ where N Ak is the number of proteins with a subcellular localization, A, with the core number Zk,N k is the total number of proteins with the core number Zk,N A is the number of proteins with a subcellular localization, A, within the whole proteins, and N is the number of whole proteins in the PPI network.

2.8. Statistical analysis Correlations were estimated by using non-parametric (Spearman’s rank correlation coefficients) statistics. The Shapiro–Wilk (SW) test was used for testing the normality of the distributions. In the present study, because these variables did not follow normal distribution in any of the protein categories, the Kruskal–Wallis (KW) test served as a non-parametric test with which to compare the protein categories. Statistical analysis was done using the freely available R package, version 3.0.2. Because KW test implemented in R software can only give the exact P-value when the P-value is more than 2.20E 16, so, all the P-values that were less than 2.20E  16 were represented as o2.20E  16.

3. Results and discussion 3.1. Analysis of topological properties In this study, we analyze the topological properties of proteins in the PPI network with subcellular localizations; the results are shown in Table 4 and Figs. S1–S3. Degree is the number of the nearest neighbors of a node and high degree nodes are considered to play an important role in a network. In the PPI network, the differences in degrees of seven protein categories are investigated. There is a significant category difference between the degrees (P o2.20E 16). In the PPI network, cytoplasmic proteins have the highest average degree, followed by nuclear proteins. In contrast, the lowest average degree is observed for the category of proteins with a cell mitochondrial localization. The distribution and variance of the degrees in the seven subcellular localizations are shown in Table 5. As shown in Table 5, the cytoplasmic proteins have some hub proteins with extremely high degree, and the variance of cytoplasmic proteins is the highest among seven protein groups. These results indicate that the highest average degree of cytoplasmic proteins is caused by few hub proteins with an extremely high degree. It is easy to understand the high degree of nuclear proteins, because most biological processes for cell viability take place in nuclear, more proteins would interact with them. We also investigated the differences in the core number of the seven protein categories. In the PPI network, there is a significant difference in the core number of the seven protein categories (P o2.20E  16). The cumulative fraction distribution of the core number for the nuclear proteins stands out from those of the other categories (Fig. S3), as the nuclear proteins have the highest average core number among the seven categories, which indicates that nuclear proteins are more likely to be located at the backbone of the protein network than the proteins of the other six categories. In addition, the excess retention of the seven categories of proteins in k-cores 1–18 of the human protein interaction network is plotted (Fig. 2). The excess retention of nuclear proteins displays the lowest decrease with the k-cores, whereas the excess retention of secreted proteins displays the quickest decrease with the k-cores. This plot clearly demonstrates

Please cite this article as: Yang, L., et al., Human proteins characterization with subcellular localizations. J. Theor. Biol. (2014), http://dx. doi.org/10.1016/j.jtbi.2014.05.008i

67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 Q2114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132

L. Yang et al. / Journal of Theoretical Biology ∎ (∎∎∎∎) ∎∎∎–∎∎∎

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66

7

Table 4 The topological properties among different subcellular localizations. Property

Cyt

ER

Mem

Mit

Nuc

Sec

Unknown

P-value

Degree Core number CC TC ASP Betweenness BN EPC MNC DMNC 1N EG index 1N DG index ADEG ADDG CGOI CKI CFI CMI CTFI CPCI

11.862 5.727 0.099 0.191 3.896 2.627E þ 04 2.874 33.287 6.519 0.158 0.377 0.237 3.668 3.815 83.490 2.971 0.786 1.315 0.074 0.779

7.277 4.077 0.098 0.207 4.157 1.571E þ 04 2.200 17.347 3.062 0.164 0.266 0.300 3.932 4.071 30.831 0.800 0.262 0.292 0.000 0.123

6.834 4.252 0.110 0.200 4.132 1.012E þ04 1.852 18.835 3.602 0.142 0.365 0.285 3.893 4.035 57.944 2.973 0.737 1.054 0.121 0.539

4.989 3.617 0.050 0.217 4.173 5.348E þ 03 1.470 12.601 1.906 0.092 0.278 0.228 3.956 4.100 22.444 0.891 0.128 0.117 0.015 0.320

11.455 6.305 0.145 0.183 3.879 1.481E þ04 2.088 31.703 6.981 0.204 0.391 0.232 3.652 3.804 159.714 2.847 1.093 1.774 0.145 6.018

6.031 3.709 0.110 0.216 4.322 1.026Eþ 04 1.898 9.828 2.673 0.136 0.388 0.372 4.071 4.206 49.788 2.903 0.612 1.295 0.522 0.166

7.594 5.256 0.119 0.191 4.120 8.483E þ 03 1.668 18.875 4.730 0.155 0.344 0.202 3.899 4.045 33.810 1.911 0.370 0.588 0.032 0.435

o2.20E  16 o2.20E  16 o2.20E  16 o7.16E  06 o2.20E  16 o2.20E  16 o2.20E  16 o2.20E  16 o2.20E  16 o2.20E  16 o2.20E  16 o2.20E  16 o2.20E  16 o2.20E  16 o2.20E-16 o2.20E  16 o2.20E  16 o7.16E  06 o2.20E  16 o2.20E  16

Cyt—cytoplasm, ER—endoplasmic reticulum, Mem—membrane, Mit—mitochondrion, Nuc—nucleus, Sec—secreted, CC—clustering coefficient, TC—topological coefficient, 1N EG index—1N essential gene index, 1N DG index—1N disease gene index.

Table 5 The distribution (%) and variance of the degrees between different subcellular localizations. Degree

1–5

6–10

11–15

16–20

21–30

31–40

41–50

51–100

4 100

Variance

Cytoplasm Endoplasmic reticulum Membrane Mitochondrion Nucleus Secreted Unknown

52.50 66.20 65.20 72.90 47.50 66.10 69.80

16.60 15.40 17.10 16.20 18.80 19.60 13.80

10.00 10.80 6.70 5.60 10.30 5.90 5.50

6.40 1.50 4.00 3.00 6.40 3.70 3.10

6.00 1.50 4.20 1.10 8.70 2.80 2.80

3.40 0.00 1.50 0.80 3.70 1.10 1.30

1.40 1.50 0.60 0.00 2.10 0.30 0.50

2.20 3.10 0.60 0.40 2.20 0.40 3.10

1.40 0.00 0.20 0.00 0.40 0.00 0.30

597.82 159.48 107.27 47.24 230.91 66.35 250.52

that among all subcellular localizations, nuclear proteins are more likely to encode hub proteins in the PPI network, whereas secreted proteins are more likely to encode peripheral proteins in the PPI network. In undirected networks, clustering coefficient of a node is defined as the ratio of the observed edges between the neighbors to all the possible edges that can occur between the neighboring nodes. Clustering coefficient indicates how close the local neighborhood of a node is to be part of the graph. In the PPI network, close-connected module proteins tend to have higher clustering coefficient, whereas inter-modular proteins tend to have lower clustering coefficient. In this study, there is a significant category difference between the clustering coefficients (P o2.20E  16). The average clustering coefficients of nuclear proteins and mitochondrial proteins are 0.145 and 0.050, respectively, which are the highest average clustering coefficient and the lowest average clustering coefficient in this study. These results indicate that nuclear proteins are more likely to participate in a close-connected module in the PPI network, whereas mitochondrial proteins are more likely to be the inter-modular proteins in the PPI network. Besides clustering coefficient, topological coefficient is used to measure the topological characteristics of the proteins in the PPI network. Topological coefficient is a relative measure of the tendency of the nodes in the network to have shared interaction partners with other nodes. The KW test is performed to determine the difference among proteins in different subcellular localizations (P o2.20E  16). In contrast to the results of clustering coefficient,

the average topological coefficients of nuclear proteins and mitochondrial proteins are 0.183 and 0.217, respectively, which are the lowest average topological coefficient and the highest average topological coefficient in this study. ASP is defined as the average shortest path between a node and all the nodes in the PPI network. The comparative results indicate that the average ASPs for cytoplasmic, endoplasmic reticulum, membrane, mitochondrial, nuclear, secreted and unknown proteins are 3.896, 4.157, 4.132, 4.173, 3.879, 4.322 and 4.120, respectively. Therefore, the average ASP of nuclear proteins is the lowest compared with those of the other six categories (P o2.20E 16). In contrast, the highest average ASP is observed for the subset of proteins with secreted localization. These results suggest that proteins in the cell nuclear localization communicate quickly in the PPI network, whereas proteins in the cell secreted localization communicate slowly in the PPI network. Betweenness is a centrality measure of a node within a network. It is defined as the number of shortest paths from all vertices to all others that pass through that node. Betweenness can be regarded as a measure of the influence a node has over the spread of information through the network. Compared with six other categories, the betweennesses of mitochondrial proteins are significantly higher than those of proteins located in the other six localizations (P o2.20E 16). The average betweenness of cytoplasmic protein is about quintuple as large as that in mitochondrial protein. This is expected, as betweenness tends to be highly correlated with degree.

Please cite this article as: Yang, L., et al., Human proteins characterization with subcellular localizations. J. Theor. Biol. (2014), http://dx. doi.org/10.1016/j.jtbi.2014.05.008i

67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132

8

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66

L. Yang et al. / Journal of Theoretical Biology ∎ (∎∎∎∎) ∎∎∎–∎∎∎

Fig. 2. Excess retention with the k-cores. The excess retention of nuclear proteins displays the lowest decrease with the k-cores, whereas the excess retention of secreted proteins displays the quickest decrease with the k-cores. This plot clearly demonstrates that among all subcellular localizations, nuclear proteins are more likely to encode hub proteins in the PPI network, whereas secreted proteins are more likely to encode peripheral proteins in the PPI network.

In this work, BN, EPC, MNC and DMNC are used to explore and identify hubs in the PPI network. In the PPI network, there are significant differences in BN, DMNC, MNC and EPC in the seven categories. Based on the comparative results of BN, EPC, MNC and DMNC, we find that cytoplasmic and nuclear proteins are more likely to be hubs and bottlenecks in the PPI network. These results also confirmed that proteins with larger degree and larger betweenness are hubs and bottlenecks in the PPI network. ADEG is used to conduct comparisons between the seven subcellular categories. In the PPI network, ADEG is also found to be statistically discernible among the seven subcellular categories. The comparative results indicate that the average ADEG for the subset of proteins with nuclear localization is 3.652, which is the lowest among the seven subcellular categories. The lower ADEG indicates nuclear proteins have shorter average distance to essential genes. Similar results are obtained in the analysis of ADDE, which is a measure for average distance to disease genes. In addition, 1N essential gene index of nuclear proteins is significantly higher than those of the other six categories (P o2.20E  16), indicating that among all the neighbors of a nuclear protein the proportion of being essential genes are significantly higher than those of other six categories. In contrast to the results of 1N essential gene index, 1N disease gene index of secreted proteins is significantly higher than those of the other six categories, indicating that among all the neighbors of a secreted protein the proportions of being disease genes are significantly higher than those of the other six categories. Here, CGOI, CKI, CFI, CMI, CTFI and CPCI are used to measure the amount of common GO function, KEGG pathways, protein families, motifs, transcription factors, and protein complexes of adjacent nodes. On the basis of the seven categories described earlier, we compare the common indices of each category, and there are significant differences in six common indices among the seven categories, and all the results are shown in Table 4 and

Fig. 3. Two-dimensional hierarchical clustering of 20 topological properties for 11,952 proteins. In this figure, CC indicates clustering coefficient, TC indicates topological coefficient, 1N EG index indicates 1N essential gene index, and 1N DG index indicates 1N disease gene index.

Figs. S1–S3. Intriguingly, the CGOI, CFI, CMI and CPCI of proteins with a nuclear localization are the highest compared with proteins with the other subcellular localizations. In contrast, proteins with cell mitochondrion localization have the lowest CGOI, CFI and CMI. Based on our analysis of the CKI, membrane proteins have the highest average CKI compared with the other six categories, and the difference between the seven categories is significant. We also find that the CTFI of secreted proteins is significantly higher than those of the other six categories (P o2.20E  16). To further examine whether the topological properties are partitioned into different clusters, a hierarchical cluster analysis (Eisen et al., 1998) is performed on 20 topological properties (Fig. 3). These properties are classified into different clusters, for most of the clusters, the proteins in the same cluster have consistent behaviors (Table 4 and Fig. 3). As shown in Fig. 3, ADEG and ADDG cluster together, and the average distance to other function gene groups can be reflected by this cluster. Betweenness and BN are two topological measures for determining the bottleneck nodes in the network. Proteins that participate in many biological pathways have higher betweenness and BN values compared to other proteins that do not, so these two features cluster together. The cluster of degree and MNC reflects the essentiality of the proteins or important nodes in the network, the cluster of core number and EPC reflects the backbone of the proteins, and the cluster of CC and DMNC reflects the modularity of the proteins. Prosite motifs and Pfam protein families are two indices of protein conservation, so the cluster of CMI and CPI reflects the common conserved motifs of adjacent proteins. In addition, the ADEG, ADDG and ASP are classified into a cluster, which is defined as cluster 1; the degree, MNC, CGOI, betweenness, BN, CKI, CTFI, CPCI, CFI, CMI, core number and EPC are classified into a cluster, which is defined as cluster 2. The average distance to the other node and the efficiency of information or mass transport in the PPI network can be reflected by cluster 1; cluster 2 reflects the important nodes or essential node in the PPI network.

Please cite this article as: Yang, L., et al., Human proteins characterization with subcellular localizations. J. Theor. Biol. (2014), http://dx. doi.org/10.1016/j.jtbi.2014.05.008i

67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132

L. Yang et al. / Journal of Theoretical Biology ∎ (∎∎∎∎) ∎∎∎–∎∎∎

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66

3.2. Analysis of biological properties In this work, we analyze the biological properties of proteins with subcellular localizations; all the results are shown in Table S1 and Fig. S4–S6. The GO score provides a measure of the importance of a particular protein. There is a statistically significant difference in the GO scores of the seven protein categories. The comparative results of GO scores indicate that the average GO scores for cytoplasmic, endoplasmic reticulum, membrane, mitochondrial, nuclear, secreted and unknown proteins are 3.508E þ04, 2.484E þ04, 2.890E þ04, 8.243Eþ 04, 7.304E þ04, 3.591E þ 04 and 5.239E þ04, respectively. Therefore, the GO score of mitochondrial proteins is the highest compared with those of the other six categories. In contrast, the lowest GO score is observed for endoplasmic reticulum proteins. Secreted proteins include many enzymes, toxins, hormones and antimicrobial peptides. Secreted proteins are presented within the extracellular space, and are often accessible to various drug delivery mechanisms. Based on this, we expect that secreted proteins would be involved in more KEGG pathways. Investigation of the KEGG pathway numbers shows that there is a significant difference in the KEGG number of the seven protein categories (P o2.20E  16). The average KEGG number is higher for secreted proteins than for proteins in the other six categories. By contrast, the average KEGG number of endoplasmic reticulum proteins is the lowest among the seven categories. We also seek to analyze proteins with subcellular localizations on the family number and motif number. Compared with other protein categories, cytoplasmic proteins are involved in more protein families, and the difference among them is significant (P o2.20E  16). The compared results show that there is a significant category difference in the motif number, and secreted proteins have the highest average motif number, followed by cytoplasmic proteins, nucleus proteins, membrane proteins, unknown proteins, endoplasmic reticulum proteins and mitochondrial proteins. The seven protein categories are subjected to a genomic characterization in which the transcription factor number for each category is determined. The results show that there is a significant category difference in the TF number (P o2.20E 16). Secreted proteins tend to be regulated by the highest number of transcription factors, whereas endoplasmic reticulum proteins tend to be regulated by the lowest number of transcription factors. The average TF numbers of secreted and endoplasmic reticulum proteins are 4.028 and 2.000, respectively, which means that secreted proteins have on an average 2.01 times more TF numbers than endoplasmic reticulum proteins. TRANSFAC Professional is the most comprehensive database containing published data on eukaryotic transcription factors, their experimentally proven binding sites, and regulated genes. Secreted proteins should not be investigated by more researchers to identify their binding sites by experimental methods. Therefore, these results should not result from dataset bias or investigation bias. In the PPI network, nuclear proteins have the highest clustering coefficients, suggesting that they are more likely to be involved in protein complexes (Kotlyar et al., 2012). Nuclear proteins are indeed involved in the largest number of protein complexes, and the difference in complex numbers of the seven categories is significant (P o2.20E  16). This result is consistent with the analysis results of Li et al. (2013). In the work of Li et al. they found that most of the protein complexes in CORUM database comprised nuclear proteins. Protein evolutionary rate (Ka/Ks) can be used to measure the conservative property of genes. It has been shown that significant differences exist between the Ka/Ks ratio for different subcellular localizations (P o2.20E  16). The average Ka/Ks ratios for each

9

category are calculated and compared between the seven categories. Our results suggest that secreted proteins are under more relaxed selective pressure than the other six categories. Subcellular localization is an important aspect of protein function. There have been conflicting reports about the correlation between protein subcellular localization and evolutionary rate. Some previous findings suggested that extracellular proteins were under more relaxed selective pressure than cytoplasmic proteins and nuclear proteins (Wang et al., 2013). The difference between our analysis and some previous studies may be explained by two factors. First, different datasets were used in our study and previous studies. The other possibility is that the evolution of genes has different patterns in the human lineage, leading to the difference in Ka/Ks ratios from human–rodent or human–primate alignments. Phyletic age is known as the evolutionary origin of a gene, defined by the evolutionarily most distant species where homologs can be found. Many structural, evolutionary and functional features of genes are correlated with phyletic age. The differences in phyletic age of the seven protein categories are also investigated. There is a significant category difference between phyletic ages (Po2.20E  16). In multicellular organisms, embryonic development is a tightly regulated chain of events. Knocking-out of genes expressed in early developmental stages is more likely to be lethal because more downstream events are affected. Duplicated genes or duplicates refer to those that are homologous to other genes with their protein sequences in the same genome. The number of homologous genes in the same genome is defined as homologous gene number. Finally, investigation of the expression stage and homologous gene number shows that there are significant differences in expression stage and homologous gene number of the seven protein categories. 3.3. Analysis of codon usage indices Codon usage bias refers to differences in the frequency of the occurrence of synonymous codons in coding DNA. Therefore, codon usage bias can provide a measure of the importance of a particular gene. In this study, we analyze 6 codon usage indices for each investigated gene from its coding region. For all measures of codon usage indices, the KW tests are applied to the comparison of the seven protein categories, significant differences are found in 6 codon usage indices (Table S2 and Figs. S7–S9). The CAI is one of the indicators of codon usage indices. It quantifies the similarity between the synonymous codon usage of a gene and the synonymous codon frequency of a reference set. The comparative results of the CAI indicate that the average CAIs for cytoplasmic, endoplasmic reticulum, membrane, mitochondrial, nuclear, secreted and unknown proteins are 0.394, 0.375, 0.435, 0.370, 0.393, 0.438, and 0.397, respectively. Therefore, the CAI of secreted proteins is the highest compared with those of the other six categories. In contrast, the lowest CAI is observed for mitochondrial proteins. The Spearman correlations between CAI and the other five codon usage indices are calculated to investigate whether the other five codon usage indices are correlated with CAI. As shown in Fig. 4, it is found that there are four significantly positive correlations between CAI and CBI, Fop, GC3s and GC, and one significantly negative correlation between CAI and Nc. All the above correlations are statistically significant, and we therefore expect that the codon usage indices of secreted proteins would be the highest. The average CBI, Fop, GC3s and GC of secreted proteins are indeed higher than those of proteins in the other six categories. The effective number of codons (Nc), a measurement of codon usage index that ranges from 20 to 61. A gene with Nc equal to 20 when one codon is used per amino acid, and thus it shows the

Please cite this article as: Yang, L., et al., Human proteins characterization with subcellular localizations. J. Theor. Biol. (2014), http://dx. doi.org/10.1016/j.jtbi.2014.05.008i

67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 Q3130 131 132

10

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66

L. Yang et al. / Journal of Theoretical Biology ∎ (∎∎∎∎) ∎∎∎–∎∎∎

Fig. 4. CAI versus (A) CBI, (B) Fop, (C) Nc, (D) GC3s and (E) GC for 18,612 human proteins. The line shown is the linear regression. In all cases evidence for significant correlations are obtained.

strongest codon usage bias, whereas a gene with Nc equal to 61 when all the codons are used with equal frequency. There are clear negative correlations between CAI and Nc (Fig. 4). Therefore, the Nc of secreted proteins is the lowest compared with those of the other six categories (P o2.20E  16). In contrast, the highest Nc is observed for the subset of proteins with endoplasmic reticulum localizations. These results indicate that secreted proteins have the strongest codon usage bias, whereas endoplasmic reticulum proteins have the weakest codon usage bias, which are consistent with the above observations. 3.4. Analysis of expression level The availability of vast public gene-array datasets gave us an opportunity to examine mRNA expression levels in 15 different tissues. To explore whether proteins in different subcellular localizations behave differently on their mRNA expression levels in 15 different tissues, we calculated the mean and P-value of mRNA expression levels for seven protein categories in 15 different tissues, the comparative results are shown in Table S3 and Figs. S10–S12. There are significant category differences between the expression levels. As shown in Table S3, high expression levels can be observed for mitochondrial and endoplasmic reticulum proteins, low expression levels for membrane and secreted proteins. Figs. S11 andS12 show the violin plots and cumulative fraction

distributions of expression levels for each of the different subcellular localizations. The violin plots indicate that each category shows an appreciable spread of expression levels and the cumulative fraction distribution for the categories with the high expression levels stand out from those of the other compartments. The comparative results of expression levels indicate that the average expression levels in the heart for cytoplasmic, endoplasmic reticulum, membrane, mitochondrial, nuclear, secreted and unknown proteins are 4.727, 4.970, 3.968, 5.285, 4.170, 3.920, and 4.574, respectively. Therefore, the mitochondrial proteins have the highest expression level, whereas secreted proteins have the lowest expression level. Similar results are obtained in the analysis of the other 14 tissues. These results clearly show that our results are largely consistent over the variety of tissues. By this we mean that the overall trend of high expression levels in endoplasmic reticulum and mitochondrial proteins and low expression levels in secreted and membrane proteins can be observed consistently in 15 different tissues. The analysis results presented in this study are similar to the results of Drawid et al. (2000), which found high expression levels for cytoplasmic proteins, low expression levels for nuclear and membrane proteins and middling expression levels for secreted proteins in yeast. Drawid et al. also found that proteins in the ‘transcription’ and ‘transport’ categories had lower expression levels in yeast. Proteins involved in transcription are often located in nuclear localization and proteins involved in

Please cite this article as: Yang, L., et al., Human proteins characterization with subcellular localizations. J. Theor. Biol. (2014), http://dx. doi.org/10.1016/j.jtbi.2014.05.008i

67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132

L. Yang et al. / Journal of Theoretical Biology ∎ (∎∎∎∎) ∎∎∎–∎∎∎

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66

Table 6 The success rates in predicting subcellular locations for 3823 human proteins. Subcellular location

Success rates (%)

Cytoplasm Membrane Mitochondrion Nucleus Secreted Overall Acc

543/928 ¼58.51 768/1129¼68.02 76/242¼ 31.41 754/1082 ¼ 69.69 259/442¼ 58.59 2400/3823 ¼ 62.78

extracellular transport are often located in the cell membrane. These may be the reasons for the lower expression levels of nuclear and membrane proteins in our study. In addition, we suspect the correlation between expression level and subcellular localization may be related to the volumes of the various subcellular compartments. For example, the cytoplasm has much more space for proteins than the other compartments. The expression level for freely diffusing proteins destined for larger compartments may need to be higher than for smaller ones to achieve the same effective concentration (Drawid et al., 2000). In many bacteria and small eukaryotes, CAI was used as a quantitative way of predicting the expression level of a gene. In this study, we want to know whether there is a correlation between the CAI and mRNA expression level. We choose this index on the basis that CAI has been shown to correlate highly to expression levels in yeast and it seems to perform better than other indices, like CBI, Nc or Fop. Therefore, the Spearman correlations between CAI and mRNA expression level in 15 tissues are calculated to investigate whether CAI can be used to represent expression level of humans (Fig. S13). Fig. S13 shows the relationship between the CAI values and mRNA expression levels for 15 different tissues. It can clearly be seen that for genes in 15 different tissues there are no significant correlations between CAI and expression level. The relationship between CAI and mRNA expression level also agrees with difference results obtained in six codon usage indices and mRNA expression levels of the proteins in seven categories. 3.5. Analysis of the protein complexity and physicochemical properties The protein complexity, solvent accessibility, exposed surface, flexibility, hydrophilicity and polarity are also investigated, and all the results are shown in Table S4 and Figs. S14–S16. There are significant category differences among these properties. Intriguingly, the physicochemical properties of cytoplasmic proteins are on average the highest compared with proteins in the other six subcellular localizations. In contrast, for most cases, proteins with a mitochondrial localization have relatively lower physicochemical properties. 3.6. Protein subcellular localization prediction In order to predict the subcellular location of a protein, it is very important to choose a set of reasonable information parameters from protein sequence. The concept of Chou’s pseudoamino acid composition has stimulated a series of studies to improve the prediction quality in various areas (Chou, 2001, 2011; Chou and Elrod, 1999; Chou and Cai, 2002; Yuan et al., 2013; Zuo and Li, 2009, 2010; Zuo et al., 2013). So, in this study, 400 dipeptides (Chen and Li, 2007a, 2007b; Lin, 2008; Lin and Li, 2007; Yang and Li, 2009), 20 topological properties, 10 biological properties, 6 codon usage indices, 15 mRNA expression levels, protein complexity and 5 physicochemical properties are selected

11

to test the performances of our classifier for protein subcellular localization prediction. In statistical prediction, the following three methods are often used to examine the power of a predictor: independent dataset test, sub-sampling test and jackknife test. Of these three, the jackknife test is deemed as the most effective and objective test and in it overtraining problem does not exist, hence it has been used by more and more investigators (Chou, 2005; Chou et al., 1997, 2000; Huang et al., 2012; Li et al., 2012) in examining the power of various prediction methods. Therefore, the jackknife test is performed on benchmark dataset to examine the power of our classifier; all the results are shown in Table 6. Since user-friendly and publicly accessible web-servers represent the future direction for developing practically more useful models, simulated methods, or predictors, we shall make efforts to establish a user-friendly web-server for the predictor that is accessible to the public in our future work for enhancing the value of its practical applications. We also plan to incorporate more features, such as GO features and sequential evolution information (Chou et al., 2011, 2012; Wu et al., 2011; Xiao et al., 2011b) to further improve the prediction results of our classifier in the future work.

4. Conclusion In this work, we have used the topological properties, biological properties, codon usage indices, mRNA expression levels, protein complexity and physicochemical properties to analyze proteins at a subcellular localization level. To the best of our knowledge, this is the first time these properties are systematically compared for human proteins of different subcellular localizations. We found clear statistical differences in all properties of the seven subcellular localizations. In the PPI network, most of the topological properties show that nuclear proteins tend to have the most important indices and play an important role in human PPI network. Mitochondrial proteins show the highest average GO score. Secreted proteins are involved in more KEGG pathways, tend to be regulated by the highest number of transcription factors and have the highest average motif number. Cytoplasmic proteins are involved in more protein families. Some other biological properties such as Ka/Ks ratio and phyletic age are also investigated; significant differences are found between proteins in the seven categories. In the analysis of 6 codon usage indices for the seven protein categories, secreted proteins have the highest codon usage indices. In addition, our results presented in this study show that, for humans, expression levels are clearly correlated with the subcellular localization of the corresponding protein, and the highest expression levels can be observed for mitochondrial proteins. Finally, cytoplasmic proteins show the highest protein complexity and the highest physicochemical properties. However, some questions remain unanswered. It is important to further investigate the reasons behind these results, and more biological data stored in the biological databases such as TRANSFAC and KEGG databases are needed. Additionally, although the present study addressed the evolutionary pressure of proteins through parameters such as Ka/Ks ratio and CAI, other parameters, such as the number of genetic protein and physical interactions, the fitness consequences of gene knockout, the protein sequence length, are important factors for evolutionary determinants. In addition, based on 20 topological properties, 10 biological properties, 6 codon usage indices, 15 mRNA expression levels, protein complexity, 5 physicochemical properties, and 400 pseudo-amino acid compositions, the SVM classifier is applied for the prediction of protein subcellular localization. We anticipate that these parameters will become important future topics in genomics. Finally, these findings should contribute to elucidating the functional mechanisms of a biological system, and these results may provide

Please cite this article as: Yang, L., et al., Human proteins characterization with subcellular localizations. J. Theor. Biol. (2014), http://dx. doi.org/10.1016/j.jtbi.2014.05.008i

67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132

12

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 Q4 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66

L. Yang et al. / Journal of Theoretical Biology ∎ (∎∎∎∎) ∎∎∎–∎∎∎

us useful help for protein subcellular localization prediction. Since user-friendly and publicly accessible web-servers represent the future direction for developing practically more useful models, simulated methods, or predictors as pointed out in Chou and Shen (2009) and Lin and Lapointe (2013) demonstrated in a series of recent publications (Chen et al., 2013; Fan et al., 2014; Guo et al., 2014; Liu et al., 2013; Min et al., 2013; Qiu et al., 2014; Xiao et al., 2013; Xu et al., 2013), we shall make efforts in our future work to provide a web-server for the method presented in this paper.

Acknowledgments This work was supported by the Scientific Research Fund of Heilongjiang Provincial Health Department (No. 2012797), the Heilongjiang Postdoctoral Funds for Scientific Research Initiation (No. LBHQ11042), the Program for Young Talents of Science and Technology in Harbin (No. 2013RFQXJ057).

Appendix A. Supporting information Supplementary data associated with this article can be found in the online version at: http://dx.doi.org/10.1016/j.jtbi.2014.05.008. References Ashburner, M., Ball, C.A., Blake, J.A., Botstein, D., Butler, H., Cherry, J.M., Davis, A.P., Dolinski, K., Dwight, S.S., Eppig, J.T., 2000. Gene Ontology: tool for the unification of biology. Nat. Genet. 25, 25–29. Bairoch, A., Boeckmann, B., 1991. The Swiss-Prot protein sequence data bank. Nucleic Acids Res. 19, 2247–2249. Bateman, A., Coin, L., Durbin, R., Finn, R.D., Hollich, V., Griffiths-Jones, S., Khanna, A., Marshall, M., Moxon, S., Sonnhammer, E.L.L., 2004. The Pfam protein families database. Nucleic Acids Res. 32, D138–D141. Bennetzen, J.L., Hall, B., 1982. Codon selection in yeast. J. Biol. Chem. 257, 3026–3031. Bhaskaran, R., Ponnuswamy, P., 1988. Positional flexibilities of amino acid residues in globular proteins. Int. J. Pept. Protein Res. 32, 241–255. Biou, V., Gibrat, J., Levin, J., Robson, B., Garnier, J., 1988. Secondary structure prediction: combination of three different methods. Protein Eng. 2, 185–191. Brown, K.R., Jurisica, I., 2005. Online predicted human interaction database. Bioinformatics 21, 2076–2082. Burge, C., Karlin, S., 1997. Prediction of complete gene structures in human genomic DNA. J. Mol. Biol. 268, 78–94. Chang, C.C., Lin, C.J., 2011. LIBSVM: a library for support vector machines (Software available at:)〈http://www.csie.ntu.edu.tw/  cjlin/libsvm〉. Chen, W., Feng, P.M., Lin, H., Chou, K.C., 2013. iRSpot-PseDNC: identify recombination spots with pseudo dinucleotide composition. Nucleic Acids Res. 41 (e68–e68). Chen, W.H., Minguez, P., Lercher, M.J., Bork, P., 2012. OGEE: an online gene essentiality database. Nucleic Acids Res. 40, D901–D906. Chen, Y.L., Li, Q.Z., 2007a. Prediction of apoptosis protein subcellular location using improved hybrid approach and pseudo-amino acid composition. J. Theor. Biol. 248, 377–381. Chen, Y.L., Li, Q.Z., 2007b. Prediction of the subcellular location of apoptosis proteins. J. Theor. Biol. 245, 775–783. Chin, C.S., Samanta, M.P., 2003. Global snapshot of a protein interaction networkpercolation based approach. Bioinformatics 19, 2413–2419. Chou, K.C., 2001. Prediction of protein cellular attributes using pseudo amino acid composition. Proteins: Struct. Funct. Genet. 43, 246–255. Chou, K.C., 2005. Using amphiphilic pseudo amino acid composition to predict enzyme subfamily classes. Bioinformatics 21, 10–19. Chou, K.C., 2011. Some remarks on protein attribute prediction and pseudo amino acid composition. J. Theor. Biol. 273, 236–247. Chou, K.C., 2013. Some remarks on predicting multi-label attributes in molecular biosystems. Mol. Biosyst. 9, 1092–1100. Chou, K.C., Elrod, D.W., 1999. Protein subcellular location prediction. Protein Eng. 12, 107–118. Chou, K.C., Cai, Y.D., 2002. Using functional domain composition and support vector machines for prediction of protein subcellular location. J. Theor. Biol. 277, 45765–45769. Chou, K.C., Shen, H.B., 2007. Recent progress in protein subcellular location prediction. Anal. Biochem. 370, 1–16. Chou, K.C., Shen, H.B., 2008. Cell-PLoc: a package of Web servers for predicting subcellular localization of proteins in various organisms. Nat. Protoc. 3, 153–162.

Chou, K.C., Shen, H.B., 2009. Recent advances in developing web-servers for predicting protein attributes. Nat. Sci. 1, 63–92. Chou, K.C., Shen, H.B., 2010a. Plant-mPLoc: a top-down strategy to augment the power for predicting plant protein subcellular localization. PLoS One 5, e11335. Chou, K.C., Shen, H.B., 2010b. A new method for predicting the subcellular localization of eukaryotic proteins with both single and multiple sites: EukmPLoc 2.0. PLoS One 5, e9931. Chou, K.C., Jones, D., Heinrikson, R.L., 1997. Prediction of the tertiary structure and substrate binding site of caspase-8. FEBS Lett. 419, 49–54. Chou, K.C., Tomasselli, A.G., Heinrikson, R.L., 2000. Prediction of the tertiary structure of a caspase-9/inhibitor complex. FEBS Lett. 470, 249–256. Chou, K.C., Wu, Z.C., Xiao, X., 2011. iLoc-Euk: a multi-label classifier for predicting the subcellular localization of singleplex and multiplex eukaryotic proteins. PLoS One 6, e18258. Chou, K.C., Wu, Z.C., Xiao, X., 2012. iLoc-Hum: using the accumulation-label scale to predict subcellular locations of human proteins with both single and multiple sites. Mol. Biosyst. 8, 629–641. Cortes, C., Vapnik, V., 1995. Support vector networks. Mach. Learn. 20, 273–297. Drawid, A., Jansen, R., Gerstein, M., 2000. Genome-wide analysis relating expression level with protein subcellular localization. Trends Genet. 16, 426–430. Dreger, M., 2003. Subcellular proteomics. Mass Spectrom. Rev. 22, 27–56. Eisen, M.B., Spellman, P.T., Brown, P.O., Botstein, D., 1998. Cluster analysis and display of genome-wide expression patterns. Proc. Natl. Acad. Sci. USA 95, 14863–14868. Ermolaeva, M.D., 2001. Synonymous codon usage in bacteria. Curr. Issues Mol. Biol. 3, 91–97. Fan, G.L., Li, Q.Z., 2012. Predict mycobacterial proteins subcellular locations by incorporating pseudo-average chemical shift into the general form of Chou’s pseudo amino acid composition. J. Theor. Biol. 304, 88–95. Fan, Y.N., Xiao, X., Min, J.L., Chou, K.C., 2014. iNR-Drug: predicting the interaction of drugs with nuclear receptors in cellular networking. Int. J. Mol. Sci. 15, 4915–4937. Freeman, L.C., 1982. Centered graphs and the structure of ego networks. Math. Soc. Sci. 3, 291–304. Fu, L., Niu, B., Zhu, Z., Wu, S., Li, W., 2012. CD-HIT: accelerated for clustering the next-generation sequencing data. Bioinformatics 28, 3150–3152. Ghaemmaghami, S., Huh, W.K., Bower, K., Howson, R.W., Belle, A., Dephoure, N., O’Shea, E.K., Weissman, J.S., 2003. Global analysis of protein expression in yeast. Nature 425, 737–741. Goh, K.I., Cusick, M.E., Valle, D., Childs, B., Vidal, M., Barabasi, A.L., 2007. The human disease network. Proc. Natl. Acad. Sci. USA 104, 8685–8690. Goldberg, D.S., Roth, F.P., 2003. Assessing experimentally derived interactions in a small world. Proc. Natl. Acad. Sci. USA 100, 4372–4376. Grantham, R., 1974. Amino acid difference formula to help explain protein evolution. Science 185, 862–864. Guo, S.H., Deng, E.Z., Xu, L.Q., Ding, H., Lin, H., Chen, W., Chou, K.C., 2014. iNucPseKNC: a sequence-based predictor for predicting nucleosome positioning in genomes with pseudo k-tuple nucleotide composition. Bioinformatics, Epub ahead of print. Gygi, S.P., Rist, B., Gerber, S.A., Turecek, F., Gelb, M.H., Aebersold, R., 1999. Quantitative analysis of complex protein mixtures using isotope-coded affinity tags. Nat. Biotechnol. 17, 994–999. Hamosh, A., Scott, A.F., Amberger, J.S., Bocchini, C.A., McKusick, V.A., 2005. Online Mendelian Inheritance in Man (OMIM), a knowledgebase of human genes and genetic disorders. Nucleic Acids Res. 33, D514–D517. Han, H.W., Bae, S.H., Jung, Y.H., Moon, J., 2013a. Genome-wide characterization of the relationship between essential and TATA-containing genes. FEBS Lett. 587, 444–451. Han, H.W., Ohn, J.H., Moon, J., Kim, J.H., 2013b. Yin and Yang of disease genes and death genes between reciprocally scale-free biological networks. Nucleic Acids Res. 41, 9209–9217. Hopp, T.P., Woods, K.R., 1981. Prediction of protein antigenic determinants from amino acid sequences. Proc. Natl. Acad. Sci. USA 78, 3824–3828. Huang, T., Zhang, J., Xu, Z.P., Hu, L.-L., Chen, L., Shao, J.L., Zhang, L., Kong, X.Y., Cai, Y. D., Chou, K.C., 2012. Deciphering the effects of gene deletion on yeast longevity using network and machine learning approaches. Biochimie 94, 1017–1025. Hubbard, T., Barker, D., Birney, E., Cameron, G., Chen, Y., Clark, L., Cox, T., Cuff, J., Curwen, V., Down, T., 2002. The Ensembl genome database project. Nucleic Acids Res. 30, 38–41. Hulo, N., Bairoch, A., Bulliard, V., Cerutti, L., De Castro, E., Langendijk-Genevaux, P.S., Pagni, M., Sigrist, C.J., 2006. The PROSITE database. Nucleic Acids Res. 34, D227–D230. Hurst, L.D., 2002. The Ka/Ks ratio: diagnosing the form of sequence evolution. Trends Genet. 18, 486–487. Hwang, Y.C., Lin, C.C., Chang, J.Y., Mori, H., Juan, H.F., Huang, H.C., 2009. Predicting essential genes based on network and sequence analysis. Mol. Biosyst. 5, 1672–1678. Ikemura, T., 1981. Correlation between the abundance of Escherichia coli transfer RNAs and the occurrence of the respective codons in its protein genes: a proposal for a synonymous codon choice that is optimal for the E. coli translational system. J. Mol. Biol. 151, 389–409. Janin, J., Wodak, S., Levitt, M., Maigret, B., 1978. Conformation of amino acid sidechains in proteins. J. Mol. Biol. 125, 357–386. Jeong, H., Mason, S.P., Barabási, A., Oltvai, Z.N., 2001. Lethality and centrality in protein networks. Nature 411, 41–42. Joyce, A.R., Reed, J.L., White, A., Edwards, R., Osterman, A., Baba, T., Mori, H., Lesely, S.A., Palsson, B.Ø., Agarwalla, S., 2006. Experimental and computational

Please cite this article as: Yang, L., et al., Human proteins characterization with subcellular localizations. J. Theor. Biol. (2014), http://dx. doi.org/10.1016/j.jtbi.2014.05.008i

67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132

L. Yang et al. / Journal of Theoretical Biology ∎ (∎∎∎∎) ∎∎∎–∎∎∎

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 Q5 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

assessment of conditionally essential genes in Escherichia coli. J. Bacteriol. 188, 8259–8271. Kanehisa, M., Goto, S., Kawashima, S., Okuno, Y., Hattori, M., 2004. The KEGG resource for deciphering the genome. Nucleic Acids Res. 32, D277–D280. Kawashima, S., Pokarowski, P., Pokarowska, M., Kolinski, A., Katayama, T., Kanehisa, M., 2008. AAindex: amino acid index database, progress report 2008. Nucleic Acids Res. 36, D202–D205. Kotlyar, M., Fortney, K., Jurisica, I., 2012. Network-based characterization of drugregulated genes, drug targets, and toxicity. Methods 57, 499–507. Kurland, C., 1991. Codon bias and gene expression. FEBS Lett. 285, 165–169. Laxton, R., 1978. The measure of diversity. J. Theor. Biol. 70, 51–67. Li, B.Q., Hu, L.L., Niu, S., Cai, Y.D., Chou, K.C., 2012. Predict and analyze Snitrosylation modification sites with the mRMR and IFS approaches. J. Proteomics 75, 1654–1665. Li, Z.C., Lai, Y.H., Chen, L.L., Chen, C., Xie, Y., Dai, Z., Zou, X.Y., 2013. Identifying subcellular localizations of mammalian protein complexes based on graph theory with a random forest algorithm. Mol. Biosyst. 9, 658–667. Lin, C.Y., Chin, C.H., Wu, H.H., Chen, S.H., Ho, C.W., Ko, M.T., 2008. Hubba: hub objects analyzer a framework of interactome hubs identification for network biology. Nucleic Acids Res. 36, W438–W443. Lin, H., 2008. The modified mahalanobis discriminant for predicting outer membrane proteins by using Chou’s pseudo amino acid composition. J. Theor. Biol. 252, 350–356. Lin, H., Li, Q.Z., 2007. Predicting conotoxin superfamily and family by using pseudo amino acid composition and modified Mahalanobis discriminant. Biochem. Biophys. Res. Commun. 354, 548–551. Lin, S.X., Lapointe, J., 2013. Theoretical and experimental biology in one. J. Biomed. Sci. Eng. 6, 435–442. Liu, B., Zhang, D., Xu, R., Xu, J., Wang, X., Chen, Q., Dong, Q., Chou, K.C., 2013. Combining evolutionary information extracted from frequency profiles with sequence-based kernels for protein remote homology detection. Bioinformatics 30, 472–479. Luby-Phelps, K., 1999. Cytoarchitecture and physical properties of cytoplasm: volume, viscosity, diffusion, intracellular surface area. Int. Rev. Cytol. 192, 189–221. Martin, I.V., MacNeill, S.A., 2004. Functional analysis of subcellular localization and protein–protein interaction sequences in the essential DNA ligase I protein of fission yeast. Nucleic Acids Res. 32, 632–642. Matys, V., Fricke, E., Geffers, R., Gö ßling, E., Haubrock, M., Hehl, R., Hornischer, K., Karas, D., Kel, A.E., Kel-Margoulis, O.V., 2003. TRANSFACs: transcriptional regulation, from patterns to profiles. Nucleic Acids Res. 31, 374–378. McBride, H.M., Neuspiel, M., Wasiak, S., 2006. Mitochondria: more than just a powerhouse. Curr. Biol. 16, R551–R560. Mei, S., 2012. Predicting plant protein subcellular multi-localization by Chou’s PseAAC formulation based multi-label homolog knowledge transfer learning. J. Theor. Biol. 310, 80–87. Min, J.L., Xiao, X., Chou, K.C., 2013. iEzy-Drug: a web server for identifying the interaction between enzymes and drugs in cellular networking. Biomed. Res. Int., 701317. Moriyama, E.N., Powell, J.R., 1998. Gene length and codon usage bias in Drosophila melanogaster, Saccharomyces cerevisiae and Escherichia coli. Nucleic Acids Res. 26, 3188–3193. Nakai, K., 2000. Protein sorting signals and prediction of subcellular localization. Adv. Protein Chem. 54, 277–344. Powell, J.R., Moriyama, E.N., 1997. Evolution of codon usage bias in DrosophilaProc. Natl. Acad. Sci. USA 94, 7784–7790. Przulj, N., Wigle, D.A., Jurisica, I., 2004. Functional topology in a network of protein interactions. Bioinformatics 20, 340–348. Qiu, W.R., Xiao, X., Chou, K.C., 2014. iRSpot-TNCPseAAC: Identify recombination spots with trinucleotide composition and pseudo amino acid components. Int. J. Mol. Sci. 15, 1746–1766. Ravasz, E., Somera, A.L., Mongru, D.A., Oltvai, Z.N., Barabási, A., 2002. Hierarchical organization of modularity in metabolic networks. Science 297, 1551–1555. Rocha, E.P., 2004. Codon usage bias from tRNA’s point of view: redundancy, specialization, and efficient decoding for translation optimization. Genome Res. 14, 2279–2286. Rual, J.F., Venkatesan, K., Hao, T., Hirozane Kishikawa, T., Dricot, A., Li, N., Berriz, G.F., Gibbons, F.D., Dreze, M., Ayivi Guedehoussou, N., 2005. Towards a proteomescale map of the human protein–protein interaction network. Nature 437, 1173–1178. Ruepp, A., Brauner, B., Dunger-Kaltenbach, I., Frishman, G., Montrone, C., Stransky, M., Waegele, B., Schmidt, T., Doudieu, O.N., Stümpflen, V., 2008. CORUM: the comprehensive resource of mammalian protein complexes. Nucleic Acids Res. 36, D646–D650. Sharp, P.M., Li, W.H., 1987. The codon adaptation index—a measure of directional synonymous codon usage bias, and its potential applications. Nucleic Acids Res. 15, 1281–1295.

13

Stelzl, U., Worm, U., Lalowski, M., Haenig, C., Brembeck, F.H., Goehler, H., Stroedicke, M., Zenkner, M., Schoenherr, A., Koeppen, S., 2005. A human protein–protein interaction network: a resource for annotating the proteome. Cell 122, 957–968. Tsien, R.Y., 1998. The green fluorescent protein. Annu. Rev. Biochem. 67, 509–544. Wachi, S., Yoneda, K., Wu, R., 2005. Interactome-transcriptome analysis reveals the high centrality of genes differentially expressed in lung cancer tissues. Bioinformatics 21, 4205–4208. Wan, S., Mak, M.W., Kung, S.Y., 2013. GOASVM: a subcellular location predictor by incorporating term-frequency gene ontology into the general form of Chou’s pseudo-amino acid composition. J. Theor. Biol. 323, 40–48. Wang, C., Jiang, W., Li, W., Lian, B., Chen, X., Hua, L., Lin, H., Li, D., Li, X., Liu, Z., 2011. Topological properties of the drug targets regulated by microRNA in human protein–protein interaction network. J. Drug Target. 19, 354–364. Wang, X., Wang, R., Zhang, Y., Zhang, H., 2013. Evolutionary survey of druggable protein targets with respect to their subcellular localizations. Genome Biol. Evol. 5, 1291–1297. Watts, D.J., Strogatz, S.H., 1998. Collective dynamics of ‘small-world’ networks. Nature 393, 440–442. Wright, F., 1990. The ‘effective number of codons’ used in a gene. Gene 87, 23–29. Wu, C., Macleod, I., Su, A.I., 2013. BioGPS and MyGene.info: organizing online, genecentric information. Nucleic Acids Res. 41, D561–D565. Wu, Z.C., Xiao, X., Chou, K.C., 2011. iLoc-Plant: a multi-label classifier for predicting the subcellular localization of plant proteins with both single and multiple sites. Mol. Biosyst. 7, 3287–3297. Wu, Z.C., Xiao, X., Chou, K.C., 2012. iLoc-Gpos: a multi-layer classifier for predicting the subcellular localization of singleplex and multiplex Gram-positive bacterial proteins. Protein Pept. Lett. 19, 4–14. Wuchty, S., 2004. Evolution and topology in the yeast protein interaction network. Genome Res. 14, 1310–1314. Wuchty, S., Almaas, E., 2005. Peeling the yeast protein network. Proteomics 5, 444–449. Xiao, X., Wu, Z.C., Chou, K.C., 2011a. A multi-label classifier for predicting the subcellular localization of Gram-negative bacterial proteins with both single and multiple sites. PLoS One 6, e20592. Xiao, X., Wu, Z.C., Chou, K.C., 2011b. iLoc-Virus: a multi-label learning classifier for identifying the subcellular localization of virus proteins with both single and multiple sites. J. Theor. Biol. 284, 42–51. Xiao, X., Min, J.L., Wang, P., Chou, K.C., 2013. iCDI-PseFpt: identify the channel–drug interaction in cellular networking with PseAAC and molecular fingerprints. J. Theor. Biol. 337, 71–79. Xu, J., Li, Y., 2006. Discovering disease-genes by topological features in human protein–protein interaction network. Bioinformatics 22, 2800–2805. Xu, Y., Shao, X.J., Wu, L.Y., Deng, N.Y., Chou, K.C., 2013. iSNO-AAPair: incorporating amino acid pairwise coupling into PseAAC for predicting cysteine Snitrosylation sites in proteins. PeerJ 1, e171. Yang, L., Li, Q.Z., 2009. Prediction of presynaptic and postsynaptic neurotoxins by the increment of diversity. Toxicol. In Vitro 23, 346–348. Yang, L., Wang, J., Wang, H., Lv, Y., Zuo, Y., Jiang, W., 2014. Analysis and identification of toxin targets by topological properties in protein–protein interaction network. J. Theor. Biol. 349, 82–91. Yıldırım, M.A., Goh, K.I., Cusick, M.E., Barabási, A.L., Vidal, M., 2007. Drug–target network. Nat. Biotechnol. 25, 1119–1126. Yu, H., Kim, P.M., Sprecher, E., Trifonov, V., Gerstein, M., 2007. The importance of bottlenecks in protein networks: correlation with gene essentiality and expression dynamics. PLoS Comput. Biol. 3, e59. Yuan, L.F., Ding, C., Guo, S.H., Ding, H., Chen, W., Lin, H., 2013. Prediction of the types of ion channel-targeted conotoxins based on radial basis function network. Toxicol. In Vitro 27, 852–856. Zhang, L., Hu, K., Tang, Y., 2010. Predicting disease-related genes by topological similarity in human protein–protein interaction network. Cent. Eur. J. Phys. 8, 672–682. Zhou, G.P., Doctor, K., 2003. Subcellular location prediction of apoptosis proteins. Proteins: Struct. Funct. Genet. 50, 44–48. Zhu, M., Gao, L., Li, X., Liu, Z., Xu, C., Yan, Y., Walker, E., Jiang, W., Su, B., Chen, X., 2009. The analysis of the drug-targets based on the topological properties in the human protein–protein interaction network. J. Drug Target. 17, 524–532. Zuo, Y.C., Li, Q.Z., 2009. Using reduced amino acid composition to predict defensin family and subfamily: integrating similarity measure and structural alphabet. Peptides 30, 1788–1793. Zuo, Y.C., Li, Q.Z., 2010. Using K-minimum increment of diversity to predict secretory proteins of malaria parasite based on groupings of amino acids. Amino Acids 38, 859–867. Zuo, Y.C., Chen, W., Fan, G.L., Li, Q.Z., 2013. A similarity distance of diversity measure for discriminating mesophilic and thermophilic proteins. Amino Acids 44, 573–580.

Please cite this article as: Yang, L., et al., Human proteins characterization with subcellular localizations. J. Theor. Biol. (2014), http://dx. doi.org/10.1016/j.jtbi.2014.05.008i

61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119

Human proteins characterization with subcellular localizations.

Proteins are responsible for performing the vast majority of cellular functions which are critical to a cell's survival. The knowledge of the subcellu...
3MB Sizes 2 Downloads 3 Views