Global organization of a binding site network gives insight into evolution and structure-function relationships of proteins.

www.nature.com/scientificreports

OPEN

Received: 26 April 2017 Accepted: 7 August 2017 Published: xx xx xxxx

Global organization of a binding site network gives insight into evolution and structure-function relationships of proteins Juyong Lee 1,2, Janez Konc3,4, Dušanka Janežič3 & Bernard R. Brooks2 The global organization of protein binding sites is analyzed by constructing a weighted network of binding sites based on their structural similarities and detecting communities of structurally similar binding sites based on the minimum description length principle. The analysis reveals that there are two central binding site communities that play the roles of the network hubs of smaller peripheral communities. The sizes of communities follow a power-law distribution, which indicates that the binding sites included in larger communities may be older and have been evolutionary structural scaffolds of more recent ones. Structurally similar binding sites in the same community bind to diverse ligands promiscuously and they are also embedded in diverse domain structures. Understanding the general principles of binding site interplay will pave the way for improved drug design and protein design. Ligand binding sites are responsible for various biological processes of proteins such as signal transduction and enzymatic activity. Understanding the characteristics of binding sites is essential in drug discovery and protein engineering. Still, there are many open questions that should be addressed to better understand binding sites. Are there common motifs of binding site structures? How much are binding sites similar or different quantitatively? How do binding sites evolve? Do they follow the same or different evolutionary pathways from protein domain structures? How do binding site similarities relate to known global structural, that is, fold similarities? Exploiting the vast structural data available in the Protein Data Bank (PDB)1, many binding site structure comparison algorithms and databases were developed2–10. However, previous studies were mainly focused on finding individual similar structures, and did not explore the global organization of binding site similarities4, 6–9. There have been studies to find clusters of known binding sites6, 7, 11–16. However, most of them focused on certain protein families and did not address the global and evolutionary relationships between the representative binding site structures. In contrast to binding site structures, protein domains have been investigated extensively since domains have long been considered as basic protein structural units that are stable, function and evolve17. There are many approaches to classify the protein domain universe systematically, such as SCOP18 and CATH19, that provide global hierarchical organizations of domain structures. To visualize the complex global relationships between domains the whole protein domain space has been projected onto two- and three-dimensional maps 20–23. Alternatively, the domain space has been represented as a network in which nodes are protein domains, and connections are drawn between domains that have similar folds24, 25. The three-dimensional projection of the domain space shows that all domain structures can be clustered into four clusters, which approximately correspond to the four SCOP classes, i.e., all alpha, all beta, alpha + beta, and alpha/beta classes20–22. Another study proposes that the alpha/beta class forms a densely populated and functionally diverse core region of the protein domain universe23. A recent network-based approach shows that alpha/beta domains form a large connected domain network, whereas all-alpha, all-beta, and alpha + beta domains form smaller and disconnected networks24. 1

Department of Chemistry, Kangwon National University, 1 Kangwondaehak-gil, Chuncheon, 24341, Republic of Korea. 2Laboratory of Computational Biology, National Heart, Lung, and Blood Institute, National Institutes of Health, Bethesda, Maryland, 20892, United States. 3Faculty of Mathematics, Natural Sciences and Information Technologies, University of Primorska, Glagoljaška 8, SI-6000, Koper, Slovenia. 4National Institute of Chemistry, Hajdrihova 19, SI-1000, Ljubljana, Slovenia. Juyong Lee and Janez Konc contributed equally to this work. Correspondence and requests for materials should be addressed to J.L. (email: [email protected]) Scientific REPOrTs | 7: 11652 | DOI:10.1038/s41598-017-10412-z

1

www.nature.com/scientificreports/

Figure 1. Flow diagram for binding site community analysis. It is still an open question whether the domain is the optimal level to describe and classify protein structures26. Several studies suggest the existence of highly conserved and frequently occurring subdomain level motifs27–30. Assuming that subdomain level local structures, such as binding sites, are the basic units of protein structure evolution, additional questions can be asked. What is the process by which the binding sites evolve and how does this process differ from the evolution of domain structures? Although convergent31 and divergent32 evolution of binding sites has been described, a mechanism that can encompass both evolutionary pathways has not been found yet. To help find answers to these questions, we analyzed global relationships between domains and binding sites. In this study, we generated a global weighted network of all known binding sites based on their structural similarities calculated using ProBiS2–4, 10. Since the generated networks have large numbers of nodes and edges, they are incomprehensible at first glance. To reveal the hidden global organization of the networks, we reduced its complexity using a community detection approach that finds a subset of nodes that are more densely connected than the rest of the network. This approach, used in a protein-protein network improves the accuracy of protein function prediction33–35. We found that the size distribution of binding site communities follow a power-law distribution, implying that the largest communities may be the motifs of the most ancient binding sites and they many have served as structural scaffolds for other binding sites. We also found that functional diversity of proteins is independent of the binding site community size, indicating that binding sites are tightly coupled to protein function. We believe that the global binding site network communities can contribute to development of new approaches to function prediction, drug discoveries, and binding site design.

Results and Discussion

We generated a weighted binding site network from an all-to-all similarity comparison of binding site structures extracted from non-redundant sets of protein structures. In the binding site network, nodes are binding site structures and similar sites are connected by edges. Figure 1 presents the workflow in a sequence of 5 steps: 1) we generated the non-redundant protein structure databases using sequence identity cutoffs of 40, 70, and 90%; 2) non-redundant binding site structures were extracted from the databases; 3) structural similarities between all binding sites were calculated using the ProBiS2, 3; 4) the similarity scores were normalized into z-scores and the pairs of binding site pairs whose z-score is higher than a pre-defined threshold were connected by edges and 5) structurally similar binding sites were classified into communities to reduce the complexity of the network. Using the identified binding site communities, we investigated their size distribution, functional enrichment, and their relationships with ligands and domain structures.

Binding site network is continuous. To identify the continuity of the binding site network, we con-

structed networks with z-score thresholds from 1.5 to 4.0 with an interval of 0.5, calculated the fraction of binding sites included in the largest connected network and counted the number of separate networks (Fig. S1). We identified that most binding sites are connected as a single continuous network until z-score reaches 2.5. At this z-score, only the 3% highest similarities are considered and 53% of the nodes belong to the largest connected network. This suggests that binding site structure space is continuous until this level of statistical significance. When more stringent z-scores are used, the network becomes scattered into small and disconnected networks. When z-score is 3.0, the largest connected network includes only 9.5% of the nodes and all binding sites are divided into 1,176 separate networks. For further analysis, we used the largest connected network generated with a sequence identity cutoff of 70% and a z-score threshold of 1.5, which contains 15,789 binding sites (99.8% of all non-redundant binding sites) and 547,305 edges.

Community structure of the binding-site network reveals the global structure of binding sites. To reduce the complexity of the network, we detected the binding sites communities, i.e., groups of

binding sites that are more closely related to each other than to the rest, using the Infomap algorithm36, 37. This procedure leads to the weighted network, in which nodes correspond to binding site communities and the widths Scientific REPOrTs | 7: 11652 | DOI:10.1038/s41598-017-10412-z

2

www.nature.com/scientificreports/

Figure 2. Binding site community network. The 39 highest similarities between binding site communities and associated 20 binding site communities using ProBiS are displayed. A node corresponds to a binding site community and its size is proportional to the number of included binding sites and the bigger nodes correspond to higher ranked communities. Node shade represents the aggregated structural similarity between binding sites in the community. Edge width is proportional to the structural similarities between communities. Node label, e g., C1.HEM.CLA, is composed of the community rank (C1 is the community of rank one) according to the number of the included binding sites, and of the PDB codes of the two most populated ligands (HEM stands for heme, CLA is chloropyll a). The binding site communities shown in this network contain 43.3% of all nonredundant existing binding sites in the PDB database. The ligand IDs associated with binding site communities from C1 to C10 are listed as follows: CIT – citric acid, AKG – alpha-ketoglutaric acid, CLA – chloropyll a, HEM – heme, GDP – guanosine-5′-diphosphate, ADP - adenosine-5′-diphosphate, IPE – isopentenyl pyrophosphate, POP – pyrophosphate 2−, AP5 - bis(adenosine)-5′-pentaphosphate, NAD - nicotinamide adenine dinucleotide, NAP - nicotinamide adenine dinucleotide phosphate, ANP - phosphoaminophosphonic acid-adenylate ester, ATP - adenosine-5′-triphosphate, SAH - S-adenosyl-L-homocysteine, SAM - S-adenosylmethionine, FAD flavin adenine dinucleotide, HEC – heme C. The full list of community detection results as well as the rest of ligand IDs and their associated names are listed in Supplementary Information. of edges represent their structural similarity (Fig. 2). This community network reveals that larger communities are located at the core region of the network and smaller communities are at peripheral region. The two largest communities (C1.HEM.CLA and C2.CIT.AKG, which are most enriched with citric acid and heme molecules) are connected with most communities indicating that they play roles of network hubs. The largest communities show significant structural similarities between themselves. The 30 highest edge weights are observed between the 31 largest communities (Fig. S2). Only a few significant inter-community similarities are not connected with C1.HEM.CLA or C2.CIT.AKG. The network of binding site communities reveals the relationships between binding sites. It is noticeable that C2.CIT.AKG has a strong connection with C4.IPE.UDP and many connections with other communities, indicating that the structure of C2.CIT.AKG may be close to the structural scaffold of many binding sites. It is also identified that a group of four communities, which interact with ligands containing adenosine, C6.NAD.NAP, C9.FAD. NAD, C8.SAH.SAM, and C11.NAD.NAP, are closely inter-connected. This suggests that these communities may be structurally diversified due to functional reasons although they interact with similar ligands. Thus, comparing the difference between these communities may provide insight into how protein structures have evolved to make similar ligands interact with various proteins in a distinctive way. To test the robustness of our findings, we generated another binding site network using a different binding site comparison program G-LoSA38 and performed the community detection on the network (Fig. S3). The largest binding site community is most enriched with heme molecules. It is also identical that the five largest communities are also the most strongly connected. To verify that the similarity between community structures obtained with ProBiS and G-LoSA are statistically significant, we calculated the normalized mutual information (NMI) between two community structures and compared it with the values obtained with randomly permutated communities. The NMI value of the two original community structures was 0.337 while that of the random permutations was 0.1226 ± 0.0012, resulting in a P-value

Senescence gives insights into the morphogenetic evolution of anamniotes.

A global view of network of lncRNAs and their binding proteins.

Mutational analysis gives insight into substrate preferences of a nucleotidyl cyclase from Mycobacterium avium.

Structural insight into the DNA-binding mode of the primosomal proteins PriA, PriB, and DnaT.

Cyanobacterial evolution: fresh insight into ancient questions.

Organization of chlorophyll biosynthesis and insertion of chlorophyll into the chlorophyll-binding proteins in chloroplasts.

Evolution and structural organization of the C proteins of paramyxovirinae.

An insight into the global burden of skin diseases.

Strongyloidiasis--an insight into its global prevalence and management.

A Flexible Binding Site Architecture Provides New Insights into CcpA Global Regulation in Gram-Positive Bacteria.

Dynamics of Transcription Factor Binding Site Evolution.

Behaviors of cavefish offer insight into developmental evolution.

Global relationships in fluctuation and response in adaptive evolution.

Marsupial genome sequences: providing insight into evolution and disease.

Self-organization of network dynamics into local quantized states.

Examining variable domain orientations in antigen receptors gives insight into TCR-like antibody design.

Mouse matrix metalloprotease-1a (Mmp1a) gives new insight into MMP function.

Linkage disequilibrium network analysis (LDna) gives a global view of chromosomal inversions, local adaptation and geographic structure.

From local to global changes in proteins: a network view.

Identification of major zinc-binding proteins from a marine cyanobacterium: insight into metal uptake in oligotrophic environments.

Directed evolution provides insight into conformational substrate sampling by SrtA.

Key network approach reveals new insight into Alzheimer's disease.

Insight into the molecular mechanism of a herbal injection by integrating network pharmacology and in vitro.

Genetic comparisons yield insight into the evolution of enamel thickness during human evolution.