DrugNet: network-based drug-disease prioritization by integrating heterogeneous data.

Accepted Manuscript Title: DrugNet: network-based drug-disease prioritization by integrating heterogeneous data Author: V´ıctor Mart´ınez Carmen Navarro Carlos Cano Waldo Fajardo Armando Blanco PII: DOI: Reference:

S0933-3657(14)00144-4 http://dx.doi.org/doi:10.1016/j.artmed.2014.11.003 ARTMED 1379

To appear in:

ARTMED

Received date: Revised date: Accepted date:

17-12-2013 5-11-2014 12-11-2014

Please cite this article as: Víctor Martínez, Carmen Navarro, Carlos Cano, Waldo Fajardo, Armando Blanco, DrugNet: network-based drug-disease prioritization by integrating heterogeneous data, Artificial Intelligence In Medicine (2015), http://dx.doi.org/10.1016/j.artmed.2014.11.003 This is a PDF file of an unedited manuscript that has been accepted for publication. As a service to our customers we are providing this early version of the manuscript. The manuscript will undergo copyediting, typesetting, and review of the resulting proof before it is published in its final form. Please note that during the production process errors may be discovered which could affect the content, and all legal disclaimers that apply to the journal pertain.

ip t

DrugNet: network-based drug-disease prioritization by integrating heterogeneous data

cr

V´ıctor Mart´ınez, Carmen Navarro, Carlos Cano, Waldo Fajardo, Armando Blanco

us

Department of Computer Science and Artificial Intelligence, University of Granada, C/ Daniel Saucedo Aranda S.N., 18071 Granada, Spain

an

Abstract

Objective: Computational drug repositioning can lead to a considerable re-

M

duction in cost and time in any drug development process. Recent approaches have addressed the network-based nature of biological information for per-

d

forming complex prioritization tasks. In this work, we propose a new method-

te

ology based on heterogeneous network prioritization that can aid researchers in the drug repositioning process.

Ac ce p

Methods: We have developed DrugNet, a new methodology for drug-disease and disease-drug prioritization. Our approach is based on a network-based prioritization method called ProphNet which has recently been developed by the authors. ProphNet is able to integrate data from complex networks involving a wide range of types of elements and interactions. In this work, we built a network of interconnected drugs, proteins and diseases and applied DrugNet to different types of tests for drug repositioning. Results: We tested the performance of our approach on different validaEmail addresses: [email protected] (V´ıctor Mart´ınez), [email protected] (Carmen Navarro), [email protected] (Carlos Cano), [email protected] (Waldo Fajardo), [email protected] (Armando Blanco) Preprint submitted to Artificial Intelligence in Medicine

November 5, 2014

Page 1 of 42

tion tests, including cross validation and tests based on real clinical trials.

ip t

DrugNet achieved a mean AUC value of 0.9552 ± 0.0015 in 5-fold cross validation tests, and a mean AUC value of 0.8364 for tests based on recent

cr

clinical trials (phases 0-4) not present in our data. These results suggest

that DrugNet could be very useful for discovering new drug uses. We also

us

studied specific cases of particular interest, proving the benefits of heterogeneous data integration in this problem.

an

Conclusions: Our methodology suggests that new drugs can be repositioned by generating ranked lists of drugs based on a given disease query or vice versa. Our study shows that the simultaneous integration of information

M

about diseases, drugs and targets can lead to a significant improvement in drug repositioning tasks. DrugNet is available as a web tool from http://

d

genome2.ugr.es/drugnet/ (accessed: 23 September 2014). Matlab source

te

code is also available on the website.

Keywords: drug repositioning, network-based prioritization, disease

Ac ce p

networks, data integration, flow propagation 1. Introduction

The development of a new drug is a risky process, lasting approximately

15 years and costing between $800 million and $1 billion [1]. While the cost of drug development is rising, the number of approved drugs has remained constant in recent years [2]. Finding new applications for already commercialized drugs can, however, help the pharmaceutical industry reduce costs and overcome some of the obstacles of this difficult process. Since commercialized drugs have already passed various clinical trials, they are very

2

Page 2 of 42

likely to succeed in terms of safety. Finding new prescriptions for already

ip t

commercialized drugs is a field known as drug repositioning [3]. Cases such as Sildenafil (Viagra) and Minoxidil are classic examples of

cr

what can be achieved with drug repositioning. Sildenafil, originally commercialized to improve coronary blood flow, was repurposed to treat erectile

us

dysfunction with incredible marketing success. Minoxidil was marketed to prevent high blood pressure. It was later relaunched to treat androgenic

an

alopecia. Given these precedents, it is not surprising that in 2009, 30% of approved drugs were repositioned drugs for treating new indications [3, 4]. Drug repositioning can bring many advantages to the drug development

M

pipeline. A recent report indicates that launching a reformulation of an existing drug is almost 5 times more expensive than relaunching a repositioned

d

drug [5].

te

Furthermore, repositioned drugs have already passed safety tests, while around 30% of new drug failures are due to safety issues found in clinical

Ac ce p

trials [5]. Another key aspect of drug repositioning is the potential of drug candidates which, although having been found safe in man, have been discontinued from Phase 2 or 3 clinical development for lack of efficacy in the therapeutic indication for which they had originally been conceived [6]. Although the ultimate validation of the repositioned drug is the most rel-

evant and expensive stage in drug repositioning, finding promising candidate drugs in early research stages might be key for a successful process. In silico methods (for a review, see [1]) are a powerful tool that enable implicit or unknown relations to be found between drugs and targets. Since these methods are decision support systems that perform a prioritization of hypotheses to

3

Page 3 of 42

suggest promising candidates, experts can focus on these candidates, discuss

ip t

their potential and the viability of their validation. Existing approaches for in silico drug repositioning can be grouped into

cr

two main categories: methods which focus on composition (either the chem-

ical or molecular features of the drug) and those based on exploiting knowl-

us

edge about the diseases, their underlying processes or their symptomatology. In terms of the first category, these methods may relate drugs based on

an

shared quantitative chemical features from both drugs and targets [7]. Gene expression profiles are broadly used in this area [8–12]. Another way to relate drugs and targets is to simulate direct physical

M

interaction between drug compounds and targets, a process which is also known as molecular docking (for a review, see [13]).

d

Many of these approaches usually result in new inferred single drug-target

te

relations. However, the efficacy of many drugs relies on their interaction with several targets [14]. Polypharmacology [15] attempts to unveil drug interac-

Ac ce p

tions with multiple targets. Since polypharmacological approaches aim to discover the unknown off-targets for existing drugs [15], computational drug repositioning is also applied to this area [16]. However, although many current efforts focus on the target polypharmacology of selected pathways or gene families, the polypharmacological drug phenotype response can be predicted only on the basis of the quantitative and systematic analysis of drugprotein interactions on a proteome-wide scale. Therefore, new computational tools are required to analyze how multitarget drugs perturb target-disease networks on a proteome-wide scale [17]. Disease-focused proposals, on the other hand, attempt to relate drugs and

4

Page 4 of 42

diseases based on existing knowledge about symptomatology, known treat-

ip t

ments or pathological information. For example, Promiscuous [18] and [19] relate diseases and drugs by considering drug side-effect information, since

cr

side effects are associated with certain symptomatology, which is sometimes shared with diseases. Other methods such as PREDICT identify drug-disease

us

and disease-disease relations using text mining techniques [20].

Chiang et al. [21] base their proposal on the ‘guilt-by-association’ (GBA)

an

principle. This principle states that biological entities showing a similar behaviour or sharing properties are more likely to be related. The method proposed by Chiang et al. assumes that two diseases are related when a sim-

M

ilar treatment is prescribed for both (i.e. prescriptions share a considerable subset of drugs).

d

Methods based on the GBA principle [22–24] represent relations between

te

different biological elements such as genes, proteins, drugs or diseases. These relationships are usually directly extracted from biological databases.

Ac ce p

Nevertheless, relations between biological entities are complex and mostly network-structured, and it is not yet clear which of the previous approaches is best, if any [25]. Although molecular-based proposals are powerful, the chemical basis of a number of diseases and the targets of many drugs are still unclear. Molecular docking methods require the 3D structure of the interacting molecules to be well resolved and available, which is not always the case.

In addition, many diseases affect different tissues and organ systems, making it difficult to infer drug-target relations based on a single molecular activity profile. Finally, there are phenotypes that cannot be easily predicted

5

Page 5 of 42

solely from chemical features. Symptomatology-based approaches, on the

ip t

other hand, fit better when there is a lack of knowledge about the molecular processes underlying the query disease but there are clinical data about obser-

cr

vations in patients affected by the query disease. This approach is therefore usually applied to identify drugs to treat newly discovered, poorly studied and

us

rare diseases (i.e. diseases that affect a small percentage of the population) where only the symptoms of the disease are known. Symptomatology-based

an

approaches complement other approaches which are based purely on genomic data. Despite the number of different approaches, it seems clear that due to the network-shaped nature of biological information, graph-based approaches

M

are well suited for extracting knowledge from existing biological data [26]. This is also reflected in the wide range of network-based approaches existing

d

in the drug repositioning field [7, 9, 10, 27, 28]. Although the output of these

te

methods are novel individual drug-target relations, they do, however, present limitations when applied to complex multifactorial diseases that depend on

Ac ce p

many targets [14]. Moreover, as far as we are aware, there are currently no approaches that take into account relationships between diseases (disease networks) in order to perform drug-disease prioritization. These relationships have proved critical for understanding the multifactorial nature of complex disorders [29].

In this work, we propose DrugNet, a network-based drug repositioning

prioritization method that simultaneously integrates information on diseases, drugs and targets (proteins) to perform drug-disease and disease-drug prioritization. Our approach is based on ProphNet, a general network-based prioritiza-

6

Page 6 of 42

tion tool which has shown excellent results for gene-disease prioritization in

ip t

our previous work [30]. ProphNet implements a propagation flow algorithm that enables biological entities to be prioritized according to their intercon-

cr

nections in complex networks. In this work, we applied ProphNet on two different complex networks: one comprising interconnected drugs and dis-

us

eases and the other with drugs, diseases and proteins (shown in Figure 1). We validated the proposed methodology and network configurations by

an

performing various tests. First, we ran two leave-one-out (LOO) variants: the classical approach and a new approach to reduce the influence of the direct neighborhood on the results. A 5-fold cross-validation test was also

M

performed to prove the robustness of our method. Furthermore, we extracted a collection of drug-disease relations for drugs currently under clinical trial

d

and checked how well our method prioritized them. Finally, we studied

te

DrugNet predictions for certain drugs that had recently been repositioned to new uses which were not explicitly represented in our data.

Ac ce p

This document is organized as follows: Section 2 outlines the proposed methodology; Section 3 details the data sources used to build the different networks used in this study; Section 4 presents the validation tests performed, including tests with real clinical trials, and explains the results; and finally Section 5 examines our conclusions and future lines of research. 2. Methods

ProphNet [30] uses complex networks of interactions or similarities between biological elements (such as proteins, drugs or diseases) to rank the nodes in these networks according to their distance to a query set of nodes. 7

Page 7 of 42

With this aim, ProphNet implements a flow propagation algorithm that takes

ip t

into account the entire network topology to compute the global distance measurements.

cr

In order to apply ProphNet for drug repositioning, we first need to define and build the data networks to which the algorithm is applied. This

us

data representation considers one network for each type of entity (e.g. one network modelling protein-protein interactions, one for drug-drug relations,

an

etc.). Each network node v represents a biological entity (e.g. a drug, a disease or a protein) labelled with a value Ψ(v). Nodes in the networks are connected by weighted arcs representing interactions or relationships be-

M

tween the connected pair of nodes. A high weight indicates a high similarity or strong interaction between the connected nodes. There are two types of

d

networks: networks which represent relations or interactions between ele-

te

ments of the same type (e.g. protein-protein networks) and networks which represent relations or interactions between elements of two different types

Ac ce p

(e.g. drug-disease networks).

Each network is represented as a normalized adjacency matrix. The nor-

malization takes into account the degree of each node to limit the impact of hub nodes in the prioritization process. Each adjacency matrix A is normalized as:

1 2 Anorm = DG ∗ A ∗ DG ,

1 2 where DG and DG are diagonal matrices where each component is defined

as:

qP 1 DG = 1/ ( ck=1 Ajk ) j = 1, .., r jj qP 2 ( rj=1 Ajk ) k = 1, .., c. DG = 1/ kk 8

Page 8 of 42

These interconnected networks define a global graph. Our goal is to

ip t

measure the relationship degree between two sets of nodes (called the query set and target set, respectively) from two different networks (called the query

cr

network and target network, respectively) in this global graph. The query

set (Q) is provided by the user as the input (e.g. a set of drugs or diseases of

us

interest) while the target set (T ) is iteratively established by ProphNet (to find the most strongly related diseases or drugs, respectively). Nodes in Q are

an

initially set to: Ψ(v) = 1/|Q| ∀v ∈ Q and nodes in T to: Ψ(v) = 1 ∀v ∈ T . The remaining nodes are initially set to zero. The value assigned to a node Ψ(v) indicates the similarity of the node to those in T or in Q, depending on

M

whether v belongs to the target network or to any other network, respectively. We define a path connecting the query network and the target network as

d

a path of networks (not a path of nodes) which make it possible to get from

te

Q to T . Two propagation operations are defined: “propagation within a network” and “network-to-network propagation”. The first operation enables

Ac ce p

node values to be propagated within a specified network using the propagation flow algorithm [31, 32]. This algorithm is performed by iteratively applying

xi+1 = α ∗ M ∗ xi + (1 − α) ∗ x0 ,

where α is a parameter which determines the importance of the prior information in the network, M is the normalized adjacency matrix of the network and xi is a vector representing network node values at iteration i. The nor-

malization previously described for M guarantees the convergence in this iterative process [32]. We can consider this propagation to be an iterative process in which each node pumps prior information to its neighbors. 9

Page 9 of 42

The second operation (“network-to-network propagation”) enables values

ip t

to be propagated from the current network to the following network in the

as:

Px∈neig(v)

Ψ(x) , |neig(v)|

Ψ(v) =

cr

path by assigning to each node v from the following network a value computed

us

where neig(v) is the set of nodes from the current network which are connected with node v in the following network. This operation pumps infor-

an

mation from the current network to the next network in the path. Initially, node values in the query set are propagated within the query

M

network using the “propagation within network” operation. The same process is performed in the target network to propagate values from the target set of nodes, obtaining a vector t which compiles the values of all the nodes

d

in the target network after this propagation. Let p be the number of paths

te

of networks connecting the query network to the target network. The following steps of the propagation process consist in alternately applying the

Ac ce p

“network-to-network propagation” and the “propagation within a network” operations to the networks along the p different paths, until all the networks adjacent to the target network have been reached. Finally, vectors representing adjacent networks node values for each path

are multiplied by the normalized adjacency matrix of the network connecting the adjacent network with the target network. A set of vectors, each of the same length as the number of nodes in the target network, is obtained in this step. These vectors are concatenated to obtain a single vector x b. A vector b t

is also obtained as the result of concatenating p times the vector t.

The similarity between the query set and the target is measured by cor10

Page 10 of 42

s = corr(b x, b t)

ip t

relating vectors x b and b t:

where corr is Pearson’s Correlation. Value s is used as a score to determine

cr

the degree of relationship between the query and target sets.

In order to score each entity in the target network according to the de-

us

gree of relationship to the query set, each node from the target network is iteratively set as the target set and its score is computed using the method

an

described above. Since the target network is supposed to comprise a large number of nodes (even thousands) and the “propagation within a network”

M

is a computationally expensive operation, this iterative process could significantly increase the computational complexity of the algorithm. It can, however, be optimized by precomputing the result of the propagation in the

d

target network for each node. Finally, prioritized lists are obtained by sorting

te

all the target node scores in decreasing order. The complete pseudocode of the algorithm is shown in Algorithm 1.

Ac ce p

In order to illustrate our method, Figure 2 represents different steps of

a simplified prioritization example. The figure shows two different runs on the same data network using the same target set but different query sets. Triangles represent drugs, rounded squares represent proteins and circles represent diseases. The width of each edge represents the strength of the relation or interaction and the color of the node represents its assigned score, ranging from 0 (white) to 1 (black). The query set (with only one node in this case) and target set (interactively selected by the algorithm) are initially set to 1 (node initialization). Drug node values are propagated from the query nodes within the drug subnetwork. The same step is performed in the target 11

Page 11 of 42

network with the target set (propagation within a network). All the paths

ip t

from the query network to target network are computed (only two in this case) and values from the query network are propagated along these paths

cr

by alternating the two propagation operations as described above (network-

to-network propagation and propagation within a network). Finally, the

us

score is computed as the similarity between the values of the nodes in the target network and their direct neighbors in adjacent networks (node value

an

correlation). The first example (A) shows a case with high similarity as opposed to the second example (B).

M

3. Materials

The drug network has been built using DrugBank 3.0 drug entries [33].

d

We computed the similarity between every pair of drugs using Lin’s nodeThe anatomical

te

based similarity [34] on the annotations for the drugs.

therapeutic chemical (ATC) codes are a classification system for drugs con-

Ac ce p

trolled by the WHO Collaborating Centre for Drug Statistics Methodology (WHOCC). This system divides drugs into different groups according to the organ or system on which they act and/or their therapeutic and chemical characteristics. Based on the implicit ontology defined by the ATC codes, and given that a drug can have several annotated ATC codes, the semantic similarity between drug i and drug j is computed as:

similarity(i, j) = 1 −

P|C i |

i j k=1 Lindist (ck , cz ) |C i |

where Lindist (u, v) = 1 − Linsim (u, v) 12

Page 12 of 42

with Linsim as Lin’s semantic similarity, C i = {ci1 , .., ci|ci| } and

ip t

z = arg minx∈C j Lindist (cik , x). Only approved drugs with at least one ATC code assigned were considered. The resultant drug network contains 1490

cr

drugs.

The disease network contains 4517 diseases and it was constructed using

us

the disease ontology (DO) [35]. In order to avoid redundancy among disease nodes in our network, only leaf nodes are considered. The similarity between

an

disease i and disease j is computed as: Linsim (i, j).

The protein network was extracted from BioGRID 3.2 [36]. A network of 18107 proteins with 136867 unique interactions was obtained.

M

The drug-disease network was derived from the indications field in DrugBank. A disease is associated with one specific drug if its name is contained

d

in the indications field for the drug. All variants of the name of the disease

te

and permutations of the tokens that form the name were considered. A total of 1008 relations were extracted. These relationships can be downloaded

Ac ce p

from the supplementary material.

The drug-protein relations were extracted by matching protein symbols

and synonyms from the protein network with targets annotated in DrugBank and 4026 relations were obtained. The disease-protein associations were directly extracted from the disease

and gene annotations (DGA) [37] and 11658 relations were obtained. 4. Results and Discussion DrugNet was tested on two different network configurations: the first

considered drugs and diseases, and the second considered drug, disease and 13

Page 13 of 42

protein networks.

ip t

Different validation tests were applied to measure the performance of our approach with the two configurations: firstly, two different variants of

cr

LOO validation tests were applied; and secondly, a 5-fold cross-validation for the best configuration selected in the LOO tests was performed. Finally,

us

we studied how well DrugNet predicted new, recently discovered drug uses (reported in literature and clinical trials) which were not present in our data

an

networks. 4.1. Cross-validation tests

M

Leave-one-out validations were run to determine the performance of DrugNet for the two different network configurations under study. LOO tests consisted

d

of 1008 test cases (one for each explicit drug-disease relationship in the global graph). Two LOO variants were considered: standard LOO and advanced

te

LOO. The standard LOO test was performed for each test case by removing

Ac ce p

one explicit drug-disease relation from the global graph, taking the drug as the query set and checking the resultant disease ranking to measure performance. The advanced LOO test removed all drug-disease direct relations in which either the drug or the disease of interest were involved. The advanced LOO test is more demanding and attempts to avoid excessive redundancy based on communities of associated drugs and diseases. ROC curves [38, 39] were plotted for each LOO validation test. A ROC

curve was created by plotting the fraction of true positives out of the positives against the fraction of false positives out of the negatives at various threshold settings. A true positive occurs when the rank of the case disease was below the threshold. A false positive occurred when a disease not in the case was 14

Page 14 of 42

ranked below the threshold. ROC curves are a great tool for observing the

ip t

performance of a binary classifier system as its discrimination threshold is varied. By using ROC curves, relative results can be compared in order to

cr

obtain an optimal prioritization model.

The area under the ROC curve (AUC value) was also computed to quan-

us

tify gains. This value is equal to the probability that the classifier will rank a randomly chosen positive instance better than a randomly chosen negative

an

one. This measure skewed the information obtained using ROC curves but enabled quantitative comparisons to be performed.

The accuracy of the ranking for the two different network configurations

M

was measured. Results were summarized using ROC curves in Figure 3. Both configurations obtained a very high AUC value in the standard LOO

d

test: 0.9504 for drugs-diseases and 0.9579 for drugs-proteins-diseases. The

te

true benefits of adding the protein network can be seen in the advanced LOO test: 0.8041 for drugs-diseases alone and 0.8692 for drugs-proteins-

Ac ce p

diseases. The addition of protein-protein interactions together with proteindrug and protein-disease relationships markedly increased the robustness of the method in the absence of drug-disease relations that could provide relevant information for prioritization. The differences in terms of ranking values between the two configurations were statistically significant (p-values computed with t-tests and corrected with Bonferroni). The results therefore showed that DrugNet achieved the best performance in drug repositioning when applied to a network integrating drugs, diseases and targets. These results highlight the strong relation between similar drugs and similar diseases sharing targets.

15

Page 15 of 42

Five independent 5-fold cross-validation tests were also performed for the

ip t

drug-disease-protein network since this approach was shown to provide significantly better results on the previous tests. The results are shown in Figure

cr

4. The partitions for each test were randomly generated. The different runs reported very good, similar results, proving the robustness of our method.

us

The mean AUC value obtained was 0.9552 ± 0.0015. Furthermore, the box plot showed that the AUC values for most results were close to 1, while fewer

an

cases performed much worse and reduced average performance. A detailed study is needed to determine the poor performance of the method for these

M

outliers. 4.2. Clinical trials

d

In order to validate the results obtained by the best network configuration for drug repositioning in real cases, DrugNet was applied to prioritize

Ac ce p

ClinicalTrials.gov.

te

relations not explicitly present in our data and derived from clinical trials at

ClinicalTrials.gov is a service of the U.S. National Institutes of Health

with a database of clinical studies performed on humans around the world. A clinical trial is an interventional study where participants receive specific interventions. In our case, a specific set of drugs or placebo. Researchers monitor each participant so as to determine drug safety and efficacy. There are five phases to each clinical trial, from Phase 0 (exploratory studies with limited exposure to the drug) to Phase 4 (studies occurring after drug approval). We retrieved 11992 drug-disease relations from ClinicalTrials.gov which

were not explicitly present in our datasets. The diseases in clinical trial 16

Page 16 of 42

entries were matched to the disease nodes in our disease network using the

ip t

method described in the section Materials. Drug-disease relations were then extracted by searching for disease names and synonyms in the clinical trial

cr

condition field. Drugs were directly matched by searching for the drug name

in the intervention field. Only 8217 of these relations were unique drug-

us

disease pairs, since repeated drug-disease relations were reported for different phases of the study.

an

Table 1 summarizes DrugNet results and Figure 5 shows the test ROC curve. Drugs in the earlier stages of clinical trials have a high risk of failure due to toxicity or lack of efficacy. In fact, a very high proportion of drugs

M

being developed (around 90%) fail during Phase 1 of the clinical trials [40], when a large amount of money has already been invested in their develop-

d

ment. As the table shows, our results were better for clinical trials in more

te

advanced phases. Predictions made by our approach therefore seem reliable and may potentially considerably reduce costs.

Ac ce p

4.3. Case studies

In order to illustrate how our method can be used in real environments,

we applied our methodology to find new uses for already approved drugs. We queried for recently repositioned drugs and checked the rank of the disease for which the query drug had recently begun to be used. It is important to note that these new uses were not explicitly specified in our data by drug-disease relationships. This study revealed some very interesting and promising results. Methotrexate is an antimetabolite and antifolate drug initially used in the treatment of cancer. In recent years, various studies have shown that 17

Page 17 of 42

it is also effective in the treatment of Crohn’s disease [41]. Crohn’s disease

ip t

ranked 17th in DrugNet results. Colesevelam was originally developed as a low-density lipoprotein cholesterol (LDL-C) lowering agent for patients with

cr

primary hyperlipidemia. Further studies proved that colesevelam could be used in Type 2 diabetes mellitus to lower the rate of hypoglycemia [42].

us

Hypoglycemia ranked 10th in DrugNet for Colesevelam.

Gabapentin was initially developed to treat epilepsy and is currently also

an

being used to reduce neuropathic pain. Years of studies have shown that it is also useful in anxiety disorders [43]. Our method ranked generalized anxiety disorder 8th.

M

Cisplatin has been used to treat different types of cancers such as testicular, bladder, ovarian and lung cancer, but not to routinely treat breast

d

cancer. However, recent studies have proven that this drug reduces BRCA1

te

expression in triple-negative breast cancers [44]. Our method ranked breast lymphoma 19th and malignant breast melanoma 20th.

Ac ce p

Donepezil is a centrally acting reversible acetylcholinesterase inhibitor which has mainly been used for the treatment of Alzheimer’s disease. It has recently been successfully used to treat Parkinson’s disease [45]. Our method ranked Parkinson’s disease 6th. Risperidone is an atypical antipsychotic drug used to treat schizophrenia.

Recent studies have shown how it can help reduce agitation, psychosis, sleep disturbance and rapid cycling in obsessive-compulsive disorder patients [46]. Our method ranked obsessive-compulsive disorder 7th.

18

Page 18 of 42

4.4. Comparison with other methods

ip t

We performed a manual validation of interesting cases by comparing our results with those reported by PREDICT (Table S2 Supplementary

cr

Material)[20]. This comparison was not without limitations. Firstly, a com-

prehensive automated comparison was not possible since PREDICT is not

us

available, neither the code or as a web tool, so it cannot be run on demand for a given user query. Secondly, the use of different databases for build-

an

ing the networks (e.g. disease ontology or OMIM for the disease network) produce terminology issues since the terms derived from these databases do not always match. We therefore carried out an exhaustive, manual review of

M

literature in order to find evidence supporting the results obtained by both PREDICT and DrugNet for a limited set of queries. In the following sec-

d

tions, we shall present some of the results that suggest that DrugNet reported

te

better rankings for new drug-disease reposition hypotheses. We first queried PREDICT and DrugNet for new drug uses. For ex-

Ac ce p

ample, PREDICT only suggests Nedocromil to treat atopic dermatitis. This hypothesis is validated in literature [47] and tallies with DrugNet’s top ranked prediction. However, DrugNet suggests other diseases may also be treated with Nedocromil. For example, bronchitis [48] and allergic rhinitis [49] are ranked 2nd and 3rd by DrugNet, respectively. Something similar occurs for Tiludronate. While PREDICT only suggests a variant of acroosteolysis with osteoporosis as a potential target of Tiludronate, DrugNet suggests osteoporosis [50] (ranked 7th), Paget’s disease of bone [51] (ranked 1st), osteoarthritis [52] (ranked 3rd) or hypercalcemia [53] (ranked 5th), among others. DrugNet therefore seems to be able to find more valid uses to the 19

Page 19 of 42

drug queries.

ip t

We also tested PREDICT and DrugNet to reposition drugs for a given disease query of interest. For example, PREDICT suggested Propylthiouracil,

cr

Methimazole, Meprobamate, Tizanidine and Cyclobenzaprine for treating

myxedema. Among these, DrugNet only suggested Methimazole (ranked 3rd)

us

and Propylthiouracil (7th) which were shown to be related with myxedema [54], but not the other drugs, for which we found no evidence of any relation

an

with myxedema in literature. In addition, other drugs suggested by DrugNet for treating myxedema are Levothyroxine [55] (ranked 1st), Desmopressin [56] (ranked 5th) and Vasopressin [57] (ranked 8th).

M

A more exhaustive comparison between prioritization tools remains an issue since it requires the authors of this pioneering work to make their tools

d

publicly available.

te

4.5. Limitations of our approach

Ac ce p

Although the data sources we integrated into DrugNet were the most comprehensive we could find to fit our purposes, we performed various tests to check whether DrugNet presented limitations in terms of biases inherently present in these sources.

We first checked whether certain drug categories (ATC codes in our case)

performed better than others. To this end, we carried out two cross-validation tests (leave-one-out and “advanced” leave-one-out tests) for the drug-targetdisease network and measured the performance of our method to rank the target disease after querying for a drug from a specific ATC category. These results are available in the Supplementary Material File 1. These results showed a better performance in the LOO test for drugs 20

Page 20 of 42

with the ATC codes “Respiratory system”, “Musculo-skeletal system” and

ip t

“Nervous system”. Although the node degree and other properties of the drugs in these categories were not different from others (see Table 1, Supple-

cr

mentary Material), these categories exhibited a higher number of “disease to drug” connections (number of links connecting the target disease with dif-

us

ferent drug nodes) which might contribute to this better performance. Since we query the system for a drug of interest, the propagation flow goes from

an

drugs to diseases (targets), so the more connections a target presents (higher “disease to drugs”), the higher the information signal it gathers from the drug network. It should be noted that this is not the case for the feature

M

“drug to diseases” (number of links connecting a query drug with different disease nodes), since a higher value of this feature implies that the signal is

te

results.

d

split among more recipients (diseases), so it does not correlate with better

We also ran an advanced LOO test in order to avoid excessive redun-

Ac ce p

dancy arising from the existence of communities of associated drugs and diseases. For the advanced LOO test, the best performing categories were “antiparasitic products, insecticides and repellents”, “nervous system” and “antiinfectives for systemic use”. It is worth mentioning that there was a significant difference between results for the simple and advanced LOO tests. This is due to the fact that drugs from ATC categories with more connections between drugs and diseases (high values of “drug to diseases” and “disease to drugs”) are more penalized in the advanced LOO test, since these connections are removed from the network. This is the case for “respiratory system”, which showed the best performance for the simple LOO

21

Page 21 of 42

tests and significantly dropped its average position in the ranking for the

ip t

advanced LOO tests (from an average ranking position of 75 in the LOO tests to an average position of 493 in the advanced LOO tests, see Table 1,

cr

Supplementary Material). This category showed one of the highest number

of connections between drugs and diseases (average sum of 28 connections)

us

and these connections were removed in the cross validation performed by the advanced LOO tests, justifying this drop in performance. On the other hand,

an

the average drop for “antiparasitic products, insecticides and repellents” was of only 70 positions in the advanced LOO ranking with respect to the simple LOO results, making it the best performing category for advanced LOO

M

tests, since this category of drugs only presented an average of 5 connections between drugs and diseases. Thus, the removal of these connections did not

d

present as significant an impact as in the previous case.

te

There was also a bias against the drugs associated with the ATC code “Various”, as shown in these results. This was to be expected since drugs in

Ac ce p

this category are more heterogeneous. Drugs in “Systemic hormonal preparations, excluding sex hormones and insulins” also obtained poor repositioning results, probably due to the significantly high value of the property “drug to diseases” (27.8 when the average value was 5.72). This implies that drugs in this category are widely used for many different diseases, and thus the information signal is spread over many diseases and the repositioning result is worse for these drugs.

We also studied possible biases which were not associated to the drug ATC code but to general topological features in other networks such as node degree, number of shared targets for one disease-drug pair, number

22

Page 22 of 42

of drug-disease connections, etc. These experiments are available in the Sup-

ip t

plementary Material File 2. Results suggested that the strongest bias was associated to nodes showing a high “disease to drugs” degree, i.e. a high

cr

number of connections between the target disease and the drugs. This correlation was negative, which means that the higher the number of connections

us

between the target disease and the drugs, the better the ranking result (the lower the position in the ranking). The previous justification for the impact

an

of the feature “disease to drugs” also applies to this case. 4.6. How to extend the data network

M

ProphNet enables the integration of any number of networks. For a new network of biological entities to be effectively integrated into an existing

d

system, ProphNet requires this new network (its nodes) to have connections to (the nodes in) two other networks in the system, so there is at least one

te

path between the query and target networks which crosses the new network.

Ac ce p

If these conditions do not apply, the new knowledge can also be integrated into ProphNet by aggregating this knowledge into existing edges from any network in the system. This aggregation can be modelled by different operators, and experimentation is required to achieve the best performance. In our case, DrugNet does not aggregate information from different sources into the same links, since each source derives a different subnetwork or its interconnections.

We shall illustrate how the information is integrated with two examples. Firstly, let us assume that we are interested in integrating disease pathways into our system [58]. Since these pathways can link drugs, proteins and diseases [59], a pathway network could be connected to these other networks 23

Page 23 of 42

creating a new path to propagate information from the query to the target

ip t

network. From these data, we could also derive pathway-pathway connections by defining a similarity measure between pathways. Different features can be

cr

used to build this network, such as the number of shared proteins between

pathways. In this case, knowledge about disease pathways could therefore

us

be integrated as a new network into the system.

Let us now consider the case of information about the drug’s side effects.

an

As far as we know, this information can only relate to drugs. This limits their integrability since even if it were possible to build a new network interconnecting side effects, it would not be possible to connect its elements

M

to elements from other networks. However, this information can still be integrated in the existing network by using side-effect information to refine

d

the drug similarity network. A simple approach would be to increase the

Ac ce p

4.7. Web tool

te

similarity between two drugs as they share a larger number of side effects.

In order to make our method available to the scientific community, we

have developed an easy-to-use web application. This tool allows different types of queries for drug-disease or disease-drug repositioning to be performed. Multiple input elements can be specified at the same time for complex queries. The prioritization process takes only a few seconds. This process retrieves a sorted list of elements with their relative rank, assigned score and z-score. Interesting target proteins linking drugs and diseases are also shown (if any). Results can be also obtained as an xls file (Microsoft Excel). In order to help users focus on the most interesting cases, the tool marks statistically significant elements based on Grubbs’ test for outliers [60]. A 24

Page 24 of 42

critical score with a significance level of 0.05 is computed for each search

ip t

result and used as a significance threshold for calculated z-scores. In addition, results that are explicitly present in the data networks are differentiated from

on previously unknown results of potential interest.

cr

new relationships suggested by DrugNet, in order to allow the user to focus

us

DrugNet is freely available without prior registration from http://genome2. ugr.es/drugnet/ (Accessed: 23 September 2014). The MatLab source code

an

is also available on this site. 5. Conclusions

M

This work presents DrugNet, a new methodology for in silico drug repositioning. Our approach is based on two main ideas. The first is that biological

d

entities interact with each other in a networked, intricate way. Consequently,

te

any element should be observed as a connected entity interacting with its environment, rather than as an isolated element. The second is that biological

Ac ce p

information is diverse, growing and heterogeneous. Choosing the right data source for each experiment is a difficult yet extremely important step. In this case, integrating information about targets and their relationships with drugs and diseases enabled putative relations to be inferred that could not be retrieved using drug-disease relationships alone. Furthermore, DrugNet is able to perform both drug-disease and disease-

drug prioritization. This means that drugs can be queried for new indications, which can be an interesting approach in many cases, for instance when a drug has been archived due to a lack of efficacy. Our method can also perform disease-drug prioritization, which addresses the drug repositioning problem 25

Page 25 of 42

from a different perspective, providing researchers with a tool that can sug-

ip t

gest treatments for rare diseases, where no effective drug has been found, or for improving existing treatments. Our tests did indeed reveal that DrugNet

cr

was able to identify interesting new uses for drugs by predicting relationships which had already been documented in literature or clinical trials but which

us

were not explicitly present in our dataset.

Results showed that this approach can help identify previously unknown

an

drug applications extremely reliably in real situations and so these methods may potentially save a large amount of resources in the drug development pipeline. Despite the fact that the safety of repositioned drugs is not usually

M

a cause for concern since they have already undergone various safety tests before commercialization, our method generally suggests drugs that are already

d

in advanced stages of clinical trials. This suggests that drugs repositioned

te

by our approach are very likely not only to be safe but also effective. Future work will include the addition of new networks such as a network of

Ac ce p

side effects and the integration of new similarity measures in a network such as the Tanimoto coefficient [61] to compute the chemical similarity between a pair of drugs. The potential of disease pathways as a source of information for building disease networks is also promising [58]. We believe that ATC-based semantic similarity combined with chemical

similarity could improve the obtained results for drug repositioning. Finally, although state-of-the-art techniques have been used to implement DrugNet, such as the flow propagation algorithm, other algorithms can be tested to improve the performance of DrugNet.

26

Page 26 of 42

Acknowledgments

ip t

This work has been carried out as part of projects PI-0710-2013 of the

Junta de Andalucia, Sevilla and TIN2013-41990-R of DGICT, Madrid. It

cr

was also supported by Plan Propio de Investigacion, University of Granada.

us

References

[1] J. Dudley, T. Deshpande, A. Butte, Exploiting drug-disease relation-

an

ships for computational drug repositioning, Brief Bioinform 12 (4) (2011) 303–311.

M

[2] T. Ashburn, K. Thor, Drug repositioning: identifying and developing new uses for existing drugs, Nat Rev Drug Discov 3 (8) (2004) 673–683.

d

[3] D. Sardana, C. Zhu, M. Zhang, R. C. Gudivada, L. Yang, A. G. Jegga,

Ac ce p

(2011) 346–356.

te

Drug repositioning for orphan diseases, Briefings in bioinformatics 12 (4)

[4] A. I. Graul, L. Sorbera, P. Pina, M. Tell, E. Cruces, E. Rosa, et al., The year’s new drugs & biologics-2009., Drug News Perspect 23 (1) (2010) 7–36.

[5] A. Persidis, The benefits of drug repositioning, Drug Discovery 12 (2011) 9–12.

[6] H. A. Mucke, Drug repositioning: extracting added value from prior r&d investments, Tech. rep., Cambridge Healthtech Institute (2010).

27

Page 27 of 42

[7] F. Cheng, C. Liu, J. Jiang, W. Lu, W. Li, G. Liu, et al., Prediction

ip t

of drug-target interactions and drug repositioning via network-based inference, PLoS Computational Biology 8 (5) (2012) e1002503.

cr

[8] F. Iorio, R. Bosotti, E. Scacheri, V. Belcastro, P. Mithbaokar, R. Ferriero, et al., Discovery of drug mode of action and drug repositioning

an

Sciences 107 (33) (2010) 14621–14626.

us

from transcriptional responses, Proceedings of the National Academy of

[9] G. Hu, P. Agarwal, Human disease-drug network based on genomic ex-

M

pression profiles, PLoS One 4 (8) (2009) e6536.

[10] D. Emig, A. Ivliev, O. Pustovalova, L. Lancashire, S. Bureeva, Y. Nikolsky, et al., Drug target prediction and repositioning using an integrated

te

d

network-based approach, PloS one 8 (4) (2013) e60618. [11] F. Iorio, T. Rittman, H. Ge, M. Menden, J. Saez-Rodriguez, Transcrip-

Ac ce p

tional data: a new gateway to drug repositioning?, Drug discovery today 18 (7) (2013) 350–357.

[12] J. Lamb, E. Crawford, D. Peck, J. Modell, I. Blat, M. Wrobel, et al., The connectivity map: using gene-expression signatures to connect small molecules, genes, and disease, Sci Signal 313 (5795) (2006) 1929–1935.

[13] D.-L. Ma, D. S.-H. Chan, C.-H. Leung, Drug repositioning by structurebased virtual screening, Chemical Society Reviews 42 (5) (2013) 2130– 2141. [14] J. L. Medina-Franco, M. A. Giulianotti, G. S. Welmaker, R. A. 28

Page 28 of 42

Houghten, Shifting from the single to the multitarget paradigm in drug

ip t

discovery, Drug discovery today 18 (9) (2013) 495–501. [15] A. S. Reddy, S. Zhang, Polypharmacology: drug discovery for the future,

cr

Expert review of clinical pharmacology 6 (1) (2013) 41–47.

us

[16] J. Achenbach, P. Tiikkainen, L. Franke, E. Proschak, Computational tools for polypharmacology and repurposing, Future Medicinal Chem-

an

istry 3 (8) (2011) 961–968.

[17] L. Xie, L. Xie, S. L. Kinnings, P. E. Bourne, Novel computational ap-

M

proaches to polypharmacology as a means to define responses to individual drugs, Annual review of pharmacology and toxicology 52 (2012)

d

361–379.

te

[18] J. von Eichborn, M. Murgueitio, M. Dunkel, S. Koerner, P. Bourne, R. Preissner, Promiscuous:

a database for network-based drug-

Ac ce p

repositioning, Nucleic Acids Res 39 (suppl 1) (2011) D1060–D1066. [19] L. Yang, P. Agarwal, Systematic drug repositioning based on clinical side-effects, PloS one 6 (12) (2011) e28025.

[20] A. Gottlieb, G. Stein, E. Ruppin, R. Sharan, Predict: a method for inferring novel drug indications with application to personalized medicine, Mol Syst Biol 7 (1) (2011) 496.

[21] A. Chiang, A. Butte, Systematic evaluation of drug–disease relationships to identify leads for novel drug uses, Int J Clin Pharmacol Ther 86 (5) (2009) 507–510. 29

Page 29 of 42

[22] L. Aravind, Guilt by association: contextual information in genome

ip t

analysis, Genome Research 10 (8) (2000) 1074–1077. [23] J. Quackenbush, Microarraysguilt by association, Science 302 (5643)

cr

(2003) 240–241.

us

[24] M. G. Walker, W. Volkmuth, T. M. Klingler, Pharmaceutical target discovery using guilt-by-association: schizophrenia and parkinsons disease

an

genes, in: Proc Int Conf Intell Syst Mol Biol, 1999, pp. 282–286. [25] D. Swinney, J. Anthony, How were new medicines discovered?, Nat Rev

M

Drug Discov 10 (7) (2011) 507–519.

[26] S. Hasan, B. K. Bonde, N. S. Buchan, M. D. Hall, Network analysis

d

has diverse roles in drug discovery, Drug discovery today 17 (15) (2012)

te

869–874.

[27] S. Mathur, D. Dinakarpandian, Drug repositioning using disease associ-

Ac ce p

ated biological processes and network analysis of drug targets, in: AMIA Annual Symposium Proceedings, Vol. 2011, American Medical Informatics Association, 2011, p. 305.

[28] X. Chen, M.-X. Liu, G.-Y. Yan, Drug–target interaction prediction by random walk on the heterogeneous network, Molecular BioSystems 8 (7) (2012) 1970–1978.

[29] L. Diaz-Beltran, C. Cano, D. P. Wall, F. J. Esteban, Systems biology as a comparative approach to understand complex gene expression in neurological diseases, Behavioral Sciences 3 (2) (2013) 253–272. 30

Page 30 of 42

[30] V. Mart´ınez, C. Cano, A. Blanco, Prophnet: A generic prioritiza-

ip t

tion method through propagation of information, BMC bioinformatics 15 (Suppl 1) (2014) S5.

cr

[31] O. Vanunu, R. Sharan, A propagation based algorithm for inferring gene-

formatics, Citeseer, 2008, pp. 54–52.

us

disease associations, in: Proceedings of German Conference on Bioin-

[32] S. Navlakha, C. Kingsford, The power of protein interaction networks for

an

associating genes with diseases, Bioinformatics 26 (8) (2010) 1057–1063. [33] C. Knox, V. Law, T. Jewison, P. Liu, S. Ly, A. Frolkis, et al., Drugbank

M

3.0: a comprehensive resource for omics research on drugs, Nucleic Acids Res 39 (suppl 1) (2011) D1035–D1041.

d

[34] D. Lin, An information-theoretic definition of similarity, in: In Proceed-

te

ings of the 15th International Conference on Machine Learning, Morgan

Ac ce p

Kaufmann, 1998, pp. 296–304. [35] L. M. Schriml, C. Arze, S. Nadendla, Y.-W. W. Chang, M. Mazaitis, V. Felix, et al., Disease ontology: a backbone for disease semantic integration, Nucleic acids research 40 (D1) (2012) D940–D946.

[36] A. Chatr-aryamontri, B.-J. Breitkreutz, S. Heinicke, L. Boucher, A. Winter, C. Stark, et al., The biogrid interaction database: 2013 update, Nucleic acids research 41 (D1) (2013) D816–D823.

[37] K. Peng, W. Xu, J. Zheng, K. Huang, H. Wang, J. Tong, et al., The disease and gene annotations (dga): an annotation resource for human disease, Nucleic acids research 41 (D1) (2013) D553–D560. 31

Page 31 of 42

[38] J. Beck, E. Shultz, The use of relative operating characteristic (roc)

ip t

curves in test performance evaluation, Arch Pathol Lab Med 110 (1) (1986) 13–20.

cr

[39] M. H. Zweig, G. Campbell, Receiver-operating characteristic (roc) plots: a fundamental evaluation tool in clinical medicine., Clinical chemistry

us

39 (4) (1993) 561–577.

nology 16 (13) (1998) 1294–1294.

an

[40] A. Krantz, Diversification of the drug discovery process, Nature Biotech-

[41] B. G. Feagan, J. Rochon, R. N. Fedorak, E. J. Irvine, G. Wild, L. Suther-

M

land, et al., Methotrexate for the treatment of crohn’s disease, New England Journal of Medicine 332 (5) (1995) 292–297.

d

[42] V. A. Fonseca, Y. Handelsman, B. Staels, Colesevelam lowers glucose

te

and lipid levels in type 2 diabetes: the clinical evidence, Diabetes, Obe-

Ac ce p

sity and Metabolism 12 (5) (2010) 384–392. [43] M. H. Pollack, J. Matthews, E. L. Scott, Gabapentin as a potential treatment for anxiety disorders, American Journal of Psychiatry 155 (7) (1998) 992–993.

[44] D. P. Silver, A. L. Richardson, A. C. Eklund, Z. C. Wang, Z. Szallasi, Q. Li, et al., Efficacy of neoadjuvant cisplatin in triple-negative breast cancer, Journal of Clinical Oncology 28 (7) (2010) 1145–1153.

[45] D. Aarsland, K. Laake, J. Larsen, C. Janvin, Donepezil for cognitive impairment in parkinson’s disease: a randomised controlled study, Journal of Neurology, Neurosurgery & Psychiatry 72 (6) (2002) 708–712. 32

Page 32 of 42

[46] F. M. Jacobsen, et al., Risperidone in the treatment of affective illness

ip t

and obsessive-compulsive disorder., The Journal of clinical psychiatry 56 (9) (1995) 423.

cr

[47] H. Van Bever, W. Stevens, Nedocromil sodium cream in the treatment of

us

atopic dermatitis, European journal of pediatrics 149 (1) (1989) 74–74. [48] H. Williams, Multi-centre clinical trial of nedocromil sodium in re-

an

versible obstructive airways disease in adults: a general practitioner collaborative study, Current Medical Research and Opinion 11 (7) (1989)

M

417–426.

[49] R. L. Mabry, Topical pharmacotherapy for allergic rhinitis: nedocromil,

d

American journal of otolaryngology 14 (6) (1993) 379–381.

te

[50] C. Chesnut III, Tiludronate: development as an osteoporosis therapy, Bone 17 (5) (1995) S517–S519.

Ac ce p

[51] M. McClung, C. Tou, N. Goldstein, C. Picot, Tiludronate therapy for paget’s disease of bone, Bone 17 (5) (1995) S493–S496.

[52] M. Moreau, P. Rialland, J.-P. Pelletier, J. Martel-Pelletier, D. Lajeunesse, C. Boileau, et al., Tiludronate treatment improves structural changes and symptoms of osteoarthritis in the canine anterior cruciate ligament model, Arthritis Res Ther 13 (3) (2011) R98.

[53] J. Dumon, A. Magritte, J.-J. Body, Efficacy and safety of the bisphosphonate tiludronate for the treatment of tumor-associated hypercalcemia, Bone and mineral 15 (3) (1991) 257–266. 33

Page 33 of 42

[54] A. Heufelder, B. Wenzel, R. Bahn, Methimazole and propylthiouracil

ip t

inhibit the oxygen free radical-induced expression of a 72 kilodalton heat shock protein in graves’ retroocular fibroblasts., Journal of Clinical

cr

Endocrinology & Metabolism 74 (4) (1992) 737–742.

[55] S. J. Mandel, G. A. Brent, P. R. Larsen, Levothyroxine therapy in pa-

us

tients with thyroid disease, Annals of Internal Medicine 119 (6) (1993)

an

492–502.

[56] E. M. T. Erfurth, U.-B. C. Ericsson, K. Egervall, S. R. Lethagen, Effect of acute desmopressin and of long-term thyroxine replacement on

M

haemostasis in hypothyroidism, Clinical endocrinology 42 (4) (1995) 373–378.

d

[57] W. R. Skowsky, T. A. Kikuchi, The role of vasopressin in the impaired

Ac ce p

(1978) 613–621.

te

water excretion of myxedema, The American journal of medicine 64 (4)

[58] H. Mei, T. Xia, G. Feng, J. Zhu, S. M. Lin, Y. Qiu, Opportunities in systems biology to discover mechanisms and repurpose drugs for cns diseases, Drug discovery today 17 (21) (2012) 1208–1216.

[59] A. P. Davis, C. G. Murphy, R. Johnson, J. M. Lay, K. Lennon-Hopkins, C. Saraceni-Richards, et al., The comparative toxicogenomics database: update 2013, Nucleic acids research (2012) gks994.

[60] F. E. Grubbs, Procedures for detecting outlying observations in samples, Technometrics 11 (1) (1969) 1–21.

34

Page 34 of 42

Ac ce p

te

d

M

an

us

cr

ip t

[61] T. T. Tanimoto, Ibm internal report, Nov 17 (1957) 1957.

35

Page 35 of 42

ip t

cr

Algorithm 1 prioritize(G: global graph, Q: query set, Dq : query network, Dt : target network) Propagate values within Dq

2:

P : Compute the list of paths from Dq to Dt in G

3:

for each path pi = {pi1 , ..., pij , ..., pil } in P do

an

4:

us

1:

for each network pij in the path pi from pi1 to pi(l−1) do Propagate values from pij to pi(j+1)

6:

Propagate values within pi(j+1)

M

5:

end for

8:

Store the values of Di(l−1) after propagation through path pi as x bi(l−1)

for each entity e ∈ Vt in the target network Dt do

Ac ce p

10:

end for

te

9:

d

7:

11:

Set target set T = {e}

12:

Propagate values within Dt

13:

Compute correlation coefficient se using the stored x bi(l−1) for each

path pi

14:

end for

15:

L: Sort all entities e ∈ Vt by their se values in descending order

16:

return L

36

Page 36 of 42

Case

AUC

1993

0.8106

Phase 0

105

0.8976

Phase 1

2050

0.8176

Phase 1–Phase 2

1154

0.8333

Phase 2

3307

Phase 2–Phase 3

402

Phase 3

1771

0.8652

Phase 4

1210

0.9109 0.8364

us

cr

N/A

M

count

ip t

Clinical trials phase

0.8232

an

0.8108

All phases

11992

d

Table 1: Results obtained for DrugNet on drug-disease prioritization tasks using recent and undergoing clinical trials. The first column shows the phase of the study, the second

Ac ce p

te

column the number of studies in this phase and the third column the obtained AUC values.

37

Page 37 of 42

d

M

an

us

cr

ip t

Disease network

Ac ce p

te

Protein network

Drug network

Figure 1: Graphical representation of a toy sample network in DrugNet. The global network comprises different interconnected subnetworks: drug network, protein network and disease network.

38

Page 38 of 42

ip t cr us

(B)

Ac ce p

te

d

M

an

(A)

Figure 2: Illustration of the main steps of the algorithm on two different runs (A and B) using the same data networks, the same disease as the target but different drugs as the query. Figure A shows an example where DrugNet scores the relationship between the query and the target with a high similarity value. Figure B, meanwhile, shows an example with a low similarity score.

39

Page 39 of 42

ip t

1 0.9

cr

0.8

us

0.7

0.5

an

Sensitivity

0.6

0.4

M

0.3 0.2

Standard LOO − Drugs and diseases Standard LOO − Drugs, proteins and diseases Advanced LOO − Drugs and diseases Advanced LOO − Drugs, proteins and diseases

te

0

0.2

0.4 0.6 1−Specificity

0.8

1

Ac ce p

0

d

0.1

Figure 3: ROC curves for LOO tests using two different network configurations.

40

Page 40 of 42

ip t

1 0.9

cr

0.8 0.7

us

AUC

0.6 0.5

an

0.4 0.3

0.1 2

3

4

5

te

d

1

M

0.2

Figure 4: Box plots of five 5-fold independent tests. It can be seen that most of the popu-

Ac ce p

lation is well prioritized while a small group of outliers decreases the average performance.

41

Page 41 of 42

ip t

1 0.9

cr

0.8

us

0.6 0.5 0.4

an

Sensitivity

0.7

0.3

M

0.2 0.1

Drugs, proteins and diseases

0

0.2

0.4 0.6 1−Specificity

0.8

1

te

d

0

Figure 5: ROC curve for clinical trials tests with the drugs-proteins-diseases network

Ac ce p

configuration.

42

Page 42 of 42

Integrating Heterogeneous Biomedical Data for Cancer Research: the CARPEM infrastructure.

Integrating heterogeneous drug sensitivity data from cancer pharmacogenomic studies.

Gene Prioritization by Compressive Data Fusion and Chaining.

Characterizing Cancer-Specific Networks by Integrating TCGA Data.

Identifying pathogenic processes by integrating microarray data with prior knowledge.

Integrating Microarray Data and GRNs.

iCTNet2: integrating heterogeneous biological interactions to understand complex traits.

Heterogeneous environments shape invader impacts: integrating environmental, structural and functional effects by isoscapes and remote sensing.

Drug repositioning by integrating target information through a heterogeneous network model.

Identifying Human SIRT1 Substrates by Integrating Heterogeneous Information from Various Sources.

The use of semantic similarity measures for optimally integrating heterogeneous Gene Ontology data from large scale annotation pipelines.

An approach for integrating the prioritization of functional and nonfunctional requirements.

Laplacian normalization and random walk on heterogeneous networks for disease-gene prioritization.

Integrating "big data" into surgical practice.

miRIAD-integrating microRNA inter- and intragenic data.

Integrating Data Transformation in Principal Components Analysis.

Integrating Biodiversity Data into Botanic Collections.

On integrating multi-experiment microarray data.

Prioritization.

Improving protein identification from tandem mass spectrometry data by one-step methods and integrating data from other platforms.

Identifying kinase dependency in cancer cells by integrating high-throughput drug screening and kinase inhibition data.

Identifying Liver Cancer-Related Enhancer SNPs by Integrating GWAS and Histone Modification ChIP-seq Data.

LNDriver: identifying driver genes by integrating mutation and expression data based on gene-gene interaction network.

Modeling the causal regulatory network by integrating chromatin accessibility and transcriptome data.