Combining automatic table classification and relationship extraction in extracting anticancer drug-side effect pairs from full-text articles.

Journal of Biomedical Informatics xxx (2014) xxx–xxx

Contents lists available at ScienceDirect

Journal of Biomedical Informatics journal homepage: www.elsevier.com/locate/yjbin

Combining automatic table classification and relationship extraction in extracting anticancer drug–side effect pairs from full-text articles Rong Xu a,⇑, QuanQiu Wang b,⇑ a b

Medical Informatics Program, Center for Clinical Investigation, Case Western Reserve University, Cleveland, OH 44106, United States ThinTek, LLC, Palo Alto, CA 94306, United States

a r t i c l e

i n f o

Article history: Received 30 May 2014 Accepted 3 October 2014 Available online xxxx Keywords: Text mining Information extraction Cancer drug side effect Drug discovery Drug repositioning Drug toxicity prediction

a b s t r a c t Anticancer drug-associated side effect knowledge often exists in multiple heterogeneous and complementary data sources. A comprehensive anticancer drug–side effect (drug–SE) relationship knowledge base is important for computation-based drug target discovery, drug toxicity predication and drug repositioning. In this study, we present a two-step approach by combining table classification and relationship extraction to extract drug–SE pairs from a large number of high-profile oncological full-text articles. The data consists of 31,255 tables downloaded from the Journal of Oncology (JCO). We first trained a statistical classifier to classify tables into SE-related and -unrelated categories. We then extracted drug–SE pairs from SE-related tables. We compared drug side effect knowledge extracted from JCO tables to that derived from FDA drug labels. Finally, we systematically analyzed relationships between anti-cancer drug-associated side effects and drug-associated gene targets, metabolism genes, and disease indications. The statistical table classifier is effective in classifying tables into SE-related and -unrelated (precision: 0.711; recall: 0.941; F1: 0.810). We extracted a total of 26,918 drug–SE pairs from SE-related tables with a precision of 0.605, a recall of 0.460, and a F1 of 0.520. Drug–SE pairs extracted from JCO tables is largely complementary to those derived from FDA drug labels; as many as 84.7% of the pairs extracted from JCO tables have not been included a side effect database constructed from FDA drug labels. Side effects associated with anticancer drugs positively correlate with drug target genes, drug metabolism genes, and disease indications. Ó 2014 Elsevier Inc. All rights reserved.

1. Introduction Drug-induced side effects are observable phenotypes of drugs manifested at the level of the whole body system and are mediated by a drug interacting with its on- or off-targets through a cascade of downstream pathway perturbations. Systematic and integrated approaches to studying drug-associated side effects have the potential to illuminate the complex pathways of drug-induced toxicities, allowing for the identification of novel drug targets, prediction of unknown drug toxicities, and repositioning of existing drugs for new disease indications. Computational approaches to drug target discovery, unknown drug toxicities prediction, and drug repositioning have primarily relied on drug molecular structures or functions such as chemical structure, molecular activity, and molecular docking [2,7,11,14,17,18,20,26,34,35]. These computational approaches largely depend on the availability of drug molecular structure or function knowledge bases. Systems approaches would greatly ben⇑ Corresponding authors. E-mail addresses: [email protected] (R. Xu), [email protected] (Q. Wang).

efit from the vast amount of higher-level clinical phenotype data such as observed drug-related side effects [3,5,6,13]. It has been increasingly recognized that similar side effects of seemingly unrelated drugs can be caused by their common off-targets and that drugs with similar side effects are likely to share molecular targets [5]. Therefore, systems approaches to studying the phenotypic relationships among drugs and integrating the high-level drug phenotypic data with lower-level genetic and chemical data will allow for a better understanding of drug toxicities. Current systems approaches to studying phenotypic relationships among drugs rely exclusively on information extracted from FDA drug labels [3,5,17]. It was recently demonstrated that 39% of serious events associated with targeted anticancer drugs are not reported in clinical trials and 49% are not described in initial FDA drug labeling [21]. For the successful development of phenotype-driven systems approaches to understanding drug-associated side effects, the availability of a comprehensive and machine-understandable drug–side effect (SE) relationship knowledge base is critical. Drug–SE relationship extraction and mining from multiple heterogeneous and complementary data sources is an active research area. Kuhn et al. developed text mining approaches in constructing

http://dx.doi.org/10.1016/j.jbi.2014.10.002 1532-0464/Ó 2014 Elsevier Inc. All rights reserved.

Please cite this article in press as: Xu R, Wang Q. Combining automatic table classification and relationship extraction in extracting anticancer drug–side effect pairs from full-text articles. J Biomed Inform (2014), http://dx.doi.org/10.1016/j.jbi.2014.10.002

2

R. Xu, Q. Wang / Journal of Biomedical Informatics xxx (2014) xxx–xxx

a side effect resource (SIDER) from FDA drug labels [15]. Currently, SIDER represents the best source of computable drug side effect association knowledge. Systems approaches to studying this information have led to the prediction of several new drug targets [5,17]. The FDA Adverse Event Reporting System (FAERS) is the spontaneous reporting system overseen by the U.S. FDA and the main resources for post-marketing drug safety surveillance. Mining drug–side effect (drug–SE) relationships from FAERS is a highly active research area. Data mining algorithms such as disproportionality analysis, correlation analysis, multivariate regression, and signal ranking and filtering leveraging external knowledge have been developed to detect adverse drug signals from FAERS [1,12,23,28,29]. Another important information source of drug–SE associations is the vast amount of published biomedical literature. Currently, more than 22 million biomedical abstracts are publicly available on MEDLINE, making it a rich source of side effect information for drugs at all clinical stages, including drugs in pre-marketing clinical trials, post-marketing clinical case reports and clinical trials. Statistical, machine learning and signal ranking approaches have been developed in extracting drug–SE pairs from free-text MEDLINE abstracts [9,22,27,30,31]. Recently, researchers began to explore other data sources for mining drug–SE associations. For example, patient EHRs have emerged as a promising resource for post-marketing drug adverse event discovery [8,19,32]. Health information available on the web and web search log data can also provide valuable information on drug side effects [16,33]. In summary, drug side effect association knowledge exists in multiple heterogeneous and complementary data sources with different formats. The effectiveness of automatic approaches in extracting drug side effect knowledge depends on both data sources and the targeted drugs or SEs. Currently, there exist no universally effective approaches to extract this knowledge from different data sources. While the main data sources used in previous studies for drug– SE relationship extraction include FDA drug labels, abstracts of published biomedical literature and post-marketing drug safety surveillance systems, there exists a large body of untapped drug side effect knowledge that remains buried in full-text articles. In this study, we developed automatic approaches to extract anticancer drug–SE pairs from the Journal of Clinical Oncology (JCO), specifically the tables imbedded in the full text articles. JCO was established in 1983 and is the official journal of the American Society of Clinical Oncology and the leading journal in oncology. JCO articles include a variety of cancer-related research articles, including clinical trials reporting drug efficacy and toxicity in cancer patients, trial reports evaluating the effectiveness of biomarkers, clinical case reports, and meta-analysis studies, among other article types. JCO articles not only include pivotal clinical trials that have led to drug approval, but also trials that are still in investigational stages, and even failed trials. Side effect knowledge of drugs at different clinical stages is crucial to our understanding of the molecular mechanisms underlying the observed toxicities.

2. Approach In this pilot study, we focused on extracting drug–SE pairs from tables contained in full-text JCO articles. Unlike the full-text portions of JCO articles, the tables often summarize important information such as patient characteristics, treatment outcomes, and toxicities, with each table often containing only one type of information. We downloaded a total of 13,855 full text articles from JCO and extracted 31,255 tables from the downloaded articles. Since not all tables are related to drug side effects, we first developed a support vector machine classifier to classify tables into SE-related or -unrelated categories. We then extracted drug–SE pairs from

the SE-related tables. The precisions of the input lexicons (both drug lexicon and SE lexicon) are critically important for subsequent relationship extraction. We created a clean drug lexicon and a clean SE lexicon through intense manual curation efforts. Using the clean lexicons, we extracted drug–SE pairs from classified tables using dictionary-based approaches. In order to compare our study to existing research efforts in building drug–SE relationship knowledge base, we compared drug–SE pairs that we extracted from JCO tables to those from SIDER, currently the best drug side effect association database constructed from FDA package inserts [15]. To show the potential of extracted drug–SE pairs in developing systems approaches to understand the molecular mechanisms underlying the observed drug phenotypes (toxicities), we linked drugs to their corresponding gene targets, metabolism genes, and disease indications and systematically studied the correlations between drug phenotypes and their known targets, metabolism genes, and disease indications. To the best of our knowledge, this is the first study in combining table classification and relationship extraction in order to extract drug–SE pairs from full-text articles. 3. Data and methods The system consists of the following steps: (1) we downloaded JCO articles and extracted tables from downloaded articles; (2) we created clean drug and SE lexicons; (3) we classifies tables into drug toxicity-related and non-related categories; (4) we extracted drug–SE pairs from classified tables using manually curated, clean drug and SE lexicons; (5) we compared drug–SE pairs extracted from JCO tables to those derived from FDA drug labels; and (6) we systematically analyzed the relationships between cancer drug-associated side effects with drug gene targets, drug metabolism, and disease indications (Fig. 1). 3.1. Download JCO articles and extract tables We downloaded a total of 13,855 JCO full text articles (1.5 GB) published between 1983 and 2013 from the Case Western Reserve University Intranet. We then extracted a total of 31,255 tables from these downloaded JCO articles. A typical JCO clinical trial paper often contains multiple tables describing trial participants, treatment response, survival, prognostic factors, treatment costs, or treatment-related adverse events. While some articles contain no tables, others include many tables. For example, the article entitled ‘‘Procarbazine, Lomustine, and Vincristine (PCV) Chemotherapy for Anaplastic Astrocytoma: A Retrospective Review of Radiation Therapy Oncology Group Protocols Comparing Survival With Carmustine or PCV Adjuvant Chemotherapy’’ (PMID 10550132) contains as many as 14 tables. The content associated with each extracted table includes the article title, table content, and table legend. While the side effect information is included in the table content and table legends, the drug information is often included in the article titles. A typical example of a SE-related table is shown in Fig. 2. We used the publicly available information retrieval library Lucene (http://lucene.apache.org) to create a search engine with indices created on article titles, table contents, and table legends. Each table was assigned a unique identification number. 3.2. Create clean drug and SE lexicons 3.2.1. Create clean anticancer drug lexicon We first compiled a comprehensive drug lexicon from the Unified Medical Language System (UMLS) (2011AB version) [4] using terms with the following semantic types: ‘‘Pharmacologic



3

Fig. 1. Experiment flowchart.

Substance’’ (258,863 terms), ‘‘Clinical Drug’’ (514,212 terms), ‘‘Antibiotic’’ (10,423 terms), ‘‘Steroid’’ (21,193 terms), and ‘‘Vitamin’’ (3630 terms). Note that many terms from the last three categories were not included in the category ‘‘Pharmacologic Substance.’’ This drug lexicon consists of 791,848 terms, which includes not only FDA-approved but also investigational drugs, as well as failed drugs. Inclusion of both investigational and failed drugs in the drug lexicon is important in extracting drug–SE pairs from JCO articles. The initial drug lexicon contains both cancer drugs and noncancer drugs. To facilitate our task of building a cancer-specific

drug–SE knowledge base from JCO articles, we created a cancerspecific drug lexicon from the initial drug lexicon through a combination of automatic filtering and manual curation. First, we filtered the initial drug lexicon with JCO tables, including both table titles and legends. We used each drug term as a search query to the local search engine. Terms that did not appear in any tables were removed. This filtering step removed many terms such as ‘‘zinc cl/b-caine/cetylpyrd cl unidentified’’ and ‘‘zinc finger and btb domain containing 7c protein, human.’’ The filtering significantly decreased the size of the drug lexicon, which was important for our subsequent manual curation effort. In addition, this filtered

Fig. 2. A typical example table that contains drug-associated adverse event information. Drug information was contained in the article title.


4


drug lexicon was highly enriched with anti-cancer drugs since only drugs that appeared in JCO articles were included. The filtered lexicon consisted of 4988 drugs. We then manually curated the filtered lexicon by removing terms that are too general, ambiguous, or mis-classified such as ‘‘drug’’, ‘‘medicine’’, ‘‘link,’’ ‘‘alerts,’’ ‘‘align,’’ and ‘‘unknown.’’ The final clean and cancer-specific drug lexicon consists of 4256 terms. 3.2.2. Clean SE lexicon An accurate and comprehensive SE lexicon is critical for drug– SE relationship extraction. In our previous studies, we have manually curated a clean SE lexicon [28–31]. The clean SE lexicon consists of 49,625 terms. 3.3. Classify tables into drug SE-related and -unrelated 3.3.1. Table classifier training From the downloaded tables, we removed tables containing no drug or SE terms since no drug–SE pairs would be extracted from these tables. We randomly selected 700 tables from remanning tables and manually classified these tables into positives or drug–SE-related (380 tables) and negatives (320 tables). We then trained an SVM classifiers with Weka [10] using these manually classified tables as training data. The SVM-based table classifier used polynomial kernel, bag-of-words feature, TF-IDF weighting, stemming and stopwords-removal. The bag-of-words feature was used since it is often the case that the appearance of one specific word such as toxicity or adverse can determine whether or not a sentence is drug–SE-related. The 10-fold cross validation was used in training the table classifier. During the training and testing of the SVM classifier, we experimented with different ways to process the tables: (1) tables with html tags; (2) tables without html tags; and (3) tables with html tags and removed table titles. 3.3.2. Evaluation To test the performance of table classification, we randomly selected 100 JCO full-text articles and extracted a total of 375 tables from these articles, an average of 3.75 tables per article. We first manually classified these tables into SE-related and nonrelated. Among the 375 tables, 34 were positives, which was less than one per article. This indicates that most JCO tables are not SE-related; therefore, finding SE-related tables before drug–SE relationship extraction is critical. We used the manually classified 375 tables as our gold standard to evaluate the performance of the SVM table classifier. 3.4. Extract drug–SE pair from classified tables

curation. From these SE-related tables, we extracted a total of 588 distinct drug–SE pairs, representing 32 drugs and 205 SEs. We evaluated our algorithm in three different ways to investigate how table classification contributes to the subsequent relationship extraction: (1) apply relationship extraction algorithm to all 375 unclassified tables; (2) use the SVM classifier to classify these tables first, and then apply the algorithm to those positively classified tables; and (3) apply the algorithm to the 34 tables that were manually classified as positives. Standard precision, recall, F1, FPR (false positive rate) and FNF (false negative rate) were calculated. 3.5. Compare drug–SE pairs extracted from JCO tables to ones derived from FDA drug labels In order to compare our study with existing research efforts in building drug–SE relationship databases, we compared drug–SE pairs that we extracted from JCO tables to those from SIDER, which represents the best drug side effect association database constructed from FDA drug labeling so far. We downloaded a total of 100,049 known drug–SE pairs from SIDER and filtered out drug– SE pairs if the drugs did not appear in the cancer-specific drug lexicon we constructed above. After the filtering, the SIDER dataset consisted of 51,341 pairs. We evaluated coverages of targeted anticancer drug-associated side effects in these two databases. Specifically, we investigated whether JCO tables contained additional side effect information for these innovative cancer drugs, which have not been captured in FDA drug labeling. Targeted cancer drugs control cancer cell growth by interfering with specific molecular targets involved in tumor growth and progression. While targeted anticancer drugs dramatically improve cancer treatment outcomes, they are often associated with unexpectedly high toxicities. We obtained a list of 45 targeted cancer drugs from the National Cancer Institute (NCI) (http://www.cancer.gov/cancertopics/factsheet/Therapy/targeted). We filtered drug–SE pairs extracted from JCO tables and from SIDER with the targeted cancer drug list and compared the overlap of this knowledge between these two data sources. 3.6. Analyze the correlation between drug side effects and drug targets, metabolism and indications We investigated the potential value of anticancer drug–SE pairs that were extracted from JCO tables to our understanding of the underlying drug molecular targets, drug toxicity prediction, and drug repurposing. Specifically, we investigated whether drug–drug pairs that share side effects also tend to share target genes, drug metabolism genes, or disease indications.

3.4.1. Drug–SE pair extraction We first classified all JCO tables using the trained table classifier and recorded the article IDs for positively classified tables. We used each drug from the drug lexicon as search query to the local JCO table search engine. If a drug appeared in a table title, table content, or a table legend, the drug and the article ID were recorded. Then we used each term from the SE lexicons as a search query to the local JCO table search engine. Unlike our search for drug terms, we only searched the table contents for SE terms. Table titles and legends were not used since they often contain drug–disease treatment relationships (as shown in Fig. 2). SE terms along with the IDs of the tables in which they appeared were recorded. Drug–SE pairs (along with article IDs) were extracted by joining the article IDs.

3.6.1. Correlation with drug targets We extracted a total of 10,478 drug–gene pairs from DrugBank [25], representing 3454 drugs and 1816 genes. For drug–drug pairs that share SEs at different cutoffs, we calculated the average number of shared gene targets.

3.4.2. Evaluation Among the 375 tables extracted from the 100 randomly selected JCO articles, 34 were SE-related as determined by manual

3.6.3. Correlation with drug indications We extracted a total of 52,066 drug–disease pairs from ClinicalTrials.gov (http://www.clinicaltrials.gov/), a registry of federally

3.6.2. Correlation with drug metabolizing genes Similar to the correlation analysis above, we analyzed the correlation between shared metabolism genes and overlapping SEs between drugs. We obtained from PharmGKB [24] a total of 4399 drug–gene pairs with subtypes of pharmacokinetics (PK) or pharmacodynamics (PD), representing 637 drugs and 859 genes. For drug–drug pairs that share SEs at different cutoffs, we calculated the average number of shared metabolism genes.


5


and privately supported clinical trials conducted in the United States and around the world (www.clinicaltrials.gov). The drug– disease pairs contain 2035 drugs and 9591 diseases. For drug–drug pairs that share SEs at different cutoffs, we calculated the average number of shared disease indications. 4. Results 4.1. The SVM-based table classifier is effective in categorizing tables into SE-related and -unrelated We trained the SVM table classifier on 700 manually classified tables and tested it on a separate set of 375 tables extracted from 100 randomly selected JCO articles. We trained and evaluated the classification performance on tables processed in three different ways: tables with HTML tags, tables without HTML tags, and tables with neither HTML tags nor titles. Table 1 listed the precisions, recalls, F1 measures, false positive rates (FPRs), and false negative rates (FNRs) of the table classification task. As shown in the table, the classifier achieved a precision of 0.711, recall of 0.941, and an F1 value of 0.810 when it was trained and evaluated using tables with HTML tags removed. Both the false positive and false negative rates were low. 4.2. Drug–SE relationship extraction from classified tables We manually extracted 588 drug–SE pairs from 34 positive tables from a total of 375 tables extracted from the randomly selected 100 JCO articles. Using these pairs as gold standard, we compared the performance of pair extraction from SVM-classified tables to those from manually classified tables and from unclassified tables. As shown in Table 2, compared to relationship extraction from unclassified tables (all 375 tables), drug–SE relationship extraction from SVM-classified tables had significantly better precision (0.605 vs. 0.491), while recalls were similar (0.469 vs. 0.471). Compared to relationship extraction from manually classified tables, extraction from SVM-classified tables had comparable precisions (0.605 vs. 0.684) and recalls (0.469 vs. 0.471). In summary, drug–SE relationship extraction from automatically classified tables had achieved comparable performance to that from manually classified tables and significantly improved performance compared to that from unclassified tables. These results demonstrated that table classification before drug–SE relationship extraction was necessary and our table classifier was effective in finding SE-related tables from all downloaded tables. However, both precisions and recalls are still modest for both SVM- and manually-classified tables. Based on our observations, many SE-related tables contain drug–disease pairs that are not related to drug side effects. For instance, many SE-related tables contain disease terms that are used to describe patient disease characteristics, co-morbidities, or exclusion criteria. Currently, we are developing approaches to further improve the precision by automatically filtering out drug–disease treatment pairs, co-morbidities and other noises. The modest recall achieved in our study (0.469) indicates that the coverage of the underlying drug or SE lexicon is not perfect. For instance, many SE terms appeared in

Table 2 Precisions, recalls, and F1s of drug–SE extraction from three types of tables: positive tables as determined by SVM classifier, positive tables as determined manually, and all 375 tables. Highlighted are best performances achieved for each type of tables. Table classification

Drug Source

Precision

Recall

F1

SVM classified tables

Title All

0.605 0.490

0.469 0.476

0.529 0.483

Manually classified tables

Title All

0.684 0.554

0.471 0.478

0.558 0.513

All (unclassified)

Title All

0.491 0.288

0.471 0.478

0.481 0.359

the tables but were not included in the SE lexicon that we built based on MedDRA. Currently, we are developing approaches to build a more comprehensive SE lexicon. 4.3. Compare drug–SE pairs extracted from JCO tables to ones derived from FDA drug labels From all tables classified as positives, we extracted a total of 26,918 drug–SE pairs, representing 520 anticancer drugs and 1198 SE terms. From SIDER, we obtained a total of 58,454 anticancer drug–SE pairs using the same drug lexicon that we created, representing 470 drugs and 3431 SE terms (Table 3). Drug–SE pairs extracted from JCO tables included more drugs than those from FDA drug labels, which was expected since JCO articles included not only FDA-approved drugs but also experimental and failed drugs while FDA drug labels contain exclusively approved drugs. Since targeted anticancer drugs were relatively new than other anticancer drugs, we compared the coverages of drug–SE pairs for these innovative drugs between the two data resources. Drug–SE pairs extracted from JCO tables contained 31 of the 45 targeted cancer drugs that we downloaded from NCI. These 31 targeted cancer drugs were associated with 1,984 drug–SE pairs in JCO tables. The SIDER database contains 15 of the 45 targeted cancer drugs and 14 of these 15 drugs also appeared in JCO tables. We then calculated the overlap between drug–SE pairs extracted from JCO tables and those from SIDER. As shown in Table 4, 15.3% of drug–SE pairs extracted from JCO tables appeared on FDA drug labels and 3.8% of pairs derived from FDA drug labels also appeared in JCO tables. We observed similar trends for targeted cancer drugs. These results indicate that drug toxicity knowl-

Table 3 Drug–SE pairs extracted from JCO tables vs. pairs extracted from FDA drug labels. Drug type

Source

Drugs (n)

SEs (n)

Pairs (n)

All cancer drugs

JCO tables FDA drug labels

520 470

1198 3431

14,516 58,454

Targeted cancer drugs


31 15

584 837

1,984 2,419

Table 4 Overlap of drug–SE pairs between JCO tables and FDA drug labels: percent of drug–SE pairs from source 1 (column) that also appeared in source 2 (row). Drug type

Table 1 Table classification performance. The best performance was achieved when HTML tags were removed (highlighted). Table preprocessing

Precision

Recall

F1

FPR

FNR

With HTML tags Without HTML tags Without HTML tags + no title

0.322 0.711 0.704

0.853 0.941 0.911

0.467 0.810 0.794

0.178 0.038 0.038

0.147 0.058 0.088

JCO tables (%)

FDA drug labels (%)

All


100 15.3

3.8 100

Targeted cancer drugs


100 27.9

22.9 100


6


edge in these two data sources is largely complementary. While FDA drug labels contained known side effect information for commercial drugs from many data resources including published clinical trials and post-market drug safety surveillance systems, JCO tables contained side effect information for many cancer drugs that are still in experimental stages. Therefore, in order to build a comprehensive drug–SE relationship knowledge base, drug–SE pairs from multiple complementary data sources such as JCO articles, FDA drug labels, biomedical literature, and post-market drug safety surveillance systems are needed. 4.4. Anticancer drug-associated side effects positively correlate with drug-associated targeted genes, metabolizing genes, and disease indications Our ultimate goal in building a cancer-specific drug–SE relationship knowledge base is to develop computational approaches to discover novel cancer drug targets, predict unknown drug side effects in cancer patients, and find new indications for existing drugs. To demonstrate the potential of drug–SE pairs extracted from JCO tables in these applications, we investigated whether drug pairs that shared side effects (SEs) also tended to share gene targets, metabolism genes, and disease indications. We investigated whether any differences existed between targeted anticancer drug–SE pairs and all anticancer drug–SE pairs. Different from traditional cytotoxic chemotherapy drugs that kill both tumor and healthy cells, targeted cancer drugs control cancer cell growth by interfering with specific molecular targets that are predominantly expression on tumor cells. Therefore, side effects for targeted cancer drugs are presumably mostly mechanism-based and should correlate more strongly with their gene targets. We showed that cancer drug-associated side effects positively correlated with drug targets (Fig. 3). For all drug–drug combination, the average number of shared gene targets was 0.608, and the number significantly increased to 0.817 for drug–drug pairs that shared at least 30 SEs and to 1.028 for drug–drug pairs that shared at least 50 SEs. The positive correlation was stronger for targeted cancer drugs. For targeted cancer drugs, the average number of shared gene targets significantly increased from 1.649 for all drug–drug combinations to 2.741 for pairs that shared at least 50 SEs. The stronger correlation between side effects and gene targets for targeted cancer drugs may be due to the fact that many side effects for targeted cancer drugs are caused by the on-target effects (by the drug binding to ‘on-target’ in normal cells). On the other hand, side effects for non-specific cytotoxic chemotherapy drugs are similar even though the drugs can have different targets. For example, irinotecan is a cytotoxic DNA topoisomerase I inhibitor primarily used in the treatment of colorectal cancer. Cisplatin is

Fig. 3. Correlation between shared SEs and shared target genes.

a platinum-compound chemotherapy drug that acts as an alkylating agent. Even though these two drugs have different targets, they share many systemic side effects such as alopecia and neutropenia. Therefore, it is less likely to predict drug targets based on shared side effects for chemotherapy drugs. Both drug targets and metabolism genes are involved in druginduced side effects. We showed that cancer drug-associated side effects strongly correlated with drug metabolism genes (Fig. 4). Unlike the correlation with gene targets, the correlation with drug metabolism genes was weaker for targeted agents than for all cancer drugs. For all cancer drugs, the average number of shared metabolism genes increased from 0.977 for all drug–drug combinations to 4.33 for drug–drug pairs that shared at least 50 SEs. For targeted cancer drugs, the average number of shared metabolism genes increased from 0.438 for all drug–drug combinations to 1.069 for drug–drug pairs that shared at least 50 SEs. The observed positive correlation between side effects and drug metabolism genes indicates that computational models for predicting drug toxicities in cancer patients needs to incorporate not only known drug on-target genes, but also drug metabolism genes. The observed phenotypic drug side effects strongly correlated with drug indications (Fig. 5). For all cancer drugs, the average number of shared disease indications was 6.859 for all drug–drug combinations; the number significantly increased to 24.33 for drug–drug pairs that shared at least 20 SEs, and to 46.15 for pairs that shared at least 50 SEs. The correlation with disease indications was weaker for targeted cancer drugs. For targeted cancer drugs, the average number of shared disease indications was 17.88 for all drug combinations and the number increased to 25.68 for drug pairs that shared at least 30 SEs and to 37.42 for drug pairs that shared at least 50 SEs. The positive correlation between cancer drug-associated side effects and their disease indications indicates that we may repurpose existing drugs for new cancer treatments or cancer drugs for non-cancer treatments based on the clinically observed drug-associated side effects. 5. Discussion We presented a system that combined both table classification and relationship extraction to effectively extract a large number of drug–SE pairs from high profile oncological articles. Nonetheless, further improvements to our study can be made in the future. First, while tables in JCO articles summarize important side effects for specific studies, it remains unknown whether they captured all

Fig. 4. Correlation between shared SEs and shared metabolism genes.



7

in cancer patients and can also aid in the prediction of unknown drug side effects and the repurposing of existing drugs for new disease indications. Funding RX is funded by CWRU/Cleveland Clinic CTSA Grant (UL1 RR024989), the Training grant in Computational Genomic Epidemiology of Cancer (CoGEC) (R25 CA094186-06), and the Grant #IRG-91-022-18 to the Case Comprehensive Cancer Center from the American Cancer Society. QW is partly funded by ThinTek LLC. Acknowledgements

Fig. 5. Correlation between shared SEs and shared disease indications.

the relevant side effects reported in JCO articles. Some side effects may be discussed in articles, but not summarized in tables. Currently, we are developing approaches to extract drug–SE pairs from the text parts of JCO articles. Second, even with manually curated drug lexicons and SE lexicons, the precision and recall of drug–SE pair extraction from SVM-classified and even manually-classified tables are still modest (precision: 0.605; recall: 0.469). The modest precisions indicate that SE-related tables still contain many nondrug–SE pairs. The modest recalls indicate that the coverages of the drug lexicon and especially the SE lexicon need to be improved. Third, the extracted drug–SE pairs are only for individual drugs. In reality, cancer drugs are often used with other drugs in specific combinations. Certain side effects may occur only when specific drug–drug combinations are used. Extracting side effects associated with drug combinations is a challenging task and related work is scant. Fourth, many drug toxicity-specific tables included frequency or prevalence information for reported side effects. For tailoring treatments to specific cancer patients, both the presence and the prevalence of drug-associated side effects are important. Currently, we are exploring the structures of the tables to automatically extract this information. Fifth, even though JCO is the leading journal in oncology, many other high profile medical journals such as JAMA (The Journal of the American Medical Association) and Lancet also include clinical trial reports for cancer drugs and other drugs. Currently, we are obtaining the full-text articles from these journals to extract broad types of drug–SE relationships from high-profile clinical trial reports. 6. Conclusions In this study, we presented a system that combined both table classification and relationship extraction to extract anticancer drug–SE pairs from tables imbedded in full-text oncological articles. We demonstrated that the statistical table classifier was effective in classifying tables into SE-related and non-related categories (precision: 0.711; recall: 0.941; F1: 0.810). The drug–SE pair extraction from classified tables achieved a precision of 0.605, a recall of 0.460, and an F1 of 0.520. We showed that cancer drugrelated toxicity knowledge contained in JCO tables was largely complementary to that captured in FDA drug labeling. We showed that side effects associated with cancer drugs positively correlated with drug ‘on-target’ genes, drug metabolism genes, and disease indications. Therefore, computational approaches to studying these drug–SE pairs extracted from JCO tables, in combination with pairs extracted from other data sources, can provide insight into the molecular mechanisms underlying many observed side effects

Xu and Wang have jointly conceived the idea, designed and implemented the algorithms and prepared the manuscript. All authors read and approved the final manuscript. References [1] Bate A, Evans SJW. Quantitative signal detection using spontaneous ADR reporting. Pharmacoepidem Drug Safety 2009;18(6):427–36. [2] Besnard J, Ruda GF, Setola V, Abecassis K, Rodriguiz RM, Huang XP, et al. Automated design of ligands to polypharmacological profiles. Nature 2012;492(7428):215–20. [3] Cami A, Arnold A, Manzi S, Reis B. Predicting adverse drug events using pharmacological network models. Sci Transl Med 2011;3(114):114ra27. [4] Bodenreider O. The unified medical language system (UMLS): integrating biomedical terminology. Nucl Acid Res 2004;32(suppl 1):D267–70. [5] Campillos M, Kuhn M, Gavin AC, Jensen LJ, Bork P. Drug target identification using side-effect similarity. Science 2008;321(5886):263–6. [6] Duran-Frigola M, Aloy P. Recycling side-effects into clinical markers for drug repositioning. Genome Med 2012;4(3). [7] Fliri AF, Loging WT, Thadeio PF, Volkmann RA. Analysis of drug-induced effect patterns to link structure and side effects of medicines. Nature Chem Biol 2005;1(7):389–97. [8] Friedman C. Discovering novel adverse drug events using natural language processing and mining of the electronic health record. In: Artificial intelligence in medicine. Berlin Heidelberg: Springer; 2009. p. 1–5. [9] Gurulingappa H, Mateen?Rajput A, Toldo L. Extraction of potential adverse drug events from medical case reports. J Biomed Semantics 2012;3(1):15. [10] Hall M, Frank E, Holmes G, Pfahringer B, Reutemann P, Witten IH. The WEKA data mining software: an update. ACM SIGKDD Explor Newslett 2009;11(1):10–8. [11] Hammann F, Gutmann H, Vogt N, Helma C, Drewe J. Prediction of adverse drug reactions using decision tree modeling. Clin Pharmacol Therapeut 2010;88(1):52–9. [12] Harpaz R, DuMouchel W, Shah NH, Madigan D, Ryan P, Friedman C. Novel data-mining methodologies for adverse drug event discovery and analysis. Clin Pharmacol Therapeut 2012;91(6):1010–21. [13] Hurle MR, Yang L, Xie Q, Rajpal DK, Sanseau P, Agarwal P. Computational drug repositioning: from data to therapeutics. Clinical Pharmacology & Therapeutics 2013;93(4):335–41. http://dx.doi.org/10.1038/clpt.2013.1. [14] Keiser MJ, Setola V, Irwin JJ, Laggner C, Abbas AI, Hufeisen SJ, et al. Predicting new molecular targets for known drugs. Nature 2009;462(7270):175–81. [15] Kuhn M, Campillos M, Letunic I, Jensen LJ, Bork P. A side effect resource to capture phenotypic effects of drugs. Mol Syst Biol 2010;6(1). [16] Leaman R, Wojtulewicz L, Sullivan R, Skariah A, Yang J, Gonzalez G. Towards internet-age pharmacovigilance: extracting adverse drug reactions from user posts to health-related social networks. In: Proceedings of the 2010 workshop on biomedical natural language processing. Association for Computational Linguistics; 2010. p. 117–25. [17] Lounkine E, Keiser MJ, Whitebread S, Mikhailov D, Hamon J, Jenkins JL, et al. Large-scale prediction and testing of drug activity on side-effect targets. Nature 2012;486(7403):361–7. [18] Pauwels E, Stoven V, Yamanishi Y. Predicting drug side-effect profiles: a chemical fragment-based approach. BMC Bioinform 2011;12(1):169. [19] Platt R, Wilson M, Chan KA, Benner JS, Marchibroda J, McClellan M. The new sentinel network improving the evidence of medical-product safety. New Engl J Med 2009;361(7):645–7. [20] Pouliot Y, Chiang AP, Butte AJ. Predicting adverse drug reactions using publicly available PubChem BioAssay data. Clin Pharmacol Therapeut 2011;90(1):90–9. [21] Seruga B et al. Reporting of serious adverse drug reactions of targeted anticancer agents in pivotal phase III clinical trials. J Clin Oncol 2011;29:174–85. [22] Shetty KD, Dalal SR. Using information mining of the medical literature to improve drug safety. J Am Med Inform Assoc 2011;18(5):668–74. [23] Tatonetti NP, Patrick PY, Daneshjou R, Altman RB. Data-driven prediction of drug effects and interactions. Sci Transl Med 2012;4(125):125ra31.


8


[24] Whirl-Carrillo M, McDonagh EM, Hebert JM, Gong Li, Sangkuhl K, Thorn CF, et al. Pharmacogenomics knowledge for personalized medicine. Clin Pharmacol Therapeut 2012;92(4):414–7. [25] Wishart DS, Knox C, Guo AC, Shrivastava S, Hassanali M, Stothard P, et al. DrugBank: a comprehensive resource for in silico drug discovery and exploration. Nucl Acids Res 2006;34(suppl 1):D668–72. [26] Xie L, Evangelidis T, Xie L, Bourne PE. Drug discovery using chemical systems biology: weak inhibition of multiple kinases may contribute to the anti-cancer effect of nelfinavir. PLoS Comput Biol 2011;7(4):e1002037. [27] Xu R, Wang Q. Automatic construction and integrated analysis of a cancer drug side effect knowledge base. J Am Med Inform Assoc 2013. http://dx.doi.org/ 10.1136/amiajnl-2012-001584. [28] Xu R, Wang Q. Automatic signal prioritizing and filtering approaches in detecting post-marketing cardiovascular events associated with targeted cancer drugs from the FDA Adverse Event Reporting System (FAERS). J Biomed Informs 2014:171–7. [29] Xu R, Wang Q. Large-scale combining signals from both biomedical literature and the FDA adverse event reporting system (FAERS) to improve postmarketing drug safety signal detection. BMC Bioinform 2014;15:17. http:// dx.doi.org/10.1186/10.1186/1471-2105-15-17. [30] Xu R, Wang Q. Automatic construction of a large-scale and accurate drug-sideeffect association knowledge base from biomedical literature. J Biomed Inform

[31]

[32]

[33]

[34]

[35]

2014. http://dx.doi.org/10.1016/j.jbi.2014.05.013. pii: S1532-0464(14)001385. Xu R, Wang Q. Comparing a knowledge-driven approach to a supervised machine learning approach in constructing a large-scale drug–side effect relationship knowledge base for computational drug discovery. In: International symposium on bioinformatics research and applications (ISBRA) 2014 conference; 2014. Wang X, Hripcsak G, Markatou M, Friedman C. Active computerized pharmacovigilance using natural language processing, statistics, and electronic health records: a feasibility study. J Am Med Inform Assoc 2009;16(3):328–37. White RW, Tatonetti NP, Shah NH, Altman RB, Horvitz E. Web-scale pharmacovigilance: listening to signals from the crowd. J Am Med Inform Assoc 2013;20(3):404–8. Yang L, Wang K, Chen J, Jegga AG, Luo H, Shi L, et al. Exploring off-targets and off-systems for adverse drug reactions via chemical-protein interactomeclozapine-induced agranulocytosis as a case study. PLoS Comput Biol 2011;7(3):e1002016. Yildirim MA, Goh KI, Cusick ME, Barabsi AL, Vidal M. Drugtarget network. Nature Biotechnol 2007;25(10):1119.


Extracting biomedical events from pairs of text entities.

The Extraction, Anticancer Effect, Bioavailability, and Nanotechnology of Baicalin.

Combining feature extraction and classification for fNIRS BCIs by regularized least squares optimization.

Automatic carotid centerline extraction from three-dimensional ultrasound Doppler images.

Automatic extraction of drug indications from FDA drug labels.

Automatic extraction of endocranial surfaces from CT images of crania.

Automatic lymphoma classification with sentence subgraph mining from pathology reports.

Development and validation of a classification approach for extracting severity automatically from electronic health records.

Classification of current anticancer immunotherapies.

Learning while extracting: 'pacing' lessons from the world of lead extraction.

The Relationship Between Stress and Coping in Table Tennis.

Automatic detection and classification of artifacts in single-channel EEG.

PDF text classification to leverage information extraction from publication reports.

Towards the automatic classification of neurons.

PASTEC: an automatic transposable element classification tool.

Extraction of Protein-Protein Interaction from Scientific Articles by Predicting Dominant Keywords.

Automatic classification of the cats vigilance state.

Automatic cervical cell segmentation and classification in Pap smears.

Classification of finger pairs from one hand based on spectral features in human EEG.

Automatic Extraction of Appendix from Ultrasonography with Self-Organizing Map and Shape-Brightness Pattern Learning.

Developing an Intelligent Automatic Appendix Extraction Method from Ultrasonography Based on Fuzzy ART and Image Processing.

Ultrasound-assisted extraction of polysaccharides from Artemisia selengensis Turcz and its antioxidant and anticancer activities.

EXIF Custom: Automatic image metadata extraction for Scratchpads and Drupal.

Automatic bone segmentation and bone-cartilage interface extraction for the shoulder joint from magnetic resonance images.