An ensemble method for extracting adverse drug events from social media.

Accepted Manuscript Title: An ensemble method for extracting adverse drug events from social media Author: Jing Liu Songzheng Zhao Xiaodi Zhang PII: DOI: Reference:

S0933-3657(15)30037-3 http://dx.doi.org/doi:10.1016/j.artmed.2016.05.004 ARTMED 1463

To appear in:

ARTMED

Received date: Revised date: Accepted date:

7-10-2015 20-5-2016 27-5-2016

Please cite this article as: Liu Jing, Zhao Songzheng, Zhang Xiaodi.An ensemble method for extracting adverse drug events from social media.Artiﬁcial Intelligence in Medicine http://dx.doi.org/10.1016/j.artmed.2016.05.004 This is a PDF file of an unedited manuscript that has been accepted for publication. As a service to our customers we are providing this early version of the manuscript. The manuscript will undergo copyediting, typesetting, and review of the resulting proof before it is published in its final form. Please note that during the production process errors may be discovered which could affect the content, and all legal disclaimers that apply to the journal pertain.

An ensemble method for extracting adverse drug events from social media Jing Liu*, Songzheng Zhao, Xiaodi Zhang

School of Management, Northwestern Polytechnical University, Xi’an, Shaanxi 710072, PR China

*Corresponding author: Email address: [email protected] Tel: +86 15991765831 School of Management Northwestern Polytechnical University 127 West Youyi Road, Xi’an, Shaanxi 710072, PR China

1

Highlights 1. We propose a relation extraction system that uses natural language processing techniques to distinguish between adverse drug events (ADEs) and non-ADEs on social media. 2. We develop a feature-based method to explore various lexical, syntactic, and semantic features, and we investigate the effectiveness of feature selection and analyze the contributions of different features. 3. We evaluate the effectiveness of four well-known kernels to investigate whether kernel-based methods can effectively extract ADEs from social media. 4. By adopting different combination methods, we propose several classifier ensembles to further enhance ADE extraction capabilities.

Abstract Objective: Because adverse drug events (ADEs) are a serious health problem and a leading cause of death, it is of vital importance to identify them correctly and in a timely manner. With the development of Web 2.0, social media has become a large data source for information on ADEs. The objective of this study is to develop a relation extraction system that uses natural language processing techniques to effectively distinguish between ADEs and non-ADEs in informal texts on social media. Methods and Materials: We develop a feature-based approach that utilizes various lexical, syntactic, and semantic features. Information-gain-based feature selection is performed to address high-dimensional features. Then, we evaluate the effectiveness of four well-known kernel-based approaches (i.e., subset tree kernel, tree kernel, shortest dependency path kernel, and all-paths graph kernel) and several ensembles that are generated by adopting different combination methods (i.e., majority voting, weighted averaging, and stacked generalization). All of the

2

approaches are tested using three data sets: two health-related discussion forums and one general social networking site (i.e., Twitter). Results: When investigating the contribution of each feature subset, the feature-based approach attains the best area under the receiver operating characteristics curve (AUC) values, which are 78.6%, 72.2%, and 79.2% on the three data sets. When individual methods are used, we attain the best AUC values of 82.1%, 73.2%, and 77.0% using the subset tree kernel, shortest dependency path kernel, and feature-based approach on the three data sets, respectively. When using classifier ensembles, we achieve the best AUC values of 84.5%, 77.3%, and 84.5% on the three data sets, outperforming the baselines. Conclusions: Our experimental results indicate that ADE extraction from social media can benefit from feature selection. With respect to the effectiveness of different feature subsets, lexical features and semantic features can enhance the ADE extraction capability. Kernel-based approaches, which can stay away from the feature sparsity issue, are qualified to address the ADE extraction problem. Combining different individual classifiers using suitable combination methods can further enhance the ADE extraction effectiveness. Keywords: relation extraction; feature-based approach; feature selection; kernel-based approaches; social media; adverse drug event extraction

3

1. Introduction An adverse drug event (ADE) is defined as any injury due to a medication [1]. This injury can be caused by a medication error, an off-label usage of a medication, or a recommended usage of a medication as per its prescription or label [2]. An adverse drug reaction (ADR), a subtype of an ADE, refers to an unintended response to a drug when it is used at recommended dosage [3]. ADEs are a crucial public health concern, and they could result in increased hospitalizations, morbidity, and even mortality [4,5]. For example, it is estimated that ADEs affect approximately 2 million inpatients each year in the United States [6] and can lead to prolonged hospital stays. In terms of outpatients, ADEs are annually responsible for approximately 125,000 hospital admissions, 1 million emergency visits, and over 3.5 million physician visits in the United States [7]. ADEs can also result in reputation damage for pharmaceutical companies and major financial losses for countries (e.g., approximately $660 million in Australia per year due to estimated 190,000 medication-related hospital admissions [8]). Therefore, detecting ADEs accurately and in a timely manner is important for stakeholders (e.g., patients, physicians, pharmaceutical companies, and regulatory authorities). Although ADEs can be identified by pre-marketing clinical trials, such trials are limited because they are on selected populations and have constraints on the scale and time [4,9]. Therefore, major risks with regard to drug safety could remain when the drug hits the market, making post-marketing surveillance the paramount avenue for detecting ADEs that are associated with a drug [10]. Currently, in the 4

United States, post-marketing drug safety monitoring relies primarily on the US Food and Drug Administration (FDA) adverse event reporting system (FAERS), which is a passive system that is populated by voluntary ADE reports from healthcare professionals, pharmaceutical manufacturers, and patients. However, studies have shown that FAERS significantly underestimates the number of ADE cases [11,12] and is incapable of detecting ADEs in a timely manner [12]. In recent years, social media has been an under-explored data source for extracting ADEs. An increasing number of patients (e.g., one-fourth of people with chronic diseases [13]) are turning to social media to seek information, obtain advice, voice concerns, and share experiences concerning drugs [14]. The objective of this paper is to develop a system for extracting ADEs from user-generated content (UGC) on social media using advanced natural language processing (NLP) techniques and machine learning algorithms. Such a system could augment the current passive systems (e.g., FAERS), thereby aiding regulatory authorities in drug-safety decision making (such as drug recalls, market withdrawals, and safety alerts) as well as reducing legal and monetary loss risks for pharmaceutical companies. Specifically, this study aims to utilize relation extraction methods, which can recognize relationships between two entities in unstructured text, to classify ADE relationship between identified drug entities and event entities from other relations with good performance. We develop a feature-based method that leverages various lexical, syntactic, and semantic features. We also explore the effectiveness of the feature-based method, four well-known kernel-based

5

approaches, and combinations of these methods. Extensive experiments are performed using two health-related discussion forums and one general social networking site (i.e., Twitter) to investigate the feasibility of our proposed system.

2. Related work 2.1. Extracting adverse drug events from social media ADE extraction research has utilized various data sources, such as spontaneous reporting systems [15], clinical notes [16], and electronic health records [4,17]. However, recent ADE extraction studies are paying attention to social media, such as microblogs (e.g., Twitter [10,11,18-21]) and health-related discussion forums, including DailyStrength [9,10,21,22], MedHelp [5,23,24], AskaPatient [20,25,26], and American Diabetes Association [26,27]. To conduct automatic ADE extraction from social media, Jiang and Zheng [18] adopted a lexicon-based approach to extract the event entities. However, this study failed to distinguish between ADEs and other types of events (e.g., drug indications). To fill in this research gap, several studies classified recognized events and endeavored to separate ADEs from others. For example, Leaman et al. [9] implemented a rule-based filtering method to remove other relations. Nikfarjam and Gonzalez [22] generated frequent language patterns for expressing ADEs using association rule mining. Bian et al. [11], Sarker and Gonzalez [10], and Nikfarjam et al. [21] explored different deep linguistic features, such as textual features, semantic features, sentiment-related features, and the embedding cluster number feature. Nevertheless, these studies were not able to automatically specify the drug

6

that the identified ADEs were associated with. To address this problem, multiple studies formulated ADE extraction as a relation extraction task to detect the specific relationship between the drug entities and event entities, to indicate that “the event is caused by the drug”. Co-occurrence, a relation extraction method that is easy to implement, has dominated prior studies [5,19,28-30]. However, co-occurrence generally suffered from low precision. Therefore, recent studies turned to more sophisticated relation extraction approaches. For example, Liu et al. [23,27] adopted the shortest dependency path kernel [31] and achieved an f-measure of 66.9%. 2.2. Relation extraction methods Relation extraction is aimed at recognizing relationships between two entities in unstructured text and has gained considerable attention in the biomedical domain, such as for protein-protein interaction (PPI) extraction [32,33] and drug-drug interaction (DDI) detection [34,35]. Relation extraction has been conducted on various corpora that are derived from the biomedical literature or from newspapers. There are three main categories of relation extraction approaches: co-occurrence, rule-based, and statistical learning methods [36]. The co-occurrence approach is prone to low precision because it assumes that two entities are somehow related if they are mentioned simultaneously [37]. Rule-based approaches are prone to low recall because the generated rules could fail to correctly recognize the instances that are expressed in uncovered patterns [38,39]. Statistical learning methods, which include feature-based methods and kernel-based methods, generally recast

7

relation extraction as a classification problem and have achieved great success. Feature-based approaches leverage various lexical, syntactic, and semantic features, and thus, time- and effort-consuming feature engineering is required [40,41]. Kernel-based methods, on the other hand, can address high (or even infinite) dimensional features implicitly by directly computing the inner dot product of the compared instances [42] via a kernel function. Multiple effective kernels that explore different feature spaces have been proposed. Tree kernel [43], subset tree kernel [44], shortest dependency path kernel [31], and all-paths graph kernel [45] have leveraged shallow parse tree information, syntax tree representation, shortest path connecting two entities in a dependency structure, and both dependency structure as well as linear surface information, respectively. Derivatives of these approaches were developed to overcome the limitations of the original kernels. For example, cosine kernel [46], edit distance kernel [46], walk kernel [47], walk-weighted kernel [48], and dependency trigram kernel [41] have been proposed to overcome the constraint of the shortest dependency path kernel, i.e., the same length requirement on the two compared shortest dependency paths. Each kernel utilizes a different portion of a sentence’s structure [33] and has its own benefits and drawbacks [32]. To take advantage of each kernel and retrieve various important types of information, prior studies have proposed multiple hybrid methods to combine different individual methods, and they indicated that relation extraction capabilities could be enhanced in most cases [32-36,40].

8

2.3. Research gaps and questions Several prior studies have formulated ADE extraction from social media as a relation extraction problem and can output “drug/adverse-event” pairs. Although statistical learning methods have achieved great success in the healthcare domain (e.g., PPI extraction and DDI detection), to the best of our knowledge, few studies have utilized these methods to extract ADEs from social media [23,27]. Social media is a very challenging data mining environment; it is abundant in misspellings, colloquial terms, abbreviations, and novel/creative phrases. These intrinsic characteristics of UGC on social media can result in high-dimensional feature space. However, to the best of our knowledge, few studies for ADE extraction from social media have performed feature selection. Kernel-based methods can implicitly address high-dimensional features without enumerating all of the features. This property makes kernel-based methods promising for addressing social media data. Considering the characteristics of social media, it is necessary to re-evaluate the effectiveness of these relation extraction approaches, which are designed for grammatically well-formed texts. Based on these gaps, three research questions should be addressed: (1) For ADE extraction from social media, can feature-based relation extraction methods effectively extract ADEs from social media? Can feature selection enhance the predictive capability? Which features are useful? (2) Are kernel-based approaches suitable for extracting ADEs from social media? (3) Can combinations of different methods enhance the effectiveness of

9

extracting ADEs from social media?

3. Research design 3.1. Data collection, preprocessing, and entity recognition Three data sets are investigated in this study. Two of these are sourced from health-related discussion forums and were developed in-house. The third data set, which was derived from the general social networking site Twitter, is publicly available. In this subsection, we focus largely on the data collection, preprocessing, and entity recognition of the two discussion-forum-sourced data sets. To collect data from discussion forums, web pages are downloaded first using an automatic web crawler. Subsequently, a regular-expression-based parser is developed to extract specific fields, including post id, user profile, topic title, topic id, board id, URL, and post content. In total, 261,464 and 31,081 posts are collected from two well-known health-related discussion forums. Following social media data collection, necessary data preprocessing is conducted. Specifically, we use regular expressions to identify and then remove URLs, duplicated punctuations, and personally identifiable information, including telephone number and social security number (SSN). Casefolding is also conducted. For the two discussion-forum-sourced data sets, the basic unit for ADE extraction is the sentence. Although many ADEs can hold across sentences, extracting ADE relations across sentences is substantially more challenging than within sentences [16], for several reasons. First, many drug-event pairs without any relation can be derived, which further exacerbates the class imbalance problem. Second, sentences

10

that constitute one post could have different focuses [27]. Therefore, posts are segmented into sentences by conducting sentence boundary identification using OpenNLP1. For the Twitter data set, we take the tweet as the basic unit because annotation is conducted at the tweet level in the publicly available data. A recent survey with regard to ADE detection has revealed that lexicon-based approaches have been the most popular [49]. As a prerequisite for relation extraction, in this study, drug and event entity recognition from discussion forums is conducted by applying the lexicon-based approach, as per Liu and Chen [27]. Given a sentence, we first utilize MetaMap2 to identify medical entities by mapping the biomedical text to the unified medical language system (UMLS) 3, an ontology that encompasses a large volume of health and biomedical vocabularies. A filter measure is subsequently adopted using FAERS, which is a widely used database for pharmacovigilance research. Specifically, we compare terms that are identified by UMLS with each drug or event term in FAERS, and we filter out those terms that never appear in FAERS. More specifically, casefolding is conducted, and some special characteristics (e.g., “-”, “ ”) in phrases are replaced with “_” for all of the compared terms. Considering the abundance of colloquial expressions that are generated by lay persons on social media, for each entity that survives the filtering procedure, we augment our lexicon with its corresponding terms in the consumer health vocabulary (CHV) 4. These unofficial terms are used to identify entities from UGC on social media by conducting wildcard matching. For example, “lose a lot of weight” is identified by comparing a window of tokens from UGC with the term

11

“lose weight”. To evaluate the entity recognition effectiveness, a domain expert is asked to manually annotate 600 sentences, among which 250 sentences contain both drug and event entities and 350 sentences have only one type of entity (drug or event) or no entity. In total, 321 drugs and 343 events are annotated. The lexicon-based approach for medical entity recognition achieves an f-measure of 87.33% and 76.16% for drug recognition and event recognition, respectively. This performance is deemed to be satisfactory considering the time- and effort-consuming entity annotation process for perfectly-annotated-corpora-based ADE relation extraction. We select the sentences in which drug and event co-occur for subsequent relation extraction. 3.2. Individual methods To distinguish ADEs with other relations, in section 3.2.1, we propose a feature-based approach that leverages various lexical, syntactic, and semantic features. Then, in section 3.2.2, four well-known kernels are described. 3.2.1. Feature-based approach for ADE extraction Each relation instance can be represented as a feature vector

( ), which indicates

the presence of each feature that is captured from the sentence. In this study, we first extract an initial set of features. Then, we perform information-gain-based feature selection considering the high feature dimensionality, which results from the characteristics of UGC on social media, such as diversity of expression, abundance of misspellings, non-standard/colloquial terms, and creative phrases [10]. 12

(1) Feature extraction 1) Lexical features Lexical features can be divided into three categories: words, positions, and overlap. Giuliano et al. [50] hypothesizes that if a relationship is asserted, then the words that appear between and surrounding the two entities can carry a large amount of information. Therefore, we extract all of the words between the drug and event entities and the N words that surround the drug and event entities, with N set to 3, as per prior studies [51]. These features use n-grams, which include unigrams, bigrams, and trigrams [10]. For the discussion-forum-sourced data sets, we remove stop words (e.g., “and”, “the”) for unigrams because of the stop words’ weak discriminative ability, while for the Twitter data set, we retain the stop words for all n-grams, as per Sarker and Gonzalez [10]. We respectively replace several special characters (e.g., “*”) and numbers with “_” and “D” to alleviate the data sparseness. The relative positions of the two entity types (i.e., drug and event) [52] can convey useful information. We also investigate the overlap information, which includes the number of tokens, number of other drugs, and number of other events between the two entities under consideration [53,54]. 2) Syntactic features Syntactic features can capture the latent information that is inherent in a sentence. The investigated features concentrate on the part-of-speech (POS) tags and the shortest path that connects the two entities under consideration, in either the syntax parse tree or the dependency structure. We explore the POS tags that

13

correspond to words in the lexical feature set with the same settings. Features that are generated over the syntax parse tree include the following: trigrams that are conducted on the shortest path labels (SPW), in which two entities are replaced with “Entity1” and “Entity2”; the lowest common ancestor (LCA) of the two entities; the length of the shortest path (#SP); the length of the left sub-path (#SPL) that connects the first entity to LCA; and the length of the right sub-path (#SPR) that connects LCA to the second entity. With respect to dependency-based features, the shortest dependency path is conceived to encode the most important information in the dependency structure [31]. It can provide compact representations and decrease the influence of irrelevant or noisy information on social media. Trigrams of alternating sequence of vertices and edges along the path, i.e., v-walks and e-walks (SDPW), are captured [47]. We also investigate the length of the path (#SDP) and the predicate node, to which edges with different directions are connected [47]. POS walks, generalized POS walks, POS predicate, and generalized POS predicate are included to alleviate the data sparseness. For all of the feature-based and kernel-based methods that are investigated in this study, POS tags, syntax parse tree, and collapsed dependency structure of instances are generated using the Stanford Parser5. 3) Semantic features Semantic features have presented strong discriminative ability for ADE extraction from social media [10,11] by incorporating domain-specific and context-dependent knowledge. We consider two groups of semantic features: features that indicate

14

other relation types and UMLS semantic types. UMLS semantic type features are used only on the two discussion-forum-soured data sets in this study. For the Twitter-soured dataset, we use the annotated semantic information. The objective of relation-type-indicating features is to differentiate ADE relationship between drug and event entities from other types of relationships – including drug indication, negation, and prevention. The drug indication relationship holds if the drug is prescribed for the event. This type of relation can be flagged by matching drug-event pairs with well-documented indications in FAERS. The negation relationship is asserted if no such event accompanies the drug. We utilize NegEx6 to identify negation. The prevention relationship indicates that the drug can be preventative against the event. Several trigger words are manually collected to indicate the prevention relationship in this study. The medical concepts can be generalized into broad categories (UMLS semantic groups and semantic types). We use MetaMap to identify the UMLS semantic types and the groups to which each entity belongs. Entities that are recognized by matching with CHV are then labeled as type “chv”. The number of features that are explored in each feature subset is shown in Table 1. Table 2 presents a summary of features that are generated for the example instance “Beta blockers [Drug] gave me terrible headaches [Event] as well”.

[Insert Tables 1 and 2 here]

(2) Feature selection

15

Feature selection aims to eliminate the redundant features that carry little or no predictive information and to identify a subset of informative features [55]. The sparseness of the feature space and the representational richness issues on social media necessitate feature selection to remove noise and redundancy [20]. Among the diverse feature selection methods, information gain (IG) has been widely used because of its computational efficiency and effectiveness [55]. IG is a univariate measure of entropy reduction that is provided by a specific attribute across classes [56]. The IG for a specific attribute can be calculated by subtracting the conditional entropy for the feature from the overall entropy across the drug-event relationship classes. To determine the final number of utilized features, i.e., the IG threshold, we introduce a parameter ε (which is defined as the proportion of selected features to total features) and investigate different values of the parameter. 3.2.2. Kernel-based approaches for ADE extraction Kernel-based relation extraction methods measure the similarity between instances :

using a kernel function to the similarity score

×

→ [0, ∞], which maps a pair of objects

,

∈

( , ) by implicitly calculating the inner dot product of the

input features without enumerating all of the features [42]. To illustrate how to predict a new instance, we take support vector machines (SVM), which is a prime example of a statistical learning algorithm that utilizes kernels [57], as an example. Similar to the discriminant function in linearly separable problems, when a kernel function is used, the new discriminant function is as follows: n

f '( x)  sgn( ai yiK (xi, x)  b) i 1

16

(1)

where

is the training data point representation in high-dimensional space,

is the class label of

,

is the new instance,

( , ) is the similarity of the two

instances that are calculated using the defined kernel function, and

is the

number of training data points. To predict the label

( ) > 0, and

= 1 if

,

= −1 if

( ) < 0. The

kernels can compute the similarity of two instances implicitly. Therefore, they can avoid time- and effort-consuming feature engineering and stay away from the sparsity issue that results from diverse language patterns on social media. We investigate four kernel-based approaches for ADE extraction: the subset tree kernel, tree kernel, shortest dependency path kernel, and all-paths graph kernel. (1) Subset tree kernel As a convolution kernel, the subset tree kernel (SST) calculates the similarity of two parse trees by computing the number of common sub-trees [44,58]. The grammatical rules must be retained, which means that for a tree node n, either all or none of its children must be included in the resulting substructure [59]. The convolution tree kernel K (T 1, T 2) 

( ,



) is defined as: (2)

 (n1, n 2 )

n1N 1, n 2N 2

where

is the ith parse tree,

represents the set of tree nodes in

denotes a tree node in

. Here, ∆(

are rooted at

in a recursive way, as follows:

and

,

) computes the common sub-trees that

1) If the context-free productions at ∆(

,

, and

) = 0; otherwise, go to 2).

17

and

are different, then

2) If both

and

are pre-terminals (i.e., they have only leaf children), then

∆(

,

) = ; otherwise, go to 3).

3) ∆(

,

) is calculated recursively as: # ch ( n1)

 ( n1, n2 )  



(1   (ch( n1, k ), ch( n 2, k )))

(3)

k 1

where # ℎ( , and

) is the number of children of

, ℎ( , ) is the kth child of node

is the decay factor that scales the relative importance of different

sub-tree sizes. We select the path-enclosed tree (PT) structure [60] of the syntactic parse tree as the instance representation for SST, as shown in Fig. 1.

[Insert Figure 1 here]

(2) Tree kernel The tree kernel [43] is defined based on the nodes’ attributes, such as the word, POS, and entity type. We define two nodal functions: a matching function and a similarity function. The matching function determines whether two nodes match by comparing a subset of attributes. If two nodes can match, their similarity is then computed by comparing the other attributes. We define the tree kernel trees,

and

their children

( ,

) to compute the similarity of two parse

, considering both parent nodes’ similarity and the similarity of .

if the two trees' parent nodes match 0, (4) K T (T1 , T2 )   s ( T . p , T . p )  K ( T . c , T . c ) if the two trees' parent nodes do not match  1 2 c 1 2 18

is computed in terms of the children subsequence similarities. We choose to implement the contiguous tree kernel rather than the sparse tree kernel due to its relatively smaller computational cost [43]. With respect to the feature spaces, we also investigate PT (see Fig. 1). (3) Shortest dependency path kernel The shortest dependency path kernel (SDP) [31] hypothesizes that the shortest dependency path between two entities on a dependency graph carries the most important information. The collapsed dependency structure of the sample sentence is shown in Fig. 2.


The dependency path is a sequence of words with arrows that indicate the orientation of each dependency [31]. To alleviate the data sparsity that results from the lexicalized characteristic, we also consider word classes, including the POS, generalized POS, and entity type. This set of features is defined as a Cartesian product over these classes. To compute the similarity of two shortest dependency paths, the SDP kernel function is defined as: mn 0,  n KSDP ( x, y )    c( xi , yi ), m = n  i 1

where ( ,

denotes the set of features at position

(5)

and likewise for

) is the number of common elements between 19

and

.

. Here, and

denote the lengths of the two shortest dependency paths. (4) All-paths graph kernel The all-paths graph kernel (APG) [45] is defined over a weighted and directed dependency graph that consists of two sub-graphs: the dependency structure sub-graph and linear order of words sub-graph. The dependency structure sub-graph (DSS) represents the dependency structure of the sentence. It includes two categories of vertices: token and dependency. A token vertex is represented by its text and POS, while a dependency vertex is labeled with the dependency type. The labels of the vertices that correspond to the two entities under investigation are blinded with “DRUG” and “EVENT”, to improve the generalization capability. To follow the shortest dependency path hypothesis [31] and emphasize the vertices on the shortest path, two actions are performed: differentiating the labels of the nodes on the shortest path and assigning different weights to the edges on the shortest path. The linear order of words sub-graph (LOS) is a linear structure of the sentence. Each token vertex is labeled with its text, POS, and relative position to the entity pair of interest. The same weight is assigned to the edge that connects each word to its succeeding word. The graph representation is illustrated in Fig. 3.


For calculation, we use two types of matrices: an edge matrix A  R|V||V| and a label matrix L  R|V||L| . Here,

represents the set of vertices, and

is the set of

possible labels of the vertices. Aij presents the edge weight if an edge that connects 20

∈ and

∈ exists and 0 otherwise. The weight is predefined for the edges

on the shortest dependency path to be 0.9, while the other edges are assigned a weight of 0.3.

= 1 if the th vertex has the th label and 0 otherwise. A graph

matrix G can be calculated using the Neumann Series:

G  LT ((I  A)1  I)L

(6)

The matrix sums up the weights of all of the possible paths between any pair of vertices. Each entry represents the strength of the connection. Given two instances that are formed as graph matrices G and G’, the graph kernel is defined as: |L|

|L|

K (G , G ' )   Gij Gij'

(7)

i 1 j 1

3.3. Ensemble of classifiers Each aforementioned individual method utilizes different aspects of the sentence and computes the similarity of the instances in different ways, and each method presents different advantages and disadvantages. For example, the feature-based method incorporates various types of features, but syntactic features could be insufficient to represent the sentence syntactic structures [41,61]. Both tree kernel and SST address the syntax parse tree information; however the tree kernel is constrained by requiring the compared nodes to be at the same layer, while it is a challenge for SST to determine how to represent the parse tree precisely and concisely [62]. SDP focuses on the shortest path that connects two entities in the dependency structure, but it is constrained by the same length requirement on the shortest paths. APG considers both the dependency graph and the word features; however, it misses the semantic features, which have shown strong discriminative 21

capability for ADE extraction [10,11]. To make full use of each method's strengths and ability to complement the other methods, an ensemble of classifiers can be generated and investigated in light of the previously demonstrated ability of enhancing the predictive capability in many application domains [32,35]. A classifier ensemble is a set of classifiers whose outputs are merged to reach a consensus and a final decision, to make the aggregated output outperform any single classifier [63-65]. Ensemble methods can reduce the variance-error without increasing the bias-error [64], thereby enhancing the generalization capability. Classifier ensembles can be built in two steps: generating a set of diverse base classifiers (also referred to as base learners) and combining the individual decisions to achieve a final hypothesis [63,65]. 3.3.1. Base learners In our study, the feature-based approach, tree kernel, SST, SDP and APG are all base learners. To determine the optimal combination of base learners, it is necessary to conduct ensemble selection; as Jurek et al. [63] noted, “not all of the generated models contribute during the final decision-making process”, and as Zhou et al. demonstrated, “Many may be better than all” [66]. Therefore, selecting the optimal group of member classifiers could improve both the accuracy and efficiency of a classifier ensemble [63]. The selection techniques can be divided into two general methodologies: static selection and dynamic selection [63]. Static selection is easy to implement and the N best approach, which means that selecting N classifiers with the best performances represents the simplest selection technique. The most

22

popular criteria of ensemble selection are the performance and diversity [63]. In this study, we derive two groups of classifier ensembles by fusing the best three performing classifiers and selecting the three classifiers that have the highest diversity. In terms of the diversity measure, we select the non-pairwise entropy measure E [67]. 3.3.2. Combination methods There are various combination methods, such as averaging and meta-learning [63,64]. We apply three fusion strategies: majority voting, weighted averaging, and stacked generalization [68]. (1) Majority voting Voting is the simplest method for combining predictions from multiple classifiers. Majority voting selects the class that obtains the highest number of votes [69], as the final prediction of an unlabeled instance. (2) Weighted Averaging Several ensembles assign the same weight to individual classifiers [33,36,40], and they sometimes fail to achieve the best performance [33]. Therefore, some effort has been made to assign higher weights to individual classifiers that have better performance [32,34,35]. The new similarity measure that is produced by combining individual similarities using weighted averaging can be calculated as: M

K (X, X' )    m K m ( X, X' )

(8)

m1 M



m

(9)

 1,  m  [0,1]

m 1

23

where Km is the normalized output of each classifier, M denotes the number of the utilized classifiers, and σm is the Km weight that is assigned according to the performance. (3) Stacked generalization Stacked generalization (or stacking) provides a mechanism for utilizing the collective discriminatory ability of a set of heterogeneous classifiers [56,65,68]. The main idea of stacking is to train a meta-level classifier using the predictions that are generated by base-level models as the input attributes [35,64,70]. Therefore, the meta-level classifier can exploit the diversity of the base-level classifiers’ outputs [70] and learn from the errors that are generated by the base-level classifiers [35]. Figure 4 presents the working process of stacked generalization. The original training data set Dtrain is randomly split into J disjoint blocks Dj of almost equal size and same distribution to conduct J-fold cross validation. At each jth fold, base-level classifiers are trained on

=

−

and subsequently utilized to classify all

of the instances in Dj. The training set for the meta-level classifier is collectively augmented by incorporating each instance in Dj whose input vector is composed of its outputs that are generated using base-level classifiers and the original class value. The meta-level model is generated by using a meta-level algorithm on the level-1 training data. To generate the final decision, the meta-level model is applied on the original testing instances, which have the same input vector composition as the level-1 training data. [Insert Figure 4 here]

24

3.4. Classification and evaluation We implement most individual methods using SVM, except for the APG kernel that utilizes the least squares support vector machine (RLS). For stacked generalization, we investigate the effectiveness of multi-response linear regression (MLR) as the meta-level algorithm. To evaluate the effectiveness of each feature subset in the feature-based method, the individual kernels, and combinations of classifiers using majority voting, weighted averaging and stacked generalization, we adopt standard machine learning evaluation metrics, including the accuracy and f-measure. We also report the results in terms of the area under the receiver operating characteristics curve (AUC) [71], which is not affected by the class distributions in the data sets.

4. Experimental setup 4.1. Test bed To verify the generalization ability of the methods, we used these methods to extract ADEs that are associated with both general drugs and specific diseases. Although several annotated data sets are publicly available for extracting ADEs from social media, these corpora address a diverse range of drugs that are not specific to a domain or disease [49]. Therefore, we developed two data sets that are related to two representative diseases, i.e., diabetes and heart disease, for ADE extraction research. The first data set consisted of 2,200,557 sentences that were taken from MedHelp7 and was related to heart disease (referred to as “HD”). The second was comprised of 61,226 sentences that were sourced from Diabetes forums8 and was

25

related to diabetes (referred to as “DF”). We also conducted our research on a publicly available annotated corpus9, which was taken from Twitter10 and consisted of 815 instances (referred to as “TW”). To support the supervised machine learning techniques, a randomly selected subset of each discussion-forum-sourced data set (1,300 instances in HD; 500 instances in DF) was chosen for manual annotation. The selected instances were annotated by three postgraduate students who majored in biomedical informatics under the guidance of a domain expert. We categorized the relations between the drug entities and event entities into two classes: ADE and non-ADE. If an instance contained any entity that was mistakenly recognized, the instance was labeled as “non-ADE”. The chosen instances taken from the HD set were assigned evenly to the annotators, while the 500 instances taken from the DF set were annotated by all of the annotators to measure the inter-annotator agreement (IAA). We calculated IAA for all three annotator pairs using the Kappa statistic. An average Kappa measure of 0.718 was reported, which implied significant agreement among all of the annotators. Following the IAA calculation, disagreements were resolved by the domain expert to construct the gold standard corpus for the DF data set. Some statistical information on the three test beds is shown in Table 3.

[Insert Table 3 here]

26

4.2. Experiments In this research, we conducted binary classification to separate ADE relationship from non-ADE relationship. For all of the experiments, we divided each of the three test beds into two parts: a training set (80%) and a testing set (20%). The two parts had a class distribution that was similar to that in the full data sets. To avoid information leakage, all of the experiments were conducted at document-level, which means that instances derived from the same sentence were not allowed to appear in both the training set and testing set. To quell the class imbalance, we applied an over-sampling method [72] on the training set, while retaining the original class distribution on the testing set. In the feature-based method, all of the extracted features originated from the training data exclusively. Additionally, 10-fold cross validation was performed to conduct feature selection and parameter tuning on the training set. Following these procedures, models were trained using the best parameter settings on the full training set, and they were applied to the testing set for evaluation. For comparison purposes, we re-implemented two methods that have been used for ADE extraction from social media, i.e., co-occurrence and a hybrid method that combines SDP-kernel-based relation detection with semantic-filter-based relation classification [23,27]. Co-occurrence was selected as one of the baselines because of its popularity, while the hybrid method was chosen as the other baseline because it used the statistical learning method to detect the relations. Moreover, its ADE extraction pipeline was similar to our study. Subsequently, we performed three

27

experiments. The first experiment was intended to determine the number of utilized features and compare the contribution of each feature subset in the feature-based method. The second experiment compared five individual methods: the feature-based method, tree kernel, SST, SDP, and APG. The final experiment aimed to investigate whether combinations of individual methods can enhance the generalization capability in comparison to the individual constituent methods. The proposed feature-based method was implemented using the linear kernel in the SVMlight 11 package. We used the tree kernel toolkit12 for the implementation of SST and the all-paths graph kernel13 for APG. The tree kernel and SDP were implemented using customized kernel functions that were incorporated into the SVMlight package. To find the best parameter settings, we conducted a coarse-grained grid search on several important parameters. The parameter selection strategy and the best parameter settings are shown in Table 4.


5. Results and discussion In this section, we present and analyze the experimental results. The highest results in each experiment are boldfaced, and the best performances across all of the experiments are both boldfaced and underlined. 5.1. Contribution of each feature subset in the feature-based method As shown in Fig. 5, the feature-based method attains the best f-measure of 82.5%,

28

81.5%, and 94.2% when ε = 0.1, ε = 0.2, ε = 0.08 (1,270 features, 1,528 features, and 1,209 features are selected) for the HD, DF, and TW data sets, respectively. The low percentages verify the necessity of feature selection for ADE extraction from social media. The selected features constitute a new feature set (labeled “full”) on each data set.


To investigate the contribution of each feature subset, we adopt a leave-one-out strategy, which means that a specific feature subset is removed from the full feature set to investigate the performance change. The results are shown in Table 5. Lexical features tend to be useful over the three data sets, because AUC decreases when they are removed. However, introducing the POS features deteriorates the classification effectiveness, and it is observed that the AUC is unchanged or presents improvement on each data set when the POS features are removed. This finding is perhaps due to the potential redundancy among the POS features and other features. Although IG-based feature selection is conducted, it is a univariate measure and does not have the ability to address issues that involve feature interactions. The effectiveness of the dependency features has been revealed on the DF and TW sets, and there are obvious AUC drops when the dependency features are removed. The syntax tree features present weak discriminative capability for the HD and DF sets, which could result from the deteriorated syntactic parsing performance when addressing long sentences in discussion forums. When 29

the POS features and syntax tree features are removed simultaneously, we observe enhanced effectiveness in most cases (e.g., AUC improvement on the HD and DF data sets; f-measure improvement on the DF and TW data sets). When the syntactic feature subset (i.e., the union of the POS set, dependency set, and syntax tree set) is removed, decreased AUC values are attained on the DF and TW data sets, which indicates the effectiveness of the syntactic features. In terms of the semantic features, their effectiveness is demonstrated by the observation that the performance decreases on the three data sets when the semantic features are removed from the full feature set. [Insert Table 5 here]

5.2. Effectiveness of individual methods The effectiveness of each individual method is shown in Table 6. In the feature-based method, we use the remaining features after removing the POS features and syntax tree features. The feature-based method appears to be superior. For example, it attains the best AUC of 77.0% on the TW set, the best accuracy of 68.0% on the DF set, and the best f-measure of 58.7% as well as the second best AUC on the HD set. This good predictive capability benefits from the incorporation of various types of information. The SDP kernel usually yields better performance than the other kernels. It achieves the best f-measure of 66.1% and 93.8% on the DF set and TW set, respectively. This finding could be because the shortest dependency path encodes the sentence information compactly and concisely,

30

thereby decreasing the negative effect of irrelevant information or noise on the social media. In contrary, although the APG considers additional information, it presents inadequate classification capability compared to the SDP. Among the investigated individual methods, the tree kernel performs worst on all of the data sets in most of the cases. As shown in Table 6, all of the investigated individual methods except for the tree kernel outperform the baselines in most of the cases.


5.3. Effectiveness of ensembles Table 7 illustrates the effectiveness of the investigated ensembles. These ensembles are different in terms of base learners or combination methods. The “performance criterion” represents the combination of the three classifiers that rank in the top 3 in terms of the AUC, while the “diversity criterion” is the combination of the three classifiers that have the highest diversity. As shown in Table 7, the best AUC values over all of the three data sets are attained by using a weighted averaging strategy. For the HD and TW data sets, the ensembles that combine all five individual methods achieve the best AUC of 84.5%. For the DF set, the best AUC of 77.3% is achieved by combining the three classifiers (feature-based learner, SDP, and SST) that are selected based on the performance criterion.


31

5.3.1. Performance comparison of different combination methods As shown in Table 7, there is no combination method that performs best consistently [73]. The weighted averaging method delivers the best performance in most cases on the HD set. Majority voting attains the best f-measure on the DF set. Stacking yields the best performance in most cases on the TW set. Majority voting performs worst in terms of the AUC in comparison to the other combination methods for the ensembles that are derived based on the performance and diversity criteria. The inferiority of majority voting lies in its constraint that two independent conditions (i.e., the accuracy and diversity) must be satisfied simultaneously to achieve a good ensemble [74]. 5.3.2. Performance comparison between the ensembles and individual methods Figure 6 illustrates the AUC comparison between the ensembles and their respective individual constituent methods. As shown in the figure, the ensembles that use weighted averaging and stacking achieve an improved AUC in comparison to their respective individual constituent methods in eight and six out of nine cases, respectively. However, majority voting fails to attain an improved AUC compared to the individual methods in most of the cases.


5.3.3. Performance comparison between the ensembles and baselines Figure 7 illustrates the AUC comparison between the ensembles that combine all five individual methods in our study and the baselines (i.e., the co-occurrence

32

method and hybrid method combining SDP-based relation detection with semantic-filter-based relation classification [27]). As shown in Fig. 7, the ensembles in our study significantly outperform the baselines.


5.4. Discussion To better understand the characteristics of the texts on social media (discussion forums and Twitter), several data statistics are calculated based on the three data sets (HD, DF, and TW) (see Table 8). These statistics include the number of sentences per post, the number of tokens per instance (sentence or tweet), the drug (event) type/token ratios in the annotated data sets, and the n-gram (n=2,3)/unigram ratio in selected features.


On average, each post consists of 8.42 sentences on the HD set, which makes the above-mentioned inter-sentential relation extraction challenging. The average instance length in the discussion forums is longer than that in the tweets (for example, 28 tokens versus 19 tokens per instance on the HD set and TW set, respectively). The drug (event) type/token ratios can, to some extent, reveal the lexical diversity when referring to drugs and events. It is revealed that an event has more lexical variations than the drug on all of the three data sets, especially on the

33

TW set. The linguistic diversity for the event is attributed to either spelling errors or colloquial expressions [23]. Figure 8 illustrates the composition of selected features via IG-based feature selection and shows that SDPW features (i.e., alternating sequence of vertices and edges along the shortest dependency path) occupy the largest percentage on all of the three data sets. This finding implies that expression patterns in social media are diverse at the syntactic level. The n-gram (n=2, 3)/unigram ratio shown in Table 8 exhibits, to some extent, linguistic diversity at the lexical level. As shown in Table 8, the ratio on the TW set is higher than that on the HD and DF sets. More diverse speech patterns in Twitter than in discussion forums are attributed to the length constraints in Twitter, and therefore, various creative expressions/symbols and non-standard abbreviations are generated. The characteristics of informal text on social media can give rise to representational richness issues and feature sparsity. Figure 9 illustrates the feature frequency distribution for selected features. As shown in Fig. 9, only approximately 20% of the features occur more than three times in the DF set, and the number is even below 10% in the TW set.

[Insert Figures 8 and 9 here]

We further analyze the selected semantic features and find that the feature that indicates “drug indication” is selected over all of the three data sets. This finding implies that external knowledge bases can effectively detect drug indication relations, thereby enhancing the ADE identification capability. 34

6. Conclusions and future directions Social media websites are under-explored data sources for extracting ADEs. In this study, we developed a system that is capable of effectively distinguishing between ADEs and non-ADEs (e.g., drug indications) in informal text on social media. A relation extraction system was proposed and implemented using advanced NLP techniques. Specifically, we proposed a feature-based method that explored various lexical, syntactic, and semantic features. In particular, we performed IG-based feature selection. The experimental results suggested that feature selection can significantly enhance ADE classification capability. This finding could have implications for social intelligence, considering the feature sparsity and representational richness issues that accompany social media data. In terms of the contributions of different features, our experimental results revealed that lexical features and especially semantic features can enhance the ADE extraction capability, thus providing the possibility of further improving ADE extraction effectiveness by resorting to external knowledge bases in the biomedical domain. We also explored the possibility of utilizing kernels for ADE extraction from social media. Experimental results showed that kernel-based methods can effectively extract ADEs from social media, thus providing a more effective method because kernels can avoid time- and effort-consuming feature engineering and stay away from the sparsity problem that results from the intrinsic characteristics of social media. We adopted different combination methods to develop several ensembles that delivered improved effectiveness compared to the individual constituent methods 35

and the baseline methods, which provides more reliable decision-making support for stakeholders, such as patients, physicians, pharmaceutical companies and regulatory authorities. For future study, we intend to conduct wrapper-model-based feature selection methods to further remove feature redundancy and noise. We also plan to utilize other ensemble learning methods, such as random forest and random subspace, to address high-dimensional feature space. We also would like to use semi-supervised learning methods to make full use of unlabeled data on social media.

Acknowledgments This work is partially supported by the National Natural Science Foundation of China (No. 71172124), the Specialized Research Fund for the Doctoral Program of Higher Education of China (No.20116102110036), the Shaanxi Province Soft Science Research Project (No.2015KRM021), and the Humanity and Social Science Youth Foundation of the Ministry of Education of China (No.12YJC630051).

References [1] Bates DW, Cullen DJ, Laird N, Petersen LA, Small SD, Servi D, et al. Incidence of adverse drug events and potential adverse drug events: implications for prevention. J Am Med Inf Assoc. 1995;274:29-34. [2] Karimi S, Wang C, Metke-Jimenez A, Gaire R, Paris C. Text and data mining techniques in adverse drug reaction detection. ACM Comput Surv. 2015;47:56. [3] Organization WH. International drug monitoring: the role of national centres. report of a WHO meeting. 1972. [4] Ji Y, Ying H, Dews P, Mansour A, Tran J, Miller RE, et al. A potential causal association mining algorithm for screening adverse drug reactions in postmarketing surveillance. IEEE T Inf Technol Biomed. 2011;15:428-37. [5] Yang CC, Yang H, Jiang L. Postmarketing Drug Safety Surveillance Using Publicly Available Health-Consumer-Contributed Content in Social Media. ACM Trans Manage Inf Syst. 2014;5:2. [6] Segura-Bedmar I, Martínez P, Herrero-Zazo M. Lessons learnt from the DDIExtraction-2013 shared 36

task. J Biomed Inform. 2014;51:152-64. [7] http://health.gov/hcq/ade.asp. (Accessed: 1 March 2016). [8] Roughead EE, Semple SJ. Medication safety in acute care in Australia: where are we now? Part 1: a review of the extent and causes of medication problems 2002–2008. Australia and New Zealand Health Policy. 2009;6:18. [9] Leaman R, Wojtulewicz L, Sullivan R, Skariah A, Yang J, Gonzalez G. Towards internet-age pharmacovigilance: extracting adverse drug reactions from user posts to health-related social networks. In: K. Bretonnel Cohen DD-F, Sophia Ananiadou, John Pestian, Jun'ichi Tsujii, Bonnie Webber, editor. Proceedings of the 2010 Workshop on Biomedical Natural Language Processing. Uppsala, Sweden: ACL; 2010. p. 117-25. [10] Sarker A, Gonzalez G. Portable automatic text classification for adverse drug reaction detection via multi-corpus training. J Biomed Inform. 2015;53:196-207. [11] Bian J, Topaloglu U, Yu F. Towards large-scale twitter mining for drug-related adverse events. In: Christopher C. Yang HC, Howard Wactlar, Carlo Combi, Xuning Tang, editor. Proceedings of the 2012 international workshop on Smart health and wellbeing. Maui, HI, USA: ACM; 2012. p. 25-32. [12] Yang CC, Yang H, Jiang L, Zhang M. Social media mining for drug safety signal detection. Proceedings of the 2012 international workshop on Smart health and wellbeing. Maui, HI, USA: ACM; 2012. p. 33-40. [13] Andreu-Perez J, Poon CCY, Merrifield RD, Wong STC, Yang G-Z. Big Data for Health. IEEE J Biomed Health Inform. 2015;19:1193-208. [14] Abbasi A, Adjeroh D, Dredze M, Paul MJ, Zahedi FM, Zhao H, et al. Social Media Analytics for Smart Health. IEEE Intell Syst. 2014;29:60-80. [15] Xu R, Wang Q. Large-scale combining signals from both biomedical literature and the FDA Adverse Event Reporting System (FAERS) to improve post-marketing drug safety signal detection. BMC Bioinformatics. 2014;15:17. [16] Henriksson A, Kvist M, Dalianis H, Duneld M. Identifying Adverse Drug Event Information in Clinical Notes with Distributional Semantic Representations of Context. J Biomed Inform. 2015;57:333-49. [17] Wang X, Hripcsak G, Markatou M, Friedman C. Active computerized pharmacovigilance using natural language processing, statistics, and electronic health records: a feasibility study. J Am Med Inf Assoc. 2009;16:328-37. [18] Jiang K, Zheng Y. Mining twitter data for potential drug effects: Springer Berlin Heidelberg; 2013. [19] Freifeld CC, Brownstein JS, Menone CM, Bao W, Filice R, Kass-Hout T, et al. Digital drug safety surveillance: monitoring pharmaceutical products in Twitter. Drug Saf. 2014;37:343-50. [20] Sharif H, Zaffar F, Abbasi A, Zimbra D. Detecting adverse drug reactions using a sentiment classification framework. In: Chaitanya Baru PK, Kai Hwang, L.W. Chang, Merce Crosas, Howard Lander, Arcot Rajasekar, Stanley Ahalt, Tom Carsey, Justin Zhan, Srinivas Aluru, Yixin Chen, Longbing Cao, editor. 2014 ASE BIGDATA/SOCIALCOM/CYBERSECURITY Conference. Stanford University, Stanford, CA, USA: USA: ASE; 2014. [21] Nikfarjam A, Sarker A, O’Connor K, Ginn R, Gonzalez G. Pharmacovigilance from social media: mining adverse drug reaction mentions using sequence labeling with word embedding cluster features. J Am Med Inf Assoc. 2015;22:671-81. [22] Nikfarjam A, Gonzalez GH. Pattern mining for extraction of mentions of adverse drug reactions from user comments. In: Evans RS, editor. Proceeding of 2011 AMIA Annual Symposium. Bethesda,

37

MD, USA: American Medical Informatics Association; 2011. p. 1019-26. [23] Liu X, Liu J, Chen H. Identifying Adverse Drug Events from Health Social Media: A Case Study on Heart Disease Discussion Forums. In: Xiaolong Zheng DZ, Hsinchun Chen, Yong Zhang, Daniel B. Neill, editor. Proceedings of the 2014 international conference on Smart Health. Beijing, China: Springer; 2014. p. 25-36. [24] Yang M, Kiang M, Shang W. Filtering big data from social media–Building an early warning system for adverse drug reactions. J Biomed Inform. 2015;54:230-40. [25] Metke-Jimenez A, Karimi S. Concept extraction to identify adverse drug reactions in medical forums: A comparison of algorithms. arXiv preprint arXiv:150406936. 2015. [26] Karimi S, Metke-Jimenez A, Kemp M, Wang C. Cadec: A corpus of adverse drug event annotations. J Biomed Inform. 2015;55:73-81. [27] Liu X, Chen H. AZDrugMiner: an information extraction system for mining patient-reported adverse drug events in online patient forums. In: Zeng D, Yang, C.C., Tseng, V.S., Xing, C., Chen, H., Wang, F.-Y., Zheng, X., editor. Proceedings of the 2013 international conference on Smart Health. Beijing, China: Springer; 2013. p. 134-50. [28] Benton A, Ungar L, Hill S, Hennessy S, Mao J, Chung A, et al. Identifying potential adverse effects using the web: A new approach to medical hypothesis generation. J Biomed Inform. 2011;44:989-96. [29] Mao JJ, Chung A, Benton A, Hill S, Ungar L, Leonard CE, et al. Online discussion of drug side effects and discontinuation among breast cancer survivors. Pharmacoepidemiol Drug Saf. 2013;22:256-62. [30] Segura-Bedmar I, de la Pena S, Martınez P. Extracting drug indications and adverse drug reactions from Spanish health social media. In: Tsujii KCaDD-FaSAaJ-i, editor. Proceedings of the 2014 Workshop on Biomedical Natural Language Processing. Baltimore, Maryland USA: ACL; 2014. p. 98-106. [31] Bunescu RC, Mooney RJ. A shortest path dependency kernel for relation extraction. In: Mooney RJ, editor. Proceedings of the conference on Human Language Technology and Empirical Methods in Natural Language Processing. Vancouver, British Columbia, Canada: ACL; 2005. p. 724-31. [32] Yang Z, Tang N, Zhang X, Lin H, Li Y, Yang Z. Multiple kernel learning in protein–protein interaction extraction from biomedical literature. Artif Intell Med. 2011;51:163-73. [33] Miwa M, Sætre R, Miyao Y, Tsujii Ji. Protein–protein interaction extraction by leveraging multiple kernels and parsers. Int J Med Inform. 2009;78:e39-e46. [34] Chowdhury MFM, Lavelli A. FBK-irst: A multi-phase kernel based approach for drug-drug interaction detection and classification that exploits linguistic information. In: Suresh Manandhar DY, editor. Proceedings of the seventh International Workshop on Semantic Evaluation. Atlanta, Georgia, USA: ACL; 2013. p. 53. [35] He L, Yang Z, Zhao Z, Lin H, Li Y. Extracting drug-drug interaction from the biomedical literature using a stacked generalization-based approach. PLoS One. 2013;8:e65814. [36] Li J, Zhang Z, Li X, Chen H. Kernel-based learning for biomedical relation extraction. J Am Soc Inf Sci Technol. 2008;59:756-69. [37] Bunescu R, Mooney R, Ramani A, Marcotte E. Integrating co-occurrence statistics with information extraction for robust retrieval of protein interactions from Medline. In: Karin Verspoor KBC, Ben Goertzel, Interjeet Mani, editor. Proceedings of the Workshop on Linking Natural Language Processing and Biology: Towards Deeper Biological Literature Analysis. New York, NY: ACL; 2006. p. 49-56.

38

[38] Thomas J, Milward D, Ouzounis C, Pulman S, Carroll M. Automatic Extraction of Protein Interactions from Scientific. In: Russ B. Altman AKD, Lawrence Hunter, Teri E. Klein, editor. Pacific symposium on biocomputing. Hawaii, USA: World Scientific; 2000. p. 538-49. [39] Huang M, Zhu X, Hao Y, Payan DG, Qu K, Li M. Discovering patterns to extract protein–protein interactions from full texts. Bioinformatics. 2004;20:3604-12. [40] Chowdhury MFM, Lavelli A. Combining tree structures, flat features and patterns for biomedical relation extraction. In: Daelemans W, editor. Proceedings of the 13th Conference of the European Chapter of the Association for Computational Linguistics. Avignon, France: ACL; 2012. p. 420-9. [41] Choi M, Kim H. Social relation extraction from texts using a support-vector-machine-based dependency trigram kernel. Inf Process Manage. 2013;49:303-11. [42] Vapnik VN, Vapnik V. Statistical learning theory: New York: Wiley; 1998. [43] Zelenko D, Aone C, Richardella A. Kernel methods for relation extraction. J Mach Learn Res. 2003;3:1083-106. [44] Collins M, Duffy N. Convolution kernels for natural language. In: T.G. Dietterich SB, Z. Ghahramani, editor. Advances in Neural Information Processing Systems 14. Cambridge, MA: MIT Press; 2001. p. 625-32. [45] Airola A, Pyysalo S, Björne J, Pahikkala T, Ginter F, Salakoski T. All-paths graph kernel for protein-protein interaction extraction with evaluation of cross-corpus learning. BMC Bioinformatics. 2008;9:S2. [46] Erkan G, Özgür A, Radev DR. Semi-supervised classification for extracting protein interaction sentences using dependency parsing. In: Eisner J, editor. Proceedings of the 2007 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning. Prague, Czech Republic: ACL; 2007. p. 228-37. [47] Kim S, Yoon J, Yang J. Kernel approaches for genic interaction extraction. Bioinformatics. 2008;24:118-26. [48] Kim S, Yoon J, Yang J, Park S. Walk-weighted subsequence kernels for protein-protein interaction extraction. BMC Bioinformatics. 2010;11:107. [49] Sarker A, Ginn R, Nikfarjam A, O’Connor K, Smith K, Jayaraman S, et al. Utilizing social media data for pharmacovigilance: a review. J Biomed Inform. 2015;54:202-12. [50] Giuliano C, Lavelli A, Romano L. Exploiting shallow linguistic information for relation extraction from biomedical literature. In: Diana McCarthy SW, editor. Proceedings of the 11st Conference of the European Chapter of the Association for Computational Linguistics. Trento, Italy: ACL; 2006. p. 401-8. [51] Segura-Bedmar I, Martinez P, de Pablo-Sánchez C. Using a shallow linguistic kernel for drug–drug interaction extraction. J Biomed Inform. 2011;44:789-804. [52] Zhang P, Li W, Hou Y, Song D. Developing position structure-based framework for chinese entity relation extraction. Asian Lang Inform Process. 2011;10:14. [53] Kambhatla N. Combining lexical, syntactic, and semantic features with maximum entropy models for extracting relations.

Proceedings of the ACL 2004 on Interactive poster and

demonstration sessions. Barcelona, Spain: ACL; 2004. p. 22. [54] Zhou G, Zhang M. Extracting relation information from text documents by exploring various types of knowledge. Inf Process Manage. 2007;43:969-82. [55] Dash M, Liu H. Feature selection for classification. Intell Data Anal. 1997;1:131-56. [56] Abbasi A, Albrecht C, Vance A, Hansen J. Metafraud: a meta-learning framework for detecting financial fraud. MIS Q. 2012;36:1293-327.

39

[57] Cristianini N, Shawe-Taylor J. An introduction to support vector machines and other kernel-based learning methods. Cambridge: Cambridge university press; 2000. [58] Moschitti A. Making Tree Kernels Practical for Natural Language Learning. In: Diana McCarthy SW, editor. Proceedings of the 11st Conference of the European Chapter of the Association for Computational Linguistics. Trento, Italy: ACL; 2006. p. 113-20. [59] Tikk D, Thomas P, Palaga P, Hakenberg J, Leser U. A comprehensive benchmark of kernel methods

to

extract

protein–protein

interactions

from

literature.

PLoS

Comput

Biol.

2010;6:e1000837. [60] Zhang M, Zhou G, Aw A. Exploring syntactic structured features over parse trees for relation extraction using kernel methods. Inf Process Manage. 2008;44:687-701. [61] Zhou G-D, Zhu Q-M. Kernel-based semantic relation detection and classification via enriched parse tree structure. J Comput Sci Technol. 2011;26:45-56. [62] Qian L, Zhou G. Tree kernel-based protein–protein interaction extraction from biomedical literature. J Biomed Inform. 2012;45:535-43. [63] Jurek A, Bi Y, Wu S, Nugent C. A survey of commonly used ensemble-based classification techniques. Knowl Eng Rev. 2014;29:551-81. [64] Rokach L. Taxonomy for characterizing ensemble methods in classification tasks: A review and annotated bibliography. Comput Stat Data Anal. 2009;53:4046-72. [65] Sesmero MP, Ledezma AI, Sanchis A. Generating ensembles of heterogeneous classifiers using Stacked Generalization. Wiley Interdiscip Rev Data Min Knowl Discov. 2015;5:21-34. [66] Zhou ZH, Wu J, Tang W. Ensembling Neural Networks: Many Could Be Better Than All. Artif Intell. 2002;137:239-63(25). [67] Kuncheva LI, Whitaker CJ. Measures of Diversity in Classifier Ensembles and Their Relationship with the Ensemble Accuracy. Mach Learn. 2003;51:181-207. [68] Wolpert DH. Stacked generalization. Neural networks. 1992;5:241-59. [69] Ali KM, Pazzani MJ. Error reduction through learning multiple descriptions. Mach Learn. 1996;24:173-202. [70] Sigletos G, Paliouras G, Spyropoulos CD, Hatzopoulos M. Combining information extraction systems using voting and stacked generalization. J Mach Learn Res. 2005;6:1751-82. [71] Bradley AP. The use of the area under the ROC curve in the evaluation of machine learning algorithms. Pattern Recognit. 1997;30:1145-59. [72] He H, Garcia EA. Learning from imbalanced data. IEEE Trans Knowl Data Eng. 2009;21:1263-84. [73] Zhou Z-H. Ensemble methods: foundations and algorithms. USA: CRC Press; 2012. [74] Wang G, Sun J, Ma J, Xu K, Gu J. Sentiment classification: The contribution of ensemble learning. Decis Support Syst. 2014;57:77-93.

40

Tables Table 1. The feature-based method: number of features explored Feature subset HD DF

Lexical

5,303

3,000

7,386

2,732

1,591

2,576

Syntax tree

478

358

474

Dependency

4,144

2,677

4,563

24

22

107

12,681

7,648

15,106

POS Syntactic

TW

Semantic Total

Table 2. A summary of features that are generated for the example instance “Beta Blockers gave me terrible headaches as well” Feature subset

Feature type Words

Lexical Position Overlap POS

POS SPW

Syntax tree

Dependency

Semantic

#SP LCA #SPR #SPL #SDP SDPW Predicate Indication Negation Prevention UMLS group UMLS type

Feature value Unigram_between: gave, me, terrible; Unigram_after: well Bigram_between: gave_me, me_terrible Trigram_between: gave_me_terrible drug_before number of tokens: 3 Unigram_between: VBD, PRP, JJ; Unigram_after: RB Bigram_between: VBD_PRP, PRP_JJ Trigram_between: VBD_PRP_JJ Entity1_NNS_NP;NNS_NP_S;NP_S_VP;S_VP_NP; VP_NP_NNS;NP_NNS_Entity2 8 S 4 3 5 headaches_dobj_gave,dobj_gave_nsubj, gave_nsubj_Beta_Blockers NNS_dobj_VBD, dobj_VBD_nsubj,VBD_nsubj_NNS, Noun_dobj_Verb ,dobj_Verb_nsubj,Verb_nsubj_Noun word: gave; POS: VBD; generalized POS: Verb No corresponding feature No corresponding feature No corresponding feature CHEM_drug; DISO_event phsu_drug; sosy_event

41

Table 3. Statistical information of the three test beds

HD

DF

TW

Data set

#Instances

#Sentences

#Posts

#Drugs

Total

19,162

2,200,557

261,464

172

#Drugtokens 20,683

1,058

#Eventtokens 23,694

Training set

1044

865

841

98

927

235

952

258

786

Testing set

256

210

205

45

227

90

234

61

195

Total

4,068

61,226

31,087

461

2,721

343

2,436

-

-

Training set

400

372

365

63

390

115

382

186

214

Testing set

100

96

89

25

100

53

97

46

54

Training set

605

-

-

56

730

652

987

530

75

#Events

#ADE

#Non-ADE

-

-

37 Testing set 210 259 259 324 178 32 1. #Drug is the number of different drugs, #Drug-tokens is the number of times drug occurs in sentences. For example, if a drug appears in five different sentences, then #Drug is 1 while #Drug-tokens is 5. 2. We annotated only a portion of the total instances in the HD and DF sets; therefore, #ADE and #Non-ADE are “-” for the “Total” data sets. 3. We conducted ADE extraction on the tweet level on the TW set. We did not conduct sentence boundary detection; therefore, the #Sentences on the TW set is “-”. Post is a special concept for discussion forums, therefore, the #Posts on the TW set is “-”.

42

Table 4. Parameter selection strategy and results Method

Parameters ε

Feature-based method

SST λ

Tree kernel λ SDP

APG

Explored values 0.01, 0.02, 0.04, 0.06, 0.08, 0.1, 0.2, 0.4, 0.6, 0.8, 1 0.015625, 0.0625, 0.25, 1, 4, 8, 16, 128, 256, 512 0.5, 1, 2 0.015625, 0.0625, 0.25, 1, 4, 8, 16, 128, 256, 512 0.5, 1, 2 0.2, 0.4, 0.6, 0.8 0.015625, 0.0625, 0.25, 1, 4, 8, 16, 128, 256, 512 0.5, 1, 2 0.2, 0.4, 0.6, 0.8 0.015625, 0.0625, 0.25, 1, 4, 8, 16, 128, 256, 512 0.5, 1, 2

number of vectors

500, 2000

vector

linearized , normalized

43

Best setting(s)

parameter

HD:0.1; 0.0625; 1 DF:0.2; 0.25; 1 TW: 0.08; 0.0625; 1

HD: 0.25; 2; 0.4 DF: 1; 2; 0.4 TW:0.015625;2;0.4 HD:1; 2; 0.4 DF:0.25; 2; 0.4 TW:0.015625; 2; 0.4 HD:0.015625; 1 DF:0.0625; 2 TW:0.0625; 2 HD/ DF/ TW: 500; normalized

Table 5. Leave-one-out classification performance (%) on the three data sets Features Full -Lexical -POS -Dependency -Syntax tree -( POS+Syntax tree) -Syntactic -Semantic

HD Accuracy 80.5 77.3 77.7 81.6 81.6 79.7 76.2 78.5

DF

F-measure 59.0 52.5 57.1 61.2 59.8 58.7 59.1 56.0

AUC 77.3 73.5 77.3 78.6 78.2 78.4 78.5 76.5

Accuracy 65.0 67.0 65.0 62.0 65.0 68.0 62.0 60.0

F-measure 60.7 64.5 61.5 55.8 62.4 65.2 58.7 57.4

TW AUC 68.1 67.5 68.3 62.0 69.6 72.2 66.0 62.8

Accuracy 84.2 84.2 86.1 83.3 85.2 86.6 85.2 85.2

F-measure 91.4 91.2 92.4 90.8 92.0 92.7 92.0 92.0

AUC 77.9 74.9 79.2 71.3 74.7 77.0 72.1 73.7

Table 6. Performances (%) of baselines and individual methods Accuracy

HD F-measure

AUC

Accuracy

DF F-measure

AUC

Accuracy

TW F-measure

AUC

Baselines Co-occurrence Hybrid method

23.8 71.1

38.5 54.3

50.0 71.5

46.0 52.0

63.0 58.6

50.0 53.6

85.2 80.4

92.0 88.5

50.0 67.1

Individual methods Feature-based Tree kernel SST SDP APG

79.7 50.8 53.5 64.5 75.8

58.7 42.2 48.5 52.4 53.0

78.4 65.6 82.1 74.7 74.8

68.0 45.0 64.0 58.0 58.0

65.2 61.5 66.0 66.1 61.1

72.2 58.1 70.6 73.2 66.1

86.6 84.2 85.2 89.0 84.8

92.7 91.4 92.0 93.8 92.0

77.0 59.3 71.0 74.4 70.5

44

Table 7. Performance (%) of ensembles HD DF TW Accuracy F-measure AUC Accuracy F-measure AUC Accuracy F-measure Combining all five individual methods Majority Voting 71.1 56.0 81.6 64.0 69.5 76.5 85.2 92.0 Weighted averaging 82.0 67.1 84.5 63.0 68.4 74.4 85.2 92.0 Stacking 80.5 57.6 79.1 65.0 63.2 70.4 93.9 93.8 Performance criterion Feature-based, SDP, APG* Feature-based, SST, SDP Feature-based, SST, SDP Majority voting 78.5 59.3 80.3 65.0 69.6 73.0 87.0 92.5 Weighted averaging 84.4 64.9 81.8 61.0 65.5 77.3 86.5 92.2 Stacking 83.6 59.6 81.7 64.0 62.5 74.6 93.9 93.8 Diversity criterion Feature-based, Tree, Subset Feature-based, Tree, SDP Tree, SDP, APG Majority voting 63.7 52.3 78.2 57.0 66.1 69.7 85.2 92.0 Weighted averaging 71.5 56.8 82.7 78.1 62.2 70.7 84.2 91.4 Stacking 79.3 52.3 78.7 65.0 63.9 75.0 93.0 93.6 *In the ensemble that is derived based on the performance criterion on the HD set, SST is not selected because of its poor f-measure (48.5%)

#Sentences per post

Table 8. Several data statistics for the three data sets TW Remark HD DF 8.42 1.97

#Tokens per instance

28

21

19

Ratio (#Drugs / #Drug-tokens)

0.09

0.15

0.12

For annotated data

Ratio (#Events / #Event-tokens)

0.23

0.28

0.68

For annotated data

Ratio (#n-gram (n=2,3)/ #unigram)

1.56

1.34

3.30

For selected features

45

AUC 68.0 84.5 83.9 68.6 81.9 82.0 66.3 83.4 84.1

Figure captions Fig.1. The path-enclosed tree (PT) structure for SST

Fig.2. Collapsed dependency structure of a sentence

Fig.3. Graph representation generated from the sentence “Beta_Blockers gave me terrible headaches as well”. The drug-event pair under consideration is marked as “DRUG” and “EVENT”. The shortest dependency path is shown in bold. In (a) the DSS sub-graph, all of the vertices and edges are emphasized using a post-tag (IP). In (b) the LOS sub-graph, the specialized tags are (B)efore, (M)iddle and (A)fter

46

Fig.4. Working process of stacked generalization

Fig.5. Performance of the feature-based method using different numbers of features

47

Fig.6. AUC comparison between the ensembles and their individual constituent methods 6(a) Combining all five individual methods 6(b) Combining the three classifiers whose AUC values rank as the top three 6(c) Combining the three classifiers that have the highest diversity

48

Fig.7. AUC comparison between the ensembles and baselines

Fig.8. Composition of selected features

49

Fig.9. Feature frequency distribution in the selected feature set

50

Footnotes

1. http://opennlp.apache.org/ (Accessed: 3 March 2016) 2. http://metamap.nlm.nih.gov (Accessed: 3 March 2016) 3. http://www.nlm.nih.gov/research/umls/ (Accessed: 3 March 2016) 4. http://www.consumerhealthvocab.org/ (Accessed: 15 May 2015) 5. http://nlp.stanford.edu/software/lex-parser.shtml (Accessed: 3 March 2016) 6. https://code.google.com/p/negex (Accessed: 3 March 2016) 7. http://www.medhelp.org/ (Accessed: 16 January 2014) 8. http://www.diabetesforums.com/index.php/index.html (Accessed: 2 February 2014) 9. http://diego.asu.edu/Publications/ADRMine.html (Accessed: 3 March 2016) 10. http://www.twitter.com/ (Accessed: 26 January 2016) 11. http://www.cs.cornell.edu/People/tj/svm_light/ (Accessed: 2 February 2016) 12. http://disi.unitn.it/moschitti/Tree-Kernel.htm (Accessed: 2 February 2016) 13. http://mars.cs.utu.fi/PPICorpora/GraphKernel.html (Accessed: 2 February 20167)

51

Filtering big data from social media--Building an early warning system for adverse drug reactions.

Can social media data lead to earlier detection of drug-related adverse events?

A method for systematic discovery of adverse drug events from clinical notes.

An ensemble heterogeneous classification methodology for discovering health-related knowledge in social media messages.

Prevention of adverse drug events.

Systematic review on the prevalence, frequency and comparative value of adverse events data in social media.

Predictability of extreme events in social media.

Knowledge-based extraction of adverse drug events from biomedical text.

Adverse drug events with hyperkalaemia during inpatient stays: evaluation of an automated method for retrospective detection in hospital databases.

Limits of use of social media for monitoring biosecurity events.

Biomarkers of drug-induced adverse events.

The burden of adverse drug events.

Adverse drug events in the oral cavity.

Identifying Adverse Drug Events by Relational Learning.

Risk management strategies for reducing oral adverse drug events.

Towards Large-scale Twitter Mining for Drug-related Adverse Events.

Pharmacovigilance from social media: mining adverse drug reaction mentions using sequence labeling with word embedding cluster features.

An alternate method for extracting DNA from environmentally challenged teeth for improved DNA analysis.

Measuring Adverse Events in Hospitalized Patients: An Administrative Method for Measuring Harm.

Subsemble: an ensemble method for combining subset-specific algorithm fits.

An Optimized Method for Extracting Bacterial RNA from Mouse Skin Tissue Colonized by Mycobacterium ulcerans.

Extracting biomedical events from pairs of text entities.

An Ensemble Approach for Drug Side Effect Prediction.

Systematic Analysis of Adverse Event Reports for Sex Differences in Adverse Drug Events.