Methods 74 (2015) 36–46

Contents lists available at ScienceDirect

Methods journal homepage: www.elsevier.com/locate/ymeth

Question answering for Biology Mariana Neves a,⇑, Ulf Leser b a b

Hasso-Plattner-Institut, Potsdam Universität, Potsdam, Germany Humboldt-Universität zu Berlin, Knowledge Management in Bioinformatics, Berlin, Germany

a r t i c l e

i n f o

Article history: Received 28 April 2014 Received in revised form 3 September 2014 Accepted 21 October 2014 Available online 28 October 2014 Keywords: Question answering Biomedicine Natural language processing Data integration

a b s t r a c t Biologists often pose queries to search engines and biological databases to obtain answers related to ongoing experiments. This is known to be a time consuming, and sometimes frustrating, task in which more than one query is posed and many databases are consulted to come to possible answers for a single fact. Question answering comes as an alternative to this process by allowing queries to be posed as questions, by integrating various resources of different nature and by returning an exact answer to the user. We have surveyed the current solutions on question answering for Biology, present an overview on the methods which are usually employed and give insights on how to boost performance of systems in this domain. Ó 2014 Elsevier Inc. All rights reserved.

1. Introduction When planning or analyzing experiments, scientists look for related and previous findings in the literature to obtain external evidence on current observations. For instance, biologists often seek information regarding genes/proteins (biomarkers) expressed in a particular cell or tissue of a particular organism in the scope of a particular disease. Finding published answers to such questions requires dealing with a variety of synonyms for the genes and diseases and posing queries to different databases and search engines. Further, it often also involves screening hundreds of publications or data returned for the queries. The task of searching for relevant information in a collection of documents, such as web pages using search engines or scientific publications using PubMed1, is generally called information retrieval (IR) [1]. In IR, queries are usually expressed in terms of some keywords and answering a query does not usually take into account synonyms, i.e., when a certain concepts has more that one name, and homonyms, i.e., when the same name refers to more than one concept. IR systems typically return a list of documents potentially relevant to the query, including related metadata (e.g., journal name and year of publication) and snippets of text matching the query keywords. In contrast to IR, question answering (QA) [2,3] aims to support finding information in a collection of unstructured and structured ⇑ Corresponding author at: August-Bebel-Str. 88 Potsdam, Brandenburg 14482, Germany. Fax: +49 (0)331 5509 579. E-mail address: [email protected] (M. Neves). 1 http://www.ncbi.nlm.nih.gov/pubmed. http://dx.doi.org/10.1016/j.ymeth.2014.10.023 1046-2023/Ó 2014 Elsevier Inc. All rights reserved.

data, e.g., texts and databases, respectively. Furthermore, QA systems take questions expressed in natural language (e.g., English) and generate a precise answer by linguistically and semantically processing both the questions and data sources under consideration. In particular, a question answering system distinguishes from IR systems in three main aspects (cf. Table 1): (1) queries can be posed using natural language instead of keywords; (2) results do not consist of passages but are generated according to what has been specifically requested, be it a single answer or a short summary; (3) answers are based on the integration of data from textual documents as well as from a variety of knowledge sources. The first aspect aims to facilitate usage for non-IR experts, i.e., users do not need to be concerned on how to best pose a query to receive a precise answer. For instance, when questioning about the participation of a certain gene in a pathway, e.g., the gene p53 in WNT-signaling cascade, users would usually write both terms in the search field of a search engine. In case of not finding the answer in any of the top list documents, the user could consider entering synonyms for both the gene (e.g., ‘‘TP53’’) and the pathway (e.g., ‘‘WNT signaling pathway’’). In order to cope with this problem, some IR systems allow the use of ontological terms instead of keywords for a more precise retrieval of relevant documents. For instance, GoPubmed2 automatically suggests candidates terms in the Medical Subject Headings (MeSH) and Gene Ontology during keywords typing. However, understanding of ontological concepts is not straightforward for scientists not familiar to them. The use of natural

2

http://gopubmed.org.

M. Neves, U. Leser / Methods 74 (2015) 36–46 Table 1 Main differences between QA and IR systems. Feature

Question answering

Information retrieval

Query Results Answer

Question in natural language One or more exact answers Based on multiple documents and resources

Set of keywords and/or concepts List of candidate documents Based on passages from multiple documents

language is a more intuitive way to inquire for information, by posing questions (how, what, when, where, which, who, etc.) or requests (show me, tell me, etc.). For instance, for the example above, users could simply write the question ‘‘Is p53 part of the WNT-signaling cascade?’’. Of course, allowance of free questions requires very advanced natural language processing (NLP) techniques. The second characteristic of QA systems is to provide precise answers instead of only presenting potentially relevant documents. When using IR systems, figuring out the answer to a query requires reading the documents returned by the system. QA systems strive to simply return the answer ‘‘No’’ to the question above along with a list of references that gave evidence for this answer. This requires QA systems to perform a deep linguistic analysis of both the question and the potential relevant passages, also considering the meaning of terms. Not only synonyms, hypernyms and hyponyms must be considered during answer construction, but also disambiguation of entities should be performed whenever necessary, such as figuring out whether the word ‘‘WNT’’ refers to part of the pathway name or to a mention of a member of the WNT gene family. Third, QA is not limited to textual resources and can include integration of data resources by converting natural language questions to an appropriate query language for searching for answers in databases, for instance in RDF triples [4]. Data extracted from different sources need to be assembled into a coherent single answer by means of exploring interlinks, dealing with contradictions and joining equal or equivalent answers. Currently, conversion of biomedical natural language questions to RDF triples are being evaluated in the BioASQ challenge (cf. Section 3.2.1) and question answering over linked data for three biomedical databases is being assessed in one of the Question Answering for Linked Data (QALD) shared tasks (cf. Section 3.2.4). Further, a prototype of the LODQA system [5] (cf. Section 3.1.2) converts questions to SPARQL queries for submission to BioPortal end point. The technology behind information systems has evolved from simple Boolean keywords-based queries to complex linguistic processing of both the question and textual passages. Fig. 1 shows an overview on the evolution of these techniques and illustrates the complexity of question answering systems. Many of the current information systems available for querying PubMed implement some of these techniques (cf. survey in [6]). Question answering has been successful applied for other domains, and examples of such systems are START3 and Wolfram Alpha4. Recent interest in question answering has been also motivated by the IBM’s Watson system [7], which beat human participants in a game show. Various researchers advocate that QA system can provide many benefits to the biological domain and they expect that these systems can boost scientific productivity [8]. Indeed, a study carried out with physicians showed that they do trust in the answers provided by QA systems [9]. However, Life Sciences also poses many challenges to QA systems, specially: (1) highly complex and large terminology, (2) exponential growth of data and hundreds of on-line databases, and (3) high degree of contradictions. Often, answering a

3 4

http://start.csail.mit.edu/. http://www.wolframalpha.com/.

37

question not only requires identifying relevant facts in a single document or database, but merging parts of the answers from distinct sources. Nevertheless, research on question answering in Biology is still scarce, in contrast to the medical domain (cf. Section 4). The first community-based challenge which included a task related to biomedical question answering took place in 2006 and 2007 and consisted in the evaluation of passage retrieval and restricted to topics related to Genomics (cf. TREC Genomics in Section 3.2.3). Later on, in 2012 and 2013, the Question Answering for Machine Reading Evaluation (QA4MRE) Alzheimer Disease challenge assessed systems on the machine reading task, which consists of multiple choice questions related to a single document (cf. Section 3.2.2). Use of RDF in biomedical QA tasks is currently being evaluated in the QALD challenge (cf. Section 3.2.4) and the more comprehensive challenge related to biomedical QA so far, BioASQ (cf. Section 3.2.1), has been running since last year (2013). In this work, we present an overview on question answering systems and techniques for the biological domain. In the next section, we give an overview on the most common components of a question answering system. Section 3 describes current systems and results obtained in shared tasks. Section 4 discusses the state-of-art of QA systems for the medical domain and give insights on which improvements could be achieved in biological field in the near future. 2. Fundamental techniques Question answering systems are usually composed of three steps [10,11] (Fig. 2): question processing; candidate processing, and answer processing. The first step receives the input entered by the user, i.e., a natural language question, and includes preprocessing of the question, identification of the question type and the type of answer to be required (e.g., the entity type) and building an input to the next step. In the candidate processing step, relevant documents, passages or raw data are retrieved and ranked according to their relevance to the question. Finally, the answer processing step receives the retrieved text passages and data items and builds the final answer by extracting and merging information from different sources. In general, more techniques and reusable software components are available for first two steps based on years of research work on NLP and IR. On the other hand, the last one requires some techniques specific of QA, which is a rather recent field in comparison to NLP and IR, and might require dealing with data of different nature and sources. Researchers usually classify questions in three types: yes/no, factoid/list and definitional/summarization question. The yes/no questions are the simplest ones, as the only two possible answers are known beforehand: ‘‘yes’’ or ‘‘no’’. Factoid and list questions expect a single or a list of short facts in return, such a named-entity (e.g., a gene, a disease), or an amount (e.g., number of mutations for a certain gene). The main difference between factoid and list questions is that the first expects a single answer while the second one allows a list of them. Finally, the summary and definition questions expect a summary or short passage in return. For instance, the expected answer for the questions ‘‘What is stem cell?’’ or ‘‘How does the mitosis of a cell work?’’ should be a short text summary, e.g., up to 10 sentences. Although this kind of answer might seem similar to the ones provided by traditional IR systems, in QA systems, summaries are automatically constructed specifically for the query in hand and may contain text extracts from different documents or data sources. Further, as opposed to text snippet returned by IR system, these summaries can be viewed as a single answer instead of passages derived from hundreds of documents with inconsistent definitions. In this section we present a detailed description on the methods employed in these steps along with practical examples based on the

38

M. Neves, U. Leser / Methods 74 (2015) 36–46

Fig. 1. Time-line of the evolution of information retrieval techniques.

Fig. 2. Typical architecture of a question answering system.

dataset available from the BioASQ challenge (cf. Section 3.2.1), which includes more than 300 manually curated questions along with the expected answers and possible text passages relevant for the answer. For each step, we show the products returned by some of the procedures which are described. Three questions have been selected from the BioAsq dataset for this purpose, one for each or the question types (cf. Table 2). We also present in this section an overview on the evaluation metrics that are usually employed for assessing the many components of question answering systems.

Table 2 Examples of questions for each type. Questions were retrieved from the training dataset made available for the BioAsq challenge (cf. Section 3.2).

2.1. Linguistic and semantic pre-processing

articles) and are also necessary as input for chunking tools. Chunking, also know as shallow parsing, consists of splitting a sentence into syntactic groups, such as noun phrases and verb phrases. Finally, the parsing step performs a syntactic analysis of the sentence and provides information regarding syntactic associations between the words in a sentence. Another common linguistic technique used in QA systems is semantic role labeling (SRL), which consists of identifying in a sentence semantic arguments and associated predicates (e.g., verbs), followed by their classification into pre-defined roles, usually based on the FrameNet lexical database [12]. For instance, for the factoid question in Table 2, ‘‘What is the function of the mammalian gene Irg1?’’, ‘‘function’’ could be identified as a predicate for the agent ‘‘gene Irg1’’, while the object (the function of this gene) is missing. An extension of FrameNet with links to biomedical ontologies has been carried out in BioFrameNet [13] and BIOSMILE [14] is a semantic role labeler specific

Linguistic and semantic pre-processing steps are common procedures which are usually carried out on the question and the textual candidate passages during both the question processing and answer processing steps, respectively. The linguistic processing can be composed of some of the following steps: sentence splitting, tokenization, part-of-speech tagging, chunking and parsing. Sentence splitting is the task of separating a text into its sentences. The tokenization procedure separates each sentence into the smallest units to be considered by the system, i.e., words or tokens. The part-of-speech tagging step assigns a syntactic tag to each word of the sentence, such as whether it is a verb in the present or past tense, a singular noun, or an adjective. This information can be used in a variety of ways, such as filtering out some categories of words (e.g., prepositions or

Question type

Example

Yes/no Factoid/list Definition/ summary

‘‘Is thrombophilia related to increased risk of miscarriage?’’ ‘‘What is the function of the mammalian gene Irg1?’’ ‘‘What is the mechanism of action of anticoagulant medication Dabigatran?’’

M. Neves, U. Leser / Methods 74 (2015) 36–46

39

Fig. 3. Linguistic pre-processing for one question (A). Tokenization and part-of-speech tags are shown in ‘‘B’’: ‘‘WP’’ corresponds to a Wh-pronoun, ‘‘VBZ’’ to a verb in the 3rd person singular present’’, ‘‘DT’’ to a determiner, etc. The parse tree is shown in ‘‘C’’ and includes the root of the tree (ROOT), the question phrase (WHNP), two noun phrases (NP), etc. The dependencies tree is shown in ‘‘D’’: ‘‘cop’’ refers to the relation between a complement and the copular verb (e.g., to be and to become), ‘‘det’’ is a relation between a determiner (the) and a noun (function), ‘‘nsubj’’ is the nominal subject relation, etc.

for the biomedical domain. Examples for most of the procedures above were generated using the Stanford parser demo tool5 and are shown in Fig. 3. The semantic processing usually includes exploring the meaning of the words, identifying entities and finding semantic relationships between entities. The meaning of the words can be individually explored using resources such as WordNet [15], which provides lexical categories of words (e.g., verb or noun) as well as synonyms. Term identification or named-entity extraction (NER) consists of precisely finding mentions of certain entity types, such as gene or diseases names [16,17], or terms pertaining to some particular ontologies or terminologies, such as MeSH, Gene Ontology or the Unified Medical Language System (UMLS) [18]. An extension to this task is the entity mention normalization (EMN), which consists in mapping the previously extracted mention to a particular entry in an ontology, terminology or database. Both tasks support QA in the identification of the required entity type and by providing synonyms, hypernyms and hyponyms. Examples of NER/EMN tools are GNAT [16] (gene/proteins), ChemSpot [19] (chemical compounds), Linnaeus [20] (species) and Metamap [18] (UMLS concepts). Fig. 4 shows the identification of some named-entities for one of the questions in Table 2 using the Becas tool6 [21].

2.2. Question processing The question inputted by the user is processed by the system to infer what it is being requested and to build an appropriate input for the candidate retrieval step. It consists of pre-processing the question (cf. Section 2.1), identifying the question type and the expected type of answer and constructing of IR-style queries. First, the given question is classified in one of the three types usually considered in by QA system developers: yes/no, factoid/list or definitional/summary question (cf. examples in Table 2). This is usually carried out by checking whether it contains Wh question particles or auxiliary verbs (e.g., be, do and modal verbs), which are 5 6

http://nlp.stanford.edu:8080/parser/index.jsp. http://bioinformatics.ua.pt/becas/.

indicative of yes/no questions. However, the complexity of the natural language poses many challenges to this task, as some question which at first seems to belong to certain type can result to be of a different one. For instance, the question ‘‘Is Rheumatoid Arthritis more common in men or women?’’ from the BioAsq training dataset is a factoid question which accepts a limited set of answers: ‘‘women’’, ‘‘men’’, or ‘‘neither’’. However, the way it is constructed, by starting with the ‘‘to be’’ verb, may lead to the wrong classification of it being a yes/no question. In fact, only a small change in the question, e.g., substituting ‘‘or’’ to ‘‘than’’, can indeed make it a yes/ no question. The identification of the expected answer required by a question is usually carried out based on the question type and its linguistic and semantic information. A question previously identified as yes/no requires one of these two answers. In contrast, a factoid question expect a particular answer type, such as an entity type or a number. For the factoid question in Table 2, the expected answer is the function of the gene, or more specifically, for instance, one or more particular terms in the Molecular Function subset of the Gene Ontology [22]. Analysis is usually carried out by taking into account the Wh particle, the parsing information, pre-defined patterns and domain-specific manually created rules or patterns. For instance, in the biological domain, question starting with the ‘‘where’’ particle usually expect entity types such as cells, tissues, organs and cell components. Previous work for other domains have developed a taxonomy of questions in which all possible expected questions for a certain domain are identified and organized in an hierarchical structure. For instance, the many types of questions common on the clinical domain have been studied [23] and the development of the PICO framework has improved solution on question answering for the medical domain [24] (cf. Section 4). A preliminary taxonomy of questions for the biological domain has been explored in [25] (cf. Section 3.2.2). Construction of queries depends on the nature of the resources which are going to be accessed. For textual resources, a query composed of keywords derived from the original question is utilized. The single words which compose the question are used as building blocks to create the keywords, but chunks or named-entities can be also alternatively utilized. In this step, terms can be expanded

40

M. Neves, U. Leser / Methods 74 (2015) 36–46

Fig. 4. Identification of semantic entities and their normalization using the Becas tool. Here two chemical compounds (‘‘anticoagulant’’ and ‘‘Dabigatran’’) and one biological process (‘‘mechanism of action’’) have been extracted from the question and mapped to the corresponding identifiers in the CHEBI and NCI terminologies (in blue).

based on previously compiled dictionaries of synonyms. Queries can also be composed of particular identifiers in some pre-defined ontologies or terminologies. For instance, the semantic search engine GeneView [26] allows to query the human species using the query ‘‘SPECIES:9606’’ instead of ‘‘human’’, and thus automatically considers synonyms available for this organism (e.g., ‘‘Homo sapiens’’). Table 3 shows examples of possible queries derived from each of the questions in Table 2. Treatment of non-textual resources is diverse. A database which contains data in RDF format (e.g., BioPortal [27]) will expect a SPARQL7 query. Question answering over RDF has been explored for other domains [28] and also evaluated during community-based efforts, such as in the QALD challenge (cf. Section 3.2.4). Translating a question to SPARQL makes use of both the linguistic and semantic pre-processing information (cf. Section 2.1) and requires that elements of the query to be first mapped to concepts of the database. For instance, the gene ‘‘Irg1’’ and the word ‘‘function’’ in one of the questions in Table 2 could be mapped to the terms ‘‘C561544’’ (IRG1 protein, human) and ‘‘Q000502’’ (physiology), respectively, in the MeSH ontology. When querying databases in other formats, systems will need to use the particular query language necessary for accessing the data. For instance, when querying a database stored in MySQL, questions must be converted to SQL using approaches similar to that for SPARQL.

Goggle or PubMed or on a background collection of documents, which are usually indexed and then ranked based on the degree of relevance of the candidate to the query. The indexing process consists in extracting all words or terms from the documents and building a single structure (a index) from where they can be quickly retrieved through their words or terms. There is a variety of tools which can be used for this purpose, such as Lucene/Solr and Indri. During candidate retrieval, a score is calculated based on information retrieval techniques [1]. For instance, the TF-IDF metric calculate scores based on both the number of terms shared by the query and the candidate documents as well as the total number of documents in which a certain term appears, and thus measure the relevancy of document for a certain query in respect to the others in the collection. Often, more than one query is run per question using distinct approaches to improve the recall of the system, e.g., by considering synonyms for the keywords or changing the ‘‘AND’’ connector to ‘‘OR’’ to get more matches. Depending on the background collection which is being queried, the queries can also restrict matches to documents or passages which contain particular entity types or relationships on it, (called type coercion [29]). Fig. 5 shows an example of a candidate document retrieved for one of the questions in Table 2 using the GeneView search engine. 2.4. Answer processing

2.3. Candidate retrieval In this step, candidates are retrieved based on the query constructed in the previous step. The output of this step is a list of relevant documents along with a confidence score, when accessing textual resources, or a list of candidate answers, when querying database resources. Retrieval of candidate answers are usually carried out based on typical information retrieval or database querying techniques. For this purpose, a QA system relies on an existing IR system, such as 7

http://www.w3.org/TR/rdf-sparql-query/.

The answer processing step is the most challenging component of a QA system, as this is when the precise answer has to be extracted from the candidates. The output from this component should be a simple answer to the question. This might require merging information from distinct sources (e.g., texts and RDFs), generating summaries and dealing with missing data, contradictions, uncertainty, etc. When dealing with text passages, linguistic and semantic preprocessing (cf. Section 2.1 above) is first performed. If the expected answer type is known, approaches might prioritize passages which contain annotations from the expected types [29], in the case of

M. Neves, U. Leser / Methods 74 (2015) 36–46

41

Table 3 Examples of queries for each of the questions presented in Table 2. Queries include synonyms for the terms (in parenthesis and separated by ‘‘OR’’) and require matching of all terms (thus the ‘‘AND’’ connector). Question

Query

‘Is thrombophilia related to increased risk of miscarriage?’’ ‘‘What is the function of the mammalian gene Irg1?’’ ‘‘What is the mechanism of action of anticoagulant medication Dabigatran?’’

(thrombophilia OR hypercoagulability) AND (relate OR association) AND (increase OR heighten OR augment OR raise) AND (risk OR potential) AND (miscarriage OR abortion) (function OR functional) AND (mammalian OR mammal) AND (gene OR genetic) AND (Irg1 OR IRG1) Mechanism AND action AND (anticoagulant OR thinner OR anti-coagulants OR anti-coagulant) AND Dabigatran

Fig. 5. Abstract of a document (PMID 14500577) which has been retrieved by GeneView for the question ‘‘What is the function of the mammalian gene Irg1?’’ when using the query ‘‘(function OR functional) OR (mammalian OR mammal) OR (gene OR genetic) AND (Irg1 OR IRG1)’’. Two passages are relevant here for the identification of the answer: the one starting with ‘‘To investigate the function of Irg1 during implantation, . . .’’ which demonstrate that the function of the gene was the focus of the study, and the last sentence of the abstract where the results of the experiments are stated.

factoid questions. Other common methods include calculating a similarity measure between the words and the parsing tree in the question and in the passages and prioritizing candidate answers which appear in more than one passage [25]. Deciding the answer for a yes/no question usually depends on the number of candidate passages which have been retrieved for each of the two options. Passages which contain few of the keywords might be an indicative that the answer ‘‘No’’ should be returned. Dealing with contradictions may include voting or weighting answers by the trust put on the services, similar to the impact factor of a publishing journal. Finally, the answer processing step can optionally include a component for validating the chosen answer(s), such as by query search engines (e.g., Google or PubMed) and further analyzing the number and/or the text of passages which have been returned.

and recall are used, which assess the percentage of answers correctly answered by the system and the percentage of correct answers that the system was able to return, respectively. The Mean Average Precision (MAP) is the most common metric for evaluating ranked answers and consists in calculating the arithmetic average of the precision for each of the items in a ranked list, i.e., it takes into account the position on the item in the list. An overview of these metrics can be found in [30]. Definitional and summarization questions are usually evaluated by manual review of the answer by experts or through automatic computation of the similarity of the expected and the returned summaries using the ROUGE [31] metric, which consists on calculating the proportion of words (or groups of 2, 3 or 4 words) shared by these two summaries.

2.5. Evaluation

Question answering for Biology is a very recent topic and recent development focuses on the improvement of underlying algorithms and on the evaluation and comparison of systems. Currently, only one system (cf. Section 3.1.1) and one prototype (cf. Section 3.1.2) are actually functioning. Work has specifically been boosted by two community-based shared tasks (cf. Sections 3.2.1 and 3.2.2). In this section we present a brief summary on the available systems, the recent shared tasks initiatives and other interesting research works on question answering for the biological domain. Table 4 presents a summary on the methods used by each tool for the many components of the three question answering steps.

Evaluation of QA systems is usually performed based on a manually created set of questions and answers, or by manually posing questions directly to the Web interface and validating the returned answer. The metrics used for the evaluation depends on the type of question. Yes/no questions can be easily assessed by comparing the expected answer to the returned one. Factoid and list questions can be evaluated based on the exact answers, i.e., no allowance of additional wrong text, or based on the overlap between given and expected answer. In the first case, measures such of precision

3. State-of-the-art

42

M. Neves, U. Leser / Methods 74 (2015) 36–46

Table 4 Summary of the methods used in each QA step (cf. Section 2) for the state-of-art systems. Abbreviations: NLP: natural language processing; NER: named-entity recognition; QuesClass: question classification; QuerExpan: query expansion; DocRetr: document retrieval; DatRetr: data retrieval; AnswExtr: answer extraction; AnswVal: answer validation. The dash symbol ‘‘–’’ indicates that the component was not necessary for the task the system has been developed or applied. Components/systems Question processing

Candidate retrieval Answer processing

NLP NER QuesClass QuerExpan DocRetr DatRetr AnswExtr AnswVal

EAGLi

LODQA

Wishart

MCTeam

Weissenborn et al.

Attardi et al.

Gleize et al.

Lin et al.

U U

U OntoFinder

U PolySearch

U Metamap

U Metamap

U U

U

U PubMed

U – BioPortal –

U PolySearch

U GoPubMed

U GoPubMed

U

U

U

U – – – U

U Genia U U – – U

U

3.1. On-line systems 3.1.1. EAGLi As far as we know, EAGLi8 is the only available question answering system not focused on the medical domain. It allows queries to be posed in natural language, relevant documents passages are returned along with the keywords highlighted in the text and answers are ordered according to confidence scores (top, medium, low). Current architecture of the system is uncertain but publications which describe participation of the system in previous challenges (TREC, QA4MRE) reports that it seems to rely only on PubMed abstracts and includes extraction methods for a variety of semantic concepts, such as Gene Ontology, MeSH and proteins (SwissProt) [32], as well as protein–protein interactions [33]. Question processing starts with the syntactic parsing of the question followed by patterns which identify the subject of the question [34]. Query expansion is carried out by adding synonyms to the original terms, although the authors do no provide much detail on this step, and stemming and stopwords are used to reduce the size of the query. Relevant text passages are retrieved by making queries to PubMed as well as from a local copy of it using the EasyIR engine [34]. For the later, a weighting scheme based on experiments was carried out in [35], although not clearly specified by the authors. A version of the system was evaluated in a corpus of 200 questions derived from data found in the UniProt and DrugBank and answers were based on MeSH terms concepts. The precision for the answers ranged from 0.55 to 0.67 and recall from 0.68 to 0.78 when considering the top ten answers. The passage retrieval component of EAGLi has been evaluated in both editions of the TREC Genomics task (cf. Section 3.2.3 below). For the 2007 edition [32], the team explored semantic features based on a variety of terminologies, such as GPSDB, Gene Ontology, SwissProt, MeSH and ICD. The system assigns semantic resources to each pre-defined answer type (cf. Section 3.2.3). For instance, MeSH was assigned to questions which expected diseases or drugs as answers. EAGLi has also participated in the 2013 edition of the QA4MRE challenge (cf. Section 3.2.2). Questions were converted to queries and documents were returned based on Boolean search methods, i.e., by checking which documents contains or not the term in the query [33]. The chosen answer was the one whose query returned the highest number of documents. The top scoring approach obtained correct answers for 5 questions over a collection of 40. 3.1.2. LODQA The open source project LODQA9 (Linked Open Data Question Answering) is a very interesting recent work. It builds the first natural language interface for querying open linked data and converts questions posed on natural language to queries using the SPARQL language. It performs linguistic analysis and utilizes the OntoFinder 8 9

http://eagl.unige.ch/EAGLi. http://lodqa.dbcls.jp/.

U U – – – U

tool to match noun phrases in the query to terms in the BioPortal ontology repository and then poses SPARQL queries to the BioPortal endpoint. The tool is still under development and only a preliminary evaluation has been carried out so far using only the SNOMED-CT ontology [36]. The Web tool allows posing questions to the system but it is still very unstable and limited to some particular ontologies. 3.2. Challenges and shared tasks 3.2.1. BioASQ The BioASQ challenge10 was part of a EU-funded project which aims to boost state of the art performance for biomedical question answering. It took place for the first time in 2013 and a second edition has run this year (2014). The challenges include factoid, list, yes/ no and summary-based questions and are composed of two phases:  (A) participants are given a set of questions and are asked to return related concepts from pre-defined resources (GO, MeSH, DO, SwissProt and Jochem), relevant documents and text passages from Medline or PubMed Central full texts documents, and RDF triples from selected resources;  (B) given the gold-standard input returned from phase A (concepts, passages and triples), participants need to return an exact (short) answer and the so-called ideal answer (a short summary which complements the exact answer). A training dataset composed of more than 300 questions is available including all the results required for both phases A and B. The organizers have made available web services for querying relevant PubMed abstracts and PubMed Central full text documents to support document retrieval. Evaluation was carried out on batches of 100 questions and participants could submit a ranked list of up to 100 items for each of the required outputs (i.e., document identifiers, text passages, ontological concepts, RDF triples, exact answers, etc.). Organizers compared performance os systems based on precision, recall and MAP (cf. Section 2.5). In the 2013 edition, participants were evaluated on three batches of test datasets, each one composed of roughly 100 questions. Three teams participated in this task and we give a short description of them below. The 2014 edition of the challenge [37] consisted of five batches of 100 questions and counted with the participation of seven teams. Organizers reports an improvement on the results in comparison to the previous year and although description of the participating teams is yet to be published, results are already publicly available.11 Results have varied over the many datasets (batches) but top performing MAP for phase A were 0.31, 0.068, 0.67 and 0.085 for document retrieval, passage retrieval, concept mapping and RDF triples retrieval, respectively. As for phase B, 10 11

http://bioasq.org. http://bioasq.lip6.fr/results/2b/phaseA/; http://bioasq.lip6.fr/results/2b/phaseB/.

M. Neves, U. Leser / Methods 74 (2015) 36–46

top performing results for answers have reached up to 1.0, 0.44 and 0.44 accuracy for yes/no, factoid or list questions, respectively, while ideal answers (short summaries) reached a Rouge-212 value (cf. Section 2.5) of up to 0.20. The Wishart system [38] used a semantic-based approach in which noun phrases extracted from the questions were mapped to terminologies of different entity types (e.g., diseases, genes, drugs) and then expanded using the PolySearch tool [39]. The terms from the question were posed as queries to this tool and a list of associated biomedical entities were retrieved and used to rank relevant sentences, giving priority to those that contained entities which matched the ones in this list. The same approach is used for ranking relevant sentences provided in phase B and a summary is build for the ideal answer based on the top-ranking sentences. Their approach ranked third in the challenge. The MCTeam team relied on the MetaMap tool [18] for extracting UMLS concepts for the question and to build the query to be posed to the BioASQ services. Full texts were retrieved from the later and indexed using the Indri tool, from which the sentences were ranked. They obtained a minimum MAP of 3.875. The system developed by [29] makes use of the UMLS metathesaurus and Metamap to extract biomedical concepts and combines several scoring schemes to assess the answers. It considers factoid and list questions and discerns three types: what/which question, where-questions and decision questions. Questions are converted to queries and then posed to the GoPubMed search engine which returns candidate passages from the documents. Authors show evaluation of the system only for the factoid questions (12 of them) and report an accuracy of almost 55%. 3.2.2. QA4MRE Alzheimer Disease The QA4MRE (Question Answering for Machine Reading) for the Alzheimer Disease took place in two editions (2012 [40] and 2013 [41]) and aimed to foster development of solutions in machine reading. A set of full text documents were given along with ten questions, each of which had five possible answers. This constitutes a particular type of factoid questions that are less complex in comparison to typical factoid questions, as systems can use the possible answers to query for candidates and do not need to create the final answer but just choose the most probable one. Indeed, most of the system have approached the task by constructing five queries per question, one for each of the possible answers. A background collection composed of thousands of abstracts and full texts previously annotated with linguistic and semantic annotations, such as genes, proteins, mutations, anatomical parts, has been made available for the participants. The answers to the questions are always present in the respective document, although sometimes, the use of the background collection and external resources, such as ontologies or other terminologies, are necessary to infer the correct answer. Questions could be of three degrees of difficulty (simple, medium or complex), depending on whether the answer could be found verbatim in the text (simple), or required use of lexical or semantic dictionaries (medium) or even reasoning (complex). The top-performing team in the 2012 challenge [42] includes the expansion of the words in the documents with semantic knowledge, such as synonyms, hypernyms and relationships. Entity extraction was carried out with a tool trained on the Genia corpus and relationships were extracted based on the parsing tree of the sentences. The second best performing team [43] used the Indri IR tool for indexing the background collection (cf. Section 2.3) and considered two situations: (1) only the documents to which the questions referred or (2) also using the background 12 Rouge-2 evaluates bigrams (groups of two words) shared by gold-standard and generated summaries.

43

collection of documents, which resulted in the best performing one. Another group [44] explored the hierarchical structure of the UMLS terminology, along with comparison between the dependency trees (cf. ‘‘D’’ in Fig. 3) of the question and the relevant passages. The approach used by the best performing team [25] in 2013 included the creation of a question typology to assign every question one of the following answer types: cause/reason, method/ manner, opinion, definition and thematic. The question analysis step made also use of rules related to its syntactic structure, tools for analyzing its parsing tree and lists of manually compiled words for each of the answer types. 3.2.3. TREC genomics The TREC Genomics challenge took place in 2006 and 2007 and aimed to evaluate passage retrieval as part of question answering systems with focus on particular entity types, such as genes, protein, diseases, mutations and pathways. Participants were provided with a collection of 160,000 full text documents from about 49 journals related to genomics and evaluation was carried out for three levels: documents, passages and the aspects (concepts) they contain. Evaluation was carried out manually by judges who were presented with the top scoring 1000 passages for each topic. The challenge allowed three types of submission: manual, automatic and interactive, which allows some feedback from the user during the automatic processing. In the 2007 edition, 27 groups have submitted results for this challenge. The best performing team was the one from the National Library of Medicine [45] and their system made use of the SemRep tool for deriving semantic relation among UMLS concepts previously extracted by the Metamap tool. SemRep was applied to the 1000 top documents returned by an ensemble of four search engines. This team has obtained top MAP scores of 0.11, 0.26 and 0.33 for the passage, aspect and document evaluation. The second best scoring team [46] has utilized an approach which includes generation of orthographic variants and the combination of various IR models. Their system achieved performances of 0.098, 0.22 and 0.29 for passage, aspect and document retrieval. 3.2.4. QALD The recent fourth edition of the QALD challenge13 includes a task focused on the biomedical domain which consists on the evaluation of question answering on three databases: SIDER, Diseasome and DrugBank. The preliminary report on the challenge [37] describes the participation of teams on this task and the difficulty in matching the concepts from the vocabularies to the question text. 3.3. Other research work In [47], authors describe a QA system for computing answers to questions related to protein–protein interactions. They used a pipeline composed of named-entity recognition, semantic role labeling, question classification and query expansion. They considered four types of named entities (protein, cell, DNA and RNA) and used the Genia Tagger [48] for this purpose. Query expansion was carried out using linguistic resources, such as WordNet and the Longman dictionary for the English language. Potential answers were ranked using a linear model on features related to entity types, semantic role labeling information, etc. The system was evaluated in a dataset derived from the TREC Genomic Track (cf. Section 3.2.3) and promising results were reported. Some tools originally developed for other domain have also been successfully adapted to the biomedical domain. The general purpose summarizer system Squash has been adapted to the QA13 http://greententacle.techfak.uni-bielefeld.de/~cunger/qald/index. php?x=task2&q=4.

44

M. Neves, U. Leser / Methods 74 (2015) 36–46

oriented multi-document summarizer BioSquash system [49] and applied for questions which require a summary as answer. Its architecture includes the GENIA ontology [48] for identifying biomedical entities, as well as the UMLS terminology and WordNet to identify concepts and relations between them. The system was evaluated on 18 questions retrieved from the TREC Genomics track (cf. Section 3.2.3). Finally, the ExtrAns question answering system was adapted to the biomedical domain in [50], however, no evaluation of the tool was reported. 4. Discussion We presented an overview on the architecture of question answering systems, on the common methods used for individual steps during answer processing and on the state of the art within the biological domain. Mature solutions are still scarce but improvements in this field are promising according to various past and present challenges. In this section we discuss aspects which we find necessary to improve performance and resources which are still underused. For comparison, we also give an overview on QA for the clinical domain, a field which is close to Biology and from which it is possible to derive new insights. 4.1. Boosting question answering for Biology Every QA system depends on a comprehensive collection of documents from where answers are retrieved. Building and indexing such a collection is time consuming and demands large computational resources. Further, good performing tools for QA have shown that working with a semantically enriched collection of documents might improve results, but this means integrating and running a variety of tools for named-entity and relationship extraction on the collection, which is quite a tedious task [51]. This poses some obstacles to research groups to work in this domain, as opposed to question answering based on Web pages, where commercial search engines (e.g., Google) can be leveraged. PubMed cannot be easily used for the same purpose, as it is limited to abstracts and results are ordered chronologically instead of by relevancy to the query. Although several advanced information systems have been developed for querying PubMed (cf. survey in [6]), practically all of them focus on the end-user and are not intended to be accessed automatically through web services. A web service based on the GoPubmed tool [52] has been made available for the BioASQ challenge (cf. Section 3.2.1), but only to be used during the duration of the event. The research community cannot rely on a single resource, and more similar services are necessary to support development of QA systems, specially ones which already contain semantic annotations, such as GeneView [26]. Finally, building a comprehensive collection of biological documents is still a difficult task and text mining is restricted to the PubMed abstracts and to the small PubMed Central Open Access collections of almost 860,000 full texts (as August 2014), a small subset of the 24 millions citations in PubMed. Although agreements on using PDFs files have been obtained with some publishers [53], this is a time-consuming task which requires contacting each publisher separately and obtaining permission just for a restricted use of the publications. Current solutions for biological question answering make little use of domain knowledge, specially in respect to biological ontologies. For instance, early research works focused only on the Genia corpus [50,48] while recent initiates, such as the BioASQ challenge (cf. Section 3.2.1) has assessed concepts related to only five ontologies and databases (Gene Ontology, MeSH, DO, Jochem and SwissProt). Also the methods behind EAGLi are restricted to only a few ontologies (cf. Section 3.1.1). However, the biological domain comprise much more concepts which are not included in any of these

resources. Though the consideration of some few terminologies might be adequate for specific use cases, they might not be comprehensive enough to make the system useful for the wide biomedical community. Existing solutions for biomedical question answering have so far either focused on extracting answers from textual documents or from open linked data resources, but not to merge answers coming from both of them. This is certainly a challenging issue as it consists of dealing with data of different nature. However, integration of such various data can indeed be possible when represented using common ontologies and terminologies and it is already been carried out in other domains [54]. Regarding possible improvements on the methods, best performing teams in the TREC Genomics challenge (cf. Section 3.2.3) have shown that the use of more than one search engine may boost performance of the passage retrieval step. Using an ensemble of methods is indeed a common practice in other biomedical natural language processing tasks, such as for extracting drug–drug interactions [55]. As discussed above, such an approach could be carried if more Web services were available for this purpose, such as the one currently available for the BioASQ challenge. However, building multiple indexed instances of the biomedical document collections by a single resource group requires huge computation resources. Finally, question answering in Biology will not move forward without a clear understanding on the users’ needs and the involvement of these. Community-based initiatives need to be created to connect developers of QA systems and biologists, such as what is currently being carried out for biocuration [56]. In such a scenario, system developers would be able to learn what users expect from QA systems, which resources and terminologies should be prioritized and receive feedback through hands-on experiments of the current solutions. However, this also involves participation of biologists on creating domain-specific resources, such as question taxonomies and gold-standard corpora.

4.2. Difference between QA in Biology and medicine Question answering for the medical domain received much more attention from the research community than in the biological area. The reasons can be various. The medical domain is more restricted and focus on only one species, the human being, while Biology studies all living species, whether mammals, insects or plants. Further, consolidated manually curated terminologies, such as the UMLS terminology, provide a comprehensive description for many medical phenomena, while terminologies and ontologies for the biological field are still developing and scattered. Clinical question answering is of importance for both the patients and medical professionals (physicians, nurses, etc.) as well as of more commercial interest, e.g., for hospitals and other health institutions. Finally, even automatically derived set of questions are easier to get in the medical domain, as there are a handful of on-line portals (e.g., WebMD14) where users have been posing their questions. On the other hand, Medicine also poses challenges to QA systems, specially regrading the use of trustworthy sources [57], as the most of the content available at the Web is not reliable. Good surveys on QA in the medical domain can be found in [57,11]. Good performance of QA solutions for Medicine also lies in resources which have been developed specifically for it, e.g., the use of PICO (Problem, Patient/Population, Intervention, Comparison, and Outcome) common framework recommended by the Evidence-Based Medicine (EBM) framework [58]. Within the scope of this framework, taxonomies of questions have been developed [59,23], including classification of questions according to some 14

http://www.webmd.com/.

M. Neves, U. Leser / Methods 74 (2015) 36–46

predefined classes, such as clinical or non-clinical, evidence or noevidence. A taxonomy of questions for a particular domain can indeed support QA systems by restricting which kind of questions are to be processed and which types of answers are expected to be returned. We are aware of two QA systems available for the medical domain: askHERMES15 [60] (also called MedQA) and HONQA.16 [61]. Their functional aspects have been reviewed in [3], which has also included EAGLi. HONQA was developed by the Health On the Net Foundation (HON) and supports QA for the medical domain in three languages: English, French and Italian. Answers are extracted from pages certified by the HON foundation but also from Medline. askHERMES performs searches in a variety of biomedical collections: Medline abstracts, PubMed Central full texts, eMedicine documents, clinical guidelines and Wikipedia pages.

5. Conclusions In this survey we presented an overview on question answering for Biology. We discussed the desired functionalities that such a system should provide to users, as opposed to classical information retrieval systems, e.g., PubMed. We provided an overview on the methods behind previous solutions on this field, including practical examples for a better understanding of the processes. Finally, we discussed the current state of the art on question answering systems for the biological domain, including available web-based systems, evaluation datasets, past and current community-based initiatives to boost research in this area. Question answering systems for biomedicine is still in its infancy but a couple of systems (EAGLi and LODQA) are already becoming available and we believe that current community-based challenges (BioASQ and QALD) will boost performance of current systems and will allow the development of new solutions. Sure enough, high-quality and comprehensive benchmarks are crucial resources for both training and evaluating systems. Biomedicine is an important field and it is already attracting the attention of commercial QA applications, such as the adaptation of the IBM Watson system to this domain [62–64]. Further, there is a need for question answering systems given the deluge of data and textual documents in the field and the demand of answers in a short time as well as the integration of data from different sources and formats. Although many initiatives are available for querying biomedical databases and ontologies using natural language (e.g., [5,65]), we are not aware of initiatives for merging answers derived from different sources, a feature which is already available for other domains (e.g., [54]) and we believe it is just a matter of time for it to be a reality also in biomedicine.

References [1] C.D. Manning, P. Raghavan, H. Schütze, Introduction to Information Retrieval, Cambridge University Press, Cambridge, UK, 2008. [2] L. Hirschman, R. Gaizauskas, Nat. Lang. Eng. 7 (2001) 275–300. [3] M. Bauer, D. Berleant, Human Genomics 6 (2012) 1–4. [4] V. Lopez et al., Scaling up question-answering to linked data, in: P. Cimiano, H. Pinto (Eds.), Knowledge Engineering and Management by the Masses, Lecture notes in computer science, 6317, Springer, Berlin, Heidelberg, 2010, pp. 193– 210. [5] K.B. Cohen, J. dong Kim. Evaluation of SPARQL query generation from natural language questions, in: Natural Language Processing and Linked Open Data, 2013, pp. 3–7. [6] Z. Lu, Database 2011 (2011). [7] D.A. Ferrucci et al., AI Magazine 31 (2010) 59–79. [8] J.D. Wren, Bioinformatics 27 (2011) 2025–2026. [9] H. Yu et al., J. Biomed. Inform. 40 (2007) 236–251. 15 16

http://www.askhermes.org/. http://www.hon.ch/QA/.

45

[10] D. Jurafsky, J.H. Martin, Speech and Language Processing, second ed., Prentice Hall International, 2013. [11] S.J. Athenikos, H. Han, Comput. Methods Programs Biomed. 99 (2010) 1–24. [12] C.F. Baker, C.J. Fillmore, J.B. Lowe, The Berkeley Framenet project, in: Proceedings of the 36th Annual Meeting of the Association for Computational Linguistics and 17th International Conference on Computational Linguistics, vol. 1, ACL ’98, Association for Computational Linguistics, Stroudsburg, PA, USA, 1998, pp. 86–90. [13] A. Dolbey, M. Ellsworth, J. Scheffczyk, BioFrameNet: a domain-specific FrameNet extension with links to biomedical ontologies, in: Proceedings of KR-MED, 2006. [14] R. Tsai et al., BMC Bioinform. 8 (2007) 325. [15] G.A. Miller, Commun. ACM 38 (1995) 39–41. [16] J. Hakenberg, C. Plake, R. Leaman, M. Schroeder, G. Gonzalez, Bioinformatics 24 (2008) i126–i132. [17] R. Leaman, R. IslamajDog˘an, Z. Lu, Bioinformatics 29 (2013) 2909–2917. [18] A.R. Aronson, F.-M. Lang, J. Am. Med. Inform. Assoc. 17 (2010) 229–236. [19] T. Rocktäschel, M. Weidlich, U. Leser, Bioinformatics (2012). [20] M. Gerner, G. Nenadic, C. Bergman, BMC Bioinform. 11 (2010) 85. [21] T. Nunes, D. Campos, S. Matos, J.L. Oliveira, Bioinformatics 29 (2013) 1915– 1916. [22] T.G.O. Consortium, Nucleic Acids Res. 41 (2013) D530–D535. [23] J.W. Ely et al., Br. Med. J. 321 (2000) 429–432. [24] C. Schardt, M. Adams, T. Owens, S. Keitz, P. Fontelo, BMC Med. Inform. Decis. Mak. 7 (2007) 16. [25] M. Gleize, et al., Selecting answers with structured lexical expansion and discourse relations LIMSI’s participation at QA4MRE 2013, in: Fourth International Conference of the CLEF Initiative, CLEF, 2013. [26] P. Thomas, J. Starlinger, A. Vowinkel, S. Arzt, U. Leser, Nucleic Acids Res. 40 (2012) W585–W591. [27] N.F. Noy et al., Nucleic Acids Res. 37 (2009) W170–W173. [28] R. Huang, L. Zou, Natural language question answering over RDF data, Proceedings of the 2013 ACM SIGMOD International Conference on Management of Data, ACM, New York, NY, USA, 2013, pp. 1289–1290. [29] D. Weissenborn, G. Tsatsaronis, M. Schroeder, Answering factoid questions in the biomedical domain, in: A.-C.N. Ngomo, G. Paliouras, (Eds.), Proceedings of the first Workshop on Bio-Medical Semantic Indexing and Question Answering, Conference and Labs of the Evaluation Forum 2013 (CLEF 2013), 2013. [30] M.-D. Olvera-Lobo, J. Gutiérrez-Artacho, J. Inform. Sci. (2011). [31] C.-Y. Lin, Rouge: a package for automatic evaluation of summaries, in: Proc. ACL Workshop on Text Summarization Branches Out, 2004, p.10. [32] J. Gobeill, I. Tbahriti, F. Ehrler, P. Ruch, Vocabulary-driven passage retrieval for question-answering in genomics, in: Proceedings of the 16th Text REtrieval Conference (TREC), Maryland, USA, 2007. [33] D. Vishnyakova, F. Gobeill, P. Ruch, Using a question–answering approach in machine reading task of biomedical texts about the alzheimer disease, In Fourth International Conference of the CLEF Initiative, CLEF, 2013. [34] J. Gobeill, et al., Question answering for biology and medicine, in: Information Technology and Applications in Biomedicine, 2009, ITAB 2009, 9th International Conference on, 2009, pp. 1–5. [35] P. Ruch, Bioinformatics 22 (2006) 658–664. [36] J.-D. Kim, K.B. Cohen, Natural language query processing for SPARQL generation: a prototype system for SNOMED-CT, in: Proceedings BioLINK SIG 2013, ISMB, USA, 2013. [37] A. Peñas, C. Unger, A.-C. Ngomo, Overview of clef question answering track 2014, in: E. Kanoulas (Ed.), Information Access Evaluation. Multilinguality, Multimodality, and Interaction, Lecture Notes in Computer Science, vol. 8685, Springer International Publishing, 2014, pp. 3000–3060. [38] Y. Liu, The university of alberta participation in the BioASQ challenge: the Wishart system, in Proceedings of the first Workshop on Bio-Medical Semantic Indexing and Question Answering, Conference and Labs of the Evaluation Forum 2013 (CLEF 2013), 2013. [39] D. Cheng et al., Nucleic Acids Res. 36 (2008) W399–W405. [40] R. Morante, M. Krallinger, A. Valencia, W. Daelemans, Machine reading of biomedical texts about Alzheimer’s disease, in CLEF (Online Working Notes/ Labs/Workshop), 2012. [41] R. Morante, M. Krallinger, A. Valencia, W. Daelemans, Machine reading of biomedical texts about alzheimer’s disease, in: CLEF (Online Working Notes/ Labs/Workshop), 2013. [42] G. Attardi, L. Atzori, M. Simi, Index expansion for machine reading and question answering. In Third International Conference of the CLEF Initiative, CLEF (2012). [43] S. Bhattacharya, L. Toldo, Question answering for alzheimer disease using information retrieval, in: Third International Conference of the CLEF Initiative, CLEF, 2012. [44] D. Martinez, A. MacKinlay, D. Molla-Aliod, L. Cavedon, K. Verspoor, Simple similarity-based question answering strategies for biomedical text, in: Third International Conference of the CLEF Initiative, CLEF, 2012. [45] D. Demner-Fushman, et al., Combining resources to find answers to biomedical questions, in: TREC, 2007. [46] C. Fautsch, J. Savoy, IR-specific searches at TREC 2007: genomics & blog experiments, in: TREC, 2007. [47] R.T.K. Lin et al., Biological question answering with syntactic and semantic feature matching and an improved mean reciprocal ranking measurement, IRI, IEEE Systems, Man, and Cybernetics Society, 2008, pp. 184–189.

46

M. Neves, U. Leser / Methods 74 (2015) 36–46

[48] J.-D. Kim, T. Ohta, Y. Tateisi, J. Tsujii, Bioinformatics 19 (2003) i180–i182. [49] Z. Shi et al., Question answering summarization of multiple biomedical documents, in: Z. Kobti, D. Wu (Eds.), Canadian Conference on AI, Lecture Notes in Computer Science, vol. 4509, Springer, 2014, pp. 284–295. [50] F. Rinaldi, J. Dowdall, G. Schneider, A. Persidis, Answering questions in the genomics domain, in: Proceedings of the ACL 2004 Workshop on Question Answering in Restricted Domains, 2004, pp. 46–53. [51] P. Thomas, J. Starlinger, U. Leser, Experiences from developing the domainspecific entity search engine GeneView, in 15. GI-Fachtagung Datenbanksysteme für Business, Technologie und Web (BTW), 2013. [52] A. Doms, M. Schroeder, Nucleic Acids Res. 33 (2005) W783–W786. [53] R. Van Noorden, Trouble at the text mine, in: Nature, England, vol. 483, 2012, pp. 134–135 (Van Noorden, Richard News Nature 483134a [pii] Nature. 2012 Mar 7;483(7388):134–135. http://dx.doi.org/ 10.1038/483134a). [54] S. Tartir, I. Arpinar, B. Mcknight, SemanticQA: exploiting semantic associations for cross-document question answering, in: Innovation in Information Communication Technology (ISIICT), 2011 Fourth International Symposium on, 2011, pp. 1–6. [55] P. Thomas, M. Neves, T. Rocktäschel, U. Leser, WBI-DDI: drug–drug interaction extraction using majority voting, in: Second Joint Conference on Lexical and Computational Semantics (⁄SEM), Proceedings of the Seventh International Workshop on Semantic Evaluation (SemEval 2013), 2, Association for Computational Linguistics, Atlanta, Georgia, USA, 2013, pp. 628–635.

[56] C.N. Arighi et al., Database 2013 (2013). [57] P. Zweigenbaum, Question answering in biomedicine, in: Proc. EACL2003, Workshop on NLP for Question Answering, Budapest, 2003. [58] D.L. Sackett, S.E. Straus, W.S. Richardson, W. Rosenberg, R.B. Haynes, EvidenceBased Medicine: How to Practice and Teach EBM, 2nd ed., Churchill Livingstone, 2000. [59] G.R. Bergus, C.S. Randall, S.D. Sinift, D.M. Rosenthal, Arch. Fam. Med. 9 (2000) 541–547. [60] Y. Cao et al., J. Biomed. Inform. 44 (2011) 277–288. [61] S. Cruchet, A. Gaudinat, T. Rindflesch, C. Boyer, What about trust in the question answering world?, in: Proceedings of the AMIA Annual Symposium, San Francisco, USA, 2009, pp. 1–5. [62] E. Strickland, E. Guy, Spectrum, IEEE 50 (2013) 42–45. [63] J.L. Malin, J. Oncol. Pract. 9 (2013) 155–157. [64] M.G. Zauderer, et al., Piloting IBM Watson oncology within memorial Sloan Kettering’s regional network, in: Journal of Clinical Oncology, 2014 ASCO Annual Meeting Abstracts, vol. 32, 2014. [65] A. BenAbacha, P. Zweigenbaum, Medical question answering: translating medical questions into sparql queries, in: Proceedings of the Second ACM SIGHIT International Health Informatics Symposium, IHI’12, ACM, New York, NY, USA, 2012, pp. 41–50.

Question answering for biology.

Biologists often pose queries to search engines and biological databases to obtain answers related to ongoing experiments. This is known to be a time ...
2MB Sizes 0 Downloads 7 Views