Griffon et al. BMC Medical Informatics and Decision Making (2016) 16:101 DOI 10.1186/s12911-016-0333-0


Open Access

Searching for rare diseases in PubMed: a blind comparison of Orphanet expert query and query based on terminological knowledge N. Griffon1,2* , M. Schuers1,3, F. Dhombres2,4, T. Merabti1, G. Kerdelhué1, L. Rollin1,5 and S. J. Darmoni1,2

Abstract Background: Despite international initiatives like Orphanet, it remains difficult to find up-to-date information about rare diseases. The aim of this study is to propose an exhaustive set of queries for PubMed based on terminological knowledge and to evaluate it versus the queries based on expertise provided by the most frequently used resource in Europe: Orphanet. Methods: Four rare disease terminologies (MeSH, OMIM, HPO and HRDO) were manually mapped to each other permitting the automatic creation of expended terminological queries for rare diseases. For 30 rare diseases, 30 citations retrieved by Orphanet expert query and/or query based on terminological knowledge were assessed for relevance by two independent reviewers unaware of the query’s origin. An adjudication procedure was used to resolve any discrepancy. Precision, relative recall and F-measure were all computed. Results: For each Orphanet rare disease (n = 8982), there was a corresponding terminological query, in contrast with only 2284 queries provided by Orphanet. Only 553 citations were evaluated due to queries with 0 or only a few hits. There were no significant differences between the Orpha query and terminological query in terms of precision, respectively 0.61 vs 0.52 (p = 0.13). Nevertheless, terminological queries retrieved more citations more often than Orpha queries (0.57 vs. 0.33; p = 0.01). Interestingly, Orpha queries seemed to retrieve older citations than terminological queries (p < 0.0001). Conclusion: The terminological queries proposed in this study are now currently available for all rare diseases. They may be a useful tool for both precision or recall oriented literature search. Keywords: PubMed, Rare diseases, Bibliography as topic, Terminology as topic

Background There is currently no consensual definition of what is a rare disease: in Europe, a disease is considered rare if it affects less than 1 in 2000 citizens, while in United States of America (USA), the threshold was set at 200,000 in the entire population [1] (approximately 1 in 1600 according to the USA census bureau [2]). * Correspondence: [email protected] 1 Department of Biomedical Informatics, Rouen University Hospital, TIBS, LITIS EA 4108, Rouen University, 76031 Rouen Cedex, France 2 INSERM, U1142, LIMICS, 75006, Paris, France; Sorbonne Universités, UPMC Univ Paris 06 UMR_S 1142, LIMICS, 75006, Paris, France; Univ Paris 13, Sorbonne Paris Cité, LIMICS (UMR_S 1142), 93430, Villetaneuse, France Full list of author information is available at the end of the article

These gross definitions lead to a major heterogeneity between rare diseases:  Most of genetic diseases are rare diseases, but some

infectious diseases, cancer and auto-immune diseases are also rare.  They may occur at any point in life  There are geographical variations. A disease may be rare in one country (like Periodic disease in France) but quite frequent in another (Periodic disease in Armenia)

© 2016 The Author(s). Open Access This article is distributed under the terms of the Creative Commons Attribution 4.0 International License (, which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver ( applies to the data made available in this article, unless otherwise stated.

Griffon et al. BMC Medical Informatics and Decision Making (2016) 16:101

 Some are well known and have been described for a

number of years, whereas some have been recently discovered and information is scarce. Furthermore, these definitions have led to the knowledge of 5000 to 8000 rare diseases and to the “paradox of rarity”: each disease is rare, but patients with rare diseases are numerous. Having a clear vision of the prevalence of rare diseases is not an easy task, nevertheless, it is commonly accepted that approximately 5 to 10 % of the population suffer from rare diseases (8–9 % in the USA [1], 6–8 % in the European Union [3]). In both regions, this corresponds to approximately 30,000,000 patients suffering from a rare disease, making it a real public health concern [4]. This heterogeneity and frequency of rare diseases translates into numerous different situations in which some information is needed:  Finding a physician with adequate experience may

be easy when a reference center exists, but can be a real difficulty if care pathways are not identified [5]  Providing medical care for patients with a rare disease is a difficult task for physicians. Even if the care episode does not concern the rare disease.  Writing a systematic review about a rare diseases, or doing a short review in order to write a research article, requires querying one or more bibliographic databases with as much relevant keywords as possible [6]. It seems of public health importance to provide all these participants with the appropriate tools to easily retrieve relevant information about rare diseases. PubMed is one of the most popular search engines to access medical literature. It browses the MEDLINE bibliographic database, which gathers a large part of biomedical scientific articles, and some other minor resources [7]. MEDLINE is indexed using the MeSH® thesaurus. Although PubMed theoretically allows to access the literature about rare diseases, including the most recent scientific discoveries, the combination of the following elements may hinder users:  the relative novelty of MeSH terms for rare diseases

[8]. Until 2010, the MeSH contained only a few rare diseases, also, citations pertaining to rare diseases published before 2010 are not indexed precisely for rare diseases. Since this date, 10,354 rare diseases, as defined by the Office of Rare Diseases Research (ORDR) [9], have been introduced in MeSH (source MeSH 2014),  the delay in article MeSH-indexing in PubMed [10], which can be several weeks to several months, according to the importance of the journal, and

Page 2 of 8

 the health professionals, or the lay-persons, lack of

knowledge about MeSH [11]. It is therefore difficult for physicians, and furthermore patients, to query Pubmed in an effective way, and especially to find an article about rare diseases published before 2010 or in recent months. Several institutions (Genetic and Rare Diseases Information Center [12] and Orphanet [13]) already gather information on their website about rare diseases including a brief summary, clinical information and many links to other resources. Sometimes a link to a PubMed expert based query is provided, limiting users task to citation relevance assessment. Nevertheless, in the case of Orphanet these queries do not always take advantage of all the MeSH/PubMed functionalities and they are far from providing a comprehensive coverage of all rare diseases. Moreover, the methodology of establishing these queries is not disclosed on the Orphanet website. The aim of this study was to propose a set of queries linked to each rare disease term in Orphanet and to evaluate these queries against those developed by Orphanet.

Methods PubMed overview

PubMed is the most frequently used bibliographic database used by biomedical scientist throughout the world. It therefore constitute a standard in terms of information retrieval. MEDLINE is the major component of PubMed, gathering almost 90 % of the 26 millions of PubMed citations. MEDLINE curators affect to each citation a list of MeSH terms to describe it with a controlled level of granularity. The MeSH atomic part is the MeSH concept, a class of synonymous terms – i.e. all terms gathered in a MeSH concept are true synonyms. MeSH concepts closely related to each other in meaning may be gathered in a MeSH descriptor (MeSH D) or a MeSH supplementary concept (MeSH SC), one of them being the prefered concept, and the other being narrower, broader or related to the preferred one. Both MeSH D and MeSH SC aims at indexing the citation, but they exhibit some differences. First, MeSH SC are quite specific terms: they are used to index chemicals, drugs, and other concepts such as rare diseases. Second, MeSH SC, unlike MeSH D, are not classed, they are only linked to one or more MeSH D, usually broader, by a specific relationship. Lastly, there are a lot more MeSH SC (≈200,000) than MeSH D (≈27,000). Pubmed users may specify what search field they want to use in their query using between-bracket operators. Table 1 presents some operators and their meaning.

Griffon et al. BMC Medical Informatics and Decision Making (2016) 16:101

Table 1 Some operators used in PubMed Operator



The term is considered as a free text keyword and searched for in title


The term is considered as a free text keyword and searched for in abstract


The term, a MeSH descriptor, and all the terms it subsumes, are searched for in MeSH indexing


The term, a MeSH descriptor, and all the terms it subsumes, are searched for in MeSH major indexing


The term, a MeSH supplementary concept, is searched for in MeSH indexing


The term is considered as a free text keyword and searched for in multiple fields of PubMed citation (title, abstract, MeSH indexing, other keywords etc.)

PubMed queries Orpha queries

Orphanet PubMed queries were manually created by Orphanet experts. These queries are available on the Orphanet web site (URL:, on each disease page (for the diseases that have an Orphanet PubMed query, of course). For exemple, for the Orphanet concept “retroperitoneal fibrosis”, the PubMed query is: retroperitoneal fibrosis[majr] OR Retroperitoneal fibrosis[ti]. For the orphanet concept “Blount disease”, the query is: Blount disease[tw] OR tibia vara[tw].

Page 3 of 8

Disease Ontology) [14] was developed based on the Orphanet classification. This ontology is available in five European languages: English, French, German, Spanish and Portuguese; (b) the Online Mendelian Inheritance in Man (OMIM) database, developed at Johns Hopkins University [15]; (c) the Human Phenotype Ontology (HPO), a formal ontology, which allows the description in an unambiguous fashion of phenotypic information in medical publications and databases [16]. The HPO is freely available at One of the authors (SJD) has created exact match mappings between MeSH, OMIM, HPO and HRDO based on a natural language processing/conceptual based algorithm [17, 18] suggestions. Exact match mapping means that the two concepts are real synonyms (e.g. the “Absent corpus callosum cataract immunodeficiency” MeSH concept and the “Vici syndrome” HRDO disease). Using these alignments, PubMed queries are created automatically, according to a published algorithm [19]. The algorithm output depends on the type of MeSH term mapped to: MeSH concept, MeSH SC or MeSH D (see Table 2 for examples):

Terminological queries

In addition to the MeSH thesaurus, several other terminologies and ontologies are available on rare diseases: (a) a formal ontology named HRDO (Human Rare

a) If the HRDO concept is mapped to a MeSH Descriptor, the query structure is as follows: Disease[MH] OR Disease[TW] OR Synonyms Disease MeSH Descriptor[TW] OR Synonyms Disease HRDO[TW] OR Synonyms Disease OMIM[TW] (if an exact match mapping exists between HRDO concept and OMIM concept) OR Synonyms Disease HPO[TW] (if an exact match mapping exists between HRDO concept and HPO concept)

Table 2 Exemples of queries according to the type of the MeSH term mapped to the HRDO concept Types of mapped MeSH terms MeSH descriptor

MeSH supplemetary concept

MeSH concept

Not a MeSH term

HRDO concept example “retinal dystrophy”

“Omenn syndrome”

“Charcot-Marie-Tooth disease, type ib”

“Isolated oxycephaly”

MeSH part of the query “retinal dystrophies”[MH] OR “retinal dystrophies”[TW] OR “dystrophies, retinal”[TW] OR “dystrophy, retinal”[TW] OR “retinal dystrophy”[TW] OR

“reticuloendotheliosis, familial, with eosinophilia”[NM] OR “reticuloendotheliosis, familial, with eosinophilia”[TW] OR “severe combined immunodeficiency with hypereosinophilia”[TW] OR

“Charcot-Marie-Tooth disease, type ib”[TW] OR “1B, HMSN”[TW] OR “1Bs, HMSN”[TW] OR

HRDO part of the query “Retinal dystrophy”[TW] OR

“Omenn syndrome”[TW] OR “Combined immunodeficiency with hypereosinophilia”[TW] OR

“Charcot-Marie-Tooth disease type 1B”[TW] OR “CMT1B”[TW] OR

“Isolated oxycephaly”[TW] OR “Turricephaly”[TW] OR “Nonsyndromic oxycephaly” [TW] OR




“Omenn syndrome”[TW]

“Charcot-marie-tooth disease, demyelinating, type 1b”[TW]

HPO part of the query

“Retinal dystrophy”[TW]

OMIM part of the query -

Each column contains one example of PubMed query corresponding to the HRDO concept in the “HRDO concept example” row. Each row gathers all the synonyms for the considered diseases in one terminology. The final queries are composed by every synonyms of every terminologies, linked by “OR”. The final PubMed query for the Isolated oxycephalydisease is: “Turricephaly”[TW] OR “Nonsyndromic oxycephaly”[TW] OR “Isolated oxycephaly”[TW]. The last “OR” “turricephaly” is redundant. In this case, the final query is deducible from only one terminology (HRDO)

Griffon et al. BMC Medical Informatics and Decision Making (2016) 16:101

b) If the HRDO concept is mapped to a MeSH Supplementary Concept, the query structure is as follows: Disease[NM] OR Disease[TW] OR Synonyms Disease MeSH Supplemntary Concept[TW] OR Synonyms Disease HRDO[TW] OR Synonyms Disease OMIM[TW] OR Synonyms Disease HPO[TW] c) If the HRDO concept is mapped to a MeSH Concept, the query structure is as follows: Disease[TW] OR Synonyms Disease MeSH Concept[TW] OR Synonyms Disease HRDO[TW] OR Synonyms Disease OMIM[TW] OR Synonyms Disease HPO[TW] d) And if the HRDO concept is not mapped to the MeSH thesaurus, the query structure is as follows: Disease[TW] OR Synonyms Disease HRDO[TW] OR Synonyms Disease OMIM[TW] OR Synonyms Disease HPO[TW] Relevance evaluation

Thirty rare diseases were randomly selected from the subset with both an Orphanet query and a terminological query. The selected rare diseases are listed in Table 4 (at the end of the document). The diseases with a prevalence higher than 1/2000 were considered as not rare. One author (GK) gathered the first ten citations retrieved (PubMed “recently added” ranking), for each rare disease, using the following queries: Q1 ¼ QOrpha




Q2 ¼ QOrpha




Q3 ¼ QTerm




With QOrpha the Orpha query and QTerm the terminological query. Therefore, Q1 retrieved citations common to both Orpha and terminological query, Q2 retrieved citations specific to the Orpha query and Q3 retrieved citations specific to the terminological query. He (GK) then hid the retrieving query: the evaluators were blinded vs. the type of query. The anonymised citations were split between four physicians (FD, LR, MS and NG) in such way that: (i) each citation was evaluated twice and, (ii) each evaluator shared each third of their evaluations with one different evaluator. Evaluators had to answer the following question for each citation: “Does the article directly concern the disease?” In case of any disagreement, a third evaluator evaluated the citation and the discrepancy was resolved by consensus. More information regarding relevance evaluation is available in Additional file 1.

Page 4 of 8

Statistical analysis

Agreement between evaluators was measured by kappa. HRDO rare diseases may be split into two: terms with an Orpha query and terms without Orpha query. These two sub-populations were compared according to available determinants to ensure generalizability. For each rare disease, it is possible to estimate the precision (pi) of each query (Q1, Q2, Q3; see Eq. 4).   pi ¼ n relQi =n evalQi


With n(relQi) and n(evalQi) the number of relevant citation and the number of evaluated citation for the query i, respectively. Orpha queries and terminolgical queries were compared according to micro average precision, number and publication date of retrieved citations, and use of MeSH terms. Non-parametric tests were used: Fisher’s test for qualitative variables (micro average precision and MeSH use) and Wilcoxon test and KruskalWallis test for quantitative variables (number of citation and date). The Dunn test allows pairwise comparison after Kruskal-Wallis.

Results HRDO, in its 09/11/2013 version, inventory 9060 diseases and groups of diseases. Seventy-eight were not considered as rare diseases because the prevalence, as specified by Orphanet, was above the European threshold, also, the study considered only the 8982 rare diseases. Table 3 lists the number of alignments created or validated by SJD. Only 2284 HRDO rare diseases have a manually validated Orphanet query (25.4 %). A terminological query is generated for each disease in Orphanet (was it rare or not). Orpha queries and terminological queries respectively retrieved 0 citations in 5 (

Searching for rare diseases in PubMed: a blind comparison of Orphanet expert query and query based on terminological knowledge.

Despite international initiatives like Orphanet, it remains difficult to find up-to-date information about rare diseases. The aim of this study is to ...
633KB Sizes 0 Downloads 6 Views