A framework for automatic information quality ranking of diabetes websites.

Informatics for Health and Social Care, 2015; 40(1): 45–66 ! Informa UK Ltd. ISSN: 1753-8157 print / 1753-8165 online DOI: 10.3109/17538157.2013.872109

A framework for automatic information quality ranking of diabetes websites Rahime Belen Sag ˘ lam and Tugba Taskaya Temizel Department of Information Systems, Informatics Institute, Middle East Technical University, Ankara, Turkey Objective: When searching for particular medical information on the internet the challenge lies in distinguishing the websites that are relevant to the topic, and contain accurate information. In this article, we propose a framework that automatically identifies and ranks diabetes websites according to their relevance and information quality based on the website content. Design: The proposed framework ranks diabetes websites according to their content quality, relevance and evidence based medicine. The framework combines information retrieval techniques with a lexical resource based on Sentiwordnet making it possible to work with biased and untrusted websites while, at the same time, ensuring the content relevance. Measurement: The evaluation measurements used were Pearson-correlation, true positives, false positives and accuracy. We tested the framework with a benchmark data set consisting of 55 websites with varying degrees of information quality problems. Results: The proposed framework gives good results that are comparable with the nonautomated information quality measuring approaches in the literature. The correlation between the results of the proposed automated framework and ground-truth is 0.68 on an average with p50.001 which is greater than the other proposed automated methods in the literature (r score in average is 0.33). Keywords Biased content, diabetes, information quality, quality assessment

INTRODUCTION Since the emergence of the internet, online content has become one of the main sources of information relating to health issues. However due to the rapidly increasing number of health websites, information consumers are now in search of unbiased, accurate, relevant and up-to-date information. The main tool for members of the public to obtain this information is the search engines which usually return highly popular websites that were not created by medical experts and other professional organizations but rather by companies or patients that are biased towards a specific medication or alternative treatments. Thus, medical information seekers are often unable to judge whether the information is valid and/or of high quality. This situation sometimes results in users ceasing their treatment program prescribed by their practitioners or using medications without their practitioners’ knowledge. Recent statistics demonstrate the seriousness of the problem. The Porter Novelli

Correspondence: Tugba Taskaya Temizel, Department of Information Systems, Informatics Institute, Middle East Technical University, Universiteler Mah., Dumlupinar Bulvari, 06800 Ankara, Turkey. E-mail: [email protected]

46

R. B. Sag ˘ lam & T. T. Temizel

EuroPNStyles survey showed that 65% of the people in Europe use the internet when they want information about a medical query (1). Another study revealed that in order to improve health outcomes particularly for chronic illnesses many physicians encourage patients to be more involved in their medical care by researching their condition on the Internet. However, other medical practitioners warn their patients against the misleading information on the internet which can heighten the patient anxiety and develop cyberchondria (2). The majority of the studies in the literature have proposed manual information quality assessment guidelines. In these guidelines, several evaluation criteria such as the authors’ credibility, citations and last update time have been gathered in a form of questionnaire and then domain experts assessed websites based on these criteria. However, these methods require domain knowledge and the manual assessment of a particular website takes a significant amount of time. They do not take into consideration the relevance of the information to a search query, and they are not evidencebased assessment techniques (3–9). Although both information relevance and quality are handled in few studies, these proposed methods are vulnerable to biased or even promotional contents which are created with commercial intent or amateurishly written (10). This article describes an automated framework designed to rank websites that are related to the diagnosis, treatment and control of diabetes. The ranking is based on the degrees of the information quality taking into account the relevance of the information and the features of the website. In the study, a quantitative research strategy was followed and the results generated by the framework were compared with the manual scores in terms of an evidence based medicine (EBM) approach. EBM is the conscientious, explicit and judicious use of the current best evidence in making decisions about individual patients’ care by gathering the best available external clinical evidence from systematic research (11). It aims to ensure that medical decisions are evidence based integrating both individual clinical expertise and the best external evidence There is no generic gold standard (definitive and decisive ultimate standard in medicine) which is applicable to a great number of different health related websites and in accordance with EBM. Different gold standards have been reported for different health issues. The manual scores used for the comparison in this study were obtained by a gold standard for EBM on diabetes were provided by the American Diabetes Association (ADA). Encouraging results were obtained from the current study and are better compared with the methods presented in the literature. In contrast with previous studies, irrelevant web pages within websites were not eliminated manually and all the pages were assessed automatically in order to calculate the overall score of the website taking into consideration the relevance to diabetes of each web page. In addition, biased information was taken into account in the quality scoring in this study while other methods in the literature ignored this criterion. The contributions of this article are summarized as providing;

a document quality ranking method for diabetes websites which takes into account the content quality, bias characteristics and information relevance of the website.

Information quality framework

a bias website identification method based on Sentiwordnet.

The rest of the article is organized as follows. The section titled ‘‘Background’’ discusses the existing studies. The section titled ‘‘Method’’ defines the data set, experimental settings and proposed methodology. The section titled ‘‘Results’’ reports on the experimental results and ‘‘Conclusion and discussion’’ concludes with a summary and discussion of future work.

BACKGROUND The era of research on the assessment of information quality on the web coincides with the realization that the biased information has an impact of the quality of the search engine results. The certification of websites is one of the well-known information quality assessment methods with the widely used being HONCode (12) and URAC (13). Website providers who wish to display such logos on their websites are required to implement the principles of the certification organizations. Several companies provide similar third party rating services such as the Internet Content Rating Association (ICRA) (14). However, these logos are not seen on the majority of the health websites (15). Self-regulation is another method in which questionnaires or similar means are provided in order to aid online health consumers to self-assess the quality of a website in a valid and reliable way. Examples of this method are DISCERN (16), and NETSCORING (17). DISCERN provides a brief questionnaire which is designed to help users judge the quality of information in relation to the treatment choices (16). The DISCERN ratings have been found to be significantly correlated with evidence-based quality ratings (18). However, it has been reported that this tool requires a significant amount of time to complete since the user has to respond to a series of generic questions for each website (18). NETSCORING lists the criteria such as creditability, content, ethics and accessibility which are rated on a scale of 0–9. The sum of these ratings gives the overall score of a site (17). Both DISCERN and NETSCORING are often used by the national health portals furthermore, website providers also utilize these tools in the development of a new website to ensure a high level of quality of information. The costs of these tools to the providers and developers are very low, since it only applies to the information seekers. However, the significant amount of time and effort that the information seekers expend tends to reduce the use of these tools and the benefit (19). Another important strategy is the use of contextual features (e.g. source, date, author information, presentation and layout) and/or content features (e.g. specific words and number of words) of the websites. Table 1 presents the summary of 10 widely accepted information quality frameworks collated from research on information systems. In 2007, Wang and Zhenkai (15) proposed one of the preliminary studies on the automatic detection of the contextual quality indicators of health information on the web. They developed the Automatic Indicator Detection Tool (AIDT) which detected the presence of indicators of information quality dimensions. However, the researchers did not make any further improvements

47

48

R. B. Sag ˘ lam & T. T. Temizel Table 1. The features identified as indicators of information quality on medical websites in the literature. References Kunst et al. (2002) (3) Fricke ´ and Fallis (2002) (4) Fricke ´ et al. (2005) (5) McInerney and Nora (2005) (6) Griffiths and Christensen (2005) (18) Lewiecki et al. (2006) (7) Schwartz et al. (2006) (8) Sillence et al. (2007) (9)

Khazaal et al. (2008) (20) Barnes et al. (2009) (21)

Identified features Source, currency, evidence hierarchy. Displaying the HONcode logo, having an organization domain, displaying a copyright. Inlinks to the main page of a site, unbiased presentation of information. Site last update date. DISCERN score. (URL) suffix (.com, .edu) Endorsement of the site by a government agency or a professional organization. Inappropriate name for the website, complex, busy layout, lack of navigation aids, pop up adverts, too much text, poor search facilities/ indexes, irrelevant or inappropriate content. DISCERN, HON label. BWQC score, DISCERN score, having editorial board, affiliation to a professional organization.

to the tool to produce a quality ranking score. The AIDT gave an idea about the ‘‘information quality’’ to a certain degree but it gives no information about ‘‘information relevance’’. One of the most significant contributions to web content quality research was Discovery Challenge 2010 (22). In the competition the participants were given three tasks: (i) a classification task (categorization of web contents as spam, news, commercial, educational, discussion lists, personal, neutral, biased, trust and quality); (ii) a quality task (quality ranking of contents for English sites) and (iii) a multilingual quality task (quality ranking for German and French sites). Participants were provided with the data set that consisted of sample web hosts from Europe with training and testing samples. The content of the websites were not specific to a domain. The data set provided in the competition included training labels, URLs and hyperlinks, content-based and link-based Web spam features, term frequencies and natural language processing features. In the competition the performance of the proposed methods in terms of bias web page identification was low. The work of Geng et al. (23) ranked first in the quality task. They utilized the statistical content features, page and host level link features given by the organization committee and the TFIDF features. The researchers reported the accuracy as 0.936 in assessing the content quality. However, the accuracy in detecting the trustworthiness was reported as 0.526 and the bias as 0.606 which was notably lower than the reported accuracy for the assessment of the content quality. This methodology has a serious limitation for use in medical websites since trustworthiness and accuracy are the most important quality dimensions in this area. Another important limitation is that the authors did not take into consideration the information relevance and domain-specific content features which are critical in health websites.


In the literature only one study was found in which both the relevance and quality are handled. Griffiths et al. (24) exploited EBM in assessing health related contents providing a framework that returned results that were not only relevant to the query but also in accordance with the EBM guidelines. However, their proposed framework was not able to detect any bias content on websites since this was outside the scope of their method. As mentioned earlier, there is no common gold information quality standard which is applicable for all types of health websites. In their work Griffiths et al. (24), used the guideline for websites concerning the mental health issue of depression provided by Centre for Evidence-based Medicine at the University of Oxford. Another study which utilized a gold standard for EBM was conducted on diabetes was provided by ADA (25). Seidman et al. (10) employed the latter standard to develop a conceptual framework to manually assess the accuracy and comprehensiveness of the websites about diabetes. In their study, Seidman et al., used Direct Hit search engine to track the most popular sites returned by the search terms which included ‘‘diabetes’’ as a keyword. In order to evaluate the websites selected by Direct Hit, the researchers created a data abstraction tool in a form of questionnaire which included a comprehensive evaluation criteria such as; an explanation of the methods (whether the site gives an explanation of the process for generating its health content, and information pertaining to the author(s) and affiliations, credentials and contact details); the validity of methods (whether assertions supported by referenced material and material on the site has gone through peer review); currency of information (whether the site gives an explanation of the process for updating its health content, whether each web page indicates the date of last update and whether the page has been updated within the last 6 months); and accuracy of information (whether the site explains Type 1 (lack of insulin), Type 2 (insulin does not work effectively) and main secondary causes). In addition to these evaluation criteria, Seidman et al., included additional background data such as sponsorship information (advertisement versus no advertisement, profit versus not-for profit, academic versus non-academic and governmental versus private). In the evaluation, the external reviewers scored each website as shown in Table 2. Each website received 1 point for each criterion that was satisfied. In order to assess how much agreement exists between the reviewers on each criterion, Kappa statistics and Lin’s concordance correlation were used. It was reported that the average time required to assess each site was 30.26 min, which is quite high and not feasible for ordinary users. Recent studies have assessed medical websites and social media content in terms of their subjectivity. Denecke (26) analyzed medical blogs in order to distinguish affective and informative contents within a blog while disregarding the medical relevance. In the study, the affective post is defined as posts that do not contain any medical content but rather the thoughts, feelings or experiences of treatments, diseases and medications. In order to determine the proportion of affective content, SentiWordNet (27) was employed. SentiWordNet is a freely available lexical resource in which each synset of WordNet (28) is associated with three numerical scores describing how objective, positive and negative levels of the terms contained in the synsets.

49

50

R. B. Sag ˘ lam & T. T. Temizel Table 2. The websites that were assessed in the study undertaken by Seidman et al. (10).

Links 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 49 40 41

http://healthlink.mcw.edu/article http://my.webmd.com/index http://www.banting.com/ http://www.bbc.co.uk/health/diabetes http://www.bddiabetes.com/ http://www.defeatdiabetes.org/ http://www.dhfs.state.wi.us/health/diabetes http://www.diabetes.ca/ http://www.diabetes.about.com http://www.diabetes.org/ http://www.diabetesaustralia.com.au/ http://www.diabetesnet.com/ http://www.diabetesnews.com/ http://www.diabetesohio.org/ http://www.diabetic.org.uk/ http://www.docguide.com/ http://www.dr-diabetes.com/ http://www.drkoop.com/ http://www.drmirkin.com/diabetes http://www.endocrineweb.com/diabetes http://www.evms.edu/diabetes http://www.focusondiabetes.com/ http://www.healingwell.com/ http://www.health.state.ut.us/cfhs http://www.healthtalk.com/den/index http://www.idcpublishing.com/ http://www.idf.org/ http://www.joslin.harvard.edu/ http://www.lillydiabetes.com/ http://www.mayoclinic.com/ http://www.merck.com/pubs/mmanual_home http://www.msdiabetes.org/ http://www.musc.edu/diabetes http://www.netdoctor.co.uk/ http://www.niddk.nih.gov/ http://www.nzgg.org.nz/library http://www.onlinemedinfo.com/ http://www.sddiabetes.net/ http://www.staff.ncl.ac.uk/philip.home http://www.umassmed.edu/diabeteshandbook http://uphs.upenn.edu/health

Overall Number quality Website of web score type pages 50 63 59 50 31 45 78 88 49 85 71 65 65 61 73 56 28 75 41 33 43 65 53 35 41 41 47 73 50 80 56 27 18 73 65 42 60 74 63 68 68

GN GN DB GN DB DB GN DB DB DB DB DB DB DB DB GN DB GN GN GN DB DB GN GN GN DB DB DB DB GN GN DB GN GN GN GN GN DB DB GN GN

1051 521 9 1138 387 271 142 1110 510 757 21 547 99 63 33 2575 9 2477 122 739 521 1 3185 41 4414 140 402 739 31 217 1333 165 236 2067 586 1349 744 4 6 311 694

Size 36 MB 29 MB 48 KB 63 MB 9 MB 8 MB 2 MB 48 MB 12 MB 59 MB 488 KB 40 MB 3 MB 3 MB 668 MB 89 MB 152 KB 105 MB 1 MB 26 MB 12 MB 20 KB 88 MB 780 KB 214 MB 5 MB 35 MB 18 MB 1 MB 9 MB 71 MB 3 MB 6 MB 96 MB 19 MB 73 MB 25 MB 88 KB 380 KB 14 MB 20 MB

We used the same websites for our experiments. A total of 29 767 web pages were processed for the experiments. Overall quality score implies the scores given for each website by the domain experts. Website type shows the type of the website: if it is a generic website where issues than diabetes were discussed, it is labeled GN (generic). If the website is diabetes specific, it is labeled DB (diabetes). The number of web pages and size in total denote the number of web pages and their size for each downloaded website.


In order to determine the proportion of the affective content, the average scores for each category of adjectives were computed for each post. If the positivity score is higher than the negativity score, the post is classified as positive. If the positivity score is less than negativity score, it is considered negative; otherwise it is labeled as objective. Denecke (29) studied the applicability of the polarity scores for sentiment classification on documents belonging to different domains. The machinelearning based classification was utilized to identify 17 attributes in each text such as the ‘‘average polarity score triples for adjectives, nouns and verbs’’, ‘‘frequency of positive and negative words’’ and the ‘‘frequency of nouns, verbs and adjectives’’. The number of sentences, question marks and exclamation marks were also taken into account. The feature set was then utilized by the Simple Logistic Classifier. The constructed model for each domain was then tested on different domains. In the machine-learning based approach, the accuracy was reported to vary between 66% and 82%. Denecke (30) focused on quantifying the diversity of medical social media data using measures derived from the affective and factual content analysis. She utilized four measures including the degree of factual and affective content and diversity of semantic types and categories. In the study, it was suggested that users were more satisfied with a search result set which covered several aspects of a topic rather than a search result set handling only single aspects of the topic. Diversity was assessed using high-quality online sources which comprised both physician and patient-written blogs about a broad range of topics related to health. The diversity assessment accuracy was reported to be 86.5%. In the literature, researchers typically focused on the indicators of ‘‘quality’’ in the content and evaluated website content as unqualified if these indicators were not present. They also took into consideration the websites that were already above a quality level thus eliminating the websites that exhibit biased information. Biased contents are the most insidious cases for information retrieval techniques since they may contain many terms about a topic however, they promote a product or are amateurishly written probably as a memoir. Information on websites may be falsified to provide a profit from advertorial content on purpose or unprofessional bloggers may write in a biased way about their experiences in relating to a health condition. Several websites including biased information exist on the web, and domain experts suggest that these should be avoided by readers searching for health information (31). Some clues to biased content are given as sensational writing style (many exclamation points, or comments such as, ‘‘I developed this site after my heart attack’’ rather than ‘‘This page on heart attack was developed by health professionals from the American Heart Association’’). Some of the terms frequently used in biased websites are ‘‘breakthrough’’, ‘‘secret ingredient’’ and ‘‘miracle’’. The motivation for the current study comes from the similarity between the features of an affective content given by (26) and features of biased content which information consumers have been warned to be aware of (32). In addition, we are focused on the information relevance techniques that take into account the content relevance in particular those referred to in the study by Griffiths et al. (24).

51

52


METHODS In this study, an information quality framework for diabetes related websites is proposed. In the framework, a website is assessed as having high quality if it includes the recommendations of the ADA summarized in the work of Seidman et al. (10). The greater the number of the recommendations that the website covers the higher the rank it is awarded. High-quality websites are also expected to be written in formal language using a professional writing style. In the evaluation, indicators of bias are also considered, and the rank of a website is penalized if it is detected as biased. The penalty score is 40 over 100 as in line with the Discovery Challenge 2010 in which the quality score was decreased by 2 over 5 when bias is detected (22). The detection of bias is accomplished using the subjectivity detection approach proposed by Denecke (29) but this method was adapted to the context with changes in the lexical resource. For the information relevance, the method presented by Griffiths et al. (24) was adapted to the framework together with a new fusion method. Data set Our data set consists of websites collected and scored by Seidman et al. (10) and the bias websites, listed in Table 3, that we collected manually and were scored according to the same guidelines as given in ref. (10) by a medical domain expert who is a pathology specialist. In order to select the bias websites, the criteria published by MedlinePlus (32) such as; sensational writing style, claims about remedies that cure a variety of illnesses or promise quick, miraculous results and the existence of commercial advertisements were taken into consideration. We used the websites that were scored in the research carried out by Seidman et al. (10) since in their study the data set was constructed systematically taking into account various information quality problems which can also be considered as representative for thousands of realworld websites about diabetes. In their study, the websites were collected by querying a specific search term (i.e. diabetes) in Direct Hit search engine. The collected websites were manually scored between 0 and 100 by two domain experts based on the quality criteria prepared for Type 2 diabetes. The websites addressing only Type 1 diabetes or ‘‘juvenile diabetes’’ were excluded because some of the comprehensiveness criteria in the study by (10) could not be applied to Type 2 diabetes. Also, the sites that only included news or did not offer general diabetes content were eliminated. At the end of these assessments, Seidman et al., scored 90 websites however, in the current article, only 41 websites were utilized since some of the links were broken, some were password protected, some were not allowed to be crawled or, only a single page from the site could be retrieved. The total number of web pages for each crawled website varied from 10 to 3500. Table 2 shows the list of the crawled websites. Some websites are diabetes specific (DB) and others provide general health information about several diseases so they were labeled generic (GN). In general health portals, Seidman et al. (10), scored only the diabetes related webpages. In the experiments, since we aim to generate an overall information quality score for each website, we processed all the webpages from each website. Since the quality and relevance of each web page might differ in a

1 2 3 4 5 6 7 8 9 10 11 12 13 14

http://www.selfhelprecordings.com/diabetes/help-with-diabietes.asp http://www.your-diabetes.com/diabetes-supply.html http://shiningstarmiracles.wordpress.com/category/diabetes/ http://www.holisticonline.com/Remedies/Diabetes/ http://www.diabetes-daily-care.com/index.html http://prevent-diabetes.net/order.php http://www.diabetesdaily.com/forum/blogs/glodee/5936-incredibleendocrinologist http://www.diabetes-supply.com/home.asp http://www.antioch.com.sg/well/testimon/muniandy/ http://www.miraclesforyou.org http://christianblogs.christianet.com http://www.d-mom.com/ http://www.richardsearley.com http://www.hanselman.com/blog/HackingDiabetes.aspx Total web pages processed

Links of biased websites

Table 3. The biased websites collected manually for the experiments.

85 88 287 606 38 20 1022 16 1 36 1713 358 242 242 4754

Number of web pages

2 MB 1 MB 12 MB 13 MB 812 KB 404 KB 52 MB 612 KB 12 KB 964 KB 73 MB 13 MB 14 MB 14 MB

Size


53

54


website, we scored each web page and calculated the average. Equation (6) shows how we calculated the average score. Experimental settings There were two objectives of the experiments as follows: (1) The assessment of the performance of the proposed framework in identifying the biased websites (2) The assessment of the performance of the proposed framework in ranking the websites according to their information quality content and relevance To achieve these objectives, 5-fold cross-validations were carried out since we had a limited number of websites. Table 4 gives the list of sets used in each experiment. In Table 5, the details of each set are given. The proposed methodology was executed separately for each experiment. We applied the processes described under the training phase and evaluation phase in Figure 1 for the training and testing data set, respectively. The training data set was used to determine the terms and their weights that are highly relevant with high-quality websites but are not relevant with low-quality websites. To achieve this, we used manually marked websites to construct the quality related terms. These terms were later utilized in the testing data set to calculate a quality score. We then calculated the true positives, false positives and accuracy for each experiment. We also utilized Pearson correlation to measure the performance of the proposed framework’s ranking with manually scored ones by the domain experts. If the correlation is high and positive, it will indicate that the ranking is in line with the ground truth information. In the training phases, all the websites given by Seidman et al. (10) (see Table 2) were labeled as unbiased since none of them contained biased content. Tools RapidMiner is a well-known widely used open source system for data mining, text mining and predictive analytics (33). We chose this tool for this study since Table 4. Data sets used in the 5-fold cross-validation.

Biased websites Low-quality websites High-quality websites Total

Set 1

Set 2

Set 3

Set 4

Set 5

Total

3 4 4 11

3 4 4 11

3 4 4 11

3 4 4 11

2 5 4 11

14 21 20 55

Table 5. The list of sets used in the experiments. Experiments Experiment Experiment Experiment Experiment Experiment

1 2 3 4 5

Training Set Set Set Set Set

2 þ Set 1 þ Set 1 þ Set 1 þ Set 1 þ Set

3 þ Set 3 þ Set 2 þ Set 2 þ Set 2 þ Set

Testing 4 þ Set 4 þ Set 4 þ Set 3 þ Set 3 þ Set

5 5 5 5 4

Set Set Set Set Set

1 2 3 4 5

Figure 1. Proposed methodology.


55

56


Figure 1. Continued.

it is stable and powerful due to its flexibility in the process design (34). Terrier an open source search engine was used for information retrieval tasks (35). We used the Porter Stemmer algorithm (36) a commonly known and widely used stemmer for English language words in order to reduce all words with the same root to a common form, the stem, by removing derivational and inflectional affixes (37). The part of speech (POS) of each word in a sentence was identified by TreeTagger (38). Methodology The proposed methodology is based on sentiment analysis, relevance feedback and information retrieval techniques. Figure 1 shows the overall framework. First, the crawled web pages were pre-processed for sentiment analysis and relevance feedback algorithms in the both training and testing data sets. The html tags were removed from the content. Second, the content was tokenized, and all upper-case letters were transformed into lower-case letters. Third, common words (stop words) such as a, an, and the were removed using Rapidminer’s built-in tool (filtering stopwords option). Since the biased websites significantly affect the information quality in a negative way, initially we aimed to identify such content by carrying out a sentiment analysis. Then, the quality queries were generated among the unbiased web pages in the training set, and they were applied to the remaining collection in the test set.


Detection of bias using sentiment analysis The biased information in the health domain is defined as falsified contents that are written in deliberately obscure or unscientific sounding language, contain unrealistic health claims and promise quick, dramatic, miraculous results (32). U.S. National Library of Medicine National Institutes of Health warns information consumers against believing in websites that claim that a specific remedy so-called ‘‘a breakthrough medication’’ will cure a variety of illnesses (32). It is also advised that non-commercial websites should be considered to be trustable resources, and if a drug is recommended by name, information seekers should check if the company that manufactures or sells the drug has provided that information. In the proposed methodology, the bias information was identified by sentiment analysis with the help of the method proposed by Denecke (29). Sentiment analysis focuses on classifying a document as subjective or objective and for a subjective (opinionated) document, classifying it as expressing a positive, negative or neutral opinion. The method proposed by Denecke was revised for the research problem as follows: the negativity scores of the commerce related synsets are neutral in Sentiwordnet. Since the purpose of the Sentiwordnet is to aid in the identification of subjectivity but not the bias, both the positivity and negativity of the commerce related synsets such as money, payment, shopping or credit card are zero. In order to favor noncommercial content, the negativity scores of these synsets were assigned a value of 1. These terms were selected using the relevance feedback algorithm on biased websites. A total of 11 terms as shown in Table 6 were selected using this method. These terms were expanded to synsets with Sentiwordnet. In ref. (26), individual blog posts were classified only by considering the presence of adjectives in the content. In our experiments, we aimed to classify the different types of websites about diabetes. However, this approach did not produce satisfactory results in terms of identifying the bias content as all the websites were labeled as objective. The ratios of the subjectivity score to the number of sentiment-bearing terms in large volume websites were very small. As a consequence, the machine-learning based classification method proposed in (29) was conducted. In each experiment, 17 attributes were identified to be used by a machine learning classifier. These attributes are average polarity score triples for adjectives, nouns and verbs (nine attributes), Table 6. Commerce related terms. Terms Bill Buy Sponsor Charge Shop Pay Money Dollar Credit card Price Cost

57

58


frequency of positive and negative words (two attributes), frequency of nouns, verbs and adjectives (three attributes), the number of sentences, question marks and exclamation marks (three attributes). Then we computed the values of these features for each web page and used multiple linear regression in order to determine their weights for differentiating biased websites from unbiased sites (shown on the left of the image in Figure 1 for the training phase). Learning quality measuring queries with relevance feedback In the relevance feedback approach introduced in ref. (24), a complex query consisting of weighted words and phrases is automatically generated by comparing term frequency distributions in relevant and irrelevant documents. Here, the assumption is that terms in relevance query occur frequently in the relevant text but rarely otherwise. The resulting query is used by the text retrieval system to compute relevance scores for documents. In this study, this method was used to learn a ‘‘quality’’ query from sets of high- and low-quality webpages. So a quality query will comprise terms that will appear frequently in high-quality websites but rarely in low-quality websites. A website is categorized as high quality if it was given a score 460 out of 100. Otherwise, it is accepted as low quality. The generated quality queries are then run on the websites to obtain quality scores for each webpage using the Terrier search engine (see the right side of the image in Figure 1 for the training phase). In order to obtain a quality measuring query, firstly, the candidate terms for the quality measuring query were determined. These terms were obtained using:

the list of words and phrases extracted from the list of quality criteria for diabetes websites given in (10). the top search terms for diabetes as can be seen in Table 7.

The frequency of each word and phrase was then computed for all the web pages in each website. Using these candidate terms and phrases, the weights were computed using the Robertson–Sparck Jones (RSJ) formula given in Equation (1). Here, the purpose of weighting is to assign high values to discriminating terms in high-quality web pages: F ¼ log

rþ0:5 Rrþ0:5

nrþ0:5 NnRþrþ0:5

ð1Þ

The definitions of RSJ are shown in Table 8. In this study, the relevant documents imply high-quality documents, whereas the non-relevant documents indicate low-quality websites. The quality query terms were selected by computing the Term Selection Values (TSVs) (39) for each candidate term. The terms were ranked in descending order and the terms that had a rank above a certain threshold were selected. Table 9 lists the terms and their weights computed by the RSJ formula given in Equation (1) in the five experiments. For example, Query 1 and Query 2 were formed using the training data set in Experiment 1 and Experiment 2,

Information quality framework Table 7. Query term candidates gathered from top search terms. Diabetes symptoms

Ketones

Gestational diabetes Type 1 diabetes Hyperglycemia Hypoglycemia

Alcohol and diabetes Diabetic neuropathy Pre-diabetes Diabetic ketoacidosis

Table 8. Definitions for RSJ formula. Equation N R Nr R N Rr Nn NR NnR þ r

Definition Number Number Number Number Number Number Number Number Number

of of of of of of of of of

documents that contain the term relevant documents that contain the term non-relevant documents that contain the term relevant documents documents in the collection documents that do not contain the term documents that do not contain the term non-relevant documents non-relevant documents that do not contain the term

respectively (see Table 5). In all the experiments, the terms that have a higher weight than 1.71 were selected and used in the remainder of the experiments. The threshold was selected by observing the cut-off point in the weight distribution. Quality scoring In this step, the selected quality queries from the previous step were run using the Terrier search engine on the test data sets. The engine returned a score for each webpage and the score of a website was computed by calculating the average score of its web pages in accordance with the method proposed in (24). Okapi BM25, which was proposed by Robertson et al. (40) was utilized for the retrieval task. The formula takes into account the tf (term frequency) and idf (inverse document frequency) values and the relative length of a document. tf is the number of occurrences of a term in a document. Here, the aim is to compute a score between a query term t and a document d, based on the weight of t in d. The simplest approach is to assign a weight which is equal to the number of occurrences of term t in document d. This weighting scheme is referred as term frequency and is denoted as tft,d, where the subscripts t denoting the term and d indicates the document. However, certain terms have little or no discriminating power in determining relevance, but they can still have a high tft,d. For instance, the collection of web pages on diabetes is likely to have the term ‘‘diabetes’’ on almost every page. Consequently, it should not have high weight in the corpus. In order to tackle with this problem, an inverse document frequency idft,d is used in Equation (2): idft, D ¼ log

N dft, D

ð2Þ

59

60

R. B. Sag ˘ lam & T. T. Temizel Table 9. Terms and their weights.

Terms blood_glucose_test diabetes_prevention treatment_type treatment_type_diabetes healthy_eating Obesity Acromegaly diabetes_obesity diabetic_complications eating_disorders family_history insulin_administration meal_planning diabetes_pregnancy eating_healthy insulin_dependent meal_planning guide Pre-diabetes Medications foot_care endocrine_disorders

Weights query 1

Weights query 2

Weights query 3

Weights query 4

Weights query 5

2.85 2.85 2.85 2.85 1.88 1.88 1.71 1.71 1.71 1.71 1.71 1.71 1.71 – – – – – – – –

1.71 2.28 2.28 2.28 1.89 3.02 1.71 1.71 1.71 1.71 1.71 1.71 2.28 1.71 1.71 1.71 1.71 1.71 – – –

1.71 2.28 1.71 1.71 1.89 – 1.71 – – 1.71 – 1.71 2.69 – 1.71 – 1.71 – 3.02 2.28 1.71

2.28 2.69 1.71 1.71 3.56 1.89 – 1.71 1.71 – 1.71 1.71 2.69 – 1.71 – 1.71 – – – –

2.28 2.69 2.85 2.85 2.43 2.17 1.71 1.71 1.71 1.71 1.71 – 2.69 – 1.71 – 1.71 – – 1.89 –

Terms, which were not extracted or had significantly lower weight than 1.71, are marked as ‘‘–’’.

where dft,D defines the number of documents in the collection that contain the term t. Thus, the idf of a rare term is expected to be high, whereas the idf of a frequent term is likely to be low. An optimal weight is given to a term by combining the term frequency and inverse document frequency to produce a composite weight for each term in each document. The tfidf weighting scheme assigns a weight to a term as follows: tf idft, D ¼ tft, D idft, D

ð3Þ

In the term frequency calculation, tf values tend to be larger in longer documents. Since it is more likely that a term will appear in longer documents, it causes a potential bias towards longer documents as they will be assigned higher scores. In order to compensate that effect, Robertson derived the document length normalization (40). Okapi BM25 computes the score for each word or word phrase in the query as follows: t þ0:5 log Nn n þ0:5 t wBM25 ð4Þ t, D ¼ Qwt tft, D ld þ tft, D 2 0:25 þ 0:75 avdl where Qwt is the weight of the term t in the query. N is the total number of documents in the corpus, nt is the number of documents containing t and t þ0:5 log Nn nt þ0:5 is the idf value of the term t.


The final score for a webpage is the sum of all the term weights: X SBM25 ðD, QÞ ¼ wt, D

ð5Þ

t2Q

Finally, the tool returns a score for each webpage. The average of the score of individual web pages is computed to determine the final score of the website. In the training phase, the overall site scores are calculated as follows: Sq ¼ Qscore þ ð1 Þ NormðInðjQjÞÞ

ð6Þ

Here, Qscore means the mean quality score obtained by Terrier and NormðInðjQjÞÞ is the normalized value of InðjQjÞ, obtained by dividing InðjQjÞ by InðjQmax jÞ, where Qmax is the maximum number of retrieved documents per site. In the training phase, we have the quality scores of each website (Sq) given by domain experts. As we have calculated the SBM25 ðD, QÞ by Terrier and NormðInðjQjÞÞ, the next step is to calculate an optimal which maximizes the correlation between Sq and the right side of the equation. This parameter was used to adjust the balance between the average document score and the coverage of a site. To achieve this for , all the values between 0 and 1 were tried with an increment of 0.01. The value which maximized the correlation between the computed site scores, Qscore , and manually assigned score Sq was chosen in each training set. In five experiments, the maximum correlations were achieved when value was nearly 0.8.

RESULTS In this section, the performance of the proposed framework in terms of accuracy in biased website detection and quality scoring will be presented. For bias content detection, the algorithm produced the results as shown in Table 10. In the evaluation, five sets of experiments were performed each containing 11 websites. Of 14 biased websites, 11 websites were detected correctly as biased, whereas 3 unbiased websites were misclassified as biased. Consequently, a total of 6 out of 55 websites were misclassified; giving an accuracy of 89%. In the detection of bias content, the best results were obtained in the Experiment 3 in which all the biased contents were detected, and no unbiased contents were misclassified. The worst results were retrieved in Experiments 1 and 2 in which the accuracy was 0.82. The biased websites, which were not detected by the proposed system and as a consequence no penalty score was applied, obtained high scores in the quality scoring. So the correlation between the computed and actual scores decreased dramatically. For evaluation purposes, the Pearson correlation between the quality scores given by the domain experts and the scores generated by the proposed framework were calculated and compared with the other techniques. As shown in Table 11, the first row shows the result of the proposed framework. The second row gives the result of the study by Griffiths et al. (24) and clearly

61

62


illustrates the limitation of their method. In the last row, the results show the limitation of using only the keywords and weights of Sentiwordnet in bias website detection. When we did not modify the weights of the proposed terms as presented in Table 9, a lower r value was obtained. The scatter plot in Figure 2 illustrates the degree of correlation between these variables and suggests that the correlation is positive. In five test sets, the best results were obtained in Experiment 4 in which the correlation was 0.82. In each test set the correlations were significantly higher in the proposed framework. In order to evaluate the effect of the proposed changes made to Sentiwordnet, an attempt was made to identify the biased content using the original Sentiwordnet. As a result, we obtained lower accuracy values compared to the proposed method as shown in Table 12. The reason for this is due to the fact that since the commercial content provides one of the most important pieces of evidence for the identification of the bias content in the health domain, some of the weights of the terms need to be penalized in Sentiwordnet in order to disfavor commercial content. The correlation between the actual scores and the computed scores based on the original Sentiwordnet is decreased compared to the proposed framework as given in Table 11. The low correlations between the actual and computed scores in certain test sets necessitated further analysis to present a detailed explanation. The main characteristic of the websites in Experiment 1 was that the most of the websites in this set were not diabetes-specific websites and provide general health content about several issues. For example bbc.co.uk, webmd.com,

Table 10. Results in bias content detection.

Experiments Experiment Experiment Experiment Experiment Experiment Total

1 2 3 4 5

Testing Set Set Set Set Set

1 2 3 4 5

Number of websites detected as biased

Number of biased websites

True positive

False positive

Accuracy

3 3 3 3 2 14

2 2 3 3 1 11

1 1 0 1 0 3

0.82 0.82 1.00 0.91 0.91 0.89

Table 11. The results of the experiments. Experiment Experiment Experiment Experiment Experiment 1 2 3 4 5 r (p50.001) (proposed framework) Griffiths et al. technique (24) r with original Sentiwordnet

0.59

0.62

0.73

0.82

0.62

0.38

0.26

0.32

0.36

0.35

0.49

0.36

0.33

0.81

0.41


Figure 2. The correlation between the actual scores and the computed scores for the proposed framework within five sets.

Table 12. Results in bias content detection with original Sentiwordnet.

Experiments Experiment Experiment Experiment Experiment Experiment Total

1 2 3 4 5

Testing Set Set Set Set Set

1 2 3 4 5

Number of websites detected as biased

Number of biased websites

True positive

False positive

Accuracy

3 3 3 3 2 14

2 1 1 3 0 7

1 1 3 0 1 6

0.82 0.73 0.55 1.00 0.70 0.76

dhfs.state.wi.us and healthlink.mcw.edu are in this set. Consequently, the scores of many unrelated webpages about diabetes were included in the computation which averages all the scores of each website page even there was only a link to diabetes related web page (Equation (6)). In order to present the effect of website characteristics on Equation (6), a new experiment was designed. The websites were placed in two groups; general purpose websites and diabetes specific then 16 websites were randomly selected from each group (see Table 13). The five biased websites were added to both test sets, and the quality query 4 (in Table 14) was run against the collections and the Pearson correlation was computed between the computed and actual scores. A lower correlation of 0.25 was observed in the test set containing general websites. In large volumes websites, Terrier retrieved 41000 pages most of which had very low scores that were close to zero. There low scored webpages were those in which diabetes was not the concern but there were some links to diabetes related webpages were given with diabetes-specific terms in the quality query. Low scores lowered the average, and all the websites were scored between 30 and 40 over 100 by the system. In diabetes-specific websites, low scores close to zero were still retrieved however, they were fewer in number. These pages were generally ‘‘About us’’ or

63

64

R. B. Sag ˘ lam & T. T. Temizel Table 13. General websites and diabetes-specific websites. General websites

Diabetes-specific websites

5 8 8 21

5 8 8 21

Biased websites Low-quality websites High-quality websites Total

Table 14. Results from general websites and diabetes-specific websites.

r (p50.001)

General websites

Diabetes-specific websites

0.25

0.76

‘‘Contact us’’ pages. The correlation was computed as 0.76 (p50.001). Both results are given in Table 14.

CONCLUSION AND DISCUSSION Assessing the information quality of health websites automatically is challenging since it is necessary to simultaneously take into consideration many issues such as accuracy, bias, information relevance and timeliness. This article proposes a framework which aims to provide a better identification and ranking of diabetes websites according to EBM. As a consequence, it aims to provide both health professionals and individual information consumers find high-quality websites automatically through EBM. The framework can be provided as a software as a service model and can be utilized as a stand-alone service for diabetes-related web page searches. It also has a potential to be integrated into existing general web search engines. Previous approaches in the literature are either manual or limited to addressing wide-ranging information quality problems. The results showed that the proposed framework had a significantly higher r value compared to the other techniques (the average of all the experiments was 0.68 compared to the other technique where an average r of 0.33 was obtained). It also identified the bias websites with high accuracy. A high correlation between the manual ranking that was carried out using EBM by the domain experts and our proposed framework suggests that the method is able to generate a successful ranking according to EBM. Although the results of the study are promising, there are still some limitations. First, several health portals that include information about various health issues are automatically penalized in this method while computing the quality scores of the websites. As a result, we should not calculate the average scores of the individual webpages of the websites since it favors websites that offer only diabetes related contents. Second, assigning a dynamic penalty score to a website in relation to its biased content ratio over the whole content may also produce better results. Although the static penalty score affected the results in a positive way, the


linear static combination is not a required solution. Since the same penalty score is applied to the websites including a small number of bias-related terms and those websites including a high number of bias-related terms. We found that the algorithm mainly failed to differentiate between such websites. Future work can investigate the challenging task in the health domain to automatically find the candidate quality terms and phrases at the beginning of the framework. It is possible to obtain these terms and phrases from the highquality documents using information extraction techniques.

DECLARATION OF INTEREST The authors report no conflicts of interest. The authors alone are responsible for the content and writing of this article.

REFERENCES 1. Lambert V. Finding health information on the internet. [Online]. Available from: http://www.telegraph.co.uk/health/wellbeing/8066878/Finding-health-informationon-the-internet.html [last accessed 15 Oct 2010]. 2. Moyer CS. Cyberchondria: the one diagnosis patients miss, American Medical News, 30 1 2012. [Online]. Available from: http://www.ama-assn.org/amednews/ 2012/01/30/hll10130.htm [last accessed 30 Oct 2012]. 3. Kunst H, Groot D, Latthe PM, et al. Accuracy of information on apparently credible websites: survey of five common health topics. BMJ 2002;324:581–2. 4. Fricke M, Fallis D. Indicators of accuracy for answers to ready reference questions on the Internet. J Am Soc Inform Sci Technol 2003;55:238–45. 5. Fricke M, Fallis D, Jones M, Luszko GM. Consumer health information on the Internet about carpal tunnel syndrome: indicators of accuracy. Am J Med 2005;118: 168–74. 6. McInerney CR, Bird NJ. Assessing Website quality in context: retrieving information about genetically modified food on the Web. Inform Res 2005;10:paper 213. 7. Thorpe B, Kiebzak G, Chavez J, et al. Assessment of osteoporosis-website quality. Osteopor Int 2006;617:741–52. 8. Schwartz KL, Roe T, Northrup J, et al. Family medicine patients’ use of the internet for health information: a MetroNet study. J Am Board Fam Med 2006;19:39–45. 9. Sillence E, Briggs P, Harris PR, Fishwick L. How do patients evaluate and make use of online health information? Soc Sci Med 2007;64:1853–62. 10. Seidman J, Steinwachs D, Rubin H. Design and testing of a tool for evaluating the quality of diabetes consumer-information Web sites. J Med Internet Res 2003;5:e30. doi:10.2196/jmir.5.4.e30. 11. Sackett DL, Rosenberg VM, Gray J, et al. Evidence based medicine: what it is and what it isn’t. Br Med J 1996; 312:71–2. 12. Health on the Net Foundation, Health on the Net Foundation, [Online]. Available from: http://www.hon.ch/index.html [last accessed 23 Aug 2010]. 13. URAC, URAC, [Online]. Available from: https://www.urac.org/ [last accessed 13 Nov 2012]. 14. Resnick P, Miller J. PICS: internet access controls without censorship. Commun ACM 1996;39:87–93. 15. Wang Y, Liu Z. Automatic detecting indicators for quality of health information on the Web. Int J Med Inform 2007;76:575–82. 16. University of Oxford, Division of Public Health and Primary Health Care, DISCERN-Quality criteria for consumer health information, 2010. [Online]. Available from: http://www.discern.org.uk [last accessed 20 Nov 2010].

65

66


17. Centrale Sante, Net Scoring, [Online]. Available from: http://www.chu-rouen.fr/ netscoring/netscoringeng.html [last accessed 4 Jun 2011]. 18. Griffiths KM, Christensen H. Website quality indicators for consumers. J Med Internet Res 2005;7:e55. doi:10.2196/jmir.7.5.e55. 19. Wilson P. How to find the good and avoid the bad or ugly: a short guide to tools for rating quality of health information on the internet. BMJ 2002;324:598. 20. Khazaal Y, Fernandez S, Cochand S, et al. Quality of web-based information on social phobia: a cross-sectional study. Depress Anxiety 2008;25:461–5. 21. Barnes C, Harvey R, Wilde A, et al. Review of the quality of information on bipolar disorder on the Internet. Austr N Z J Psychiatry 2009;43:934–45. 22. European Conference on Machine Learning and Principles and Practice of Knowledge Discovery in Databases, Discovery Challenge 2010, 2010. [Online]. Available from: http://www.ecmlpkdd2010.org/indexd7fa.html?md¼articles&id ¼2041&lg¼eng [last accessed 28 Oct 2010]. 23. Geng G, Jin X, Zhang X, Zhang D. Evaluating web content quality via multi-scale features. Proceedings of the ECML/PKDD, Barcelona, Spain; 2010. 24. Griffiths K, Tang T, Hawking D, Christensen H. Automated assessment of the quality of depression websites. J Med Internet Res 2005;7:e59. doi:10.2196/ jmir.7.5.e59. 25. American Diabetes Association, American Diabetes Association Home Page, [Online]. Available from: http://www.diabetes.org/ [last accessed 5 Jun 2011]. 26. Denecke K. Accessing medical experiences and information. European Conference on Artificial Intelligence, Workshop on Mining Social Data; 2008. 27. Esuli A, Sebastiani F. Sentiwordnet: a publicly available lexical resource for opinion mining. Proceedings of the 5th Conference on Language Resources and Evaluation (LREC’06); 2006. 28. Princeton University, Wordnet, [Online]. Available from: http://wordnet.princeton.edu/ [last accessed 14 Feb 2013]. 29. Denecke K. Are SentiWordNet scores suited for multi-domain sentiment classification? Fourth International Conference on Digital Information Management (ICDIM 2009); 2009. 30. Denecke K. An architecture for diversity-aware search for medical web content. Methods Inform Med 2012; 51:549–56. 31. National Institutes of Health’s Web Site, MedlinePlus, [Online]. Available from: http://www.nlm.nih.gov/medlineplus/ [last accessed 6 Jun 2011]. 32. National Institutes of Health’s Web Site, MedlinePlus Guide to Healthy Web Surfing, [Online]. Available from: http://www.nlm.nih.gov/medlineplus/healthywebsurfing.html [last accessed 6 Jun 2011]. 33. RapidMiner, [Online]. Available from: http://rapid-i.com/content/view/181/190/ lang,en/ [last accessed 3 Dec 2011]. 34. Data Mining Tools Used Poll, KDnuggets, 2013. [Online]. Available from: http:// www.kdnuggets.com/polls/2009/data-mining-tools-used.htm [last accessed 15 Oct 2013]. 35. University of Glasgow School of Computing Science, Terrier IR Platform v3.5, [Online]. Available from: http://terrier.org/ [last accessed 14 Feb 2013]. 36. Porter M, The Porter Stemming Algorithm, 1 2006. [Online]. Available from: http:// tartarus.org/martin/PorterStemmer/ [last accessed 14 Feb 2013]. 37. Willett P. The Porter stemming algorithm: then and now. Program Electron Libr Inform Syst 2006;40:219–23. 38. Schmid H, TreeTagger, [Online]. Available from: http://www.ims.uni-stuttgart.de/ projekte/corplex/TreeTagger/ [last accessed 2 Feb 2012]. 39. Robertson S. Documentation note on term selection for query expansion. J Document 1990;46:359–64. 40. Robertson S, Walker S. Some simple effective approximations to the 2-poisson model for probabilistic weighted retrieval. Proceedings of the 17th annual international ACM SIGIR conference on Research and development in information retrieval, Springer-Verlag New York, Inc.; 1994. pp. 232–41.

Copyright of Informatics for Health & Social Care is the property of Taylor & Francis Ltd and its content may not be copied or emailed to multiple sites or posted to a listserv without the copyright holder's express written permission. However, users may print, download, or email articles for individual use.

A Concise and Practical Framework for the Development and Usability Evaluation of Patient Information Websites.

Quality of Prostate Cancer Treatment Information on Cancer Center Websites.

Conceptual Framework for Developing a Diabetes Information Network.

Retracted: An Automatic Web Service Composition Framework Using QoS-Based Web Service Ranking Algorithm.

Use of mechanistic simulations as a quantitative risk-ranking tool within the quality by design framework.

Online Information on Antioxidants: Information Quality Indicators, Commercial Interests, and Ranking by Google.

Framework for automatic information extraction from research papers on nanocrystal devices.

Evaluation of Quality and Readability of Health Information Websites Identified through India's Major Search Engines.

Quality Assessment of Information About Pit and Fissure Sealants in Persian Websites in 2012.

Erratum: Image Quality Ranking Method for Microscopy.

Image Quality Ranking Method for Microscopy.

Health information on internet: quality, importance, and popularity of persian health websites.

AuToGraFS: automatic topological generator for framework structures.

An integrated feature ranking and selection framework for ADHD characterization.

A transparent and transferable framework for tracking quality information in large datasets.

The quality of orthodontic practice websites.

A unified framework and method for automatic neural spike identification.

Iterative Neighbour-Information Gathering for Ranking Nodes in Complex Networks.

The Increasing Trend in Global Ranking of Websites of Iranian Medical Universities during January 2012-2015.

The impact of a diabetes local enhanced service on quality outcome framework diabetes outcomes.

Feature ranking and rank aggregation for automatic sleep stage classification: a comparative study.

Readability of websites containing information on dental implants.

Framework for a new generation of medical information systems.

ASM Journals Eliminate Impact Factor Information from Journal Websites.