Full Text Clustering and Relationship Network Analysis of Biomedical Publications Renchu Guan1,2, Chen Yang3, Maurizio Marchese4, Yanchun Liang1, Xiaohu Shi1* 1 College of Computer Science and Technology, Jilin University, Changchun, China, 2 State Key Laboratory of Inorganic Synthesis and Preparative Chemistry, College of Chemistry, Jilin University, Changchun, China, 3 College of Earth Sciences, Jilin University, Changchun, China, 4 Department of Information Engineering and Computer Science, University of Trento, Trento, Italy

Abstract Rapid developments in the biomedical sciences have increased the demand for automatic clustering of biomedical publications. In contrast to current approaches to text clustering, which focus exclusively on the contents of abstracts, a novel method is proposed for clustering and analysis of complete biomedical article texts. To reduce dimensionality, Cosine Coefficient is used on a sub-space of only two vectors, instead of computing the Euclidean distance within the space of all vectors. Then a strategy and algorithm is introduced for Semi-supervised Affinity Propagation (SSAP) to improve analysis efficiency, using biomedical journal names as an evaluation background. Experimental results show that by avoiding highdimensional sparse matrix computations, SSAP outperforms conventional k-means methods and improves upon the standard Affinity Propagation algorithm. In constructing a directed relationship network and distribution matrix for the clustering results, it can be noted that overlaps in scope and interests among BioMed publications can be easily identified, providing a valuable analytical tool for editors, authors and readers. Citation: Guan R, Yang C, Marchese M, Liang Y, Shi X (2014) Full Text Clustering and Relationship Network Analysis of Biomedical Publications. PLoS ONE 9(9): e108847. doi:10.1371/journal.pone.0108847 Editor: Fengfeng Zhou, Shenzhen Institutes of Advanced Technology, China Received June 14, 2014; Accepted August 26, 2014; Published September 24, 2014 Copyright: ß 2014 Guan et al. This is an open-access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited. Data Availability: The authors confirm that all data underlying the findings are fully available without restriction. All relevant data are within the paper and its Supporting Information files. Funding: The research program is funded by the National Natural Science Foundation of China (No. 61103092, 41101376, 61272207, 61373050)(http://www.nsfc. gov.cn/), and the China Postdoctoral Science Foundation (2011M500613 and 2012T50298) (http://jj.chinapostdoctor.org.cn). The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript. Competing Interests: The authors have declared that no competing interests exist. * Email: [email protected]

biomedical articles, and can even lead to inappropriate clinical decisions [13–15]. In the Genomics2005 data collected in [9], the mean number of documents in the 100 datasets was 690.7 and the average number of unique words was 2214.4. By contrast, in our experiments using a BioMed dataset containing 600 texts, the unique word count was 54367—approximately 24 times greater than the count for Genomics2005. With the addition of these terms, it is expected that biomedical text mining will provide more relevant information and more complete analysis. Unfortunately, access to the full text and citations of biomedical papers remains limited, and the complexity of full text mining, which must contend with large amounts of information noise, is much greater than that of abstract text mining. While the number of non-redundant words (terms) in a biomedical abstract text is typically less than 100, that of a full text is often much greater than 1000. Since most of mining frameworks employ the vector space model (VSM) [16], which treats a document as a bag of words and uses plain language terms as features, the dimensions of a full text corpus can be several times greater than that of abstract data. Clustering is one of the most popular techniques for data analysis in many disciplines [17]. The basic strength of text clustering is its capacity to automatically organize texts into meaningful groups. As such, it has been applied in a number of ways, including cluster-based retrieval [18], key sentence extraction [19], concept discovery in molecular biology [7], and so on.

Introduction With the proliferation of biomedical research and publications, individual scientists can no longer keep track of relevant articles through reading alone. Instead, they must rely on text mining tools to explore the implicit knowledge and hypotheses presented in biomedical texts and to extract the data and concepts relevant to their work [1,2]. In this regard, a number of new challenges have emerged, especially in relation to full text analysis [3], complex relation extraction [4,5], and information fusion [6]. To date, research approaches to biomedical text mining have been based exclusively on article abstracts (Iliopoulos et al., 2001 [7]; Yu and Lee 2006 [8]; Zhu et al., 2009 [9]), culminating most recently in 2011, when Boyack et al managed to cluster about two million MEDLINE abstracts [10]. While such clustering can provide hints about primary results and conclusions, the full texts of articles are where more detailed methodologies, experimental results, critical discussions and interpretations are found. Rodriguez-Esteban pointed out that full article texts are the only source for certain crucial information, such as experimental measurements [11]. Recent research initiatives in bioinformatics, such as TREC Genomics (http://ir.ohsu.edu/ge-nomics/) and BioCreAtIvE (http://www.biocreative.org/), also recommend migrating text analysis from abstracts to full texts [12]. Intuitively, consultation only of abstracts is inadequate for judicious analysis of

PLOS ONE | www.plosone.org

1

September 2014 | Volume 9 | Issue 9 | e108847

Relationship Network Analysis of Biomedical Publications

Traditional clustering also forms an important component of unsupervised learning algorithms, which along with various supervised techniques, play an important role in biomedical research, including studies of cancer [20], Bayesian analysis of HIV drug resistance [21], linear regression frameworks for motif finding [22], etc. In supervised text mining algorithms, a training process is first performed on a large set of known predefined topics and text labels. This generally results in better topic detection [23] than unsupervised approaches, for which there is no prior training step. Obviously, supervised and unsupervised learning algorithms can work together to improve learning processes or handle more complex problems, such as phosphorylation site prediction [24]. However, training set labeling is in general very costly and frequently unavailable in practical scenarios. In recent years, semisupervised clustering approaches have caught the attention of the machine learning community. These approaches make use of a smaller and more easily obtained set of labeled samples to guide clustering strategy. They have been applied in many application domains, including text clustering [25], gene expression analysis [26], and image processing [27]. However, to the best of our knowledge, semi-supervised learning has not yet been applied in full text mining of real biomedical publications. In this paper, the Semi-supervised Affinity Propagation (SSAP) method of text clustering is proposed. Then this method is applied to the corpus of a real biomedical text database called BioMed Central open access full-text corpus, and compare its clustering performance to that of two classical clustering algorithms. Finally, a directed relationship network and a cluster distribution matrix are constructed based on the SSAP clustering results, and use these to reveal publication interests among the top 10 journals. The source code and datasets used in this paper are available in the Supporting Information.

9 8n o 1 1 > > < vf11 ,n11 w,vf12 ,n21 w,    vf1M ,nM 1 w ,,= n o D~fd1 ,d2 ,    ,dN g~ N > > ; : vfN1 ,n1N w,vfN2 ,n2N w,    vfNM N ,nM N w

where N is the document number in the dataset, Mj is the term feature (unique words or phrases) number in the jth document, di = {,fi1, ni1., ,fi2, ni2.,…,fiMi, niMi.} indicates the ith document, and fik and nik represent the kth term in ith document and its normalized frequency, respectively. Hereinto, the normalized frequency is computed by M

ni i ~

where Countki is the kth term count number in the jth document.

The Semi-Supervised Affinity Propagation Method Based on the state-of-the-art unsupervised Affinity Propagation clustering algorithm [19], SSAP method is proposed as means of addressing the complexity problems posed by full text clustering of biomedical publications. In conventional vector space model (VSM) based methods, each document is represented in the feature space constructed by all the unique word or phrase of all the documents. Therefore, the similarities between different document pairs could be computed according to any similarity measurements under the same constructed space. However, this space is high-dimensional and the document representation is serious sparse. To avoid the large dimensions and sparse matrix computation, in the proposed method each document is represented in a tiny sub-space of VSM when computing the similarity between it and another one. The sub-space is spanned by only the features of the related document and its counterpart, which is much smaller than the original vector space. If the dataset includes thousands of words or phrases, the subspace restriction will reduce computational complexity significantly. In the method, the classical similarity measurement in text clustering, namely, Cosine Coefficient similarity is used. To achieve even better performance, a semi-supervision strategy that makes use of known information is introduced. The detailed steps of the SSAP method are as follows: a. Initialization of dataset: Initializing dataset D = {d1, d2,...,dN} be a superset of N (N.0) elements, where each element consists of a sequence of two-tuples; b. Seed Construction: Add a small number of initially labeled objects in two-tuple sets to the dataset D, and get a new dataset D9 containing N9 elements (N9$N). c. Similarity Computation: Compute the similarities among objects in D9 using the Cosine Coefficient similarity metric:

Materials and Methods Datasets The dataset used in the experiments was the BioMed Central open-access corpus (BioMed), which can be downloaded at http:// www.biomedcentral.com/about/datamining. From the BioMed corpus, 110,369 articles between Jan. 2012 to Jun. 2012 are downloaded. Each BioMed file is a well-formatted XML document, and contains various tags, such as ,title., ,bdy., ,issn., and so on. But there are many articles contain only XML tags and abstracts, so those files of less than 4KB are removed in the pre-processing phase. Then the remaining files are divided into different ‘topics’ according to their journal names and the title, abstract and plain texts are extracted. Finally, the top 10 topics with paper numbers from 1966 to 5022 are selected as the test corpus. The detailed information of these 10 topics are listed in Table 1. In our experiments, to evaluate the performance of the different approaches across different dataset sizes, we randomly select 5 sub-datasets with scales of 400, 500, 600, 700 and 800 on each topic for comparison. And for each scale we randomly select 5 times computation to perform average to avoid accidental results. Before beginning the clustering process, all journal-namerelated information were eliminated from target texts. After clustering, this information was used to evaluate the clustering performance of each algorithm. According to the list in the website: http://norm.al/2009/04/14/list-of-english-stop-words/, the stop words are removed. Then each document is represented by a set of tow-tuples and the whole dataset should have the form indicated as follows: PLOS ONE | www.plosone.org

Countki Mi

S(i,j)~

Ddi \dj D 1=

Ddi D 2 Ddj D

1= 2

where S(i,j) is the similarity between the ith document and the jth document elements in D9, and |d| represents the unique terms’ count number in document d, respectively. d. Self-Similarity Computation: Compute the self- similarities s(l, l) of each object in D9 using:  s(l,l)~(1zg) min fs(i,j)g{g max fs(i,j)g, g~ 1ƒi,jƒN’ i=j

1ƒi,jƒN’ i=j

Q 1 ƒlvN {1 NƒlvN’

where iMD9, jMD9, and i?j. 2

September 2014 | Volume 9 | Issue 9 | e108847

Relationship Network Analysis of Biomedical Publications

Table 1. Top 10 topics in Biomed corpus.

Serial Number

Name

Abb Name

Document Number

1

BMC Bioinformatics

BMC Bioinformatics

5022

2

BMC Genomics

BMC Genomics

4121

3

BMC Public Health

BMC Public Health

3758

4

BMC Cancer

BMC Cancer

3025

5

Journal of Cardiovascular Magnetic Resonance

J Cardiovas Magn R

2538

6

Retrovirology

Retrovirology

2478

7

BMC Neuroscience

BMC Neuroscience

2454

8

BMC Evolutionary Biology

BMC Evo Biol

1973

9

Malaria Journal

Malaria J

1968

10

Journal of Medical Case Reports

J Med Case Rep

1966

doi:10.1371/journal.pone.0108847.t001

e. Initialization of availability matrix: a(i, j) = 0, (i, j = 1, 2,..., N9). f. Message Matrix Computation: Compute the message matrices using:

g. Exemplar Selection: Add the two message matrices and search for the exemplar of each object i, defined as the maximizer of r(i, j)+a(i, j). h. Updating Messages, using:

r(i,j)~s(i,j){ maxfa(i,j 0 )zs(i,j 0 )g

Riz1 ~(1{l)Ri zlRi{1

j 0 =j

Aiz1 ~(1{l)Ai zlAi{1 8 ( ) > P > 0 > maxf0,r(i ,j)g i=j < min 0, r(j,j)z i0 =i,j a(i,j)~ P > > maxf0,r(i0 ,j)g i~j > :

i. Iterating from Step f to h for a fixed number of iterations or until the exemplar selection remains constant for some number of iterations. j. Merging Small Clusters with the same labels into larger clusters to produce clustering output.

i0 =j

Figure 1. F-measure comparison. K-means: k-means clustering; AP: Affinity Propagation clustering; SSAP: Semi-supervised Affinity Propagation. doi:10.1371/journal.pone.0108847.g001

PLOS ONE | www.plosone.org

3

September 2014 | Volume 9 | Issue 9 | e108847

Relationship Network Analysis of Biomedical Publications

Figure 2. Entropy comparison. K-means: k-means clustering; AP: Affinity Propagation clustering; SSAP: Semi-supervised Affinity Propagation. doi:10.1371/journal.pone.0108847.g002

[29]. Especially, it has proven that K-means is a simple but very reliable method in document clustering [30,31]. Therefore, it is also used as a baseline for comparison.

Comparison Methods SSAP is proposed based on Affinity Propagation (AP) method, which was proposed by Frey and Dueck in Science in 2007 [19] and was considered a convinced clustering algorithm in different applications. To investigate the semi-supervised learning performance of SSAP, AP is set as a comparison method in the experiments. K-means was firstly proposed by MacQueen [28] and was recognized as one of the top 10 algorithms in data mining

Evaluation To evaluate the performance of the three clustering methods— k-means, Affinity Propagation and SSAP, their respective values for F-measure, entropy, and CPU execution time are compared.

Figure 3. CPU execution time comparison. K-means: k-means clustering; AP: Affinity Propagation clustering; SSAP: Semi-supervised Affinity Propagation. doi:10.1371/journal.pone.0108847.g003

PLOS ONE | www.plosone.org

4

September 2014 | Volume 9 | Issue 9 | e108847

Relationship Network Analysis of Biomedical Publications

Table 2. The mean values over all experiments.

Mean F-measure

Mean Entropy

Mean CPU execution time (Min)

SSAP

0.674

0.429

498.9

AP

0.650

0.429

539.6

k-means

0.384

0.721

2866.8

doi:10.1371/journal.pone.0108847.t002

The generated clusters with the set of journal-based topic categories in BioMed are examined. To check the effectiveness across different collection sizes, all the three algorithms are applied to the sub-datasets with 5 scales from 400 to 800 and for each scale the sub-datasets are randomly selected 5 times for averaging. All topics in the datasets had discrete uniform distributions, and all experiments were run on a PC equipped with Intel (R) Xeon(R) CPU63430 @ 2.40 GHz, 2 GB of Ram, and no parallel computing processes.

For information retrieval, entity recognition, and information extraction in particular, F-measure is the most commonly used evaluation measurement. The global F-measure for the entire clustering result is defined as: F~

X Nh h

N

max ( g

2Nhg Nhg ) Ng  Nh (Ng zNh )

where N is the total number of documents in the dataset, Nhg is the number of objects of class h in cluster g, Ng is the number of objects of cluster g, and Nh is the number of objects of class h. In contrast to F-measure, entropy provides a measure of the homogeneity or purity of a cluster. The total entropy for a set of clusters is calculated as:

E~

G X H X g~1 h~1

({

Results Figures 1, 2, and 3 compare the F-measure, entropy and CPU execution time measurements, respectively, for the three algorithms. From Figure 1 and the summary results in Table 2, it can be seen that the mean F-measure value of SSAP is 0.674, an improvement of 0.290 (75.4%) over k-means. Similarly, AP achieves a mean F-measure of 0.650, an improvement of 0.266 (69.1%) over k-means. Figure 2 and Table 2 show a different trend for the entropy values of the three methods, with k-means showing the highest entropy and both SSAP and AP achieving around 40.0% lower entropy than k-means. From Figure 3, it is clear that the CPU execution times for AP and SSAP are much lower than that of k-means, for equivalent performance results. Moreover, the gaps enlarge exponentially as the dataset size increases. For example, k-means took 2.4 times longer to execute than SSAP for the 400-document dataset, and increased to 8.2 times for the 800-document dataset. This result agrees with the conclusions of other studies, such as Frey and Dueck’s papers in Science [19,32]. Furthermore, even after 10000 runs of 400-document clustering, the k-means could only achieve an F-measure of 0.384 and entropy of 0.721, well below the performance of SSAP. To obtain the best result, k-means needs to execute all possible solutions, the equivalent of about C10400

Full text clustering and relationship network analysis of biomedical publications.

Rapid developments in the biomedical sciences have increased the demand for automatic clustering of biomedical publications. In contrast to current ap...
1MB Sizes 0 Downloads 5 Views