Using GO-WAR for mining cross-ontology weighted association rules.

c o m p u t e r m e t h o d s a n d p r o g r a m s i n b i o m e d i c i n e 1 2 0 ( 2 0 1 5 ) 113–122

journal homepage: www.intl.elsevierhealth.com/journals/cmpb

Using GO-WAR for mining cross-ontology weighted association rules Giuseppe Agapito, Mario Cannataro, Pietro Hiram Guzzi ∗ , Marianna Milano Department of Medical and Surgical Sciences, University Magna Graecia of Catanzaro, Italy

a r t i c l e

i n f o

a b s t r a c t

Article history:

The Gene Ontology (GO) is a structured repository of concepts (GO terms) that are associated

Received 8 September 2014

to one or more gene products. The process of association is referred to as annotation. The

Received in revised form

relevance and the specificity of both GO terms and annotations are evaluated by a measure

16 March 2015

defined as information content (IC). The analysis of annotated data is thus an important

Accepted 23 March 2015

challenge for bioinformatics. There exist different approaches of analysis. From those, the use of association rules (AR) may provide useful knowledge, and it has been used in some

Keywords:

applications, e.g. improving the quality of annotations. Nevertheless classical association

Association rule learning

rules algorithms do not take into account the source of annotation nor the importance

Gene Ontology

yielding to the generation of candidate rules with low IC. This paper presents GO-WAR

Annotation quality

(Gene Ontology-based Weighted Association Rules) a methodology for extracting weighted

Data mining

association rules. GO-WAR can extract association rules with a high level of IC without loss of support and confidence from a dataset of annotated data. A case study on using of GOWAR on publicly available GO annotation datasets is used to demonstrate that our method outperforms current state of the art approaches. © 2015 Elsevier Ireland Ltd. All rights reserved.

1.

Introduction

The production of experimental data in molecular biology has been accompanied by the accumulation of functional information about biological entities. Terms describing such knowledge are usually structured by using formal instruments such as controlled vocabularies and ontologies [1]. The Gene Ontology (GO) project [2] has developed a conceptual framework based on ontologies for organizing terms (namely GO terms) describing biological concepts. It consists of three ontologies: Molecular Function (MF), Biological Process (BP), and Cellular Component (CC) describing different aspects of biological molecules. Each GO term may be associated with

∗

Corresponding author. Tel.: +39 3316718314. E-mail address: [email protected] (P.H. Guzzi).

http://dx.doi.org/10.1016/j.cmpb.2015.03.007 0169-2607/© 2015 Elsevier Ireland Ltd. All rights reserved.

many biological concepts (e.g. proteins or genes) by a process also known as annotation. The whole corpus of annotations is stored into publicly available databases, such as the Gene Ontology Annotation (GOA) database [3]. In such a way records representing the associations of biological concepts, e.g. proteins, and GO terms may be easily represented as {Pj , T1 , . . ., Tn }, e.g. {P06727, GO:0002227, or {ApolipoproteinA-IV, GO:0006810, GO:0006869} innate immune response in mucosa, transport, lipid transport}. The whole set of annotated data represents a valuable resource for the existing approach of analysis. From those, the use of association rules (AR) [4–6] is less popular with respect to other techniques, such as statistical methods or semantic similarities [7]. Existing

114


Fig. 1 – Average number of annotations per protein in UNIPROT Database and some selected species. Each bar represents a different species. The height of the bar represents the average number of annotations.

approaches span from the use of AR to improve the annotation consistency, as presented in [8], to the use of AR to analyze microarray data [9–14], (see [15] for a detailed review). As we pointed out in a previous work [5], the use of AR presents two main issues due to the Number and the Nature of Annotations [16]. The number of annotation is for each protein or gene is highly variable within the same GO taxonomy and over different species as we depict in Fig. 1. The variability is caused by two main facts: (i) The presence of different methods of annotations of data; and (ii) the use of different data sources. Regarding the Nature of Annotations, it should be evidenced that the association between a biological concept and its related GO term can be performed with 14 different methods. These methods are in general grouped into two broad categories: experimentally verified (or manuals) and Inferred from Electronic Annotation (IEA). IEA annotations are usually derived using computational methods that analyze literature. Each annotation is labeled with an evidence code (EC) to keep track of the method used to annotate a protein with GO Terms. Manual annotations are, in general, more precise and specific than IEA ones (see [1]). Unfortunately, their number is lower, and the ratio among IEA versus non-IEA is variable (Fig. 2). Often many generic GO terms (this is particularly evident when considering novel or not well-studied genes) annotate genes and proteins, and the problem is also referred to Shallow Annotation Problem. The role of these general annotations is to suggest an area in which the proteins or genes operate. This phenomenon affects especially IEA annotations derived by using computational methods. Consequently, the application of classical AR methods to the analysis of annotated data may yield to the extraction of rules with low specificity, with generic terms or inconsistent annotations. An inconsistent annotation is defined following the true path rule: a rule contains an

inconsistent annotation when it contains both a term t and its ancestors [8,5]. For these reasons, Faria et al. [8] proposed the manual filtering of ancestors and low specific terms using a definition of specificity in terms of descendants and ancestors. Nevertheless, the measurement of specificity of a single term following only topological information may yield to incorrect results as demonstrated by Alterovitz et al. [17]. Consequently, we here propose to select a more stringent definition of specificity by considering the information content (IC) of a term [18]. Methods for calculating IC fall into two classes: intrinsic approaches that estimate the IC of concepts by only considering structural information extracted from the ontology, and extrinsic approaches that measure the IC starting from annotated corpora. The main result of the use of IC is that it is possible to associate each GO term a measure of IC. The set of annotations and weights is referred to as the IC-weighted annotation set as represented in the following: {P06727, GO:0002227 (16.770), GO:0006810 (5.180), GO:0006869 (10.730)} where the numbers represent the IC of terms. For this reason, we may adapt some significant results of AR extraction that can deal with weighted attributes [19]. We developed GO-WAR, i.e. Gene Ontology-based Weighted Association Rules Mining, a novel data-mining approach able to extract weighted association rules starting from an annotated dataset of genes or gene products. The proposed approach is based on the following steps: (i) initially we calculate the information content for each GO term; (ii) then, we extract weighted association rules by using a modified FP-Tree like algorithm able to deal with the dimension of classical biological datasets. We use publicly available GO annotation data to demonstrate our method. Results confirm that our method outperforms state of the art methods. We also provide a website containing the software tool, supplementary materials, and results https://sites.google.com/site/weightedrules/.

115


Fig. 2 – Ratio of electronic inferred annotations with respect to manual ones. This picture points out that the average number of annotations changes compared to the species into account. For each bar, green color represents the fraction of non-IEA annotations while blue color represents the fraction of IEA annotations. (For interpretation of the references to color in this figure legend, the reader is referred to the web version of this article.)

The rest of the paper is structured as follows: Section 2 discusses main related work, Section 3 discusses GO-WAR methodology and implementation, Section 4 presents results of the application of GO-WAR on a biological dataset. Finally Section 5 concludes the paper and outlines future work.

2.

Background

This section presents main concepts related to Gene Ontology and learning association rules.

2.1.

Gene Ontology

Gene Ontology [2] (GO) is one of the main resources of biological information since it provides a specific definition of protein functions. GO is a structured and controlled vocabulary of terms, called GO terms. GO is subdivided in three non-overlapping ontologies: Molecular Function (MF), Biological Process (BP) and Cellular Component (CC). The structure of GO is a Directed Acyclic Graph (DAG), where the terms are the nodes and the relations among terms are the edges. The structure of GO allows for more flexibility than a hierarchy since each term can have multiple relationships to broader parent terms and more specific child terms [20]. Genes or proteins are connected with GO terms by using a procedure also known as annotation process. Each annotation in the GO has a source and a database entry attributed to it. The source can be a literature reference, a database reference or computational evidence. Each biological molecule is associated with the set of related terms [20]. Fourteen different annotation processes exist that are identified by an evidence code, the principal attribute of an annotation. The evidence codes available describe the basis for the annotation. A main distinction among evidence codes is represented by Inferred from Electronic Annotations (IEA) ones, i.e. annotations that are determined without user supervision, and non-IEA ones or manual annotations, i.e. annotations that are supervised by experts.

2.2.

Information content calculation

There are many different formulation of information content (IC) notions that fall into two classes: intrinsic and extrinsic methods. The calculation of extrinsic IC involves annotation data for a considered corpus while the computation of intrinsic IC is based on structural information extracted from the GO DAG. In this way, it relies only on the intrinsic topology of the GO structure and the dependence on annotated corpora is prevented avoiding data circularity problems [21]. Intrinsic IC calculus can be estimated using different topological characteristics as ancestors, number of children, or depth (see Harispe et al. [18] for a complete review). In this work we used the IC of terms as proposed by Sanchez et al. [22]. This measure exploits only the number of leaves and the set of ancestors of a term a including itself, subsumers(a) and it introduces both the root node and the number of leaves max leaves in IC assessment. Leaves are more informative than concepts with many leaves; thus the leaves are suited to describe and to distinguish any concept.

ICSanchez et al. (a) = − log

|leaves(a)| |subsumers(a)|

+ 1)

(1)

max leaves + 1

In order to achieve a normalized measure, Sanchez et al. have adapted the above equation in the following:

ICSanchez Adapted et al. (a) = − log

|leaves(a)| + 1) max leaves + 1

(2)

2.3. Comparison with respect to state of the art approaches In this section, we summarize main differences between our methodology and the current state of the art focusing on Faria et al. approach, [8], Manda et al. [23], and Benites et al [12].

116


Table 1 – Comparison of methodologies. Table summarizes main differences among approaches considering: the learning algorithm, the presence of an initial data filtering step, the calculation of support and confidence, the structure of extracted rules, and the presence of a supporting software tool. Goal

Ontology

Algorithm

Faria et al.

Annotation consistency

MF

A-Priori

Benites et al.

Rare associations MultiOntology multi-level coannotation

MF

A-Priori

MF or Multi ontologies

A-priori like

Multi ontologies

FP-Growth

Manda et al.

GO-WAR

General rules

We recall that GO-WAR is slightly different from the method described in [8] since the authors are interested in capturing implicit relationships between aspects of a single function (e.g. ATPaseactivity → ATPbinding). Conversely, we explore all search space, with the goal to highlight unknown relationships among molecular functions. Thus, we may investigate a more broad perspective. In addition, Faria et al. do not consider weighted support but the classical support and confidence. They are aware of specificity problem; thus they manually filter GO terms with low specificity and redundant annotations before applying association rules. Conversely, GO-WAR approach is more flexible and avoids this manual intervention. GO-WAR is different from Multi-ontology data mining at All Levels (MOAL) algorithm discussed in [23]. MOAL uses structures and relationships of GO to mine cross ontology relationships between sub-ontologies of the GO. Rules are mined using a modified version of Apriori algorithm [24] using nonweighted items. The algorithm learns association rules using the standard measures minimum support and confidence, and it adds to each rule a p-value threshold for the Chi-square test. Thus, generalized transactions are obtained providing ancestor and descendant together and removing duplicate concepts generating a very large number of rules. In the last step, using Multi-ontology Support (MOSupport) and Multi-ontology Confidence (MOConfidence) rules are pruned and ranked. Mining in GO-WAR is not limited to discover the relation among cross sub-ontologies of GO as for MOAL, but is it developed to be as general as possible. GO-WAR rules generation and pruning are obtained in only one step exploiting an ad hoc data structure as defined in the following. Benites et al. [25], propose a data mining algorithm focused on mining rare associations from pairwise associations among multiple categories, useful to describe relations that are not obvious. Authors in their methodology introduced a new measure named: Interestingness by Differences instead to use the confidence. Interestingness by Differences measure has been developed to highlight the rules with high difference between their actual and expected interestingness, making possible to detect pairwise rare rules, which might be rejected by using confidence measure. Rare rules can be also mined by GO-WAR, using small values of weighted confidence. Thus, it is possible to discern among infrequent but informative rules (with

Data filtering Manual ancestor filtering Hierarchical filtering No ancestor filtering

Ancestor filtering or IC driven

Supp&Conf

RulesType

Tool

Support and confidence

A→C

NO

Interestingness by differences MOSupport

A→C

NO

A→C

NO

General

YES

MOConfidence Weighted support and confidence

high values of weighted support, stated as rare rules by Benites et al.), from rules with high number of occurrences but not very informative (lower values of weighted support). In addition, GO-WAR can mine more general and composite rules than the pairwise rule mined by the methodology developed by Benites et al. The synthetic comparison between GO-WAR and the other methodologies described above is summarized in Table 1. Considering these approaches, we point out the benefits of GO-WAR. First of all, it should be noted that GO-WAR is currently the only approach that provides a supporting tool that may help researchers in the analysis. Moreover, GO-WAR is designed for a more general approach while other methodologies have been developed to handle a more narrow problem (e.g. evaluation of Annotation consistency, mining of rare association, or analysis of multi-ontology association). GO-WAR is able to extract rules from all kinds of ontology, whereas the other methodologies can handle only MF-ontology (Faria et al., Benites et al.) while Manda et al. add rules extraction from multi-Ontology. The mining of rule in GO-WAR is done by using a faster learning algorithm. Finally, rules mined by GO-WAR are more structured and informative than rules produced by the other methodologies. In this way, GO-WAR makes it possible to pinpoint unknown relationships among several GO terms.

3. The GO-WAR algorithm and implementation Here we discuss the weighted association rule algorithm and the related software architecture.

3.1.

GO-WAR algorithm

The rationale of the GO-WAR algorithm is to take into account the relevance of items, i.e. GO terms, balancing, thus, relevance and frequency. Our approach is similar to the methodology proposed in [19]. Consequently, we here formulate the problem of the extraction of weighted association rules considering as weight the IC of terms as formulated by Sanchez et al. The formulation may be extended to all the ICs without loss of generality.


Definition 3.1 (Weighted item). A Weighted Item (wi) is a pair (x, w), where x ∈ I, i.e. a GO Term belonging to the set of the items, and w ∈ R, i.e. the associated real-valued item. For example, the GO term GO:0003714 relatives to the transcription co-repressor activity has an IC value of 11.876, conveyed as GO:0003714, (11.876). Definition 3.2 (Weighted transaction). A weighted transaction WT is a set of weighted items. For instance, the line P41226, GO:0005524 (10.07), GO:0005829 (10.07) represents the protein P41226, its annotations GO:0005524, GO:0005829 and their weights. A set of WT is hereafter referred to as Weighted Transaction Database WTB. Definition 3.3 (Weighted support). The Weighted Support, (WS), is the product of the support of an item, calculated using the classical formulation, and its weight. Definition 3.4 (Weighted minimum support). The weighted minimum support (wminSupp) of a weighted item is defined as: wminSupp =

i = 1n(WS(xi)n) ∗ p

(3)

where n is the number of transactions, and p is a user defined threshold. The algorithm takes as input an unweighted transaction database T, and it computes the weight for each item producing a weighted transaction database WTB. Consequently, it generates candidates itemsets by applying a two-step strategy based on a modified FP-Growth (frequent pattern) [26] algorithm. In the first step, the algorithm counts the weighted occurrence of items (i.e. the product of the frequency of occurrences and their weight) in the datasets, and it stores all the results in a table. Then it builds a FP-Tree structure by adding instances of items that have a weighted support greater than wminSupp. At the end of computation, all the itemsets with desired support and coverage have been found. GO-WAR iteratively analyzes the FP-Tree to mine significant rules [27] using a recursive methodology. We defined an inverted DFS (inverted Depth First Search) scan method to examine the FP-Tree. Inverted DFS starts to explore the tree from the leave nodes (bottom) and goes up to the (root node). The advantage, of using inverted DFS respect to traditional DFS, is the possibility to prune (remove) the postfix part of a frequent pattern. All frequent patterns of a given item are mined following the links connecting all occurrences of the current item in the FP-Tree, producing a new tree called ˇTree, used to mine rules. Postfix part of a frequent pattern is defined respect to a given item or itemset I. For example, taking into account the frequent pattern FP=(a:5, b:4, x:3, t:1, z:1) the postfix part of the item x is Post(x)={t:1, z:1}, while the prefix part is Pre(x)={a:5, b:4}. All frequent patterns of a given item are mined following the links connecting all occurrences of the current item in the FP-Tree and computing the support related with each path (frequent patterns). Each path is a set of ancestors of a given item called Conditional Pattern Base (CPB), CPB = {Pre(I) ∪ I ∪ Post(I), where Pre(I)=∅ or Post(I)=∅}.

117

Starting from the leaves, the Prefix part of a path can be used to mine rules, in particular, each path is a new tree called ˇ-Tree, from which it is possible to apply the methodology explained previously to mine the meaningful rules. In particular, from the new ˇ-Tree we prune all nodes (items) for which the condition wS(node) < wminSupp is verified. The process goes ahead with all items of the current ˇ-Tree are analyzed, and/or we have reached the root prefix set related with the current item, or it is empty. Algorithm 3.5. Gene Ontology-based Weighted Association Rules Mining (GO-WAR) Require: A weighted Transaction Database WTB, A weighted minimum support wminSupp. Ensure: A set of weighted association rules Rules. for all wi ∈ WTB Calculation of weighted support ws(wi) ← computesupport end for frequentItemsList ← compute(wS, wminSupp) {Creation of FPTree} Rules ← FP − Tree {Creation of Rules}

3.2.

GO-WAR performance analysis

The space complexity of GO-WAR algorithm is related to the building of the FP-Tree. The size of the FP-Tree is linearly related to the dimension of the input dataset. It grows less than the database because preprocessing steps removes all the items having a ws(x) ≤ weigthedminSupport.

3.2.1.

Time complexity analysis

The time complexity of the algorithm is mainly related to the number of scans of the input database (creation of FP-Tree) and the extraction of rules (Creation of Rules). Calculation of weighted support requires a linear time proportional to the dimension of the input dataset. In addition, in the same scan it is possible to locate frequent items, building, thus, the FrequentItemsList. Although the average time complexity of this composite step is: O(n), FP-Tree creation is done in linear time proportional to the number of frequent items identified. The time complexity needed to extract rules is comparable with scanning a n-ary tree. In general, for a n-ary tree with height h, the upper bound for the maximum number of leaves is nh . In our solution, sorting the elements in a descending order and using an inverted DFS scan strategy that allow us to obtain a time complexity equal to O(n2 ). Finally, the resulting time complexity is: 3n + n2 , where 3n is related with the computing of weighted support, FrequentItemsList and FP-Tree building. Complexity can be rewritten without loose generality as n + n2 , thus for huge value of n the complexity turns out to be O(n2 ).

3.3.

GO-WAR implementation

GO-WAR has a layered architecture as depicted in Fig. 3, which is composed of five main modules. The MINERCORE receives the user request and acts as a controller of other modules. Each submodule is controlled by using a

118


Fig. 3 – Architecture of GO-WAR. master–slave approach that is internally realized through Java Threads in order to achieve efficient computation. The Rule Miner module is responsible for the calculation of frequent itemsets and the extraction of final rules. The IC Calculator Module provides the calculation of IC for each GO Term. It uses Semantic Measures Library and Toolkit [18] libraries provided at (http://www.semantic-measures-library.org/) and on a local copy of Gene Ontology. The GO Term Translation module provides a complete description for each GO Term by invoking the Gene Ontology Annotation Database through Web Services. The GUI module based on Java Swing (http://docs.oracle.com/javase/tutorial/uiswing/) Technology provides to the user transparent access to all the implemented functionalities as depicted in Fig. 4. GO-WAR has been fully implemented by using Java Programming Language, and it is available for download at https://sites.google.com/site/weightedrules/. Users may extract rules from the input dataset in an easy way as depicted in Fig. 4. User has to load the input dataset and to choose the values of the minimum weighted support and the weighted confidence respectively, in order to control the number of association rules extracted. For this reason, GO-WAR is developed in order to be able to handle data coming from the three sub-ontologies namely: MF, CP and BP. The ability to control multiple ontologies, make GO-WAR suitable to analyze data coming from a single or multiple ontologies. In this way, it is possible to highlight unknown relationships among terms belonging to different ontologies or unknown relationships among terms belonging to the same ontology.

starting with the comparison with respect to Faria’s methodology. We downloaded the same Gene Ontology and the same annotations used in that paper considering only all the GO terms belonging to the MF ontology used to annotate the following organisms: arabdopsis, chicken, cow, human, mouse, rat, andzebrafish. Data were downloaded from the GOA database available online at http://www.ebi.ac.uk/GOA/downloads. The size of the dataset is about 13.78 MB on disk, because the total number of GO terms used to annotate the organisms, is equal to 338,441. After collecting data, we ran GO-WAR. We tested GO-WAR using different parameters of weighted support and confidence. Then we selected parameters that guaranteed best results: weighted support equal to 23% and weighted confidence greater than 80%. We obtained 7975 rules from 692,006 frequent itemsets, requiring 2582 ms to perform the task. The set of rules we mined should be subdivided into two groups: (i) binary rules (i.e. rules having only two items); (ii) n-ary rules (i.e. rules having more than two items). In the second group, we focus on a particular subset that we refer to as extended rules, i.e. rules that have two items comparing in Faria’s binary rules. Faria et al. in their paper ranked by support the obtained rules and use the first 100 rules to discuss the ability of the method. We used these rules as comparison. The comparison with respect to Faria’s top rules provided the following results: (i) GO-WAR found 60 binary rules identical to Faria’s top rules; (ii) GO-WAR was able to find 36 extended rules not present in Faria’s top rules; (iii) we mined 100 binary rules not present in Faria’s results. The website of the project stores in a supporting file all these results.

4.

4.1.

Results

Here we present the comparison of GO-WAR with respect to main software tools and state of the art approaches

Analysis of binary rules

We here present a manual evaluation of rules missed by Faria et al. Table 2 presents some top rules found only by GO-WAR


119

Fig. 4 – GUI of GO-WAR. Initially user has to upload datasets into GO-WAR by using the GO-WAR GUI. Then he/she has to select the desired threshold. GO-WAR weights automatically all the GO-Terms and runs the rules learning algorithm. At the end of the computation, GO-WAR shows the extracted rules in a simple textual interface. For each rule user may obtain the translation of GO-Terms into the textual definition and the visualization of the GO-graph, they belong.

and missed by Faria et al. and their biological meaning ranked by their weighted support. Let us consider rule (1): (GO:0016463,GO:0008551) – zincexporting ATPase activity, cadmium-exporting ATPase activity – missed by Faria et al. The manual inspection of the two GO terms on GOA database shows that the second one is the most recurrent term of the first one term, considering IEA annotation. Despite this consideration, the relation between Cadmium and Zinc minerals in living cells has been described in literature [28,29] and it constitutes an important biological mechanism for some small species. The significant co-occurrence in a large fraction of UNIPROT database, as revealed by GO-WAR, may suggest an inconsistent annotation that GO-WAR was able to find. The analysis of the rule (2) (GO:0016463, GO:0005388), i.e. zinc-exporting ATPase activity, calcium-transporting ATPase activity. The analysis of literature showed that the relation between Zinc transport and Ca transport is still unclear, although there is a growing body of evidence supporting that there is a relation between them. Consequently, this rule may evidence an inconsistent annotation. Rule (3) also has many references in literature [30] for some species, and the co-occurrence of both terms is evident only considering IEA annotation as reported in

GOA database. The significant co-occurrence in a large fraction of UNIPROT database, as revealed, may suggest an inconsistent annotation that GO-WAR was able to find. The analysis of rule (3) (GO:0008551, GO:0005388), i.e. cadmium-exporting ATPase activity, calcium-transporting ATPase activity showed that the relation among both function is valid. Similarly to rule (2), the significant co-occurrence in a large fraction of UNIPROT database, as revealed, may suggest an inconsistent annotation that GO-WAR was able to find. Rule 4 (GO:0086039,GO:0005388), calcium-transporting ATPase activity involved in regulation of cardiac muscle cell membrane potential, calcium-transporting ATPase activity, shows a possible relation among these functions that is also reported in literature [31]. The relevance of this rule is evidenced from the consideration that in the current release of Gene Ontology these terms are linked by a is a relationship while in the release used for the experiments they were not linked. Rule 5 (GO:0008520, GO:0070890), shows a relation among sodium symporter activity and sodium-dependent L-ascorbate transmembrane transporter activity. The analysis of literature demonstrates that two definitions correspond to the same process [32,33], thus evidencing a possible redundant annotation.

120


Table 2 – Rules found by GO-WAR and missed by Faria ranked by weighted support (IDs are inserted for a better discussion in the following.). ID

Term 1

Term 2

Weighted support

1

GO:0016463

GO:0008551

5066

2

GO:0016463

GO:0005388

5060

3

GO:0008551

GO:0005388

5061

4

GO:0086039

GO:0005388

5060

5

GO:0008520

GO:0070890

5054

6

GO:0046933

GO:0046961

5048

7

GO:0015410

GO:0005388

5049

8

GO:0008554

GO:0015432

5048

9

GO:0005391

GO:0005524

5008

Rule 6 (GO:0046933,GO:0046961), shows a relation among proton-transporting ATP synthase activity, rotational mechanism, and proton-transporting ATPase activity, rotational mechanism, refers to two closely related processes [34]. In particular, considering the GO Tree, they differ only by one direct ancestor (hydrogen ion exporting ATPase activity) that is linked only to proton-transporting ATP synthase activity, rotational mechanism. Thus the analysis may suggest that two terms may be reorganized. Rule 7 (GO:0015410, GO:0005388), links manganesetransporting ATPase activity, calcium-transporting ATPase activity. Both terms are related terms since involved in the cell adenosine triphosphatase activity as stated in [35], this may suggest a restructuring of GO. Rule 8 (GO:0008554,GO:0015432) links sodium-exporting ATPase activity, phosphorylative mechanism and bile acid-exporting ATPase activity, suggesting a possible relation among these terms as evidenced in some recent works [36]. Considering the statistics of the Gene Ontology Database, term GO:00015432 is the most co-recurrent term of GO:0008554 considering both IEA and non-IEA annotations.1 This may suggest a possible common function or a restructuring. Finally, rule 9 (GO:0005391, GO:0005524), suggests a relation amongsodium: potassium-exchanging ATPase activity and ATP binding. This is obvious since the ATP binding is related to the mechanism of exchange of sodium and potassium in

1 http://www.ebi.ac.uk/QuickGO/GTerm?id=GO:0015432#term= stats.

Function Zinc-exporting ATPase activity Zinc-exporting ATPase activity Cadmium-exporting ATPase activity Calcium-transporting ATPase activity involved in regulation of cardiac muscle cell membrane potential L-Ascorbate:sodium symporter activity

Proton-transporting ATP synthase activity, rotational mechanism Manganese-transporting ATPase activity Sodium-exporting ATPase activity, phosphorylative mechanism Sodium:potassiumexchanging ATPase activity

Function Cadmium-exporting ATPase activity Calcium-transporting ATPase activity Calcium-transporting ATPase activity Calcium-transporting ATPase activity

Sodium-dependent L-Ascorbate transmembrane transporter activity Proton-transporting ATPase activity, rotational mechanism Calcium-transporting ATPase activity Bile acid-exporting ATPase activity ATP binding

living cells [37]. This may suggest the introduction of novel relations in Gene Ontology.

4.2.

Analysis of extended rules

We refer to as extended rules, those rules that overlap to a rule Faria et al., and having a larger number of terms presenting terms that are missed by Faria et al. For instance let us consider: the rule (GO:0003924 GO:0005525) found by Faria et al, and the rule (GO:0003924, GO:0042803, GO:0005525) found by GO-WAR. The website of the project contains all the 37 extended rules, Table 3 presents only top ten sorted by weighted support. From a biological point of view, these rules may suggest to GO curators the restructuring of the ontologies or the introduction of novel classes representing function less atomic.

4.3.

Performance comparison

This Section presents the performance evaluation of GO-WAR. The experiments have been done using a Macbook-pro with processor i7 at 2.3 GHz, 16 GB of RAM, and 512 GB of SSD (solid state drive disk). The website of the project stores more detailed example of rules. Performances and results obtained by GO-WAR have been compared to existing association rules approaches implemented in the Arules package of Bioconductor [24], Knime [38] and Weka [39]. Unfortunately, Weka is not able to extract weighted rules. Thus a fair comparison is still impossible. In addition, we tried to mine non-weighted rules by eliminating manually the

121


Table 3 – Top ten extended rules, ranked by weighted support. ID

Faria rules

1

GO:0003924 GO:0005525

2

GO:0004017 GO:0005524

3

GO:0030272 GO:0005524

4

GO:0004814 GO:0005524

5

GO:0004124 GO:0030170

6

GO:0004008 GO:0005524

7

GO:0004821 GO:0005524

8

GO:0004813 GO:0005524

9

GO:0004824 GO:0005524

10

GO:0008060 GO:0008270

GO-WAR rule GO:0003924, GO:0042803 GO:0005525 GO:0004017, GO:0005525, GO:0004550 GO:0005524 GO:003027, GO:0003676 GO:0005524 GO:0004814, GO:0044822 GO:0005524 GO:0004124, GO:0016740 GO:0030170 GO:0004008, GO:0008270 GO:0005524 GO:0004821, GO:0044822 GO:0005524 GO:0004813, GO:0008270, GO:0000049 GO:0005524 GO:0004824, GO:0003676 GO:0005524 GO:0008060, GO:0017137 GO:0008270

information content. We used the Apriori algorithm implemented in Weka [39]. Using different support values and confidence we Similarly, Knime is not able to deal with weighted items. Consequently we removed weight, and we tried to extract rules. Since Knime is based on Weka implementation of the FP-GROWTH algorithm, Knime was not able to mine meaningful rules. We should point out that the generation of compatible input files for Knime and Weka required a lot of efforts to make files in a format compatible with the analysis. This step is expensive from a temporal point of view, and make the analysis error prone, especially if it is done by not expert users. Results obtained by GO-WAR have been not compared with Arules package of Bioconductor [24] since, Arule did not find significant rules. It should be noted that the lack of results with Arules is due to the use of the classical implementation of A-priori algorithm. This causes the lack of significant rules.

5.

Conclusion

Classical AR algorithms are not able to deal with different sources of production of GO annotations. Consequently, when used on annotated data they produce candidate rules with low IC. We here presented GO-WAR that can extract association rules with a high level of IC without losing Support and Confidence during the rule discovery phase and without the use

Missed rules by Faria GO:0042803 GO:0005525 GO:0005525 GO:0005524; GO:0004550 GO:0005524 GO:0003676 GO:0005524 GO:0044822 GO:0005524

Interpretation GTPase activity, protein homodimerization activity GTP binding Adenylate kinase activity, GTP binding, nucleoside diphosphate kinase activity ATP binding 5-Formyltetrahydrofolate cyclo-ligase activity, nucleic acid binding ATP binding Arginine-tRNA ligase activity, poly(A) RNA binding ATP binding

GO:0008270 GO:0005524

Cysteine synthase activity, transferase activity pyridoxal phosphate binding Copper-exporting ATPase activity, zinc ion binding ATP binding

GO:0044822 GO:0005524

Histidine-tRNA ligase activity, poly(A) RNA binding ATP binding

GO:0008270 GO:0005524; GO:0000049 GO:0005524 GO:0003676 GO:0005524

Alanine-tRNA ligase activity, zinc ion binding, tRNA binding ATP binding

GO:0017137 GO:0008270

ARF GTPase activator activity, Rab GTPase binding zinc ion binding

GO:0016740 GO:0030170

Lysine-tRNA ligase activity, nucleic acid binding ATP binding

of post-processing strategies for pruning uninteresting rules. We used publicly available GO annotation data to demonstrate our methods. Future works will regard testing of GO-WAR on larger datasets for improving annotation consistency.

references

[1] P. Guzzi, M. Mina, C. Guerra, M. Cannataro, Semantic similarity analysis of protein data: assessment with biological features and issues, Brief. Bioinform. 13 (5) (2012) 569–585. http://bib.oxfordjournals.org/content/early/2011/ 12/02/bib.bbr066.short [2] M.A. Harris, J. Clark, A. Ireland, J. Lomax, M. Ashburner, et al., The gene ontology (GO) database and informatics resource, Nucleic Acids Res. 32 (Database issue) (2004) 258–261. [3] E. Camon, M. Magrane, D. Barrell, V. Lee, E. Dimmer, J. Maslen, D. Binns, N. Harte, R. Lopez, R. Apweiler, The gene ontology annotation (GOA) database: sharing knowledge in UNIPROT with gene ontology, Nucleic Acids Res. 32 (Suppl. 1) (2004) D262–D266, http://dx.doi.org/10.1093/nar/gkh021. [4] J. Hipp, U. Güntzer, G. Nakhaeizadeh, Algorithms for association rule mining – a general survey and comparison, ACM SIGKDD Explor. Newslett. 2 (1) (2000) 58–64. [5] P.H. Guzzi, M. Milano, M. Cannataro, Mining association rules from gene ontology and protein networks: promises and challenges, Procedia Comput. Sci. 29 (2014) 1970–1980. [6] M.J. Zaki, S. Parthasarathy, M. Ogihara, W. Li, et al., New algorithms for fast discovery of association rules, in: KDD, vol. 97, 1997, pp. 283–286.

122


[7] M. Cannataro, P.H. Guzzi, A. Sarica, Data mining and life sciences applications on the grid, Wiley Interdiscip. Rev.: Data Min. Knowl. Discov. 3 (3) (2013) 216–238, http://dx.doi.org/10.1002/widm.1090. [8] D. Faria, A. Schlicker, C. Pesquita, H. Bastos, A.E.N. Ferreira, M. Albrecht, A.O. Falco, Mining go annotations for improving annotation consistency, PLoS ONE 7 (7) (2012) e40519. [9] P. Carmona-Saez, M. Chagoyen, A. Rodriguez, O. Trelles, J.M. Carazo, A. Pascual-Montano, Integrated analysis of gene expression by association rules discovery, BMC Bioinform. 7 (1) (2006) 54. [10] I. Ponzoni, M.J. Nueda, S. Tarazona, S. Götz, D. Montaner, J.S. Dussaut, J. Dopazo, A. Conesa, Pathway network inference from gene expression data, BMC Syst. Biol. 8 (2) (2014) 1–17. [11] C. Tew, C. Giraud-Carrier, K. Tanner, S. Burton, Behavior-based clustering and analysis of interestingness measures for association rule mining, Data Min. Knowl. Discov. 28 (4) (2014) 1004–1045. [12] F. Benites, S. Simon, E. Sapozhnikova, Mining rare associations between biological ontologies, PLOS ONE 9 (1) (2014) e84475. [13] C.D. Nguyen, K.J. Gardiner, K.J. Cios, Protein annotation from protein interaction networks and gene ontology, J. Biomed. Inf. 44 (5) (2011) 824–829. [14] P. Manda, F. McCarthy, S.M. Bridges, Interestingness measures and strategies for mining multi-ontology multi-level association rules from gene ontology annotations for the discovery of new GO relationships, J. Biomed. Inf. 46 (5) (2013) 849–856. [15] S. Naulaerts, P. Meysman, W. Bittremieux, T.N. Vu, W. Vanden Berghe, B. Goethals, K. Laukens, A primer to frequent itemset mining for bioinformatics, Brief. Bioinform. (2013), http://dx.doi.org/10.1093/bib/bbt074. [16] C. Huttenhower, M.A. Hibbs, C.L. Myers, A.A. Caudy, D.C. Hess, O.G. Troyanskaya, The impact of incomplete knowledge on evaluation: an experimental benchmark for protein function prediction, Bioinformatics 25 (18) (2009) 2404–2410. [17] G. Alterovitz, M. Xiang, D.P. Hill, J. Lomax, J. Liu, M. Cherkassky, J. Dreyfuss, C. Mungall, M.A. Harris, M.E. Dolan, et al., Ontological engineering, Nat. Biotechnol. 8 (2) (2010) 128–130. [18] S. Harispe, D. Sánchez, S. Ranwez, S. Janaqi, J. Montmain, A framework for unifying ontology-based semantic similarity measures: a study in the biomedical domain, J. Biomed. Inf. (2013). [19] W. Wang, J. Yang, P.S. Yu, Efficient mining of weighted association rules (war), in: Proceedings of the Sixth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, ACM, 2000, pp. 270–274. [20] L. du Plessis, N. Skunca, C. Dessimoz, The what, where, how and why of gene ontology – a primer for bioinformaticians, Brief. Bioinform. 12 (6) (2011) 723–735, http://dx.doi.org/10.1093/bib/bbr002. [21] M. Milano, G. Agapito, P.H. Guzzi, M. Cannataro, Biases in information content measurement of gene ontology terms, in: H.J. Zheng, W. Dubitzky, X. Hu, J. Hao, D.P. Berrar, K. Cho, Y. Wang, D. Gilbert (Eds.), 2014 IEEE International Conference on Bioinformatics and Biomedicine, BIBM 2014, Belfast, United Kingdom, November 2–5, 2014, IEEE, 2014, pp. 9–16, http://dx.doi.org/10.1109/BIBM.2014.6999375. [22] D. Sánchez, M. Batet, D. Isern, Ontology-based information content computation, Knowl.-Based Syst. 24 (2011) 297–303, http://dx.doi.org/10.1016/j.knosys.2010.10.001. [23] P. Manda, F. McCarthy, S.M. Bridges, Interestingness measures and strategies for mining multi-ontology multi-level association rules from gene ontology

[24]

[25]

[26]

[27]

[28]

[29]

[30]

[31] [32]

[33]

[34]

[35]

[36]

[37]

[38]

[39]

annotations for the discovery of new GO relationships, J. Biomed. Inform. 46 (5) (2013) 849–856 http://www. sciencedirect.com/science/article/pii/S1532046413000877 M. Hahsler, B. Grün, K. Hornik, arules: mining association rules and frequent itemsets, 2006, URL http://cran.r-project.org/, R package version, SIGKDD Explor. 2 (2007) 1–4, doi:10.1.1.109.4868. F. Benites, S. Simon, E. Sapozhnikova, Mining rare associations between biological ontologies, PLOS ONE 9 (1) (2014) e84475, http://dx.doi.org/10.1371/journal. pone.0084475. J. Han, J. Pei, Y. Yin, Mining frequent patterns without candidate generation, in: W. Chen, J. Naughton, P.A. Bernstein (Eds.), 2000 ACM SIGMOD Intl. Conference on Management of Data, ACM Press, 2000, pp. 1–12 http://citeseer.ist.psu.edu/han99mining.html J. Han, J. Pei, Y. Yin, Mining frequent patterns without candidate generation, SIGMOD Rec. 29 (2) (2000) 1–12, http://dx.doi.org/10.1145/335191.335372. A. Craig, L. Hare, A. Tessier, Experimental evidence for cadmium uptake via calcium channels in the aquatic insect Chironomus staegeri, Aquat. Toxicol. 44 (4) (1999) 255–262. D.B. Buchwalter, S.N. Luoma, Differences in dissolved cadmium and zinc uptake among stream insects: mechanistic explanations, Environ. Sci. Technol. 39 (2) (2005) 498–504. E. Chávez, R. Briones, B. Michel, C. Bravo, D. Jay, Evidence for the involvement of dithiol groups in mitochondrial calcium transport: studies with cadmium, Arch. Biochem. Biophys. 242 (2) (1985) 493–497. A.E. Shamoo, I.S. Ambudkar, Regulation of calcium transport in cardiac cells, Can. J. Physiol. Pharmacol. 62 (1) (1984) 9–22. J.-K. Chung, Sodium iodide symporter: its role in nuclear medicine, J. Nucl. Med. 43 (9) (2002) 1188–1200 http://jnm.snmjournals.org/content/43/9/1188.abstract H. Tsukaguchi, T. Tokui, B. Mackenzie, U.V. Berger, X.-Z. Chen, Y. Wang, R.F. Brubaker, M.A. Hediger, A family of mammalian Na+-dependent L-ascorbic acid transporters, Nature 399 (6731) (1999) 70–75. M. Yoshida, E. Muneyuki, T. Hisabori, ATP synthase – a marvellous rotary engine of the cell, Nat. Rev. Mol. Cell Biol. 2 (9) (2001) 669–677. J.C. Lai, J.F. Guest, T.K. Leung, L. Lim, A.N. Davison, The effects of cadmium, manganese and aluminium on sodium–potassium-activated and magnesium-activated adenosine triphosphatase activity and choline uptake in rat brain synaptosomes, Biochem. Pharmacol. 29 (2) (1980) 141–146. W. Zhao, K. Shahzad, M. Jiang, D.E. Graugnard, S.L. Rodriguez-Zas, J. Luo, J.J. Loor, W.L. Hurley, Bioinformatics and gene network analyses of the swine mammary gland transcriptome during late gestation, Bioinform. Biol. Insights 7 (2013) 193. A.M. Bertorello, A. Aperia, S.I. Walaas, A.C. Nairn, P. Greengard, Phosphorylation of the catalytic subunit of Na+, K(+)-ATPase inhibits the activity of the enzyme, Proc. Natl. Acad. Sci. U.S.A. 88 (24) (1991) 11359–11362. M.R. Berthold, N. Cebron, F. Dill, T.R. Gabriel, T. Kötter, T. Meinl, P. Ohl, C. Sieb, K. Thiel, B. Wiswedel, KNIME: the Konstanz information miner, in: C. Preisach, H. Burkhardt, L. Schmidt-Thieme, R. Decker (Eds.), Data Analysis, Machine Learning and Applications, Springer Berlin Heidelberg, Berlin, Heidelberg, 2008, pp. 319–326, http://dx.doi.org/10.1007/978-3-540-78246-9 38 (Chapter 38). M. Hall, E. Frank, G. Holmes, B. Pfahringer, P. Reutemann, I. Witten, The WEKA data mining software: an update, Special Interest Group Knowl. Discov. Data Min. Explor. Newslett. 11 (1) (2009) 10–18, http://dx.doi.org/10.1145/1656274.1656278.

Validity of association rules extracted by healthcare-data-mining.

Mining Association Rules for Neurobehavioral and Motor Disorders in Children Diagnosed with Cerebral Palsy.

RANWAR: rank-based weighted association rule mining from gene expression and methylation data.

Using association rule mining to determine promising secondary phenotyping hypotheses.

PHARM - Association Rule Mining for Predictive Health.

A Bayesian Scoring Technique for Mining Predictive and Non-Spurious Rules.

Clinic-genomic association mining for colorectal cancer using publicly available datasets.

Novel drug target identification for the treatment of dementia using multi-relational association mining.

Data mining with molecular design rules identifies new class of dyes for dye-sensitised solar cells.

Using association rule mining for phenotype extraction from electronic health records.

Confabulation-inspired association rule mining for rare and frequent itemsets.

An Algorithm of Association Rule Mining for Microbial Energy Prospection.

DiMeX: A Text Mining System for Mutation-Disease Association Extraction.

DrugQuest - a text mining workflow for drug association discovery.

A weighted U statistic for association analyses considering genetic heterogeneity.

Mining precise cause and effect rules in large time series data of socio-economic indicators.

Characteristics of cyclist crashes in Italy using latent class analysis and association rule mining.

icuARM-An ICU Clinical Decision Support System Using Association Rule Mining.

Rectal tumour volume (GTV) delineation using T2-weighted and diffusion-weighted MRI: Implications for radiotherapy planning.

Genetic association analysis using weighted false discovery rate approach on Genetic Analysis Workshop 18 data.

Rules for colour constancy.

Use HypE to Hide Association Rules by Adding Items.

Quantifying tumor aggressiveness using diffusion-weighted MRI for prostate cancer.

Ten simple rules to initiate and run a postdoctoral association.