928

IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, VOL. 23, NO. 6, JUNE 2012

Extracting Representative Information to Enhance Flexible Data Queries Jin Zhang, Guoqing Chen, and Xiaohui Tang

Abstract— Extracting representative information is of great interest in data queries and web applications nowadays, where approximate match between attribute values/records is an important issue in the extraction process. This paper proposes an approach to extracting representative tuples from data classes under an extended possibility-based data model, and to introducing a measure (namely, relation compactness) based upon information entropy to reflect the degree that a relation is compact in light of information redundancy. Theoretical analysis and data experiments show that the approach has desirable properties that: 1) the set of representative tuples has high degrees of compactness (less redundancy) and coverage (rich content); 2) it provides a way to obtain data query outcomes of different sizes in a flexible manner according to user preference; and 3) the approach is also meaningful and applicable to web search applications. Index Terms— Flexible data queries, information equivalence, relation compactness, representativeness, web search.

I. I NTRODUCTION

T

HE commonly used database model nowadays is the relational database (RDB) model initiated by Codd [1], which was generally designed under the assumption that the data/information is precise and queries are crisp. However, decision makers often face a growing need for representing and processing imprecise and uncertain information. This is due to the fact that imprecision and uncertainty is an inherent nature of human reasoning and decision making processes. Furthermore, imprecision and uncertainty is a kind of partial knowledge about the real world and therefore considered desirable and useful in data modeling and queries. Particularly in a web environment, approximate match is often preferred in generating search outcomes when exact match is unable to produce satisfactory results from targeted databases. Since 1980s, a number of extended RDB models and queries have been proposed to deal with imprecision and uncertainty in attribute values and domain elements [2]–[6]. For instance, with the data stored in classical RDB, Kacprzyk and Zadrozny [7] proposed a flexible query system that could execute such queries as “find (all) records such that most of the (important)

Manuscript received January 15, 2011; revised March 13, 2012; accepted March 17, 2012. Date of publication April 25, 2012; date of current version May 10, 2012. This work was supported in part by the National Natural Science Foundation of China under Grant 70890080/71110107027, Tsinghua University Initiative Scientific Research Program under Grant 20101081741, and the Research Center for Contemporary Management, Tsinghua University. The authors are with the School of Economics and Management, Tsinghua University, Beijing 100084, China (e-mail: [email protected]). Digital Object Identifier 10.1109/TNNLS.2012.2193415

clauses are satisfied (to a degree in [0, 1])” for crisp data stored in the Microsoft Access DBMS. The main idea was to define the linguistic terms and transform them into equivalent crisp queries. Chen and Jong [8] developed a method to transform query conditions described by imprecise values such as “about 25” into the classical SQL queries. In [9], Bosc and Pivert extended the SQL queries to allow the expression of imprecise conditions, where the query equivalence and basic query operators were discussed. Furthermore, with the possibility distributions as attribute values [6], they transformed the possibilistic data into a weighted RDB, in that a nesting mechanism was introduced to support the value-based queries, and the projection-selection-join operations were studied in detail [10]. In [11] and [12], we extended the eight basic algebraic operators in light of database design as well as equivalent transformation rules of the algebraic operators. These efforts aimed at helping database users store and utilize the data in a more natural way, which are deemed as important extensions to the classical RDB and queries. Notably, when querying databases with data of high volume, which is common nowadays, the size of the outcomes can easily be very large, even massive in a web database search context. Consider a classical database R (Customer #, Customer Name, Age, Location) as an example. Two types of queries may be formulated against R (e.g., SQL-like). Query 1: Select Customer Name From R Where Age = 25 AND Location = B. Query 2: Select Customer Name From R Where Age is about 25 AND Location is near B. Here Query 1 is a classical query and Query 2 is a query with approximate conditions. Furthermore, when R allows imprecise attribute values to appear for Age, such as young, R becomes an extended RDB. In this case, Query 1 does not work whereas Query 2 may still pertain. It is worth mentioning that Query 2 requires an approximate match between query conditions (stated in the Where clause, which is fuzzy) and values of corresponding attributes (e.g., Age, Location, which can be imprecise or precise). Another example is a web search, say via Google. Just click and see how many pages of outcome will appear when keying in “Population of Beijing” for an exact match and Population of Beijing (notably without quotation marks) for an approximate match. Efforts have been made to deal with queries and web search with imprecise information and approximate match measures, so as to enrich the representation semantics and strengthen the power of search engines [5], [13]–[16]. Generally speaking, approximate match provides flexible queries, which is considered meaningful and

1045–9227/$31.00 © 2012 IEEE

ZHANG et al.: EXTRACTING REPRESENTATIVE INFORMATION

useful in many cases, whereas the size of its outcomes is usually larger than that of the exact match. Thus, the size of query outcomes becomes an issue of concern, especially for approximate match in the context of flexible queries. One notable approach was to provide the users with the top-k results according to a ranking based on a certain evaluation function F between query conditions and all data records. Methods of formulating, calculating, and explaining F including sampling learning have been used to deal with the effectiveness and efficiency issues [17], [18]. Sometimes, the top-k results are sufficient for flexible/soft queries with approximate match. However, it is desirable and meaningful if a query/search could also guarantee that the selected results is sufficiently representative as far as all the query results are concerned. The focus of this paper is then to present an approach attempting at possessing two appealing features: one is information compactness with little redundancy; the other is flexible size of query results. In other words, the approach will provide the users with a compact set of query outcomes, in that the compact set: 1) is smaller than and information-equivalent to the set of all query outcomes in light of redundancy and 2) at the same time has its size flexibly specified by users upon their need and preference. In this paper, we will take a database query perspective, namely, flexible queries in possibilistic RDBs. In the first place, RDBs are widely used forms of data repositories for data processing, informational retrieval, and web search in either front-end or back-end applications. In the second place, flexible queries and possibilistic databases not only reflect richer semantics in terms of approximate conditions and imprecise information, but also contain crisp queries and classical databases as special cases. Concretely, the compact set is composed of the tuples (records), each of which is a representative one in a class whose tuples are close to each other. With possibilistic data, there are mainly two issues to address: 1) how to evaluate tuple closeness and how to extract the representative tuples and 2) how to evaluate the set of representative tuples in terms of compactness. Thus, this paper, aimed at addressing these two issues, is to introduce: 1) an approach to extracting the representative tuples from large-scale data with high coverage degree and low redundancy degree; 2) a measure (namely relation compactness) to evaluate the compactness of a given relation in light of data redundancy, which is based on information entropy. Note that the extraction algorithm proposed in the approach is a novel effort, in that a center-based extraction method is developed based on the contextual centrality and structural centrality defined in Section III. As will be illustrated in the following sections in more detail, the proposed approach possesses some desirable properties: 1) the extracting operation generates a unique set of representative tuples, which is compact with little redundancy; 2) the size of the representative tuples set can be flexibly specified upon users preference; and 3) the approach is meaningful and applicable to web search applications. It is noteworthy that

929

this approach may also be used for classical databases where data/information is precise. Moreover, recent years have witnessed a rapidly increasing number of needs for providing such properties/features in reallife applications, especially in the information retrieval and e-commerce fields. One example is the need for generating a small set of outcomes that reflect the diversity and richness of the original data. For instance, online reviews/comments are considered important to online shoppers for their buying decisions. However, since in many cases the size of the reviews/comments is very large, obtaining the representative ones that are diverse and cover rich opinions of the reviews is deemed helpful to the potential customers. Another example of the needs is to generate a small set of outcomes in light of little redundancy. For instance, in mobile service applications (such as mobile phones and pad-type use), which are popular nowadays, due to the limitation of the screen size and navigability, it is neither tolerable nor practical for the users to browse many pages on mobile devices, let along lots of redundant information appear. Therefore, compactness and representativeness is desirable. Finally, it is worth mentioning that in information retrieval and queries, recall and precision [19] are two general measures as evaluation criteria for many traditional approaches and applications including search engines. However, query/search results are usually very huge. In this regard, the proposed approach could be used to further generate a smaller set that is compact and representative. Additionally, as can be seen in later sections, the proposed approach performs well in terms of compactness (less redundancy) and coverage (more content), which are, to certain extent, in a similar spirit of precision and recall, though with different settings and technical treatments. Having our earlier attempt presented in a conference [50], where a preliminary notion of relation compactness was introduced, the proposed approach in this paper is a substantial extension to the attempt, with more detailed formulation of the notions, in-depth investigation of the related properties, and extensive experiments using large data. The rest of this paper is organized as follows. Section II provides an overview of some of the notions of possibility distributions and the possibilistic database model. In Sections III and IV, the method of tuple extraction and the notion of relation compactness are discussed in detail. Data experiments are conducted in Section V and some concluding remarks are presented in Section VI. II. P OSSIBILITY D ISTRIBUTIONS AND E XTENDED RDB M ODEL A. Possibility Distribution Possibility is deemed to be a type of uncertainty, which could be semantically linked to fuzziness in concept. Fuzzy logic or fuzzy set theory was incepted by Zadeh [6]. It aims at quantifying and reasoning with imprecision and uncertainty that is common in the real world. A fuzzy set is a generalization of an ordinary set, which allows partial membership rather than merely full membership or non-membership. Concretely, let U be the universe of discourse, a fuzzy set F on U is characterized by a membership function μ F : U → [0, 1],

930

IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, VOL. 23, NO. 6, JUNE 2012

which associates each element u of U with a number μ F (u) representing the grade of membership of u in F. μ F (u) = 0 means non-membership, μ F (u) = 1 means full membership, and μ F (u) in (0, 1) means partial membership. Furthermore, Zadeh introduced the notion of possibility distributions, which acts as a fuzzy restriction on the values that may be assigned to a variable [6]. Given a fuzzy set F and a variable X on U , then the possibility of X = u, denoted by π X (u), is defined to be equal to μ F (u). The possibility distribution of X on U with respect to F is denoted by   πx (u) |u ∈ U, πx (u) = μ F (u) ∈ [0, 1] . (1) πx = u A possibility distribution π X is called normalized if there exists an element u 0 in U such that π X (u 0 ) = 1. Importantly, possibility distributions provide a graded semantics to natural language statements such as “the customer is young” and “the customer has high loyalty,” etc., which are often used in our daily communications and reflects imprecision in data values. Let X and Y be two ordinary sets. A fuzzy relation φ from X to Y (or on X ×Y ) is defined as a mapping from X ×Y to [0, 1], i.e., ∀(u, v) ∈ X × Y , φ(u, v) ∈ [0, 1]. Suppose φ1 and φ2 are two fuzzy relations on X ×Y , φ1 ⊆ φ2 means: ∀(u, v) ∈ X ×Y , φ1 (u, v) ≤ φ2 (u, v). Two specific fuzzy relations are of particular interest, namely, a closeness relation and a similarity relation. A closeness relation c is a mapping from X × X to [0, 1] such that ∀u, v ∈ X, c(u, u) = 1 (reflexive) and c(u, v) = c(v, u) (symmetric). A similarity relation s is a mapping from X × X to [0, 1] such that ∀u, v, w ∈ X, s(u, u) = 1 (reflexive), s(u, v) = s(v, u) (symmetric), and s(u, w) ≥ supv∈X min(s(u, v), s(v, w)) (sup-min transitive). Apparently, a similarity relation is a special case of a closeness relation. Moreover, both closeness relation and similarity relation are generalizations of identity relation (τ ) on X × X, where all elements of X are mutually distinct, i.e., τ gis a mapping from X × X to {0, 1}, meaning that ∀u, v ∈ X, τ (u, u) = 1 and τ (u, v) = 0 if u and v are not identical. B. Extended RDB Model In this paper, the extended possibility-based model is considered to be the underlying database model. It facilitates handling both imprecision in attribute values and closeness in domain elements in terms of possibility distributions (including linguistic terms) and closeness relations, respectively, [4], [20]. In light of data representation, this model is deemed to be a general setting compared with two other known models. One is the so-called possibility-based model [4], where attribute values can be imprecisely represented by possibility distributions. The other is the so-called similarity-based model [3], where domain elements can be represented by similarity relations (reflective, symmetric, and sup-min transitive). Moreover, it is worth mentioning that the classical database model is a special case of the extended possibility-based model, when all attribute values are precise and all domain elements are mutually distinct. In this case, a precise value (e.g., 25 for Age) can be expressed as a possibility distribution of singleton with a degree of 1 (e.g., {1/25}), and closeness relation (e.g., cAge )

TABLE I C LOSENESS R ELATION cclass CClass

Classic

Silver

Gold

Classic

1.0

0.5

0.0

0.0

1.0

0.75

0.25

1.0

0.75

Silver Gold Diamond

Diamond

1.0

degenerates to identity relation (e.g., τAge ). Specifically, in the extended possibility-based model, a relation R is a subset of (D1 ) × (D2 ) × · · · × (Dg ), where (Di ) = πAi |π Ai is a possibility distribution of attribute Ai on domain Di }, and a closeness relation ci is associated with domain Di , where ci is reflexive and symmetric (1 ≤ i ≤ g). In addition, an g-tuple t of R is of the form: t(π A1 , π A2 , . . ., πAg ), where πAi is the value of the Ai -component of t (1 ≤ i ≤ g). Two example tuples recording two customers’ Name, Age, and Class can be (Tony, {0.7/28, 1.0/33, 0.8/36}, diamond) and (Echo, young, silver). The closeness relation cClass on the domain of Class (i.e., DClass = {classic, silver, gold, and diamond}) can be predefined or computed by the managers as shown in Table I. Given two g-tuples t p (π A1 , π A2 , . . ., π Ag ), tq (π’ A1 , π’ A2 , . . ., π’ Ag ), where π Ai and π’ Ai are normalized (1 ≤ i ≤ g), an example measure of the closeness of two values on the same domain is [21] E c (π Ai , π A i ) = supx,y∈Di ,ci (x,y)≥ai min(π Ai (x), π A i (y)). (2) Here ci is a closeness relation on domain Di , α i ∈ [0, 1] is a threshold specified by experts or the database managers according to ci . Thus, the tuple closeness of two tuples t p and tq can be considered as Fc (t p , tq ) = min(E c (π A1 , π A 1 ), . . . , E c (π Ag , π A g )).

(3)

Note that the idea of comparing two representations π A and π’ A was first proposed by Prade and Testemale [4]. Since then, a number of ways to measure such a kind of data closeness have been proposed [22], [23], including some comparative studies of the different proposals [20]. III. T UPLE E XTRACTION A. Evaluation of Tuple Closeness Let X, Y, and Z be ordinary nonempty sets, φ be a fuzzy relation on X × Y , and η be a fuzzy relation on Y × Z , then the sup-min composition of φ and η is a fuzzy relation on X × Z as follows (denoted by δ = φ × η) [6]: δ(u, w) = φ × η(u, w) = sup{min(φ(u, v), η(v, w))}. v∈Y

(4)

If X, Y , and Z are finite sets, then δ is a max–min composition of φ and η. If φ is a fuzzy relation on X × X, then, φ 2 = φ × φ. Generally, φ n+1 = φ × φ n , for n ≥1. For the extended possibility-based database, suppose we have n tuples, then the pair-wise tuple closeness can be represented in a n × n closeness matrix M = (ei j )n×n , where

ZHANG et al.: EXTRACTING REPRESENTATIVE INFORMATION

931

TABLE II C USTOMER ’ S A GE AND S ALARY Tuples

Age

Salary ($)

t1

0.9/20, 0.6/30, 0.6/40, 0.1/50

0.9/600, 0.8/700, 0.2/800, 0.1/900

t2

0.9/20, 0.7/30, 0.6/40, 0.1/50

0.9/600, 0.8/700, 0.2/800, 0.2/900

t3

0.6/20, 0.7/30, 0.7/40, 0.3/50

0.8/600, 0.8/700, 0.2/800, 0.2/900

t4

0.1/20, 0.2/30, 0.2/40, 0.7/50

0.1/600, 0.2/700, 0.7/800, 0.6/900

t5

0.1/20, 0.2/30, 0.3/40, 0.9/50

0.1/600, 0.3/700, 0.9/800, 0.9/900

t6

0.6/20, 0.6/30, 0.7/40, 0.3/50

0.7/600, 0.6/700, 0.2/800, 0.2/900

t7

0.1/20, 0.1/30, 0.3/40, 0.9/50

0.1/600, 0.1/700, 0.6/800, 0.7/900

t8

0.6/20, 0.6/30, 0.7/40, 0.1/50

0.6/600, 0.7/700, 0.2/800, 0.1/900

the element ei j ∈ [0, 1] is the closeness degree between tuples t p and tq and could be derived through (3). Here, the computation of ei j for M using (3) considers two types of uncertainty in the database concerned. One is the imprecision of attribute values represented by possibility distributions such as π ik . The other is the closeness of domain elements represented by ck and α k . When neither of the two types is present in the database, the equality measure E c degenerates to “=,” with M becoming a binary matrix of 0 and 1 values. Moreover, the notion of M could be extended into the web-search environment [24]–[26], where ei j reflects the closeness degree of two search records resulting from a retrieval of two records on web. Especially, there are computationally effective ways that have been used to compare two text records, e.g., via keywords by cosine similarity [27], so as to assess the extent of the approximate match in [0, 1]. Further, for 0 ≤ λ ≤ 1, the λ-cut matrix of the closeness matrix M can be derived, denoted as Mλ = (eiλj )n×n , where if any ei j ≥ λ, then eiλj = 1 or otherwise eiλj = 0. It is known that if M is a closeness relation, then its transitive closure M + is a similarity relation (reflexive, symmetric, and max–min transitive). According to the definition of transitive closure, M + could be computed by a series of max– min compositions onto M itself with (4), i.e., M 2 = M × M, M 3 = M × M 2 , . . ., until there is an integer p which results in: M p = M p+1 . M + converges within n − 1 compositions [28] M + = M p = M p+1 ,

p ≤ n − 1.

(5)

An important theorem states that M + is a similarity relation if and only if every λ-cut Mλ+ is an equivalence relation (reflexive, symmetric, and transitive) [29]. This implies that if M + is a similarity relation, then the equivalence relation Mλ+ can group the domain concerned into some equivalence classes. Thus, given a relation R = {t1 , t2 , . . ., tn } and the threshold λ, we have the following steps to group R. 1) Calculate the closeness relation M with (3).

Fig. 1.

Closeness matrix and transitive closure of Example 1.

Fig. 2.

λ-cut matrixes of Example 1.

2) Calculate the transitive closure M + according to (4) and (5) (for algorithmic details, please refer to [30] and [31]). 3) Obtain the λ-cut Mλ+ according to the given threshold λ. If Mλ+ (i , j ) = 1, then ti and t j are grouped into the same equivalence class, or otherwise into two different classes. Definition 1: Given a relation R = {t1 , t2 , . . ., tn } with equivalence classes C1 , C2 , . . ., Cm according to Mλ+ , then two tuples ti and t j are called λ-close if they are in the same equivalence class, or called not λ-close if they are in different equivalence classes. It is clear that classical tuple identity is a special case of λ-closeness. As in the classical situation, λ is equal to 1 and λ-close means that two tuples are crisply identical. Example 1: Suppose we query customers’ records about age and salary from a database. The results are shown in Table II. According to (3)–(5), the closeness matrix M and its transitive closure M + are computed in Fig. 1. Fig. 2 shows the corresponding λ-cut matrixes (λ = 0.7). If λ = 0.7, two equivalence classes are obtained: C1 = {t1 , t2 , t3 , t6 , t8 } and C2 = {t4 , t5 , t7 }, indicating that the tuples within C1 or C2 are λ-close to each other according to Definition 1. In addition, if λ decreases, the number of equivalence classes decreases. However, if given a larger value of λ, C1 or C2 will be divided into several equivalence classes and the number of classes increases. According to the notion of λ-close in Definition 1, the following appealing properties stated in Proposition 1 hold. Proposition 1: Given a relation R = {t1 , t2 , . . ., tn } and the threshold λ, then: 1) ti and t j are λ-close if and only if there exists a finite sequence ts1 , ts2 , . . .tsw in R such that M(i , s1 ) ≥ λ, M(s1 , s2 ) ≥ λ, . . ., M(sw , j ) ≥ λ and w ≤ n− 1; 2) ti and t j are not λ-close if and only if there exists no finite sequence ts1 , ts2 , . . .tsw in R such that M(i , s1 ) ≥ λ, M(s1 , s2 ) ≥ λ, . . ., M(sw , j ) ≥ λ and w ≤ n− 1.

932

IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, VOL. 23, NO. 6, JUNE 2012

λ-close. For example, we know t1 and t8 are λ-close through t2 and t3 in Fig. 3. If t2 or t3 did not exist, t1 and t8 would not become λ-close. Moreover, we could know which tuple is in the structural center of the network and can contribute more to construct the λ-close equivalence class. That is important for us to gain a better understanding of equivalence classes. Fig. 3.

λ-close structure network of Example 1.

Proof: According to Definition 1, if ti and t j are λclose, then M + (i , j ) ≥ λ. M + is a series of max–min compositions onto M, then M + = M × M p−1 and M + (i , j ) = maxk (min(M(i , k), M p−1 (k, j ))). Since M + (i , j ) ≥ λ, there exists k such that M(i , k) ≥ λ and M p−1 (k, j ) ≥ λ. Then k is marked as s1 . M p−1 = M × M p−2 and M p−1 (s1 , j ) = maxk (min(M(s1 , k), M p−2 (k, j ))). Since M p−1 (s1 , j ) ≥ λ, then there exists k such that M(s1 , k) ≥ λ and M p−2 (k, j ) ≥ λ. Then k is marked as s2 . So the rest can be proved in the same manner and there exists a sequence ts1 , ts2 , . . ., tsw in R such that M(i , s1 ) ≥ λ, M(s1 , s2 ) ≥ λ, . . ., M(sw , j ) ≥ λ. Since p ≤ n− 1, the length of the sequence is finite and w ≤ n − 1. If there exists a finite sequence ts1 ,ts2 , . . ., tsl in R such that M(i , s1 ) ≥ λ, M(s1 , s2 ) ≥λ, . . ., M(sw , j ) ≥λ, in step 1, we can deduce that M 2 (sw−1 , j ) ≥λ because M(sw−1 , sw ) ≥λ and M(sw , j ) ≥λ. In step 2, since M(sw−2 , sw−1 ) ≥λ and M 2 (sw−1 , j ) ≥λ, M 3 (sw−3 , j ). The remaining steps can be proved in the same manner and M w (i , j ) ≥λ. Since w ≤ n− 1, M n−1 (i , j ) ≥ M w (i , j ). M + (i , j ) = M n−1 (i , j ) ≥ M w (i , j ) ≥λ so ti and t j and λ-close. Thus, Proposition 1(1) holds. (2) This is the conversenegative proposition of Proposition 1(1). Thus, Proposition 1(2) also holds. According to Proposition 1, each pair of two λ-close tuples (ti and t j ) within the same equivalence classes are connected by a sequence (ts1 , ts2 , . . ., tsw ). Since for any pair of tuples, the value in M + is no less than the corresponding value in M [30], then M + (i , s1 ) ≥λ, M + (s1 , s2 ) ≥λ, . . ., M + (sw , j ) ≥λ. The tuples ts1 , ts2 , . . ., tsw in the sequence also belong to the same equivalence class that contains ti and t j . For a given equivalence class, if M(i , j ) ≥λ, we draw a line between ti and t j . Thus, we can construct an undirected network to represent the structure of the equivalence class. If ti and t j are λ-close, ti and t j are connected in the network. In addition, if ti and t j are not connected in the network, ti and t j must not be λ-close. It can be inferred that any pair of tuples within the same equivalence class is connected in the network. In this paper, this kind of network is called λ-close structure network. Fig. 3 shows the λ-close structure network of Example 1. As a matter of fact, the λ-close equivalence classes and the closeness matrix M give us a contextual view of the relation R about which is λ-close to which, which means that any two tuples in the same equivalence class are λ-close to each other. Furthermore, the λ-close structure network is considered to provide us with a structural view of R in ways of how any two tuples in the same equivalence class are linked to be

B. Extraction of Representative Tuples When it is known which tuples in a relation with imprecision are λ-close and how they are linked in being λ-close, then how to extract the representative tuples is the next problem of concern. It is considered ideal to have fewer tuples if they carry the same amount of original information. In the crisp case, if there are identical tuples, we may simply keep one identical tuple and delete the rest, which will not cause any information loss. However, in imprecise databases and queries, the problem is much more complex. Usually, due to the existence of tuple closeness, obtaining a smaller set of query outcomes becomes an effort of extracting representative tuples in light of “information equivalence.” Based on the closeness evaluation in Section III-A, here we consider a center-based method for extracting the representative tuples. Conceptually, all the tuples in a same equivalence class are regarded to express approximately the same information and therefore one “central” tuple could be extracted to represent the class. As illustrated in Section III-A, given the relation R and m equivalence classes C1 , C2 , . . ., Cm , there are two dimensions of information about R, namely, the contextual dimension and the structural dimension. We take both into consideration in the process of extraction in terms of contextual centrality and structural centrality. 1) Contextual Centrality: This describes the degree of a given tuple in the center or very near to the center in the contextual dimension of an equivalence class. In this paper, the contextual centrality is formulated by the average closeness measured using the objects’ attributes, which are generally context-informative. Given the relation R, equivalence classes C1 , C2 , . . ., Cm , and n i (the number of tuples in Ci ), then the contextual centrality of tuple tk in Ci is defined as the average closeness of tk to Ci , i.e., Oki ∈ [0, 1] Oki =

ni  Fc (tk , t j ) . ni

(6)

j =1

Thus, if a tuple t p has the largest contextual centrality (O ip = maxk Oki ), t p is considered to be the center of the contextual dimension. Compared to any other tuple in Ci , t p can represent most of the context information in the equivalence class. 2) Structural Centrality: This describes the degree of a tuple in the center or very near to the center in the structural dimension of an equivalence class. If a tuple has the largest structural centrality, it is the center of the equivalence class in the structural dimension. It contributes more to construct the λ-close equivalence class than any other tuple, which means that most pairs of tuples become λ-close through this tuple and without this tuple many tuples will not be included in the equivalence class. This tuple can represent most connections

ZHANG et al.: EXTRACTING REPRESENTATIVE INFORMATION

933

TABLE III C ENTRALITY VALUES OF E XAMPLE 1 (α = 0.5)

Fig. 4.

Contextual and structural centers of C1 in Example 1.

of all the tuples, i.e., it can represent most of the structural information in the equivalence class. As illustrated in Section III-A, the λ-close structure network abstracted from R can represent the structural information of R, so structural centrality information could be obtained from the λ-close structure network. We describe the structural centrality of the λ-close structure network in spirit of betweenness centrality [32], [33], which is widely used in graph theory and network analysis for finding the centrality of a complex network in biology, sociology, etc., [34], [35]. For a tuple tk in an equivalence class Ci , the structural centrality Ski is defined as  σ pq (k) (7) Ski = σ pq t p =tq =tk ∈C i

where σ pq denotes the number of shortest paths from t p to tq in the λ-close structure network. σ pq (k) denotes the number of shortest paths from t p to tq that tk lies on. The larger the Ski is, the more paths tk lies on. In other words, lots of sequences between tuples will be cut off without tk and this will result in great loss to the equivalence class. Therefore, the tuple with large structural centrality is deemed reasonable for constructing the equivalence class. If a tuple t p has the largest structural centrality (S ip = maxk Ski ), t p is considered to be the center of the structural dimension. In addition, the structural centrality divided by the upper limit (n− 1)(n− 2)/2 is normalized to lie between zero and one. The centers of contextual and structural dimensions are usually different, especially for a large scale of equivalence classes. For example, Fig. 4 shows the contextual center and structural center in the equivalence class C1 of Example 1. We use the contextual centrality and structural centrality to describe whether a tuple is in the center of an equivalence class. In spirit of the measure of Fβ [19], [48], [49], which is a widely used measure in combining both precision and recall (i.e., Fβ =1/(α/Recall + (1-α)/Precision)), a combined centrality, namely Fcentralityik could be defined as follows: Fcentralityik =

1 α Oki

+

(1−α) Ski

(8)

where α∈[0,1] reflects users’ preference. If α = 0, Fcentralityik degenerates into Ski meaning that users only prefer to consider the structural centrality. If α = 1, Fcentralityik degenerates into Oki meaning that users only consider the contextual centrality. If α = 0.5, it means users treat the contextual centrality and structural centrality equally. To avoid the problem of indefinite value, if Oki = 0 or Ski = 0, we may use a real number close to 0 (e.g., 10−10 ) to replace 0 in implementation for (8). Note that

Tuples

Oip

S ik

Fcentralityik

t1

0.740

0.000

0.000

t2

0.760

0.500

0.603

t3

0.740

0.833

0.784

t4

0.767

0.000

0.000

t5

0.800

1.000

0.889

t6

0.700

0.000

0.000

t7

0.767

0.000

0.000

t8

0.700

0.000

0.000

the value of Fcentralityik gets sufficiently large if both values of contextual centrality and structural centrality are large enough [49], which is regarded intuitively appealing. Every tuple’s centrality can be computed directly through Fcentralityik . Thus, the tuple with the largest Fcentrality is extracted to represent the equivalence class. Let us take the tuples in Example 1 as an example to illustrate the extraction process. Table III shows the values of O ip , Ski , and Fcentralityik for the tuples in Example 1. In C1 , the Fcentrality of t3 is the largest, so t3 is exacted to represent C1 . In C2 , the Fcentrality of t5 is the largest, so t5 is exacted to represent C2 . Therefore, the extracting result is {t3 , t5 }. In addition, If there are several tuples with the same largest Fcentrality in an equivalence class, we will choose one of them according to the expert knowledge or user request to represent the λ-close equivalence class. The pseudo-code of the extraction algorithm is shown in Table IV. As far as the complexity is concerned, the traditional algorithm to complete the closeness matrix calculation (Step 2) and the grouping operation (Step 3) is of O(n 2 ) time complexity and O(n 2 ) space complexity [30]. Notably, a more efficient algorithm [36] has been developed to avoid the unnecessary computing in the closeness matrix generation and grouping operations, so that the overall time complexity could be reduced to O(n) in the best case and the overall space complexity is reduced to O(n). As for the complexity of the extraction process (Steps 4–16), the calculation of the contextual centrality costs O(n 2 ) in time and O(n) in space. For structural centrality, so far, the most effective algorithm is at O(np) in time and O(n + p) in space [33], where p is the number of edges in the corresponding λ-close structure network. Thus, the overall time complexity of the extraction algorithm is O(n 2 + np) and the space complexity is O(n + p). In addition, with a newly incremental tuple, since the grouping in [36] is at O(1) in time for the best case and O(n log n) for the worst case, the algorithm costs O(n log (n) + np) time and O(n) space.

C. More Properties for Tuple Extraction In flexible data queries, whether the extracted result is unique and whether representative tuples are not redundant are important concerns.

934

IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, VOL. 23, NO. 6, JUNE 2012

TABLE IV E XTRACTION A LGORITHM Algorithm: Input: Query/Search Results T = {t1, t2, …, tn} Specified Threshold λ Output: Extraction Results E = {ti1, ti2, …, tim} Begin: 1. E = Φ // Compute the closeness matrix M of the query/search results. 2. M = Compute_Closeness_Matrix(T) // Group the query/search results into λ-close equivalence classes. 3. {C1, C2, …, Cm} = Group_Equivalence_Classes(M, λ) // Compute the centrality value of every tuple within any equivalence class. 4. for Ci in {C1, C2, …, Cm} do 5. Rep = 0 6. Max_centrality = 0 7. for tk in Ci 8. Oik = Contextual_Centrality(tk, Ci) 9. Sik = Structural_Centrality(tk, Ci) 10. Fcentralityik = Compute_Centrality(Oik, Sik) 11. if (Fcentralityik > Max_centrality) then 12. Rep = tk 13. end if 14. end for 15. E = E + {tk} 16.end for 17.Output_Extraction_Results(E)

This is based on the discussion in Section III-A. If we have had R2 , then R3 could be derived (i.e., via extracting). If we have had R3 , suppose R3 = {tu , tw , tx , . . .}, since every tuple of R3 is not λ-close, every tuple of R3 can be considered as a class labeled as Cu , Cw , C x , . . .. For every tuple ti of R, if ti satisfies the flexible query conditions, then it could be compared with the tuples in the labeled classes Cu , Cw , C x , . . ., if there is a tuple t in one labeled class, e.g., Cu such that ti and t are λ-close, then add ti to Cu . Finally, we will get relation R2 , as the tuples in R3 and the equivalence classes in R2 are bijective. IV. R ELATION C OMPACTNESS In addition to closeness of any two tuples, it is also of interest to investigate the compactness of a given relation R. Here, the compactness of a relation refers to the degree of non-redundancy. For a given relation, the more redundant tuples it contains, the lower the degree of compactness. Here, a new measure, relation compactness, is proposed to evaluate a given relation R being compact. This measure possesses some good properties. Let us first consider it in classical relations, and then extend it into the relations in possibilistic databases. A. Relation Compactness in Classical Relations

1) Resultant Relation Containing the Representative Tuples is Unique: For the uniqueness of the tuple extraction treatment, according to Section III-A, we can have a unique equivalence grouping result of any given relation R. According to the extraction method described in Section III-B, for each equivalence class, we will only keep one tuple with the largest value of Fcentrality to represent the λ-close equivalence class. Thus, we will only have one unique result in the process. 2) Representative Tuples, Each from a Class, are Mutually Distinct at the Specified Degree: According to Definition 1, the representative tuples, each from a class, are not λ-close, which means for any representative tuples ti and t j , there exists no finite sequence ts1 ,ts2 , . . ., tsw such that M(i , s1 ) ≥ λ, M(s1 , s2 ) ≥ λ, . . ., M(sw , j ) ≥ λ and w ≤ n − 1 according to Proposition 1. As a result, we can conclude M(i , j ) < λ for w = 0. Thus, the representative tuples are mutually distinct at the degree of λ. 3) Resultant Relation is Information Equivalent to the Collection of Original Data Classes: Two representations are information equivalent if the transformation from one to the other entails no loss of information, i.e., if each can be constructed from the other [37]. In other words, the information in one is also inferable from the other, and vice versa [38], [39]. Here, “information equivalence” in our discussions could be described as follows. Suppose there is a relation R with imprecision, a flexible query result against R is relation R1 . The grouped relation is R2 and the resultant relation containing representative tuples is R3 . Compared with the original relation R1 of the query result, R2 and R3 contain the classification information. We can say that, in terms of λ-close, R2 and R3 are information equivalent.

In information theory [40], consider a single discrete information source, it may produce different kinds of symbol sets A = {a1 , a2 , . . ., an }. For each possible symbol set A there will be a set of probabilities pi of producing the various  possible symbols ai ( pi = 1), where these symbols are assumed successive and independent. Thus there is an entropy Hi for each ai . The entropy of this given piece of information will be defined as the average of these Hi weighted in accordance with the probability of occurrence of the symbols in question H = H ( p1 , p2 , . . . , pn ) = −

n 

p i Hi = −

i=1

n 

pi log pi

i=1

(9) where the default log base is 2. Similarly to [41], if there is a relation S, a set of classical tuples, which can be divided into m classes C1 , C2 , . . ., Cm , the probability of a random tuple belonging to class Ci is n i /n, where n i is the number of tuples in Ci , and n is the number of tuples in S, n i /n is also called the probability of class Ci in S. The expected information for classifying this given relation S is H (n 1, n 2 , . . . , n m ) = −

m  i=1

p i Hi = −

m  ni i=1

n

log

ni . n

(10)

Given a classical relation R = {t1 , t2 , . . ., tn }, a distinct tuple can be considered as a class with only one element, any tuples that are identical to each other can be considered to be in a same class. We define relation compactness as follows to describe the degree of a given relation being compact.

ZHANG et al.: EXTRACTING REPRESENTATIVE INFORMATION

935

Definition 2: Let R = {t1 , t2 , . . ., tn } be a classical relation with n tuples, R be divided into m classes C1 , C2 , . . ., Cm according to tuple identity (i.e., every tuple in Ci is identical to each other), and n i be the number of tuples in Ci . Then the relation compactness of R is defined as H (n 1, n 2 , . . . , n m ) log n m   ni ni n log n i=1 . =− log n

RCclassical (n 1 , n 2 , . . . , n m ) =

(11)

For a relation R, its degree of compactness (RC classical) reflects the extent to which the tuples of R are not redundant, measured by the “amount of information” in R. The higher the RC classical is, the less redundant the tuples in R, meaning that the more information R contains. We have the following intuitively appealing properties stated in Proposition 2. Proposition 2: Given a classical relation R = {t1 , t2 , . . ., tn }, if it is divided into m classes C1 , C2 , . . ., Cm according to tuple closeness/identity, then: 1) all the tuples in R are mutually distinct, i.e., m = n, n 1 = n 2 = . . . = n m = 1, if and only if RC classical(n 1 , n 2 , . . ., n m ) = 1; 2) all the tuples in R are identical, i.e., m = 1, n 1 = n, if and only if RC classical (n 1 ) = 0; 3) 0 ≤ RC classical (n 1 , n 2 , . . ., n m ) ≤ 1; 4) if m is fixed and n 1 = n 2 = · · · = n m , then RC classical (n 1 , n 2 , . . ., n m ) decreases whenever n increases; 5) if n is fixed, then RC classical decreases when any two of the original classes are merged into one class. Proof: It has been proved that 0 ≤ H ( p1, p2 , . . ., pn ) ≤ log n, where H ( p1, p2 , . . ., pn ) = log n if and only if pi = 1/n, i = 1, 2, . . ., n, H ( p1, p2 , . . ., pn ) = 0 if and only if all the pi but one are 0 [40]. 1) RC classical (n 1 , n 2 , . . ., n m ) = H (n 1, n 2 , . . ., n m )/log n = 1 if and only if H (n 1, n 2 , . . ., n m ) = log n, as H (n 1, n 2 , . . ., n m ) = log n if and only if m = n, n i = 1, i = 1, 2, . . ., m, so RC classical (n 1 , n 2 , . . ., n m ) = 1 if and only if m = n, n 1 = n 2 = · · · = n m = 1. 2) RC classical (n 1 , n 2 , . . ., n m ) = H (n 1, n 2 , . . ., n m )/log n = 0 if and only if H (n 1, n 2 , . . ., n m ) = 0, as H (n 1, n 2 , . . ., n m ) = 0 if and only if m = 1, n 1 = n, so RC classical (n 1 , n 2 , . . ., n m ) = 0 if and only if m = 1, n 1 = n. 3) As 0 ≤ H ( p1, p2 , . . ., pn ) ≤ log n, so 0 ≤ RC classical(n 1 , n 2 , . . ., n m ) = H (n 1, n 2 , . . ., n m )/log n ≤ 1. For the sake of convenience, rewriting (11) as follows: m   ni ni n log n i=1 RCclassical (n 1 , n 2 , . . . , n m ) = − log n m nni  n log ni i=1 . (12) = log n

4) As m is fixed and n 1 = n 2 = · · · = n m , so n 1 = n 2 = · · · = n m = n/m m nni  n log ni i=1 RCclassical (n 1 , n 2 , . . . , n m ) = log n log m (13) = log n when n(n > 1) increases, log n increases, RC classical (n 1 , n 2 , . . ., n m ) decreases. 5) n is fixed, if classes C p and Cq are merged into class Cl , 1 ≤ p, q ≤ m, then n l = n p + n q . If RC classical is decreases, we have the following equations: n p nq ni m  n n n n n n · np · nq log ni i=1,i = p,q



log

m  i=1,i = p,q



log n nl ni n n n n · nl ni

log n  n p   nq   nl n n n n n n · ≥ ⇔ np nq nl  n p +nq  n n p q n

n

n n p + nq p n q n ⇔ · ≤ n n n np   nq  n p + nq n n p + nq n ⇔1≤ · . np nq 

(14)

Example 2: Suppose there is a classical relation R = {t1 , t2 , t3 , t4 }, then the relation compactness of R can be computed depending on how tuples in R are identical or distinct: 1) if all the tuples in R are distinct, then m = 4; n 1 = n 2 = n 3 = n 4 = 1, RC classical (1, 1, 1, 1) = − (1/4 log(1/4) + 1/4 log(1/4) + 1/4 log(1/4) + 1/4 log(1/4))/log 4 = 1; 2) if t1 = t2 , then m = 3, n 1 = 2, n 2 = n 3 = 1, RC classical (2, 1, 1) = −(2/4 log(2/4) + 1/4 log(1/4) + 1/4 log(1/4))/log 4 = 0.75; 3) if t1 = t2 , t3 = t4 , then m = 2, n 1 = n 2 = 2, RC classical(2,2) = −(2/4 log(2/4) + 2/4 log(2/4))/log 4 = 0.5; and 4) if t1 = t2 = t3 , m = 2, n 1 = 3, n 2 = 1, thus, RC classical (3, 1) = −(3/4 log(3/4) + 1/4 log(1/4))/log 4 = 0.41. 2) Suppose there are two classical relations R1 = {a1 ,a2 , . . ., a8 } and R2 = {b1 ,b2 , . . ., b8 }, each being divided into four classes, namely (2, 2, 2, 2) and (5, 1, 1, 1), respectively. Then RC classical (2, 2, 2, 2) = 0.67, and RC classical (5, 1, 1, 1) = 0.52, revealing that R1 is more compact than R2 . In other words, once we first pick one tuple from a relation, the chance of picking the second tuple, which is redundant to the first tuple, is higher in R2 than in R1 . B. Relation Compactness for Relations With Imprecision Let us consider the possibilistic extension. Suppose there are n tuples with possibility distributions involved in R = {t1 , t2 , . . ., tn }, where these tuples can be not only identical or distinct to each other, but also close to each other. Thus, to describe the degree at which a given relation with imprecision is compact, the concept of relation compactness in the classical context could be extended to cope with the closeness of

936

IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, VOL. 23, NO. 6, JUNE 2012

these tuples. There is a basic difference in computing the relation compactness for a relation with distinct tuples and for a relation with close tuples. In the classical context, all the tuples in Ci are identical, and n i is the number of identical tuples. n i /n in RC classical means the probability of a random tuple belonging to the class Ci . In the possibilistic context, a tuple tk may not totally belong to class Ci , but belong to class Ci at a certain degree, e.g., Oki ∈ [0, 1], which is the average closeness of tk to Ci and described in (6). Thus, we  will use k=1,2,...,n Oki instead of n i in the extended relation compactness. Definition 3: Given a relation with imprecision R = {t1 , t2 , . . ., tn }, where R can be divided into m classes C1 , C2 , . . ., Cm according to λ-close (i.e., every tuple i in Ci is = λ-close to each other), and u i k=1,2,...,n (Ok ) is the  count operation as the “effective number” of tuples in  class Ci . N = i=1,2,...,m (u i ). The relation compactness in R is m   ui ui N log N i=1 . (15) RC(u 1 , u 2 , . . . , u m ) = − log N RC could be used to investigate the compactness of a given relation R in classical or extended RDBs. The higher RC value, the less redundant tuples the relation R contains. In addition, the following proposition holds for RC. Proposition 3: Give a relation R = {t1 , t2 , . . ., tn } and λ, where it is divided into m classes C1 , C2 , . . ., Cm according to λ-close, and u i is the same as the one in Definition 3, then: 1) R is not λ-close, i.e., m = n, u 1 = u 2 = · · · = u m = 1, if and only if RC(u 1 , u 2 , . . ., u m ) = 1; 2) if N is fixed, then RC decreases when any two of the original classes (e.g., C p , Cq ) are merged into one class Cl according to λ-close (for example, if λ decreases, then, two classes C p and Cq may be merged into one class Cl ); 3) 0 ≤ RC(u 1 , u 2 , . . ., u m ) ≤ 1; 4) RC(u 1 , u 2 , . . ., u m ) = RC classical(n 1 , n 2 , . . ., n m ) in the case of the classical relation. Proof: 1) Referring to Proposition 2(1), RC(u 1 , u 2 , . . ., u m ) = 1, if and only if m = n, u 1 = u 2 = · · · = u m = 1. 2) If λ decreases, then two tuples tk , t j which originally belong to classes C p , Cq , respectively, may belong to the same class, thus, C p , Cq can be merged into one class Cl . According to the definition of u i , u i (i = p, q) remains unchanged after merging. Since N is fixed, u l = u p + u q . Then the following holds:  up   uq u p + uq N u p + uq N 1 ≤ · up up ⇒

u

p

N 



N up

up N

·

u

q

N

uq N

 ≤

u p + uq N

  ul  u p   uq N N N N N · ≥ . uq ul

 u p +u q N

(16)

TABLE V PAIRED T-T EST AND F RIEDMAN T EST ON UCI D ATASETS (RC D EGREE ) Assumptions

Significance χ 2 Value

t Value

Significance

Represent > Sequential 97.557

***

30.000

***

Represent > Heuristic

***

30.000

***

109.472

Notes: *: p < 0.05; **: p < 0.01; *** p < 0.001 TABLE VI PAIRED T-T EST AND F RIEDMAN T EST ON UCI D ATASETS (C OVERAGE D EGREE ) Assumptions

t Value

Significance χ 2 Value

Significance

Represent > Sequential

9.863

***

25.000

***

Represent > Heuristic

3.235

**

5.556

*

Notes: *: p < 0.05; **: p < 0.01; *** p < 0.001

Before the operation of merging, RC is defined as RC before u p uq m u  N N Ni N N N · up ( ui ) · uq log RCbefore =

i=1,i = p,q

.

logN

(17)

After the merging operation, RC is defined as RC after ui ul m  N N N N · ul log ui RCafter =

i=1,i = p,q

logN

.

(18)

According to (16)–(18), RC before ≥ RC after . Then RC decreases when any two of the original classes are merged into one class. 3) Referring to Proposition 2(3), 0 ≤ RC(u 1 , u 2 , . . ., u m ) ≤ 1. 4) As the classical relation is a special case of a relation with imprecision, if all the tuples in R are classical tuples, then every tuple within a class Ci is equal to each other, Fc (ti , t j ) = 1, and the tuples in different classes are not equal to each other, Fc (ti , t j ) = 0. Therefore, for the classical relation, u i = n i and N = n in RC m   ui ui N log N i=1 RC(u 1 , u 2 , . . . , u m ) = − log N m   ni ni n log n i=1 = RCclassical (n 1 , n 2 , . . . , n m ). (19) =− log n According to Proposition 3(4), Proposition 2 also holds for RC in the case of classical relation. V. E MPIRICAL DATA E XPERIMENTS In order to further examine the effectiveness and efficiency of the extraction approach proposed in this paper, this section conducts empirical data experiments with benchmark datasets. The data experiment environment was a Windows XP system on a PC with Intel E8400 CPU and 1GB RAM, and all the programs were implemented with the basic routines in VC 8.0.

ZHANG et al.: EXTRACTING REPRESENTATIVE INFORMATION

937

Fig. 6.

Fig. 5.

Original query articles and compact sets of seven query conditions.

A. Effectiveness Experiments As discussed in previous sections, the extraction approach proposed in this paper could provide a smaller scale of query/search results with high relation compactness (low redundancy) in light of information equivalence. Here, for experiments, a digital library database was constructed for major research subjects in literature. The digital library database contains 4602 regular articles in 2000–2006 that are selected from 16 fine IS (information systems) journals according to ACM/AIS sources [42]. These articles were stored in a RDB (i.e., Microsoft Access) with keywords from titles, abstracts, index terms, etc. For illustrative purposes, we tested seven commonly used query conditions in IS fields (i.e., query optimization, supply chain, cluster, e-commerce, KDD/data mining, information system, internet/web) and compared the scales of the original query articles and compact sets in terms of their sizes (λ = 0.5), which is shown in Fig. 5. From Fig. 5, we could find that the scales of the original query results are much larger than those of the corresponding compact sets, each composed of representative articles. For example, when retrieving KDD/data mining literature, the original query results are 241 articles, while the compact set contains only 27 articles. Next, we also conducted some experiments on well-known UCI benchmark datasets [43] to further demonstrate the effectiveness of our approach with less redundancy than other approaches in terms of RC. Hereby, we only consider a small number of results (without loss of generality for the approaches concerned), which is deemed sensible and practical, as research also showed that users usually only paid attention to the first 10 search results on the first page [44]. Another two approaches to extract representative data objects were under consideration for comparison: the sequential extraction approach and the heuristic extraction approach [45]. The sequential extraction approach extracts the first k results sequentially which is commonly used in today’s databases and web search engines (top-k).

RC values of three extraction approaches.

The heuristic extraction approach was proposed in [45], which was designed with a heuristic means to seek information with high coverage degree and low redundancy degree from massive datasets. For the sake of convenience, we denote our approach, the sequential extraction approach, and the heuristic extraction approach by Represent, Sequential, and Heuristic, respectively, in the following discussions. Fig. 6 shows much higher degrees of RC values for our approach named Represent. The average RC values of Represent, Sequential, and Heuristic were 0.5404, 0.0991, and 0.1231, respectively. The results in Fig. 6 revealed that the redundancy level of Represent was much lower than Sequential and Heuristic. Furthermore, paired t-test and Friedman test were conducted, demonstrating the statistical significance in RC differences between our approach and others (Table V). In addition to compactness (RC), we also tested coverage to see the extent to which the approaches could cover the information of the source datasets. The coverage measure used in this paper is from the work in [45], which reflects the percentage of classes that are covered by the representative set. Let C(R) be the number of distinct classes of representative set R, and |C| be the number of total classes in the source datasets, then the coverage degree is defined as Coverage =

C(R) . |C|

(20)

Using the UCI datasets [43], the coverage degrees of these three approaches are shown in Fig. 7, showing that our approach’s coverage degree was higher than the other two approaches. The average coverage values of Represent, Sequential, and Heuristic were 0.7953, 0.4156, and 0.6416, respectively. This further verified the effectiveness of our proposed approach, where the “central” tuple was extracted to represent the class, which covered each diversified class and as a whole led to less redundancy and more content. Furthermore, paired t-test and Friedman test were conducted, revealing the statistical significance in coverage differences between the proposed approach and others (Table VI). In addition to the above experiments on databases, the proposed approach was also applied to web search data.

938

IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, VOL. 23, NO. 6, JUNE 2012

TABLE VII PAIRED T-T EST AND F RIEDMAN T EST ON TREC8 (MRR D EGREE ) Assumptions

t Value

Google > Represent

1.848

Significance χ 2 Value No

3.103

Significance No

Notes: *: p < 0.05; **: p < 0.01; *** p < 0.001; No: not significant

Fig. 7.

Coverage values of three extraction approaches.

Fig. 9.

Fig. 8.

RC values of the extracted 10 pages and top 10 pages.

We used the query benchmark data in KDD Cup 2005 task [46], which is commonly used in performance evaluation of information retrieval. It can be seen that usual search engines often provide top-k records/pages to the users according to the query. However, it is easily found that top-k records/pages were always very similar and lacked diversity of information. In this regard, it is considered desirable to provide compact search outcomes with high diversity and low redundancy. In the data, there are 111 different queries. We searched in Google with all of these queries and extracted representative pages with our approach. In the process of extraction, the closeness between each pair of web pages was computed with the cosine similarity using TF/IDF model [27]. Similarly to the case of database experiments on UCI benchmark, we compared the extracted 10 pages with the top 10 pages in the first result page of Google. From Fig. 8, we can find that the relation compactness of the extracted 10 pages was much higher than the top 10 pages of Google for the 111 queries, which means that diversity of the extracted pages was better and the redundancy was lower. Therefore, the representative pages could give the users a view of the whole content of the search results. As discussed in the previous sections, the number of tuples/pages in the outcome could be controlled by the setting

Scale of search results according to different λ.

of λ. Fig. 9 shows the average scale of the results of all the 111 queries with changes in λ, which could be set or fine-tuned upon preference. The test results are consistent with the theoretical analysis. Specially, λ shows the closeness degree that the domain experts or database managers want to determine in the operation. The equivalence classes would break up into smaller ones when λ increased. In other words, when λ increased, there would be more classes, thus, after the extraction, more tuples/pages would be kept. Especially, with λ = 1, the treatment turned into the classical situation, in that no tuples would be eliminated. Thus, if the user wants to get representative tuples from the source datasets in a finer level, he could increase the value of λ. In contrast, if the user wants to get representative tuples in a coarser level, he could decrease λ. The value of λ in the proposed approach provides a way for the users to implement flexible data queries. Moreover, since each representative tuple is extracted from one equivalence class, we could rank the extraction results according to the size of the equivalence classes. The larger the size, the closer the corresponding representative tuple is to the top. Further, we checked the extraction results with the measure of mean reciprocal rank (MRR), which is used in the evaluation of TREC8, TREC9, and TREC10 [47]

1 |Q|  ranki (21) M RR = |Q| i=1

where |Q| is the number of test queries and rank i is the position of the first relevant search result for the i th test query. MRR = 0 means no relevant result is found and MRR = 1 means the first representative is always the relevant result. We compared our approach with Google. The test queries

ZHANG et al.: EXTRACTING REPRESENTATIVE INFORMATION

Fig. 10.

939

MRR values of Google and Represent.

Fig. 12.

Memory used of different scales.

B. Efficiency Experiments

Fig. 11.

Running times of different scales.

used were derived from TREC8 [47], which contains 200 test queries with various difficulty levels and query lengths (range from three words to 33 words). The 200 test queries were evenly divided into 40 test groups and each group contains five various queries. According to the evaluation method in TREC8-10, the scale of the answer sets for each query is five, which means that for Google, we found the first relevant result among the first five results and for our proposed approach, the scale of the compact set is 5. Three voluntary participants (graduate students in Information Management at Tsinghua University) were recruited to mark the first relevant result for each query. For each query, the position of the first relevant result is the arithmetic mean of the positions marked by the three voluntary participants. Then, MRR values of 40 groups for Google and our approach were computed, respectively, (Fig. 10). Furthermore, the statistical comparison using paired t-test and Friedman test were conducted and the results are shown in Table VII, revealing that our approach and Google are not significantly different in terms of MRR in light of retrieval accuracy. Moreover, the experimental results also reveal that our approach could retrieve the correct page within the compact sets for various queries with different details and difficulty levels in the average sense, also for the queries with tiny details (e.g., short keywords).

To further examine the proposed approach’s efficiency with large datasets, this subsection will show efficiency experiments on different scales of web pages generated by using 2005 KDD datasets as query seeds and with λ = 0.9 by default. For the time complexity, we conducted experiments by extracting representative web pages with the proposed approach from 104 to 105 scales of source pages. Fig. 11 represents the running time of different scales, revealing a lowlevel polynomial complexity consistent with the theoretical analysis in Section III-B. Fig. 11 shows that the extraction approach proposed in this paper had the scalability to deal with web search applications with large datasets. To illustrate the scalability of the approach to deal with large scale of data, we also test the space complexity by extracting representative web pages from 104 to 105 scales of source pages. Fig. 12 depicts the memory consumption of the proposed approach, which shows a nearly linear pattern. In order to speed up the calculation and avoid unnecessary memory consumption, we have implemented an incremental method developed by the authors in [36] for the generation of equivalence classes, which uses n iterations for n tuples. In the kth iteration, the first k − 1 tuples have been assigned the equivalence class labels, and it only needs to determine the class label for the kth tuple as well as which labels of the first k − 1 tuples need to be updated. Here, it only needs to store the closeness for the kth tuple and the first k − 1 tuples. That is, there is no need to store the whole closeness matrix, instead, only size n of the memory needs to be allocated in any iteration, after which the memory could be released for use in the next iteration. Subsequently, the extraction process based on the equivalence classes costs a nearly linear space as shown in Section III-B. Thus, the overall algorithm could perform efficiently on space complexity. Fig. 12 means that with the increasing source data, for example in web search applications, the memory consumption of the extraction approach would not rapidly expand and not exceed the capability realm of ordinary PC machines.

940

IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, VOL. 23, NO. 6, JUNE 2012

VI. C ONCLUSION This paper has concentrated on the problem of extracting representative tuples to enhance flexible data queries under the extended-possibility-based database model, where attribute values can be possibility distributions and domain elements can be close to each other. A measure, namely relation compactness, which is used to express the degree of a given relation being compact, has been defined and studied in detail. Consequently, an approach to evaluate and extract representative tuples has been proposed. It has been proven that the resultant set of query outcomes with representative tuples is unique, smaller in size, and information equivalent to respective (original) classes. Moreover, the approach enables the query/search users to obtain various sizes of outcomes upon their need and preference in light of compactness (at high coverage and low redundancy with good retrieval accuracy). It is worth mentioning that this approach is meant generally for flexible data queries in imprecise databases, and also applicable to crisp queries, as well as to the context of web search applications. Ongoing research is centering on a real web2.0/enterprise2.0 application for representative blog articles extraction in a big mobile service provider company. R EFERENCES [1] E. F. Codd, “A relational model for large shared data banks,” Commun. ACM, vol. 13, no. 6, pp. 377–387, Jun. 1970. [2] J. F. Baldwin and S. Q. Zhou, “A fuzzy relational inference language,” Fuzzy Sets Syst., vol. 14, no. 2, pp. 155–174, Nov. 1984. [3] B. P. Buckles and F. E. Petry, “A fuzzy representation of data for relational databases,” Fuzzy Sets Syst., vol. 7, no. 3, pp. 213–226, May 1982. [4] H. Prade and C. Testemale, “Generalizing database relational algebra for the treatment of incomplete or uncertain information and vague queries,” Inf. Sci., vol. 34, no. 2, pp. 115–143, Nov. 1984. [5] V. Owei, “An intelligent approach to handling imperfect information in concept-based natural languages queries,” ACM Trans. Inf. Syst., vol. 20, no. 3, pp. 291–328, Jul. 2002. [6] L. A. Zadeh, “Fuzzy sets as a basis for a theory of possibility,” Fuzzy Sets Syst., vol. 1, no. 1, pp. 3–28, Jan. 1978. [7] J. Kacprzyk and S. Zadrozny, “Computing with words in intelligent database querying: Standalone and internet-based applications,” Inf. Sci., vol. 134, nos. 1–4, pp. 71–109, May 2001. [8] S. M. Chen and W. T. Jong, “Fuzzy query translation for relational database systems,” IEEE Trans. Syst., Man, Cybern. B, Cybern., vol. 27, no. 4, pp. 714–721, Aug. 1997. [9] P. Bosc and O. Pivert, “SQLf: A relational database language for fuzzy querying,” IEEE Trans. Fuzzy Syst., vol. 3, no. 1, pp. 1–17, Feb. 1995. [10] P. Bosc and O. Pivert, “About projection-selection-join queries addressed to possibilistic relational databases,” IEEE Trans. Fuzzy Syst., vol. 13, no. 1, pp. 124–139, Feb. 2005. [11] G. Q. Chen, E. E. Kerre, and J. Vandenbulcke, “Normalization based on fuzzy functional dependency in a fuzzy relational data model,” Inf. Syst., vol. 21, no. 3, pp. 299–310, May 1996. [12] X. H. Tang and G. Q. Chen, “Equivalence and transformation of extended algebraic operators in fuzzy relational databases,” Fuzzy Sets Syst., vol. 157, no. 12, pp. 1581–1596, Jun. 2006. [13] P. Buche, C. Dervin, O. Haemmerle, and R. Thomopoulos, “Fuzzy querying of incomplete, imprecise, and heterogeneously structured data in the relational model using ontologies and rules,” IEEE Trans. Fuzzy Syst., vol. 13, no. 3, pp. 373–383, Jun. 2005. [14] S. Fox, K. Karnawat, M. Mydland, S. Dumais, and T. White, “Evaluating implicit measures to improve web search,” ACM Trans. Inf. Syst., vol. 23, no. 2, pp. 147–168, Apr. 2005. [15] M. R. Azimi-Sadjadi, J. Salazar, S. Srinivasan, and S. Sheedvash, “An adaptable connectionist text-retrieval system with relevance feedback,” IEEE Trans. Neural Netw., vol. 18, no. 6, pp. 1597–1613, Nov. 2007.

[16] T. W. S. Chow and M. K. M. Rahman, “Multilayer SOM with treestructured data for efficient document retrieval and plagiarism detection,” IEEE Trans. Neural Netw., vol. 20, no. 9, pp. 1385–1402, Sep. 2009. [17] C. Li, K. C. Chang, I. F. Ilyas, and S. Song, “RankSQL: Query algebra and optimization for relational top-k queries,” in Proc. ACM SIGMOD Int. Conf. Manage. Data, Baltimore, MD, 2005, pp. 131–142. [18] H. Yu, S. Hwang, and K. C. Chang, “Enabling soft queries for data retrieval,” Inf. Syst., vol. 32, no. 4, pp. 560–574, Jun. 2007. [19] D. E. Kraft and A. Bookstein, “Evaluation of information retrieval system: A decision theory approach,” J. Amer. Soc. Inf. Sci., vol. 29, no. 1, pp. 31–40, Jan. 1978. [20] G. Q. Chen, J. Vandenbulcke, and E. E. Kerre, “A general treatment of data redundancy in a fuzzy relational data model,” J. Amer. Soc. Inf. Sci., vol. 43, no. 4, pp. 304–311, May 1992. [21] G. Q. Chen, Fuzzy Logic in Data Modeling: Semantics, Constraints, and Database Design. Boston, MA: Kluwer, 1998. [22] J. C. Cubero and M. A. Vila, “A new definition of fuzzy functional dependency in fuzzy relational databases,” Int. J. Intell. Syst., vol. 9, no. 5, pp. 441–448, 1994. [23] K. V. S. V. N. Raju and A. K. Majumdar, “Fuzzy functional dependencies and lossless join decomposition of fuzzy relational database systems,” J. ACM Trans. Database Syst., vol. 13, no. 2, pp. 129–166, Jun. 1988. [24] Y. Z. Cao, M. S. Ying, and G. Q. Chen, “Retraction and generalized extension of computing with words,” IEEE Trans. Fuzzy Syst., vol. 15, no. 6, pp. 1238–1250, Dec. 2007. [25] E. Rahm and P. A. Bernstein, “A survey of approaches to automatic schema matching,” Int. J. Very Large Data Bases, vol. 10, no. 4, pp. 334–350, Dec. 2001. [26] G. D. Guo, A. K. Jain, W. Y. Ma, and H. J. Zhang, “Learning similarity measure for natural image retrieval with relevance feedback,” IEEE Trans. Neural Netw., vol. 13, no. 4, pp. 811–820, Jul. 2002. [27] G. Salton, The SMART Retrieval System: Experiments in Automatic Document Processing. Upper Saddle River, NJ: Prentice-Hall, 1971. [28] S. Tamura, S. Higuchi, and K. Tanaka, “Pattern classification based on fuzzy relations,” IEEE Trans. Syst., Man Cybern., vol. 1, no. 1, pp. 61–66, Jan. 1971. [29] S. Shenoi, A. Melton, and L. T. Fan, “An equivalence classes model of fuzzy relational databases,” Fuzzy Sets Syst., vol. 38, no. 2, pp. 153–170, Nov. 1990. [30] H. S. Lee, “An optimal algorithm for computing the max-min transitive closure of a fuzzy similarity matrix,” Fuzzy Sets Syst., vol. 123, no. 1, pp. 129–136, Oct. 2001. [31] H. L. Larsen and R. R. Yager, “Efficient computation of transitive closures,” Fuzzy Sets Syst., vol. 38, no. 1, pp. 81–90, Oct. 1990. [32] L. Freeman, “A set of measures of centrality based on betweenness,” Sociometry, vol. 40, no. 1, pp. 35–41, Mar. 1977. [33] U. Brandes, “A faster algorithm for betweenness centrality,” J. Math. Sociol., vol. 25, no. 2, pp. 163–177, 2001. [34] H. Jeong, S. P. Mason, A. L. Barabasi, and Z. N. Oltvai, “Lethality and centrality in protein networks,” Nature, vol. 411, pp. 41–42, May 2001. [35] T. Coffman, S. Greenblatt, and S. Marcus, “Graph-based technologies for intelligence analysis,” Commun. ACM, vol. 47, no. 3, pp. 45–47, 2004. [36] J. Zhang, Q. Wei, and G. Q. Chen, “An incremental approach to efficiently retrieving representative information for mobile search on web,” in Proc. Int. Conf. Mobile Business, Athens, Greece, 2010, pp. 402–409. [37] H. A. Simon, “On the forms of mental representation,” in Minnesota Studies in the Philosophy of Science: Perception and Cognition: Issues in the Foundations of Psychology, C. W. Savage, Ed. Minneapolis, MN: Univ. Minnesota Press, 1978. [38] J. H. Larkin and H. A. Simon, “Why a diagram is (sometimes) worth ten thousand words,” Cognit. Sci., vol. 11, no. 1, pp. 65–100, Jan.–Mar. 1987. [39] K. Siau, “Informational and computational equivalence in comparing information modeling methods,” J. Database Manage., vol. 15, no. 1, pp. 73–86, 2004. [40] C. E. Shannon and W. Weaver, The Mathematical Theory of Communication, vol. 1. Urbana, IL: Univ. Illinois Press, 1949, p. 117. [41] J. Han and M. Kamber, Data Mining: Concepts and Techniques. San Fransisco, CA: Morgan Kaufmann, 2001. [42] C. Saunders. Management Information Systems Journal Rankings [Online]. Available: http://ais.affiniscape.com/ displaycommon.cfm?an=1&subarticlenbr=432 [43] C. J. Merz and P. Murphy. (1996). UCI Repository of Machine Learning Databases [Online]. Available: http://www.cs.uci.edu/∼mlearn/ MLRepository.html

ZHANG et al.: EXTRACTING REPRESENTATIVE INFORMATION

[44] L. A. Granka, T. Joachims, and G. Gay, “Eye-tracking analysis of user behavior in WWW search,” in Proc. 27th Annu. Int. ACM SIGIR, Sheffield, NY, 2004, pp. 478–479. [45] F. Pan, W. Wang, A. K. H. Tung, and J. Yang, “Finding representative set from massive data,” in Proc. 5th IEEE Int. Conf Data Mining, Houston, TX, Nov. 2005, pp. 338–345. [46] Y. Li, Z. Zheng, and H. Dai, “KDD CUP-2005 report: Facing a great challenge,” SIGKDD Explorat., vol. 7, no. 2, pp. 91–99, 2005. [47] E. Voorhees and D. Tice, “The TREC-8 question answering track evaluation,” in Proc. Text Retrieval Conf., Gaithersburg, MD, 2000, pp. 1–23. [48] C. J. Van Rijsbergen, Information Retrieval, 2nd ed. Oxford, U.K.: Butterworths, 1979. [49] C. D. Manning, P. Raghavan, and H. Schutze, Introduction to Information Retrieval. Cambridge, U.K.: Cambridge Univ. Press, 2009. [50] X. H. Tang, G. Q. Chen, and Q. Wei, “Introducing relation compactness for generating a flexible size of search results in fuzzy queries,” in Proc. IFSA/EUSFLAT, Lisbon, Portugal, 2009, pp. 1462–1467.

Jin Zhang is currently pursuing the Ph.D. degree in management science and engineering with the School of Economics and Management, Tsinghua University, Beijing, China. His current research interests include data mining and business intelligence, web search, and soft computing.

941

Guoqing Chen received the Ph.D. degree in managerial informatics from the Catholic University of Leuven, Leuven, Belgium, in 1992. He is currently a Professor of information systems with the School of Economics and Management, Tsinghua University, Beijing, China. His current research interests include data mining and business intelligence, e-business, and fuzzy logic.

Xiaohui Tang received the Ph.D. degree in management science and engineering with the School of Economics and Management, Tsinghua University, Beijing China, in 2007. His current research interests include fuzzy logic applications, e-business, and financial analysis.

Extracting representative information to enhance flexible data queries.

Extracting representative information is of great interest in data queries and web applications nowadays, where approximate match between attribute va...
2MB Sizes 0 Downloads 4 Views