IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, VOL. 27, NO. 3, MARCH 2016

565

Hierarchical Theme and Topic Modeling Jen-Tzung Chien, Senior Member, IEEE

Abstract— Considering the hierarchical data groupings in text corpus, e.g., words, sentences, and documents, we conduct the structural learning and infer the latent themes and topics for sentences and words from a collection of documents, respectively. The relation between themes and topics under different data groupings is explored through an unsupervised procedure without limiting the number of clusters. A tree stick-breaking process is presented to draw theme proportions for different sentences. We build a hierarchical theme and topic model, which flexibly represents the heterogeneous documents using Bayesian nonparametrics. Thematic sentences and topical words are extracted. In the experiments, the proposed method is evaluated to be effective to build semantic tree structure for sentences and the corresponding words. The superiority of using tree model for selection of expressive sentences for document summarization is illustrated. Index Terms— Bayesian nonparametrics (BNPs), document summarization, structural learning, topic model.

I. I NTRODUCTION NSUPERVISED learning has a broad goal of extracting features and discovering structure within the given data. The unsupervised learning via probabilistic topic model [1] has been successfully developed for document categorization [2], image analysis [3], text segmentation [4], speech recognition [5], information retrieval [6], document summarization [7], [8], and many other applications. Using topic model, latent semantic topics are learned from a bag of words to capture the salient aspects embedded in data collection. In this paper, we propose a new topic model to represent a bag of sentences as well as the corresponding words. As we know, the concept of topic is well understood in the community. Here, we use another related concept theme. Themes are the latent variables, which occur in different level of grouped data, e.g., sentences, and so the concepts of themes and topics are different. We model the themes and topics separately and require the estimation of them jointly. The hierarchical theme and topic model is constructed. Fig. 1 shows the diagram of hierarchical generation from documents, sentences to words given by the themes, and topics, which are drawn from their proportions. We explore a semantic tree structure of sentence-level latent variables from a bag of sentences, while the word-level latent variables are learned

U

Manuscript received March 15, 2014; revised December 21, 2014 and March 11, 2015; accepted March 16, 2015. Date of publication March 30, 2015; date of current version February 15, 2016. This work was supported by the Ministry of Science and Technology, Taiwan, under Contract MOST 103-2221-E-009-078-MY3. The author is with the Department of Electrical and Computer Engineering, National Chiao Tung University, Hsinchu 30010, Taiwan (e-mail: [email protected]). Color versions of one or more of the figures in this paper are available online at http://ieeexplore.ieee.org. Digital Object Identifier 10.1109/TNNLS.2015.2414658

Fig. 1. Conceptual illustration for hierarchical generation of documents (yellow rectangle), sentences (green diamond), and words (blue circle) using theme and topic proportions.

from a bag of grouped words allocated in individual tree nodes. We build a two-level topic model through a compound process. The process of generating words conditions on the theme assigned to the sentence. The motivation of this paper aims to go beyond the word level and upgrade the topic model by means of discovering the hierarchical relations between the latent variables in word and sentence levels. The benefit of this model is to establish a hierarchical latent variable model, which is feasible to characterize the heterogeneous documents with multiple levels of abstraction in different data groupings. This model is general and could be applied for document summarization and many other information systems. A. Related Work for Topic Model Topic model based on latent Dirichlet allocation (LDA) [2] is constructed as a finite-dimensional mixture representation, which assumes that: 1) the number of topics was fixed and 2) different topics were independent. The hierarchical Dirichlet process (HDP) [9] and the nested Chinese restaurant process (nCRP) [10], [11] were proposed to relax these assumptions. The HDP–LDA model in [9] is a nonparametric extension of LDA, where document representation is allowed to grow structurally when more documents are observed. The number of topics is unknown a priori. Each word token within a document is drawn from a mixture model, where the hidden topics are shared across documents. DP is realized to find flexible data partitions and provide the nonparametric prior over the number of topics for each document. The base measure for the child DPs is itself drawn from a parent DP. The atoms are shared in a hierarchical way. Model selection problem is tackled through Bayesian nonparametric (BNP) learning. In the literature, the sparse topic model was constructed by decoupling the sparsity and smoothness for LDA [12] and HDP [13]. The spike and slab prior over

2162-237X © 2015 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.

566

IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, VOL. 27, NO. 3, MARCH 2016

Dirichlet distributions was applied using a Bernoulli variable for each word to indicate whether the word appears in the topic or not. In [14], the Indian buffet process compound DP was developed to build a focused topic model. In [15], a hierarchical Dirichlet prior with Polya conditionals was used as the asymmetric Dirichlet prior over the document-topic distributions. The improvement of using this prior structure over the symmetric Dirichlet prior in LDA was substantial even when not estimating the number of topics. On the other hand, the nCRP conducts BNP inference of topic hierarchies and learns the deeply branching trees from a data collection. Using this hierarchical topic model [10], each document is modeled by a path of topics, given a random tree where the hierarchically correlated topics from global topic to individual topics are extracted from root node to leaf nodes, respectively. In [16], the topic correlation was strengthened through a doubly correlated nonparametric topic model, where the annotations were incorporated into discovery of semantic structure. In [17], the nested DP prior was proposed by replacing the random atoms with the random probability measures drawn from DP in a nested setting. In general, HDP and nCRP were implemented according to the SBP and CRP. The approximate inference using Markov chain Monte Carlo (MCMC) [9]–[11] and variational Bayesian (VB) [18]–[20] was developed. B. Topic Model Beyond Word Level In previous methods, the unsupervised learning over text data finds the topic information from a bag of words. The mixed membership modeling is implemented for representation of words from multiple documents. The wordlevel mixture model is built. However, unsupervised learning beyond word level is required in many information systems. For example, the topic models based on a bag of bigrams [21] and a bag of n-gram histories [5] were estimated for language modeling. In a document summarization system, the text representation is evaluated to select representative sentences from multiple documents. Exploring the sentence clustering and ranking [8] is essential to find the sentence-level themes and measure the relevance between sentences and documents [22]. In [23], the general sentences and specific sentences were identified for document summarization. In [7] and [24], a sentence-based topic model based on LDA was proposed to learn word-document and word-sentence associations. Furthermore, an information retrieval system retrieves the condensed information from user queries. Finding the underlying themes from documents is beneficial to organize the ranked documents and extract the relevant information. In addition to the word-level topic model, it is desirable to build the hierarchical topic model in sentence level or even in document level. In [25] and [26], the latent Dirichlet coclustering model and the segmented topic model were proposed to build topic model across different levels. C. Main Idea of This Work In this paper, we construct a hierarchical latent variable model for structural representation of text documents.

The thematic sentences and the topical words are learned from hierarchical data groupings. Each path in tree model covers from the general theme at root node to the individual themes at leaf nodes. The themes in different tree nodes contain coherent information but in varying degrees of sharing for sentence representation. We basically build a tree model for sentences according to the nCRP. The theme hierarchy is explored. The brother nodes expand the diversity of themes from different sentences within and across documents. This model does not only group the sentences into a node but also distinguish their concepts through different layers. The words of the sentences clustered in a tree node are seen as the grouped data. The grouped words in different tree nodes are driven by an HDP. The nCRP compound HDP is developed to build a hierarchical theme and topic model. To reflect the heterogeneous documents in real-world data collection, a tree stick-breaking process (TSBP) is addressed to draw a subtree of theme proportions. We conduct structural learning and group the sentences into a diversity of themes. The number of themes and the dependence between themes are learned from data. The words of the sentences within a node are represented by a topic model, which is drawn by a DP. All the topics from different nodes are shared under a global DP. The sentence-level themes and the word-level topics are estimated. This approach is evaluated by the tasks of document modeling and summarization. This paper is organized as follows. Section II surveys BNP learning based on DP, HDP, and nCRP. Section III presents the nCRP compound HDP for representation of sentences and words from multiple documents. A TSBP is implemented to draw theme proportions for subtree branches. Section IV formulates the posterior probabilities that are applied for drawing the subtree, themes, and topics. In Section V, the experiments on text modeling and document summarization are reported. Finally, the conclusion is drawn in Section VI. The convention of notations used in this paper is provided in Table I. II. BAYESIAN N ONPARAMETRIC L EARNING A. Dirichlet Process DP is essential for BNP learning. Basically, DP for a random probability measure G is expressed by G ∼ DP(α0 , G 0 ). For any finite measurable partition {A1 , . . . , Ar }, a finite-dimensional Dirichlet distribution (G(A1 ), . . . , G(Ar )) ∼ Dir(α0 G 0 (A1 ), . . . , α0 G 0 (Ar ))

(1)

or G ∼ Dir(α0 G 0 ) is produced from G with two parameters, a concentration parameter α0 > 0 and a base measure G 0 , which is seen as the mean of draws from DP. The probability measure of DP is established by G=

∞  k=1

βk δφk , where

∞ 

βk = 1

(2)

k=1

where δφk is a unit mass at the point φk and the infinite sequences of weights {βk } and points {φk } are drawn from α0 and G 0 , respectively. The solution to weights β = {βk }∞ k=1 can be obtained by SBP, which provides a distribution

CHIEN: HIERARCHICAL THEME AND TOPIC MODELING

567

TABLE I C ONVENTION OF N OTATIONS

Fig. 2. Graphical representation for (a) DP mixture model and (b) HDP mixture model. Yellow: variables in document. Blue: word levels.

the number of customers n k who have already seated there, and sits at a new table with probability proportional to α0 . Fig. 2(a) displays the graphical representation of a DP mixture model where each word wi is generated by φk |G 0 ∼ G 0 , for each k β|α0 ∼ GEM(α0 ) z i = k|β ∼ β, for each i wi |z i , {φk }∞ k=1 ∼ Mult(θi = φz i ), for each i

(4)

where the mixture component or latent topic z i = k is drawn from topic proportions β, which are determined by the strength parameter α0 . Unsupervised learning of an infinite mixture representation is implemented. over infinite partitions of unit interval, also called the GEM (named by Griffiths, Engen and McCloskey) distribution β ∼ GEM(α0 ) [27]. Let θ1 , θ2 , . . . denote the parameters drawn from G for individual words w1 , w2 , . . . with multinomial distribution function, θi |G ∼ G and wi |θi ∼ p(wi |θi ) = Mult(θi ) for each i . According to the metaphor of CRP [28], the conditional distribution of current parameter θi given the previous parameters θ1 , . . . , θi−1 is obtained by θi |θ1 , . . . , θi−1 , α0 , G 0 i−1  1 α0 ∼ δθ + G0 i − 1 + α0 l i − 1 + α0 =

l=1 K  k=1

nk α0 δφk + G0. i − 1 + α0 i − 1 + α0

(3)

Here, φ1 , . . . , φ K denote the distinct values taken from the previous parameters θ1 , . . . , θi−1 and n k is the number of customers or parameters θi  that have seated at table or chosen value φk . If the base distribution is continuous, then φk can correspond to different table. However, if the base distribution is discrete and finite, φk corresponds to distinct value, but not table, because the draws from discrete distribution can be repeated. Considering the continuous base measure, the i th customer sits at table φk with probability proportional to

B. Hierarchical Dirichlet Process HDP [9] deals with the representation of documents or grouped data where each group is associated with a mixture model. Data in different groups share a global mixture model. Each document or data grouping wd is associated with a draw from a DP given probability measure G d ∼ DP(α0 , G 0 ), which determines how much a member from a shared set of mixture components contributes to that data grouping. The base measure G 0 is itself drawn from a global DP by G 0 ∼ DP(γ, H ) with strength parameter γ and base measure H, which ensures that there is a single set of discrete components shared across data. Each DP G d governs the generation of words wd = {wdi } or their multinomial parameters {θdi } for a document. The global measure G 0 and the individual measure G d in HDP can be expressed by the mixture models with the ∞ shared atoms {φk }∞ k=1 but different weights β = {βk }k=1 and ∞ π d = {πdk }k=1 given by G0 =

∞  k=1

βk δφk , G d =

∞ 

πdk δφk , for each d

k=1

where

∞  k=1

βk =

∞ 

πdk = 1.

(5)

k=1

The atom φk is drawn from base measure H and the topic proportions β are drawn by SBP via β|γ ∼ GEM(γ ).

568

IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, VOL. 27, NO. 3, MARCH 2016

Note that the topic proportions π d are sampled from an independent DP G d given β because G d s are independent given G 0 . We have π d ∼ DP(α0 , β). Fig. 2(b) shows the HDP mixture model that generates a word wdi in grouped data wd by φk |H ∼ H, for each k β|γ ∼ GEM(γ ) π d |α0 , β ∼ DP(α0 , β), for each d z di = k|π d ∼ π d , for each d and i wdi |z di , {φk }∞ k=1 ∼ Mult(θdi = φz di ), for each d and i. (6) To implement this HDP, we can apply the stick-breaking construction to connect the relation between β and π d . We first conduct the stick-breaking construction by finding β = {βk } through implementation of a process of drawing beta variables βk ∼ Beta(1, γ ), βk = βk

k−1 

  1 − β j .

(7)

j =1

Then, the stick-breaking construction for probability measure π d of grouped data wd is performed to bridge the relation   ∞   πdk ∼ Beta α0 βk , α0 βl l=k+1  πdk = πdk

k−1 



1 − πd j



(8)

j =1  at the kth draw is determined by where beta variable πdk ∞ the base measures of two segments {βk , {βl }l=k+1 }, which are scaled by parameter α0 . Basically, HDP does not involve in learning topic hierarchies. Only single level of data groupings, i.e., document level, is modeled.

C. Nested Chinese Restaurant Process In many practical applications, it is desirable to represent text data based on different levels of aspects. The nCRP [10], [11] was proposed to infer the topic hierarchies and learn a deeply branching tree from data collection. The resulting hierarchical LDA is developed to represent a text document wd using a path cd of topics from a random tree consisting of the hierarchically correlated topics {φk }∞ k=1 from global topic in root node to individual topics in leaf nodes as shown in Fig. 3. Choosing a path of topics for a document is equivalent to selecting or visiting a chain of restaurants cd by a customer wd through cd ∼ nCRP(α0 ), which is controlled by a scaling parameter α0 . Each word wdi in document wd is assigned by a tree node φk or topic label z di = k with probability or topic proportion πdk . The generative process for a set of text documents w = {wd |d = 1, . . . , D} based on nCRP is constructed by the following. 1) For each node k in the infinite tree a) Draw a topic with parameter φk |H ∼ H . 2) For each document wd = {wdi |i = 1, . . . , Nd } a) Draw a tree path by cd ∼ nCRP(α0 ). b) Draw topic proportions over layers of the tree path cd by SBP π d |γ ∼ GEM(γ ).

Fig. 3. Infinitely branching tree structure for representation of words and documents based on nCRP. Thick arrows: tree path cd drawn from nine words of a document wd . Here, we simply use notation d for wd . Yellow rectangle: document. Blue circles: words. Each word wdi is assigned by a topic parameter φk at a tree node along cd with proportion or probability πdk .

c) For each word wdi i) Choose a layer or a topic by z di = k|π d ∼ π d . ii) Choose a word based on topic z di = k by wdi |z di , cd , {φk }∞ k=1 ∼ Mult(θdi = φcd (z di ) ). A Gibbs sampling algorithm is developed to sample the posterior tree path and word topic {cd , z di } for different words in D documents w = {wd } = {wdi } according to the posterior probabilities of cd and z di given w and the current values of all other latent variables, i.e., p(cd |c−d , w, z, α0 , H ) and p(z di |w, z−(di) , cd , γ , H ). In these posterior probabilities, the notations c−d and z−(di) denote the latent variables c and z for all documents and words other than document d and word i , respectively. Sampling procedure is performed iteratively from current state to new state for D documents with Nd words in each document wd . In this procedure, the topic parameter φk in each node k of an infinite tree model is sampled from a prior measure φk |H ∼ H . The topic proportions π d of document wd are drawn by π d |γ ∼ GEM(γ ) according to an SBP given a scaling parameter γ. Given the samples of tree path cd and topic z di, the word wdi is distributed using the multinomial parameter θdi = φcd (zdi )

(9)

of topic z di = k drawn from the topic proportions π d corresponding to tree path cd . III. H IERARCHICAL T HEME AND T OPIC M ODEL Although the topic hierarchies are explored in topic model based on nCRP, only the single-level data groupings, i.e., document level, are considered in generative process. The extension of text representation to different levels of data groupings is required to improve text modeling. In addition, a single tree path cd in nCRP may not sufficiently reflect the topic variations and theme ambiguities in heterogeneous documents. A flexible topic selection is required to compensate for model uncertainty. By conducting the multiple-level unsupervised learning and flexible topic selection, we are able to upgrade system performance for document modeling.

CHIEN: HIERARCHICAL THEME AND TOPIC MODELING

569

where snCRP(·) is defined by considering individual sentences {wd j } under a subtree td rather than using the individual words {wdi } with single tree path cd as addressed for nCRP(·) in Section II-C. Notably, we consider a subtree td = {td j } for thematic representation of sentences from a heterogeneous document wd . The proportions β d of theme parameters ψ = {ψl } over a subtree td are drawn according to a TSBP, which is simplified from the tree-structured SBP in [29]. The resulting distribution is called the treeGEM, which is expressed by β d |γs ∼ treeGEM(γs ) Fig. 4. Infinitely branching tree structure for representation of words, sentences, and documents based on the nCRP compound HDP. Thick arrows: subtree branches td = {td j } drawn for eight sentences {wd j | j = 1, . . . , 8} of a document wd . Here, we simply use the notations s j for wd j and d for wd . Yellow rectangle: document. Green diamonds: sentences. Blue circles: words. Each sentence wd j is assigned by a theme parameter ψl at a tree node connected to the subtree branches with a document-dependent probability βdl while each word wd j i in tree node is assigned by a topic parameter φk with a theme-dependent probability πlk .

In this paper, a hierarchical theme and topic model is proposed to conduct a kind of topical coclustering [25], [26] over sentence level and word level, so that one can cluster sentences while clustering words. The nCRP compound HDP is presented to implement the proposed model where the text modeling in word level, sentence level and document level is jointly performed. By referring to [29], a simplified tree-structured SBP is presented to draw a subtree td , which accommodates the theme and topic variations in document wd . A. nCRP Compound HDP We propose an unsupervised structural learning approach to discover latent variables in different levels of data groupings. For the application of document representation, we aim to explore the latent themes from a set of sentences {wd j } and the latent topics from the corresponding set of words {wd j i }. A text corpus w = {wd } = {wd j } = {wd j i } is represented using different levels of aspects. The proposed model is constructed by considering the structure of a document where each document consists of a bag of sentences and each sentence consists of a bag of words. Different from the topic model using HDP [9] and the hierarchical topic model using nCRP [10], [11], we develop a tree model to represent a bag of sentences and then represent the corresponding words allocated in individual tree nodes. A two-stage procedure is proposed for document modeling as shown in Fig. 4. In the first stage, each sentence wd j of a document wd is drawn from a mixture of theme model where the themes are shared for all sentences from a document collection w. The theme model of a document wd is composed of the themes under its corresponding subtree td for different sentences {wd j }, which are drawn by a sentence-based nCRP (snCRP) with a control parameter α0 td = {td j } ∼ snCRP(α0 )

(10)

(11)

where γs is a strength parameter. The distribution treeGEM(·) is derived through a TSBP, which shall be described in Section III-D. With a tree structure of themes, the unsupervised grouping of sentences into different layers is obtained. In the second stage, each word wd j i in the set of sentences {wd j } allocated in tree node l is drawn by an individual mixture of topic model based on a DP. All topics from different nodes of a tree model are shared under a global topic model with parameters {φk }, which are sampled by φk |H ∼ H . By treating the words in a tree node as the grouped data, the HDP is implemented for topical representation of the whole document collection w, which is composed of the grouped words in different tree nodes. We assume that the words of the sentences in tree node l are conditionally independent. These words are drawn from a topic model using topic parameters {φk }∞ k=1 . The sentences in a document given theme l are conditionally independent. These sentences are ∞ . drawn from a theme model using theme parameters {ψl }l=1 The document-dependent theme proportions β d = {βdl } and the theme-dependent topic proportions π l = {πlk } are produced by a TSBP β d |γs ∼ treeGEM(γs ) and a standard SBP π l ∼ GEM(γw ) where γs and γw denote the sentencelevel and the word-level strength parameters, respectively. Given the theme proportions β d and topic proportions π l , the probability measure of each sentence wd j is drawn from a document-dependent theme mixture model G s,d =

∞ 

βdl δψl

(12)

l=1

while the probability measure of each word wd j i in a tree node is drawn from a theme-dependent topic mixture model G w,l =

∞ 

πlk δφk .

(13)

k=1

Since the theme for sentences is represented by a mixture model of topics for words in (13), we bridge the relation between the probability measures of themes {ψl } and topics {φk } via  ψl ∼ πlk φk . (14) k

We implement the so-called nCRP compound HDP in a two-stage procedure and establish the hierarchical theme and topic model for document representation. The generative process for a set of documents in different levels of groupings

570

IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, VOL. 27, NO. 3, MARCH 2016

is accordingly implemented. Having the sampled tree node yd j connected to the tree branches td and the associated topic z di = k, the word wd j i in sentence wd j is drawn using the multinomial parameter θd j i = φtd (yd j ,zdi ) .

(15)

The topic is determined by the topic proportions π l of theme yd j = l, while the theme is determined by the theme proportions β d from a document wd . The generative process of this compound process is described as follows. 1) For each node or theme l in the infinite tree a) For each topic k in a tree node i) Draw a topic with parameter φk |H ∼ H . b) Draw topic proportions by SBP π l |γw ∼ GEM(γw ).  c) Theme model is constructed by ψl ∼ k πlk φk . 2) For each document wd = {wd j } a) Draw a subtree td = {td j } ∼ snCRP(α0 ). b) Draw theme proportions over a subtree td in different layers by tree SBP β d |γs ∼ treeGEM(γs ). c) For each sentence wd j = {wd j i } i) Choose a theme yd j = l|β d ∼ β d . ii) For each word wd j i or simply wdi a. A) Choose a topic by z di = k|π l ∼ π l . B) Choose a word based on topic z di = k by wd j i |z di , yd j , td , {φk }∞ k=1 ∼ Mult(θd j i = φtd (yd j ,zdi ) ). B. nCRP for Thematic Sentences We implement the snCRP and build an infinitely branching tree model, which discovers latent themes from different sentences of a text corpus. The root node in this tree contains the general theme while those nodes in leaf layer convey the specific themes. The hierarchical clustering of sentences is realized. The snCRP is developed to represent the sentences of a document wd = {wd j } based on the themes, which come from a subtree td = {td j }. The ambiguity and uncertainty of themes existing in a heterogeneous document could be compensated. This is different from conventional word-based nCRP [10], [11], where only the topics along a single tree path cd are selected to represent different words in a document wd = {wdi }. The word-based nCRP using GEM distribution for topic proportions π d ∼ GEM(γ ) is now extended to the snCRP using treeGEM distribution for theme proportions β d ∼ treeGEM(γs ) by considering a subtree td . The metaphor of snCRP is described as shown in Fig. 4. There are infinite number of Chinese restaurants in a city. Each restaurant has infinite tables. The tourist wd j of a tour group wd = {wd j } visits the first restaurant in root node ψ1 where each of its tables has a card showing the next restaurant, which is arranged in the second layer consisting of {ψ2 , ψ4 , . . .}. Such visit repeats infinitely. Each restaurant is associated with a tree layer. The restaurants in a city are organized into an infinitely branched and infinitely deep tree structure. The tourists or sentences wd j from different tour groups or documents wd that are closely related shall visit

Fig. 5. Graphical representation for hierarchical theme and topic model. Yellow: variables in document level. Green: variables in sentence level. Blue: variables in word level.

the same restaurant and sit at the same table. The hierarchical grouping of sentences is, therefore, obtained by nonparametric tree model based on this snCRP. Each tree node stands for a theme. Each tourist or sentence wd j is modeled by the multinomial parameter ψl of a theme l from a subtree or a chain of restaurants td . The thematic sentences allocated in tree nodes with high probabilities could be selected for document summarization. C. HDP for Topical Words After having the hierarchical grouping of sentences, we treat the words corresponding to a tree node l as the grouped data and conduct the HDP using the grouped data in different tree nodes. The grouped words from different documents are recognized under the same theme ψl . Such tree-based HDP is different from the document-based HDP [9], which treats documents as the grouped data for text representation. An organized topic model is constructed to draw words for individual themes. The theme-dependent topical words are learned from the tree-based HDP. According to the combination of snCRP and tree-based HDP, the theme parameter ψl is represented by a mixture model of topic parameters {φk }∞ k=1 , where the theme-dependent topic proportions {πlk }∞ k=1 are used as the mixture weights as given in (14). Tree-based HDP is developed to infer the topic proportions π l based on a standard SBP. The tree-based multinomial parameter is inferred to determine the word distribution through φk |H ∼ H, for each k  πlk φk , for each l π l |γw ∼ GEM(γw ), ψl ∼ k

td ∼ snCRP(α0 ), β d |γs ∼ treeGEM(γs ), for each d yd j = l|β d ∼ β d , for each d and j z di = k|π l ∼ π l , for each d and i wd j i |z di , yd j , td , {φk }∞ k=1 ∼ Mult(φtd (yd j ,z di ) ), for each d, j and i. (16) A compound process of snCRP and HDP is fulfilled to build the hierarchical theme and topic model as shown in the graphical representation of Fig. 5. Each word wd j i is drawn using the topic parameter θd j i = φtd (yd j ,zdi ) , which is controlled by the word-level topic z di . This topic is allocated in the theme or tree node yd j of the subtree td , which is selected from sentence wd j . The topical words in different tree nodes with high probability are selected. Such process could be extended to represent multiple-level data

CHIEN: HIERARCHICAL THEME AND TOPIC MODELING

571

for a child node lac ∈ c . The probability or theme proportion of generating a child node lac is calculated by βlac = βla βlac

c−1  j =1

  1 − βla j .

(18)

In this calculation, the theme proportion βlac of a child node is determined from an initial proportion of the ancestor node βla, which is continuously chopped according to the draws of beta variables from the existing child nodes {βla j | j = 1, . . . , c − 1} to the new child node βlac. Each child node can be further treated as an ancestor node for future SBP to generate the grandchild nodes. Equation (18) is recursively applied to find theme proportions βla and βlac for different nodes in subtree td . The theme proportions of a parent node and its child nodes satisfy the condition βla = βla0 + βla1 + βla2 + · · ·

Fig. 6. Illustration for (a) conducting TSBP and (b) finding hierarchical theme proportions for an infinitely branching tree structure, which meets the constraint of unit length in the estimated theme proportions or in the treeGEM distribution. (c) Dashed arrows and circles: future SBP. The SBP of a parent node la to its child nodes c = {la0 , la1 , la2 , . . .}. The theme proportion of the first child node of a parent node la is denoted by βla0 . After stick-breaking for a parent node la and its child nodes c , we have the replacement βla ← βla0 . This example illustrates how the three-layer proportions {β1 , β11 , β12 , β111 , β121 } or {β10 , β110 , β120 , β111 , β121 } in Fig. 4 are drawn to share a unit-length stick.

groupings including words, paragraphs, documents, streams, and corpora. D. Tree Stick-Breaking Process In the implementation, a TSBP is incorporated to draw a subtree [30] and determine the theme proportions for representation of sentences in heterogeneous document wd = {wd j }. Traditional GEM distribution is not suitable to characterize the tree structure with dependencies between brother nodes and those between parent node and child nodes. The snCRP is combined with a TSBP, which is developed to draw theme proportions β d from a document wd based on the treeGEM distribution subject to the constraint of multinomial parameters ∞ β = 1 for all nodes in subtree td . A variety of aspects l=1 dl from different sentences are revealed through β d . As shown in Fig. 6(a) and (c), we draw the theme proportions for an ancestor node and its child nodes {la , c = {la0 , la1 , la2 , . . .}} that are connected between two layers by arrows. The theme proportion βla0 in child node denotes the initial segment, which is succeeded from an ancestor node la . SBP is performed for the coming child nodes {la1 , la2 , . . .}. Given the treeGEM parameter γs , we sample a beta variable γ −1 (1 + γs )  1 − βlac s βlac ∼ Beta(1, γs ) = (17) (1) (γs )

(19)

Tree stick-breaking is run for each set of nodes {la , c }. After stick-breaking of a parent node ła and its child nodes c , the theme proportion of parent node is replaced by βla ← βla0 . Fig. 6(b) shows how the three-layer theme proportions {β1 , β11 , β12 , β111 , β121 } in Fig. 4 are inferred through TSBP. We accordingly infer the theme proportions β d for a document wd and meet the constraint that the summation over theme proportions of all parent and child nodes has unit length. The inference β d ∼ treeGEM(γs ) is completed. Therefore, a stick of unit length is partitioned at random locations. The random theme proportions β d are used to calculate the posterior probability for drawing theme yd j = l for a sentence wd j . In this paper, we adopt a single beta parameter γs for TSBP toward depth as well as branch. This solution is a simplified realization of the TSBP in [29] where two separate beta parameters {γsd , γsb } are adopted to conduct stick-breaking for depth and branch, β d ∼ treeGEM(γsd , γsb ). The infinite version of the Dirichlet tree distribution used in [31] could be adopted as the conjugate prior over the tree multinomials β d . IV. BAYESIAN I NFERENCE The approximate Bayesian inference using Gibbs sampling is developed to infer the posterior parameters or latent variables to implement the nCRP compound HDP. Latent variables consist of subtree branch td j , sentence-level theme yd j , and word-level topic z di for each word wdi and sentence wd j in a text corpus w. Each latent variable is iteratively sampled by the posterior probability of this variable given the observed data w and all the other variables. The sampling of latent variables is performed in an MCMC procedure. In this procedure, we sample a subtree td = {td j } for a document wd via snCRP. Each sentence wd j is grouped into a tree node or a documentdependent theme yd j = l under a subtree td . Each word wdi of a sentence is then assigned by the theme-dependent topic z di = k, which is sampled according to a tree-based HDP. In what follows, we address the calculation of posterior probabilities for drawing tree branch td j , theme label yd j and topic label z di .

572

IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, VOL. 27, NO. 3, MARCH 2016

A. Sampling for Document-Dependent Subtree Branches A document wd is seen as a bag of sentences for sampling a subtree td . We iteratively sample a tree branch or choose a table td j for sentence wd j using the posterior probability p(td j = t|td(− j ), w, yd , α0 , η) ∝ p(td j = t|td(− j ), α0 ) × p(wd j |wd(− j ), td , yd , η)

(20)

where yd = {yd j , yd(− j )} denotes the set of theme labels for sentence wd j and all the remaining sentences wd(− j ) of document wd , and td(− j ) denotes the branches for document wd excluding those of sentence wd j . In (20), the snCRP parameter α0 and the Dirichlet prior parameter η provide the control over the size of the inferred tree. The first term p(td j = t|td(− j ), α0 ) provides the prior on choosing a branch t for a sentence wd j according to the snCRP with the following metaphor. In a culinary vacation, the tourist or sentence wd j enters a restaurant and chooses a table or branch td j = t using the CRP probability p(td j = t|td(− j ), α0 ) ⎧ m d,t ⎪ , if an occupied table t is chosen ⎨ m d − 1 + α0 = (21) α0 ⎪ ⎩ , if a new table is chosen m d − 1 + α0 where m d,t denotes the number of tourists or sentences in document wd that have seated at table t and m d denotes total number of sentences in wd . In the next evening, this tourist goes to the restaurant in the next layer, which is identified by the card on the table that is selected in current evening. Equation (21) is applied again to determine whether a new branch at this layer is drawn or not. The prior of a subtree td = {td j } is accordingly determined in a nested fashion over different layers from different sentences in a document wd = {wd j }. The second term p(wd j |wd(− j ), td , yd , η) is calculated by considering the marginal likelihood over the multinomial parameters θ t,l = {θt,l,v }V v=1 of a node yd j = l connected to tree branch td j = t for totally V words in dictionary where θt,l,v = p(wdi = v|td j = t, yd j = l). The parameters θ t,l in a tuple (t, l) are assumed to be Dirichlet distributed with a shared prior parameter η

  V V η  v=1 η−1 θt,l,v . (22) p(θ t,l |η) = V v=1 (η) v=1 We can derive the likelihood function of a sentence wd j given a tree branch t and the other sentences wd(− j ) of wd as p(wd |td j = t, yd , η) p(wd j |wd(− j ), td , yd , η) = p(wd(− j )|td j = t, yd , η)  ⎡ V ⎤ V Ld  v=1 n (d(− j ),t,l,v +Vη (n + η) d,t,l,v v=1 ⎣ 

 ⎦ = V V (n + η) d(− j ),t,l,v v=1 l=1 v=1 n d,t,l,v +Vη (23) where Ld denotes the maximum number of nodes in subtree td , n d,t,l,v denotes the number of words wdi = v in

document wd that are allocated in tuple (t, l), and n d(− j ),t,l,v denotes the number of words wdi = v of wd except sentence wd j that are allocated in tuple (t, l). Equation (23) is obtained since the marginal likelihood is derived by  p(wd |t, yd , η) = p(wd |t, yd , θ t,l ) p(θ t,l |η)dθ t,l ⎛ ⎞   Ld  V (Vη) n +η−1 d,t,l,v ⎝ ⎠ dθ t,l = θt,l,v ( (η))V l=1 v=1 Ld V  v=1 (n d,t,l,v + η) 

 ∝ V l=1 v=1 n d,t,l,v + Vη

(24)

which is found by arranging an integral over a posterior Dirichlet distribution of θ t,l with parameters {n d,t,l,v + η}V v=1 . Intuitively, a model with larger α0 will tend to a tree with more branches. The smaller η encourages fewer words in each tree node. The joint effect of large α0 and small η in posterior probability will result in the circumstance that more themes are required to explain the data. Using snCRP, we use sentence wd j to find the corresponding branch td j . The subtree branches td = {td j } are drawn to accommodate the thematic variations in heterogeneous document wd . B. Sampling for Document-Dependent Themes After finding the subtree branches td = {td j } for sentences in a document wd = {wd j } via snCRP, we sample the document-dependent theme label yd j = l for each sentence wd j according to the posterior probability of latent variable yd j given document wd and current values of all the other latent variables p(yd j = l|wd , yd(− j ), td , γs , η) ∝ p(yd j = l|yd(− j ), td , γs ) p(wd j |wd(− j ), yd , td , η). (25) The first term represents the distribution of theme proportion of yd j = l or lac in a subtree td given the themes of the other sentences yd(− j ). This distribution is calculated as an expectation over the treeGEM distribution, which is determined by the TSBP. Considering the draw of theme proportion for a child node lac given those for the parent node la and the preceding child nodes {la1 , . . . , la(c−1) } in (18), we derive the first term based on a subtree structure td p(yd j = lac |yd(− j ), γs ) = E[βla |yd(− j ), γs ]     c−1   E 1 − βla j yd(− j ), γs × E βlac |yd(− j ), γs j =1

= p(yd j = la |yd(− j ), γs ) ×

c−1  j =1

γs +

L d

1 + m d(− j ),lac  d 1 + γs + L u=c m d(− j ),lau

u= j +1 m d(− j ),lau

1 + γs +

L d

u= j

(26)

m d(− j ),lau

which is a product of expectations of the theme proportion of the parent node βla , the beta variable for the proportion of current child node βlac , and the beta variables for

CHIEN: HIERARCHICAL THEME AND TOPIC MODELING

573

the remaining proportions for the preceding child nodes {βla1 , . . . , βlac }. In (26), m d(− j ),lac denotes the number of sentences in wd(− j ) that are allocated in node lac . The treeGEM parameter γs reflects a kind of pseudocount of the observations for the proportion. The expectation over beta variable E[β  |yd(− j ), γs ] is derived by  β  p(β  |yd(− j ), γs )dβ   = β  p(β  |γs ) p(yd(− j )|β  )dβ   L d ∝ β  (1 − β  )γs −1 (β  )m d(− j ),lac × (1 −β  ) n=c+1 m d(− j ),lan dβ  =

1 + m d(− j ),lac L d 1 + m d(− j ),lac + γs + n=c+1 m d(− j ),lan

(27)

which is obtained as the mean of the posterior beta variable p(β  |yd(− j ), γs ) with parameters 1 + m d(− j ),lac  d and γs + L n=c+1 m d(− j ),lan . On the other hand, the second term of (25) for sampling a subtree is the same as that of (20) for sampling a theme. This term p(wd j |wd(− j ), yd , td , η) has been derived in (23). There are twofold relations in the posterior probabilities between nCRP and snCRP. First, we treat the sentence for snCRP as if it were the word for nCRP. Second, the calculation of (26) is recursively done for a set of parent node and its child nodes. Using nCRP, this calculation is performed by treating tree nodes in different layers in a flat level without considering subtree structure. In general, snCRP is completed by first choosing a subtree td for the sentences in a document wd and then assigning each sentence to one of the nodes in the chosen subtree according to the theme proportions β d drawn from the treeGEM. In the generation of theme assignment yd j = l, the node l assigned to a sentence wd j might be connected to the branch td j of a subtree t drawn by another sentence wd j  from the same document wd . The variety of themes in a heterogeneous document wd is reflected by the subtree td , which is drawn via snCRP. A whole infinite tree is accordingly established and shared for all documents in a corpus w. C. Sampling for Theme-Dependent Topics

p(z di = k|z−(di) , yd j = l, γw ) 1 + n −(di),l,k =  l 1 + γw + K u=k n −(di),l,u  l k−1  γw + K u= j +1 n −(di),l,u × Kl j =1 1 + γw + u= j n −(di),l,u

(29)

where Kl denotes the maximum number of topics in a theme l and n −(di),l,k denotes the number of words in w−(di) that are allocated in topic k of theme l. The words of the sentences in a node with theme l are treated as the grouped data to carry out tree-based HDP. The terms in (29) are obtained as the expectations over the beta variable for topic proportion of current break πlk and the remaining proportions of the preceding breaks {πl1 , . . . , πl(k−1) }. The second term of (28) calculates the probability of generating the word wdi = v given w−(di) and the current topic variables z in theme l as expressed by n −(di),l,v + η p(wdi = v|w−(di) , z, yd j = l, η) = V v=1 n −(di),l,v + Vη (30) where n −(di),l,v denotes the number of words wdi = v in w−(di) that are allocated in theme l. Given the current state of the sampler, we iteratively sample each latent variable conditioned on the whole observations and the rest variables. Given a text corpus w, we sequentially sample the subtree branch td j = t and the theme yd j = l for individual sentence wd j of document wd via the snCRP. After having a sentence-based tree structure, we sequentially sample the theme-dependent topic z di = k for individual word wdi in a tree node based on the tree-based HDP. These samples are iteratively employed to update the corresponding posterior probabilities p(td j = t|td(− j ), w, yd , α0 , η), p(yd j = l|wd , yd(− j ), td , γs , η), and p(z di = k|w, z−(di) , yd j = l, γw , η) in the Gibbs sampling procedure. The true posteriors are approximated by running sufficient sampling iterations. The hierarchical theme and topic model is established by fulfilling the nCRP compound HDP. V. E XPERIMENTS

Finally, we implement the tree-based HDP by sampling the theme-dependent topic z di = k for each word wd j i or wdi under a theme yd j = l based on the posterior probability p(z di = k|w, z−(di) , yd j = l, γw , η) ∝ p(z di = k|z−(di) , yd j = l, γw ) × p(wdi |w−(di) , z, yd j = l, η)

to (17) and (18), we derive the first term in a form of

(28)

where z = {z di , z−(di) } denotes the topic labels of word wdi and the remaining words w−(di) of a corpus w. In (28), the first term indicates the distribution of topic proportion of z di = k, which is calculated as an expectation over π l ∼ GEM(γw ). This calculation is done via a wordlevel SBP with a control parameter γw . By drawing the beta variable and calculating the topic proportion similar

A. Experimental Setup The experiments were performed using four publicdomain corpora: 1) Wall Street Journal (WSJ) 1987–1992; 2) Associate Press (AP) 1988-1990; 3) Neural Information Processing Systems (NIPS) (http://arbylon.net/resources.html and https://archive.ics.uci.edu/ml/datasets/Bag+of+Words); and 4) Document Understanding Conference (DUC) 2007 (http://duc.nist.gov). The corpora of WSJ, AP, and NIPS contained news documents and conference papers that were applied for evaluation of document representation. The perplexity of test documents wtest = {wd |d = 1, . . . , D} was calculated by    D d=1 log p(wd ) . (31) Perplexity(wtest ) = exp − D d=1 Nd

574

IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, VOL. 27, NO. 3, MARCH 2016

We wish to achieve better model with lower perplexity on test documents. To investigate the topic coherence of the estimated topic models, we further calculated the pointwise mutual information (PMI) [32] p(wi , w j ) 1  (32) log PMI(wtest ) = 45 p(wi ) p(w j )

TABLE II E VOLUTION F ROM THE PARAMETRIC TO THE N ONPARAMETRIC , F ROM THE

N ONHIERARCHICAL TO THE H IERARCHICAL , AND F ROM THE

W ORD -BASED TO THE S ENTENCE -BASED C LUSTERING M ODELS

i< j

which is averaged over all pairs of words in the list of top-ten words, i.e., i, j ∈ {1, . . . , 10}, in the estimated topics. In WSJ, we chose 1085 documents with 18 323 sentences and 203 731 words for model training and the other 120 documents as test data for evaluation. This data set had 4999 unique words. In AP, we used 1211 documents with 19 109 sentences and 198 420 words for model training and the other 130 documents for testing. This data set had 5183 unique words. In NIPS training set, we collected 2500 papers totally including 310 553 sentences and 3 026 153 words with a vocabulary of 10 328 words. The other 250 papers with 290 345 words were sampled to form the test set. The scale of NIPS corpus is much larger than that of WSJ, AP, and DUC. In DUC corpus, there were 1680 documents consisting of 22 961 sentences and 18 696 unique words. This corpus provided the reference summary for individual document, which was manually written for evaluation of document summarization. The automatic summary for DUC was limited to 250 words at most. The NIST evaluation tool, Recall-Oriented Understudy for Gisting Evaluation (ROUGE) at http://berouge.com, was adopted. ROUGE-1 was used to measure the matched unigrams between reference summary and automatic summary in terms of recall, precision, and F-measure. In these four data sets, we held out 10% of training data as the validation data for selection of hyperparameters α0 , η, γs , and γw and trained the models on the remaining 90% of data. Stop words were removed. Hyperparameters were individually selected for different data sets based on perplexity. The same selection was employed to choose the number of topics K in LDA and the numbers of themes L and topics K in sentencebased LDA (sLDA). Fivefold cross validation was performed. We evaluated the computation time in seconds using different methods over a personal computer equipped with a CPU of Intel(R) Core(TM) i7-4930 K at 3.4-GHz six cores and a memory of 64-G RAM. For comparative study, we implemented the parametric topic models including LDA [2], sLDA [7], and the nonparametric topic models including HDP [9], sentence-based HDP (sHDP), nCRP [11], and the proposed snCRP compound HDP (simply denoted by snCRP hereafter). These methods could be also grouped into the categories of nonhierarchical models (LDA, sLDA, HDP, and sHDP) and hierarchical models (nCRP and snCRP) as shown in Table II. The sentence-based models based on sLDA, sHDP, and snCRP were implemented with additional information of sentence boundaries. In [7], sLDA was implemented by representing each word wd j i in document wd based on the theme-dependent topic z di = k where the theme yd j = l was learned from a bag of sentences in text corpus w = {wd j }. The words were drawn from the theme-dependent topic model. Using sLDA, the numbers of

themes and topics were fixed and the theme hierarchy was not considered. In addition, the sHDP is an extension of sLDA by conducting BNP inference, but sHDP is a simplification of snCRP without involving the theme hierarchy. The sHDP is proposed in this paper and implemented for comparison. Basically, the GEM distribution and the treeGEM distribution are applied to characterize the distributions of theme proportions for sHDP and snCRP, respectively. We also examine the effect of topic and theme hierarchies in document modeling based on nCRP and snCRP by comparing the performance of standard SBP, the TSBP1 in Section III-D and the TSBP in [29] (TSBP2). For each document wd , SBP selects a single tree path cd [11] while TSBP1 and TSBP2 find the subtree branches td . This is the first work that TSBP1 and TSBP2 are developed to explore the thematic hierarchy from heterogeneous sentences and documents. In evaluation of document summarization, we carried out the vector-space model (VSM) in [33] and the sentence-based models using sLDA, sHDP, and snCRP, which performed sentence clustering in different ways. The sLDA conducted parametric and nonhierarchical clustering, while the sHDP executed nonparametric and nonhierarchical clustering. Only snCRP performed nonparametric and hierarchical clustering. The Kullback–Leibler (KL) divergence between document model and sentence model was calculated. The thematic sentences with the smallest KL divergence were selected. Considering the tree structure using snCRP, we investigated four methods for sentence selection. The snCRP-root and snCRP-leaf selected the thematic sentences allocated in root node and leaf nodes, respectively. The snCRP-path selected representative sentences of a document only from the nodes along the most frequently visited path. The snCRP–maximal marginal relevance (MMR) selected sentences from all possible branches td by applying the MMR [22]. For simplicity, we constrained the tree growing in nCRP and snCRP to three layers in our experiments. The Dirichlet prior parameter η was separate for three layers in nCRP and snCRP. We decrease the value of η (η1 , η2 , η3 ) from root layer to leaf layer to reflect the condition that the number of words allocated in bottom layer was reduced. In implementation of TSBP2, the beta prior parameter for depth γsd (γsd1 , γsd2 , γsd3 ) was also decreased by depth to reflect the decreasing number of sentences in bottom layer. The system parameters were selected individually from validation data in different data sets. In DUC data set, the hyperparameters of snCRP–TSBP1 and snCRP–TSBP2 were selected as α0 = 0.5, η1 = 0.05, η2 = 0.025, η3 = 0.0125,

CHIEN: HIERARCHICAL THEME AND TOPIC MODELING

Fig. 7. Perplexity versus Gibbs sampling iteration of using snCRP for three data sets: WSJ, AP, and DUC.

γs = 1.85, γsd1 = 2.5, γsd2 = 1.25, γsd3 = 1.125, γsb = 1.1, and γw = 1.85. We run Gibbs sampling procedure for 280 iterations with 100 samples per iteration. The burn-in samples in the first 30 iterations were abandoned. Fig. 7 shows the perplexity of training documents versus Gibbs sampling iterations using snCRP–TSBP1. WSJ, AP, and DUC data sets are investigated. We find that the number of tree nodes is increased and the corresponding perplexity is decreased by sampling iterations. The perplexity is converged after ∼50 iterations. This phenomenon is consistent for three data sets.

575

Fig. 8. Comparison of perplexity of using nCRP and snCRP, where SBP, TSBP1, and TSBP2 are investigated. WSJ and AP are used. Error bars: standard deviation across test documents in fivefold cross validation. Model complexity is compared. The blue number after nCRP denotes total number of the estimated topics while the blue numbers after snCRP denote total numbers of the estimated themes and topics.

B. Evaluation of Tree Models and Stick-Breaking Processes for Document Modeling In this set of experiments, we evaluate the performance of different tree models and SBPs for document representation using WSJ and AP data sets. The hierarchical models based on nCRP and snCRP and the draws of topic proportions or theme proportions using SBP, TSBP1, and TSBP2 are investigated. Different from the word-based hierarchical topic model using nCRP, the proposed snCRP builds the sentence-based tree model where each node represents a theme from a set of sentences and the words in these sentences are generated by HDP. In addition to the baseline nCRP–SBP [11] with single tree path, we compare the performance of nCRP and snCRP combined with TSBP1 and TSBP2, which select the subtree branches to deal with the topical and thematic variations in heterogeneous documents. Fig. 8 shows the perplexities of test documents and the estimated model complexities. The error bars show the standard deviation across test documents in fivefold cross validation. The model complexity is measured in terms of total number of estimated topics in tree model of nCRP and total numbers of estimated themes and topics in tree model of snCRP. We find that snCRP obtains lower perplexity than nCRP. Selection of subtree branches using TSBP1 and TSBP2 outperforms that of single path using SBP in both data sets. The results of TSBP1 and TSBP2 are comparable. Such performance is obtained using nCRP as well as snCRP. In addition, the estimated model complexity of using TSBP is larger than that of SBP. This is reasonable because TSBP

Fig. 9. Number of themes versus length of document using snCRP with SBP, TSBP1, and TSBP2. Six documents in different lengths are selected from WSJ. Document length is measured by total number of words in the document.

adopts more latent variables to allow larger variations in topics or themes. TSBP2 has higher freedom to choose more themes and topics but obtained very limited improvement compared with TSBP1. The model complexity of hierarchical models nCRP and snCRP is larger than that of nonhierarchical models LDA, sLDA, HDP, and sHDP. It is interesting that snCRP uses smaller number of topics than nCRP. However, total number of latent variables of using snCRP (L + K) is larger than that of using nCRP (K). This implies that additional modeling in sentence level can reduce the need of the required topics in word-level modeling for document representation. Fig. 9 shows the number of themes that are estimated from six documents with different lengths or numbers of words in documents (Nd = 200, 301, 405, 504, and 688). We find that the number of themes used for representation of a document is increased by the length of document when applying TSBP1 and TSBP2. An SBP selects a single path so that only three themes are selected for each document. TSBP2 chooses more themes than TSBP1. This is an evidence that TSBPs perform better than SBP for document modeling.

576

IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, VOL. 27, NO. 3, MARCH 2016

Fig. 10. Comparison of perplexity of using LDA, sLDA, HDP, sHDP, nCRP, and snCRP. Subtree branch selection in nCRP and snCRP based on TSBP1 is considered. Error bars: standard deviation. Model complexity is compared. The blue number after LDA, HDP, and nCRP denotes total number of the estimated topics, while the blue numbers after sLDA, sHDP, and snCRP denote total number of the estimated themes and topics.

Fig. 11. Comparison of PMI score of using LDA, sLDA, HDP, sHDP, nCRP, and snCRP. Error bars: standard deviation.

C. Evaluation of Different Methods in Terms of Perplexity, Topic Coherence, and Computation Time This paper presents the evolution from the parametric and nonhierarchical topic model based on LDA to the nonparametric and hierarchical theme and topic model based on snCRP. The additional modeling over subgrouping data is introduced to conduct unsupervised structural learning of themes and topics from a set of documents. Fig. 10 compares the perplexity of test documents using LDA, sLDA, HDP, sHDP, nCRP, and snCRP. WSJ, AP, and NIPS data sets are used. The numbers of the estimated themes and topics are shown to evaluate the effect of model complexity in different methods. We find that perplexity is consistently reduced from word-based clustering to sentence-based clustering using different methods and data sets. It is because that two levels of data modeling in sentence-based clustering methods provide an organized way to represent a set of documents. The relation between words and sentences through latent topics and themes is beneficial for document modeling regardless of the model style, model shape, and inference procedure. We consistently see that sLDA, sHDP, and snCRP estimate smaller number of topics than LDA, HDP, and nCRP, respectively. Nevertheless, the total number of themes and topics in sentence-based clustering is still larger than that of topics in word-based clustering. In this comparison, HDP and sHDP perform better than LDA and sLDA, respectively. The lowest perplexity is obtained using snCRP. Compared with the news articles in WSJ and AP, the large amount of scientific documents in NIPS does not produce so many topics and themes when using BNP methods. Based on the estimated LDA, sLDA, HDP, sHDP, nCRP, and snCRP, Fig. 11 further evaluates the performance of topic coherence by comparing the corresponding PMI scores, where WSJ, AP, and NIPS data sets are investigated. PMI is known as an objective measure, which considerably reflects the human-judged topic coherence [32]. To conduct a consistent comparison, the PMI scores in sLDA, sHDP, and snCRP are calculated from the pairs of frequent words in the estimated

Fig. 12. Comparison of computation time (in seconds) of using sLDA, sHDP, and snCRP under different amounts of training documents.

topics rather than themes. From the results of PMI measure, we still see the improvement using two-level clustering over one-level clustering and also nonparametric hierarchical modeling over parametric nonhierarchical modeling. In addition, the training time of using different methods is evaluated as shown in Fig. 12. NIPS corpus is used. The nCRP and snCRP based on TSBP1 were examined. To investigate the scalability of computation time due to the amount of training data, we sampled training documents and formed the training sets with 800, 1600, and 2500 documents. The model size of LDA and sLDA is adjusted according to the amount of training data. LDA and sLDA are estimated using VB inference, while HDP, sHDP, nCRP, and snCRP conducts inference based on Gibbs sampling. Computation time is measured with the converged model parameters where ten iterations are run using VB and 50 iterations are run using Gibbs sampling. Basically, the computation overhead of using sLDA, sHDP, and snCRP over LDA, HDP, and nCRP is limited. Nevertheless, the computation cost of nonparametric methods using HDP, sHDP, nCRP, and snCRP is much higher than that of parametric methods using LDA and sLDA. The highest cost is measured using sentence-based tree model, i.e., snCRP. The computation time is roughly proportional to the amount of training data.

CHIEN: HIERARCHICAL THEME AND TOPIC MODELING

577

Fig. 14. Comparison of recall, precision, and F-measure under ROUGE-1 evaluation using snCRP based on different sentence selection methods. Error bars: standard deviation. Fig. 13. Tree model of DUC showing the topical words in each theme or tree node.

D. Evaluation for Document Summarization The proposed snCRP conducts sentence-based clustering or equivalently establishes a tree model that contains the thematic sentences in tree nodes and helps for document summarization. The other sentence-based clustering methods including sLDA and sHDP are implemented for comparative study. Fig. 13 displays an example of three-layer tree structure, which is estimated based on snCRP–TSBP1 and treeGEM from DUC data set. For the words of all sentences allocated in tree nodes, we conduct HDP to find topic proportions corresponding to each node based on the GEM distribution. In this figure, five topical words are displayed in tree nodes in different layers that are shaded with different colors. The root node (yellow) contains general words while the leaf nodes (white) consist of specific words. It is obvious to see semantic relationships between tree nodes in different layers along the selected five tree paths. These paths are separately related to animal, television, disease, criminal, and country. The performance of unsupervised structural learning is illustrated. In implementation of snCRP, the selection of sentences from tree model is constrained by applying four selection methods. Fig. 14 compares the performance of document summarization using snCRP–TSBP1 in terms of recall, precision, and F-measure under ROUGE-1. The performance of selecting sentences from root node (snCRP-root) and leaf nodes (snCRP-leaf) is comparable. The metric of MMR for selection from all paths (snCRP–MMR) and the metric of KL divergence for selection from the most-frequently visited path (snCRP-path) perform better than the snCRP-root and the snCRP-leaf. The snCRP-path obtains the highest F-measure in this comparison. The sentences along the most frequently visited path contain the most representative sentences information for summarization. We fix the case of snCRP-path when comparing with other methods. Fig. 15 shows the recall, precision, and F-measure of document summarization using VSM, sLDA, sHDP, snCRP–SBP, snCRP–TSBP1, and snCRP–TSBP2 under ROUGE-1 evaluation. The numbers of the estimated themes and topics are included for comparison. Again, snCRP–TSBPs

Fig. 15. Comparison of recall, precision, and F-measure under ROUGE-1 evaluation using VSM, sLDA, sHDP, and snCRP with SBP, TSBP1, and TSBP2. Error bars: standard deviation. Model complexity is compared. The blue numbers denote total numbers of the estimated themes and topics.

estimate more themes and topics than snCRP–SBP. Hierarchical model based on snCRP has larger model size than nonhierarchical models based on sLDA and sHDP. In terms of F-measure, the theme and topic models using sLDA and sHDP are significantly better than baseline VSM. Nonparametric model based on sHDP obtains higher F-measure than parametric model based on sLDA. Nevertheless, the hierarchical theme and topic model using snCRP is superior to that using sHDP. The contributions of using snCRP come from the flexible model complexity and the theme structure, which are beneficial for sentence clustering, document modeling, and document summarization. Similar to the evaluation for document modeling, snCRP–TSBPs perform better than snCRP–SBP for document summarization. The snCRP–TSBP1 and snCRP–TSBP2 outperform the other methods in terms of F-measure. VI. C ONCLUSION This paper addressed a new hierarchical and nonparametric model for document representation and summarization. A hierarchical theme model was constructed according to a sentence-level nCRP, while the topic model was established through a word-level HDP. The nCRP compound HDP was proposed to build a tightly coupled theme and topic model, which was also seen as a theme-dependent topic

578

IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, VOL. 27, NO. 3, MARCH 2016

mixture model. A self-organized document representation using themes in sentence level and topics in word level was developed. We presented the TSBP to draw subtree branches for possible thematic variations in heterogeneous documents. A hierarchical mixture model of themes was constructed according to the snCRP. The hierarchical clustering of sentences was implemented. The thematic sentences were allocated in tree nodes that were frequently visited. Experimental results on document modeling and summarization showed the merit of snCRP in terms of perplexity, topic coherence, and F-measure. The proposed snCRP is a general model for unsupervised structural learning. This model is generalizable to characterize the latent structure in different levels of data groupings that exist in different specialized technical data. R EFERENCES [1] D. M. Blei, “Probabilistic topic models,” Commun. ACM, vol. 55, no. 4, pp. 77–84, Apr. 2012. [2] D. M. Blei, A. Y. Ng, and M. I. Jordan, “Latent Dirichlet allocation,” J. Mach. Learn. Res., vol. 3, no. 5, pp. 993–1022, Jan. 2003. [3] D. M. Blei, L. Carin, and D. Dunson, “Probabilistic topic models,” IEEE Signal Process. Mag., vol. 27, no. 6, pp. 55–65, Nov. 2010. [4] J.-T. Chien and C.-H. Chueh, “Topic-based hierarchical segmentation,” IEEE Trans. Audio, Speech, Language Process., vol. 20, no. 1, pp. 55–66, Jan. 2012. [5] J.-T. Chien and C.-H. Chueh, “Dirichlet class language models for speech recognition,” IEEE Trans. Audio, Speech, Language Process., vol. 19, no. 3, pp. 482–495, Mar. 2011. [6] J.-T. Chien and M.-S. Wu, “Adaptive Bayesian latent semantic analysis,” IEEE Trans. Audio, Speech, Language Process., vol. 16, no. 1, pp. 198–207, Jan. 2008. [7] Y.-L. Chang and J.-T. Chien, “Latent Dirichlet learning for document summarization,” in Proc. IEEE Int. Conf. Acoust., Speech, Signal Process., Taipei, Taiwan, Apr. 2009, pp. 1689–1692. [8] J.-T. Chien and Y.-L. Chang, “Hierarchical theme and topic model for summarization,” in Proc. IEEE Int. Workshop Mach. Learn. Signal Process., Southampton, U.K., Sep. 2013, pp. 1–6. [9] Y. W. Teh, M. I. Jordan, M. J. Beal, and D. M. Blei, “Hierarchical Dirichlet processes,” J. Amer. Statist. Assoc., vol. 101, no. 476, pp. 1566–1581, Dec. 2006. [10] D. M. Blei, T. L. Griffiths, M. I. Jordan, and J. B. Tenebaum, “Hierarchical topic models and the nested Chinese restaurant process,” in Advances in Neural Information Processing Systems. Vancouver, BC, Canada: MIT Press, Dec. 2004, pp. 17–24. [11] D. M. Blei, T. L. Griffiths, and M. I. Jordan, “The nested Chinese restaurant process and Bayesian nonparametric inference of topic hierarchies,” J. ACM, vol. 57, no. 2, Jan. 2010, Art. ID 7. [12] J.-T. Chien and Y.-L. Chang, “Bayesian sparse topic model,” J. Signal Process. Syst., vol. 74, no. 3, pp. 375–389, Mar. 2014. [13] C. Wang and D. M. Blei, “Decoupling sparsity and smoothness in the discrete hierarchical Dirichlet process,” in Advances in Neural Information Processing Systems. Vancouver, BC, Canada: Curran & Associates Inc., Dec. 2009, pp. 1982–1989. [14] S. Williamson, C. Wang, K. A. Heller, and D. M. Blei, “The IBP compound Dirichlet process and its application to focused topic modeling,” in Proc. 27th Int. Conf. Mach. Learn., Haifa, Israel, Jun. 2010, pp. 1151–1158. [15] H. M. Wallach, D. M. Mimno, and A. McCallum, “Rethinking LDA: Why priors matter,” in Advances in Neural Information Processing Systems. Vancouver, BC, Canada: Curran & Associates Inc., Dec. 2009, pp. 1973–1981. [16] D. I. Kim and E. B. Sudderth, “The doubly correlated nonparametric topic model,” in Advances in Neural Information Processing Systems. Vancouver, BC, Canada: Curran & Associates Inc., Dec. 2011, pp. 1980–1988. [17] A. Rodríguez, D. B. Dunson, and A. E. Gelfand, “The nested Dirichlet process,” J. Amer. Statist. Assoc., vol. 103, no. 483, pp. 1131–1154, Sep. 2008. [18] J. Paisley, L. Carin, and D. M. Blei, “Variational inference for stick-breaking beta process priors,” in Proc. 28th Int. Conf. Mach. Learn., Bellevue, WA, USA, Jun. 2011, pp. 889–896.

[19] Y. W. Teh, D. Newman, and M. Welling, “A collapsed variational Bayesian inference algorithm for latent Dirichlet allocation,” in Advances in Neural Information Processing Systems. Vancouver, BC, Canada: MIT Press, Dec. 2007, pp. 1353–1360. [20] C. Wang and D. M. Blei, “Variational inference for the nested Chinese restaurant process,” in Advances in Neural Information Processing Systems. Vancouver, BC, Canada: Curran & Associates Inc., Dec. 2009, pp. 1990–1998. [21] H. M. Wallach, “Topic modeling: Beyond bag-of-words,” in Proc. 23rd Int. Conf. Mach. Learn., Haifa, Israel, Jun. 2010, pp. 977–984. [22] J. Goldstein, V. Mittal, J. Carbonell, and M. Kantrowitz, “Multi-document summarization by sentence extraction,” in Proc. ANLP/NAACL Workshop Autom. Summarization, vol. 4. Seattle, WA, USA, Apr. 2000, pp. 40–48. [23] A. Haghighi and L. Vanderwende, “Exploring content models for multidocument summarization,” in Proc. Human Lang. Technol., Annu. Conf. North Amer. Chapter ACL, Boulder, CO, USA, May 2009, pp. 362–370. [24] D. Wang, S. Zhu, T. Li, and Y. Gong, “Multi-document summarization using sentence-based topic models,” in Proc. ACL-IJCNLP, Singapore, Aug. 2009, pp. 297–300. [25] M. M. Shafiei and E. E. Milios, “Latent Dirichlet co-clustering,” in Proc. 6th Int. Conf. Data Mining, Hong Kong, Dec. 2006, pp. 542–551. [26] L. Du, W. Buntine, and H. Jin, “A segmented topic model based on the two-parameter Poisson–Dirichlet process,” Mach. Learn., vol. 81, no. 1, pp. 5–19, Oct. 2010. [27] J. Pitman, “Poisson–Dirichlet and GEM invariant distributions for splitand-merge transformations of an interval partition,” Combinat., Probab., Comput., vol. 11, no. 5, pp. 501–514, Sep. 2002. [28] D. J. Aldous, “Exchangeability and related topics,” in École d’Été de Probabilités de Saint-Flour XIII. Berlin, Germany: Springer-Verlag, 1985, pp. 1–198. [29] R. P. Adams, Z. Ghahramani, and M. I. Jordan, “Tree-structured stick breaking for hierarchical data,” in Advances in Neural Information Processing Systems. Vancouver, BC, Canada: Curran & Associates Inc., Dec. 2010, pp. 19–27. [30] J. Paisley, C. Wang, D. M. Blei, and M. I. Jordan, “Nested hierarchical Dirichlet processes,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 37, no. 2, pp. 256–270, Feb. 2015. [31] D. Andrzejewski, X. Zhu, and M. Craven, “Incorporating domain knowledge into topic modeling via Dirichlet Forest priors,” in Proc. 26th Annu. Int. Conf. Mach. Learn., Montreal, QC, Canada, Jun. 2009, pp. 25–32. [32] D. Newman, E. V. Bonilla, and W. Buntine, “Improving topic coherence with regularized topic models,” in Advances in Neural Information Processing Systems. Vancouver, BC, Canada: Curran & Associates Inc., Dec. 2011, pp. 496–504. [33] Y. Gong and X. Liu, “Generic text summarization using relevance measure and latent semantic analysis,” in Proc. 24th Annu. Int. ACM SIGIR, New Orleans, LA, USA, Sep. 2001, pp. 19–25. Jen-Tzung Chien (S’97–A’98–M’99–SM’04) received the Ph.D. degree in electrical engineering from National Tsing Hua University, Hsinchu, Taiwan, in 1997. He was a Visiting Researcher with Panasonic Technologies Inc., Santa Barbara, CA, USA, the Tokyo Institute of Technology, Tokyo, Japan, the Georgia Institute of Technology, Atlanta, GA, USA, the Microsoft Research Asia, Beijing, China, and the IBM Thomas J. Watson Research Center, Yorktown Heights, NY, USA. He was with National Cheng Kung University, Tainan, Taiwan, from 1997 to 2012. He has been with the Department of Electrical and Computer Engineering and the Department of Computer Science, National Chiao Tung University, Hsinchu, since 2012, where he is currently a Distinguished Professor. His current research interests include machine learning, information retrieval, speech recognition, blind source separation, and face recognition. Prof. Chien served as an Associate Editor of the IEEE S IGNAL P ROCESSING L ETTERS from 2008 to 2011, the Guest Editor of the IEEE T RANSACTIONS ON AUDIO , S PEECH , AND L ANGUAGE P ROCESSING in 2012, the Tutorial Speaker of the Interspeech in 2013, and the International Conference on Acoustics, Speech, and Signal Processing, in 2012 and 2015, respectively. He received the Distinguished Research Award from the Ministry of Science and Technology, Taipei, Taiwan, and the Best Paper Award of the IEEE Automatic Speech Recognition and Understanding Workshop in 2011. He also serves as an Elected Member of the IEEE Machine Learning for Signal Processing Technical Committee.

Hierarchical Theme and Topic Modeling.

Considering the hierarchical data groupings in text corpus, e.g., words, sentences, and documents, we conduct the structural learning and infer the la...
3MB Sizes 5 Downloads 16 Views