1726

IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, VOL. 23, NO. 11, NOVEMBER 2012

Unsupervised Learning of Categorical Data With Competing Models Roman Ilin, Member, IEEE

Abstract— This paper considers the unsupervised learning of high-dimensional binary feature vectors representing categorical information. A cognitively inspired framework, referred to as modeling fields theory (MFT), is utilized as the basic methodology. A new MFT-based algorithm, referred to as accelerated maximum a posteriori (MAP), is proposed. Accelerated MAP allows simultaneous learning and selection of the number of models. The key feature of accelerated MAP is a steady increase of the regularization penalty resulting in competition among models. The differences between this approach and other mixture learning and model selection methodologies are described. The operation of this algorithm and its parameter selection are discussed. Numerical experiments aimed at finding performance limits are conducted. The performance with real-world data is tested by applying the algorithm to a text categorization problem and to the clustering Congressional voting data. Index Terms— Bernoulli mixture, dynamic logic, maximum a posteriori (MAP), model selection, modeling fields theory, regularization, text categorization, vague-to-crisp process.

I. I NTRODUCTION

T

HE BASIC methodology utilized in this paper is referred to as modeling fields theory (MFT) [1]. This is a cognitively inspired mathematical framework providing a generic way of finding an optimal match between a set of models with uncertain parameters, representing a priori knowledge, and sensor input data. The match is obtained by maximizing the similarity between the models and the data. A key feature of MFT is the vague-to-crisp process [2], which begins by assigning almost equal association weights between the models and the data. Such process results in efficient computation and helps avoid local maxima [3]. Empirical evidence confirming the existence of such processes in the brain was found during neuroimaging experiments [4]. MFT describes a hierarchical system of perception and cognition [5]. In this hierarchy, each layer implements the same basic computation (described in Section III) to find the best match between its inputs coming from the layer below and the models it contains. The first layer’s input comes from the sensors sampling the environment. The models of the first layer correspond to physical objects with continuous

Manuscript received November 28, 2011; revised August 3, 2012; accepted August 3, 2012. Date of publication September 10, 2012; date of current version October 15, 2012. This work was supported in part by the U.S. Air Force Office of Scientific Research, Department of the Air Force, under Grant 11RY06COR. The author is with the Air Force Research Laboratory, Wright Patterson Air Force Base, OH 45433 USA (e-mail: [email protected]). Color versions of one or more of the figures in this paper are available online at http://ieeexplore.ieee.org. Digital Object Identifier 10.1109/TNNLS.2012.2213266

inputs. The higher layers contain abstract models. In general, abstract concepts are defined as aggregations of usually lessabstract entities and relationships between those entities. The presence of an entity or a relationship can be modeled as a binary random variable, and a concept can be modeled as a vector of such variables. In case of linguistic data, considered in this paper, more abstract concepts are described by the joint probability of certain words being present in the text. Accordingly, the data are encoded as binary feature vectors. Such encoding is a generic way of representing categorical data, utilized in a vast variety of applications. This necessity to model and learn categorical data within the higher layers of MFT provides motivation for this paper. MFT provides a way to assign each data point to the model that is most similar to it. These assignments divide (or cluster) the data set into groups. Categorical clustering has attracted researchers’ attention lately [6], due to a large number of problems with data containing categorical attributes, such as market basket analysis and many other data mining tasks. Interesting examples of clustering approaches, including categorical clustering, can be found in [7]–[10]. One of the challenges in clustering is determining the number of clusters [11]. Clustering is often posed as an optimization problem and usually selecting larger number of clusters results in a better value of the optimization criterion. Such phenomenon is referred to as overfitting. In such cases, additional considerations are necessary to select an appropriate structure. If, as in this paper, mixture probability densitybased methods are used for clustering, this problem falls under a general category of model selection, which is a known difficult problem [12]. This paper introduces a novel algorithm, called accelerated maximum a posteriori (MAP), which allows simultaneous learning and model selection with categorical data. The rest of this paper introduces the algorithm, compares it to related techniques, and provides insights into its operation and heuristic for its parameter tuning. Performance comparison with other clustering methods and two applications to realworld data are given. II. R ELATED T ECHNIQUES AND C URRENT C ONTRIBUTION MFT provides a framework for simultaneous learning and data association, by matching the data to multiple models with uncertain parameters. From statistical point of view, a set of models forms a mixture [13]. In the case when the number of mixture components is known, the expectation maximization (EM) algorithm [14], [15] is a widely used technique for mixture parameter estimation. Usual criticisms of EM include sensitivity to initial conditions and slow convergence.

2162–237X/$31.00 © 2012 IEEE

ILIN: UNSUPERVISED LEARNING OF CATEGORICAL DATA WITH COMPETING MODELS

Sampling techniques based on MCMC provide an alternative approach to mixture parameter estimation. Gibbs sampling [16], [17] is a widely used technique for sampling from highdimensional spaces using an iterative approach. When, as in this paper, the number of components is unknown, several statistical approaches to the model selection exist. One family of approaches consists of defining a model selection criterion, computed for each model, after parameter estimation is completed. Akaike information criterion is a popular methodology for this approach [18]. Bayesian information criterion is another popular tool for model selection [19]. Other criteria include integrated completed likelihood introduced in [20] and Bayesian entropy criterion proposed in [21]. Bayes factor is a general method for model comparison [22]. It requires computing the marginal data likelihood, which is not trivial for mixture models. The value of marginal likelihood can be approximated using Monte Carlo estimate [23]; however, such estimate is computationally expensive for high-dimensional models. The above-mentioned approaches perform parameter estimation and model selection separately. Other approaches attempt to solve both problems simultaneously. One of the popular Bayesian techniques is an extension of the MCMC algorithm to include the number of components in the set of unknown parameters. Such an extension results in a parameter space with changing dimensions, and a special technique referred to as reversible jump (RJ) is necessary to allow the MCMC sampler to smoothly transition between the parameter spaces [24], [25]. An alternative approach consists of modifying the estimation process by introducing a penalty term. The penalty function is added to the main objective function and is defined in a way that penalizes overly complex models. This approach is called regularization [26], [27]. A properly defined penalty function leads to turning off some of the model parameters and thus reducing the model complexity. Similar to RJ-MCMC, this approach combines model selection and parameter estimation in one computational procedure. Penalty function is often equivalent to the introduction of a priori parameter distribution thus connecting regularization approach with the Bayesian approach. Recently, model selection penalty terms based on information theory have been introduced. In [28], the minimum message length criterion is used. In [29], the entropy term is added to the objective function to encourage simpler models. Both approaches implement an EM iterative process, which results in the annihilation of some of the models due to the model selection term. The annihilation of mixture components is caused by driving their mixture proportions to zero. Bayesian Ying-Yang harmony learning has been proposed as a methodology for simultaneous learning and model selection. This approach has been applied to categorical data using Poisson mixture models [30]. Mathematically, maximization of the harmony function is equivalent to the minimization of Kullback–Leibler divergence between two joint data and parameter densities. The presence of a priori parameter distribution contributes to the model selection by forcing the mixing proportions of some of the components to zero.

1727

In [30], Bayesian Ying-Yang was successfully applied to clustering high-dimensional vectors representing textures, for which Poisson distribution is an appropriate model. In this paper, we limited ourselves to binary data and Bernoulli distribution as appropriate for textual data. The approach advocated in this paper is related to regularization and simultaneously incorporates the key ideas of the MFT. It is similar to other approaches in that some of the models are annihilated during the iterative estimation process. There are several novelties introduced in this paper. 1) Component annihilation is caused by driving the model parameters toward values that make the model improbable and unsupported by the data. 2) The amount of penalty is not decreased, as in other regularization approaches, but is increased with time. Although counterintuitive, such an approach results in good performance. 3) The introduction of competition mechanism is based on the idea of gradual and balanced adjustment of model parameters and the penalty term. Outside of mixture models, there are various clustering techniques capable of handling categorical data. An overview of clustering methods is given in [6]. In addition to the above-mentioned statistical model selection approaches, the determination of the number of clusters is also done using the gap statistics [31]. Spectral clustering is a popular clustering approach requiring only standard linear algebra manipulations [32]. The determination of the number of clusters in this case can be done using the eigengap heuristic. In this paper, the proposed algorithm demonstrates superior performance in comparison to a spectral clustering algorithm with eigengap heuristic and to the jump method which is related to the gap statistics. III. MFT The MFT considers a general problem of finding the best match between a set of models M = {M1 , . . ., M H } that depend on a set of parameters S = {S1 , . . ., S H } and a set of data inputs X = {X 1 , . . . , X N } [1]. One of the challenges of this task is the inherent exponential computational complexity of data-to-model association [33]. The MFT overcomes this complexity using a vague-to-crisp process of simultaneous data association and parameter estimation. This process is expressed mathematically as the maximization of the following total similarity: L(X, M(S)) =

H N  

rh l(x n , Mh ).

(1)

n=1 h=1

Here the similarity between a data element x n and a model Mh is measured by a function l(x n , Mh ), and the total similarity is given as a product over all the data and a summation over all the models. The quantities rh are the relative weights of the models. The mathematical form of the models reflects the prior knowledge, and the model parameter values are estimated based on the evidence given by the data. The maximization of (1) is achieved by iterative evaluation (2) of the model parameters and the special quantities referred to as

1728

IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, VOL. 23, NO. 11, NOVEMBER 2012

association weights. In (2), α is a gradient ascent constant. Please see [1] for its derivation. The association weights f hn are computed based on the current values of parameters Sh and the parameters are recomputed in the estimation step based on the current association weights Association

Estimation

f hn =  H

rh l(n, h)

N rh =

n=1

fhn

n=1

f hn

p(x n |h) =

D 

phd xnd (1 − phd )1−xnd .

(4)

d=1

The set of all model parameters is denoted by . In our case, this set consists of probabilities of each feature and the mixing coefficients:  = {rh , phd }, h = 1, . . . , H, d = 1, . . . , D. The objective function (1) takes the following form:

h 2 =1 rh 2 l(n, h 2 ) N 

Sth+1 = Sth + α

vector are assumed to be statistically independent. Under such assumption, the probability of observing vector x n conditional on model h is given as follows:

∂ log l(n, h) ∂ Sh

.

L(X|) =

(2)

N Each step increases the value of the objective function and it can be shown that it converges to a (possibly local) maximum of L(X, M(S)). There are two important additions to the iterative procedure (2) that make it unique and efficient. 1) Initialization: The model parameters must be initialized in such a way that the initial association weights are almost the same for all models and all data elements. This is the vague initialization postulated by MFT. 2) Vagueness Control: A special mechanism may be introduced into the models to allow for controlled decrease in vagueness of data association. This paper introduces a third mechanism to the MFT framework, related to model selection. 3) Competition Among Models: A special mechanism is introduced for inducing competition among models resulting in correct model selection. The models are often defined in probabilistic terms as probability density functions depending on parameters Sh . In this case, the similarity between the data element x n and the model Mh is given by the likelihood of observing the data given the model (3) l(n, h) ≡ p(x n |Sh ). The optimization criterion (1) becomes the total data likelihood. Various applications of MFT call for different types of models. In the previous work related to detection and tracking of objects in a sequence of radar detections or optical images, the models describe the expected object characteristics such as the trajectory of their motion and the shape and brightness of their appearance [34], [35]. The extension of MFT to learning abstract concepts, such as situations, requires the introduction of different models capable of handling categorical features [36], [37]. IV. B ERNOULLI M IXTURES Multivariate Bernoulli mixtures are a natural way of modeling the probability distribution of binary feature vectors. The number of individual observations (data points) in the data set is denoted by N. Individual observation is denoted by x n , n = 1, . . . , N. Each observation is a D-dimensional binary vector, with elements, x nd ∈ {0, 1}, d = 1, . . ., D. Each element of the binary vector, also referred to as feature, is modeled by the Bernoulli distribution, which is a discrete probability distribution taking the value 1 with probability phd , where h is the model index. The elements of the binary

N  H 

rh p(x n |h).

(5)

n=1 h=1

Substitution of Bernoulli probability model into the MFT algorithm (2) results in the following iterative process. Here the formula for rh is not repeated rh p(x n | ph ) f hn =  H h 2 =1 rh 2 p(x n | ph 2 ) pht +1 = pht + α

N  n=1

f hn

∂rh p(x n | ph ) . ∂ ph

(6)

This algorithm by itself does not solve the model selection problem, since the number of models H has to be given to the algorithm. The following sections will introduce the improvement necessary to find the optimal value of H . V. R EGULARIZATION AND MAP Regularization is a technique that involves introduction of additional information about the solution [38]. This additional information reflects a priori knowledge about the model. From statistical point of view, the introduction of a priori distributions on model parameters is equivalent to adding an additional term into the objective function that depends on the model parameters. This term can be viewed as regularization penalty. The priors change the ML estimation into MAP. If the goal is to find the simplest possible model that can explain the data, the complexity of the model has to be factored in the penalty term. The penalty term usually is formulated in a way that encourages setting some of the model parameters to zero, which reduces the number of free parameters and thus effectively provides for model selection. The penalty term is denoted by Q(). The new objective function is the sum of likelihood (5) and the penalty term L r (X|) = L(X|) + Q().

(7)

The optimization procedure (2) applies to the new objective function. The easiest way to see this would be to repeat the derivation of the MFT procedure given in [1, Ch. 4] with the new objective function. Intuitively, the addition of a term that depends on the parameters only, and not on the data, does not affect the association step of the procedure. The penalty term only appears inside the derivatives in the estimation step. In this paper, the penalty term is formulated to impose a penalty on the magnitudes of the individual parameters L r (X|) =

N  n=1

log

H  h=1

rh p(x n |h) +

D H   h=1 d=1

log g ( phd ) . (8)

ILIN: UNSUPERVISED LEARNING OF CATEGORICAL DATA WITH COMPETING MODELS

Here, the function g( phd ) needs to have small values for phd close to 0 and large values for phd close to 1. Such penalty results in creating a preference for models with all parameters set to zeros. Such models are highly improbable and as a result they become unsupported by the data and are consequently annihilated. The penalty function Q introduces a tradeoff between the fit of the data to the model and the complexity of the model. Large penalty results in oversimplification of the model, whereas small penalty results in overfitting. Since the parameters are limited to the (0, 1) interval, a natural candidate for a priori probability density function (PDF) is the beta distribution. The PDF for beta distribution is given by the following formula: gb ( phd ) =

phd a−1 (1 − phd )b−1 . B(a, b)

(9)

Here, a and b are the distribution parameters, and B is the beta function. The shape of the beta distribution strongly depends on the values of a and b. Since the goal of this PDF is to penalize large values of parameters, it is natural to choose a = 1 and b > 1 to ensure that the PDF is a decreasing function. The parameter b controls the shape of the a priori distribution. Large values of b make the PDF concentrated close to zero. VI. ACCELERATED MAP As mentioned, the basic algorithm (2) does not change except for the computation of the derivatives with respect to the model parameters, which need to include the term coming from the a priori PDF. The derivative with respect to the model parameter phd is obtained from (8) and (9) ∂ L b (X|)  x nd − phd b−1 = f hn . − ∂ phd phd (1 − phd ) 1 − phd N

(10)

n=1

The first term in (10) is the same as it would be in the ML estimation. The second term comes from the a priori PDF. The parameter b of the beta distribution is critical. It is known that the mean value of the beta PDF is given by a/(a + b). In our case a = 1 and the mean value is inversely proportional to b. Therefore, as b increases, the a priori probability density of the model parameters becomes concentrated closer and closer to zero. This has the effect of turning the models off. Computational scheme proposed in this paper consists of two modifications to the basic MFT algorithm (2). This algorithm is referred to as accelerated MAP. 1) Introduction of a variable parameter b. This parameter starts with small values and gradually increases. Such strategy is referred to as acceleration in this paper. 2) Use of a special form of gradient ascent in the maximization step. The gradient ascent employed here takes into account only the sign of the derivative (10). Each parameter is updated by a constant increment α in the direction of the derivative. Such an update scheme provides better control over the rate of change of the parameter values. In fact, this gradual change of parameters implements the vague-to-crisp mechanism of MFT for this type of models. The parameters start with values

1729

close to 0.5-large variance, and at each iteration, they can change only by ±α. The intuition behind this approach comes from a realization that the introduction of beta prior may not be sufficient for solving the model selection problem. If the value of b is too large, all the models will be turned off. If the value is too small, the penalty will not be enough to turn off all the unnecessary models. Finding the proper value for b is difficult since it likely depends on the data. Smaller penalty may be sufficient for well-separated data. Larger penalty may be necessary for overlapping data. The introduction of dynamically changing penalty provides an alternative approach. Although the unlimited increase of the parameter b ultimately results in turning off all of the models, it is plausible that if the penalty parameter b is increased slowly enough, the algorithm’s behavior should indicate when b is in the neighborhood of its optimal value. Indeed, it was observed in numeric simulations that the algorithm goes through a series of “stable phases,” during which the number of models turned off does not change. Such stable phases can serve as the basis for model selection. The number of active models is taken to be the estimated number ˆ This procedure will be illustrated in the of components k. following sections along with the discussion of a procedure for selecting the algorithm’s parameters. The following expression for b is used in this paper: b = b0 + t · ξ.

(11)

This means that the parameter increases linearly with each iteration of the algorithm. Moreover, the initial value of b has been set to 1 (b0 = 1 since such initialization has a good explanation: at iteration 0 there is no penalty. Therefore, the only unknown parameter is the rate of increase ξ . The gradient ascent update is given by the following formula:   ∂ L b (|X) . (12) phd (t + 1) = phd (t) + α · sgn ∂ Phd Here is the sign function. The values of α and ξ are the procedure parameters. The following sections give examples and clarify how these parameters affect the performance of the algorithm. VII. N UMERICAL E XAMPLES To illustrate how the algorithm operates, it was applied to two data sets with the number of data points N = 500, the size of the binary vectors D = 200, and the number of true components K = 7. The first data set had good separation between the components and the second one had overlapping components. More on how the overlap is evaluated is given in Section XII. The model weights rh , evaluated at each iteration, are shown in Figs. 1 and 2. The algorithm started with H = 25 models. In the beginning of the process, all of the data are associated with each of the 25 models. The extra models become turned off after several hundred iterations due to the accelerated penalty term. Once the true number of components is reached, the mixing proportions remain stable for an extended period of time. The penalty keeps increasing, and at some point, its influence

1730

IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, VOL. 23, NO. 11, NOVEMBER 2012

Fig. 1. Mixture proportions for a dataset with seven components that are not sufficiently separated.

Fig. 3. Mixing proportions for a dataset with seven components and the acceleration parameter ξ set to zero and the parameter b0 set to one, turning accelerated MAP into an ML estimate.

VIII. C HOICE OF PARAMETERS α AND ξ Equation (10) for the derivative is repeated here for convenience N b−1 ∂ L b (X|)  x nd − phd = f hn . (13) − ∂ phd phd (1 − phd ) 1 − phd n=1

Fig. 2. Mixture proportions for a dataset with seven components that are not sufficiently separated.

on the solution becomes overwhelming. All of the models become turned off. This process is typical for the operation of accelerated MAP. When the separation between components is insufficient, the stable phase is never reached until all of the models are turned off, as shown in Fig. 2. In order to demonstrate the effect of acceleration, the parameter ξ was set to zero. This removes the effect of the penalty term and turns the algorithm into the ML estimation. Fig. 3 demonstrates that all 25 models remain active. The examples demonstrate the operation of the algorithm and confirm that the right number of components is detected during its execution. Due to a constant increase of parameter b it is guaranteed that all the models will be terminated in finite time, providing a natural stopping criteria. In order to make the algorithm practically usable it is necessary to provide a way to: 1) select its parameters; 2) detect stable stages; and 3) find the limitations on its ability to separate overlapping mixture components. The following sections shed more light on these issues.

Consider this expression at a certain iteration t. For a given model h, and a given feature index d, the indexes of the data samples assigned to the model {x nd } can be divided into two subsets, A1 = {n : x nd = 1}, and A0 = {n : x nd = 0}. The derivative is then expressed as follows:  ∂ L b (X|) 1  1 b−1 = f hn − f hn − . ∂ phd phd 1 − phd (1 − phd ) n∈ A1 n∈ A0 (14)  Further, the quantity Nh ≡ n=1,...,N fhn is introduced. Suppose that the data-to-model assignments are close to 1 or 0. The validity of this assumption will be discussed in Section IX. In this case, Nh is approximately equal to the number of data points assigned to model h. Out of these data points, some proportion will have the value 1 for feature d and the rest will have 0. Thus,  the sizes of the subsets  A1 and A0 can be expressed as n∈ A1 fhn = p1 Nh and n∈A0 fhn = p0 Nh . Further, suppose p∗ is the true value of parameter Phd . Introduce the following quantities m and k: phd m≡ ∗ (15) p p1 (16) k ≡ ∗. p Quantity m is the ratio of the current estimate phd to its true value. The quantity k is the ratio of the current proportion of 1s in the subset of data currently assigned to model h to the true value of the parameter. Note that both quantities should be close to 1 if the algorithm performs correctly. The derivative (15) now takes the following form: (1 − k p ∗ )Nh k Nh b−1 ∂ L b (X|) − = . (17) − ∗ ∂ phd m 1 − m p∗ (1 − m p )

ILIN: UNSUPERVISED LEARNING OF CATEGORICAL DATA WITH COMPETING MODELS

Fig. 4. Parameter estimate in terms of m and k. The line through the origin serves as an attractor, and separates the regions with positive and negative derivatives. The arrow corresponds to the parameter estimate oscillating around the attractor.

Setting (17) to zero results in the following condition: Nh m= k. Nh + b − 1

(18)

Consider the derivative (17) as a function of two parameters m and k. The point (k, m) located above the line given by (18) correspond to the negative derivative, and the point below the line to positive. When parameter estimate is correct, m = 1. When data association is correct, k = 1. The parameter is updated as follows:   ∂ L b (|X) phd (t + 1) = phd (t) + α · sgn . (19) ∂ phd In terms of m, the update becomes α m (t + 1) = m (t) + ∗ · sgn p



 ∂ L b (|X) . ∂ phd

(20)

Let’s try to understand the dynamics of this iterative process. At iteration t, the parameter estimate and the data associations result in a pair of values for m and k corresponding to the point m(t) in the m–k plane as shown in Fig. 4. The plane is divided by the line m(k) into negative and positive half-planes. If the current point is located on the negative half-plane, it will move down at iteration t + 1. Otherwise, it will move up, as shown in Fig. 4. The line m(k) serves as an attractor for this iterative process since the points are always forced to move toward it. The line however, changes its location due to two factors: 1) the association weights affecting Nh and 2) regularization parameter b increasing by ξ . Also notice that the line always passes through the origin. The increase of b due to ξ rotates the line clockwise. The increase of Nh rotates the line counterclockwise. If Nh decreases, the line rotates clockwise.

1731

Fig. 5. Factors that influence how the zero derivative line moves. The increase in penalty causes the line to turn clockwise around the origin. The increase of the number of data points associated with the model causes the line to move counterclockwise.

As long as the estimate (k, m) is far from the line, the value of phd will monotonously decrease or increase with step size α. If the estimate (k, m) at iteration t comes within the vertical distance from the line at t + 1, the derivative changes sign. At the next time step, the estimate will cross the line again and this process will be repeated resulting in a stable state of the iterative process. Since the line can move up or down and destabilize the process, the estimate will start to move in the same direction until it comes close enough to jump over the line and return to stable oscillations around it. As already mentioned, the line moves due to two factors: 1) the increase in the regularization parameter b and 2) the change in the number of data points assigned to the model. The first factor is controlled by the parameter ξ . The second factor cannot be controlled directly, however, its behavior is influenced by ξ as well. Suppose that, between iteration t and t + 1, Nh increased by δ and b increased by ξ . The line moves, and, for a fixed k, the vertical distance between the line at t and t + 1 is given as follows: (Nh + δ)k Nh ξ − δ(b − 1) Nh k − ≈ k. h= Nh + b − 1 Nh + δ + b + ξ − 1 (Nh + b − 1)2 (21) Fig. 5 summarizes the factors that influence the line to move. Conceptually, two scenarios are possible. The “stable model” scenario occurs when the number of points assigned to the model increases or stays the same. The line moves down only due to ξ . The estimated parameters will be gradually drifting toward zero, however, the speed of this drift can be made arbitrarily slow by selecting small values of ξ . The “unstable model” scenario occurs when the model loses data points. In this case, the line moves at a faster pace due to δ.

1732

IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, VOL. 23, NO. 11, NOVEMBER 2012

Fig. 6. As long as the zero derivative line remains inside the vertical interval, the estimate continues to oscillate around the attractor.

Since the parameter estimates are allowed to change by a fixed quantity α, the line is likely to move down faster than the estimate (k, m) and will keep the estimate on its negative side, ultimately resulting in the (0, 0) estimate. The “unstable model” turns off. The choice of parameters α and ξ can be made by considering the “stable model” case. Let us consider the case where the number of data points assigned to the model is constant. The reasoning is illustrated in Fig. 6. Suppose at time t the point (m, k) corresponding to the current estimate is right above the line. At time t + 1, it will jump down by α/ p∗ and the line will move down by Nh ξ k/(Nh + b − 1)2 . At time t + 2, the point will jump to its original position. This jumping will continue until the line reaches the bottom position of the point. When the line crosses that position, the oscillations will skip one cycle, and the point will move down to resume oscillations from the new position. Suppose that we would like to maintain stable oscillations for some number of iterations Imax . The relationship between parameters α and ξ can be derived from the condition α Nh ξ k Imax < ∗. (Nh + b − 1)2 p

(22)

Note that the value of p∗ lies in the interval (0, 1) and so we can use the maximum value of 1 to obtain the stricter condition. Similarly, k is a positive number which should have a value close to 1 when the estimate is close to its true value. Therefore, the condition becomes ξ

Unsupervised learning of categorical data with competing models.

This paper considers the unsupervised learning of high-dimensional binary feature vectors representing categorical information. A cognitively inspired...
576KB Sizes 9 Downloads 3 Views