This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination. IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS

1

Divisive Gaussian Processes for Nonstationary Regression Luis Muñoz-González, Miguel Lázaro-Gredilla, Member, IEEE, and Aníbal R. Figueiras-Vidal, Fellow, IEEE

Abstract— Standard Gaussian process regression (GPR) assumes constant noise power throughout the input space and stationarity when combined with the squared exponential covariance function. This can be unrealistic and too restrictive for many real-world problems. Nonstationarity can be achieved by specific covariance functions, though prior knowledge about this nonstationarity can be difficult to obtain. On the other hand, the homoscedastic assumption is needed to allow GPR inference to be tractable. In this paper, we present a divisive GPR model which performs nonstationary regression under heteroscedastic noise using the pointwise division of two nonparametric latent functions. As the inference on the model is not analytically tractable, we propose a variational posterior approximation using expectation propagation (EP) which allows for accurate inference at reduced cost. We have also made a Markov chain Monte Carlo implementation with elliptical slice sampling to assess the quality of the EP approximation. Experiments support the usefulness of the proposed approach. Index Terms— Elliptical slice sampling, expectation propagation (EP), Gaussian process (GP), heteroscedastic regression, nonstationarity.

I. I NTRODUCTION

G

AUSSIAN processes (GPs) [1] offer a powerful nonparametric framework for nonlinear regression. As it is common in most regression approaches, GP regression regards observations as the sum of some unknown function of the inputs plus Gaussian noise. Unlike traditional neural networks [2] and support vector machines (SVM) [3], GPs place a prior probability density over the unknown function. Proceeding in a purely Bayesian fashion, the prior over functions and the likelihood implied by the assumed observation model can be used to infer a posterior distribution over the sought unknown function given data. GPs offer a number of advantages with respect to traditional regression machines, such as to naturally and easily produce probabilistic predictions, that is, average and dispersion values, and resistance to overfitting owing to their use of a reduced number of hyperparameters, that can be tuned with a simple continuous optimization of the evidence. These advantages come at a price, namely the O(N 3 ) cost of performing inference with this model on N data samples. Manuscript received December 5, 2012; accepted January 12, 2014. This work was supported by the Spanish Government under Project TIN 2011/24533 and Grant TEC 2008/02473. The authors are with the Department of Signal Theory and Communications, Universidad Carlos III de Madrid, Leganés 28911, Spain (e-mail: [email protected]; [email protected]; [email protected]). Color versions of one or more of the figures in this paper are available online at http://ieeexplore.ieee.org. Digital Object Identifier 10.1109/TNNLS.2014.2301951

This has sprung research over the last decade in the area of low-cost sparse GPs [4], [5], making approximate inference on GPs affordable for large-scale data sets. A frequent assumption in the standard GP model is stationarity: noise power must remain constant throughout input space (homoscedasticity) and the covariance between two observations is typically modeled as dependent only on the distance between corresponding inputs. Though this stationarity assumption does yield useful models, it is often not honored by real-world data sets. Trying to overcome this limitation, nonparametric heteroscedastic models with input-dependent noise power have been developed. Different approximate inference techniques have been employed since standard GPs are not being analytically tractable. Gibbs sampling is used in [6] to sample from the exact posterior on the mean and the log-noise latent functions. However, the method lacks a search procedure to maximize the evidence with respect to the hyperparameters. In [7], a point estimation of the log-noise is performed using an iterative algorithm. Nevertheless, the method is not guaranteed to converge and may oscillate. This issue is addressed in [8] choosing the log-noise so as to maximize a penalized likelihood, equivalent up to a constant to the prior on the lognoise. Nonstandard variational approximation and expectation propagation (EP) [9] posterior approximations are proposed in [10] and [11], respectively. However, the likelihood of the proposed heteroscedastic model (as also in [6]–[8]) is not log-concave, so that the posterior is not log-concave, thus, not ensuring a unimodal posterior. In contrast, the GP model proposed in [12] using natural parameters ensures posterior log-concavity, although inference is also not analytically tractable. A more flexible approach is presented in [13], where observations are purported as the product of realizations of two independent GPs, and EP is used for approximate inference. Pursuing nonstationarity, in this paper we introduce the divisive GP (DGP) model, where two GPs are combined to achieve amplitude nonstationarity. The proposed model uses a jointly log-concave likelihood that results in an exact log-concave posterior when combined with a GP prior. Although retrieving this exact posterior is analytically infeasible, it is much easier to achieve a high-quality Gaussian approximation to a log-concave distribution than to other types of distributions. Critically, when EP is used to compute approximate posteriors, log-concavity plays a crucial role in the convergence of the algorithm [14], having been conjectured (though not proved) to be a sufficient condition.

2162-237X © 2014 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.

This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination. 2

IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS

The remainder of this paper is organized as follows: in Section II, we review basic GP regression theory. In Section III we describe our DGP model. Markov Chain Monte Carlo (MCMC) and EP inference on the DGP model is presented in Sections IV and V, respectively. Experimental results on synthetic and real data sets are shown in Section VI. Finally, in Section VI we present the main conclusions of this paper and some avenues for further research.

p(yn | f n , gn ) = N (yn | f n /gn+ , c/(gn+ )2 )

II. G AUSSIAN P ROCESSES FOR R EGRESSION In a regression task, we consider N input-output pairs N D = {xn ∈ R D , yn ∈ R}n=1 modeled as some unknown latent function f (x) plus an independent noise y(xn ) = f (xn ) + εn .

(1)

GPR models the underlying noiseless latent function f : R D → R and the independent noise term ε placing a GP prior distribution with a zero-mean function and an arbitrary covariance function k(x, x ; θ ). Thus, the values of k(·, · ; θ) depend only on the inputs x, x , and the set of hyperparameters given in θ . For the noise term ε, a zero-mean and σ 2 -variance Gaussian distribution is assumed. Taking f = [ f 1 , . . . , f N ]T N , we have a as the evaluation of f at the inputs X = {xn }n=1 multivariate Gaussian prior given by p(f|X) = N (f|0, K N ), where [K N ]i j = k(xi , x j ; θ ). Thus, the likelihood is also Gaussian and is given by p(y|f) = N (y|f, σ 2 I N ), where I N is the N-dimensional identity matrix and y = [y1 , . . . , y N ]T . Integrating out the latent function values, the marginal likelihood appears as p (y|X, θ ) = N (y|0, K N + σ 2 I N ).

(2)

To select the hyperparameters θ and σ 2 , a (possibly local) maximum for the logarithm of the marginal likelihood in (2), also known as log-evidence, is sought. The inferred posterior distribution for a new sample x∗ is a Gaussian distribution p(y∗ ) = N (y∗ |μ∗ , σ∗2 ), with μ∗ = k∗N (K N + σ 2 I)−1 y σ∗2 = k∗∗ − k∗N (K N + σ 2 I)−1 k∗N + σ 2

where f (x) is the noisy stationary latent function, g + (x) is the modulating function defined as the positive part of some noise-free latent function g(x), that is, g + (x) = max(g(x), 0), and εn is an input-dependent Gaussian noise term that can be modeled as ε ∼ N (0, c/(g + (x))2 ), where c is a noise power constant scale factor. The likelihood of the proposed model for f (x) and g(x) given an observation y(x) can be written as

(3) (4)

where [k∗N ] j = k(x∗ , x j ; θ), k∗∗ = k(x∗ , x∗ ; θ), μ∗ is the predicted value, and σ∗2 an estimate of its variance. The predictive distribution depends on the covariance matrix and the noise power value σ 2 . Therefore, the described GPR model is homoscedastic as it assumes stationary noise. III. DGP M ODEL To overcome the limitations of the standard GPR to model nonstationary scenarios we propose a DGP model. It can be viewed as a possibly noisy stationary latent function f modulated by another stationary latent function g that describes amplitude nonstationarities affecting both the latent function f and the noise associated with f . Thus, the observations under the DGP model are described as f (xn ) + εn (5) y(xn ) = + g (xn )

(6)

where f n = f (xn ), gn+ = g + (xn ), and yn = y(xn ). Note that the DGP model also includes the standard GPR model as a particular case if latent function g is constant. Following a Bayesian treatment, we place GP priors on f and g, so that   f (x) ∼ GP 0, k f (x, x ) + σ 2f δxx g(x) ∼ GP(μ0 , k g (x, x ))

(7)

where k f (x, x ) and k g (x, x ) are any valid covariance functions, σ 2f is a homoscedastic noise hyperparameter multiplied by the Kronecker delta, and c is a constant that, together with the mean μ0 , modulates the mean power of the heteroscedastic noise. Thus, the model is fully specified and depends only on the covariance functions hyperparameters θf , σ 2f , θ g , μ0 , and the noise term c. Then, the noise in the DGP model is described by two terms. First, the hyperparameter σ 2f represents a stationary noise term associated with the latent function f . On the other hand, c/(g + (x))2 models noise nonstationarities. With these considerations, it can be observed that the mean of the noise power can be specified in both σ 2f and c/μ20 . Note that this is equivalent to a model that neglects σ 2f , assuming that all the mean noise power is in the term c/μ20 . However, the additional redundancy of the proposed DPG model can favor hyperparameter learning. The proposed model has some relevant differences with respect to the GP product model (GPPM) presented in [13]. In GPPM, a noise-free stationary latent function f is multiplied by another stationary (modulating) latent function g that describes slow-variation amplitude nonstationarities. Thus, the likelihood of GPPM is given by p(yn | f n , gn ) = N (yn | f n e gn , σ 2 ), where σ 2 is a hyperparameter that describes homoscedastic noise. This contrasts to the DGP likelihood given in (6), where the function g modulates a noisy function, so the model includes heteroscedastic noise cases. Consequently, it can be applied to nonstationary heteroscedastic problems. GPPM also has an important drawback for achieving variational inference as the likelihood is not log-concave, which lead to a multimodal posterior. This caused convergence problems for the proposed EP approximation in [13]. In contrast, the likelihood of DGP model (6) is log-concave on both f and g. Thus, the posterior is log-concave and unimodal, which suits the EP Gaussian posterior approximation and favors convergence, although it is not guaranteed. Another remarkable limitation of GPPM is that maximum likelihood of level II (ML-II) hyperparameters search cannot be performed, instead, a grid search is

This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination. MUÑOZ-GONZÁLEZ et al.: DIVISIVE GPs FOR NONSTATIONARY REGRESSION

3

proposed to set the hyperparameters. As the authors point out, the gradients they used for the ML-II implementation are approximate. This may be due to the loss of accuracy in the double Gauss-Hermite approximation needed to calculate the site functions moments in the EP algorithm. In the case of the DGP model, the values of the site functions moments can be analytically obtained and it is possible to use a ML-II implementation to search the hyperparameters. In the following, we will use vectorized forms to refer to the observations and the latent functions evaluated at the training N , f ≡ { f } N , and g ≡ {g } N . points, so that y ≡ {yn }n=1 n n=1 n n=1

the Kullback-Leibler (KL) divergence between the real and the approximate posterior: KL[ p(f, g|y)||q(f, g|y)], which is a measure of similarity between two probability densities calculated as  p(x) dx. (10) KL[ p(x)||q(x)] = p(x)log q(x)

IV. I NFERENCE Given the DGP model in (5) and the likelihood and GP priors defined in (6) and (7), respectively, we can write the posterior over the latent functions f and g at the training points as p(f, g|y) =

p(y|f, g) p(f) p(g) p(y)

(8)

where p(y) is the marginal likelihood or evidence, which can be expressed as  p(y) = p(y|f, g) p(f) p(g) df dg. (9) For the sake of simplicity we omit the conditioning on the training inputs X and the set of hyperparameters θ = [θ f , θ g , μ0 , c]T . The likelihood of the DGP model given in (6) is not Gaussian on both latent functions (although it is Gaussian on f ), which makes analytical inference of the evidence intractable. To overcome this limitation, we propose to apply the EP algorithm [9] to approximate the posterior distribution over f and g. We also provide an MCMC implementation with elliptical slice sampling [15] to draw samples from the exact posterior, which allows to assess the quality of the EP approximation. MCMC techniques allow to sample from posterior distributions, specially for cases in which an analytically tractable solution is not possible. The main drawback of MCMC methods is the high computational burden that scales timely with O(N 3 ), which implies a high cost even for moderate-size data sets. We use elliptical slice sampling (ESS) to draw samples from the posterior in (8). This method is specially suitable to sample from posteriors with tightly correlated Gaussian priors, as in this case. To sample from DGP posterior, ESS allows to make joint updates on f and g. The algorithm does not have step-size parameter as in the Metropolis-Hastings method introduced in [16]. Instead, an elliptical parametrization is proposed to update the state of the sampler, using the slice sampling method presented in [17] with an adaptive step-size. V. EP-DGP I NFERENCE EP [9] is a method that approximates a posterior p with a simpler distribution q, typically using exponential family distributions, which allows for analytical tractability. A sensible approach to approximate the posterior is to minimize

However, direct minimization of this KL divergence is not tractable, but other minimization processes are used in the context of GPs, such as updating individual approximations to the likelihood for each sample sequentially [1]. For the DGP model, we use an unnormalized bivariate Gaussian distribution to approximate the local likelihood given in (6) ˜ n) p(yn | f n , gn )  tn (φ n | Z˜ n , μ˜ n ,  ≡ Z˜ n N ( fn |μ˜ f n , σ˜ fn )N (gn |μ˜ gn , σ˜ gn )

(11)

˜ n is a where φ n = [ f n , gn ]T , μ˜ n = [μ˜ fn , μ˜ gn ]T , and  covariance matrix with σ˜ fn and σ˜ gn in its main diagonal. The local approximations tn are known as site functions. The likelihood p(y|f, g) is approximated by the product of the N site functions, which can also be expressed as the product of two multivariate Gaussian distributions, ˜ f ) and N (g|μ˜ g ,  ˜ g ), where μ˜ f and μ˜ g are N (f|μ˜ f ,  N-length vectors with μ˜ fn and μ˜ gn at the nth position, ˜ g are diagonal ˜ f and  respectively. The covariances  2 2 matrices of size N with σ˜ fn and σ˜ gn at the nth position of the diagonal, respectively. Then, the expression of the approximate posterior, q(f, g|y), is given by the product of the approximate likelihood and the priors, N (f|0, K f ) and N (g|μ0 1, Kg ), divided by the approximated evidence q(y), where K f and Kg are positive definite covariance matrices from the GP priors on f and g defined in (7). This approximate posterior can also be expressed as q(f, g|y) = N (f|μ f ,  f )N (g|μg ,  g ) −1  ˜ −1  f = K−1 +  f f

−1

˜ f μ˜ f μf =  f −1  −1 ˜ −1  g = Kg +  g   −1 ˜ g μ˜ g + Kg−1 μ0 1 . μg =  g 

(12)

The approximated evidence q(y) given by EP can be written as  N  q(y) = N (f|0, K f )N (g|μ0 1, Kg ) tn df dg. (13) n=1

To find the optimal parameters of the approximated posterior, it is known that the minimum of KL[ p(f, g|y)||q(f, g|y)] is obtained when the first- and second-order moments of both distributions match [1]. However, the calculation of the moments of p(f, g|y) is a difficult task. Instead, to solve this problem, EP estimates the moments of the Gaussian posterior approximation by sequentially updating the site functions. The procedure is carried out defining a cavity distribution [18], which is defined as the integral of the product of the GP prior

This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination. 4

IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS

and all the site functions, except the site tn whose moments are going to be estimated   q\n (φ n ) ∝ p(f) p(g) t j (φ j )dφ j . (14) j =n

The resulting cavity distribution in our case corresponds to a bivariate Gaussian distribution that can be factorized as follows: q\n (φ n ) = q f \n ( f n )qg\n (gn )     = N f n |μ f\n , σ 2f\n N gn |μg\n , σg2\n with

−1  σ 2f\n = σ −2 ˜ −2 fn − σ fn   μ f\n = σ 2f\n σ −2 ˜ −2 ˜ fn fn μ f n − σ fn μ −1  σg2\n = σg−2 − σ˜ g−2 n n   μg\n = σg2\n σg−2 μgn − σ˜ g−2 μ˜ gn n n

(15)

A. Moments Calculation With these considerations, the next step involves the calculation of the zeroth, first, and second-order moments of the left distribution in the KL divergence. The details of the calculations are described in Appendix A. The zeroth-order moment can be calculated as follows:  q\n (φ n ) p(yn | f n , gn )d f n dgn Zˆ n =   μ˘ g ˘ = Z μ˘ gt  − (17) σ˘ g

 Z˘ = N μ f\n μg\n yn , c + σ 2f\n + σg2\n yn2

(18)

  μ˘ gt is the mean of the Gaussian N gn |μ˘ g , σ˘ g2 truncated to the positive values of gn with −1 yn2 2 −2 σ˘ g = σg\n + c + σ 2f\n μ μ y g n f \n \n (19) μ˘ g = σ˘ g2 + σg2\n c + σ 2f\n and (x) is the normal cumulative distribution function.

where m 2t is the (noncentered) second-order moment of the Gaussian truncated to the positive values of gn with the mean and variance given in (19). The expression for μˆ f is given by  1 μˆ f n = f n q\n (φ n ) p(yn | f n , gn )d f n dgn Zˆ n μ f\n yn μˆ gn 2 (21) = σ˘ f + c σ 2f \n

(16)

where σ 2fn = [ f ]nn , σg2n = [ g ]nn , and μ fn and μgn are the nth elements of μ f and μg , respectively. Then, EP updates the moments of tn iteratively by minimizing the KL divergence between the product of the cavity distribution and the real (non Gaussian) local likelihood (6) and the product of the cavity distribution and the local site function tn , that is, KL(q\n (φ n ) p(yn | f n , gn )||q\n (φ n )tn (φ n )). First- and second-order moment matching is needed to minimize the KL divergence. However, as tn is an unnormalized Gaussian, Z˜ n is necessary to compute the approximate evidence (13), so we also need to include zeroth moment matching of both distributions in the KL divergence.

where

The first-order moment μˆ n = [μˆ fn , μˆ gn ]T can be calculated separately for each component. The expression for μˆ gn can be written as  1 gn q\n (φ n ) p(yn | f n , gn )d f n dgn μˆ gn = Zˆ n   μ˘ g Z˘ =  − (20) m 2t σ˘ g Zˆ n

−1 with σ˘ 2f = (c−1 + σ −2 f\n ) .

ˆ n we only consider To calculate the covariance matrix  the terms of the main diagonal σˆ 2fn and σˆ g2n , neglecting the cross-covariance terms out of the diagonal. To obtain σˆ 2fn and σˆ g2n we first calculate the noncentered second-order moments mˆ 2 f n and mˆ 2gn . For mˆ 2gn we have  1 gn2 q\n (φ n ) p(yn | f n , gn )d f n dgn mˆ 2gn = Zˆ n   μ˘ g Z˘ (22) = m 3t  − σ˘ g Zˆ n where m 3t is the noncentered third-order moment of the Gaussian distribution with the parameters given in (19) truncated to the positive values of gn . The second-order moment of f n can be expressed as  1 mˆ 2 f n = f n2 q\n (φ n ) p(yn | f n , gn )d f n dgn Zˆ n = σ˘ 2f +

mˆ 2gn σ˘ 4f yn2 c2

+

(σ˘ 2f μ f \n )2 σ 4f\n

+2

σ˘ 4f yn μ f \n μˆ gn c + σ 2f\n

.

(23) Then, the expressions for σˆ 2fn and σˆ g2n are given by     σˆ 2fn = mˆ 2 fn − μˆ 2f n ; σˆ g2n = mˆ 2gn − μˆ 2gn .

(24)

Finally, with the moments calculated we reestimate the parameters of the local likelihood approximation  −2 −1 σ˜ 2fn = σˆ −2 f n − σ f \n   μ˜ 2fn = σ˜ 2fn σˆ −2 ˆ fn − σ −2 fn μ f \n μ f \n −1  σ˜ g2n = σˆ g−2 − σg−2 n \n   μ˜ 2gn = σ˜ g2n σˆ g−2 μˆ gn − σg−2 μ n \n g\n

  (μ f\n − μ˜ fn )2 2 2 ˆ ˜   Z n = Z n 2π σ f\n + σ˜ fn exp 2 σ 2f\n + σ˜ 2fn

  (μg\n − μ˜ gn )2 2 2   . (25) × 2π σg\n + σ˜ gn exp 2 σg2\n + σ˜ g2n

This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination. MUÑOZ-GONZÁLEZ et al.: DIVISIVE GPs FOR NONSTATIONARY REGRESSION

B. Marginal Likelihood Maximum Likelihood II (ML-II) scheme is a common and computationally attractive procedure to find maximum likelihood estimates for hyperparameters by maximizing the logarithm of the marginal likelihood. In the case of EP, the evidence is approximated using the site functions instead of the real likelihood, as shown in (13). The expression for the log-evidence, needed to optimize hyperparameters, is given by 1 ˜ g | − 1 log |K f +  ˜ f| log q(y) = − log |Kg +  2 2 1 ˜ g )−1 (μ0 − μ ˜ g )T (Kg +  ˜ g) − (μ0 − μ 2 N  1 ˜ f )−1 y + − yT (K f +  log( Zˆ n ) 2 + +

N 1

2 1 2

n=1 N 

n=1 N 

log(σg2\n + σ˜ g2n ) +

log(σ 2f\n + σ˜ 2fn ) +

n=1

(μg\n − μ˜ gn )2 2(σg2\n + σ˜ g2n )

n=1 N 

(μ f \n − μ˜ f n )2

n=1

2(σ 2f \n + σ˜ 2fn )

. (26)

To calculate the derivatives of (26) with respect to the hyperparameters θ we follow a treatment similar to [1] for the case of EP for GP classification. The details of these calculations are described in Appendix B.

  and μ˘ g∗t is the mean of the Gaussian N g∗ |μ˘ g∗ , σ˘ g2∗ truncated to the positive values of g∗ with −1 y∗2 2 −2 σ˘ g∗ = σg∗ + c + σ 2f∗ y∗ μ f ∗ μ g∗ 2 . (31) μ˘ g∗ = σ˘ g∗ + σg2∗ c + σ 2f∗ The details of the calculation of this predictive distribution are included in Appendix C. There is no analytical solution for the mean of q(y∗ ). In general, the mean may not exist, as in the case of Cauchy distributions. To sidestep this problem we use the median as an estimator for the targets, which minimizes the mean absolute error. To do that, we use the expression of the cumulative distribution function for y∗ , Fy∗ (α), calculating the value α that makes the cumulative distribution equals to 0.5. For the given q(y∗ ), the solution follows a similar treatment than in the case of the cumulative distribution for the ratio of two correlated normal random variables, as described in [19], considering that we do not assume correlation between f and g and that we only take the positive part of g. With these considerations, the cumulative distribution function is given by     μ g∗ μ g∗ α − μ f ∗ μ g∗ σ g∗ α + , (32) ; Fy∗ (α) = L a(α) σg∗ a(α) σ g∗ with

C. Predictive Distribution The predictive distribution for a new test output y∗ given an input x∗ can be expressed as p(y∗ |x∗ ) (omitting the conditioning on the training data). As it is not possible to calculate the exact expression, since we do not have the exact posterior, an approximation of this predictive distribution q(y|x∗) is calculated using the approximate posterior q(f, g|y). The first step involves the calculation of predictive distributions for f ∗ and g∗ . The approximate predictive distribution for f ∗ has the same standard expression than for the case of g∗ . Thus, the approximate predictive distribution q(g∗) can be written as  p(g∗ |x∗ , g)q(f, g|y)dfdg q(g∗ ) =   = N g∗ |μg∗ , σg2∗ (27) where

5

   g −1 μ g μg∗ = k∗g Kg +  −1

 g k∗g σg2∗ = k g∗∗ − k∗g 

(28)

with [k∗g ] j = k g (x∗ , x j ) and k g∗∗ = k g (x∗ , x∗ ). Then, the approximate predictive distribution for y∗ can be calculated analytically as  p(y∗ | f ∗ , g∗ )q( f ∗ )q(g∗)d f ∗ dg∗ q(y∗ ) =   μ˘ g∗ = Z ∗ (y∗ ) μ˘ g∗t  − (29) σ˘ g∗ where   (30) Z ∗ (y∗ ) = N μ f∗ |μg∗ y∗ , c + σ 2f∗ + σg2∗ y∗2

a(α) =

σg2∗ α 2 + c + σ 2f∗

(33)

where L(h, k, γ ) is the standard bivariate normal integral L(h, k, γ ) =

1  2π 1 − γ 2  2   ∞ ∞ x − 2γ x y + y 2 × exp − d x d y. 2(1 − γ 2 ) h k (34)

If μg∗ /σg∗ → ∞ the cumulative distribution Fy∗ (α) can be approximated by   μ g∗ α − μ f ∗ Fy∗ (α) →  . (35) a(α) To obtain the predictive median m y∗ we need the inverse cumulative distribution, that is, m y∗ = Fy−1 (1/2). Although ∗ calculation of Fy−1 (α) is not analytically tractable, we can ∗ approximate the solution using numerical root-finding methods as bisection or secant algorithms, for example. Note that it is also possible to perform quantile estimation following the same treatment than in the case of the median. VI. E XPERIMENTS A. Synthetic Experiment To assess the quality of the EP approximation compared with the golden standard MCMC, we have created a synthetic unidimensional data set with 150 samples according to the DGP model, selecting GP priors covariance functions for f and g to be squared exponentials, defined as

This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination. 6

IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS

method [10], the maximum-a-posteriori heteroscedastic GP (MAPHGP) described in [8], the standard support vector regression (SVR) [3], and the heteroscedastic neural network model (H-NN) presented in [20]. As performance measures we use the normalized mean squared error (NMSE) n∗ 

NMSE =

(y∗ j − yˆ∗ j )2

j =1 n∗ 

j =1

(36) (y∗ j − y¯ )2

the normalized mean absolute error (NMAE) n∗ 

Fig. 1. Comparison of posterior distributions over y(x) for EP-DGP (dashed line), MCMC-DGP (continuous line), and standard GP (dash-dotted line). Median estimation is provided for EP-DGP and MCMC-DGP along with the 90% of mass probability bounded by quantiles 0.05 and 0.95. The estimated mean and twice the standard deviation are given for the standard GP prediction.

k S E (x, x ) = σ02 exp(− x − x 2 /2 2 ), with parameters f = 0.7 and σ02f = 9 for the GP prior on f , and g = 1.1 and σ02g = 5 for g. The prior mean of g is set to μ0 = 3 and c = 4. To make inference with EP-DGP and MCMC-DGP we set the hyperparameters to the known true values, avoiding the effect of hyperparameter learning, so that we can evaluate the quality of EP-DGP predictions with respect to the asymptotically unbiased MCMC estimates. To compare with a standard GP we cannot use the true DGP hyperparameters since the model is different. Instead, we search a proper set of hyperparameters using an ML-II implementation. To emphasize the high computational cost of MCMC-DGP with respect to EP-DGP implementation,1 with these 150 samples, MCMC-DGP takes 90 s whereas EP-DGP takes only 2 s.2 Fig. 1 shows the results of this synthetic experiment. It can be appreciated that EP-DGP posterior on y(x) is very close to the exact solution provided by MCMC-DGP. Though not shown in Fig. 1, the posterior over latent functions f and g are very similar for the two compared methods, although differences can appear far from the registered data. The mean prediction of the standard GP is also similar to the other methods. However, the performance of the uncertainty prediction is clearly worse than DGP algorithms predictions evidencing the limitations of the standard GP to deal with heteroscedastic noise. B. Regression Performance In this section, we present experimental results to evaluate the performance of DGP methods on several synthetic and real data sets. We compare the results with the standard GP, the state-of-the-art variational heteroscedastic GPR (VHGPR) 1 The MATLAB implementation of EP-DGP and MCMC-DGP used in the experiments can be found at http://www.tsc.uc3m.es/%7Elmunoz/DGP.zip 2 All the experiments have been conducted on a 4 GB computer with an Intel core i5 processor at 3.33 GHz using a MATLAB implementation.

NMAE =

|y∗ j − yˆ∗ j |

j =1 n∗ 

j =1

(37) |y∗ j − y¯ |

and the negative log-predictive density (NLPD) NLPD = −

n∗ 1  log p(y∗ j |D) n∗

(38)

j =1

where n ∗ is the number of test samples, y∗ j is the j th test observation, yˆ∗ j is the mean of the posterior of that observation, and y¯ is the mean of the training observations. Note that NLPD can take positive and negative values and that smaller values indicate better performance. For VHGPR, we use an SE covariance function for the mean latent function and an SE covariance function plus noise for log-noise latent function. To initialize hyperparameters we previously train a standard GP following the procedure described in [10]. The same covariance functions are applied in the case of MAPHGP, initializing the corresponding hyperparameters in the form detailed in [8]. For DGP methods, we use an SE covariance function for g(x) and a SE plus noise covariance function for f (x). Assuming that f (x) is a noisy function allows to establish a tradeoff between the heteroscedastic noise component modeled by c/(g + (x))2 and the homoscedastic noise in f (x). To initialize the EP-DGP hyperparameters, we first train a standard GP with an SE plus noise covariance function. Then, the lengthscale hyperparameters of f and g are set to the = GP lengthscale found by the standard GP, that is, EPDGP f and EPDGP = GP . For the amplitude and noise hyperparag meters we first set c = 4 as a constant. As described in Section III, there is a degree of redundancy in the DGP model, so we donot need to learn c. With this consideration, we set μ0 = 2/ σnGP , so that c/μ20 = σnGP . For the power hyperpara√ = μ0 / 10. meter of gs covariance function we take σ0EPDGP g Thus, we start from a quasi-homoscedastic initialization, so that nonstationarities have to be learned during the ML-II search. Finally, the power hyperparameter of f s covariance function is set to σ0EPDGP = μ0 σ0GP /c, and the homoscedastic f = σ0EPDGP /2. noise power hyperparameter to σnEPDGP f f We have not implemented any search procedure for MCMC-DGP hyperparameters. To evaluate the quality of

This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination. MUÑOZ-GONZÁLEZ et al.: DIVISIVE GPs FOR NONSTATIONARY REGRESSION

7

TABLE I E XPERIMENTAL T EST R ESULTS ON U NIDIMENSIONAL D ATA S ETS . AVERAGE NMSE, NMAE, AND NLPD P LUS /M INUS O NE S TANDARD D EVIATION . S TATISTICALLY S IGNIFICANT D IFFERENCES A RE M ARKED AS • FOR S TANDARD GP, ◦ FOR VHGPR,  FOR MCMC-DGP, FOR EP-DGP,  FOR SVR, AND † FOR H-NN. S IGNIFICANCE I S M EASURED A CCORDING TO A K RUSKAL -WALLIS T EST AT THE 5% L EVEL

EP-DGP approximation, we set MCMC-DGP hyperparameters equal to those obtained from the trained EP-DGP. For the SVR, we use a radial basis function (RBF) kernel. The kernel width σ and the margin parameter C are set to the values that minimize the averaged test NMSE over all the splits created for each data set. Notice that this induces a clear advantage for SVR designs. For a fair comparison, a cross validation for each split in each data set could be performed, for example. However, due to the high number of splits and the number of data sets used in the experiments, the computational burden of this methodology would be very high. The values of C are explored in the range [2 × 10−5 to 2 × 1010 ] with values of the form 2 × 10 p with −5 ≤ p ≤ 10. For σ we also explore values of that form in the range [2 × 10−5 to 2 × 105 ]. To calculate the NLPD for SVR predictions we assume a Gaussian distribution for the predictions with a noise power calculated as the mean squared error of the train predictions. For the H-NN we use two multilayer perceptrons (MLPs) to estimate the mean of the targets and the log-noise, respectively, using the same number of neurons in the hidden layer (M) for both MLPs, as proposed in [20]. We train the two MLPs using the backpropagation algorithm with the same methodology applied in [20]. In the experiments, we have selected the value of M that minimizes the averaged test NLPD over all the splits created for each data set (with 20 repetitions for each split). As in the case of the SVR, this induces an experimental advantage for H-NN designs with respect to the GP methods.

1) Unidimensional Data Sets: We will first consider five unidimensional data sets that have been much used in heteroscedastic regression literature, see, for example, [7], [8], [10], and [11]. Three of the data sets, referred in the experiments as Nix [20], Gol [6], and Wah [21] are synthetic problems with heteroscedastic noise. Caw data set [22] represents the sign function with additive Gaussian stationary noise. Although the problem is homoscedastic, the steep change at x = 0 can be better modeled with a locally higher noise power. Finally, we include the real nonstationary data set Mot [23]. For the synthetic problems, we have generated 300 independent runs with 100 samples drawn from the distributions described previously, that is, different samples for each run. Then, a random split is performed, using 80% of the samples for training data and 20% of the samples for test. For data set Mot we have created 300 runs by randomly splitting its 133 samples into 80% for training and 20% for testing. Table I shows the average performances in terms of NMSE, NMAE, and NLPD for the standard GP, VHGPR, EP-DGP, MCMC-DGP, SVR, and H-NN. Statistical significant differences, measured with a Kruskal-Wallis test at the 5% significance level, are also shown in the table. The performance in terms of NMSE and NMAE is similar for all methods except the H-NN, which performs worse than the rest of the algorithms in all methods. However, we can appreciate better performance of SVR

This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination. 8

IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS

in Wahba and Motorcyle data sets for the GP methods. In Nix, the standard GP clearly performs worse than the other algorithms (except H-NN). It can also be observed that, for this problem, VHGPR and SVR perform better than DGP methods, although the difference is not statistically significant. For data set Caw, VHGPR and H-NN perform sensibly worse than the rest of the methods in terms of NMSE. It can also be observed that EP-DGP and MCMC-DGP achieve significant better results than VHGPR and SVR in terms of NMAE. For data set Mot, it can be appreciated that there is certain advantage for the SVR and the standard GPR for the NMSE measure. However, in terms of NMAE, DGP methods perform better than VHGPR and the standard GPR, though the SVR slight outperforms EP-DGP and MCMC-DGP. Comparing the performance in terms of NLPD (we remind that negative values are possible and smaller values are indicative of better performance), there is a very significant difference between the proposed methods and the standard GPR and SVR in all data sets, even in Caw (where the problem is not really heteroscedastic, as described before). This shows the limitations of the standard GPR to estimate the predictive density when the noise is nonstationary, which means that the SVR cannot give good estimates of the noise power directly. The DGP methods outperform VHGPR in three of the five data sets, with statistical significant differences in Wah, for both MCMC-DGP and EP-DGP, and in Caw for EP-DGP. However, in problem Gol, the improvements of DGP methods are not statistically significant. In contrast, VHGPR performs better than both DGP algorithms for data sets Nix and Mot. Although H-NN method outperforms GPR and SVR in terms of NLPD, the performance with respect to the other heteroscedastic GP methods is clearly worse. It is also important to note that the results of EP-DGP and MCMC-DGP are very similar for most the data sets. Although there are some differences in terms of NLPD for Nix, Caw, and Mot, these are not statistically significant, and it can be observed that results in terms of NMSE and NMAE are very similar. These results validate the quality of the EP posterior approximation compared to the exact posterior obtained with the MCMC implementation, using the same set of hyperparameters than EP-DGP. With the code implementation of MAPHGP kindly provided by the authors of [8] we were not able to obtain good experimental results for the unidimensional data sets used before, since the MAPHGP did not converged or performed really poor for some of the splits in the five unidimensional data sets. In these conditions, we have performed a similar experiment but selecting 300 splits for each data set were MAPHGP converged and had a competitive performance. Then, we compared with EP-DGP on those selected splits, giving an experimental advantage to MAPHGP with respect to EP-DGP. The results of this experiment are shown in Table II. We can appreciate that the performance in terms of NMSE and NMAE is similar for both methods, although EP-DGP outperforms MAPHGP in Caw with statistical significance. However, in terms of NLPD, EP-DGP performs better than MAPHGP in four of the five data sets (in three of them with

TABLE II E XPERIMENTAL T EST R ESULTS ON U NIDIMENSIONAL D ATA S ETS FOR MAPHGP C OMPARED W ITH EP-DGP. AVERAGE NMSE, NMAE, AND NLPD P LUS /M INUS O NE S TANDARD D EVIATION . S TATISTICALLY S IGNIFICANT D IFFERENCES A RE M ARKED AS • FOR MAPHGP AND

FOR EP-DGP. S IGNIFICANCE I S M EASURED A CCORDING TO A K RUSKAL -WALLIS T EST AT THE 5% L EVEL

statistically significant difference) whereas MAPHGP only outperforms in Mot. To analyze the degradation of the performance and the effect of the hyperparameters learning of EP-DGP when reducing the number of training samples, we have conducted an experiment with data set Wah. With 200 samples, we have generated random splits varying the number of training samples from 20% to 90% with steps of 10% with 10 independent runs at each step. Fig. 2 shows the results of this experiment comparing the performance of the standard GP, VHGPR, and EP-DGP in terms of NMSE and NLPD. It can be appreciated that the degradation of the performance in terms of NMSE is quite similar for the three methods, although EP-DGP and VHGPR have more hyperparameters to learn than the standard GP, because they have two latent functions with their corresponding covariance functions for the two GP priors. In terms of NLPD, there is a noticeable difference between the standard GP and both heteroscedastic methods. However, it seems that standard GP degradation is smoother than the rest of the algorithms. It can also be observed that EP-DGP slightly outperforms VHGPR in all cases, corroborating the results shown in Table I for 80 training samples. We have performed an additional experiment with data set Wah to measure the training time for the standard GP, VHGPR, EP-DGP, and MCMC-DGP. First, we learn the set of hyperparameters for all the methods using 300 training samples. Then, we measure the training time (avoiding the hyperparameters learning) varying the number of training samples from 50 to 300 with steps of 50 samples. For a given number of training samples, we generate 20 random training sets. The average training time with the corresponding error bars for the different sizes of the training set are shown in Fig. 3. It can be observed a very high computational cost of MCMC-DGP, compared with the other methods. Although EP-DGP is slower than VHGPR and the standard GP,

This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination. MUÑOZ-GONZÁLEZ et al.: DIVISIVE GPs FOR NONSTATIONARY REGRESSION

9

TABLE III C HARACTERISTICS OF M ULTIDIMENSIONAL D ATA S ETS U SED FOR THE E XPERIMENTS

Fig. 2. Experiment on Wahba problem with 200 samples varying the percentage of training data. (a) Average NMSE as a function of the percentage of training data for EP-DGP (continuous line and wide error bars), standard GP (dashed-dotted line and medium-size error bars) and VHGPR (dashed line and narrow error bars) along with the 2σ error bars. (b) Average NLPD as a function of the percentage of training data for EP-DGP (continuous line and wide error bars), standard GP (dashed-dotted line and medium-size error bars), and VHGPR (dashed line and narrow error bars) along with the 2σ error bars.

Fig. 3. Results of the training time experiment using Wahba problem with different training data sizes. Error-bars are shown to represent the training time for the standard GP (continuous line and wider error bars), VHGPR (dashed line), EP-DGP (dashed-dotted line), and MCMC-DGP (dotted line).

its computational burden is affordable. It can also be noticed that the slope of the logarithm of the training time as a function of the logarithm of the training set size is similar. This supports

that the complexity of all these methods scales in the same form, concretely O(N 3 ). 2) Multidimensional Data Sets: The second part of the experiments shows the performance of DGP methods on multivariate data sets. We have selected the following seven real problems whose stationary or nonstationary nature is a priori unknown: prostate cancer (Can) [24], ozone (Ozo) [25], diabetes (Dia) [26], Boston Housing (Hou) [27], concrete compressive strength (Con) [28], Abalone (Aba) [29], and Parkinson telemonitoring (Par) [30]. Table III shows the number of features of these seven multivariate data sets along with the number of training and test samples. We have made 300 random splits with 80% training samples and 20% test samples for Can and Ozo. For data sets Dia, Hou, and Con, we have also made 300 random splits with 50% training and test samples. We have selected a bigger percentage of training samples for Can and Ozo because those data sets are sensibly smaller and hyperparameter learning problems may arise for all the GP-based methods used in the experiments. For the bigger data sets, Aba and Par, we made only one split. The results of the experiments for the data sets Can, Ozo, Dia, Hou, and Con, are presented in Table IV. For MCMC-DGP, we only show results for Can and Ozo as the computational burden for the rest of the data sets is very high. We have also performed a Kruskal-Wallis test to assess statistical significance at the 5% level. For these multidimensional data sets, we were not able to compare with MAPHGP using the code implementation provided in [8]. It can be appreciated that the results of H-NN clearly subperform the rest of the methods used in the experiments in all data sets. This shows that the H-NN model proposed in [20] is not suitable for multidimensional data sets. In terms of NMSE, it can be observed that the SVR is worse than the rest of GP-based methods with statistical significance in four of the five data sets. DGP methods outperform VHGPR in three data sets and the standard GP in two with statistical significance. It is important to note that in none of the data sets DGP algorithms perform worse than the other GP-based methods. In a similar way, in terms of NMAE, DGP algorithms perform significantly better than the standard GP and VHGPR in Ozo, Hou, and Con. For Can and Dia, the results are almost equivalent for the three methods. The SVR, again, is worse than the rest of the compared methods in four of the five data sets.

This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination. 10

IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS

TABLE IV E XPERIMENTAL T EST R ESULTS ON M ULTIDIMENSIONAL D ATA S ETS . AVERAGE NMSE, NMAE, AND NLPD P LUS /M INUS O NE S TANDARD D EVIATION . S TATISTICALLY S IGNIFICANT D IFFERENCES A RE M ARKED AS • FOR S TANDARD GP, ◦ FOR VHGPR,  FOR MCMC-DGP,

FOR EP-DGP,  FOR SVR, AND † FOR H-NN. S IGNIFICANCE I S M EASURED A CCORDING TO A K RUSKAL -WALLIS T EST AT THE 5% L EVEL

In terms of NLPD, it can be observed that DGP methods perform better than VHGPR, the SVR, and the standard GP with statistical significance in all data sets except for the SVR in Can, where, although EP-DGP and MCMC-DGP have better results, there are not statistically significant differences. In contrast, VHGPR only improves the standard GP results in Ozo, Hou, and Con. It is also noticeable that the GP-based methods outperform SVR in four of the five data sets with statistical significance. In Can, though SVR performs slightly better than the standard GP and VHGPR, DGP methods have better results than SVR. It can be also appreciated that EP-DGP and MCMC-DGP results for Can and Ozo are very similar with respect to NMSE, NMAE, and NLPD. As in the case of the experiments with unidimensional data sets, these results support the quality of the EP approximation compared with the exact posterior given by MCMC-DGP. From the results presented in Table IV, we can conclude that although the heteroscedastic nature of the selected problems is a priori unknown, VHGPR and DGP methods always perform as well as the standard GP, even for problems that seem to be more homoscedastic, as data sets Can and Dia. These results support the advantages of the evaluated heteroscedastic methods and show the limitations of the standard GPR when the problem is not stationary. The experimental results for the data sets Aba and Par are shown in Table V. In these data sets, we have used a squared exponential covariance function with automatic relevance determination (ARD), defined as

TABLE V E XPERIMENTAL T EST R ESULTS ON M ULTIDIMENSIONAL D ATA S ETS A BA AND

PAR . NMSE, NMAE, AND NLPD A RE P ROVIDED FOR THE S TANDARD GP, VHGPR, EP-DGP, AND SVR

2    D  (i) x − x (i) /2 2i , where x(i) k AR D (x, x ) = σ02 exp − i=1 is the i th component of the input sample x. This covariance function is more suitable for data sets with a significant number of training samples compared with the number of features. It can be observed that EP-DGP outperforms the rest of the compared methods in terms of NLPD. Although VHGPR and the standard GP achieve a better NMSE than EP-DGP, the NMAE is quite similar for the three methods. The SVR does not present competitive advantages in any case. All the models discussed in this paper fall within the same complexity category: they all require O(N 3 ) computation time (which arises from the inversion of N × N matrices, or, alternatively, from the numerically more stable Cholesky decomposition of such matrices) and O(N 2 ) space (which is

This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination. MUÑOZ-GONZÁLEZ et al.: DIVISIVE GPs FOR NONSTATIONARY REGRESSION

11

required to store such matrices). However, their speeds differ by the constant factor that vanishes when using the O(·) notation. The difference in this constant factor is what makes MCMC unsuitable for application on medium-sized data sets, while making the alternative approximations suitable for data sets with a few thousands of data points. An empirical, quantitative measurement of this speedup factor for the different algorithms is represented in Fig. 3.

The calculation of the first-order moment of gn can be written as    1 μˆ gn = gn N f n |μ f\n , σ 2f\n Zˆ n   × N gn |μg\n , σg2\n gn+ N ( f n |yn gn+ , c) d f n dgn  ∞ 2  gn  1 N gn |μg\n , σg2\n = Zˆ n 0 |yn | 2 μ f\n c + σ f\n dgn × N gn | , yn yn2    ∞ μ˘ g Z˘ Z˘ = gn2 N (gn |μ˘ g , σ˘ g2 )dgn = m 2t  − (43) σ˘ g Zˆ n 0 Zˆ n

VII. C ONCLUSION We have introduced a DGP model to carry out nonstationary regression including heteroscedastic noise cases. The likelihood log-concavity of the proposed model leads to a unimodal posterior that favors EP Gaussian approximation, in contrast to other nonstationary and heteroscedastic models. The MCMC implementation provides an accurate exact posterior for the DGP, although the computational burden is very high. Yet the EP posterior approximation reduces the computational cost offering a similar performance for regression tasks. The experimental results with unidimensional and multidimensional data sets show the improvements of the proposed DGP methods compared with the standard GPR, VHGPR, and SVR. Currently, we are investigating the use of alternative techniques to EP and MCMC, such as Laplace approximation and Variational Bayes, to reduce the computational cost for this DGP model. Other research lines are to analyze the effects of dependence between latent functions and how to model scale factor nonstationarity. A PPENDIX A D ETAILS OF EP-DGP M OMENTS C ALCULATIONS

(39)

Then, the zeroth-order moment in (17) can be calculated as Zˆ n       = N f n |μ f \n , σ 2f\n N gn |μg\n , σg2\n × gn+ N ( f n |yn gn+ , c) d f n dgn  ∞ 2  μ f\n c + σ f\n gn  2 N gn |μg\n , σg\n ×N gn | = , dgn yn yn2 0 |yn |    ∞   μ˘ g = Z˘ gn N gn |μ˘ g , σ˘ g2 dgn = Z˘ μ˘ gt  − . (40) σ˘ g 0   The mean of the Gaussian N gn |μ˘ g , σ˘ g2 truncated to the positive values of gn can be expressed as   μ˘ g (41) μ˘ gt = μ˘ g + σ˘ g λ σ˘ g where λ(x) =

1 − x2 ϕ (x) ; ϕ(x) = e 2.  (−x) 2π

For the first-order moment of f n we have    1 μˆ fn = f n N f n |μ f \n , σ 2f\n Zˆ n   × N gn |μg\n , σg2\n gn+ N ( f n |yn gn+ , c) d f n dgn  ∞  gn  1 N gn |μg\n , σg2\n μ˘ f = Zˆ n 0 |yn | 2 μ f\n c + σ f\n × N gn | , (45) dgn yn yn2 where

To simplify the calculations of the EP-DGP moments, we can also express the likelihood given in (6) as p(yn | f n , gn ) = gn+ N ( f n |yn gn+ , c).

where second-order moment of the Gaussian   the noncentered N gn |μ˘ g , σ˘ g2 truncated to the positive values of gn is calculated as      μ˘ g μ˘ g μ˘ g 2 2 m 2t = μ˘ g + 2μ˘ g σ˘ g λ λ + σ˘ g 1 − . (44) σ˘ g σ˘ g σ˘ g

(42)

μ˘ f =

σ˘ 2f

μ f\n |yn |gn+ + 2 c σ f\n



−1  ; σ˘ 2f = c−1 + σ −2 . (46) f \n

Note that μ˘ f depends on gn . Then, the solution of the integral in (45) involves the calculation of the first   and second-order moments of the truncated Gaussian N gn |μ˘ g , σ˘ g2 . Making some simplifications we obtain the solution given in (21). The noncentered second-order moment of gn can be calculated as    1 gn2 N f n |μ f \n , σ 2f\n mˆ 2gn = ˆ Zn     × N gn |μg\n , σg2\n gn+ N fn |yn gn+ , c d f n dgn  ∞ 3  gn  1 N gn |μg\n , σg2\n = Zˆ n 0 |yn | 2 μ f \n c + σ f\n × N gn | , dgn yn yn2    ∞   μ˘ g Z˘ Z˘ 3 2 = gn N gn |μ˘ g , σ˘ g dgn = m 3t  − σ˘ g Zˆ n 0 Zˆ n (47) where the noncentered third-order moment of the Gaussian   N gn |μ˘ g , σ˘ g2 truncated to the positive values of gn is

This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination. 12

IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS

expressed as

  μ˘ g 3 2 m 3t = μ˘ g + 3μ˘ g σ˘ g λ σ˘ g      2 μ ˘ μ ˘ μ ˘ μ ˘ g g g g + 3μ ˘ g σ˘ g2 1 − λ + σ˘ g3 λ 2+ 2 . σ˘ g σ˘ g σ˘ g σ˘ g (48)

Finally, the expression of the second-order moment of f n is given by    1 mˆ 2 f n = f n2 N f n |μ f \n , σ 2f\n ˆ Zn   × N gn |μg\n , σg2\n gn+ N ( fn |yn gn+ , c) d f n dgn  ∞   1 gn  N gn |μg\n , σg2\n μ˘ 2f + σ˘ 2f = ˆ |y | n Zn 0 2 μ f \n c + σ f\n × N gn | , dgn yn yn2  ∞ 2   μ f\n c + σ f\n 1 2 = N gn |μg\n , σg\n N gn | , yn yn2 Zˆ n 0 2 gn μ2f\n σ˘ 4f gn σ˘ f gn3 yn σ˘ 4f gn2 σ˘ 4f μ f\n × dgn . + + +2 |yn | c2 |yn |σ 4f\n cσ 2f \n (49) As in the case of the calculation of μˆ fn , the term μ˘ f depends on gn , so that some calculations with moments of the truncated Gaussian to the positive values are needed to solve the integral. After some simplifications, we get the solution as shown in (23). A PPENDIX B D ERIVATIVES OF THE L OG -E VIDENCE W ITH R ESPECT TO THE H YPERPARAMETERS In this appendix, we detail the calculations of the partial derivatives of the log-evidence with respect to the hyperparameters θ f , θ g , and μ0 . These derivatives are necessary to search the values of the hyperparameters that maximize the log-evidence using ML-II. The expressions obtained are similar to those described in [1] for EP GP classification. The derivatives of θ f , θg , include explicit derivatives corresponding to the terms of the log-evidence which depend on K f and Kg , respectively, and implicit derivatives in the rest of the terms, due to the site EP estimation for the site functions. Fortunately, implicit derivatives are exactly zero, as proven in [31], so we only need to calculate the explicit terms. The partial derivatives with respect to each of the hyperparameters in θ f are given by ∂K f ∂ 1 ˜ f )−1 ˜ f )−1 y log q(y) = y T (K f +  (K f +  ∂θ f j 2 ∂θ f j ∂K f 1 −1 ˜ f) − tr (K f +  2 ∂θ f j ∂K f 1 T −1 ˜ f) ) (50) = tr (αα − (K f +  2 ∂θ f j

˜ f )−1 y. The calculation of these derivatives where α = (K f +  3 requires O(N ) operations to invert the symmetric matrix ˜ g ), and O(N 2 ) operations to compute each partial (Kg +  derivative w.r.t. θ f [1]. For the set of hyperparameters θ g , we have a similar expression ∂ ∂ 1 ˜ g| log q(y) = − log |Kg +  ∂θg j ∂θg j 2 1 ˜ g )−1 (μ0 − μ ˜ g )T (Kg +  ˜ g) − (μ0 − μ 2 ∂Kg 1 ˜ g )−1 ˜ g )−1 = (μ0 − μ ˜ g )T (Kg +  (Kg +  2 ∂θg j ∂K 1 g ˜ g )−1 ×(μ0 − μ ˜ g ) − tr (Kg +  2 ∂θg j ∂K 1 ˜ g )−1 ) g = tr (ββ T −(Kg +  (51) 2 ∂θg j ˜ g )−1 (μ0 − μ ˜ g ). The computational where β = (Kg +  complexity of these calculations is also O(N 3 ) to invert ˜ g ) and O(N 2 ) for each partial derivative. Finally, (Kg +  the derivative with respect to the hyperparameter μ0 can be written as   1 ∂ ∂ ˜ g )−1 (μ0 − μ − (μ0 − μ log q(y) = ˜ g )T (Kg +  ˜ g) ∂μ0 ∂μ0 2 ˜ g )−1 1 N = −β T 1 N (52) ˜ g )T (Kg +  = −(μ0 − μ This derivative does not imply additional computational cost, as β is also needed for the derivatives on θ g . A PPENDIX C DGP P REDICTIVE D ISTRIBUTION C ALCULATION The predictive distribution for a new test sample y∗ , given in (30), can be calculated as follows:    N ( f∗ |μ f∗ , σ 2f∗ )N g∗ |μg∗ , σg2∗ q(y∗ ) = × g + N ( f∗ |y∗ g∗+ , c) d f ∗ dg∗  ∞∗  g∗  N g∗ |μg∗ , σg2∗ = |y | 0 ∗ 2 μ f ∗ c + σ f∗ dg∗ × N g∗ | , y∗ y∗2  ∞   = Z ∗ (y∗ ) g∗ N g∗ |μ˘ g∗ , σ˘ g2∗ dg∗ 0   μ˘ g∗ = Z ∗ (y∗ ) μ˘ g∗t  − σ˘ g∗

(53)

  where μ˘ g∗t is the mean of the Gaussian N g∗ |μ˘ g∗ , σ˘ g2∗ truncated to the positive values of g∗ , given by   μ˘ g∗ μ˘ g∗t = μ˘ g∗ + σ˘ g∗ λ . (54) σ˘ g∗

This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination. MUÑOZ-GONZÁLEZ et al.: DIVISIVE GPs FOR NONSTATIONARY REGRESSION

13

R EFERENCES [1] C. E. Rasmussen and C. K. I. Williams, Gaussian Processes for Machine Learning. Cambridge, MA, USA: MIT Press, 2006. [2] S. Haykin, Neural Networks. A Comprehensive Foundation. New York, NY, USA: Macmillan, 1994. [3] V. Vapnik, The Nature of Statistical Learning Theory. New York, NY, USA: Springer-Verlag, 1999. [4] J. Quiñonero-Candela and C.E. Rasmussen, “A unifying view of sparse approximate Gaussian process regression,” J. Mach. Learn. Res., vol. 6, no. 12, pp. 1939–1959, 2005. [5] M. Lázaro-Gredilla, J. Quiñonero-Candela, C. E. Rasmussen, and A. R. Figueiras-Vidal, “Sparse spectrum Gaussian process regression,” J. Mach. Learn. Res., vol. 11, pp. 1865–1881, Jun. 2010. [6] P. W. Goldberg, C. K. I. Williams, and C. M. Bishop, “Regression with input-dependent noise: A Gaussian process treatment,” in Advances in Neural Information Processing Systems, vol. 10, M. I. Jordan, M. J. Kearns, and S. A. Solla, Eds. Cambridge, MA, USA: MIT Press, 1998, pp. 493–499. [7] K. Kersting, C. Plagemann, P. Pfaff, and W. Burgard, “Most likely heteroscedastic Gaussian process regression,” in Proc. 24th Int. Conf. Mach. Learn., 2007, pp. 393–400. [8] N. Quadrianto, K. Kersting, M. D. Reid, T. S. Caetano, and W. L. Buntine, “Kernel conditional quantile estimation via reduction revisited,” in Proc. 9th IEEE Int. Conf. Data Mining, Miami, FL, USA, Dec. 2009, pp. 938–943. [9] T. P. Minka, “A family of algorithms for approximate Bayesian inference,” Ph.D. dissertation, Dept. Elect. Eng. Comput. Sci., Massachusetts Inst. Technol., Cambridge, MA, USA, 2001. [10] M. Lázaro-Gredilla and M. K. Titsias, “Variational heteroscedastic Gaussian process regression,” in Proc. 28th Int. Conf. Mach. Learn., Bellevue, WA, USA, 2011, pp. 841–848. [11] L. Muñoz-González, M. Lázaro-Gredilla, and A. R. Figueiras-Vidal, “Heteroscedastic Gaussian process regression using expectation propagation,” in Proc. IEEE Int. Workshop Mach. Learn. Signal Process., Sep. 2011, pp. 1–6. [12] Q. V. Le, A. J. Smola, and S. Canu, “Heteroscedastic Gaussian process regression,” in Proc. 22nd Int. Conf. Mach. Learn., Bonn, Germany, 2005, pp. 489–496. [13] R. P. Adams and O. Stegle, “Gaussian process product models for nonparametric nonstationarity,” in Proc. 25th Int. Conf. Mach. Learn., Helsinki, Finland, 2008, pp. 1–8. [14] M. W. Seeger, “Bayesian inference and optimal design for the sparse linear model,” J. Mach. Learn. Res., vol. 9, pp. 759–813, Apr. 2008. [15] I. Murray, R. P. Adams, and D. J. C. MacKay, “Elliptical slice sampling,” in Proc. Int. Conf. Artif. Intell. Statist., Sardinia, Italy, 2010, pp. 541–548. [16] R. M. Neal, “Regression and classification using Gaussian process priors,” Bayesian Statist., vol. 6, pp. 475–501, Aug. 1999. [17] R. M. Neal, “Slice sampling,” Ann. Statist., vol. 31, no. 3, pp. 705–741, 2003. [18] M. Opper and O. Winther, “Gaussian processes for classification: Meanfield algorithms,” Neural Comput., vol. 12, no. 11, pp. 2655–2684, 2000. [19] D. V. Hinkley, “On the ratio of two correlated normal random variables,” Biometrika, vol. 56, no. 3, pp. 635–639, 1969. [20] D. A. Nix and A. S. Weigend, “Learning local error bars for nonlinear regression,” in Advances in Neural Information Processing Systems, vol. 7, G. Tesauro, D. S. Touretzky, and T. K. Leen, Eds. Cambridge, MA, USA: MIT Press, 1995, pp. 489–496. [21] M. Yuan and G. Wahba, “Doubly penalized likelihood estimator in heteroscedastic regression,” Statist. Probab. Lett., vol. 69, no. 1, pp. 11–20, 2004. [22] G. Cawley, N. Talbot, and O. Chapelle, “Estimating predictive variances with kernel ridge regression,” in Machine Learning Challenges: Evaluating Predictive Uncertainty, Visual Object Classification, and Recognising Textual Entailment (Lectures Notes in Computer Science), vol. 3944, J Quinonero-Candela, I. Dagan, B. Magnini, and F. D’AlchBuc, Eds. Berlin, Germany: Springer-Verlag, 2006, pp. 56–77. [23] B. W. Silverman, “Some aspects of the spline smoothing approach to non-parametric regression curve fitting,” J. R. Statist. Soc. Ser. B (Methodol.), vol. 47, no. 1, pp. 1–52, 1985. [24] T. A. Stamey, J. N. Kabalin, J. E. McNeal, I. M. Johnstone, F. Freiha, E. A. Redwine, et al., “Prostate specific antigen in the diagnosis and treatment of adenocarcinoma of the prostate. II: Radical prostatectomy treated patients,” J. Urology, vol. 141, no. 5, pp. 1076–1083, 1989. [25] T. Hastie, R. Tibshirani, and J. H. Friedman, The Elements of Statistical Learning: Data Mining, Inference, and Prediction. New York, NY, USA: Springer-Verlag, 2009.

[26] B. Efron, T. Hastie, I. Johnstone, and R. Tibshirani, “Least angle regression,” Ann. Statist., vol. 32, no. 2, pp. 407–499, 2004. [27] D. Harrison and D. L. Rubinfeld, “Hedonic housing prices and the demand for clean air,” J. Environ. Econ. Manag., vol. 5, no. 1, pp. 81–102, 1978. [28] I. C. Yeh, “Modeling of strength of high-performance concrete using artificial neural networks,” J. Cement Concrete Res., vol. 28, no. 12, pp. 1797–1808, 1998. [29] K. Bache and M. Lichman, UCI Machine Learning Repository. Berkeley, CA, USA: Univ. California, 2013. [30] A. Tsanas, M. A. Little, P. E. McSharry, and L. O. Ramig, “Accurate telemonitoring of Parkinson’s disease progression by noninvasive speech tests,” IEEE Trans. Biomed. Eng., vol. 57, no. 4, pp. 884–893, Apr. 2010. [31] M. Seeger, “Expectation propagation for exponential families,” Dept. Electr. Eng. Comput. Sci., Univ. Berkeley, CA, USA, Tech. Rep. EPFL-REPORT-161464, 2005.

Luis Muñoz-González received the Telecommunication Engineering (Hons.) degree from Universidad de Valladolid, Valladolid, Spain, in 2007, and the M.Sc. (Hons.) degree in multimedia and communications from Universidad Carlos III de Madrid, Leganés, Spain, in 2011, where he is currently pursuing the Ph.D. degree. His current research interests include Bayesian models, Gaussian processes, and approximate inference.

Miguel Lázaro-Gredilla (M’11) received the Telecommunication Engineering (Hons.) degree from the University of Cantabria, Santander, Spain, and the Ph.D. (Hons.) degree from Universidad Carlos III de Madrid, Leganés, Spain, in 2004 and 2010, respectively. He has been a Visiting Researcher with the University of Cambridge, Cambridge, U.K.; the University of Manchester, Manchester, U.K.; and the University of Cantabria. He is currently a Visiting Lecturer with Universidad Carlos III de Madrid. His current research interests include Gaussian processes, Bayesian models, and approximate inference.

Aníbal R. Figueiras-Vidal (M’76–SM’84–F’12) received the Telecommunication Engineering (Hons.) degree from Universidad Politécnica de Madrid, Madrid, Spain, in 1973, the Doctoral (Hons.) degree from Universidad Politécnica de Barcelona, Barcelona, Spain, in 1976, and the Honoris Causa Doctor degree from Universidad de Vigo, Vigo, Spain, in 1999, and Universidad San Pablo, Arequipa, Peru, in 2011. He is a Professor of signal theory and communications with Universidad Carlos III, Madrid. He has authored or co-authored more than 300 journal and conference papers, and over 100 research projects and contracts. His current research interests include digital signal processing, digital communications, and machine learning. Dr. Figueiras-Vidal is a member of the Royal Academy of Engineering of Spain. He was the President from 2007 to 2011.

Divisive Gaussian processes for nonstationary regression.

Standard Gaussian process regression (GPR) assumes constant noise power throughout the input space and stationarity when combined with the squared exp...
2MB Sizes 5 Downloads 9 Views