nips nips2012 nips2012-37 knowledge-graph by maker-knowledge-mining

37 nips-2012-Affine Independent Variational Inference

Source: pdf

Author: Edward Challis, David Barber

Abstract: We consider inference in a broad class of non-conjugate probabilistic models based on minimising the Kullback-Leibler divergence between the given target density and an approximating ‘variational’ density. In particular, for generalised linear models we describe approximating densities formed from an afﬁne transformation of independently distributed latent variables, this class including many well known densities as special cases. We show how all relevant quantities can be efﬁciently computed using the fast Fourier transform. This extends the known class of tractable variational approximations and enables the ﬁtting for example of skew variational densities to the target density. 1

Reference: text

Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 uk Abstract We consider inference in a broad class of non-conjugate probabilistic models based on minimising the Kullback-Leibler divergence between the given target density and an approximating ‘variational’ density. [sent-6, score-0.286]

2 In particular, for generalised linear models we describe approximating densities formed from an afﬁne transformation of independently distributed latent variables, this class including many well known densities as special cases. [sent-7, score-0.434]

3 This extends the known class of tractable variational approximations and enables the ﬁtting for example of skew variational densities to the target density. [sent-9, score-0.569]

4 1 Introduction Whilst Bayesian methods have played a signiﬁcant role in machine learning and related areas (see [1] for an introduction), improving the class of distributions for which inference is either tractable or can be well approximated remains an ongoing challenge. [sent-10, score-0.165]

5 Our contribution is to extend the class of approximating distributions beyond classical forms to approximations that can possess skewness or other non-Gaussian characteristics, while maintaining computational efﬁciency. [sent-12, score-0.292]

6 We consider approximating the normalisation constant Z of a probability density function p(w) 1 p(w) = Z N N fn (w) with Z = n=1 fn (w)dw (1. [sent-13, score-0.47]

7 To address this we may ﬁnd an approximating density q(w) to the target p(w) by minimising the Kullback-Leibler (KL) divergence KL(q(w)|p(w)) = log q(w) q(w) − log p(w) q(w) = −H [q(w)] − log p(w) q(w) (1. [sent-16, score-0.318]

8 The non-negativity of the KL divergence provides the lower bound N log Z ≥ H [q(w)] + log fn (w) := B. [sent-18, score-0.309]

9 3) n=1 Finding the best parameters θ of the approximating density q(w|θ) is then equivalent to maximising the lower bound on log Z. [sent-20, score-0.273]

10 We therefore specialise on models of the form N fn (wT xn ) p(w) ∝ N (w|µ, Σ) (1. [sent-47, score-0.18]

11 4) n=1 N where {xn }n=1 is a collection of ﬁxed D dimensional real vectors and fn : R → R+ ; N (w|µ, Σ) denotes a multivariate Gaussian in w with mean µ and covariance Σ. [sent-48, score-0.23]

12 It is therefore important to extend the class of tractable approximating distributions beyond standard forms such as the multivariate Gaussian [20, 12, 2, 13]. [sent-52, score-0.238]

13 2 Afﬁne independent densities D We ﬁrst consider independently distributed latent variables v ∼ qv (v|θ) = d=1 qvd(vd |θd ) with ‘base’ distributions qvd. [sent-55, score-0.374]

15 Typically we assume the base distributions are homogeneous, qvd ≡ qv . [sent-62, score-0.628]

16 By using, for example, Student’s t, Laplace, logistic, generalised-normal or skewnormal base distributions, equation (2. [sent-64, score-0.17]

17 This class of multivariate distributions has the important property that, unlike the For p(w) in this model class and Gaussian q(w) = N w|m, CT C , B is tighter than ‘local’ bounds [14, 11, 22, 18, 16]. [sent-66, score-0.2]

18 2 The skew-normal q(w) recently discussed in [21] possesses skew in one direction of parameter space only and is a special case of the AI skew-normal densities used in section 4. [sent-68, score-0.227]

19 See ﬁgures 1, 2 and 3, for examples of two dimensional distributions qw (w|A, b, θ) with skew-normal and generalised-normal base distributions used to approximate toy machine learning problems. [sent-73, score-0.538]

20 Provided we choose a base distribution class that includes the Gaussian as a special case (for example generalised-normal, skew-normal and asymptotically Student’s t) we are guaranteed to perform at least as well as classical multivariate Gaussian KL approximate inference. [sent-74, score-0.305]

21 Full derivations, including formulae for skew-normal and generalised-normal base distributions, are given in the supplementary material. [sent-80, score-0.17]

22 1 Evaluating the KL bound The KL bound can be readily decomposed as N D B = log |det (A)| + log fn (wT xn ) H [q(vd |θd )] + log N (w|µ, Σ) + (2. [sent-82, score-0.499]

23 For many standard base distributions the entropy H [qvd(vd |θd )] is closed form. [sent-84, score-0.282]

24 When the entropy of a univariate base distribution is not analytically available, we assume it can be cheaply evaluated numerically. [sent-85, score-0.323]

25 The energy contribution to the KL bound is the sum of the expectation of the log Gaussian term (which requires only ﬁrst and second order moments) and the nonlinear ‘site projections’. [sent-86, score-0.172]

26 1 Marginal site densities Deﬁning y := wT x, the expectation of the site projection for any function g and ﬁxed vector x is equivalent to a one-dimensional expectation, g wT x qw (w) = g(y) qy (y) with qy (y) = δ(y − xT w)qw (w)dw = δ(y − αT v − β)qv (v)dv (2. [sent-90, score-1.016]

27 We can rewrite this D-dimensional integral as a one dimensional integral using the integral transform δ(x) = e2πitx dt: D qy (y) = e2πit(y−α T v−β) D qvd(vd )dvdt = d=1 e2πi(t−β)y qud (t) dt ˜ (2. [sent-92, score-0.738]

29 Unfortunately, most distributions do not have Fourier transforms that admit compact analytic forms D for the product d=1 qud (t). [sent-96, score-0.427]

30 With the exception of the Gaussian (the only stable distribution with ﬁnite variance), Levy and Cauchy distributions, stable distributions do not have analytic forms for their density functions and are analytically expressible only in the Fourier domain. [sent-98, score-0.293]

31 Nevertheless, when qv (v) is stable distributed, marginal quantities of w such as y can be computed analytically in the Fourier domain [3]. [sent-99, score-0.263]

32 3 2 2 0 0 0 −2 −2 −2 −4 −4 −4 −6 −6 −6 −8 −8 −8 −5 0 5 10 2 −5 (a) 0 5 10 −5 0 (b) 5 10 (c) Figure 2: Two dimensional Bayesian logistic regression with Gaussian prior N (w|0, 10I) and likelihood fn (w) = σ(τl cn wT xn ), τl = 5. [sent-100, score-0.434]

33 In general, therefore, we need to resort to numerical methods to compute qy (y) and expectations with respect to it. [sent-108, score-0.215]

34 To achieve this we discretise the base distributions and, by choosing a sufﬁciently ﬁne discretisation, limit the maximal error that can be incurred. [sent-109, score-0.311]

35 D First we deﬁne the set of discrete approximations to {qud (ud |θd )}d=1 for ud := αd vd . [sent-111, score-0.458]

36 These ‘lattice’ approximations are a weighted sum of K delta functions K qud (ud |θd ) ≈ qud (ud ) := ˆ 1 lk + 2 ∆ πdk δ (ud − lk ) where πdk = q(ud |θd )dud . [sent-112, score-1.007]

37 5) 1 lk − 2 ∆ k=1 K The lattice points {lk }k=1 are spaced uniformly over the domain [l1 , lK ] with ∆ := lk+1 − lk . [sent-114, score-0.489]

38 The weighting for each delta spike is the mass assigned to the distribution qud (ud |θd ) over the interval [lk − 1 ∆, lk + 1 ∆]. [sent-115, score-0.497]

39 2 2 D Given the lattice approximations to the densities {qud (ud |θd )}d=1 the fast Fourier transform (FFT) can be used to evaluate the convolution of the lattice distributions. [sent-116, score-0.61]

40 Doing so we obtain the lattice approximation to the marginal y = wT x such that (see supplementary section 2. [sent-117, score-0.245]

41 2) K D δ(y − lk − β)ρk where ρ = ifft qy (y) ≈ qy (y) = ˆ k=1 fft [π d ] . [sent-118, score-0.669]

42 The only approximation used in ﬁnding the marginal density is then the discretisation of the base distributions, with the remaining FFT calculations being exact. [sent-121, score-0.336]

43 2 Efﬁcient site derivative computation Whilst we have shown that the expectation of the site projections can be accurately computed using the FFT, how to cheaply evaluate the derivative of this term is less clear. [sent-125, score-0.355]

44 The complication can be seen by inspecting the partial derivative of g(wT x) with respect to Amn ∂ g(wT x) ∂Amn q(w) = xn xT Av + bT x vm dv, qv (v)g (2. [sent-126, score-0.213]

45 To see this we can write ∂ g wT x ∂Amn = xn g (y)dm (y)dy, where dm (y) := 4 vm qv (v)δ y − αT v − β dv. [sent-130, score-0.263]

46 4 (c) Figure 3: Two dimensional robust linear regression with Gaussian prior N (w|0, I), Laplace likeT 1 lihood fn (w) = 2τl e−|yn −w xn |/τl with τl = 0. [sent-161, score-0.282]

47 Here dm (y) is a univariate weighting function with Fourier transform: ˜ dm (t) = e−2πitβ em (t) ˜ qud (t), where em (t) := ˜ ˜ d=m um qu (um )e−2πitum dum . [sent-169, score-0.479]

48 Thus the complexity e of computing the site derivative4 is equivalent to the complexity of the site expectation of section 2. [sent-171, score-0.311]

49 Even for non-smooth functions g the site gradient has the additional property that it is smooth, provided the base distributions are smooth. [sent-174, score-0.383]

50 This means that gradient optimisation for AI based KL approximate inference can be applied, even when the target density is non-smooth. [sent-176, score-0.251]

51 In contrast, other deterministic approximate inference routines are not universally applicable to non-smooth target densities – for instance the Laplace approximation and [10] both require the target to be smooth. [sent-177, score-0.329]

52 2 Optimising the KL bound Given ﬁxed base distributions, we can optimise the KL bound with respect to the parameters A = N LU and b. [sent-179, score-0.434]

53 2 we can also efﬁciently evaluate the gradients of the KL bound with respect to the parameters θ that deﬁne the base distribution. [sent-184, score-0.268]

54 These parameters θ can control higher order moments of the approximating density q(w) such as skewness and kurtosis. [sent-185, score-0.202]

55 1 we choose the skew-normal distribution to approximate Bayesian logistic regression posteriors. [sent-189, score-0.179]

56 For heavy-tailed posteriors that arise for example in robust or sparse Bayesian linear regression models, one choice is to use the generalised-normal as base density, which includes the Laplace and Gaussian distributions as special cases. [sent-190, score-0.361]

57 However, in situations for which it is not clear how to select qv (v), several different distributions can be assessed and then that which achieves the greatest lower bound B is preferred. [sent-192, score-0.334]

58 1 depends on the number of lattice points used to evaluate the convolved density function qy (y). [sent-197, score-0.459]

59 For the results presented we implemented a simple strategy for choosing the lattice points [l1 , . [sent-198, score-0.181]

60 From Chebyshev’s inequality, taking six standard deviation end points guarantees that we capture at least 97% of the mass of qy (y). [sent-203, score-0.215]

61 In practice this proportion is often much higher since qy (y) is often close to Gaussian for D 1. [sent-204, score-0.215]

62 We ﬁx the number of lattice points used during optimisation to suit our computational budget. [sent-205, score-0.24]

63 To compute the ﬁnal bound value we apply the simple strategy of doubling the number of lattice points until the evaluated bound changes by less than 10−3 [6]. [sent-206, score-0.377]

64 Fully characterising the overall accuracy of the approximation as a function of the number of lattice points is complex, see [24, 26] for a related discussion. [sent-207, score-0.214]

65 When the condition number is large many lattice points are needed to accurately discretise the set of distributions D {qud (ud |θd )}d=1 which increases the time and memory requirements. [sent-209, score-0.322]

66 One possible route to circumventing these issues is to use base densities that have analytic Fourier transforms (such as a mixture of Gaussians). [sent-210, score-0.351]

67 In such cases the discrete Fourier transform of qy (y) can be directly evaluated by computing the product of the Fourier transforms of each D {qud (ud |θd )}d=1 . [sent-211, score-0.322]

68 The computational bottleneck for AI inference, assuming N > D, arises from computing the expectation and partial derivatives of the N site projections. [sent-213, score-0.172]

69 It was shown in [7] that exact Gaussian KL approximate inference has bound and gradient computations which scale O N D2 . [sent-216, score-0.198]

70 Similarly, local variational bounding methods (see below) scale O N D2 when implemented exactly. [sent-217, score-0.171]

71 3 Related methods Another commonly applied technique to obtain a lower bound for densities of the form of equation (1. [sent-218, score-0.236]

72 4) is the ‘local’ variational bounding procedure [14, 11, 22, 18]. [sent-219, score-0.171]

73 Local bounding methods approximate the normalisation constant by bounding each non-conjugate term in the integrand, equation (1. [sent-220, score-0.228]

74 In [7] we showed that the Gaussian KL bound dominates the local bound in such models. [sent-222, score-0.247]

75 Other approaches increase the ﬂexibility of the approximating distribution by expressing qw (w) as a mixture. [sent-224, score-0.208]

76 Another recently proposed method to approximate integrals using mixtures is split mean ﬁeld which iterates between constructing soft partitions of the integral domain and bounding those partitioned integrals [5]. [sent-227, score-0.234]

77 5 For symmetric densities {qud (ud |θd )} we arranged that their mode coincides with the central lattice point. [sent-232, score-0.319]

78 1 0 0 0 10 20 Ntrn 30 −1 0 40 (a) 10 20 Ntrn 30 40 (b) Figure 4: Gaussian KL and AI KL approximate inference comparison for a Bayesian logistic regression model with different training data set sizes Ntrn . [sent-244, score-0.237]

79 w ∈ R10 ; Gaussian prior N (w|0, 5I); logistic sigmoid likelihood fn = σ(τl cn wT xn ) with τl = 5; covariates xn sampled from the standard normal, wtrue sampled from the prior and class labels cn = ±1 sampled from the likelihood. [sent-245, score-0.463]

80 (b) Mean and standard error averaged test set log probability (ATLP) differences obtained with the Gaussian and AI approximate posteriors for different training dataset sizes Ntrn . [sent-250, score-0.173]

81 1 Toy problems We compare the performance of Gaussian KL and AI KL approximate inference methods in three different two dimensional generalised linear models against the true posteriors and marginal likelihood values obtained numerically. [sent-253, score-0.311]

82 Figure 1 presents results for a linear regression model with a sparse Laplace prior; the AI base density is chosen to be generalised-normal. [sent-255, score-0.294]

83 Figure 2 demonstrates approximating a Bayesian logistic regression posterior, with the AI base distribution skew-normal. [sent-256, score-0.378]

84 Figure 3 corresponds to a Bayesian linear regression model with the noise robust Laplace likelihood density and Gaussian prior; again the AI approximation uses the generalised-normal as the base distribution. [sent-257, score-0.356]

85 The AI KL procedure achieves a consistently higher bound than the Gaussian case, with the AI bound nearly saturating at the true value of log Z in two of the models. [sent-258, score-0.237]

86 2 Bayesian logistic regression We compare Gaussian KL and AI KL approximate inference for a synthetic Bayesian logistic regression model. [sent-261, score-0.374]

87 The AI density has skew-normal base distribution with θd parameterising the skewness of vd . [sent-262, score-0.498]

88 We optimised the AI KL bound jointly with respect to L, U, b and θ simultaneously with convergence taking on average 8 seconds with D = N = 10, compared to 0. [sent-263, score-0.165]

89 In (b) we 9 We note that split mean ﬁeld approximate inference was reported to take approximately 100 seconds for a similar logistic regression model achieving comparable results [20]. [sent-272, score-0.274]

90 7 plot the mean and standard error differences for the averaged test set log probabilities (ATLP) calculated using the Gaussian and AI approximate posteriors obtained in each model. [sent-273, score-0.173]

91 The bound differences can be seen to be strongly correlated with test set log probability differences, conﬁrming that tighter bound values correspond to improved predictive performance. [sent-276, score-0.271]

92 The Laplace T distribution is also used as a noise robust likelihood fn (w) = p(yn |w, kn ) = e−|yn −w kn |/τl /2τl where kn is the nth vector of the kernel matrix. [sent-280, score-0.272]

93 009 AI KL inference is performed with a generalised-normal base distribution. [sent-328, score-0.228]

94 The parameters θd control the kurtosis of the base distributions q(vd |θd ); for simplicity we ﬁx θd = 1. [sent-329, score-0.244]

95 Whilst the improvements for these particular datasets are modest, the AI bound dominates the Gaussian bound in all three datasets, with predictive log probabilities also showing consistent improvement. [sent-333, score-0.288]

96 Whilst we have only presented experimental results for AI distributions with simple analytically expressible base distributions we note the method is applicable for any base distribution provided D {qvd(vd )}d=1 are smooth. [sent-334, score-0.568]

97 Since we optimise the KL divergence over a larger class of approximating densities than the multivariate Gaussian, the lower bound to the normalisation constant is also improved. [sent-338, score-0.546]

98 1 provide a general and computationally efﬁcient means for inference in non-Gaussian densities whose application could be useful for a range of probabilistic models. [sent-342, score-0.196]

99 However, our current understanding of the best approach to discretise the base densities is limited and further study of this is required, particularly for application in very large systems D 1. [sent-343, score-0.375]

100 It would also be useful to investigate using base densities that directly allow for efﬁcient computation of the marginals qy (y) in the Fourier domain. [sent-344, score-0.523]

similar papers computed by tfidf model

tfidf for this paper:

wordName wordTfidf (topN-words)

[('qud', 0.31), ('kl', 0.29), ('qvd', 0.222), ('ud', 0.215), ('qy', 0.215), ('ai', 0.208), ('vd', 0.197), ('lattice', 0.181), ('ntrn', 0.177), ('base', 0.17), ('qv', 0.162), ('fourier', 0.158), ('lk', 0.154), ('site', 0.139), ('densities', 0.138), ('qw', 0.137), ('fn', 0.129), ('bg', 0.124), ('variational', 0.117), ('whilst', 0.115), ('bai', 0.115), ('gaussian', 0.098), ('bound', 0.098), ('wt', 0.094), ('laplace', 0.092), ('atlp', 0.089), ('skew', 0.089), ('fft', 0.085), ('normalisation', 0.078), ('logistic', 0.076), ('distributions', 0.074), ('approximating', 0.071), ('skewness', 0.068), ('optimise', 0.068), ('discretise', 0.067), ('ntst', 0.067), ('transform', 0.064), ('density', 0.063), ('regression', 0.061), ('multivariate', 0.06), ('optimisation', 0.059), ('av', 0.059), ('wd', 0.059), ('bayesian', 0.059), ('amn', 0.059), ('inference', 0.058), ('posteriors', 0.056), ('bounding', 0.054), ('generalised', 0.054), ('integrals', 0.051), ('dominates', 0.051), ('xn', 0.051), ('dm', 0.05), ('expressible', 0.048), ('cn', 0.047), ('tailed', 0.046), ('approximations', 0.046), ('af', 0.045), ('cheaply', 0.044), ('lpai', 0.044), ('lpg', 0.044), ('slump', 0.044), ('posterior', 0.044), ('dv', 0.043), ('transforms', 0.043), ('approximate', 0.042), ('log', 0.041), ('dimensional', 0.041), ('boston', 0.041), ('fd', 0.041), ('univariate', 0.039), ('discretisation', 0.039), ('kn', 0.038), ('stable', 0.038), ('entropy', 0.038), ('seconds', 0.037), ('challis', 0.036), ('darmstadt', 0.036), ('integral', 0.036), ('det', 0.035), ('differences', 0.034), ('delta', 0.033), ('class', 0.033), ('expectation', 0.033), ('approximation', 0.033), ('datapoints', 0.032), ('minimising', 0.032), ('analytically', 0.032), ('marginal', 0.031), ('barber', 0.031), ('bouchard', 0.031), ('heavy', 0.031), ('lu', 0.031), ('yn', 0.03), ('jointly', 0.03), ('um', 0.03), ('dk', 0.029), ('likelihood', 0.029), ('target', 0.029), ('invertible', 0.029)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.99999964 37 nips-2012-Affine Independent Variational Inference

Author: Edward Challis, David Barber

2 0.11484013 127 nips-2012-Fast Bayesian Inference for Non-Conjugate Gaussian Process Regression

Author: Emtiyaz Khan, Shakir Mohamed, Kevin P. Murphy

Abstract: We present a new variational inference algorithm for Gaussian process regression with non-conjugate likelihood functions, with application to a wide array of problems including binary and multi-class classiﬁcation, and ordinal regression. Our method constructs a concave lower bound that is optimized using an efﬁcient ﬁxed-point updating algorithm. We show that the new algorithm has highly competitive computational complexity, matching that of alternative approximate inference methods. We also prove that the use of concave variational bounds provides stable and guaranteed convergence – a property not available to other approaches. We show empirically for both binary and multi-class classiﬁcation that our new algorithm converges much faster than existing variational methods, and without any degradation in performance. 1

3 0.10682034 249 nips-2012-Nyström Method vs Random Fourier Features: A Theoretical and Empirical Comparison

Author: Tianbao Yang, Yu-feng Li, Mehrdad Mahdavi, Rong Jin, Zhi-Hua Zhou

Abstract: Both random Fourier features and the Nystr¨ m method have been successfully o applied to efﬁcient kernel learning. In this work, we investigate the fundamental difference between these two approaches, and how the difference could affect their generalization performances. Unlike approaches based on random Fourier features where the basis functions (i.e., cosine and sine functions) are sampled from a distribution independent from the training data, basis functions used by the Nystr¨ m method are randomly sampled from the training examples and are o therefore data dependent. By exploring this difference, we show that when there is a large gap in the eigen-spectrum of the kernel matrix, approaches based on the Nystr¨ m method can yield impressively better generalization error bound than o random Fourier features based approach. We empirically verify our theoretical ﬁndings on a wide range of large data sets. 1

4 0.10111197 129 nips-2012-Fast Variational Inference in the Conjugate Exponential Family

Author: James Hensman, Magnus Rattray, Neil D. Lawrence

Abstract: We present a general method for deriving collapsed variational inference algorithms for probabilistic models in the conjugate exponential family. Our method uniﬁes many existing approaches to collapsed variational inference. Our collapsed variational inference leads to a new lower bound on the marginal likelihood. We exploit the information geometry of the bound to derive much faster optimization methods based on conjugate gradients for these models. Our approach is very general and is easily applied to any model where the mean ﬁeld update equations have been derived. Empirically we show signiﬁcant speed-ups for probabilistic inference using our bound. 1

5 0.09674345 244 nips-2012-Nonconvex Penalization Using Laplace Exponents and Concave Conjugates

Author: Zhihua Zhang, Bojun Tu

Abstract: In this paper we study sparsity-inducing nonconvex penalty functions using L´ vy e processes. We deﬁne such a penalty as the Laplace exponent of a subordinator. Accordingly, we propose a novel approach for the construction of sparsityinducing nonconvex penalties. Particularly, we show that the nonconvex logarithmic (LOG) and exponential (EXP) penalty functions are the Laplace exponents of Gamma and compound Poisson subordinators, respectively. Additionally, we explore the concave conjugate of nonconvex penalties. We ﬁnd that the LOG and EXP penalties are the concave conjugates of negative Kullback-Leiber (KL) distance functions. Furthermore, the relationship between these two penalties is due to asymmetricity of the KL distance. 1

6 0.094187543 128 nips-2012-Fast Resampling Weighted v-Statistics

7 0.093871452 23 nips-2012-A lattice filter model of the visual pathway

8 0.084616229 73 nips-2012-Coding efficiency and detectability of rate fluctuations with non-Poisson neuronal firing

9 0.081809394 82 nips-2012-Continuous Relaxations for Discrete Hamiltonian Monte Carlo

10 0.080635689 38 nips-2012-Algorithms for Learning Markov Field Policies

11 0.079119399 355 nips-2012-Truncation-free Online Variational Inference for Bayesian Nonparametric Models

12 0.074192844 58 nips-2012-Bayesian models for Large-scale Hierarchical Classification

13 0.073125459 349 nips-2012-Training sparse natural image models with a fast Gibbs sampler of an extended state space

14 0.072981864 354 nips-2012-Truly Nonparametric Online Variational Inference for Hierarchical Dirichlet Processes

15 0.069375731 33 nips-2012-Active Learning of Model Evidence Using Bayesian Quadrature

16 0.068052404 115 nips-2012-Efficient high dimensional maximum entropy modeling via symmetric partition functions

17 0.065629594 248 nips-2012-Nonparanormal Belief Propagation (NPNBP)

18 0.065528035 294 nips-2012-Repulsive Mixtures

19 0.064087436 98 nips-2012-Dimensionality Dependent PAC-Bayes Margin Bound

20 0.062375609 206 nips-2012-Majorization for CRFs and Latent Likelihoods

similar papers computed by lsi model

lsi for this paper:

topicId topicWeight

[(0, 0.197), (1, 0.029), (2, 0.027), (3, 0.002), (4, -0.1), (5, 0.02), (6, 0.005), (7, 0.032), (8, 0.008), (9, -0.1), (10, -0.022), (11, -0.014), (12, 0.013), (13, -0.029), (14, -0.014), (15, -0.078), (16, 0.021), (17, -0.05), (18, -0.002), (19, 0.011), (20, 0.024), (21, -0.02), (22, 0.023), (23, 0.048), (24, 0.014), (25, 0.037), (26, -0.009), (27, 0.029), (28, -0.115), (29, 0.04), (30, -0.038), (31, -0.032), (32, 0.013), (33, -0.093), (34, -0.027), (35, -0.059), (36, 0.025), (37, -0.072), (38, -0.142), (39, 0.11), (40, 0.076), (41, 0.044), (42, 0.096), (43, -0.057), (44, 0.007), (45, 0.02), (46, -0.044), (47, -0.102), (48, 0.026), (49, -0.029)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.93369907 37 nips-2012-Affine Independent Variational Inference

Author: Edward Challis, David Barber

2 0.69449347 127 nips-2012-Fast Bayesian Inference for Non-Conjugate Gaussian Process Regression

Author: Emtiyaz Khan, Shakir Mohamed, Kevin P. Murphy

3 0.67030251 129 nips-2012-Fast Variational Inference in the Conjugate Exponential Family

Author: James Hensman, Magnus Rattray, Neil D. Lawrence

4 0.59059298 58 nips-2012-Bayesian models for Large-scale Hierarchical Classification

Author: Siddharth Gopal, Yiming Yang, Bing Bai, Alexandru Niculescu-mizil

Abstract: A challenging problem in hierarchical classiﬁcation is to leverage the hierarchical relations among classes for improving classiﬁcation performance. An even greater challenge is to do so in a manner that is computationally feasible for large scale problems. This paper proposes a set of Bayesian methods to model hierarchical dependencies among class labels using multivariate logistic regression. Speciﬁcally, the parent-child relationships are modeled by placing a hierarchical prior over the children nodes centered around the parameters of their parents; thereby encouraging classes nearby in the hierarchy to share similar model parameters. We present variational algorithms for tractable posterior inference in these models, and provide a parallel implementation that can comfortably handle largescale problems with hundreds of thousands of dimensions and tens of thousands of classes. We run a comparative evaluation on multiple large-scale benchmark datasets that highlights the scalability of our approach and shows improved performance over the other state-of-the-art hierarchical methods. 1

5 0.57684112 206 nips-2012-Majorization for CRFs and Latent Likelihoods

Author: Tony Jebara, Anna Choromanska

Abstract: The partition function plays a key role in probabilistic modeling including conditional random ﬁelds, graphical models, and maximum likelihood estimation. To optimize partition functions, this article introduces a quadratic variational upper bound. This inequality facilitates majorization methods: optimization of complicated functions through the iterative solution of simpler sub-problems. Such bounds remain efﬁcient to compute even when the partition function involves a graphical model (with small tree-width) or in latent likelihood settings. For large-scale problems, low-rank versions of the bound are provided and outperform LBFGS as well as ﬁrst-order methods. Several learning applications are shown and reduce to fast and convergent update rules. Experimental results show advantages over state-of-the-art optimization methods. This supplement presents additional details in support of the full article. These include the application of the majorization method to maximum entropy problems. It also contains proofs of the various theorems, in particular, a guarantee that the bound majorizes the partition function. In addition, a proof is provided guaranteeing convergence on (non-latent) maximum conditional likelihood problems. The supplement also contains supporting lemmas that show the bound remains applicable in constrained optimization problems. The supplement then proves the soundness of the junction tree implementation of the bound for graphical models with large n. It also proves the soundness of the low-rank implementation of the bound for problems with large d. Finally, the supplement contains additional experiments and ﬁgures to provide further empirical support for the majorization methodology. Supplement for Section 2 Proof of Theorem 1 Rewrite the partition function as a sum over the integer index j = 1, . . . , n under the random ordering π : Ω → {1, . . . , n}. This deﬁnes j∑ π(y) and associates h and f with = n hj = h(π −1 (j)) and fj = f (π −1 (j)). Next, write Z(θ) = j=1 αj exp(λ⊤ fj ) by introducing ˜ ˜ λ = θ − θ and αj = hj exp(θ ⊤ fj ). Deﬁne the partition function over only the ﬁrst i components ∑i as Zi (θ) = j=1 αj exp(λ⊤ fj ). When i = 0, a trivial quadratic upper bound holds ( ) Z0 (θ) ≤ z0 exp 1 λ⊤ Σ0 λ + λ⊤ µ0 2 with the parameters z0 → 0+ , µ0 = 0, and Σ0 = z0 I. Next, add one term to the current partition function Z1 (θ) = Z0 (θ) + α1 exp(λ⊤ f1 ). Apply the current bound Z0 (θ) to obtain 1 Z1 (θ) ≤ z0 exp( 2 λ⊤ Σ0 λ + λ⊤ µ0 ) + α1 exp(λ⊤ f1 ). Consider the following change of variables u γ 1/2 −1/2 = Σ0 λ − Σ0 = α1 z0 exp( 1 (f1 2 (f1 − µ0 )) − µ0 )⊤ Σ−1 (f1 − µ0 )) 0 and rewrite the logarithm of the bound as log Z1 (θ) ( ) 1 ≤ log z0 − 1 (f1 − µ0 )⊤ Σ−1 (f1 − µ0 ) + λ⊤ f1 + log exp( 2 ∥u∥2 ) + γ . 0 2 Apply Lemma 1 (cf. [32] p. 100) to the last term to get ( (1 ) ) log Z1 (θ) ≤ log z0 − 1 (f1 − µ0 )⊤ Σ−1 (f1 − µ0 ) + λ⊤ f1 + log exp 2 ∥v∥2 +γ 0 2 ( ) v⊤ (u − v) 1 + + (u − v)⊤ I + Γvv⊤ (u − v) 1+γ exp(− 1 ∥v∥2 ) 2 2 10 where Γ = 1 1 tanh( 2 log(γ exp(− 2 ∥v∥2 ))) . 1 2 log(γ exp(− 2 ∥v∥2 )) The bound in [32] is tight when u = v. To achieve tightness −1/2 ˜ when θ = θ or, equivalently, λ = 0, we choose v = Σ0 (µ0 − f1 ) yielding ( ) Z1 (θ) ≤ z1 exp 1 λ⊤ Σ1 λ + λ⊤ µ1 2 where we have z1 = z0 + α1 z0 α1 = µ0 + f1 z0 + α1 z0 + α1 1 tanh( 2 log(α1 /z0 )) = Σ0 + (µ0 − f1 )(µ0 − f1 )⊤ . 2 log(α1 /z0 ) µ1 Σ1 This rule updates the bound parameters z0 , µ0 , Σ0 to incorporate an extra term in the sum over i in Z(θ). The process is iterated n times (replacing 1 with i and 0 with i − 1) to produce an overall bound on all terms. Lemma 1 (See [32] p. 100) ( ( ) ) For all u ∈ Rd , any v ∈ Rd and any γ ≥ 0, the bound log exp 1 ∥u∥2 + γ ≤ 2 ( ( ) ) log exp 1 ∥v∥2 + γ + 2 ( ) v⊤ (u − v) 1 + (u − v)⊤ I + Γvv⊤ (u − v) 1 1 + γ exp(− 2 ∥v∥2 ) 2 holds when the scalar term Γ = 1 tanh( 2 log(γ exp(−∥v∥2 /2))) . 2 log(γ exp(−∥v∥2 /2)) Equality is achieved when u = v. Proof of Lemma 1 The proof is provided in [32]. Supplement for Section 3 Maximum entropy problem We show here that partition functions arise naturally in maximum ∑ p(y) entropy estimation or minimum relative entropy RE(p∥h) = y p(y) log h(y) estimation. Consider the following problem: ∑ ∑ min RE(p∥h) s.t. p(y)f (y) = 0, p(y)g(y) ≥ 0. p y y d′ Here, assume that f : Ω → R and g : Ω → R are arbitrary (non-constant) vector-valued functions ( ) over the sample space. The solution distribution p(y) = h(y) exp θ ⊤ f (y) + ϑ⊤ g(y) /Z(θ, ϑ) is recovered by the dual optimization ∑ ( ) arg θ, ϑ = max − log h(y) exp θ ⊤ f (y) + ϑ⊤ g(y) d ϑ≥0,θ y ′ where θ ∈ Rd and ϑ ∈ Rd . These are obtained by minimizing Z(θ, ϑ) or equivalently by maximizing its negative logarithm. Algorithm 1 permits variational maximization of the dual via the quadratic program min 1 (β ϑ≥0,θ 2 ˜ ˜ − β)⊤ Σ(β − β) + β ⊤ µ ′ where β ⊤ = [θ ⊤ ϑ⊤ ]. Note that any general convex hull of constraints β ∈ Λ ⊆ Rd+d could be imposed without loss of generality. Proof of Theorem 2 We begin by proving a lemma that will be useful later. Lemma 2 If κΨ ≽ Φ ≻ 0 for Φ, Ψ ∈ Rd×d , then 1 ˜ ˜ ˜ L(θ) = − 2 (θ − θ)⊤ Φ(θ − θ) − (θ − θ)⊤ µ ˜ ˜ ˜ U (θ) = − 1 (θ − θ)⊤ Ψ(θ − θ) − (θ − θ)⊤ µ 2 satisfy supθ∈Λ L(θ) ≥ 1 κ ˜ supθ∈Λ U (θ) for any convex Λ ⊆ Rd , θ ∈ Λ, µ ∈ Rd and κ ∈ R+ . 11 Proof of Lemma 2 Deﬁne the primal problems of interest as PL = supθ∈Λ L(θ) and PU = supθ∈Λ U (θ). The constraints θ ∈ Λ can be summarized by a set of linear inequalities Aθ ≤ b where A ∈ Rk×d and b ∈ Rk for some (possibly inﬁnite) k ∈ Z. Apply the change of variables ˜ ˜ ˜ ˜ ˜ ˜ z = θ− θ. The constraint A(z+ θ) ≤ b simpliﬁes into Az ≤ b where b = b−Aθ. Since θ ∈ Λ, it 1 ⊤ ˜ ≥ 0. We obtain the equivalent primal problems PL = sup is easy to show that b ˜ − z Φz − Az≤b z⊤ µ and PU = supAz≤b − 1 z⊤ Ψz − z⊤ µ. The corresponding dual problems are ˜ 2 2 ⊤ −1 y⊤AΦ−1A⊤y ˜ µ Φ µ +y⊤AΦ−1µ+y⊤b+ y≥0 2 2 y⊤AΨ−1 A⊤y µ⊤Ψ−1µ ˜ DU = inf +y⊤AΨ−1µ+y⊤b+ . y≥0 2 2 DL = inf ˜ Due to strong duality, PL = DL and PU = DU . Apply the inequalities Φ ≼ κΨ and y⊤ b > 0 as ⊤ −1 y⊤AΨ−1 A⊤y y⊤AΨ−1 µ κ ˜ µΨ µ + + y⊤b + PL ≥ sup − z⊤ Ψz − z⊤ µ = inf y≥0 2 2κ κ 2κ ˜ Az≤b 1 1 ≥ DU = PU . κ κ 1 This proves that PL ≥ κ PU . We will use the above to prove Theorem 2. First, we will upper-bound (in the Loewner ordering sense) the matrices Σj in Algorithm 2. Since ∥fxj (y)∥2 ≤ r for all y ∈ Ωj and since µj in Algorithm 1 is a convex combination of fxj (y), the outer-product terms in the update for Σj satisfy (fxj (y) − µ)(fxj (y) − µ)⊤ ≼ 4r2 I. Thus, Σj ≼ F(α1 , . . . , αn )4r2 I holds where α 1 F(α1 , . . . , αn ) = i n ∑ tanh( 2 log( ∑i−1 αk )) k=1 αi 2 log( ∑i−1 α ) i=2 k k=1 using the deﬁnition of α1 , . . . , αn in the proof of Theorem 1. The formula for F starts at i = 2 since z0 → 0+ . Assume permutation π is sampled uniformly at random. The expected value of F is then α π(i) 1 n 1 ∑ ∑ tanh( 2 log( ∑i−1 απ(k) )) k=1 . Eπ [F(α1 , . . . , αn )] = απ(i) n! π i=2 ) 2 log( ∑i−1 k=1 απ(k) We claim that the expectation is maximized when all αi = 1 or any positive constant. Also, F is invariant under uniform scaling of its arguments. Write the expected value of F as E for short. ∂E ∂E ∂E Next, consider ∂αl at the setting αi = 1, ∀i. Due to the expectation over π, we have ∂αl = ∂αo for any l, o. Therefore, the gradient vector is constant when all αi = 1. Since F(α1 , . . . , αn ) is invariant to scaling, the gradient vector must therefore be the all zeros vector. Thus, the point ∂ ∂E when all αi = 1 is an extremum or a saddle. Next, consider ∂αo ∂αl for any l, o. At the setting 2 ∂ ∂E αi = 1, ∂ E = −c(n) and, ∂αo ∂αl = c(n)/(n − 1) for some non-negative constant function ∂α2 l c(n). Thus, the αi = 1 extremum is locally concave and is a maximum. This establishes that Eπ [F(α1 , . . . , αn )] ≤ Eπ [F(1, . . . , 1)] and yields the Loewner bound ) ( n−1 ∑ tanh(log(i)/2) 2 I = ωI. Σj ≼ 2r log(i) i=1 Apply this bound to each Σj in the lower bound on J(θ) and also note a corresponding upper bound ∑ ˜ ˜ ˜ J(θ) ≥ J(θ)−tω+tλ ∥θ − θ∥2− (θ − θ)⊤(µj −fxj (yj )) 2 j ∑ ˜ ˜ ˜ J(θ) ≤ J(θ)−tλ ∥θ − θ∥2− (θ − θ)⊤(µj −fxj (yj )) 2 j 12 ˜ which follows from Jensen’s inequality. Deﬁne the current θ at time τ as θτ and denote by Lτ (θ) the above lower bound and by Uτ (θ) the above upper bound at time τ . Clearly, Lτ (θ) ≤ J(θ) ≤ Uτ (θ) with equality when θ = θτ . Algorithm 2 maximizes J(θ) after initializing at θ0 and performing an update by maximizing a lower bound based on Σj . Since Lτ (θ) replaces the deﬁnition of Σj with ωI ≽ Σj , Lτ (θ) is a looser bound than the one used by Algorithm 2. Thus, performing θτ +1 = arg maxθ∈Λ Lτ (θ) makes less progress than a step of Algorithm 1. Consider computing the slower update at each iteration τ and returning θτ +1 = arg maxθ∈Λ Lτ (θ). Setting Φ = (tω +tλ)I, Ψ = tλI and κ = ω+λ allows us to apply Lemma 2 as follows λ sup Lτ (θ) − Lτ (θτ ) = θ∈Λ 1 sup Uτ (θ) − Uτ (θτ ). κ θ∈Λ Since Lτ (θτ ) = J(θτ ) = Uτ (θτ ), J(θτ +1 ) ≥ supθ∈Λ Lτ (θ) and supθ∈Λ Uτ (θ) ≥ J(θ ∗ ), we obtain ( ) 1 J(θτ +1 ) − J(θ ∗ ) ≥ 1− (J(θτ ) − J(θ ∗ )) . κ Iterate the above inequality starting at t = 0 to obtain ( )τ 1 ∗ J(θτ ) − J(θ ) ≥ 1− (J(θ0 ) − J(θ ∗ )) . κ ( ) 1 τ κ A solution within a multiplicative factor of ϵ implies that ϵ = 1 − κ or log(1/ϵ) = τ log κ−1 . ⌉ ⌈ Inserting the deﬁnition for κ shows that the number of iterations τ is at most log(1/ϵ) or κ log κ−1 ⌉ ⌈ log(1/ϵ) log(1+λ/ω) . Inserting the deﬁnition for ω gives the bound. Y12,0 Y11,1 Y21,1 Y31,1 ··· 1,1 Ym1,1 Figure 3: Junction tree of depth 2. Algorithm 5 SmallJunctionTree ˜ Input Parameters θ and h(u), f (u) ∀u ∈ Y12,0 and zi , Σi , µi ∀i = 1, . . . , m1,1 + Initialize z → 0 , µ = 0, Σ = zI For each conﬁguration u ∈ Y12,0 { ∏m1,1 ∑m1,1 ∏m1,1 ˜ ˜ ˜ α = h(u)( ∑ zi exp(−θ ⊤ µi )) exp(θ ⊤ (f (u) + i=1 µi )) = h(u) exp(θ ⊤ f (u)) i=1 zi i=1 m1,1 l = f (u) + i=1 µi − µ 1 ∑m1,1 tanh( 2 log(α/z)) ⊤ Σ + = i=1 Σi + ll 2 log(α/z) α µ + = z+α l z += α } Output z, µ, Σ Supplement for Section 5 Proof of correctness for Algorithm 3 Consider a simple junction tree of depth 2 shown on Figure 3. The notation Yca,b refers to the cth tree node located at tree level a (ﬁrst level is considered as the one with∑ leaves) whose parent is the bth from the higher tree level (the root has no parent so b = 0). tree ∑ Let Y a1 ,b1 refer to the sum over all conﬁgurations of variables in Yca1 ,b1 and Y a1 ,b1 \Y a2 ,b2 1 c1 c1 c2 refers to the sum over all conﬁgurations of variables that are in Yca1 ,b1 but not in Yca2 ,b2 . Let ma,b 1 2 denote the number of children of the bth node located at tree level a + 1. For short-hand, use ψ(Y ) = h(Y ) exp(θ ⊤ f (Y )). The partition function can be expressed as: 13 Y13,0 Y12,1 ··· Y11,1 Y21,1 ··· Y22,1 1,1 Ym1,1 ··· Y11,2 Y21,2 1,2 Ym1,2 2,1 Ym2,1 1,m2,1 Y1 ··· 1,m2,1 Y2 1,m 2,1 Ym1,m2,1 Figure 4: Junction tree of depth 3. Z(θ) = ∑ 2,0 u∈Y1 ≤ ∑ 2,0 u∈Y1 = ∑  ψ(u) ∏ m1,1 i=1 [ ∏ m1,1 ψ(u) i=1 [    ∑ ψ(v) 2,0 v∈Yi1,1 \Y1 ) 1( ˜ ˜ ˜ zi exp( θ − θ)⊤ Σi (θ − θ) + (θ − θ)⊤ µi 2 ∏ ( m1,1 ⊤ h(u) exp(θ f (u)) 2,0 u∈Y1 zi exp i=1 ] 1 ˜ ˜ ˜ (θ − θ)⊤ Σi (θ − θ) + (θ − θ)⊤ µi 2 )] ∑ where the upper-bound is obtained by applying Theorem 1 to each of the terms v∈Y 1,1 \Y 2,0 ψ(v). 1 i By simply rearranging terms we get: ) ( ( [ (m1,1 )) m1,1 ∏ ∑ ∑ ˜ zi exp(−θ ⊤ µi ) exp θ ⊤ f (u) + µi h(u) Z(θ) ≤ 2,0 u∈Y1 i=1 ( exp 1 ˜ (θ − θ)⊤ 2 (m1,1 ∑ ) Σi i=1 ˜ (θ − θ) )] . i=1 One ( can prove that this ) expression can be upper-bounded by 1 ˆ ⊤ Σ(θ − θ) + (θ − θ)⊤ µ where z, Σ and µ can be computed using Algoˆ ˆ z exp 2 (θ − θ) rithm 5 (a simpliﬁcation of Algorithm 3). We will call this result Lemma A. The proof is similar to the proof of Theorem 1 so is not repeated here. Consider enlarging the tree to a depth 3 as shown on Figure 4. The partition function is now      m2,1 m1,i ∑  ∏ ∑ ∏ ∑   Z(θ) = ψ(w) . ψ(u)  ψ(v)  3,0 u∈Y1 i=1 3,0 v∈Yi2,1 \Y1 j=1 w∈Yj1,i \Yi2,1 ( )) ∏m1,i (∑ ∑ term By Lemma A we can upper bound each v∈Y 2,1 \Y 3,0 ψ(v) j=1 w∈Yj1,i \Yi2,1 ψ(w) 1 i ( ) ˆ ˆ ˆ by the expression zi exp 1 (θ − θ)⊤ Σi (θ − θ) + (θ − θ)⊤ µi . This yields 2 [ )] ( m2,1 ∑ ∏ 1 ˜ ˜ ˜ Z(θ) ≤ ψ(u) (θ − θ)⊤ Σi (θ − θ) + (θ − θ)⊤ µi . zi exp 2 3,0 i=1 u∈Y1 2,1 2,1 2,1 This process can be viewed as collapsing the sub-trees S1 , S2 , . . ., Sm2,1 to super-nodes that are represented by bound parameters, zi , Σi and µi , i = {1, 2, · · · , m2,1 }, where the sub-trees are 14 deﬁned as: 2,1 S1 = 1,1 {Y12,1 , Y11,1 , Y21,1 , Y31,1 , . . . , Ym1,1 } 2,1 S2 = 1,2 {Y22,1 , Y11,2 , Y21,2 , Y31,2 , . . . , Ym1,2 } = 2,1 {Ym2,1 , Y1 . . . 2,1 Sm2,1 1,m2,1 1,m2,1 , Y2 1,m2,1 , Y3 1,m2,1 , . . . , Ym1,m2,1 }. Notice that the obtained expression can be further upper bounded using again Lemma A (induction) ( ) ˆ ˆ ˆ yielding a bound of the form: z exp 1 (θ − θ)⊤ Σ(θ − θ) + (θ − θ)⊤ µ . 2 Finally, for a general tree, follow the same steps described above, starting from leaves and collapsing nodes to super-nodes, each represented by bound parameters. This procedure effectively yields Algorithm 3 for the junction tree under consideration. Supplement for Section 6 Proof of correctness for Algorithm 4 We begin by proving a lemma that will be useful later. Lemma 3 For all x ∈ Rd and for all l ∈ Rd , 2  d d 2 ∑ ∑ l(i)  . x(i)2 l(i)2 ≥  x(i) √∑ d l(j)2 i=1 i=1 j=1 Proof of Lemma 3 By Jensen’s inequality, 2  ( d )2 d d ∑ x(i)l(i)2 ∑ ∑ x(i)l(i)2  . √∑ x(i)2 ∑d ⇐⇒ x(i)2 l(i)2 ≥  ≥ ∑d d l(j)2 l(j)2 l(j)2 j=1 j=1 i=1 i=1 i=1 i=1 d ∑ l(i)2 j=1 Now we prove the correctness of Algorithm 4. At the ith iteration, the algorithm stores Σi using ⊤ a low-rank representation Vi Si Vi + Di where Vi ∈ Rk×d is orthonormal, Si ∈ Rk×k positive d×d semi-deﬁnite and Di ∈ R is non-negative diagonal. The diagonal terms D are initialized to tλI where λ is the regularization term. To mimic Algorithm 1 we must increment the Σ matrix by a rank one update of the form Σi = Σi−1 + ri r⊤ . By projecting ri onto each eigenvector in V, we i ∑k ⊤ can decompose it as ri = j=1 Vi−1 (j, ·)ri Vi−1 (j, ·)⊤ + g = Vi−1 Vi−1 ri + g where g is the remaining residue. Thus the update rule can be rewritten as: Σi ⊤ ⊤ ⊤ = Σi−1 + ri r⊤ = Vi−1 Si−1 Vi−1 + Di−1 + (Vi−1 Vi−1 ri + g)(Vi−1 Vi−1 ri + g)⊤ i ′ ′ ′ ⊤ ⊤ ⊤ = Vi−1 (Si−1 + Vi−1 ri r⊤ Vi−1 )Vi−1 + Di−1 + gg⊤ = Vi−1 Si−1 Vi−1 + gg⊤ + Di−1 i ′ where we deﬁne Vi−1 = Qi−1 Vi−1 and deﬁned Qi−1 in terms of the singular value decomposition, ′ ′ ⊤ Q⊤ Si−1 Qi−1 = svd(Si−1 + Vi−1 ri r⊤ Vi−1 ). Note that Si−1 is diagonal and nonnegative by i−1 i construction. The current formula for Σi shows that we have a rank (k + 1) system (plus diagonal term) which needs to be converted back to a rank k system (plus diagonal term) which we denote by ′ Σi . We have two options as follows. Case 1) Remove g from Σi to obtain ′ ′ ′ ′ ⊤ Σi = Vi−1 Si−1 Vi−1 + Di−1 = Σi − gg⊤ = Σi − cvv⊤ 1 where c = ∥g∥2 and v = ∥g∥ g. th Case 2) Remove the m (smallest) eigenvalue in S′ and its corresponding eigenvector: i−1 ′ Σi ′ ′ ′ ′ ′ ′ ⊤ = Vi−1 Si−1 Vi−1 + Di−1 + gg⊤ − S (m, m)V (m, ·)⊤ V (m, ·) = Σi − cvv⊤ ′ ′ where c = S (m, m) and v = V(m, ·) . 15 ′ Clearly, both cases can be written as an update of the form Σi = Σi + cvv⊤ where c ≥ 0 and v⊤ v = 1. We choose the case with smaller c value to minimize the change as we drop from a system of order (k + 1) to order k. Discarding the smallest singular value and its corresponding eigenvector would violate the bound. Instead, consider absorbing this term into the diagonal component to ′′ ′ preserve the bound. Formally, we look for a diagonal matrix F such that Σi = Σi + F which also ′′ maintains x⊤ Σi x ≥ x⊤ Σi x for all x ∈ Rd . Thus, we want to satisfy: ( d )2 d ∑ ∑ ⊤ ′′ ⊤ ⊤ ⊤ ⊤ x Σi x ≥ x Σi x ⇐⇒ x cvv x ≤ x Fx ⇐⇒ c x(i)v(i) ≤ x(i)2 F(i) i=1 i=1 where, for ease of notation, we take F(i) = F(i, i). ′ where w = v⊤ 1. Consider the case where v ≥ 0 though we will soon get rid of (∑ )2 ∑d d this assumption. We need an F such that i=1 x(i)2 F(i) ≥ c i=1 x(i)v(i) . Equivalently, we (∑ ) ∑d ′ 2 ′ d need i=1 x(i)2 F(i) ≥ . Deﬁne F(i) = F(i) for all i = 1, . . . , d. So, we need 2 i=1 x(i)v(i) cw cw2 (∑ ) ∑d ′ ′ ′ 2 d an F such that i=1 x(i)2 F(i) ≥ . Using Lemma 3 it is easy to show that we i=1 x(i)v(i) Deﬁne v = 1 wv ′ ′ ′ may choose F (i) = v(i) . Thus, we obtain F(i) = cw2 F(i) = cwv(i). Therefore, for all x ∈ Rd , ∑d all v ≥ 0, and for F(i) = cv(i) j=1 v(j) we have d ∑ ( x(i) F(i) ≥ c 2 i=1 d ∑ )2 x(i)v(i) . (3) i=1 To generalize the inequality to hold for all vectors v ∈ Rd with potentially negative entries, it is ∑d sufﬁcient to set F(i) = c|v(i)| j=1 |v(j)|. To verify this, consider ﬂipping the sign of any v(i). The left side of the Inequality 3 does not change. For the right side of this inequality, ﬂipping the sign of v(i) is equivalent to ﬂipping the sign of x(i) and not changing the sign of v(i). However, in this case the inequality holds as shown before (it holds for any x ∈ Rd ). Thus for all x, v ∈ Rd and ∑d for F(i) = c|v(i)| j=1 |v(j)|, Inequality 3 holds. Supplement for Section 7 Small scale experiments In additional small-scale experiments, we compared Algorithm 2 with steepest descent (SD), conjugate gradient (CG), BFGS and Newton-Raphson. Small-scale problems may be interesting in real-time learning settings, for example, when a website has to learn from a user’s uploaded labeled data in a split second to perform real-time retrieval. We considered logistic regression on ﬁve UCI data sets where missing values were handled via mean-imputation. A range of regularization settings λ ∈ {100 , 102 , 104 } was explored and all algorithms were initialized from the same ten random start-points. Table 3 shows the average number of seconds each algorithm needed to achieve the same solution that BFGS converged to (all algorithms achieve the same solution due to concavity). The bound is the fastest algorithm as indicated in bold. data|λ BFGS SD CG Newton Bound a|100 1.90 1.74 0.78 0.31 0.01 a|102 0.89 0.92 0.83 0.25 0.01 a|104 2.45 1.60 0.85 0.22 0.01 b|100 3.14 2.18 0.70 0.43 0.07 b|102 2.00 6.17 0.67 0.37 0.04 b|104 1.60 5.83 0.83 0.35 0.04 c|100 4.09 1.92 0.65 0.39 0.07 c|102 1.03 0.64 0.64 0.34 0.02 c|104 1.90 0.56 0.72 0.32 0.02 d|100 5.62 12.04 1.36 0.92 0.16 d|102 2.88 1.27 1.21 0.63 0.09 d|104 3.28 1.94 1.23 0.60 0.07 e|100 2.63 2.68 0.48 0.35 0.03 e|102 2.01 2.49 0.55 0.26 0.03 e|104 1.49 1.54 0.43 0.20 0.03 Table 3: Convergence time in seconds under various regularization levels for a) Bupa (t = 345, dim = 7), b) Wine (t = 178, dim = 14), c) Heart (t = 187, dim = 23), d) Ion (t = 351, dim = 34), and e) Hepatitis (t = 155, dim = 20) data sets. Inﬂuence of rank k on bound performance in large scale experiments We also examined the inﬂuence of k on bound performance and compared it with LBFGS, SD and CG. Several choices 16 of k were explored. Table 4 shows results for the SRBCT data-set. In general, the bound performs best but slows down for superﬂuously large values of k. Steepest descent and conjugate gradient are slow yet obviously do not vary with k. Note that each iteration takes less time with smaller k for the bound. However, we are reporting overall runtime which is also a function of the number of iterations. Therefore, total runtime (a function of both) may not always decrease/increase with k. k LBFGS SD CG Bound 1 1.37 8.80 4.39 0.56 2 1.32 8.80 4.39 0.56 4 1.39 8.80 4.39 0.67 8 1.35 8.80 4.39 0.96 16 1.46 8.80 4.39 1.34 32 1.40 8.80 4.39 2.11 64 1.54 8.80 4.39 4.57 Table 4: Convergence time in seconds as a function of k. Additional latent-likelihood results For completeness, Figure 5 depicts two additional data-sets to complement Figure 2. Similarly, Table 5 shows all experimental settings explored in order to provide the summary Table 2 in the main article. bupa wine −19 0 −5 −log(J(θ)) −log(J(θ)) −20 −21 −22 Bound Newton BFGS Conjugate gradient Steepest descent −15 −23 −24 −5 −10 0 5 log(Time) [sec] 10 −20 −4 −2 0 2 4 log(Time) [sec] 6 8 Figure 5: Convergence of test latent log-likelihood for bupa and wine data-sets. Data-set Algorithm BFGS SD CG Newton Bound ion m=1 m=2m=3m=4 -4.96 -5.55 -5.88 -5.79 -11.80 -9.92 -5.56 -8.59 -5.47 -5.81 -5.57 -5.22 -5.95 -5.95 -5.95 -5.95 -6.08 -4.84 -4.18 -5.17 Data-set Algorithm BFGS SD CG Newton Bound bupa m=1 m=2 m=3 m=4 -22.07 -21.78 -21.92 -21.87 -21.76 -21.74 -21.73 -21.83 -21.81 -21.81 -21.81 -21.81 -21.85 -21.85 -21.85 -21.85 -21.85 -19.95 -20.01 -19.97 wine m=1m=2m=3m=4 -0.90 -0.91 -1.79 -1.35 -1.61 -1.60 -1.37 -1.63 -0.51 -0.78 -0.95 -0.51 -0.71 -0.71 -0.71 -0.71 -0.51 -0.51 -0.48 -0.51 hepatitis m=1m=2m=3m=4 -4.42 -5.28 -4.95 -4.93 -4.93 -5.14 -5.01 -5.20 -4.84 -4.84 -4.84 -4.84 -5.50 -5.50 -5.50 -4.50 -5.47 -4.40 -4.75 -4.92 SRBCT m=1m=2m=3m=4 -5.99 -6.17 -6.09 -6.06 -5.61 -5.62 -5.62 -5.61 -5.62 -5.49 -5.36 -5.76 -5.54 -5.54 -5.54 -5.54 -5.31 -5.31 -4.90 -0.11 Table 5: Test latent log-likelihood at convergence for different values of m ∈ {1, 2, 3, 4} on ion, bupa, hepatitis, wine and SRBCT data-sets. 17

6 0.55208302 128 nips-2012-Fast Resampling Weighted v-Statistics

7 0.53542322 115 nips-2012-Efficient high dimensional maximum entropy modeling via symmetric partition functions

8 0.52800423 355 nips-2012-Truncation-free Online Variational Inference for Bayesian Nonparametric Models

9 0.52576447 82 nips-2012-Continuous Relaxations for Discrete Hamiltonian Monte Carlo

10 0.51122797 244 nips-2012-Nonconvex Penalization Using Laplace Exponents and Concave Conjugates

11 0.51092041 359 nips-2012-Variational Inference for Crowdsourcing

12 0.50696462 268 nips-2012-Perfect Dimensionality Recovery by Variational Bayesian PCA

13 0.50565451 21 nips-2012-A Unifying Perspective of Parametric Policy Search Methods for Markov Decision Processes

14 0.50328529 336 nips-2012-The Coloured Noise Expansion and Parameter Estimation of Diffusion Processes

15 0.50069135 234 nips-2012-Multiresolution analysis on the symmetric group

16 0.49728802 249 nips-2012-Nyström Method vs Random Fourier Features: A Theoretical and Empirical Comparison

17 0.49673066 365 nips-2012-Why MCA? Nonlinear sparse coding with spike-and-slab prior for neurally plausible image encoding

18 0.49341941 211 nips-2012-Meta-Gaussian Information Bottleneck

19 0.4898771 53 nips-2012-Bayesian Pedigree Analysis using Measure Factorization

20 0.48492321 294 nips-2012-Repulsive Mixtures

similar papers computed by lda model

lda for this paper:

topicId topicWeight

[(0, 0.023), (2, 0.256), (17, 0.011), (21, 0.029), (38, 0.129), (39, 0.029), (42, 0.034), (54, 0.019), (55, 0.04), (74, 0.034), (76, 0.162), (80, 0.1), (92, 0.045)]

similar papers list:

simIndex simValue paperId paperTitle

1 0.82687151 196 nips-2012-Learning with Partially Absorbing Random Walks

Author: Xiao-ming Wu, Zhenguo Li, Anthony M. So, John Wright, Shih-fu Chang

Abstract: We propose a novel stochastic process that is with probability αi being absorbed at current state i, and with probability 1 − αi follows a random edge out of it. We analyze its properties and show its potential for exploring graph structures. We prove that under proper absorption rates, a random walk starting from a set S of low conductance will be mostly absorbed in S. Moreover, the absorption probabilities vary slowly inside S, while dropping sharply outside, thus implementing the desirable cluster assumption for graph-based learning. Remarkably, the partially absorbing process uniﬁes many popular models arising in a variety of contexts, provides new insights into them, and makes it possible for transferring ﬁndings from one paradigm to another. Simulation results demonstrate its promising applications in retrieval and classiﬁcation.

same-paper 2 0.80176061 37 nips-2012-Affine Independent Variational Inference

Author: Edward Challis, David Barber

3 0.79497981 8 nips-2012-A Generative Model for Parts-based Object Segmentation

Author: S. Eslami, Christopher Williams

Abstract: The Shape Boltzmann Machine (SBM) [1] has recently been introduced as a stateof-the-art model of foreground/background object shape. We extend the SBM to account for the foreground object’s parts. Our new model, the Multinomial SBM (MSBM), can capture both local and global statistics of part shapes accurately. We combine the MSBM with an appearance model to form a fully generative model of images of objects. Parts-based object segmentations are obtained simply by performing probabilistic inference in the model. We apply the model to two challenging datasets which exhibit signiﬁcant shape and appearance variability, and ﬁnd that it obtains results that are comparable to the state-of-the-art. There has been signiﬁcant focus in computer vision on object recognition and detection e.g. [2], but a strong desire remains to obtain richer descriptions of objects than just their bounding boxes. One such description is a parts-based object segmentation, in which an image is partitioned into multiple sets of pixels, each belonging to either a part of the object of interest, or its background. The signiﬁcance of parts in computer vision has been recognized since the earliest days of the ﬁeld (e.g. [3, 4, 5]), and there exists a rich history of work on probabilistic models for parts-based segmentation e.g. [6, 7]. Many such models only consider local neighborhood statistics, however several models have recently been proposed that aim to increase the accuracy of segmentations by also incorporating prior knowledge about the foreground object’s shape [8, 9, 10, 11]. In such cases, probabilistic techniques often mainly differ in how accurately they represent and learn about the variability exhibited by the shapes of the object’s parts. Accurate models of the shapes and appearances of parts can be necessary to perform inference in datasets that exhibit large amounts of variability. In general, the stronger the models of these two components, the more performance is improved. A generative model has the added beneﬁt of being able to generate samples, which allows us to visually inspect the quality of its understanding of the data and the problem. Recently, a generative probabilistic model known as the Shape Boltzmann Machine (SBM) has been used to model binary object shapes [1]. The SBM has been shown to constitute the state-of-the-art and it possesses several highly desirable characteristics: samples from the model look realistic, and it generalizes to generate samples that differ from the limited number of examples it is trained on. The main contributions of this paper are as follows: 1) In order to account for object parts we extend the SBM to use multinomial visible units instead of binary ones, resulting in the Multinomial Shape Boltzmann Machine (MSBM), and we demonstrate that the MSBM constitutes a strong model of parts-based object shape. 2) We combine the MSBM with an appearance model to form a fully generative model of images of objects (see Fig. 1). We show how parts-based object segmentations can be obtained simply by performing probabilistic inference in the model. We apply our model to two challenging datasets and ﬁnd that in addition to being principled and fully generative, the model’s performance is comparable to the state-of-the-art. 1 Train labels Train images Test image Appearance model Joint Model Shape model Parsing Figure 1: Overview. Using annotated images separate models of shape and appearance are trained. Given an unseen test image, its parsing is obtained via inference in the proposed joint model. In Secs. 1 and 2 we present the model and propose efﬁcient inference and learning schemes. In Sec. 3 we compare and contrast the resulting joint model with existing work in the literature. We describe our experimental results in Sec. 4 and conclude with a discussion in Sec. 5. 1 Model We consider datasets of cropped images of an object class. We assume that the images are constructed through some combination of a ﬁxed number of parts. Given a dataset D = {Xd }, d = 1...n of such images X, each consisting of P pixels {xi }, i = 1...P , we wish to infer a segmentation S for the image. S consists of a labeling si for every pixel, where si is a 1-of-(L+1) encoded variable, and L is the ﬁxed number of parts that combine to generate the foreground. In other words, si = (sli ), P l = 0...L, sli 2 {0, 1} and l sli = 1. Note that the background is also treated as a ‘part’ (l = 0). Accurate inference of S is driven by models for 1) part shapes and 2) part appearances. Part shapes: Several types of models can be used to deﬁne probabilistic distributions over segmentations S. The simplest approach is to model each pixel si independently with categorical variables whose parameters are speciﬁed by the object’s mean shape (Fig. 2(a)). Markov Random Fields (MRFs, Fig. 2(b)) additionally model interactions between nearby pixels using pairwise potential functions that efﬁciently capture local properties of images like smoothness and continuity. Restricted Boltzmann Machines (RBMs) and their multi-layered counterparts Deep Boltzmann Machines (DBMs, Fig. 2(c)) make heavy use of hidden variables to efﬁciently deﬁne higher-order potentials that take into account the conﬁguration of larger groups of image pixels. The introduction of such hidden variables provides a way to efﬁciently capture complex, global properties of image pixels. RBMs and DBMs are powerful generative models, but they also have many parameters. Segmented images, however, are expensive to obtain and datasets are typically small (hundreds of examples). In order to learn a model that accurately captures the properties of part shapes we use DBMs but also impose carefully chosen connectivity and capacity constraints, following the structure of the Shape Boltzmann Machine (SBM) [1]. We further extend the model to account for multi-part shapes to obtain the Multinomial Shape Boltzmann Machine (MSBM). The MSBM has two layers of latent variables: h1 and h2 (collectively H = {h1 , h2 }), and deﬁnes a P Boltzmann distribution over segmentations p(S) = h1 ,h2 exp{ E(S, h1 , h2 |✓s )}/Z(✓s ) where X X X X X 1 2 E(S, h1 , h2 |✓s ) = bli sli + wlij sli h1 + c 1 h1 + wjk h1 h2 + c2 h2 , (1) j j j j k k k i,l j i,j,l j,k k where j and k range over the ﬁrst and second layer hidden variables, and ✓s = {W 1 , W 2 , b, c1 , c2 } are the shape model parameters. In the ﬁrst layer, local receptive ﬁelds are enforced by connecting each hidden unit in h1 only to a subset of the visible units, corresponding to one of four patches, as shown in Fig. 2(d,e). Each patch overlaps its neighbor by b pixels, which allows boundary continuity to be learned at the lowest layer. We share weights between the four sets of ﬁrst-layer hidden units and patches, and purposely restrict the number of units in h2 . These modiﬁcations signiﬁcantly reduce the number of parameters whilst taking into account an important property of shapes, namely that the strongest dependencies between pixels are typically local. 2 h2 1 1 h S S (a) Mean h S (b) MRF h2 h2 h1 S S (c) DBM b (d) SBM (e) 2D SBM Figure 2: Models of shape. Object shape is modeled with undirected graphical models. (a) 1D slice of a mean model. (b) Markov Random Field in 1D. (c) Deep Boltzmann Machine in 1D. (d) 1D slice of a Shape Boltzmann Machine. (e) Shape Boltzmann Machine in 2D. In all models latent units h are binary and visible units S are multinomial random variables. Based on Fig. 2 of [1]. k=1 k=2 k=3 k=1 k=2 k=3 k=1 k=2 k=3 ⇡ l=0 l=1 l=2 Figure 3: A model of appearances. Left: An exemplar dataset. Here we assume one background (l = 0) and two foreground (l = 1, non-body; l = 2, body) parts. Right: The corresponding appearance model. In this example, L = 2, K = 3 and W = 6. Best viewed in color. Part appearances: Pixels in a given image are assumed to have been generated by W ﬁxed Gaussians in RGB space. During pre-training, the means {µw } and covariances {⌃w } of these Gaussians are extracted by training a mixture model with W components on every pixel in the dataset, ignoring image and part structure. It is also assumed that each of the L parts can have different appearances in different images, and that these appearances can be clustered into K classes. The classes differ in how likely they are to use each of the W components when ‘coloring in’ the part. The generative process is as follows. For part l in an image, one of the K classes is chosen (represented by a 1-of-K indicator variable al ). Given al , the probability distribution deﬁned on pixels associated with part l is given by a Gaussian mixture model with means {µw } and covariances {⌃w } and mixing proportions { lkw }. The prior on A = {al } speciﬁes the probability ⇡lk of appearance class k being chosen for part l. Therefore appearance parameters ✓a = {⇡lk , lkw } (see Fig. 3) and: a p(xi |A, si , ✓ ) = p(A|✓a ) = Y l Y l a sli p(xi |al , ✓ ) p(al |✓a ) = = Y Y X YY l l k w lkw N (xi |µw , ⌃w ) !alk !sli (⇡lk )alk . , (2) (3) k Combining shapes and appearances: To summarize, the latent variables for X are A, S, H, and the model’s active parameters ✓ include shape parameters ✓s and appearance parameters ✓a , so that p(X, A, S, H|✓) = Y 1 p(A|✓a )p(S, H|✓s ) p(xi |A, si , ✓a ) , Z( ) i (4) where the parameter adjusts the relative contributions of the shape and appearance components. See Fig. 4 for an illustration of the complete graphical model. During learning, we ﬁnd the values of ✓ that maximize the likelihood of the training data D, and segmentation is performed on a previously-unseen image by querying the marginal distribution p(S|Xtest , ✓). Note that Z( ) is constant throughout the execution of the algorithms. We set via trial and error in our experiments. 3 n H ✓a si al H xi L+1 ✓s S X A P Figure 4: A model of shape and appearance. Left: The joint model. Pixels xi are modeled via appearance variables al . The model’s belief about each layer’s shape is captured by shape variables H. Segmentation variables si assign each pixel to a layer. Right: Schematic for an image X. 2 Inference and learning Inference: We approximate p(A, S, H|X, ✓) by drawing samples of A, S and H using block-Gibbs Markov Chain Monte Carlo (MCMC). The desired distribution p(S|X, ✓) can then be obtained by considering only the samples for S (see Algorithm 1). In order to sample p(A|S, H, X, ✓) we consider the conditional distribution of appearance class k being chosen for part l which is given by: Q P ·s ⇡lk i ( w lkw N (xi |µw , ⌃w )) li h Q P i. p(alk = 1|S, X, ✓) = P (5) K ·sli r=1 ⇡lr i( w lrw N (xi |µw , ⌃w )) Since the MSBM only has edges between each pair of adjacent layers, all hidden units within a layer are conditionally independent given the units in the other two layers. This property can be exploited to make inference in the shape model exact and efﬁcient. The conditional probabilities are: X X 1 2 p(h1 = 1|s, h2 , ✓) = ( wlij sli + wjk h2 + c1 ), (6) j k j i,l p(h2 k 1 = 1|h , ✓) = ( X k 2 wjk h1 j + c2 ), j (7) j where (y) = 1/(1 + exp( y)) is the sigmoid function. To sample from p(H|S, X, ✓) we iterate between Eqns. 6 and 7 multiple times and keep only the ﬁnal values of h1 and h2 . Finally, we draw samples for the pixels in p(S|A, H, X, ✓) independently: P 1 exp( j wlij h1 + bli ) p(xi |A, sli = 1, ✓) j p(sli = 1|A, H, X, ✓) = PL . (8) P 1 1 m=1 exp( j wmij hj + bmi ) p(xi |A, smi = 1, ✓) Seeding: Since the latent-space is extremely high-dimensional, in practice we ﬁnd it helpful to run several inference chains, each initializing S(1) to a different value. The ‘best’ inference is retained and the others are discarded. The computation of the likelihood p(X|✓) of image X is intractable, so we approximate the quality of each inference using a scoring function: 1X Score(X|✓) = p(X, A(t) , S(t) , H(t) |✓), (9) T t where {A(t) , S(t) , H(t) }, t = 1...T are the samples obtained from the posterior p(A, S, H|X, ✓). If the samples were drawn from the prior p(A, S, H|✓) the scoring function would be an unbiased estimator of p(X|✓), but would be wildly inaccurate due to the high probability of missing the important regions of latent space (see e.g. [12, p. 107-109] for further discussion of this issue). Learning: Learning of the model involves maximizing the log likelihood log p(D|✓a , ✓s ) of the training dataset D with respect to the model parameters ✓a and ✓s . Since training is partially supervised, in that for each image X its corresponding segmentation S is also given, we can learn the parameters of the shape and appearance components separately. For appearances, the learning of the mixing coefﬁcients and the histogram parameters decomposes into standard mixture updates independently for each part. For shapes, we follow the standard deep 4 Algorithm 1 MCMC inference algorithm. 1: procedure I NFER(X, ✓) 2: Initialize S(1) , H(1) 3: for t 2 : chain length do 4: A(t) ⇠ p(A|S(t 1) , H(t 1) , X, ✓) 5: S(t) ⇠ p(S|A(t) , H(t 1) , X, ✓) 6: H(t) ⇠ p(H|S(t) , ✓) 7: return {S(t) }t=burnin:chain length learning literature closely [13, 1]. In the pre-training phase we greedily train the model bottom up, one layer at a time. We begin by training an RBM on the observed data using stochastic maximum likelihood learning (SML; also referred to as ‘persistent CD’; [14, 13]). Once this RBM is trained, we infer the conditional mean of the hidden units for each training image. The resulting vectors then serve as the training data for a second RBM which is again trained using SML. We use the parameters of these two RBMs to initialize the parameters of the full MSBM model. In the second phase we perform approximate stochastic gradient ascent in the likelihood of the full model to ﬁnetune the parameters in an EM-like scheme as described in [13]. 3 Related work Existing probabilistic models of images can be categorized by the amount of variability they expect to encounter in the data and by how they model this variability. A signiﬁcant portion of the literature models images using only two parts: a foreground object and its background e.g. [15, 16, 17, 18, 19]. Models that account for the parts within the foreground object mainly differ in how accurately they learn about and represent the variability of the shapes of the object’s parts. In Probabilistic Index Maps (PIMs) [8] a mean partitioning is learned, and the deformable PIM [9] additionally allows for local deformations of this mean partitioning. Stel Component Analysis [10] accounts for larger amounts of shape variability by learning a number of different template means for the object that are blended together on a pixel-by-pixel basis. Factored Shapes and Appearances [11] models global properties of shape using a factor analysis-like model, and ‘masked’ RBMs have been used to model more local properties of shape [20]. However, none of these models constitute a strong model of shape in terms of realism of samples and generalization capabilities [1]. We demonstrate in Sec. 4 that, like the SBM, the MSBM does in fact possess these properties. The closest works to ours in terms of ability to deal with datasets that exhibit signiﬁcant variability in both shape and appearance are the works of Bo and Fowlkes [21] and Thomas et al. [22]. Bo and Fowlkes [21] present an algorithm for pedestrian segmentation that models the shapes of the parts using several template means. The different parts are composed using hand coded geometric constraints, which means that the model cannot be automatically extended to other application domains. The Implicit Shape Model (ISM) used in [22] is reliant on interest point detectors and deﬁnes distributions over segmentations only in the posterior, and therefore is not fully generative. The model presented here is entirely learned from data and fully generative, therefore it can be applied to new datasets and diagnosed with relative ease. Due to its modular structure, we also expect it to rapidly absorb future developments in shape and appearance models. 4 Experiments Penn-Fudan pedestrians: The ﬁrst dataset that we considered is Penn-Fudan pedestrians [23], consisting of 169 images of pedestrians (Fig. 6(a)). The images are annotated with ground-truth segmentations for L = 7 different parts (hair, face, upper and lower clothes, shoes, legs, arms; Fig. 6(d)). We compare the performance of the model with the algorithm of Bo and Fowlkes [21]. For the shape component, we trained an MSBM on the 684 images of a labeled version of the HumanEva dataset [24] (at 48 ⇥ 24 pixels; also ﬂipped horizontally) with overlap b = 4, and 400 and 50 hidden units in the ﬁrst and second layers respectively. Each layer was pre-trained for 3000 epochs (iterations). After pre-training, joint training was performed for 1000 epochs. 5 (c) Completion (a) Sampling (b) Diffs ! ! ! Figure 5: Learned shape model. (a) A chain of samples (1000 samples between frames). The apparent ‘blurriness’ of samples is not due to averaging or resizing. We display the probability of each pixel belonging to different parts. If, for example, there is a 50-50 chance that a pixel belongs to the red or blue parts, we display that pixel in purple. (b) Differences between the samples and their most similar counterparts in the training dataset. (c) Completion of occlusions (pink). To assess the realism and generalization characteristics of the learned MSBM we sample from it. In Fig. 5(a) we show a chain of unconstrained samples from an MSBM generated via block-Gibbs MCMC (1000 samples between frames). The model captures highly non-linear correlations in the data whilst preserving the object’s details (e.g. face and arms). To demonstrate that the model has not simply memorized the training data, in Fig. 5(b) we show the difference between the sampled shapes in Fig. 5(a) and their closest images in the training set (based on per-pixel label agreement). We see that the model generalizes in non-trivial ways to generate realistic shapes that it had not encountered during training. In Fig. 5(c) we show how the MSBM completes rectangular occlusions. The samples highlight the variability in possible completions captured by the model. Note how, e.g. the length of the person’s trousers on one leg affects the model’s predictions for the other, demonstrating the model’s knowledge about long-range dependencies. An interactive M ATLAB GUI for sampling from this MSBM has been included in the supplementary material. The Penn-Fudan dataset (at 200 ⇥ 100 pixels) was then split into 10 train/test cross-validation splits without replacement. We used the training images in each split to train the appearance component with a vocabulary of size W = 50 and K = 100 mixture components1 . We additionally constrained the model by sharing the appearance models for the arms and legs with that of the face. We assess the quality of the appearance model by performing the following experiment: for each test image, we used the scoring function described in Eq. 9 to evaluate a number of different proposal segmentations for that image. We considered 10 randomly chosen segmentations from the training dataset as well as the ground-truth segmentation for the test image, and found that the appearance model correctly assigns the highest score to the ground-truth 95% of the time. During inference, the shape and appearance models (which are deﬁned on images of different sizes), were combined at 200 ⇥ 100 pixels via M ATLAB’s imresize function, and we set = 0.8 (Eq. 8) via trial and error. Inference chains were seeded at 100 exemplar segmentations from the HumanEva dataset (obtained using the K-medoids algorithm with K = 100), and were run for 20 Gibbs iterations each (with 5 iterations of Eqs. 6 and 7 per Gibbs iteration). Our unoptimized M ATLAB implementation completed inference for each chain in around 7 seconds. We compute the conditional probability of each pixel belonging to different parts given the last set of samples obtained from the highest scoring chain, assign each pixel independently to the most likely part at that pixel, and report the percentage of correctly labeled pixels (see Table 1). We ﬁnd that accuracy can be improved using superpixels (SP) computed on X (pixels within a superpixel are all assigned the most common label within it; as with [21] we use gPb-OWT-UCM [25]). We also report the accuracy obtained, had the top scoring seed segmentation been returned as the ﬁnal segmentation for each image. Here the quality of the seed is determined solely by the appearance model. We observe that the model has comparable performance to the state-of-the-art but pedestrianspeciﬁc algorithm of [21], and that inference in the model signiﬁcantly improves the accuracy of the segmentations over the baseline (top seed+SP). Qualitative results can be seen in Fig. 6(c). 1 We obtained the best quantitative results with these settings. The appearances exhibited by the parts in the dataset are highly varied, and the complexity of the appearance model reﬂects this fact. 6 Table 1: Penn-Fudan pedestrians. We report the percentage of correctly labeled pixels. The ﬁnal column is an average of the background, upper and lower body scores (as reported in [21]). FG BG Upper Body Lower Body Head Average Bo and Fowlkes [21] 73.3% 81.1% 73.6% 71.6% 51.8% 69.5% MSBM MSBM + SP 70.7% 71.6% 72.8% 73.8% 68.6% 69.9% 66.7% 68.5% 53.0% 54.1% 65.3% 66.6% Top seed Top seed + SP 59.0% 61.6% 61.8% 67.3% 56.8% 60.8% 49.8% 54.1% 45.5% 43.5% 53.5% 56.4% Table 2: ETHZ cars. We report the percentage of pixels belonging to each part that are labeled correctly. The ﬁnal column is an average weighted by the frequency of occurrence of each label. BG Body Wheel Window Bumper License Light Average ISM [22] 93.2% 72.2% 63.6% 80.5% 73.8% 56.2% 34.8% 86.8% MSBM 94.6% 72.7% 36.8% 74.4% 64.9% 17.9% 19.9% 86.0% Top seed 92.2% 68.4% 28.3% 63.8% 45.4% 11.2% 15.1% 81.8% ETHZ cars: The second dataset that we considered is the ETHZ labeled cars dataset [22], which itself is a subset of the LabelMe dataset [23], consisting of 139 images of cars, all in the same semiproﬁle view (Fig. 7(a)). The images are annotated with ground-truth segmentations for L = 6 parts (body, wheel, window, bumper, license plate, headlight; Fig. 7(d)). We compare the performance of the model with the ISM of Thomas et al. [22], who also report their results on this dataset. The dataset was split into 10 train/test cross-validation splits without replacement. We used the training images in each split to train both the shape and appearance components. For the shape component, we trained an MSBM at 50 ⇥ 50 pixels with overlap b = 4, and 2000 and 100 hidden units in the ﬁrst and second layers respectively. Each layer was pre-trained for 3000 epochs and joint training was performed for 1000 epochs. The appearance model was trained with a vocabulary of size W = 50 and K = 100 mixture components and we set = 0.7. Inference chains were seeded at 50 exemplar segmentations (obtained using K-medoids). We ﬁnd that the use of superpixels does not help with this dataset (due to the poor quality of superpixels obtained for these images). Qualitative and quantitative results that show the performance of model to be comparable to the state-of-the-art ISM can be seen in Fig. 7(c) and Table 2. We believe the discrepancy in accuracy between the MSBM and ISM on the ‘license’ and ‘light’ labels to mainly be due to ISM’s use of interest-points, as they are able to locate such ﬁne structures accurately. By incorporating better models of part appearance into the generative model, we expect to see this discrepancy decrease. 5 Conclusions and future work In this paper we have shown how the SBM can be extended to obtain the MSBM, and presented a principled probabilistic model of images of objects that exploits the MSBM as its model for part shapes. We demonstrated how object segmentations can be obtained simply by performing MCMC inference in the model. The model can also be treated as a probabilistic evaluator of segmentations: given a proposal segmentation it can be used to estimate its likelihood. This leads us to believe that the combination of a generative model such as ours, with a discriminative, bottom-up segmentation algorithm could be highly effective. We are currently investigating how textured appearance models, which take into account the spatial structure of pixels, affect the learning and inference algorithms and the performance of the model. Acknowledgments Thanks to Charless Fowlkes and Vittorio Ferrari for access to datasets, and to Pushmeet Kohli and John Winn for valuable discussions. AE has received funding from the Carnegie Trust, the SORSAS scheme, and the IST Programme under the PASCAL2 Network of Excellence (IST-2007-216886). 7 (a) Test (c) MSBM (b) Bo and Fowlkes (d) Ground truth Background Hair Face Upper Shoes Legs Lower Arms (d) Ground truth (c) MSBM (b) Thomas et al. (a) Test Figure 6: Penn-Fudan pedestrians. (a) Test images. (b) Results reported by Bo and Fowlkes [21]. (c) Output of the joint model. (d) Ground-truth images. Images shown are those selected by [21]. Background Body Wheel Window Bumper License Headlight Figure 7: ETHZ cars. (a) Test images. (b) Results reported by Thomas et al. [22]. (c) Output of the joint model. (d) Ground-truth images. Images shown are those selected by [22]. 8 References [1] S. M. Ali Eslami, Nicolas Heess, and John Winn. The Shape Boltzmann Machine: a Strong Model of Object Shape. In IEEE CVPR, 2012. [2] Mark Everingham, Luc Van Gool, Christopher K. I. Williams, John Winn, and Andrew Zisserman. The PASCAL Visual Object Classes (VOC) Challenge. International Journal of Computer Vision, 88:303–338, 2010. [3] Martin Fischler and Robert Elschlager. The Representation and Matching of Pictorial Structures. IEEE Transactions on Computers, 22(1):67–92, 1973. [4] David Marr. Vision: A Computational Investigation into the Human Representation and Processing of Visual Information. Freeman, 1982. [5] Irving Biederman. Recognition-by-components: A theory of human image understanding. Psychological Review, 94:115–147, 1987. [6] Ashish Kapoor and John Winn. Located Hidden Random Fields: Learning Discriminative Parts for Object Detection. In ECCV, pages 302–315, 2006. [7] John Winn and Jamie Shotton. The Layout Consistent Random Field for Recognizing and Segmenting Partially Occluded Objects. In IEEE CVPR, pages 37–44, 2006. [8] Nebojsa Jojic and Yaron Caspi. Capturing Image Structure with Probabilistic Index Maps. In IEEE CVPR, pages 212–219, 2004. [9] John Winn and Nebojsa Jojic. LOCUS: Learning object classes with unsupervised segmentation. In ICCV, pages 756–763, 2005. [10] Nebojsa Jojic, Alessandro Perina, Marco Cristani, Vittorio Murino, and Brendan Frey. Stel component analysis. In IEEE CVPR, pages 2044–2051, 2009. [11] S. M. Ali Eslami and Christopher K. I. Williams. Factored Shapes and Appearances for Partsbased Object Understanding. In BMVC, pages 18.1–18.12, 2011. [12] Nicolas Heess. Learning generative models of mid-level structure in natural images. PhD thesis, University of Edinburgh, 2011. [13] Ruslan Salakhutdinov and Geoffrey Hinton. Deep Boltzmann Machines. In AISTATS, volume 5, pages 448–455, 2009. [14] Tijmen Tieleman. Training restricted Boltzmann machines using approximations to the likelihood gradient. In ICML, pages 1064–1071, 2008. [15] Carsten Rother, Vladimir Kolmogorov, and Andrew Blake. “GrabCut”: interactive foreground extraction using iterated graph cuts. ACM SIGGRAPH, 23:309–314, 2004. [16] Eran Borenstein, Eitan Sharon, and Shimon Ullman. Combining Top-Down and Bottom-Up Segmentation. In CVPR Workshop on Perceptual Organization in Computer Vision, 2004. [17] Himanshu Arora, Nicolas Loeff, David Forsyth, and Narendra Ahuja. Unsupervised Segmentation of Objects using Efﬁcient Learning. IEEE CVPR, pages 1–7, 2007. [18] Bogdan Alexe, Thomas Deselaers, and Vittorio Ferrari. ClassCut for unsupervised class segmentation. In ECCV, pages 380–393, 2010. [19] Nicolas Heess, Nicolas Le Roux, and John Winn. Weakly Supervised Learning of ForegroundBackground Segmentation using Masked RBMs. In ICANN, 2011. [20] Nicolas Le Roux, Nicolas Heess, Jamie Shotton, and John Winn. Learning a Generative Model of Images by Factoring Appearance and Shape. Neural Computation, 23(3):593–650, 2011. [21] Yihang Bo and Charless Fowlkes. Shape-based Pedestrian Parsing. In IEEE CVPR, 2011. [22] Alexander Thomas, Vittorio Ferrari, Bastian Leibe, Tinne Tuytelaars, and Luc Van Gool. Using Recognition and Annotation to Guide a Robot’s Attention. IJRR, 28(8):976–998, 2009. [23] Bryan Russell, Antonio Torralba, Kevin Murphy, and William Freeman. LabelMe: A Database and Tool for Image Annotation. International Journal of Computer Vision, 77:157–173, 2008. [24] Leonid Sigal, Alexandru Balan, and Michael Black. HumanEva. International Journal of Computer Vision, 87(1-2):4–27, 2010. [25] Pablo Arbelaez, Michael Maire, Charless C. Fowlkes, and Jitendra Malik. From Contours to Regions: An Empirical Evaluation. In IEEE CVPR, 2009. 9

4 0.7605027 321 nips-2012-Spectral learning of linear dynamics from generalised-linear observations with application to neural population data

Author: Lars Buesing, Maneesh Sahani, Jakob H. Macke

Abstract: Latent linear dynamical systems with generalised-linear observation models arise in a variety of applications, for instance when modelling the spiking activity of populations of neurons. Here, we show how spectral learning methods (usually called subspace identiﬁcation in this context) for linear systems with linear-Gaussian observations can be extended to estimate the parameters of a generalised-linear dynamical system model despite a non-linear and non-Gaussian observation process. We use this approach to obtain estimates of parameters for a dynamical model of neural population data, where the observed spike-counts are Poisson-distributed with log-rates determined by the latent dynamical process, possibly driven by external inputs. We show that the extended subspace identiﬁcation algorithm is consistent and accurately recovers the correct parameters on large simulated data sets with a single calculation, avoiding the costly iterative computation of approximate expectation-maximisation (EM). Even on smaller data sets, it provides an effective initialisation for EM, avoiding local optima and speeding convergence. These beneﬁts are shown to extend to real neural data.

5 0.75502527 17 nips-2012-A Scalable CUR Matrix Decomposition Algorithm: Lower Time Complexity and Tighter Bound

Author: Shusen Wang, Zhihua Zhang

Abstract: The CUR matrix decomposition is an important extension of Nystr¨ m approximao tion to a general matrix. It approximates any data matrix in terms of a small number of its columns and rows. In this paper we propose a novel randomized CUR algorithm with an expected relative-error bound. The proposed algorithm has the advantages over the existing relative-error CUR algorithms that it possesses tighter theoretical bound and lower time complexity, and that it can avoid maintaining the whole data matrix in main memory. Finally, experiments on several real-world datasets demonstrate signiﬁcant improvement over the existing relative-error algorithms. 1

6 0.67774582 104 nips-2012-Dual-Space Analysis of the Sparse Linear Model

7 0.67689997 188 nips-2012-Learning from Distributions via Support Measure Machines

8 0.67677838 232 nips-2012-Multiplicative Forests for Continuous-Time Processes

9 0.67676336 197 nips-2012-Learning with Recursive Perceptual Representations

10 0.67661434 74 nips-2012-Collaborative Gaussian Processes for Preference Learning

11 0.67655092 316 nips-2012-Small-Variance Asymptotics for Exponential Family Dirichlet Process Mixture Models

12 0.67586458 277 nips-2012-Probabilistic Low-Rank Subspace Clustering

13 0.67577708 355 nips-2012-Truncation-free Online Variational Inference for Bayesian Nonparametric Models

14 0.67485511 272 nips-2012-Practical Bayesian Optimization of Machine Learning Algorithms

15 0.67455447 252 nips-2012-On Multilabel Classification and Ranking with Partial Feedback

16 0.67430562 13 nips-2012-A Nonparametric Conjugate Prior Distribution for the Maximizing Argument of a Noisy Function

17 0.67388922 280 nips-2012-Proper losses for learning from partial labels

18 0.67370248 279 nips-2012-Projection Retrieval for Classification

19 0.67334586 56 nips-2012-Bayesian active learning with localized priors for fast receptive field characterization

20 0.67321485 203 nips-2012-Locating Changes in Highly Dependent Data with Unknown Number of Change Points