nips nips2005 nips2005-201 knowledge-graph by maker-knowledge-mining
Source: pdf
Author: Kazuho Watanabe, Sumio Watanabe
Abstract: The Variational Bayesian framework has been widely used to approximate the Bayesian learning. In various applications, it has provided computational tractability and good generalization performance. In this paper, we discuss the Variational Bayesian learning of the mixture of exponential families and provide some additional theoretical support by deriving the asymptotic form of the stochastic complexity. The stochastic complexity, which corresponds to the minimum free energy and a lower bound of the marginal likelihood, is a key quantity for model selection. It also enables us to discuss the effect of hyperparameters and the accuracy of the Variational Bayesian approach as an approximation of the true Bayesian learning. 1
Reference: text
sentIndex sentText sentNum sentScore
1 jp Abstract The Variational Bayesian framework has been widely used to approximate the Bayesian learning. [sent-8, score-0.024]
2 In various applications, it has provided computational tractability and good generalization performance. [sent-9, score-0.038]
3 In this paper, we discuss the Variational Bayesian learning of the mixture of exponential families and provide some additional theoretical support by deriving the asymptotic form of the stochastic complexity. [sent-10, score-0.619]
4 The stochastic complexity, which corresponds to the minimum free energy and a lower bound of the marginal likelihood, is a key quantity for model selection. [sent-11, score-0.344]
5 It also enables us to discuss the effect of hyperparameters and the accuracy of the Variational Bayesian approach as an approximation of the true Bayesian learning. [sent-12, score-0.21]
6 1 Introduction The Variational Bayesian (VB) framework has been widely used as an approximation of the Bayesian learning for models involving hidden (latent) variables such as mixture models[2][4]. [sent-13, score-0.309]
7 This framework provides computationally tractable posterior distributions with only modest computational costs in contrast to Markov chain Monte Carlo (MCMC) methods. [sent-14, score-0.138]
8 In many applications, it has performed better generalization compared to the maximum likelihood estimation. [sent-15, score-0.03]
9 In spite of its tractability and its wide range of applications, little has been done to investigate the theoretical properties of the Variational Bayesian learning itself. [sent-16, score-0.096]
10 For example, questions like how accurately it approximates the true one remained unanswered until quite recently. [sent-17, score-0.062]
11 To address these issues, the stochastic complexity in the Variational Bayesian learning of gaussian mixture models was clarified and the accuracy of the Variational Bayesian learning was discussed[10]. [sent-18, score-0.614]
12 In this paper, we focus on the Variational Bayesian learning of more general mixture models, namely the mixtures of exponential families which include mixtures of distributions such as gaussian, binomial and gamma. [sent-20, score-0.481]
13 Mixture models are known to be non-regular statistical models due to the non-identifiability of parameters caused by their hidden variables[7]. [sent-21, score-0.133]
14 In some recent studies, the Bayesian stochastic complexities of non-regular models have been clarified and it has been proven that they become smaller than those of regular models[12][13]. [sent-22, score-0.286]
15 This indicates an advantage of the Bayesian learning when it is applied to non-regular models. [sent-23, score-0.054]
16 As our main results, the asymptotic upper and lower bounds are obtained for the stochastic complexity or the free energy in the Variational Bayesian learning of the mixture of exponential families. [sent-24, score-0.812]
17 The stochastic complexity is important quantity for model selection and giving the asymptotic form of it also contributes to the following two issues. [sent-25, score-0.436]
18 One is the accuracy of the Variational Bayesian learning as an approximation method since the stochastic complexity shows the distance from the variational posterior distribution to the true Bayesian posterior distribution in the sense of Kullback information. [sent-26, score-1.201]
19 Indeed, we give the asymptotic form of the stochastic complexity as F (n) λ log n where n is the sample size, by comparing the coefficient λ with that of the true Bayesian learning, we discuss the accuracy of the VB approach. [sent-27, score-0.652]
20 Another is the influence of the hyperparameter on the learning process. [sent-28, score-0.09]
21 Since the Variational Bayesian algorithm is a procedure of minimizing the functional that finally gives the stochastic complexity, the derived bounds indicate how the hyperparameters influence the process of the learning. [sent-29, score-0.326]
22 Our results have an implication for how to determine the hyperparameter values before the learning process. [sent-30, score-0.09]
23 We consider the case in which the true distribution is contained in the learner model. [sent-31, score-0.124]
24 Analyzing the stochastic complexity in this case is most valuable for comparing the Variational Bayesian learning with the true Bayesian learning. [sent-32, score-0.437]
25 This is because the advantage of the Bayesian learning is typical in this case[12]. [sent-33, score-0.054]
26 Furthermore, this analysis is necessary and essential for addressing the model selection problem and hypothesis testing. [sent-34, score-0.049]
27 In Section 2, we introduce the mixture of exponential family model. [sent-36, score-0.284]
28 In Section 4, the Variational Bayesian framework is described and the variational posterior distribution for the mixture of exponential family model is derived. [sent-38, score-0.895]
29 2 Mixture of Exponential Family Denote by c(x|b) a density function of the input x ∈ RN given an M -dimensional parameter vector b = (b(1) , b(2) , · · · , b(M ))T ∈ B where B is a subset of RM . [sent-41, score-0.035]
30 The general mixture model p(x|θ) with a parameter vector θ is defined by K ak c(x|bk ), p(x|θ) = k=1 where integer K is the number of components and {ak |ak ≥ 0, K ak = 1} is the k=1 set of mixing proportions. [sent-42, score-0.746]
31 Suppose functions f1 , · · · , fM and a constant function are linearly independent, which means the effective number of parameters in a single component distribution c(x|b) is M . [sent-45, score-0.093]
32 The mixture model can be rewritten as follows by using a hidden variable y = (y1 , · · · , yK ) ∈ {(1, 0, · · ·, 0), (0, 1, · · · , 0), · · · , (0, 0, · · ·, 1)}, K yk ak c(x|bk ) p(x, y|θ) = . [sent-47, score-0.518]
33 k=1 If and only if the datum x is generated from the kth component, yk = 1. [sent-48, score-0.132]
34 3 The Bayesian Learning Suppose n training samples X n = {x1 , · · · , xn } are independently and identically taken from the true distribution p0 (x). [sent-49, score-0.096]
35 In the Bayesian learning of a model p(x|θ) whose parameter is θ, first, the prior distribution ϕ(θ) on the parameter θ is set. [sent-50, score-0.196]
36 The Bayesian predictive distribution p(x|X n ) is given by averaging the model over the posterior distribution as follows, p(x|X n ) = p(x|θ)p(θ|X n )dθ. [sent-52, score-0.204]
37 (7) The stochastic complexity F (X n ) is defined by F (X n ) = − log Z(X n ), (8) which is also called the free energy and is important in most data modelling problems. [sent-53, score-0.505]
38 Practically, it is used as a criterion by which the model is selected and the hyperparameters in the prior are optimized[1][9]. [sent-54, score-0.115]
39 Define the average stochastic complexity F (n) by F (n) = EX n F (X n ) , (9) where EX n [·] denotes the expectation value over all sets of training samples. [sent-55, score-0.348]
40 Recently, it was proved that F (n) has the following asymptotic form[12], λ log n − (m − 1) log log n + O(1), F (n) (10) where λ and m are the rational number and the natural number respectively which are determined by the singularities of the set of true parameters. [sent-56, score-0.524]
41 In regular statistical models, 2λ is equal to the number of parameters and m = 1, whereas in non-regular models such as mixture models, 2λ is not larger than the number of parameters and m ≥ 1. [sent-57, score-0.266]
42 However, in the Bayesian learning, one computes the stochastic complexity or the predictive distribution by integrating over the posterior distribution, which typically cannot be performed analytically. [sent-59, score-0.469]
43 q(x) log p(x) where Cr and CQ are the normalization constants2 . [sent-67, score-0.114]
44 We define the stochastic complexity in the VB learning F (X n ) by the minimum value of the functional F [q] , that is , F (X n ) = min F [q], r,Q which shows the accuracy of the VB approach as an approximation of the Bayesian learning. [sent-68, score-0.5]
45 F (X n ) is also used for model selection since it gives an upper bound of the true Bayesian stochastic complexity F (X n ). [sent-69, score-0.514]
46 2 Variational Posterior for MEF Model In this subsection, we derive the variational posterior r(θ|X n ) for the MEF model based on (14) and then define the variational parameter for this model. [sent-71, score-1.005]
47 Using the complete data {X n , Y n } = {(x1 , y1 ), · · · , (xn , yn )}, we put n k y k = yi i Q(Y n ) , y k , and νk = i nk = i=1 1 nk n yk f(xi ), i i=1 k where yi = 1 if and only if the ith datum xi is from the kth component. [sent-72, score-0.462]
48 The variable nk is the expected number of the data that are estimated to be from the kth component. [sent-73, score-0.184]
49 Let nk + φ 0 , (18) n + Kφ0 1 ∂ log C(γk , µk ) bk = bk r(bk ) = , (19) γk ∂µk and define the variational parameter θ by θ = θ r(θ) = {ak , bk }K . [sent-75, score-1.551]
50 Then it is k=1 noted that the variational posterior r(θ) and CQ in (15) are parameterized by the variational parameter θ. [sent-76, score-0.983]
51 We define the variational estimator θvb by the variational parameter θ that attains the minimum value of the stochastic complexity F (X n ). [sent-78, score-1.212]
52 Then, putting (15) into (12), we obtain ak = ak F (X n ) r(a) = where S(X ) = − min{K(r(θ|θ)||ϕ(θ)) − (log CQ(θ) + S(X n ))}, (20) = n = K(r(θ|θ vb )||ϕ(θ)) − (log CQ (θ vb ) + S(X n )), log p0 (x). [sent-79, score-1.604]
53 (21) n i=1 θ Therefore, our aim is to evaluate the minimum value of (20) as a function of the variational parameter θ. [sent-80, score-0.474]
54 Hereafter, we omit the condition X n of the variational posteriors, and abbreviate them to q(Y n , θ), Q(Y n ) and r(θ). [sent-82, score-0.417]
55 3 5 Main Result The average stochastic complexity F (n) in the VB learning is defined by F (n) = EX n [F (X n )]. [sent-83, score-0.351]
56 (i) The true distribution p0 (x) is an MEF model p(x|θ0 ) which has K0 components and the parameter θ0 = {a∗ , b∗ }K0 , k k k=1 K0 a∗ exp{b∗ · f(x) + f0 (x) − g(b∗ )}, k k k p(x|θ0 ) = k=1 b∗ k M ∈ R and where has K components, b∗ k = b∗ (k = j). [sent-85, score-0.182]
57 And suppose that the model p(x|θ) j K ak exp{bk · f(x) + f0 (x) − g(bk )}, p(x|θ) = k=1 and K ≥ K0 holds. [sent-86, score-0.296]
58 (ii) The prior distribution of the parameters is ϕ(θ) = ϕ(a)ϕ(b) given by (2) and (3) with ϕ(b) bounded. [sent-87, score-0.098]
59 (iii) Regarding the distribution c(x|b) of each component, the Fisher information 2 g(b) matrix I(b) = ∂∂b∂b satisfies 0 < |I(b)| < +∞, for arbitrary b ∈ B 4 . [sent-88, score-0.034]
60 2 (24) This theorem shows the asymptotic form of the average stochastic complexity in the Variational Bayesian learning. [sent-93, score-0.436]
61 The coefficients λ, λ of the leading terms are identified by K,K0 , that are the numbers of components of the learner and the true distribution, the number of parameters M of each component and the hyperparameter φ0 of the conjugate prior given by (2). [sent-94, score-0.298]
62 In this theorem, nHn (θvb ) = − n log p(xi |θvb )−S(X n ), and − n log p(xi |θ vb ) i=1 i=1 is a training error which is computable during the learning. [sent-95, score-0.702]
63 If the term EX n nHn (θ vb ) is a bounded function of n, then it immediately follows from this theorem that λ log n + O(1) ≤ F 0 (n) ≤ λ log n + O(1), 4 ∂ 2 g(b) ∂b∂b denotes the matrix whose ijth entry is of a matrix. [sent-96, score-0.83]
64 ∂ 2 g(b) ∂b(i) ∂b(j) and |·| denotes the determinant where O(1) is a bounded function of n. [sent-97, score-0.057]
65 In certain cases, such as binomial mixtures and mixtures of von-Mises distributions, it is actually a bounded function of n. [sent-98, score-0.192]
66 In the case of gaussian mixtures, if B = RN , it is conjectured that the minus likelihood ratio minθ nHn (θ), a lower bound of nHn (θ vb ), is at most of the order of log log n[5]. [sent-99, score-0.818]
67 Since the dimension of the parameter θ is M K + K − 1, the average stochastic complexity of regular statistical models, which coincides with the Bayesian information criterion (BIC)[9] is given by λBIC log n where λBIC = M K+K−1 . [sent-100, score-0.5]
68 Theorem 2 2 claims that the coefficient λ of log n is smaller than λBIC when φ0 ≤ (M + 1)/2. [sent-101, score-0.134]
69 This implies that the advantage of non-regular models in the Bayesian learning still remains in the VB learning. [sent-102, score-0.09]
70 (26) k=1 Then log CQ (θ) in (20) is evaluated as follows. [sent-104, score-0.114]
71 nHn (θ) + Op(1) ≤ −(log CQ (θ) + S(X n )) ≤ nH n (θ) + Op (1) where H n (θ) = 1 n n log i=1 p(xi |θ0 ) K C ¯k ) exp − k=1 ak c(xi |b nk +min{φ0 ,ξ0 } (27) , and C is a constant. [sent-105, score-0.554]
72 Thus, from (20), evaluating the right-hand sides of (25) and (27) at specific points near the true parameter θ0 , we obtain the upper bound in (23). [sent-106, score-0.179]
73 The lower bound in (23) is obtained from (25) and (27) by Jensen’s inequality K and the constraint k=1 ak = 1. [sent-107, score-0.318]
74 D) 6 Discussion and Conclusion In this paper, we showed the upper and lower bounds of the stochastic complexity for the mixture of exponential family models in the VB learning. [sent-110, score-0.755]
75 Firstly, we compare the stochastic complexity shown in Theorem 2 with the one in the true Bayesian learning. [sent-111, score-0.383]
76 On the mixture models with M parameters in each component, the following upper bound for the coefficient of F (n) in (10) is known [13], (K + K0 − 1)/2 (M = 1), λ≤ (28) (K − K0 ) + (M K0 + K0 − 1)/2 (M ≥ 2). [sent-112, score-0.294]
77 By the certain conditions about the prior distribution under which the above bound was derived, we can compare the stochastic complexity when φ0 = 1. [sent-113, score-0.439]
78 (29) 5 Op (1) denotes a random variable bounded in probability. [sent-115, score-0.057]
79 Since we obtain F (n) λ log n+O(1) under certain assumptions[11], let us compare λ of the VB learning to λ in (28) of the true Bayesian learning. [sent-116, score-0.206]
80 When M = 1, that is, each component has one parameter, λ ≥ λ holds since K0 ≤ K. [sent-117, score-0.035]
81 This means that the more redundant components the model has, the more the VB learning differs from the true Bayesian learning. [sent-118, score-0.172]
82 In this case, 2λ is equal to the number of the parameters of the model. [sent-119, score-0.024]
83 Hence the BIC[9] corresponds to λ log n when M = 1. [sent-120, score-0.114]
84 This implies that the variational posterior is close to the true Bayesian posterior when M ≥ 2. [sent-122, score-0.707]
85 More precise discussion about the accuracy of the approximation can be done for models on which tighter bounds or exact values of the coefficient λ in (10) are given[10]. [sent-123, score-0.167]
86 Secondly, we point out that Theorem 2 shows how the hyperparameter φ0 influence the process of the VB learning. [sent-124, score-0.06]
87 The coefficient λ in (24) indicates that only when φ0 ≤ (M + 1)/2, the prior distribution (2) works to eliminate the redundant components that the model has and otherwise it works to use all the components. [sent-125, score-0.154]
88 And lastly, let us give examples of how to use the theoretical bounds in (23). [sent-126, score-0.084]
89 One can examine experimentally whether the actual iterative algorithm converges to the optimal variational posterior instead of local minima by comparing the stochastic complexity with our theoretical result. [sent-127, score-0.904]
90 The theoretical bounds would also enable us to compare the accuracy of the VB learning with that of the Laplace approximation or the MCMC method. [sent-128, score-0.189]
91 As mentioned in Section 4, our result will be important for developing effective model selection methods using F (X n ) in the future work. [sent-129, score-0.049]
92 Attias, ”Inferring parameters and structure of latent variable models by variational bayes,” Proc. [sent-137, score-0.477]
93 Brown, “Fundamentals of statistical exponential families,” IMS Lecture NotesMonograph Series, 1986. [sent-141, score-0.079]
94 Beal, “Graphical models and variational methods,” Advanced Mean Field Methods , MIT Press, 2000. [sent-145, score-0.453]
95 Hartigan, “A Failure of likelihood asymptotics for normal mixtures,” Proc. [sent-148, score-0.03]
96 Sato, “Online model selection based on the variational bayes,” Neural Computation, 13(7), pp. [sent-161, score-0.466]
97 Watanabe, ”Lower bounds of stochastic complexities in variational bayes learning of gaussian mixture models,” Proc. [sent-168, score-0.919]
98 Watanabe, ”Stochastic complexity for mixture of exponential families in variational bayes,” Proc. [sent-173, score-0.841]
99 Watanabe,“Algebraic analysis for non-identifiable learning machines,” Neural Computation, 13(4), pp. [sent-177, score-0.03]
100 Watanabe, ”Singularities in mixture models and upper bounds of stochastic complexity,” Neural Networks, 16, pp. [sent-181, score-0.468]
wordName wordTfidf (topN-words)
[('vb', 0.474), ('variational', 0.417), ('bk', 0.282), ('ak', 0.254), ('bayesian', 0.245), ('nhn', 0.19), ('cq', 0.189), ('stochastic', 0.186), ('mixture', 0.152), ('nk', 0.139), ('complexity', 0.135), ('mef', 0.118), ('bic', 0.118), ('posterior', 0.114), ('log', 0.114), ('coe', 0.114), ('kullback', 0.086), ('ex', 0.08), ('exponential', 0.079), ('op', 0.066), ('asymptotic', 0.066), ('mixtures', 0.063), ('true', 0.062), ('hyperparameter', 0.06), ('families', 0.058), ('bounds', 0.056), ('singularities', 0.054), ('watanabe', 0.054), ('hyperparameters', 0.053), ('yk', 0.053), ('family', 0.053), ('theorem', 0.049), ('posteriors', 0.048), ('clari', 0.047), ('exp', 0.047), ('accuracy', 0.045), ('kth', 0.045), ('bound', 0.044), ('bayes', 0.044), ('ective', 0.043), ('prior', 0.04), ('upper', 0.038), ('tractability', 0.038), ('hidden', 0.037), ('binomial', 0.036), ('models', 0.036), ('component', 0.035), ('parameter', 0.035), ('datum', 0.034), ('putting', 0.034), ('complexities', 0.034), ('fm', 0.034), ('cr', 0.034), ('distribution', 0.034), ('tokyo', 0.033), ('iii', 0.032), ('functional', 0.031), ('xi', 0.031), ('hn', 0.031), ('regular', 0.03), ('learning', 0.03), ('likelihood', 0.03), ('bounded', 0.03), ('approximation', 0.03), ('redundant', 0.029), ('components', 0.029), ('learner', 0.028), ('rm', 0.028), ('theoretical', 0.028), ('uence', 0.028), ('denotes', 0.027), ('energy', 0.027), ('selection', 0.027), ('mcmc', 0.027), ('parameters', 0.024), ('framework', 0.024), ('advantage', 0.024), ('culture', 0.024), ('fundamentals', 0.024), ('fellows', 0.024), ('honor', 0.024), ('valencia', 0.024), ('comparing', 0.024), ('free', 0.023), ('minimum', 0.022), ('nh', 0.022), ('conjectured', 0.022), ('ijth', 0.022), ('model', 0.022), ('yn', 0.021), ('min', 0.021), ('lower', 0.02), ('discuss', 0.02), ('conjugate', 0.02), ('called', 0.02), ('suppose', 0.02), ('spain', 0.02), ('sports', 0.02), ('claims', 0.02), ('factorizes', 0.02)]
simIndex simValue paperId paperTitle
same-paper 1 1.0000001 201 nips-2005-Variational Bayesian Stochastic Complexity of Mixture Models
Author: Kazuho Watanabe, Sumio Watanabe
Abstract: The Variational Bayesian framework has been widely used to approximate the Bayesian learning. In various applications, it has provided computational tractability and good generalization performance. In this paper, we discuss the Variational Bayesian learning of the mixture of exponential families and provide some additional theoretical support by deriving the asymptotic form of the stochastic complexity. The stochastic complexity, which corresponds to the minimum free energy and a lower bound of the marginal likelihood, is a key quantity for model selection. It also enables us to discuss the effect of hyperparameters and the accuracy of the Variational Bayesian approach as an approximation of the true Bayesian learning. 1
2 0.38121998 202 nips-2005-Variational EM Algorithms for Non-Gaussian Latent Variable Models
Author: Jason Palmer, Kenneth Kreutz-Delgado, Bhaskar D. Rao, David P. Wipf
Abstract: We consider criteria for variational representations of non-Gaussian latent variables, and derive variational EM algorithms in general form. We establish a general equivalence among convex bounding methods, evidence based methods, and ensemble learning/Variational Bayes methods, which has previously been demonstrated only for particular cases.
3 0.14156447 52 nips-2005-Correlated Topic Models
Author: John D. Lafferty, David M. Blei
Abstract: Topic models, such as latent Dirichlet allocation (LDA), can be useful tools for the statistical analysis of document collections and other discrete data. The LDA model assumes that the words of each document arise from a mixture of topics, each of which is a distribution over the vocabulary. A limitation of LDA is the inability to model topic correlation even though, for example, a document about genetics is more likely to also be about disease than x-ray astronomy. This limitation stems from the use of the Dirichlet distribution to model the variability among the topic proportions. In this paper we develop the correlated topic model (CTM), where the topic proportions exhibit correlation via the logistic normal distribution [1]. We derive a mean-field variational inference algorithm for approximate posterior inference in this model, which is complicated by the fact that the logistic normal is not conjugate to the multinomial. The CTM gives a better fit than LDA on a collection of OCRed articles from the journal Science. Furthermore, the CTM provides a natural way of visualizing and exploring this and other unstructured data sets. 1
4 0.12024548 65 nips-2005-Estimating the wrong Markov random field: Benefits in the computation-limited setting
Author: Martin J. Wainwright
Abstract: Consider the problem of joint parameter estimation and prediction in a Markov random field: i.e., the model parameters are estimated on the basis of an initial set of data, and then the fitted model is used to perform prediction (e.g., smoothing, denoising, interpolation) on a new noisy observation. Working in the computation-limited setting, we analyze a joint method in which the same convex variational relaxation is used to construct an M-estimator for fitting parameters, and to perform approximate marginalization for the prediction step. The key result of this paper is that in the computation-limited setting, using an inconsistent parameter estimator (i.e., an estimator that returns the “wrong” model even in the infinite data limit) is provably beneficial, since the resulting errors can partially compensate for errors made by using an approximate prediction technique. En route to this result, we analyze the asymptotic properties of M-estimators based on convex variational relaxations, and establish a Lipschitz stability property that holds for a broad class of variational methods. We show that joint estimation/prediction based on the reweighted sum-product algorithm substantially outperforms a commonly used heuristic based on ordinary sum-product. 1 Keywords: Markov random fields; variational method; message-passing algorithms; sum-product; belief propagation; parameter estimation; learning. 1
5 0.10949569 113 nips-2005-Learning Multiple Related Tasks using Latent Independent Component Analysis
Author: Jian Zhang, Zoubin Ghahramani, Yiming Yang
Abstract: We propose a probabilistic model based on Independent Component Analysis for learning multiple related tasks. In our model the task parameters are assumed to be generated from independent sources which account for the relatedness of the tasks. We use Laplace distributions to model hidden sources which makes it possible to identify the hidden, independent components instead of just modeling correlations. Furthermore, our model enjoys a sparsity property which makes it both parsimonious and robust. We also propose efficient algorithms for both empirical Bayes method and point estimation. Our experimental results on two multi-label text classification data sets show that the proposed approach is promising.
6 0.10606159 30 nips-2005-Assessing Approximations for Gaussian Process Classification
7 0.096959502 180 nips-2005-Spectral Bounds for Sparse PCA: Exact and Greedy Algorithms
8 0.095859841 96 nips-2005-Inference with Minimal Communication: a Decision-Theoretic Variational Approach
9 0.093653657 154 nips-2005-Preconditioner Approximations for Probabilistic Graphical Models
10 0.072539195 92 nips-2005-Hyperparameter and Kernel Learning for Graph Based Semi-Supervised Classification
11 0.070173755 21 nips-2005-An Alternative Infinite Mixture Of Gaussian Process Experts
12 0.070022367 140 nips-2005-Nonparametric inference of prior probabilities from Bayes-optimal behavior
13 0.069998704 155 nips-2005-Predicting EMG Data from M1 Neurons with Variational Bayesian Least Squares
14 0.065250225 58 nips-2005-Divergences, surrogate loss functions and experimental design
15 0.064885713 105 nips-2005-Large-Scale Multiclass Transduction
16 0.064832076 63 nips-2005-Efficient Unsupervised Learning for Localization and Detection in Object Categories
17 0.064426072 48 nips-2005-Context as Filtering
18 0.064390145 82 nips-2005-Generalization Error Bounds for Aggregation by Mirror Descent with Averaging
19 0.063798331 138 nips-2005-Non-Local Manifold Parzen Windows
20 0.062529102 95 nips-2005-Improved risk tail bounds for on-line algorithms
topicId topicWeight
[(0, 0.205), (1, 0.09), (2, -0.026), (3, 0.01), (4, 0.219), (5, -0.271), (6, -0.158), (7, -0.046), (8, -0.159), (9, -0.179), (10, 0.018), (11, -0.074), (12, 0.071), (13, 0.064), (14, -0.101), (15, -0.15), (16, -0.025), (17, -0.053), (18, 0.002), (19, 0.058), (20, 0.132), (21, 0.01), (22, -0.017), (23, -0.061), (24, 0.076), (25, -0.092), (26, -0.171), (27, 0.141), (28, -0.001), (29, 0.015), (30, 0.106), (31, 0.056), (32, -0.059), (33, 0.023), (34, -0.047), (35, -0.012), (36, 0.081), (37, -0.074), (38, -0.037), (39, -0.03), (40, -0.071), (41, 0.069), (42, -0.004), (43, 0.093), (44, -0.026), (45, 0.063), (46, 0.015), (47, -0.154), (48, 0.15), (49, 0.047)]
simIndex simValue paperId paperTitle
same-paper 1 0.96336937 201 nips-2005-Variational Bayesian Stochastic Complexity of Mixture Models
Author: Kazuho Watanabe, Sumio Watanabe
Abstract: The Variational Bayesian framework has been widely used to approximate the Bayesian learning. In various applications, it has provided computational tractability and good generalization performance. In this paper, we discuss the Variational Bayesian learning of the mixture of exponential families and provide some additional theoretical support by deriving the asymptotic form of the stochastic complexity. The stochastic complexity, which corresponds to the minimum free energy and a lower bound of the marginal likelihood, is a key quantity for model selection. It also enables us to discuss the effect of hyperparameters and the accuracy of the Variational Bayesian approach as an approximation of the true Bayesian learning. 1
2 0.8684426 202 nips-2005-Variational EM Algorithms for Non-Gaussian Latent Variable Models
Author: Jason Palmer, Kenneth Kreutz-Delgado, Bhaskar D. Rao, David P. Wipf
Abstract: We consider criteria for variational representations of non-Gaussian latent variables, and derive variational EM algorithms in general form. We establish a general equivalence among convex bounding methods, evidence based methods, and ensemble learning/Variational Bayes methods, which has previously been demonstrated only for particular cases.
3 0.63168806 52 nips-2005-Correlated Topic Models
Author: John D. Lafferty, David M. Blei
Abstract: Topic models, such as latent Dirichlet allocation (LDA), can be useful tools for the statistical analysis of document collections and other discrete data. The LDA model assumes that the words of each document arise from a mixture of topics, each of which is a distribution over the vocabulary. A limitation of LDA is the inability to model topic correlation even though, for example, a document about genetics is more likely to also be about disease than x-ray astronomy. This limitation stems from the use of the Dirichlet distribution to model the variability among the topic proportions. In this paper we develop the correlated topic model (CTM), where the topic proportions exhibit correlation via the logistic normal distribution [1]. We derive a mean-field variational inference algorithm for approximate posterior inference in this model, which is complicated by the fact that the logistic normal is not conjugate to the multinomial. The CTM gives a better fit than LDA on a collection of OCRed articles from the journal Science. Furthermore, the CTM provides a natural way of visualizing and exploring this and other unstructured data sets. 1
4 0.57645184 180 nips-2005-Spectral Bounds for Sparse PCA: Exact and Greedy Algorithms
Author: Baback Moghaddam, Yair Weiss, Shai Avidan
Abstract: Sparse PCA seeks approximate sparse “eigenvectors” whose projections capture the maximal variance of data. As a cardinality-constrained and non-convex optimization problem, it is NP-hard and is encountered in a wide range of applied fields, from bio-informatics to finance. Recent progress has focused mainly on continuous approximation and convex relaxation of the hard cardinality constraint. In contrast, we consider an alternative discrete spectral formulation based on variational eigenvalue bounds and provide an effective greedy strategy as well as provably optimal solutions using branch-and-bound search. Moreover, the exact methodology used reveals a simple renormalization step that improves approximate solutions obtained by any continuous method. The resulting performance gain of discrete algorithms is demonstrated on real-world benchmark data and in extensive Monte Carlo evaluation trials. 1
5 0.47518715 4 nips-2005-A Bayesian Spatial Scan Statistic
Author: Daniel B. Neill, Andrew W. Moore, Gregory F. Cooper
Abstract: We propose a new Bayesian method for spatial cluster detection, the “Bayesian spatial scan statistic,” and compare this method to the standard (frequentist) scan statistic approach. We demonstrate that the Bayesian statistic has several advantages over the frequentist approach, including increased power to detect clusters and (since randomization testing is unnecessary) much faster runtime. We evaluate the Bayesian and frequentist methods on the task of prospective disease surveillance: detecting spatial clusters of disease cases resulting from emerging disease outbreaks. We demonstrate that our Bayesian methods are successful in rapidly detecting outbreaks while keeping number of false positives low. 1
6 0.46795395 65 nips-2005-Estimating the wrong Markov random field: Benefits in the computation-limited setting
7 0.4034633 30 nips-2005-Assessing Approximations for Gaussian Process Classification
8 0.40240073 154 nips-2005-Preconditioner Approximations for Probabilistic Graphical Models
9 0.38573346 205 nips-2005-Worst-Case Bounds for Gaussian Process Models
10 0.37418801 43 nips-2005-Comparing the Effects of Different Weight Distributions on Finding Sparse Representations
11 0.36568025 113 nips-2005-Learning Multiple Related Tasks using Latent Independent Component Analysis
12 0.36074519 33 nips-2005-Bayesian Sets
13 0.35416332 35 nips-2005-Bayesian model learning in human visual perception
14 0.3197799 96 nips-2005-Inference with Minimal Communication: a Decision-Theoretic Variational Approach
15 0.31232461 155 nips-2005-Predicting EMG Data from M1 Neurons with Variational Bayesian Least Squares
16 0.3018651 24 nips-2005-An Approximate Inference Approach for the PCA Reconstruction Error
17 0.29981378 85 nips-2005-Generalization to Unseen Cases
18 0.29523826 133 nips-2005-Nested sampling for Potts models
19 0.29322278 105 nips-2005-Large-Scale Multiclass Transduction
20 0.29202703 62 nips-2005-Efficient Estimation of OOMs
topicId topicWeight
[(3, 0.097), (10, 0.032), (18, 0.013), (27, 0.014), (31, 0.052), (34, 0.064), (55, 0.06), (69, 0.03), (73, 0.051), (84, 0.246), (88, 0.096), (91, 0.136)]
simIndex simValue paperId paperTitle
same-paper 1 0.78652215 201 nips-2005-Variational Bayesian Stochastic Complexity of Mixture Models
Author: Kazuho Watanabe, Sumio Watanabe
Abstract: The Variational Bayesian framework has been widely used to approximate the Bayesian learning. In various applications, it has provided computational tractability and good generalization performance. In this paper, we discuss the Variational Bayesian learning of the mixture of exponential families and provide some additional theoretical support by deriving the asymptotic form of the stochastic complexity. The stochastic complexity, which corresponds to the minimum free energy and a lower bound of the marginal likelihood, is a key quantity for model selection. It also enables us to discuss the effect of hyperparameters and the accuracy of the Variational Bayesian approach as an approximation of the true Bayesian learning. 1
2 0.72059608 171 nips-2005-Searching for Character Models
Author: Jaety Edwards, David Forsyth
Abstract: We introduce a method to automatically improve character models for a handwritten script without the use of transcriptions and using a minimum of document specific training data. We show that we can use searches for the words in a dictionary to identify portions of the document whose transcriptions are unambiguous. Using templates extracted from those regions, we retrain our character prediction model to drastically improve our search retrieval performance for words in the document.
3 0.66455168 193 nips-2005-The Role of Top-down and Bottom-up Processes in Guiding Eye Movements during Visual Search
Author: Gregory Zelinsky, Wei Zhang, Bing Yu, Xin Chen, Dimitris Samaras
Abstract: To investigate how top-down (TD) and bottom-up (BU) information is weighted in the guidance of human search behavior, we manipulated the proportions of BU and TD components in a saliency-based model. The model is biologically plausible and implements an artificial retina and a neuronal population code. The BU component is based on featurecontrast. The TD component is defined by a feature-template match to a stored target representation. We compared the model’s behavior at different mixtures of TD and BU components to the eye movement behavior of human observers performing the identical search task. We found that a purely TD model provides a much closer match to human behavior than any mixture model using BU information. Only when biological constraints are removed (e.g., eliminating the retina) did a BU/TD mixture model begin to approximate human behavior.
4 0.62333524 197 nips-2005-Unbiased Estimator of Shape Parameter for Spiking Irregularities under Changing Environments
Author: Keiji Miura, Masato Okada, Shun-ichi Amari
Abstract: We considered a gamma distribution of interspike intervals as a statistical model for neuronal spike generation. The model parameters consist of a time-dependent firing rate and a shape parameter that characterizes spiking irregularities of individual neurons. Because the environment changes with time, observed data are generated from the time-dependent firing rate, which is an unknown function. A statistical model with an unknown function is called a semiparametric model, which is one of the unsolved problem in statistics and is generally very difficult to solve. We used a novel method of estimating functions in information geometry to estimate the shape parameter without estimating the unknown function. We analytically obtained an optimal estimating function for the shape parameter independent of the functional form of the firing rate. This estimation is efficient without Fisher information loss and better than maximum likelihood estimation. 1
5 0.621135 76 nips-2005-From Batch to Transductive Online Learning
Author: Sham Kakade, Adam Tauman Kalai
Abstract: It is well-known that everything that is learnable in the difficult online setting, where an arbitrary sequences of examples must be labeled one at a time, is also learnable in the batch setting, where examples are drawn independently from a distribution. We show a result in the opposite direction. We give an efficient conversion algorithm from batch to online that is transductive: it uses future unlabeled data. This demonstrates the equivalence between what is properly and efficiently learnable in a batch model and a transductive online model.
6 0.61454397 112 nips-2005-Learning Minimum Volume Sets
7 0.60861832 49 nips-2005-Convergence and Consistency of Regularized Boosting Algorithms with Stationary B-Mixing Observations
8 0.60075313 96 nips-2005-Inference with Minimal Communication: a Decision-Theoretic Variational Approach
9 0.59743255 74 nips-2005-Faster Rates in Regression via Active Learning
10 0.59408998 137 nips-2005-Non-Gaussian Component Analysis: a Semi-parametric Framework for Linear Dimension Reduction
11 0.59267908 30 nips-2005-Assessing Approximations for Gaussian Process Classification
12 0.5908879 41 nips-2005-Coarse sample complexity bounds for active learning
13 0.58959359 202 nips-2005-Variational EM Algorithms for Non-Gaussian Latent Variable Models
14 0.58932626 90 nips-2005-Hot Coupling: A Particle Approach to Inference and Normalization on Pairwise Undirected Graphs
15 0.58874619 32 nips-2005-Augmented Rescorla-Wagner and Maximum Likelihood Estimation
16 0.58652699 205 nips-2005-Worst-Case Bounds for Gaussian Process Models
17 0.5859313 54 nips-2005-Data-Driven Online to Batch Conversions
18 0.58444893 43 nips-2005-Comparing the Effects of Different Weight Distributions on Finding Sparse Representations
19 0.58435923 66 nips-2005-Estimation of Intrinsic Dimensionality Using High-Rate Vector Quantization
20 0.58086264 38 nips-2005-Beyond Gaussian Processes: On the Distributions of Infinite Networks