jmlr jmlr2006 jmlr2006-87 knowledge-graph by maker-knowledge-mining
Source: pdf
Author: Kazuho Watanabe, Sumio Watanabe
Abstract: Bayesian learning has been widely used and proved to be effective in many data modeling problems. However, computations involved in it require huge costs and generally cannot be performed exactly. The variational Bayesian approach, proposed as an approximation of Bayesian learning, has provided computational tractability and good generalization performance in many applications. The properties and capabilities of variational Bayesian learning itself have not been clarified yet. It is still unknown how good approximation the variational Bayesian approach can achieve. In this paper, we discuss variational Bayesian learning of Gaussian mixture models and derive upper and lower bounds of variational stochastic complexities. The variational stochastic complexity, which corresponds to the minimum variational free energy and a lower bound of the Bayesian evidence, not only becomes important in addressing the model selection problem, but also enables us to discuss the accuracy of the variational Bayesian approach as an approximation of true Bayesian learning. Keywords: Gaussian mixture model, variational Bayesian learning, stochastic complexity
Reference: text
sentIndex sentText sentNum sentScore
1 The variational Bayesian approach, proposed as an approximation of Bayesian learning, has provided computational tractability and good generalization performance in many applications. [sent-10, score-0.661]
2 The properties and capabilities of variational Bayesian learning itself have not been clarified yet. [sent-11, score-0.627]
3 It is still unknown how good approximation the variational Bayesian approach can achieve. [sent-12, score-0.627]
4 In this paper, we discuss variational Bayesian learning of Gaussian mixture models and derive upper and lower bounds of variational stochastic complexities. [sent-13, score-1.641]
5 Keywords: Gaussian mixture model, variational Bayesian learning, stochastic complexity 1. [sent-15, score-0.939]
6 The variational Bayesian (VB) framework was proposed as another approximation for computations in the models with hidden variables (Attias, 1999; Ghahramani and Beal, 2000). [sent-33, score-0.705]
7 The variational Bayesian framework has been applied to various real-world data modeling problems and empirically proved to be both computational tractable and generalize well. [sent-35, score-0.627]
8 The properties of variational Bayesian learning remain unclear from a theoretical stand point. [sent-36, score-0.627]
9 Although the variational Bayesian framework is an approximation, questions like how accurately it approximates the true distribution have yet to be answered. [sent-37, score-0.666]
10 In this paper, we focus on variational Bayesian learning of Gaussian mixture models. [sent-38, score-0.736]
11 As the main contribution, we derive asymptotic upper and lower bounds on the variational stochastic complexity. [sent-39, score-0.908]
12 It is shown that the variational stochastic complexity is smaller than in regular statistical models, so the advantage of Bayesian learning still remains in variational Bayesian learning. [sent-40, score-1.487]
13 The variational stochastic complexity, which corresponds to the minimum variational free energy and a lower bound of the Bayesian evidence, is an important quantity for model selection. [sent-41, score-1.527]
14 One is the accuracy of variational Bayesian learning as an approximation method since the variational stochastic complexity shows the distance from the variational posterior distribution to the true Bayesian posterior distribution in terms of Kullback information. [sent-43, score-2.41]
15 Since the variational Bayesian algorithm minimizes the variational free energy, the derived bounds indicate how the hyperparameters influence the learning process. [sent-45, score-1.378]
16 Analyzing the variational stochastic complexity in this case is most valuable for comparing variational Bayesian learning with true Bayesian learning. [sent-48, score-1.477]
17 In Section 4, the variational Bayesian framework is outlined and the variational stochastic complexity is defined. [sent-54, score-1.457]
18 A Gaussian mixture model p(x|θ) of an M-dimensional input x ∈ RM with a parameter vector θ is defined by K p(x|θ) = ∑ ak g(x|µk , Σk ), k=1 where integer K is the number of components and {ak |ak ≥ 0, ∑K ak = 1} is the set of mixing k=1 proportions. [sent-61, score-0.547]
19 (1) of Gaussian mixture models in the variational Bayesian framework and show upper and lower bounds of the variational stochastic complexity in Theorem 3. [sent-66, score-1.678]
20 The Gaussian mixture model can be rewritten as follows using a hidden variable y = (y1 , · · · , yK ) ∈ {(1, 0, · · · , 0), (0, 1, · · · , 0), · · · , (0, 0, · · · , 1)}, K p(x, y|θ) = ∏ k=1 ak 2πσ2 k M exp{− x − µk 2σ2 k 2 } yk . [sent-67, score-0.452]
21 Let n S(X n ) = − ∑ log p0 (x), i=1 and define the normalized Bayesian stochastic complexity F0 (X n ) by F0 (X n ) = − log Z0 (X n ) = F(X n ) − S(X n ). [sent-85, score-0.471]
22 As another approximation, the variational Bayesian framework was proposed (Attias, 1999; Beal, 2003; Ghahramani and Beal, 2000). [sent-111, score-0.627]
23 Variational Bayesian Learning In this section, we outline the variational Bayesian framework and define the variational stochastic complexity. [sent-113, score-1.42]
24 The variational Bayesian framework starts by upper bounding the Bayesian stochastic complexity. [sent-117, score-0.821]
25 For an arbitrary conditional distribution q(Y n , θ|X n ) on the hidden variables and the parameters, the Bayesian stochastic complexity can be upper bounded by applying Jensen’s inequality, n F(X ) ≤ ∑ Yn Z q(Y n , θ|X n ) log q(Y n , θ|X n ) dθ p(X n ,Y n , θ) ≡ F[q]. [sent-118, score-0.405]
26 The functional F[q] is called the variational free energy. [sent-121, score-0.666]
27 The variational Bayesian approach makes an approximation to ensure a computationally tractable posterior. [sent-122, score-0.627]
28 The distribution q(Y n , θ|X n ) that minimizes the functional F[q] is termed the optimal variational posterior and generally differs from the true Bayesian posterior. [sent-124, score-0.822]
29 Minimization of the functional F[q] with respect to the distributions Q(Y n |X n ) and r(θ|X n ) can be performed by using variational methods. [sent-125, score-0.676]
30 (8) then the variational posteriors, r(θ|X n ) and Q(Y n |X n ), satisfy r(θ|X n ) = 1 ϕ(θ) exp log p(X n ,Y n |θ) Cr and Q(Y n |X n ) = 1 exp log p(X n ,Y n |θ) CQ Q(Y n |X n ) , r(θ|X n ) , where Cr and CQ are the normalization constants. [sent-129, score-0.923]
31 2 Stochastic Complexity of Variational Bayes We define the variational stochastic complexity F(X n ) by the minimum value of the functional F[q] attained by the above optimal variational posteriors, that is , F(X n ) = min F[q]. [sent-140, score-1.479]
32 r,Q The variational stochastic complexity F(X n ) gives an estimate (upper bound) for the true Bayesian stochastic complexity F(X n ), which is the minus log evidence. [sent-141, score-1.161]
33 Therefore, F(X n ) is used for the model selection in variational Bayesian learning(Beal, 2003). [sent-142, score-0.662]
34 Moreover, the difference between F(X n ) and the Bayesian stochastic complexity F(X n ) is the Kullback information from the optimal variational posterior to the true posterior. [sent-143, score-0.984]
35 r,Q Hence, comparison between F(X n ) and F(X n ) shows the accuracy of the variational Bayesian approach as an approximation of true Bayesian learning. [sent-145, score-0.647]
36 We define the normalized variational stochastic complexity F 0 (X n ) by F 0 (X n ) = F(X n ) − S(X n ). [sent-146, score-0.882]
37 The variational posteriors r(θ|X n ) and Q(Y n |X n ) that satisfy eq. [sent-150, score-0.684]
38 (10) are parameterized by the variational parameter θ defined by θ= θ r(θ|X n ) , if the model p(x, y|θ) is included in the exponential family(Beal, 2003; Ghahramani and Beal, 2000). [sent-152, score-0.662]
39 Therefore, henceforth we denote r(θ|X n ) and CQ as r(θ|θ) and CQ (θ) when they are regarded as functions of the variational parameter θ. [sent-155, score-0.646]
40 We define the variational estimator θvb of θ by the variational parameter θ that attains the minimum value of the normalized variational stochastic complexity F 0 (X n ). [sent-156, score-2.155]
41 (13) θ In variational Bayesian learning, the variational parameter θ is updated iteratively to find the optimal solution θvb . [sent-158, score-1.254]
42 Main Results In this section, we describe two conditions and give the upper and lower bounds of the normalized variational stochastic complexity in Theorem 3. [sent-162, score-0.963]
43 They are the conjugate prior distributions and are often used in variational Bayesian learning of Gaussian mixture models. [sent-168, score-0.793]
44 Then the normalized variational stochastic complexity F 0 (X n ) defined by eq. [sent-172, score-0.882]
45 Then the average of the normalized variational stochastic complexity F 0 (X n ) satisfies λ log n + EX n [nHn (θvb )] +C1 ≤ EX n [F 0 (X n )] ≤ λ log n +C2 . [sent-176, score-1.098]
46 This means the normalized variational stochastic complexity F 0 (X n ) becomes smaller than the BIC and implies that the advantage of non-regular models in Bayesian learning still remains in variational Bayesian learning. [sent-184, score-1.54]
47 Theorem 3 also shows how the hyperparameters affect the learning process and implies that the hyperparameter φ0 is the only hyperparameter that the leading term of the normalized variational stochastic complexity F 0 (X n ) depends on. [sent-185, score-1.089]
48 First of all, we derive the variational posterior r(θ|X n ), Q(Y n |X n ) and the variational parameter θ for the Gaussian mixture model given by eq. [sent-193, score-1.532]
49 Note that the variables nk and νk satisfy the constraints ∑K nk = n and ∑K nk νk = ∑n xi . [sent-198, score-0.484]
50 (10), the variational posterior Q(Y n |X n ) is given by Q(Y n |X n ) = xi − µk 1 n exp yk {Ψ(nk + φ0 ) − Ψ(n + Kφ0 ) − i CQ ∏ 2 i=1 2 − M 1 (log 2π + )} , 2 nk + ξ0 where Ψ(x) = Γ′ (x)/Γ(x) is the di-gamma(psi) function and we used log ak r(a|X n ) = Ψ(nk + φ0 ) − Ψ(n + Kφ0 ). [sent-204, score-1.342]
51 The variational parameter θ is given by θ = θ r(θ|X n ) = {ak , µk }K . [sent-205, score-0.627]
52 It is noted that r(θ|X n ) and k=1 nk +φ Q(Y n |X n ) are parameterized by θ since nk can be replaced by using ak = n+Kφ00 . [sent-206, score-0.525]
53 (21) 0 ≤ log Γ(x) − {(x − ) log x − x + log 2π} ≤ 2 2 12x The inequalities (20) ensure that substituting log x for Ψ(x) only contributes additive constant terms to the normalized variational stochastic complexity. [sent-211, score-1.293]
54 Lemma 5 K(r(θ|θ)||ϕ(θ)) − {G(a) + 634 ξ0 K ∑ µ − ν0 2 } ≤ C, 2 k=1 k S TOCHASTIC C OMPLEXITIES OF G AUSSIAN M IXTURES holds where C is a constant and the function G(a) of a = {ak }K is defined by k=1 G(a) = MK + K − 1 M 1 K log n + { − (φ0 − )} ∑ log ak . [sent-214, score-0.405]
55 (24), it is noted that the function values of Tn (θ) at specific points of the variational parameter θ give the upper bounds of the normalized variational stochastic complexity F 0 (X n ). [sent-223, score-1.596]
56 (I) : ak = a∗ (1 ≤ k ≤ K0 − 1), ak = a∗ 0 /(K − K0 + 1) (K0 ≤ k ≤ K), K k µk = µ∗ (1 ≤ k ≤ K0 − 1), µk = µ∗ 0 (K0 ≤ k ≤ K), K k then nH n (θ) < K−K0 +1 mink {a∗ } k holds and Tn (θ) < MK + K − 1 1 log n +C′ + O( ), 2 n where C′ is a constant. [sent-225, score-0.486]
57 a If φ0 > M+1 2 , (25) then G(a) ≥ M+1 MK + K − 1 log n − ( − φ0 )K log K, 2 2 (26) 1 1 since Jensen’s inequality yields that ∑K log ak ≤ K log( K ∑K ak ) = K log( K ). [sent-233, score-0.702]
58 k=1 k=1 M+1 If φ0 ≤ 2 , then G(a) ≥ {(K − 1)φ0 + M M+1 1 } log n + ( − φ0 )(K − 1) log φ0 + O( ), 2 2 n (27) φ0 since ak ≥ n+Kφ0 holds for every k and the constraint ∑K ak = 1 ensures that | log ak | is bounded k=1 by a constant independent of n for at least one index k. [sent-234, score-0.891]
59 Experiments In order to examine the quality of the theoretical bounds given in Theorem 3, we conducted experiments of variational Bayesian learning for Gaussian mixture models using M = 1 and M = 10 dimensional synthetic data. [sent-239, score-0.802]
60 We applied the variational Bayesian algorithm to each model using the data set generated from the true distribution with K0 = 2 components. [sent-241, score-0.701]
61 (13), the initial value of the variational parameter θ was set around the true parameter, that is, around a1 = a2 = 1/2, ak = 0 (k ≥ 3), µ1 = µ∗ , µ2 = µ∗ 1 2 and µk = 0 (k ≥ 3). [sent-247, score-0.836]
62 For each data set, the normalized variational stochastic complexity (the inside of the braces in eq. [sent-249, score-0.882]
63 (13)) was 636 S TOCHASTIC C OMPLEXITIES OF G AUSSIAN M IXTURES calculated when the variational Bayesian algorithm converged. [sent-250, score-0.627]
64 Denoting the results for respective data sets by F 0 (X 1000 ) and F 0 (X 100 ), we calculated λVB = (F 0 (X 1000 ) − F 0 (X 100 ))/ log 10 (28) to estimate the coefficient of the leading term of the normalized variational stochastic complexity F 0 (X n ). [sent-251, score-1.011]
65 (19) are also plotted for the comparison of variational Bayesian learning with true Bayesian learning in the next section. [sent-256, score-0.647]
66 The variational Bayesian algorithm gave λVB that coincide with the coefficient λ. [sent-257, score-0.627]
67 We also calculated the generalization error defined by K(p(x|θ0 )|| p(x|θ) r(θ|X n ) ), where p(x|θ) is the predictive distribution in variational Bayesian learning. [sent-260, score-0.68]
68 (7), λVB and λG should have shown similar behavior if there were the same relation between the average normalized variational stochastic complexity and the average generalization error as in Bayesian learning. [sent-268, score-0.916]
69 These results imply that in variational Bayesian learning, unlike in Bayesian learning, the coefficient of the average generalization error differs from that of the average variational stochastic complexity EX n [F(X n )]. [sent-269, score-1.491]
70 Discussion In this paper, we showed upper and lower bounds of the variational stochastic complexity of the Gaussian mixture models. [sent-271, score-1.02]
71 (17) can be improved to give F 0 (X n ) ≥ λ log n + nHn (θvb ) +C1 , (30) if the consistency of the variational estimator θvb is proven. [sent-276, score-0.754]
72 However, little has been known so far about the behavior of the variational estimator. [sent-299, score-0.627]
73 2 Comparison to Bayesian Learning We compare the normalized variational stochastic complexity shown in Theorem 3 with the one in true Bayesian learning assuming eq. [sent-312, score-0.902]
74 For the Gaussian mixture model in particular, the following upper bound on the coefficient of the average normalized Bayesian stochastic complexity EX n [F0 (X n )] described as eq. [sent-315, score-0.443]
75 (33) Let us compare this λ of variational Bayesian learning to λ in eq. [sent-321, score-0.627]
76 This implies that the more redundant components the model has, the more variational Bayesian learning differs from true Bayesian learning. [sent-324, score-0.724]
77 This implies the variational posterior is close to the true Bayesian posterior. [sent-326, score-0.781]
78 It is noted that λ of variational Bayesian learning eq. [sent-331, score-0.651]
79 3 Stochastic Complexity and Generalization We have discussed how much the variational posterior differs from the true Bayesian posterior by comparing the stochastic complexities. [sent-336, score-1.081]
80 In variational Bayesian learning, there is no apparent relationship between the average variational stochastic complexity and the average generalization error unlike in Bayesian learning where their leading terms are given by the same coefficient λ as in eq. [sent-337, score-1.512]
81 Hence, assessing the generalization performance of the Gaussian mixture model in variational Bayesian learning is an important issue to be addressed. [sent-341, score-0.805]
82 It would be important to clarify how this term affects the generalization performance in variational Bayesian learning. [sent-343, score-0.677]
83 From Theorem 3, only the hyperparameter φ0 affects the leading term of the normalized variational stochastic complexity F 0 (X n ) and the other 639 WATANABE AND WATANABE hyperparameters ξ0 and ν0 affect only the lower order terms. [sent-346, score-1.05]
84 We also point out that Theorem 3 shows how the hyperparameter φ0 influence variational Bayesian learning. [sent-353, score-0.684]
85 (17) with experimental results, one can investigate the properties of the actual iterative algorithm in variational Bayesian learning. [sent-361, score-0.627]
86 Although the actual iterative algorithm gives the variational posterior that satisfies eq. [sent-362, score-0.761]
87 One can examine experimentally whether the algorithm converges to the optimal variational posterior that minimizes the functional instead of local minima by comparing the experimental results with the theoretical bounds. [sent-368, score-0.798]
88 Moreover, the theoretical bounds would enable us to compare the accuracy of variational Bayesian learning with that of the Laplace approximation or the MCMC method. [sent-369, score-0.662]
89 However, in order to make such comparisons more accurately, one will need not only the leading term but also the lower order terms of the asymptotic form of the variational stochastic complexity. [sent-370, score-0.866]
90 The Gaussian mixture model is included in general exponential family models with hidden variables (Sato, 2001) and furthermore, in general graphical models to which the variational Bayesian framework can be applied (Attias, 1999). [sent-372, score-0.88]
91 Analyzing the variational stochastic complexities in the more general cases would be an important undertaking. [sent-373, score-0.83]
92 Furthermore, as mentioned in Section 4, the variational stochastic complexity F(X n ) is used as a criterion for model selection in variational Bayesian learning. [sent-374, score-1.492]
93 Conclusion In this paper, we derived upper and lower bounds of the variational stochastic complexity of the Gaussian mixture models. [sent-380, score-1.02]
94 Using the derived bounds, we discussed the influence of the hyperparameters and the accuracy of variational Bayesian learning as an approximation of true Bayesian 640 S TOCHASTIC C OMPLEXITIES OF G AUSSIAN M IXTURES learning. [sent-381, score-0.719]
95 These bounds can be used for evaluation and optimization of learning algorithms based on the variational Bayesian approximation. [sent-382, score-0.662]
96 Proof of Lemma 2 Proof From the restriction of the variational Bayesian approximation eq. [sent-390, score-0.627]
97 (10), if the variational posterior Q(Y n |X n ) is optimized, then Q(Y n |X n ) log n n n = − logCQ p(X n ,Y n |θ) r(θ|X )Q(Y |X ) holds. [sent-394, score-0.869]
98 Proof of Lemma 5 Proof Calculating the Kullback information between the posterior and the prior, we obtain Γ(φ0 )K K K(r(a|a)||ϕ(a)) = ∑ h(nk ) − nΨ(n + Kφ0) + log Γ(n + Kφ0 ) + log Γ(Kφ0 ) , (34) k=1 where we use the notation h(x) = xΨ(x + φ0 ) − log Γ(x + φ0 ). [sent-397, score-0.458]
99 Inferring parameters and structure of latent variable models by variational bayes. [sent-421, score-0.658]
100 Lower bounds of stochastic complexities in variational bayes learning of gaussian mixture models. [sent-477, score-1.037]
wordName wordTfidf (topN-words)
[('variational', 0.627), ('watanabe', 0.376), ('bayesian', 0.299), ('vb', 0.218), ('ak', 0.189), ('stochastic', 0.166), ('nk', 0.156), ('nhn', 0.138), ('posterior', 0.134), ('logcq', 0.113), ('omplexities', 0.113), ('tochastic', 0.113), ('mixture', 0.109), ('log', 0.108), ('ex', 0.099), ('ixtures', 0.096), ('bic', 0.096), ('cq', 0.081), ('beal', 0.074), ('yk', 0.072), ('hyperparameters', 0.072), ('aussian', 0.065), ('posteriors', 0.057), ('hyperparameter', 0.057), ('kullback', 0.053), ('normalized', 0.052), ('mk', 0.05), ('sato', 0.05), ('yamazaki', 0.05), ('hidden', 0.047), ('gaussian', 0.044), ('nh', 0.043), ('exp', 0.04), ('likelihood', 0.04), ('kazuho', 0.038), ('complexity', 0.037), ('complexities', 0.037), ('bounds', 0.035), ('model', 0.035), ('tn', 0.034), ('asymptotic', 0.034), ('generalization', 0.034), ('datum', 0.032), ('models', 0.031), ('laplace', 0.03), ('prior', 0.03), ('regular', 0.03), ('upper', 0.028), ('coef', 0.028), ('distributions', 0.027), ('lemma', 0.026), ('attias', 0.026), ('levin', 0.025), ('mailbox', 0.025), ('nagatsuta', 0.025), ('sumio', 0.025), ('yokohama', 0.025), ('components', 0.025), ('yn', 0.024), ('ns', 0.024), ('hn', 0.024), ('mcmc', 0.024), ('noted', 0.024), ('schwarz', 0.023), ('normal', 0.022), ('theorem', 0.022), ('functional', 0.022), ('uence', 0.022), ('clari', 0.022), ('coefficient', 0.021), ('singularities', 0.021), ('psi', 0.021), ('leading', 0.021), ('lemmas', 0.021), ('energy', 0.021), ('ghahramani', 0.02), ('averages', 0.02), ('rm', 0.02), ('analyzing', 0.02), ('true', 0.02), ('titech', 0.019), ('henceforth', 0.019), ('distribution', 0.019), ('bayes', 0.019), ('algebraic', 0.019), ('estimator', 0.019), ('kth', 0.018), ('lower', 0.018), ('diverge', 0.017), ('hartigan', 0.017), ('bars', 0.017), ('free', 0.017), ('redundant', 0.017), ('ratio', 0.016), ('clarify', 0.016), ('inequalities', 0.016), ('bound', 0.016), ('xi', 0.016), ('experimentally', 0.015), ('evidence', 0.015)]
simIndex simValue paperId paperTitle
same-paper 1 0.99999976 87 jmlr-2006-Stochastic Complexities of Gaussian Mixtures in Variational Bayesian Approximation
Author: Kazuho Watanabe, Sumio Watanabe
Abstract: Bayesian learning has been widely used and proved to be effective in many data modeling problems. However, computations involved in it require huge costs and generally cannot be performed exactly. The variational Bayesian approach, proposed as an approximation of Bayesian learning, has provided computational tractability and good generalization performance in many applications. The properties and capabilities of variational Bayesian learning itself have not been clarified yet. It is still unknown how good approximation the variational Bayesian approach can achieve. In this paper, we discuss variational Bayesian learning of Gaussian mixture models and derive upper and lower bounds of variational stochastic complexities. The variational stochastic complexity, which corresponds to the minimum variational free energy and a lower bound of the Bayesian evidence, not only becomes important in addressing the model selection problem, but also enables us to discuss the accuracy of the variational Bayesian approach as an approximation of true Bayesian learning. Keywords: Gaussian mixture model, variational Bayesian learning, stochastic complexity
2 0.13440615 64 jmlr-2006-Noisy-OR Component Analysis and its Application to Link Analysis
Author: Tomáš Šingliar, Miloš Hauskrecht
Abstract: We develop a new component analysis framework, the Noisy-Or Component Analyzer (NOCA), that targets high-dimensional binary data. NOCA is a probabilistic latent variable model that assumes the expression of observed high-dimensional binary data is driven by a small number of hidden binary sources combined via noisy-or units. The component analysis procedure is equivalent to learning of NOCA parameters. Since the classical EM formulation of the NOCA learning problem is intractable, we develop its variational approximation. We test the NOCA framework on two problems: (1) a synthetic image-decomposition problem and (2) a co-citation data analysis problem for thousands of CiteSeer documents. We demonstrate good performance of the new model on both problems. In addition, we contrast the model to two mixture-based latent-factor models: the probabilistic latent semantic analysis (PLSA) and latent Dirichlet allocation (LDA). Differing assumptions underlying these models cause them to discover different types of structure in co-citation data, thus illustrating the benefit of NOCA in building our understanding of highdimensional data sets. Keywords: component analysis, vector quantization, variational learning, link analysis
3 0.13251249 28 jmlr-2006-Estimating the "Wrong" Graphical Model: Benefits in the Computation-Limited Setting
Author: Martin J. Wainwright
Abstract: Consider the problem of joint parameter estimation and prediction in a Markov random field: that is, the model parameters are estimated on the basis of an initial set of data, and then the fitted model is used to perform prediction (e.g., smoothing, denoising, interpolation) on a new noisy observation. Working under the restriction of limited computation, we analyze a joint method in which the same convex variational relaxation is used to construct an M-estimator for fitting parameters, and to perform approximate marginalization for the prediction step. The key result of this paper is that in the computation-limited setting, using an inconsistent parameter estimator (i.e., an estimator that returns the “wrong” model even in the infinite data limit) is provably beneficial, since the resulting errors can partially compensate for errors made by using an approximate prediction technique. En route to this result, we analyze the asymptotic properties of M-estimators based on convex variational relaxations, and establish a Lipschitz stability property that holds for a broad class of convex variational methods. This stability result provides additional incentive, apart from the obvious benefit of unique global optima, for using message-passing methods based on convex variational relaxations. We show that joint estimation/prediction based on the reweighted sum-product algorithm substantially outperforms a commonly used heuristic based on ordinary sum-product. Keywords: graphical model, Markov random field, belief propagation, variational method, parameter estimation, prediction error, algorithmic stability
4 0.085369647 15 jmlr-2006-Bayesian Network Learning with Parameter Constraints (Special Topic on Machine Learning and Optimization)
Author: Radu Stefan Niculescu, Tom M. Mitchell, R. Bharat Rao
Abstract: The task of learning models for many real-world problems requires incorporating domain knowledge into learning algorithms, to enable accurate learning from a realistic volume of training data. This paper considers a variety of types of domain knowledge for constraining parameter estimates when learning Bayesian networks. In particular, we consider domain knowledge that constrains the values or relationships among subsets of parameters in a Bayesian network with known structure. We incorporate a wide variety of parameter constraints into learning procedures for Bayesian networks, by formulating this task as a constrained optimization problem. The assumptions made in module networks, dynamic Bayes nets and context specific independence models can be viewed as particular cases of such parameter constraints. We present closed form solutions or fast iterative algorithms for estimating parameters subject to several specific classes of parameter constraints, including equalities and inequalities among parameters, constraints on individual parameters, and constraints on sums and ratios of parameters, for discrete and continuous variables. Our methods cover learning from both frequentist and Bayesian points of view, from both complete and incomplete data. We present formal guarantees for our estimators, as well as methods for automatically learning useful parameter constraints from data. To validate our approach, we apply it to the domain of fMRI brain image analysis. Here we demonstrate the ability of our system to first learn useful relationships among parameters, and then to use them to constrain the training of the Bayesian network, resulting in improved cross-validated accuracy of the learned model. Experiments on synthetic data are also presented. Keywords: Bayesian networks, constrained optimization, domain knowledge c 2006 Radu Stefan Niculescu, Tom Mitchell and Bharat Rao. N ICULESCU , M ITCHELL AND R AO
5 0.080469511 49 jmlr-2006-Learning Parts-Based Representations of Data
Author: David A. Ross, Richard S. Zemel
Abstract: Many perceptual models and theories hinge on treating objects as a collection of constituent parts. When applying these approaches to data, a fundamental problem arises: how can we determine what are the parts? We attack this problem using learning, proposing a form of generative latent factor model, in which each data dimension is allowed to select a different factor or part as its explanation. This approach permits a range of variations that posit different models for the appearance of a part. Here we provide the details for two such models: a discrete and a continuous one. Further, we show that this latent factor model can be extended hierarchically to account for correlations between the appearances of different parts. This permits modeling of data consisting of multiple categories, and learning these categories simultaneously with the parts when they are unobserved. Experiments demonstrate the ability to learn parts-based representations, and categories, of facial images and user-preference data. Keywords: parts, unsupervised learning, latent factor models, collaborative filtering, hierarchical learning
6 0.065499634 17 jmlr-2006-Bounds for the Loss in Probability of Correct Classification Under Model Based Approximation
7 0.063894048 6 jmlr-2006-A Scoring Function for Learning Bayesian Networks based on Mutual Information and Conditional Independence Tests
8 0.051101644 71 jmlr-2006-Optimising Kernel Parameters and Regularisation Coefficients for Non-linear Discriminant Analysis
9 0.050367597 46 jmlr-2006-Learning Factor Graphs in Polynomial Time and Sample Complexity
10 0.038646962 57 jmlr-2006-Linear State-Space Models for Blind Source Separation
11 0.03857528 9 jmlr-2006-Accurate Error Bounds for the Eigenvalues of the Kernel Matrix
12 0.030778987 88 jmlr-2006-Streamwise Feature Selection
13 0.029797319 23 jmlr-2006-Consistency and Convergence Rates of One-Class SVMs and Related Algorithms
14 0.029228764 36 jmlr-2006-In Search of Non-Gaussian Components of a High-Dimensional Distribution
15 0.028939607 58 jmlr-2006-Lower Bounds and Aggregation in Density Estimation
16 0.028888842 52 jmlr-2006-Learning Spectral Clustering, With Application To Speech Separation
17 0.028101869 91 jmlr-2006-The Interplay of Optimization and Machine Learning Research (Special Topic on Machine Learning and Optimization)
18 0.026447829 45 jmlr-2006-Learning Coordinate Covariances via Gradients
19 0.026228258 90 jmlr-2006-Superior Guarantees for Sequential Prediction and Lossless Compression via Alphabet Decomposition
20 0.025469217 32 jmlr-2006-Expectation Correction for Smoothed Inference in Switching Linear Dynamical Systems
topicId topicWeight
[(0, 0.165), (1, -0.008), (2, -0.138), (3, -0.027), (4, -0.257), (5, -0.128), (6, -0.13), (7, 0.027), (8, 0.047), (9, -0.163), (10, 0.13), (11, 0.121), (12, 0.047), (13, -0.029), (14, -0.036), (15, 0.175), (16, -0.24), (17, -0.162), (18, -0.163), (19, -0.022), (20, -0.04), (21, -0.052), (22, 0.159), (23, 0.057), (24, 0.091), (25, -0.031), (26, -0.026), (27, -0.136), (28, -0.058), (29, 0.112), (30, -0.052), (31, 0.01), (32, 0.013), (33, 0.068), (34, -0.045), (35, -0.127), (36, 0.135), (37, -0.006), (38, -0.059), (39, 0.041), (40, -0.047), (41, 0.036), (42, 0.143), (43, -0.077), (44, 0.055), (45, 0.162), (46, -0.015), (47, 0.028), (48, -0.12), (49, 0.173)]
simIndex simValue paperId paperTitle
same-paper 1 0.97823507 87 jmlr-2006-Stochastic Complexities of Gaussian Mixtures in Variational Bayesian Approximation
Author: Kazuho Watanabe, Sumio Watanabe
Abstract: Bayesian learning has been widely used and proved to be effective in many data modeling problems. However, computations involved in it require huge costs and generally cannot be performed exactly. The variational Bayesian approach, proposed as an approximation of Bayesian learning, has provided computational tractability and good generalization performance in many applications. The properties and capabilities of variational Bayesian learning itself have not been clarified yet. It is still unknown how good approximation the variational Bayesian approach can achieve. In this paper, we discuss variational Bayesian learning of Gaussian mixture models and derive upper and lower bounds of variational stochastic complexities. The variational stochastic complexity, which corresponds to the minimum variational free energy and a lower bound of the Bayesian evidence, not only becomes important in addressing the model selection problem, but also enables us to discuss the accuracy of the variational Bayesian approach as an approximation of true Bayesian learning. Keywords: Gaussian mixture model, variational Bayesian learning, stochastic complexity
2 0.57826608 64 jmlr-2006-Noisy-OR Component Analysis and its Application to Link Analysis
Author: Tomáš Šingliar, Miloš Hauskrecht
Abstract: We develop a new component analysis framework, the Noisy-Or Component Analyzer (NOCA), that targets high-dimensional binary data. NOCA is a probabilistic latent variable model that assumes the expression of observed high-dimensional binary data is driven by a small number of hidden binary sources combined via noisy-or units. The component analysis procedure is equivalent to learning of NOCA parameters. Since the classical EM formulation of the NOCA learning problem is intractable, we develop its variational approximation. We test the NOCA framework on two problems: (1) a synthetic image-decomposition problem and (2) a co-citation data analysis problem for thousands of CiteSeer documents. We demonstrate good performance of the new model on both problems. In addition, we contrast the model to two mixture-based latent-factor models: the probabilistic latent semantic analysis (PLSA) and latent Dirichlet allocation (LDA). Differing assumptions underlying these models cause them to discover different types of structure in co-citation data, thus illustrating the benefit of NOCA in building our understanding of highdimensional data sets. Keywords: component analysis, vector quantization, variational learning, link analysis
3 0.52155745 28 jmlr-2006-Estimating the "Wrong" Graphical Model: Benefits in the Computation-Limited Setting
Author: Martin J. Wainwright
Abstract: Consider the problem of joint parameter estimation and prediction in a Markov random field: that is, the model parameters are estimated on the basis of an initial set of data, and then the fitted model is used to perform prediction (e.g., smoothing, denoising, interpolation) on a new noisy observation. Working under the restriction of limited computation, we analyze a joint method in which the same convex variational relaxation is used to construct an M-estimator for fitting parameters, and to perform approximate marginalization for the prediction step. The key result of this paper is that in the computation-limited setting, using an inconsistent parameter estimator (i.e., an estimator that returns the “wrong” model even in the infinite data limit) is provably beneficial, since the resulting errors can partially compensate for errors made by using an approximate prediction technique. En route to this result, we analyze the asymptotic properties of M-estimators based on convex variational relaxations, and establish a Lipschitz stability property that holds for a broad class of convex variational methods. This stability result provides additional incentive, apart from the obvious benefit of unique global optima, for using message-passing methods based on convex variational relaxations. We show that joint estimation/prediction based on the reweighted sum-product algorithm substantially outperforms a commonly used heuristic based on ordinary sum-product. Keywords: graphical model, Markov random field, belief propagation, variational method, parameter estimation, prediction error, algorithmic stability
Author: Radu Stefan Niculescu, Tom M. Mitchell, R. Bharat Rao
Abstract: The task of learning models for many real-world problems requires incorporating domain knowledge into learning algorithms, to enable accurate learning from a realistic volume of training data. This paper considers a variety of types of domain knowledge for constraining parameter estimates when learning Bayesian networks. In particular, we consider domain knowledge that constrains the values or relationships among subsets of parameters in a Bayesian network with known structure. We incorporate a wide variety of parameter constraints into learning procedures for Bayesian networks, by formulating this task as a constrained optimization problem. The assumptions made in module networks, dynamic Bayes nets and context specific independence models can be viewed as particular cases of such parameter constraints. We present closed form solutions or fast iterative algorithms for estimating parameters subject to several specific classes of parameter constraints, including equalities and inequalities among parameters, constraints on individual parameters, and constraints on sums and ratios of parameters, for discrete and continuous variables. Our methods cover learning from both frequentist and Bayesian points of view, from both complete and incomplete data. We present formal guarantees for our estimators, as well as methods for automatically learning useful parameter constraints from data. To validate our approach, we apply it to the domain of fMRI brain image analysis. Here we demonstrate the ability of our system to first learn useful relationships among parameters, and then to use them to constrain the training of the Bayesian network, resulting in improved cross-validated accuracy of the learned model. Experiments on synthetic data are also presented. Keywords: Bayesian networks, constrained optimization, domain knowledge c 2006 Radu Stefan Niculescu, Tom Mitchell and Bharat Rao. N ICULESCU , M ITCHELL AND R AO
5 0.41672421 49 jmlr-2006-Learning Parts-Based Representations of Data
Author: David A. Ross, Richard S. Zemel
Abstract: Many perceptual models and theories hinge on treating objects as a collection of constituent parts. When applying these approaches to data, a fundamental problem arises: how can we determine what are the parts? We attack this problem using learning, proposing a form of generative latent factor model, in which each data dimension is allowed to select a different factor or part as its explanation. This approach permits a range of variations that posit different models for the appearance of a part. Here we provide the details for two such models: a discrete and a continuous one. Further, we show that this latent factor model can be extended hierarchically to account for correlations between the appearances of different parts. This permits modeling of data consisting of multiple categories, and learning these categories simultaneously with the parts when they are unobserved. Experiments demonstrate the ability to learn parts-based representations, and categories, of facial images and user-preference data. Keywords: parts, unsupervised learning, latent factor models, collaborative filtering, hierarchical learning
6 0.33795431 46 jmlr-2006-Learning Factor Graphs in Polynomial Time and Sample Complexity
8 0.28440845 17 jmlr-2006-Bounds for the Loss in Probability of Correct Classification Under Model Based Approximation
9 0.28295013 71 jmlr-2006-Optimising Kernel Parameters and Regularisation Coefficients for Non-linear Discriminant Analysis
10 0.2321575 9 jmlr-2006-Accurate Error Bounds for the Eigenvalues of the Kernel Matrix
11 0.21078841 10 jmlr-2006-Action Elimination and Stopping Conditions for the Multi-Armed Bandit and Reinforcement Learning Problems
12 0.20271179 36 jmlr-2006-In Search of Non-Gaussian Components of a High-Dimensional Distribution
13 0.18613146 40 jmlr-2006-Infinite-σ Limits For Tikhonov Regularization
14 0.18243703 58 jmlr-2006-Lower Bounds and Aggregation in Density Estimation
15 0.17974699 88 jmlr-2006-Streamwise Feature Selection
16 0.17946279 53 jmlr-2006-Learning a Hidden Hypergraph
17 0.17283133 94 jmlr-2006-Using Machine Learning to Guide Architecture Simulation
18 0.17181158 23 jmlr-2006-Consistency and Convergence Rates of One-Class SVMs and Related Algorithms
19 0.16652633 90 jmlr-2006-Superior Guarantees for Sequential Prediction and Lossless Compression via Alphabet Decomposition
20 0.13719666 57 jmlr-2006-Linear State-Space Models for Blind Source Separation
topicId topicWeight
[(8, 0.062), (11, 0.015), (35, 0.01), (36, 0.055), (45, 0.027), (50, 0.106), (62, 0.332), (63, 0.053), (68, 0.02), (76, 0.018), (78, 0.032), (79, 0.015), (81, 0.043), (84, 0.016), (90, 0.016), (91, 0.02), (96, 0.055)]
simIndex simValue paperId paperTitle
same-paper 1 0.76309544 87 jmlr-2006-Stochastic Complexities of Gaussian Mixtures in Variational Bayesian Approximation
Author: Kazuho Watanabe, Sumio Watanabe
Abstract: Bayesian learning has been widely used and proved to be effective in many data modeling problems. However, computations involved in it require huge costs and generally cannot be performed exactly. The variational Bayesian approach, proposed as an approximation of Bayesian learning, has provided computational tractability and good generalization performance in many applications. The properties and capabilities of variational Bayesian learning itself have not been clarified yet. It is still unknown how good approximation the variational Bayesian approach can achieve. In this paper, we discuss variational Bayesian learning of Gaussian mixture models and derive upper and lower bounds of variational stochastic complexities. The variational stochastic complexity, which corresponds to the minimum variational free energy and a lower bound of the Bayesian evidence, not only becomes important in addressing the model selection problem, but also enables us to discuss the accuracy of the variational Bayesian approach as an approximation of true Bayesian learning. Keywords: Gaussian mixture model, variational Bayesian learning, stochastic complexity
2 0.41530871 36 jmlr-2006-In Search of Non-Gaussian Components of a High-Dimensional Distribution
Author: Gilles Blanchard, Motoaki Kawanabe, Masashi Sugiyama, Vladimir Spokoiny, Klaus-Robert Müller
Abstract: Finding non-Gaussian components of high-dimensional data is an important preprocessing step for efficient information processing. This article proposes a new linear method to identify the “nonGaussian subspace” within a very general semi-parametric framework. Our proposed method, called NGCA (non-Gaussian component analysis), is based on a linear operator which, to any arbitrary nonlinear (smooth) function, associates a vector belonging to the low dimensional nonGaussian target subspace, up to an estimation error. By applying this operator to a family of different nonlinear functions, one obtains a family of different vectors lying in a vicinity of the target space. As a final step, the target space itself is estimated by applying PCA to this family of vectors. We show that this procedure is consistent in the sense that the estimaton error tends to zero at a parametric rate, uniformly over the family, Numerical examples demonstrate the usefulness of our method.
3 0.40533316 9 jmlr-2006-Accurate Error Bounds for the Eigenvalues of the Kernel Matrix
Author: Mikio L. Braun
Abstract: The eigenvalues of the kernel matrix play an important role in a number of kernel methods, in particular, in kernel principal component analysis. It is well known that the eigenvalues of the kernel matrix converge as the number of samples tends to infinity. We derive probabilistic finite sample size bounds on the approximation error of individual eigenvalues which have the important property that the bounds scale with the eigenvalue under consideration, reflecting the actual behavior of the approximation errors as predicted by asymptotic results and observed in numerical simulations. Such scaling bounds have so far only been known for tail sums of eigenvalues. Asymptotically, the bounds presented here have a slower than stochastic rate, but the number of sample points necessary to make this disadvantage noticeable is often unrealistically large. Therefore, under practical conditions, and for all but the largest few eigenvalues, the bounds presented here form a significant improvement over existing non-scaling bounds. Keywords: kernel matrix, eigenvalues, relative perturbation bounds
4 0.39973199 46 jmlr-2006-Learning Factor Graphs in Polynomial Time and Sample Complexity
Author: Pieter Abbeel, Daphne Koller, Andrew Y. Ng
Abstract: We study the computational and sample complexity of parameter and structure learning in graphical models. Our main result shows that the class of factor graphs with bounded degree can be learned in polynomial time and from a polynomial number of training examples, assuming that the data is generated by a network in this class. This result covers both parameter estimation for a known network structure and structure learning. It implies as a corollary that we can learn factor graphs for both Bayesian networks and Markov networks of bounded degree, in polynomial time and sample complexity. Importantly, unlike standard maximum likelihood estimation algorithms, our method does not require inference in the underlying network, and so applies to networks where inference is intractable. We also show that the error of our learned model degrades gracefully when the generating distribution is not a member of the target class of networks. In addition to our main result, we show that the sample complexity of parameter learning in graphical models has an O(1) dependence on the number of variables in the model when using the KL-divergence normalized by the number of variables as the performance criterion.1 Keywords: probabilistic graphical models, parameter and structure learning, factor graphs, Markov networks, Bayesian networks
5 0.39522183 52 jmlr-2006-Learning Spectral Clustering, With Application To Speech Separation
Author: Francis R. Bach, Michael I. Jordan
Abstract: Spectral clustering refers to a class of techniques which rely on the eigenstructure of a similarity matrix to partition points into disjoint clusters, with points in the same cluster having high similarity and points in different clusters having low similarity. In this paper, we derive new cost functions for spectral clustering based on measures of error between a given partition and a solution of the spectral relaxation of a minimum normalized cut problem. Minimizing these cost functions with respect to the partition leads to new spectral clustering algorithms. Minimizing with respect to the similarity matrix leads to algorithms for learning the similarity matrix from fully labelled data sets. We apply our learning algorithm to the blind one-microphone speech separation problem, casting the problem as one of segmentation of the spectrogram. Keywords: spectral clustering, blind source separation, computational auditory scene analysis
6 0.39153659 29 jmlr-2006-Estimation of Gradients and Coordinate Covariation in Classification
7 0.39068836 66 jmlr-2006-On Model Selection Consistency of Lasso
8 0.3891435 17 jmlr-2006-Bounds for the Loss in Probability of Correct Classification Under Model Based Approximation
9 0.38522318 28 jmlr-2006-Estimating the "Wrong" Graphical Model: Benefits in the Computation-Limited Setting
10 0.38345635 71 jmlr-2006-Optimising Kernel Parameters and Regularisation Coefficients for Non-linear Discriminant Analysis
11 0.38193962 73 jmlr-2006-Pattern Recognition for Conditionally Independent Data
12 0.38156962 82 jmlr-2006-Some Theory for Generalized Boosting Algorithms
13 0.38094693 24 jmlr-2006-Consistency of Multiclass Empirical Risk Minimization Methods Based on Convex Loss
14 0.38022962 6 jmlr-2006-A Scoring Function for Learning Bayesian Networks based on Mutual Information and Conditional Independence Tests
15 0.37983677 95 jmlr-2006-Walk-Sums and Belief Propagation in Gaussian Graphical Models
16 0.37951094 64 jmlr-2006-Noisy-OR Component Analysis and its Application to Link Analysis
17 0.37704775 58 jmlr-2006-Lower Bounds and Aggregation in Density Estimation
18 0.37142137 48 jmlr-2006-Learning Minimum Volume Sets
20 0.36701441 53 jmlr-2006-Learning a Hidden Hypergraph