jmlr jmlr2006 jmlr2006-87 knowledge-graph by maker-knowledge-mining

87 jmlr-2006-Stochastic Complexities of Gaussian Mixtures in Variational Bayesian Approximation

Source: pdf

Author: Kazuho Watanabe, Sumio Watanabe

Abstract: Bayesian learning has been widely used and proved to be effective in many data modeling problems. However, computations involved in it require huge costs and generally cannot be performed exactly. The variational Bayesian approach, proposed as an approximation of Bayesian learning, has provided computational tractability and good generalization performance in many applications. The properties and capabilities of variational Bayesian learning itself have not been clariﬁed yet. It is still unknown how good approximation the variational Bayesian approach can achieve. In this paper, we discuss variational Bayesian learning of Gaussian mixture models and derive upper and lower bounds of variational stochastic complexities. The variational stochastic complexity, which corresponds to the minimum variational free energy and a lower bound of the Bayesian evidence, not only becomes important in addressing the model selection problem, but also enables us to discuss the accuracy of the variational Bayesian approach as an approximation of true Bayesian learning. Keywords: Gaussian mixture model, variational Bayesian learning, stochastic complexity

Reference: text

Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 The variational Bayesian approach, proposed as an approximation of Bayesian learning, has provided computational tractability and good generalization performance in many applications. [sent-10, score-0.661]

2 The properties and capabilities of variational Bayesian learning itself have not been clariﬁed yet. [sent-11, score-0.627]

3 It is still unknown how good approximation the variational Bayesian approach can achieve. [sent-12, score-0.627]

4 In this paper, we discuss variational Bayesian learning of Gaussian mixture models and derive upper and lower bounds of variational stochastic complexities. [sent-13, score-1.641]

5 Keywords: Gaussian mixture model, variational Bayesian learning, stochastic complexity 1. [sent-15, score-0.939]

6 The variational Bayesian (VB) framework was proposed as another approximation for computations in the models with hidden variables (Attias, 1999; Ghahramani and Beal, 2000). [sent-33, score-0.705]

7 The variational Bayesian framework has been applied to various real-world data modeling problems and empirically proved to be both computational tractable and generalize well. [sent-35, score-0.627]

8 The properties of variational Bayesian learning remain unclear from a theoretical stand point. [sent-36, score-0.627]

9 Although the variational Bayesian framework is an approximation, questions like how accurately it approximates the true distribution have yet to be answered. [sent-37, score-0.666]

10 In this paper, we focus on variational Bayesian learning of Gaussian mixture models. [sent-38, score-0.736]

11 As the main contribution, we derive asymptotic upper and lower bounds on the variational stochastic complexity. [sent-39, score-0.908]

12 It is shown that the variational stochastic complexity is smaller than in regular statistical models, so the advantage of Bayesian learning still remains in variational Bayesian learning. [sent-40, score-1.487]

13 The variational stochastic complexity, which corresponds to the minimum variational free energy and a lower bound of the Bayesian evidence, is an important quantity for model selection. [sent-41, score-1.527]

14 One is the accuracy of variational Bayesian learning as an approximation method since the variational stochastic complexity shows the distance from the variational posterior distribution to the true Bayesian posterior distribution in terms of Kullback information. [sent-43, score-2.41]

15 Since the variational Bayesian algorithm minimizes the variational free energy, the derived bounds indicate how the hyperparameters inﬂuence the learning process. [sent-45, score-1.378]

16 Analyzing the variational stochastic complexity in this case is most valuable for comparing variational Bayesian learning with true Bayesian learning. [sent-48, score-1.477]

17 In Section 4, the variational Bayesian framework is outlined and the variational stochastic complexity is deﬁned. [sent-54, score-1.457]

18 A Gaussian mixture model p(x|θ) of an M-dimensional input x ∈ RM with a parameter vector θ is deﬁned by K p(x|θ) = ∑ ak g(x|µk , Σk ), k=1 where integer K is the number of components and {ak |ak ≥ 0, ∑K ak = 1} is the set of mixing k=1 proportions. [sent-61, score-0.547]

19 (1) of Gaussian mixture models in the variational Bayesian framework and show upper and lower bounds of the variational stochastic complexity in Theorem 3. [sent-66, score-1.678]

20 The Gaussian mixture model can be rewritten as follows using a hidden variable y = (y1 , · · · , yK ) ∈ {(1, 0, · · · , 0), (0, 1, · · · , 0), · · · , (0, 0, · · · , 1)}, K p(x, y|θ) = ∏ k=1 ak 2πσ2 k M exp{− x − µk 2σ2 k 2 } yk . [sent-67, score-0.452]

21 Let n S(X n ) = − ∑ log p0 (x), i=1 and deﬁne the normalized Bayesian stochastic complexity F0 (X n ) by F0 (X n ) = − log Z0 (X n ) = F(X n ) − S(X n ). [sent-85, score-0.471]

22 As another approximation, the variational Bayesian framework was proposed (Attias, 1999; Beal, 2003; Ghahramani and Beal, 2000). [sent-111, score-0.627]

23 Variational Bayesian Learning In this section, we outline the variational Bayesian framework and deﬁne the variational stochastic complexity. [sent-113, score-1.42]

24 The variational Bayesian framework starts by upper bounding the Bayesian stochastic complexity. [sent-117, score-0.821]

25 For an arbitrary conditional distribution q(Y n , θ|X n ) on the hidden variables and the parameters, the Bayesian stochastic complexity can be upper bounded by applying Jensen’s inequality, n F(X ) ≤ ∑ Yn Z q(Y n , θ|X n ) log q(Y n , θ|X n ) dθ p(X n ,Y n , θ) ≡ F[q]. [sent-118, score-0.405]

26 The functional F[q] is called the variational free energy. [sent-121, score-0.666]

27 The variational Bayesian approach makes an approximation to ensure a computationally tractable posterior. [sent-122, score-0.627]

28 The distribution q(Y n , θ|X n ) that minimizes the functional F[q] is termed the optimal variational posterior and generally differs from the true Bayesian posterior. [sent-124, score-0.822]

29 Minimization of the functional F[q] with respect to the distributions Q(Y n |X n ) and r(θ|X n ) can be performed by using variational methods. [sent-125, score-0.676]

31 2 Stochastic Complexity of Variational Bayes We deﬁne the variational stochastic complexity F(X n ) by the minimum value of the functional F[q] attained by the above optimal variational posteriors, that is , F(X n ) = min F[q]. [sent-140, score-1.479]

32 r,Q The variational stochastic complexity F(X n ) gives an estimate (upper bound) for the true Bayesian stochastic complexity F(X n ), which is the minus log evidence. [sent-141, score-1.161]

33 Therefore, F(X n ) is used for the model selection in variational Bayesian learning(Beal, 2003). [sent-142, score-0.662]

34 Moreover, the difference between F(X n ) and the Bayesian stochastic complexity F(X n ) is the Kullback information from the optimal variational posterior to the true posterior. [sent-143, score-0.984]

35 r,Q Hence, comparison between F(X n ) and F(X n ) shows the accuracy of the variational Bayesian approach as an approximation of true Bayesian learning. [sent-145, score-0.647]

36 We deﬁne the normalized variational stochastic complexity F 0 (X n ) by F 0 (X n ) = F(X n ) − S(X n ). [sent-146, score-0.882]

37 The variational posteriors r(θ|X n ) and Q(Y n |X n ) that satisfy eq. [sent-150, score-0.684]

38 (10) are parameterized by the variational parameter θ deﬁned by θ= θ r(θ|X n ) , if the model p(x, y|θ) is included in the exponential family(Beal, 2003; Ghahramani and Beal, 2000). [sent-152, score-0.662]

39 Therefore, henceforth we denote r(θ|X n ) and CQ as r(θ|θ) and CQ (θ) when they are regarded as functions of the variational parameter θ. [sent-155, score-0.646]

40 We deﬁne the variational estimator θvb of θ by the variational parameter θ that attains the minimum value of the normalized variational stochastic complexity F 0 (X n ). [sent-156, score-2.155]

41 (13) θ In variational Bayesian learning, the variational parameter θ is updated iteratively to ﬁnd the optimal solution θvb . [sent-158, score-1.254]

42 Main Results In this section, we describe two conditions and give the upper and lower bounds of the normalized variational stochastic complexity in Theorem 3. [sent-162, score-0.963]

43 They are the conjugate prior distributions and are often used in variational Bayesian learning of Gaussian mixture models. [sent-168, score-0.793]

44 Then the normalized variational stochastic complexity F 0 (X n ) deﬁned by eq. [sent-172, score-0.882]

45 Then the average of the normalized variational stochastic complexity F 0 (X n ) satisﬁes λ log n + EX n [nHn (θvb )] +C1 ≤ EX n [F 0 (X n )] ≤ λ log n +C2 . [sent-176, score-1.098]

46 This means the normalized variational stochastic complexity F 0 (X n ) becomes smaller than the BIC and implies that the advantage of non-regular models in Bayesian learning still remains in variational Bayesian learning. [sent-184, score-1.54]

47 Theorem 3 also shows how the hyperparameters affect the learning process and implies that the hyperparameter φ0 is the only hyperparameter that the leading term of the normalized variational stochastic complexity F 0 (X n ) depends on. [sent-185, score-1.089]

48 First of all, we derive the variational posterior r(θ|X n ), Q(Y n |X n ) and the variational parameter θ for the Gaussian mixture model given by eq. [sent-193, score-1.532]

49 Note that the variables nk and νk satisfy the constraints ∑K nk = n and ∑K nk νk = ∑n xi . [sent-198, score-0.484]

50 (10), the variational posterior Q(Y n |X n ) is given by Q(Y n |X n ) = xi − µk 1 n exp yk {Ψ(nk + φ0 ) − Ψ(n + Kφ0 ) − i CQ ∏ 2 i=1 2 − M 1 (log 2π + )} , 2 nk + ξ0 where Ψ(x) = Γ′ (x)/Γ(x) is the di-gamma(psi) function and we used log ak r(a|X n ) = Ψ(nk + φ0 ) − Ψ(n + Kφ0 ). [sent-204, score-1.342]

51 The variational parameter θ is given by θ = θ r(θ|X n ) = {ak , µk }K . [sent-205, score-0.627]

52 It is noted that r(θ|X n ) and k=1 nk +φ Q(Y n |X n ) are parameterized by θ since nk can be replaced by using ak = n+Kφ00 . [sent-206, score-0.525]

53 (21) 0 ≤ log Γ(x) − {(x − ) log x − x + log 2π} ≤ 2 2 12x The inequalities (20) ensure that substituting log x for Ψ(x) only contributes additive constant terms to the normalized variational stochastic complexity. [sent-211, score-1.293]

54 Lemma 5 K(r(θ|θ)||ϕ(θ)) − {G(a) + 634 ξ0 K ∑ µ − ν0 2 } ≤ C, 2 k=1 k S TOCHASTIC C OMPLEXITIES OF G AUSSIAN M IXTURES holds where C is a constant and the function G(a) of a = {ak }K is deﬁned by k=1 G(a) = MK + K − 1 M 1 K log n + { − (φ0 − )} ∑ log ak . [sent-214, score-0.405]

55 (24), it is noted that the function values of Tn (θ) at speciﬁc points of the variational parameter θ give the upper bounds of the normalized variational stochastic complexity F 0 (X n ). [sent-223, score-1.596]

56 (I) : ak = a∗ (1 ≤ k ≤ K0 − 1), ak = a∗ 0 /(K − K0 + 1) (K0 ≤ k ≤ K), K k µk = µ∗ (1 ≤ k ≤ K0 − 1), µk = µ∗ 0 (K0 ≤ k ≤ K), K k then nH n (θ) < K−K0 +1 mink {a∗ } k holds and Tn (θ) < MK + K − 1 1 log n +C′ + O( ), 2 n where C′ is a constant. [sent-225, score-0.486]

57 a If φ0 > M+1 2 , (25) then G(a) ≥ M+1 MK + K − 1 log n − ( − φ0 )K log K, 2 2 (26) 1 1 since Jensen’s inequality yields that ∑K log ak ≤ K log( K ∑K ak ) = K log( K ). [sent-233, score-0.702]

58 k=1 k=1 M+1 If φ0 ≤ 2 , then G(a) ≥ {(K − 1)φ0 + M M+1 1 } log n + ( − φ0 )(K − 1) log φ0 + O( ), 2 2 n (27) φ0 since ak ≥ n+Kφ0 holds for every k and the constraint ∑K ak = 1 ensures that | log ak | is bounded k=1 by a constant independent of n for at least one index k. [sent-234, score-0.891]

59 Experiments In order to examine the quality of the theoretical bounds given in Theorem 3, we conducted experiments of variational Bayesian learning for Gaussian mixture models using M = 1 and M = 10 dimensional synthetic data. [sent-239, score-0.802]

60 We applied the variational Bayesian algorithm to each model using the data set generated from the true distribution with K0 = 2 components. [sent-241, score-0.701]

61 (13), the initial value of the variational parameter θ was set around the true parameter, that is, around a1 = a2 = 1/2, ak = 0 (k ≥ 3), µ1 = µ∗ , µ2 = µ∗ 1 2 and µk = 0 (k ≥ 3). [sent-247, score-0.836]

62 For each data set, the normalized variational stochastic complexity (the inside of the braces in eq. [sent-249, score-0.882]

63 (13)) was 636 S TOCHASTIC C OMPLEXITIES OF G AUSSIAN M IXTURES calculated when the variational Bayesian algorithm converged. [sent-250, score-0.627]

64 Denoting the results for respective data sets by F 0 (X 1000 ) and F 0 (X 100 ), we calculated λVB = (F 0 (X 1000 ) − F 0 (X 100 ))/ log 10 (28) to estimate the coefﬁcient of the leading term of the normalized variational stochastic complexity F 0 (X n ). [sent-251, score-1.011]

65 (19) are also plotted for the comparison of variational Bayesian learning with true Bayesian learning in the next section. [sent-256, score-0.647]

66 The variational Bayesian algorithm gave λVB that coincide with the coefﬁcient λ. [sent-257, score-0.627]

67 We also calculated the generalization error deﬁned by K(p(x|θ0 )|| p(x|θ) r(θ|X n ) ), where p(x|θ) is the predictive distribution in variational Bayesian learning. [sent-260, score-0.68]

68 (7), λVB and λG should have shown similar behavior if there were the same relation between the average normalized variational stochastic complexity and the average generalization error as in Bayesian learning. [sent-268, score-0.916]

69 These results imply that in variational Bayesian learning, unlike in Bayesian learning, the coefﬁcient of the average generalization error differs from that of the average variational stochastic complexity EX n [F(X n )]. [sent-269, score-1.491]

70 Discussion In this paper, we showed upper and lower bounds of the variational stochastic complexity of the Gaussian mixture models. [sent-271, score-1.02]

71 (17) can be improved to give F 0 (X n ) ≥ λ log n + nHn (θvb ) +C1 , (30) if the consistency of the variational estimator θvb is proven. [sent-276, score-0.754]

72 However, little has been known so far about the behavior of the variational estimator. [sent-299, score-0.627]

73 2 Comparison to Bayesian Learning We compare the normalized variational stochastic complexity shown in Theorem 3 with the one in true Bayesian learning assuming eq. [sent-312, score-0.902]

74 For the Gaussian mixture model in particular, the following upper bound on the coefﬁcient of the average normalized Bayesian stochastic complexity EX n [F0 (X n )] described as eq. [sent-315, score-0.443]

75 (33) Let us compare this λ of variational Bayesian learning to λ in eq. [sent-321, score-0.627]

76 This implies that the more redundant components the model has, the more variational Bayesian learning differs from true Bayesian learning. [sent-324, score-0.724]

77 This implies the variational posterior is close to the true Bayesian posterior. [sent-326, score-0.781]

78 It is noted that λ of variational Bayesian learning eq. [sent-331, score-0.651]

79 3 Stochastic Complexity and Generalization We have discussed how much the variational posterior differs from the true Bayesian posterior by comparing the stochastic complexities. [sent-336, score-1.081]

80 In variational Bayesian learning, there is no apparent relationship between the average variational stochastic complexity and the average generalization error unlike in Bayesian learning where their leading terms are given by the same coefﬁcient λ as in eq. [sent-337, score-1.512]

81 Hence, assessing the generalization performance of the Gaussian mixture model in variational Bayesian learning is an important issue to be addressed. [sent-341, score-0.805]

82 It would be important to clarify how this term affects the generalization performance in variational Bayesian learning. [sent-343, score-0.677]

83 From Theorem 3, only the hyperparameter φ0 affects the leading term of the normalized variational stochastic complexity F 0 (X n ) and the other 639 WATANABE AND WATANABE hyperparameters ξ0 and ν0 affect only the lower order terms. [sent-346, score-1.05]

84 We also point out that Theorem 3 shows how the hyperparameter φ0 inﬂuence variational Bayesian learning. [sent-353, score-0.684]

85 (17) with experimental results, one can investigate the properties of the actual iterative algorithm in variational Bayesian learning. [sent-361, score-0.627]

86 Although the actual iterative algorithm gives the variational posterior that satisﬁes eq. [sent-362, score-0.761]

87 One can examine experimentally whether the algorithm converges to the optimal variational posterior that minimizes the functional instead of local minima by comparing the experimental results with the theoretical bounds. [sent-368, score-0.798]

88 Moreover, the theoretical bounds would enable us to compare the accuracy of variational Bayesian learning with that of the Laplace approximation or the MCMC method. [sent-369, score-0.662]

89 However, in order to make such comparisons more accurately, one will need not only the leading term but also the lower order terms of the asymptotic form of the variational stochastic complexity. [sent-370, score-0.866]

90 The Gaussian mixture model is included in general exponential family models with hidden variables (Sato, 2001) and furthermore, in general graphical models to which the variational Bayesian framework can be applied (Attias, 1999). [sent-372, score-0.88]

91 Analyzing the variational stochastic complexities in the more general cases would be an important undertaking. [sent-373, score-0.83]

92 Furthermore, as mentioned in Section 4, the variational stochastic complexity F(X n ) is used as a criterion for model selection in variational Bayesian learning. [sent-374, score-1.492]

93 Conclusion In this paper, we derived upper and lower bounds of the variational stochastic complexity of the Gaussian mixture models. [sent-380, score-1.02]

94 Using the derived bounds, we discussed the inﬂuence of the hyperparameters and the accuracy of variational Bayesian learning as an approximation of true Bayesian 640 S TOCHASTIC C OMPLEXITIES OF G AUSSIAN M IXTURES learning. [sent-381, score-0.719]

95 These bounds can be used for evaluation and optimization of learning algorithms based on the variational Bayesian approximation. [sent-382, score-0.662]

96 Proof of Lemma 2 Proof From the restriction of the variational Bayesian approximation eq. [sent-390, score-0.627]

98 Proof of Lemma 5 Proof Calculating the Kullback information between the posterior and the prior, we obtain Γ(φ0 )K K K(r(a|a)||ϕ(a)) = ∑ h(nk ) − nΨ(n + Kφ0) + log Γ(n + Kφ0 ) + log Γ(Kφ0 ) , (34) k=1 where we use the notation h(x) = xΨ(x + φ0 ) − log Γ(x + φ0 ). [sent-397, score-0.458]

99 Inferring parameters and structure of latent variable models by variational bayes. [sent-421, score-0.658]

100 Lower bounds of stochastic complexities in variational bayes learning of gaussian mixture models. [sent-477, score-1.037]

similar papers computed by tfidf model

tfidf for this paper:

wordName wordTfidf (topN-words)

[('variational', 0.627), ('watanabe', 0.376), ('bayesian', 0.299), ('vb', 0.218), ('ak', 0.189), ('stochastic', 0.166), ('nk', 0.156), ('nhn', 0.138), ('posterior', 0.134), ('logcq', 0.113), ('omplexities', 0.113), ('tochastic', 0.113), ('mixture', 0.109), ('log', 0.108), ('ex', 0.099), ('ixtures', 0.096), ('bic', 0.096), ('cq', 0.081), ('beal', 0.074), ('yk', 0.072), ('hyperparameters', 0.072), ('aussian', 0.065), ('posteriors', 0.057), ('hyperparameter', 0.057), ('kullback', 0.053), ('normalized', 0.052), ('mk', 0.05), ('sato', 0.05), ('yamazaki', 0.05), ('hidden', 0.047), ('gaussian', 0.044), ('nh', 0.043), ('exp', 0.04), ('likelihood', 0.04), ('kazuho', 0.038), ('complexity', 0.037), ('complexities', 0.037), ('bounds', 0.035), ('model', 0.035), ('tn', 0.034), ('asymptotic', 0.034), ('generalization', 0.034), ('datum', 0.032), ('models', 0.031), ('laplace', 0.03), ('prior', 0.03), ('regular', 0.03), ('upper', 0.028), ('coef', 0.028), ('distributions', 0.027), ('lemma', 0.026), ('attias', 0.026), ('levin', 0.025), ('mailbox', 0.025), ('nagatsuta', 0.025), ('sumio', 0.025), ('yokohama', 0.025), ('components', 0.025), ('yn', 0.024), ('ns', 0.024), ('hn', 0.024), ('mcmc', 0.024), ('noted', 0.024), ('schwarz', 0.023), ('normal', 0.022), ('theorem', 0.022), ('functional', 0.022), ('uence', 0.022), ('clari', 0.022), ('coefficient', 0.021), ('singularities', 0.021), ('psi', 0.021), ('leading', 0.021), ('lemmas', 0.021), ('energy', 0.021), ('ghahramani', 0.02), ('averages', 0.02), ('rm', 0.02), ('analyzing', 0.02), ('true', 0.02), ('titech', 0.019), ('henceforth', 0.019), ('distribution', 0.019), ('bayes', 0.019), ('algebraic', 0.019), ('estimator', 0.019), ('kth', 0.018), ('lower', 0.018), ('diverge', 0.017), ('hartigan', 0.017), ('bars', 0.017), ('free', 0.017), ('redundant', 0.017), ('ratio', 0.016), ('clarify', 0.016), ('inequalities', 0.016), ('bound', 0.016), ('xi', 0.016), ('experimentally', 0.015), ('evidence', 0.015)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.99999976 87 jmlr-2006-Stochastic Complexities of Gaussian Mixtures in Variational Bayesian Approximation

Author: Kazuho Watanabe, Sumio Watanabe

2 0.13440615 64 jmlr-2006-Noisy-OR Component Analysis and its Application to Link Analysis

Author: Tomáš Šingliar, Miloš Hauskrecht

Abstract: We develop a new component analysis framework, the Noisy-Or Component Analyzer (NOCA), that targets high-dimensional binary data. NOCA is a probabilistic latent variable model that assumes the expression of observed high-dimensional binary data is driven by a small number of hidden binary sources combined via noisy-or units. The component analysis procedure is equivalent to learning of NOCA parameters. Since the classical EM formulation of the NOCA learning problem is intractable, we develop its variational approximation. We test the NOCA framework on two problems: (1) a synthetic image-decomposition problem and (2) a co-citation data analysis problem for thousands of CiteSeer documents. We demonstrate good performance of the new model on both problems. In addition, we contrast the model to two mixture-based latent-factor models: the probabilistic latent semantic analysis (PLSA) and latent Dirichlet allocation (LDA). Differing assumptions underlying these models cause them to discover different types of structure in co-citation data, thus illustrating the beneﬁt of NOCA in building our understanding of highdimensional data sets. Keywords: component analysis, vector quantization, variational learning, link analysis

3 0.13251249 28 jmlr-2006-Estimating the "Wrong" Graphical Model: Benefits in the Computation-Limited Setting

Author: Martin J. Wainwright

Abstract: Consider the problem of joint parameter estimation and prediction in a Markov random ﬁeld: that is, the model parameters are estimated on the basis of an initial set of data, and then the ﬁtted model is used to perform prediction (e.g., smoothing, denoising, interpolation) on a new noisy observation. Working under the restriction of limited computation, we analyze a joint method in which the same convex variational relaxation is used to construct an M-estimator for ﬁtting parameters, and to perform approximate marginalization for the prediction step. The key result of this paper is that in the computation-limited setting, using an inconsistent parameter estimator (i.e., an estimator that returns the “wrong” model even in the inﬁnite data limit) is provably beneﬁcial, since the resulting errors can partially compensate for errors made by using an approximate prediction technique. En route to this result, we analyze the asymptotic properties of M-estimators based on convex variational relaxations, and establish a Lipschitz stability property that holds for a broad class of convex variational methods. This stability result provides additional incentive, apart from the obvious beneﬁt of unique global optima, for using message-passing methods based on convex variational relaxations. We show that joint estimation/prediction based on the reweighted sum-product algorithm substantially outperforms a commonly used heuristic based on ordinary sum-product. Keywords: graphical model, Markov random ﬁeld, belief propagation, variational method, parameter estimation, prediction error, algorithmic stability

4 0.085369647 15 jmlr-2006-Bayesian Network Learning with Parameter Constraints (Special Topic on Machine Learning and Optimization)

Author: Radu Stefan Niculescu, Tom M. Mitchell, R. Bharat Rao

Abstract: The task of learning models for many real-world problems requires incorporating domain knowledge into learning algorithms, to enable accurate learning from a realistic volume of training data. This paper considers a variety of types of domain knowledge for constraining parameter estimates when learning Bayesian networks. In particular, we consider domain knowledge that constrains the values or relationships among subsets of parameters in a Bayesian network with known structure. We incorporate a wide variety of parameter constraints into learning procedures for Bayesian networks, by formulating this task as a constrained optimization problem. The assumptions made in module networks, dynamic Bayes nets and context speciﬁc independence models can be viewed as particular cases of such parameter constraints. We present closed form solutions or fast iterative algorithms for estimating parameters subject to several speciﬁc classes of parameter constraints, including equalities and inequalities among parameters, constraints on individual parameters, and constraints on sums and ratios of parameters, for discrete and continuous variables. Our methods cover learning from both frequentist and Bayesian points of view, from both complete and incomplete data. We present formal guarantees for our estimators, as well as methods for automatically learning useful parameter constraints from data. To validate our approach, we apply it to the domain of fMRI brain image analysis. Here we demonstrate the ability of our system to ﬁrst learn useful relationships among parameters, and then to use them to constrain the training of the Bayesian network, resulting in improved cross-validated accuracy of the learned model. Experiments on synthetic data are also presented. Keywords: Bayesian networks, constrained optimization, domain knowledge c 2006 Radu Stefan Niculescu, Tom Mitchell and Bharat Rao. N ICULESCU , M ITCHELL AND R AO

5 0.080469511 49 jmlr-2006-Learning Parts-Based Representations of Data

Author: David A. Ross, Richard S. Zemel

Abstract: Many perceptual models and theories hinge on treating objects as a collection of constituent parts. When applying these approaches to data, a fundamental problem arises: how can we determine what are the parts? We attack this problem using learning, proposing a form of generative latent factor model, in which each data dimension is allowed to select a different factor or part as its explanation. This approach permits a range of variations that posit different models for the appearance of a part. Here we provide the details for two such models: a discrete and a continuous one. Further, we show that this latent factor model can be extended hierarchically to account for correlations between the appearances of different parts. This permits modeling of data consisting of multiple categories, and learning these categories simultaneously with the parts when they are unobserved. Experiments demonstrate the ability to learn parts-based representations, and categories, of facial images and user-preference data. Keywords: parts, unsupervised learning, latent factor models, collaborative ﬁltering, hierarchical learning

6 0.065499634 17 jmlr-2006-Bounds for the Loss in Probability of Correct Classification Under Model Based Approximation

7 0.063894048 6 jmlr-2006-A Scoring Function for Learning Bayesian Networks based on Mutual Information and Conditional Independence Tests

8 0.051101644 71 jmlr-2006-Optimising Kernel Parameters and Regularisation Coefficients for Non-linear Discriminant Analysis

9 0.050367597 46 jmlr-2006-Learning Factor Graphs in Polynomial Time and Sample Complexity

10 0.038646962 57 jmlr-2006-Linear State-Space Models for Blind Source Separation

11 0.03857528 9 jmlr-2006-Accurate Error Bounds for the Eigenvalues of the Kernel Matrix

12 0.030778987 88 jmlr-2006-Streamwise Feature Selection

13 0.029797319 23 jmlr-2006-Consistency and Convergence Rates of One-Class SVMs and Related Algorithms

14 0.029228764 36 jmlr-2006-In Search of Non-Gaussian Components of a High-Dimensional Distribution

15 0.028939607 58 jmlr-2006-Lower Bounds and Aggregation in Density Estimation

16 0.028888842 52 jmlr-2006-Learning Spectral Clustering, With Application To Speech Separation

17 0.028101869 91 jmlr-2006-The Interplay of Optimization and Machine Learning Research (Special Topic on Machine Learning and Optimization)

18 0.026447829 45 jmlr-2006-Learning Coordinate Covariances via Gradients

19 0.026228258 90 jmlr-2006-Superior Guarantees for Sequential Prediction and Lossless Compression via Alphabet Decomposition

20 0.025469217 32 jmlr-2006-Expectation Correction for Smoothed Inference in Switching Linear Dynamical Systems

similar papers computed by lsi model

lsi for this paper:

topicId topicWeight

[(0, 0.165), (1, -0.008), (2, -0.138), (3, -0.027), (4, -0.257), (5, -0.128), (6, -0.13), (7, 0.027), (8, 0.047), (9, -0.163), (10, 0.13), (11, 0.121), (12, 0.047), (13, -0.029), (14, -0.036), (15, 0.175), (16, -0.24), (17, -0.162), (18, -0.163), (19, -0.022), (20, -0.04), (21, -0.052), (22, 0.159), (23, 0.057), (24, 0.091), (25, -0.031), (26, -0.026), (27, -0.136), (28, -0.058), (29, 0.112), (30, -0.052), (31, 0.01), (32, 0.013), (33, 0.068), (34, -0.045), (35, -0.127), (36, 0.135), (37, -0.006), (38, -0.059), (39, 0.041), (40, -0.047), (41, 0.036), (42, 0.143), (43, -0.077), (44, 0.055), (45, 0.162), (46, -0.015), (47, 0.028), (48, -0.12), (49, 0.173)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.97823507 87 jmlr-2006-Stochastic Complexities of Gaussian Mixtures in Variational Bayesian Approximation

Author: Kazuho Watanabe, Sumio Watanabe

2 0.57826608 64 jmlr-2006-Noisy-OR Component Analysis and its Application to Link Analysis

Author: Tomáš Šingliar, Miloš Hauskrecht

3 0.52155745 28 jmlr-2006-Estimating the "Wrong" Graphical Model: Benefits in the Computation-Limited Setting

Author: Martin J. Wainwright

4 0.52103168 15 jmlr-2006-Bayesian Network Learning with Parameter Constraints (Special Topic on Machine Learning and Optimization)

Author: Radu Stefan Niculescu, Tom M. Mitchell, R. Bharat Rao

5 0.41672421 49 jmlr-2006-Learning Parts-Based Representations of Data

Author: David A. Ross, Richard S. Zemel

6 0.33795431 46 jmlr-2006-Learning Factor Graphs in Polynomial Time and Sample Complexity

7 0.33539 6 jmlr-2006-A Scoring Function for Learning Bayesian Networks based on Mutual Information and Conditional Independence Tests

8 0.28440845 17 jmlr-2006-Bounds for the Loss in Probability of Correct Classification Under Model Based Approximation

9 0.28295013 71 jmlr-2006-Optimising Kernel Parameters and Regularisation Coefficients for Non-linear Discriminant Analysis

10 0.2321575 9 jmlr-2006-Accurate Error Bounds for the Eigenvalues of the Kernel Matrix

11 0.21078841 10 jmlr-2006-Action Elimination and Stopping Conditions for the Multi-Armed Bandit and Reinforcement Learning Problems

12 0.20271179 36 jmlr-2006-In Search of Non-Gaussian Components of a High-Dimensional Distribution

13 0.18613146 40 jmlr-2006-Infinite-σ Limits For Tikhonov Regularization

14 0.18243703 58 jmlr-2006-Lower Bounds and Aggregation in Density Estimation

15 0.17974699 88 jmlr-2006-Streamwise Feature Selection

16 0.17946279 53 jmlr-2006-Learning a Hidden Hypergraph

17 0.17283133 94 jmlr-2006-Using Machine Learning to Guide Architecture Simulation

18 0.17181158 23 jmlr-2006-Consistency and Convergence Rates of One-Class SVMs and Related Algorithms

19 0.16652633 90 jmlr-2006-Superior Guarantees for Sequential Prediction and Lossless Compression via Alphabet Decomposition

20 0.13719666 57 jmlr-2006-Linear State-Space Models for Blind Source Separation

similar papers computed by lda model

lda for this paper:

topicId topicWeight

[(8, 0.062), (11, 0.015), (35, 0.01), (36, 0.055), (45, 0.027), (50, 0.106), (62, 0.332), (63, 0.053), (68, 0.02), (76, 0.018), (78, 0.032), (79, 0.015), (81, 0.043), (84, 0.016), (90, 0.016), (91, 0.02), (96, 0.055)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.76309544 87 jmlr-2006-Stochastic Complexities of Gaussian Mixtures in Variational Bayesian Approximation

Author: Kazuho Watanabe, Sumio Watanabe

2 0.41530871 36 jmlr-2006-In Search of Non-Gaussian Components of a High-Dimensional Distribution

Author: Gilles Blanchard, Motoaki Kawanabe, Masashi Sugiyama, Vladimir Spokoiny, Klaus-Robert Müller

Abstract: Finding non-Gaussian components of high-dimensional data is an important preprocessing step for efﬁcient information processing. This article proposes a new linear method to identify the “nonGaussian subspace” within a very general semi-parametric framework. Our proposed method, called NGCA (non-Gaussian component analysis), is based on a linear operator which, to any arbitrary nonlinear (smooth) function, associates a vector belonging to the low dimensional nonGaussian target subspace, up to an estimation error. By applying this operator to a family of different nonlinear functions, one obtains a family of different vectors lying in a vicinity of the target space. As a ﬁnal step, the target space itself is estimated by applying PCA to this family of vectors. We show that this procedure is consistent in the sense that the estimaton error tends to zero at a parametric rate, uniformly over the family, Numerical examples demonstrate the usefulness of our method.

3 0.40533316 9 jmlr-2006-Accurate Error Bounds for the Eigenvalues of the Kernel Matrix

Author: Mikio L. Braun

Abstract: The eigenvalues of the kernel matrix play an important role in a number of kernel methods, in particular, in kernel principal component analysis. It is well known that the eigenvalues of the kernel matrix converge as the number of samples tends to inﬁnity. We derive probabilistic ﬁnite sample size bounds on the approximation error of individual eigenvalues which have the important property that the bounds scale with the eigenvalue under consideration, reﬂecting the actual behavior of the approximation errors as predicted by asymptotic results and observed in numerical simulations. Such scaling bounds have so far only been known for tail sums of eigenvalues. Asymptotically, the bounds presented here have a slower than stochastic rate, but the number of sample points necessary to make this disadvantage noticeable is often unrealistically large. Therefore, under practical conditions, and for all but the largest few eigenvalues, the bounds presented here form a signiﬁcant improvement over existing non-scaling bounds. Keywords: kernel matrix, eigenvalues, relative perturbation bounds

4 0.39973199 46 jmlr-2006-Learning Factor Graphs in Polynomial Time and Sample Complexity

Author: Pieter Abbeel, Daphne Koller, Andrew Y. Ng

Abstract: We study the computational and sample complexity of parameter and structure learning in graphical models. Our main result shows that the class of factor graphs with bounded degree can be learned in polynomial time and from a polynomial number of training examples, assuming that the data is generated by a network in this class. This result covers both parameter estimation for a known network structure and structure learning. It implies as a corollary that we can learn factor graphs for both Bayesian networks and Markov networks of bounded degree, in polynomial time and sample complexity. Importantly, unlike standard maximum likelihood estimation algorithms, our method does not require inference in the underlying network, and so applies to networks where inference is intractable. We also show that the error of our learned model degrades gracefully when the generating distribution is not a member of the target class of networks. In addition to our main result, we show that the sample complexity of parameter learning in graphical models has an O(1) dependence on the number of variables in the model when using the KL-divergence normalized by the number of variables as the performance criterion.1 Keywords: probabilistic graphical models, parameter and structure learning, factor graphs, Markov networks, Bayesian networks

5 0.39522183 52 jmlr-2006-Learning Spectral Clustering, With Application To Speech Separation

Author: Francis R. Bach, Michael I. Jordan

Abstract: Spectral clustering refers to a class of techniques which rely on the eigenstructure of a similarity matrix to partition points into disjoint clusters, with points in the same cluster having high similarity and points in different clusters having low similarity. In this paper, we derive new cost functions for spectral clustering based on measures of error between a given partition and a solution of the spectral relaxation of a minimum normalized cut problem. Minimizing these cost functions with respect to the partition leads to new spectral clustering algorithms. Minimizing with respect to the similarity matrix leads to algorithms for learning the similarity matrix from fully labelled data sets. We apply our learning algorithm to the blind one-microphone speech separation problem, casting the problem as one of segmentation of the spectrogram. Keywords: spectral clustering, blind source separation, computational auditory scene analysis

6 0.39153659 29 jmlr-2006-Estimation of Gradients and Coordinate Covariation in Classification

7 0.39068836 66 jmlr-2006-On Model Selection Consistency of Lasso

8 0.3891435 17 jmlr-2006-Bounds for the Loss in Probability of Correct Classification Under Model Based Approximation

9 0.38522318 28 jmlr-2006-Estimating the "Wrong" Graphical Model: Benefits in the Computation-Limited Setting

10 0.38345635 71 jmlr-2006-Optimising Kernel Parameters and Regularisation Coefficients for Non-linear Discriminant Analysis

11 0.38193962 73 jmlr-2006-Pattern Recognition for Conditionally Independent Data

12 0.38156962 82 jmlr-2006-Some Theory for Generalized Boosting Algorithms

13 0.38094693 24 jmlr-2006-Consistency of Multiclass Empirical Risk Minimization Methods Based on Convex Loss

14 0.38022962 6 jmlr-2006-A Scoring Function for Learning Bayesian Networks based on Mutual Information and Conditional Independence Tests

15 0.37983677 95 jmlr-2006-Walk-Sums and Belief Propagation in Gaussian Graphical Models

16 0.37951094 64 jmlr-2006-Noisy-OR Component Analysis and its Application to Link Analysis

17 0.37704775 58 jmlr-2006-Lower Bounds and Aggregation in Density Estimation

18 0.37142137 48 jmlr-2006-Learning Minimum Volume Sets

19 0.3701883 41 jmlr-2006-Kernel-Based Learning of Hierarchical Multilabel Classification Models (Special Topic on Machine Learning and Optimization)

20 0.36701441 53 jmlr-2006-Learning a Hidden Hypergraph