jmlr jmlr2013 jmlr2013-49 knowledge-graph by maker-knowledge-mining
Source: pdf
Author: Shinichi Nakajima, Masashi Sugiyama, S. Derin Babacan, Ryota Tomioka
Abstract: The variational Bayesian (VB) approximation is known to be a promising approach to Bayesian estimation, when the rigorous calculation of the Bayes posterior is intractable. The VB approximation has been successfully applied to matrix factorization (MF), offering automatic dimensionality selection for principal component analysis. Generally, finding the VB solution is a non-convex problem, and most methods rely on a local search algorithm derived through a standard procedure for the VB approximation. In this paper, we show that a better option is available for fully-observed VBMF—the global solution can be analytically computed. More specifically, the global solution is a reweighted SVD of the observed matrix, and each weight can be obtained by solving a quartic equation with its coefficients being functions of the observed singular value. We further show that the global optimal solution of empirical VBMF (where hyperparameters are also learned from data) can also be analytically computed. We illustrate the usefulness of our results through experiments in multi-variate analysis. Keywords: variational Bayes, matrix factorization, empirical Bayes, model-induced regularization, probabilistic PCA
Reference: text
sentIndex sentText sentNum sentScore
1 In this paper, we show that a better option is available for fully-observed VBMF—the global solution can be analytically computed. [sent-16, score-0.131]
2 More specifically, the global solution is a reweighted SVD of the observed matrix, and each weight can be obtained by solving a quartic equation with its coefficients being functions of the observed singular value. [sent-17, score-0.358]
3 We further show that the global optimal solution of empirical VBMF (where hyperparameters are also learned from data) can also be analytically computed. [sent-18, score-0.131]
4 In this paper, we show that, despite the non-convexity of the optimization problem, the global optimal solution of VBMF can be analytically computed. [sent-53, score-0.131]
5 More specifically, the global solution is a reweighted SVD of the observed matrix, and each weight can be obtained by solving a quartic equation with its coefficients being functions of the observed singular value. [sent-54, score-0.358]
6 Again, the optimization problem of empirical VBMF is non-convex, but we show that the global optimal solution of empirical VBMF can still be analytically computed. [sent-57, score-0.131]
7 Those bounds are tight enough to give the exact analytic solution only when the observed matrix is square. [sent-72, score-0.208]
8 In this paper, we conduct a more precise analysis, which results in a quartic equation with its coefficients depending on the observed singular value. [sent-73, score-0.251]
9 Satisfying this quartic equation is a necessary condition for the weight, and further consideration specifies which of the four solutions is the VBMF solution. [sent-74, score-0.231]
10 In summary, we derive the exact global analytic solution for general rectangular cases under the standard matrix-wise independence assumption. [sent-75, score-0.246]
11 We first introduce the framework of Bayesian matrix factorization and the variational Bayesian approximation in Section 2. [sent-77, score-0.154]
12 Then, we analyze the VB free energy, and derive the global analytic solution in Section 3. [sent-78, score-0.274]
13 Without loss of generality, we assume that the diagonal entries of the product CACB are arranged in the non-increasing order, that is, cah cbh ≥ cah′ cbh′ for any pair h < h′ . [sent-111, score-0.555]
14 Therefore, minimizing the free energy (5) amounts to finding the distribution closest to the Bayes posterior in the sense of the KL distance. [sent-132, score-0.155]
15 Using the variational method, we can show that the VB posterior minimizing the free energy (5) under the constraint (6) can be written as rVB (A, B) = M L ∏ NH (am ; am , ΣA ) ∏ NH (bl ; bl , ΣB ), m=1 (7) l=1 where the parameters satisfy A = a1 , . [sent-137, score-0.243]
16 In this scenario, CA and CB are updated in each iteration by the following formulas: c2h = ah 2 /M + (ΣA )hh , a (12) c2h = bh 2 /L + (ΣB )hh . [sent-151, score-0.263]
17 In Section 3, we theoretically show that the column-wise independence assumption has actually no effect, before deriving the exact global analytic solution. [sent-166, score-0.172]
18 After that, starting from a proposition given in Nakajima and Sugiyama (2011), we derive the global analytic solution for VBMF (Section 3. [sent-170, score-0.221]
19 Finally, we derive the global analytic solution for the empirical VBMF (Section 3. [sent-172, score-0.221]
20 , cah cbh > cah′ cbh′ for any pair h < h′ ), any solution minimizing the free energy (20) has diagonal ΣA and ΣB . [sent-181, score-0.752]
21 When CACB is degenerate, any solution has an equivalent solution with diagonal ΣA and ΣB . [sent-182, score-0.188]
22 Obviously, any VBMF solution (minimizer of the free energy (20)) with diagonal covariances is a SimpleVBMF solution (minimizer of the free energy (20) under the constraint that the covariances are diagonal). [sent-185, score-0.482]
23 Actually, any VBMF solution can be obtained by rotating its equivalent SimpleVBMF solution (VBMF solution with diagonal covariances) (see Appendix A. [sent-188, score-0.262]
24 In practice, it is however sufficient to focus on the SimpleVBMF solutions, since equivalent solutions share the same free energy F VB and the same mean prediction BA⊤ . [sent-190, score-0.145]
25 Since we have shown the equivalence between VBMF and SimpleVBMF, we can use the results obtained in Nakajima and Sugiyama (2011), where SimpleVBMF was analyzed, for pursuing the global analytic solution for (non-simple) VBMF. [sent-192, score-0.221]
26 This is a non-convex optimization problem, but we show that the global optimal solution can still be analytically obtained. [sent-202, score-0.131]
27 h h h=1 Then, the global SimpleVB solution (under the column-wise independence (15)) can be expressed as U SVB ≡ BA⊤ H rSVB (A,B) = ⊤ ∑ γSVB ωb ωa . [sent-204, score-0.132]
28 h h h h=1 Let γh = (L + M)σ2 σ4 + 2 2 + 2 2cah cbh (L + M)σ2 σ4 + 2 2 2 2cah cbh 2 − LMσ4 . [sent-205, score-0.55]
29 When h γh > γh , (21) γ2 + q1 (γh ) · γh + q0 = 0, h (22) γSVB is given as a positive real solution of h where 4 q1 (γh ) = q0 = LM −(M − L)2 (γh − γh ) + (L + M) (M − L)2 (γh − γh )2 + 4σ c2 c2 ah bh σ2 L σ4 − 1− 2 c2h c2h γh a b 2LM σ2 M 2 1− 2 γh . [sent-207, score-0.317]
30 With some algebra, we can convert Equation (22) to a quartic equation, which has four solutions in general. [sent-216, score-0.163]
31 h NAKAJIMA , S UGIYAMA , BABACAN AND T OMIOKA The coefficients of the quartic equation (23) are analytic, so γsecond can also be obtained analyth ically, for example, by Ferrari’s method (Hazewinkel, 2002). [sent-219, score-0.209]
32 2 Therefore, the global VB solution can be analytically computed. [sent-220, score-0.131]
33 As discussed in Nakajima and Sugiyama (2011), the ratio cah /cbh is arbitrary in empirical VB. [sent-237, score-0.24]
34 Accordingly, we fix the ratio to cah /cbh = 1 without loss of generality. [sent-238, score-0.24]
35 In practice, one may solve the quartic equation numerically, for example, by the ‘roots’ function in MATLAB R . [sent-240, score-0.209]
36 In our latest work on performance analysis of VBMF, we have derived a simpler-form solution, which does not require to solve a quartic equation (Nakajima et al. [sent-242, score-0.209]
37 10 G LOBAL A NALYTIC S OLUTION OF VARIATIONAL BAYESIAN M ATRIX FACTORIZATION Nakajima and Sugiyama (2011) obtained a closed form solution of the optimal hyperparameter value cah cbh for SimpleVBMF. [sent-244, score-0.589]
38 Therefore, we can easily obtain the global analytic solution for empirical VBMF. [sent-245, score-0.221]
39 Here, √ √ γh = ( L + M)σ, (24) 1 2 γ2 − (L + M)σ2 − 4LMσ4 , γ2 − (L + M)σ2 + h 2LM h γh VB γh VB 1 ˘h ˘ ˘ ∆h = M log ˘a ˘b γ + 1 +L log γ + 1 + 2 −2γh γVB +LM c2h c2h , Mσ2 h Lσ2 h σ c2h c2h = ˘a ˘b (25) (26) ˘h and γVB is the VB solution for cah cbh = cah cbh . [sent-247, score-1.104]
40 However, the second factor depends on the rescaled noise variance σ2 , and therefore, should be considered when σ2 is estimated based on the free energy minimization principle. [sent-290, score-0.14]
41 1 Experiment on Artificial Data We compare the standard ICM algorithm and the analytic solution in the empirical VB scenario with unknown noise variance, that is, the hyperparameters (CA ,CB ) and the noise variance σ2 are also estimated from observation. [sent-330, score-0.188]
42 The analytic solution consists of applying Theorem 5 combined with a naive 1-dimensional search for the estimation of noise variance σ2 . [sent-342, score-0.188]
43 The analytic solution is plotted by the dashed line (labeled as ‘Analytic’). [sent-343, score-0.188]
44 We see that the analytic solution estimates the true rank H = H ∗ = 20 immediately (∼ 0. [sent-344, score-0.217]
45 We also conducted experiments on other benchmark data sets, and found that the ICM algorithm generally converges slowly, and sometimes suffers from the local minima problem, while our analytic-form gives the global solution immediately. [sent-374, score-0.136]
46 Overall, the proposed global analytic solution is shown to be a useful alternative to the standard ICM algorithm. [sent-379, score-0.221]
47 This solution can be obtained also by factorizing the quartic equation (23) as follows: lim cah cbh →∞ f (γh ) = γh + M σ2 L 1− 2 γh L γh · γh − 1− γh + 1− σ2 M γh γ2 h 17 σ2 M γh γ2 h γh − σ2 L M 1− 2 γh = 0. [sent-404, score-0.814]
48 Since Theorem 3 states that its second largest solution gives the VB estimator for √ γh > limcah cbh →∞ γh = Mσ2 , we have the following corollary: Corollary 1 The global VB solution with the almost flat prior (i. [sent-427, score-0.456]
49 , cah cbh → ∞) is given by 2 max 0, 1 − Mσ γ if γh > 0, h γ2 lim γVB = γPJS = h h h cah cbh →∞ 0 otherwise. [sent-429, score-1.046]
50 This is actually the upper-bound of the VB solution for arbitrary cah cbh > 0. [sent-431, score-0.589]
51 Since ξ3 = ξ1 = 0 (see Theorem 3) in this case, the quartic equation (23) can be solved as a quadratic 19 NAKAJIMA , S UGIYAMA , BABACAN AND T OMIOKA equation with respect to γ2 (Nakajima and Sugiyama, 2011). [sent-443, score-0.277]
52 We can also find the solution by h √ factorizing the quartic equation (23) for γh > Mσ2 as follows: f square (γh ) = γh + γPJS + h σ2 cah cbh σ2 cah cbh γh + γPJS − h · γh − γPJS + h σ2 cah cbh γh − γPJS − h σ2 cah cbh = 0. [sent-444, score-2.343]
53 Using Theorem 3, we have the following corollary: Corollary 2 When L = M, the global VB solution is given by VB−square γh = max 0, γPJS − h σ2 cah cbh . [sent-445, score-0.622]
54 (37) Equation (37) shows that, in this case, MIR and prior-induced regularization (PIR) can be completely decomposed—the estimator is equipped with the model-induced PJS shrinkage (γPJS ) and h the prior-induced trace-norm shrinkage (−σ2 /(cah cbh )). [sent-446, score-0.275]
55 γ2 h By using Corollary 2 and Corollary 3, respectively, we can easily compute the VB and the empirical VB solutions in this case without a quartic solver. [sent-449, score-0.163]
56 , a MF model for L = M = H = 1) with σ2 = 1 and ca = cb = 100 (almost flat priors), when the observed values are V = 0 (left), V = 1 (middle), and V = 2 (right), respectively. [sent-582, score-0.213]
57 , a MF model for L = M = H = 1) with σ2 = 1 and ca = cb = 100 (almost flat priors). [sent-589, score-0.213]
58 3 Extensions In this paper, we derived the global analytic solution of VBMF, by fully making use of the assumptions that the likelihood and priors are both spherical Gaussian, and that the observed matrix has no missing entry. [sent-604, score-0.261]
59 To obtain the VB solution of robust PCA, we have proposed a novel algorithm where the analytic VBMF solution is applied to partial problems (Nakajima et al. [sent-616, score-0.262]
60 2 T ENSOR FACTORIZATION We have shown that the VB solution under matrix-wise independence essentially agrees with the SimpleVB solution under column-wise independence. [sent-622, score-0.173]
61 In our preliminary study so far, we saw that the analytic VB solution for tensor factorization is not attainable, at least in the same way as MF. [sent-624, score-0.272]
62 However, we have found that the optimal solution has diagonal covariances for the core tensor in Tucker decomposition (Nakajima, 2012), which would allow us to greatly simplify inference algorithms and reduce necessary memory storage and computational costs. [sent-625, score-0.153]
63 With correlated priors, the posterior is no longer uncorrelated and thus it is not straightforward in general to obtain the global solution from the results obtained in this paper. [sent-629, score-0.139]
64 Accordingly, the following procedure gives the global solution analytically: the analytic solution ′ ′ given the diagonal (CA ,CB ) is first computed, and the above transformation is then applied. [sent-632, score-0.335]
65 However, one may use our analytic solution as a subroutine, for example, in the soft-thresholding step of S OFT-I MPUTE (Mazumder et al. [sent-639, score-0.188]
66 In this paper, we focused on the matrix factorization (MF) problem with no missing entry, and showed that this weakness could be overcome by analytically computing the global optimal solution. [sent-645, score-0.146]
67 We discussed the possibility that our analytic solution can be used as a building block of novel algorithms for more general problems. [sent-649, score-0.188]
68 In the following, we prove that, in Case 1, ΣA and ΣB are diagonal for any solution (A, B, ΣA , ΣB ), and that, in other cases, any solution has its equivalent solution with diagonal ΣA and ΣB . [sent-662, score-0.302]
69 1 Proof for Case 1 Here, we consider the case when cah cbh > cah′ cbh′ for any pair h < h′ . [sent-674, score-0.515]
70 Then, the free energy (20) can be written as a function of Ω: 1 1/2 1/2 −1 −1 F VB (Ω) = tr CA CB ΩCA B∗⊤ B∗ + LΣ∗ CA Ω⊤ + const. [sent-678, score-0.157]
71 ΣB = CB Ω′CB Then, the free energy as a function of Ω′ is given by 1 1/2 1/2 −1 −1 F VB (Ω′ ) = tr CA CB Ω′CB A∗⊤ A∗ + MΣ∗ CB Ω′⊤ + const. [sent-686, score-0.157]
72 In A the following, we will prove that any solution with diagonal ΣA has diagonal ΣB . [sent-695, score-0.154]
73 Then, the free energy can be written as a function of Ω: 1 −1/2 −1/2 F VB (Ω) = tr ΓΩΓ−1/2CA A∗⊤ A∗ + MΣ∗ CA Γ−1/2 Ω⊤ A 2 1/2 1/2 +c−1 Γ−1 ΩΓ1/2CA B∗⊤ B∗ + LΣ∗ CA Γ1/2 Ω⊤ . [sent-703, score-0.157]
74 B Thus, we have proved that any solution has its equivalent solution with diagonal covariances, which completes the proof for Case 2. [sent-709, score-0.188]
75 3 Proof for Case 3 Finally, we consider the case when cah cbh = ca′h cbh′ for (not all but) some pairs h = h′ . [sent-711, score-0.515]
76 First, in the same way as for Case 1, we can prove that ΣA and ΣB are block diagonal where the blocks correspond to the groups sharing the same cah cbh . [sent-712, score-0.555]
77 Next, we can apply the proof for Case 2 to each block, and show that any solution has its equivalent solution with diagonal ΣA and ΣB . [sent-713, score-0.188]
78 h h h=1 Then, the global VB solution (under the matrix-wise independence (6)) can be expressed as U VB ≡ BA⊤ H rVB (A,B) = ⊤ ∑ γVB ωb ωa . [sent-722, score-0.132]
79 h h h h=1 Let γh = (L + M)σ2 σ4 + 2 2 2 2cah cbh (L + M)σ2 σ4 + 2 2 + 2 2cah cbh 2 − LMσ4 . [sent-723, score-0.55]
80 When h γh > γh , (50) γ2 + q1 (γh ) · γh + q0 = 0, h (51) γVB is given as a positive real solution of h where 4 q1 (γh ) = q0 = LM −(M − L)2 (γh − γh ) + (L + M) (M − L)2 (γh − γh )2 + 4σ c2 c2 ah bh σ4 σ2 L − 1− 2 c2h c2h γh a b 2LM σ2 M 2 1− 2 γh . [sent-725, score-0.317]
81 Any positive solution of Equation (51) lying in the range (54) satisfies the quartic equation (23), and lies in the following range: 1/4 0 < γh < ξ0 . [sent-736, score-0.283]
82 (61) Conversely, any positive solution of the quartic equation (23) lying in the range (61) satisfies Equation (51), and lies in the range (54). [sent-737, score-0.283]
83 Lemma 8 and Lemma 9 imply that finding the VB solution is achieved by finding a positive solution of the quartic equation (23) lying in the range (61). [sent-738, score-0.357]
84 Investigating the shape of the quartic function f (γh ), defined in Equation (23), we have the following lemma (the proof is given in Appendix D. [sent-739, score-0.167]
85 The quartic equation (23) has two positive real solutions. [sent-741, score-0.209]
86 When γh ≤ γh , the means and the variances of the VB posterior for the h-th component are given by (ah , bh , (ΣA )h,h , (ΣB )h,h ) = 0, 0, (σ2h )null , (σ2h )null . [sent-745, score-0.144]
87 When γh > γh , the means and the variances of the VB posterior for the h-th component are given by (ah , bh , (ΣA )h,h , (ΣB )h,h ) = (± γh δh ωah , ± γh δ−1 ωbh , σ2h , σ2h ). [sent-747, score-0.144]
88 Otherwise, cah cbh → 0 is the only local minimum. [sent-755, score-0.515]
89 ˘ ˘ It was also shown in Nakajima and Sugiyama (2011) that the (scaled) free energy difference between the two local minima is given by ∆h (the positive local minimum with cah cbh = cah cbh gives ˘ ˘ lower free energy than the null local minimum with cah cbh → 0 if and only if ∆h ≤ 0). [sent-756, score-1.843]
90 7 Thus, we have the following lemma: Lemma 13 The hyperparameter cah cbh that globally minimizes the VB free energy (20) is given by cah cbh = cah cbh if γh > γh and ∆h ≤ 0. [sent-757, score-1.668]
91 (50) By using Equation (60), we have η2 − h σ2 L σ4 = 1− 2 c2h c2h γh a b 1− σ2 M 2 σ4 γh − 2 2 = γ−2 γ2 − γ2 h h h γ2 cah cbh h ´h γ2 − γ2 , h (67) where ´ γh = (L + M)σ2 σ4 + 2 2 − 2 2cah cbh (L + M)σ2 σ4 + 2 2 2 2cah cbh 2 − LMσ4 . [sent-791, score-1.065]
92 This means that the solution also satisfies its equivalent equation (51). [sent-811, score-0.142]
93 3 Proof of Lemma 10 We will investigate the shape of the quartic function (23), f (γh ) := γ4 + ξ3 γ3 + ξ2 γ2 + ξ1 γh + ξ0 . [sent-814, score-0.141]
94 h h h (23) Since the coefficient of the quartic term is positive (equal to one), f (γh ) goes to infinity as γh → −∞ or γh → ∞. [sent-815, score-0.141]
95 The shape of the quartic function f (γh ) is shown in Figure 9. [sent-825, score-0.141]
96 Note that the points at which f (γh ) crosses the horizontal axis are the solutions of the quartic equation (23). [sent-826, score-0.231]
97 33 NAKAJIMA , S UGIYAMA , BABACAN AND T OMIOKA 4 3 2 f (γh ) := γh + ξ3 γh + ξ2 γh + ξ1 γh + ξ0 γh Inflection points second γh Figure 9: The shape of a quartic function f (γh ) := γ4 + ξ3 γ3 + ξ2 γ2 + ξ1 γh + ξ0 , where ξ2 < 0, h h h 1/4 1/4 ξ0 (= f (0)) > 0, and f (ξ0 ) < 0. [sent-828, score-0.141]
98 Local minima, symmetry-breaking, and pruning in variational free energy minimization. [sent-998, score-0.204]
99 Global solution of fully-observed variational Bayesian matrix factorization is column-wise independent. [sent-1041, score-0.228]
100 Fast variational Bayesian inference for non-conjugate matrix factorization models. [sent-1104, score-0.154]
wordName wordTfidf (topN-words)
[('icm', 0.498), ('vb', 0.342), ('nakajima', 0.306), ('cbh', 0.275), ('vbmf', 0.252), ('cah', 0.24), ('quartic', 0.141), ('sugiyama', 0.136), ('cb', 0.136), ('babacan', 0.135), ('ah', 0.131), ('inimlss', 0.123), ('iniml', 0.117), ('iniran', 0.117), ('analytic', 0.114), ('bh', 0.112), ('olution', 0.105), ('omioka', 0.105), ('lobal', 0.09), ('nalytic', 0.09), ('ugiyama', 0.09), ('mf', 0.084), ('lm', 0.081), ('ca', 0.077), ('simplevbmf', 0.076), ('solution', 0.074), ('atrix', 0.074), ('energy', 0.07), ('factorization', 0.069), ('equation', 0.068), ('variational', 0.065), ('ba', 0.056), ('free', 0.053), ('cacb', 0.053), ('pjs', 0.053), ('rl', 0.05), ('bayesian', 0.047), ('fro', 0.047), ('svb', 0.047), ('rrr', 0.045), ('singular', 0.042), ('ih', 0.042), ('simplevb', 0.041), ('ilin', 0.04), ('diagonal', 0.04), ('asvb', 0.035), ('bsvb', 0.035), ('tomioka', 0.035), ('tr', 0.034), ('global', 0.033), ('posterior', 0.032), ('osterior', 0.03), ('minima', 0.029), ('evb', 0.029), ('rvb', 0.029), ('spontaneous', 0.029), ('raiko', 0.029), ('rank', 0.029), ('pca', 0.028), ('sec', 0.027), ('lemma', 0.026), ('bishop', 0.026), ('independence', 0.025), ('covariances', 0.024), ('analytically', 0.024), ('mir', 0.023), ('null', 0.023), ('bayes', 0.023), ('bl', 0.023), ('nl', 0.022), ('solutions', 0.022), ('tokyo', 0.021), ('matrix', 0.02), ('reinsel', 0.02), ('modes', 0.02), ('teh', 0.02), ('priors', 0.02), ('iteration', 0.02), ('yy', 0.019), ('breaking', 0.019), ('svd', 0.018), ('sh', 0.018), ('rh', 0.018), ('xy', 0.018), ('derin', 0.018), ('evbmf', 0.018), ('konstan', 0.018), ('rsvb', 0.018), ('ryota', 0.018), ('shinichi', 0.018), ('velu', 0.018), ('worsley', 0.018), ('ml', 0.017), ('rescaled', 0.017), ('hotelling', 0.017), ('pruning', 0.016), ('lim', 0.016), ('il', 0.015), ('tensor', 0.015), ('marshall', 0.015)]
simIndex simValue paperId paperTitle
same-paper 1 0.99999994 49 jmlr-2013-Global Analytic Solution of Fully-observed Variational Bayesian Matrix Factorization
Author: Shinichi Nakajima, Masashi Sugiyama, S. Derin Babacan, Ryota Tomioka
Abstract: The variational Bayesian (VB) approximation is known to be a promising approach to Bayesian estimation, when the rigorous calculation of the Bayes posterior is intractable. The VB approximation has been successfully applied to matrix factorization (MF), offering automatic dimensionality selection for principal component analysis. Generally, finding the VB solution is a non-convex problem, and most methods rely on a local search algorithm derived through a standard procedure for the VB approximation. In this paper, we show that a better option is available for fully-observed VBMF—the global solution can be analytically computed. More specifically, the global solution is a reweighted SVD of the observed matrix, and each weight can be obtained by solving a quartic equation with its coefficients being functions of the observed singular value. We further show that the global optimal solution of empirical VBMF (where hyperparameters are also learned from data) can also be analytically computed. We illustrate the usefulness of our results through experiments in multi-variate analysis. Keywords: variational Bayes, matrix factorization, empirical Bayes, model-induced regularization, probabilistic PCA
2 0.11452314 47 jmlr-2013-Gaussian Kullback-Leibler Approximate Inference
Author: Edward Challis, David Barber
Abstract: We investigate Gaussian Kullback-Leibler (G-KL) variational approximate inference techniques for Bayesian generalised linear models and various extensions. In particular we make the following novel contributions: sufficient conditions for which the G-KL objective is differentiable and convex are described; constrained parameterisations of Gaussian covariance that make G-KL methods fast and scalable are provided; the lower bound to the normalisation constant provided by G-KL methods is proven to dominate those provided by local lower bounding methods; complexity and model applicability issues of G-KL versus other Gaussian approximate inference methods are discussed. Numerical results comparing G-KL and other deterministic Gaussian approximate inference methods are presented for: robust Gaussian process regression models with either Student-t or Laplace likelihoods, large scale Bayesian binary logistic regression models, and Bayesian sparse linear models for sequential experimental design. Keywords: generalised linear models, latent linear models, variational approximate inference, large scale inference, sparse learning, experimental design, active learning, Gaussian processes
3 0.077534661 75 jmlr-2013-Nested Expectation Propagation for Gaussian Process Classification with a Multinomial Probit Likelihood
Author: Jaakko Riihimäki, Pasi Jylänki, Aki Vehtari
Abstract: This paper considers probabilistic multinomial probit classification using Gaussian process (GP) priors. Challenges with multiclass GP classification are the integration over the non-Gaussian posterior distribution, and the increase of the number of unknown latent variables as the number of target classes grows. Expectation propagation (EP) has proven to be a very accurate method for approximate inference but the existing EP approaches for the multinomial probit GP classification rely on numerical quadratures, or independence assumptions between the latent values associated with different classes, to facilitate the computations. In this paper we propose a novel nested EP approach which does not require numerical quadratures, and approximates accurately all betweenclass posterior dependencies of the latent values, but still scales linearly in the number of classes. The predictive accuracy of the nested EP approach is compared to Laplace, variational Bayes, and Markov chain Monte Carlo (MCMC) approximations with various benchmark data sets. In the experiments nested EP was the most consistent method compared to MCMC sampling, but in terms of classification accuracy the differences between all the methods were small from a practical point of view. Keywords: Gaussian process, multiclass classification, multinomial probit, approximate inference, expectation propagation
4 0.067586891 14 jmlr-2013-Asymptotic Results on Adaptive False Discovery Rate Controlling Procedures Based on Kernel Estimators
Author: Pierre Neuvial
Abstract: The False Discovery Rate (FDR) is a commonly used type I error rate in multiple testing problems. It is defined as the expected False Discovery Proportion (FDP), that is, the expected fraction of false positives among rejected hypotheses. When the hypotheses are independent, the BenjaminiHochberg procedure achieves FDR control at any pre-specified level. By construction, FDR control offers no guarantee in terms of power, or type II error. A number of alternative procedures have been developed, including plug-in procedures that aim at gaining power by incorporating an estimate of the proportion of true null hypotheses. In this paper, we study the asymptotic behavior of a class of plug-in procedures based on kernel estimators of the density of the p-values, as the number m of tested hypotheses grows to infinity. In a setting where the hypotheses tested are independent, we prove that these procedures are asymptotically more powerful in two respects: (i) a tighter asymptotic FDR control for any target FDR level and (ii) a broader range of target levels yielding positive asymptotic power. We also show that this increased asymptotic power comes at the price of slower, non-parametric convergence rates for the FDP. These rates are of the form m−k/(2k+1) , where k is determined by the regularity of the density of the p-value distribution, or, equivalently, of the test statistics distribution. These results are applied to one- and two-sided tests statistics for Gaussian and Laplace location models, and for the Student model. Keywords: multiple testing, false discovery rate, Benjamini Hochberg’s procedure, power, criticality, plug-in procedures, adaptive control, test statistics distribution, convergence rates, kernel estimators
5 0.057653204 108 jmlr-2013-Stochastic Variational Inference
Author: Matthew D. Hoffman, David M. Blei, Chong Wang, John Paisley
Abstract: We develop stochastic variational inference, a scalable algorithm for approximating posterior distributions. We develop this technique for a large class of probabilistic models and we demonstrate it with two probabilistic topic models, latent Dirichlet allocation and the hierarchical Dirichlet process topic model. Using stochastic variational inference, we analyze several large collections of documents: 300K articles from Nature, 1.8M articles from The New York Times, and 3.8M articles from Wikipedia. Stochastic inference can easily handle data sets of this size and outperforms traditional variational inference, which can only handle a smaller subset. (We also show that the Bayesian nonparametric topic model outperforms its parametric counterpart.) Stochastic variational inference lets us apply complex Bayesian models to massive data sets. Keywords: Bayesian inference, variational inference, stochastic optimization, topic models, Bayesian nonparametrics
6 0.054804336 121 jmlr-2013-Variational Inference in Nonconjugate Models
7 0.046883773 114 jmlr-2013-The Rate of Convergence of AdaBoost
8 0.038281973 15 jmlr-2013-Bayesian Canonical Correlation Analysis
9 0.036852434 70 jmlr-2013-Maximum Volume Clustering: A New Discriminative Clustering Approach
10 0.036022536 4 jmlr-2013-A Max-Norm Constrained Minimization Approach to 1-Bit Matrix Completion
11 0.033933051 9 jmlr-2013-A Widely Applicable Bayesian Information Criterion
12 0.033684902 115 jmlr-2013-Training Energy-Based Models for Time-Series Imputation
13 0.032272305 90 jmlr-2013-Quasi-Newton Method: A New Direction
14 0.03131872 13 jmlr-2013-Approximating the Permanent with Fractional Belief Propagation
15 0.03018246 48 jmlr-2013-Generalized Spike-and-Slab Priors for Bayesian Group Feature Selection Using Expectation Propagation
16 0.029489541 102 jmlr-2013-Sparse Matrix Inversion with Scaled Lasso
17 0.025761919 45 jmlr-2013-GPstuff: Bayesian Modeling with Gaussian Processes
18 0.025414577 120 jmlr-2013-Variational Algorithms for Marginal MAP
19 0.024759982 72 jmlr-2013-Multi-Stage Multi-Task Feature Learning
20 0.022619728 59 jmlr-2013-Large-scale SVD and Manifold Learning
topicId topicWeight
[(0, -0.138), (1, -0.142), (2, 0.058), (3, -0.021), (4, -0.013), (5, -0.037), (6, 0.003), (7, -0.048), (8, 0.011), (9, -0.004), (10, 0.063), (11, 0.033), (12, 0.027), (13, -0.031), (14, -0.048), (15, 0.045), (16, -0.047), (17, -0.053), (18, -0.072), (19, 0.029), (20, 0.111), (21, -0.049), (22, 0.118), (23, -0.051), (24, -0.142), (25, -0.293), (26, 0.083), (27, -0.227), (28, 0.203), (29, -0.21), (30, 0.105), (31, 0.154), (32, -0.059), (33, -0.219), (34, -0.062), (35, 0.075), (36, 0.129), (37, 0.009), (38, 0.127), (39, 0.103), (40, 0.024), (41, 0.051), (42, -0.02), (43, 0.052), (44, -0.033), (45, 0.037), (46, 0.038), (47, 0.016), (48, -0.012), (49, -0.12)]
simIndex simValue paperId paperTitle
same-paper 1 0.93677974 49 jmlr-2013-Global Analytic Solution of Fully-observed Variational Bayesian Matrix Factorization
Author: Shinichi Nakajima, Masashi Sugiyama, S. Derin Babacan, Ryota Tomioka
Abstract: The variational Bayesian (VB) approximation is known to be a promising approach to Bayesian estimation, when the rigorous calculation of the Bayes posterior is intractable. The VB approximation has been successfully applied to matrix factorization (MF), offering automatic dimensionality selection for principal component analysis. Generally, finding the VB solution is a non-convex problem, and most methods rely on a local search algorithm derived through a standard procedure for the VB approximation. In this paper, we show that a better option is available for fully-observed VBMF—the global solution can be analytically computed. More specifically, the global solution is a reweighted SVD of the observed matrix, and each weight can be obtained by solving a quartic equation with its coefficients being functions of the observed singular value. We further show that the global optimal solution of empirical VBMF (where hyperparameters are also learned from data) can also be analytically computed. We illustrate the usefulness of our results through experiments in multi-variate analysis. Keywords: variational Bayes, matrix factorization, empirical Bayes, model-induced regularization, probabilistic PCA
2 0.66118371 14 jmlr-2013-Asymptotic Results on Adaptive False Discovery Rate Controlling Procedures Based on Kernel Estimators
Author: Pierre Neuvial
Abstract: The False Discovery Rate (FDR) is a commonly used type I error rate in multiple testing problems. It is defined as the expected False Discovery Proportion (FDP), that is, the expected fraction of false positives among rejected hypotheses. When the hypotheses are independent, the BenjaminiHochberg procedure achieves FDR control at any pre-specified level. By construction, FDR control offers no guarantee in terms of power, or type II error. A number of alternative procedures have been developed, including plug-in procedures that aim at gaining power by incorporating an estimate of the proportion of true null hypotheses. In this paper, we study the asymptotic behavior of a class of plug-in procedures based on kernel estimators of the density of the p-values, as the number m of tested hypotheses grows to infinity. In a setting where the hypotheses tested are independent, we prove that these procedures are asymptotically more powerful in two respects: (i) a tighter asymptotic FDR control for any target FDR level and (ii) a broader range of target levels yielding positive asymptotic power. We also show that this increased asymptotic power comes at the price of slower, non-parametric convergence rates for the FDP. These rates are of the form m−k/(2k+1) , where k is determined by the regularity of the density of the p-value distribution, or, equivalently, of the test statistics distribution. These results are applied to one- and two-sided tests statistics for Gaussian and Laplace location models, and for the Student model. Keywords: multiple testing, false discovery rate, Benjamini Hochberg’s procedure, power, criticality, plug-in procedures, adaptive control, test statistics distribution, convergence rates, kernel estimators
3 0.44642407 47 jmlr-2013-Gaussian Kullback-Leibler Approximate Inference
Author: Edward Challis, David Barber
Abstract: We investigate Gaussian Kullback-Leibler (G-KL) variational approximate inference techniques for Bayesian generalised linear models and various extensions. In particular we make the following novel contributions: sufficient conditions for which the G-KL objective is differentiable and convex are described; constrained parameterisations of Gaussian covariance that make G-KL methods fast and scalable are provided; the lower bound to the normalisation constant provided by G-KL methods is proven to dominate those provided by local lower bounding methods; complexity and model applicability issues of G-KL versus other Gaussian approximate inference methods are discussed. Numerical results comparing G-KL and other deterministic Gaussian approximate inference methods are presented for: robust Gaussian process regression models with either Student-t or Laplace likelihoods, large scale Bayesian binary logistic regression models, and Bayesian sparse linear models for sequential experimental design. Keywords: generalised linear models, latent linear models, variational approximate inference, large scale inference, sparse learning, experimental design, active learning, Gaussian processes
4 0.30099291 13 jmlr-2013-Approximating the Permanent with Fractional Belief Propagation
Author: Michael Chertkov, Adam B. Yedidia
Abstract: We discuss schemes for exact and approximate computations of permanents, and compare them with each other. Specifically, we analyze the belief propagation (BP) approach and its fractional belief propagation (FBP) generalization for computing the permanent of a non-negative matrix. Known bounds and Conjectures are verified in experiments, and some new theoretical relations, bounds and Conjectures are proposed. The fractional free energy (FFE) function is parameterized by a scalar parameter γ ∈ [−1; 1], where γ = −1 corresponds to the BP limit and γ = 1 corresponds to the exclusion principle (but ignoring perfect matching constraints) mean-field (MF) limit. FFE shows monotonicity and continuity with respect to γ. For every non-negative matrix, we define its special value γ∗ ∈ [−1; 0] to be the γ for which the minimum of the γ-parameterized FFE function is equal to the permanent of the matrix, where the lower and upper bounds of the γ-interval corresponds to respective bounds for the permanent. Our experimental analysis suggests that the distribution of γ∗ varies for different ensembles but γ∗ always lies within the [−1; −1/2] interval. Moreover, for all ensembles considered, the behavior of γ∗ is highly distinctive, offering an empirical practical guidance for estimating permanents of non-negative matrices via the FFE approach. Keywords: permanent, graphical models, belief propagation, exact and approximate algorithms, learning flows
5 0.26558134 15 jmlr-2013-Bayesian Canonical Correlation Analysis
Author: Arto Klami, Seppo Virtanen, Samuel Kaski
Abstract: Canonical correlation analysis (CCA) is a classical method for seeking correlations between two multivariate data sets. During the last ten years, it has received more and more attention in the machine learning community in the form of novel computational formulations and a plethora of applications. We review recent developments in Bayesian models and inference methods for CCA which are attractive for their potential in hierarchical extensions and for coping with the combination of large dimensionalities and small sample sizes. The existing methods have not been particularly successful in fulfilling the promise yet; we introduce a novel efficient solution that imposes group-wise sparsity to estimate the posterior of an extended model which not only extracts the statistical dependencies (correlations) between data sets but also decomposes the data into shared and data set-specific components. In statistics literature the model is known as inter-battery factor analysis (IBFA), for which we now provide a Bayesian treatment. Keywords: Bayesian modeling, canonical correlation analysis, group-wise sparsity, inter-battery factor analysis, variational Bayesian approximation
6 0.25287735 115 jmlr-2013-Training Energy-Based Models for Time-Series Imputation
7 0.2357447 4 jmlr-2013-A Max-Norm Constrained Minimization Approach to 1-Bit Matrix Completion
8 0.22957817 75 jmlr-2013-Nested Expectation Propagation for Gaussian Process Classification with a Multinomial Probit Likelihood
9 0.22537905 72 jmlr-2013-Multi-Stage Multi-Task Feature Learning
10 0.22182724 114 jmlr-2013-The Rate of Convergence of AdaBoost
11 0.22170392 9 jmlr-2013-A Widely Applicable Bayesian Information Criterion
12 0.18069203 90 jmlr-2013-Quasi-Newton Method: A New Direction
13 0.17475128 48 jmlr-2013-Generalized Spike-and-Slab Priors for Bayesian Group Feature Selection Using Expectation Propagation
14 0.16885003 102 jmlr-2013-Sparse Matrix Inversion with Scaled Lasso
15 0.16151625 70 jmlr-2013-Maximum Volume Clustering: A New Discriminative Clustering Approach
16 0.16092882 38 jmlr-2013-Dynamic Affine-Invariant Shape-Appearance Handshape Features and Classification in Sign Language Videos
17 0.15414526 121 jmlr-2013-Variational Inference in Nonconjugate Models
18 0.1521982 45 jmlr-2013-GPstuff: Bayesian Modeling with Gaussian Processes
19 0.14496963 30 jmlr-2013-Counterfactual Reasoning and Learning Systems: The Example of Computational Advertising
20 0.1391544 110 jmlr-2013-Sub-Local Constraint-Based Learning of Bayesian Networks Using A Joint Dependence Criterion
topicId topicWeight
[(0, 0.019), (5, 0.121), (6, 0.039), (10, 0.062), (20, 0.01), (23, 0.014), (53, 0.011), (61, 0.48), (68, 0.021), (70, 0.023), (75, 0.04), (85, 0.019), (87, 0.016), (93, 0.015)]
simIndex simValue paperId paperTitle
same-paper 1 0.68647611 49 jmlr-2013-Global Analytic Solution of Fully-observed Variational Bayesian Matrix Factorization
Author: Shinichi Nakajima, Masashi Sugiyama, S. Derin Babacan, Ryota Tomioka
Abstract: The variational Bayesian (VB) approximation is known to be a promising approach to Bayesian estimation, when the rigorous calculation of the Bayes posterior is intractable. The VB approximation has been successfully applied to matrix factorization (MF), offering automatic dimensionality selection for principal component analysis. Generally, finding the VB solution is a non-convex problem, and most methods rely on a local search algorithm derived through a standard procedure for the VB approximation. In this paper, we show that a better option is available for fully-observed VBMF—the global solution can be analytically computed. More specifically, the global solution is a reweighted SVD of the observed matrix, and each weight can be obtained by solving a quartic equation with its coefficients being functions of the observed singular value. We further show that the global optimal solution of empirical VBMF (where hyperparameters are also learned from data) can also be analytically computed. We illustrate the usefulness of our results through experiments in multi-variate analysis. Keywords: variational Bayes, matrix factorization, empirical Bayes, model-induced regularization, probabilistic PCA
2 0.59658545 93 jmlr-2013-Random Walk Kernels and Learning Curves for Gaussian Process Regression on Random Graphs
Author: Matthew J. Urry, Peter Sollich
Abstract: We consider learning on graphs, guided by kernels that encode similarity between vertices. Our focus is on random walk kernels, the analogues of squared exponential kernels in Euclidean spaces. We show that on large, locally treelike graphs these have some counter-intuitive properties, specifically in the limit of large kernel lengthscales. We consider using these kernels as covariance functions of Gaussian processes. In this situation one typically scales the prior globally to normalise the average of the prior variance across vertices. We demonstrate that, in contrast to the Euclidean case, this generically leads to significant variation in the prior variance across vertices, which is undesirable from a probabilistic modelling point of view. We suggest the random walk kernel should be normalised locally, so that each vertex has the same prior variance, and analyse the consequences of this by studying learning curves for Gaussian process regression. Numerical calculations as well as novel theoretical predictions for the learning curves using belief propagation show that one obtains distinctly different probabilistic models depending on the choice of normalisation. Our method for predicting the learning curves using belief propagation is significantly more accurate than previous approximations and should become exact in the limit of large random graphs. Keywords: Gaussian process, generalisation error, learning curve, cavity method, belief propagation, graph, random walk kernel
3 0.29767597 25 jmlr-2013-Communication-Efficient Algorithms for Statistical Optimization
Author: Yuchen Zhang, John C. Duchi, Martin J. Wainwright
Abstract: We analyze two communication-efficient algorithms for distributed optimization in statistical settings involving large-scale data sets. The first algorithm is a standard averaging method that distributes the N data samples evenly to m machines, performs separate minimization on each subset, and then averages the estimates. We provide a sharp analysis of this average mixture algorithm, showing that under a reasonable set of conditions, the combined parameter achieves √ mean-squared error (MSE) that decays as O (N −1 + (N/m)−2 ). Whenever m ≤ N, this guarantee matches the best possible rate achievable by a centralized algorithm having access to all N samples. The second algorithm is a novel method, based on an appropriate form of bootstrap subsampling. Requiring only a single round of communication, it has mean-squared error that decays as O (N −1 + (N/m)−3 ), and so is more robust to the amount of parallelization. In addition, we show that a stochastic gradient-based method attains mean-squared error decaying as O (N −1 + (N/m)−3/2 ), easing computation at the expense of a potentially slower MSE rate. We also provide an experimental evaluation of our methods, investigating their performance both on simulated data and on a large-scale regression problem from the internet search domain. In particular, we show that our methods can be used to efficiently solve an advertisement prediction problem from the Chinese SoSo Search Engine, which involves logistic regression with N ≈ 2.4 × 108 samples and d ≈ 740,000 covariates. Keywords: distributed learning, stochastic optimization, averaging, subsampling
Author: Daniil Ryabko, Jérémie Mary
Abstract: A metric between time-series distributions is proposed that can be evaluated using binary classification methods, which were originally developed to work on i.i.d. data. It is shown how this metric can be used for solving statistical problems that are seemingly unrelated to classification and concern highly dependent time series. Specifically, the problems of time-series clustering, homogeneity testing and the three-sample problem are addressed. Universal consistency of the resulting algorithms is proven under most general assumptions. The theoretical results are illustrated with experiments on synthetic and real-world data. Keywords: time series, reductions, stationary ergodic, clustering, metrics between probability distributions
5 0.29213935 4 jmlr-2013-A Max-Norm Constrained Minimization Approach to 1-Bit Matrix Completion
Author: Tony Cai, Wen-Xin Zhou
Abstract: We consider in this paper the problem of noisy 1-bit matrix completion under a general non-uniform sampling distribution using the max-norm as a convex relaxation for the rank. A max-norm constrained maximum likelihood estimate is introduced and studied. The rate of convergence for the estimate is obtained. Information-theoretical methods are used to establish a minimax lower bound under the general sampling model. The minimax upper and lower bounds together yield the optimal rate of convergence for the Frobenius norm loss. Computational algorithms and numerical performance are also discussed. Keywords: 1-bit matrix completion, low-rank matrix, max-norm, trace-norm, constrained optimization, maximum likelihood estimate, optimal rate of convergence
7 0.2911644 28 jmlr-2013-Construction of Approximation Spaces for Reinforcement Learning
8 0.29076436 17 jmlr-2013-Belief Propagation for Continuous State Spaces: Stochastic Message-Passing with Quantitative Guarantees
9 0.28985485 72 jmlr-2013-Multi-Stage Multi-Task Feature Learning
10 0.28946263 26 jmlr-2013-Conjugate Relation between Loss Functions and Uncertainty Sets in Classification Problems
11 0.28922978 48 jmlr-2013-Generalized Spike-and-Slab Priors for Bayesian Group Feature Selection Using Expectation Propagation
12 0.28916132 76 jmlr-2013-Nonparametric Sparsity and Regularization
13 0.28891245 73 jmlr-2013-Multicategory Large-Margin Unified Machines
14 0.2886419 99 jmlr-2013-Semi-Supervised Learning Using Greedy Max-Cut
15 0.28680092 10 jmlr-2013-Algorithms and Hardness Results for Parallel Large Margin Learning
16 0.28607592 15 jmlr-2013-Bayesian Canonical Correlation Analysis
17 0.28593585 68 jmlr-2013-Machine Learning with Operational Costs
18 0.28505388 41 jmlr-2013-Experiment Selection for Causal Discovery
19 0.28503612 59 jmlr-2013-Large-scale SVD and Manifold Learning
20 0.28495958 47 jmlr-2013-Gaussian Kullback-Leibler Approximate Inference