jmlr jmlr2011 jmlr2011-94 knowledge-graph by maker-knowledge-mining

94 jmlr-2011-Theoretical Analysis of Bayesian Matrix Factorization


Source: pdf

Author: Shinichi Nakajima, Masashi Sugiyama

Abstract: Recently, variational Bayesian (VB) techniques have been applied to probabilistic matrix factorization and shown to perform very well in experiments. In this paper, we theoretically elucidate properties of the VB matrix factorization (VBMF) method. Through finite-sample analysis of the VBMF estimator, we show that two types of shrinkage factors exist in the VBMF estimator: the positive-part James-Stein (PJS) shrinkage and the trace-norm shrinkage, both acting on each singular component separately for producing low-rank solutions. The trace-norm shrinkage is simply induced by non-flat prior information, similarly to the maximum a posteriori (MAP) approach. Thus, no trace-norm shrinkage remains when priors are non-informative. On the other hand, we show a counter-intuitive fact that the PJS shrinkage factor is kept activated even with flat priors. This is shown to be induced by the non-identifiability of the matrix factorization model, that is, the mapping between the target matrix and factorized matrices is not one-to-one. We call this model-induced regularization. We further extend our analysis to empirical Bayes scenarios where hyperparameters are also learned based on the VB free energy. Throughout the paper, we assume no missing entry in the observed matrix, and therefore collaborative filtering is out of scope. Keywords: matrix factorization, variational Bayes, empirical Bayes, positive-part James-Stein shrinkage, non-identifiable model, model-induced regularization

Reference: text


Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 2 Full-Bayesian Matrix Factorization (FBMF) and Its Empirical Variant (EFBMF) We use the Gaussian priors on the parameters A and B: φ(U) = φA (A)φB (B), where H φA (A) ∝ exp − ∑ h=1 H φB (B) ∝ exp − ∑ h=1 ah 2 2c2h a = exp − bh 2 2c2h b = exp − −1 tr(ACA A⊤ ) 2 , (2) −1 tr(BCB B⊤ ) . [sent-102, score-1.003]

2 2 (3) Here, ah and bh are the h-th column vectors of A and B, respectively, that is, A = (a1 , . [sent-103, score-0.986]

3 Without loss a b of generality, we assume that the product cah cbh is non-increasing with respect to h. [sent-111, score-0.753]

4 The hyperparameters cah and cbh may be determined so that the Bayes free energy F(V ) is minimized. [sent-123, score-0.801]

5 A,B In the MAP framework, one may determine the hyperparameters cah and cbh so that the Bayes posterior p(A, B|V ) is maximized (equivalently, the negative log posterior is minimized). [sent-132, score-0.816]

6 In our model (1), (2), and (3), the negative log of the Bayes posterior is expressed as − log p(A, B|V ) = ah 2 bh 2 LM log σ2 1 H + ∑ M log c2h + L log c2h + 2 + 2 a b 2 2 h=1 cah cbh 2 H 1 + 2 V − ∑ bh a⊤ h 2σ h=1 + Const. [sent-168, score-2.225]

7 (17) Fro Differentiating Equation (17) with respect to A and B and setting the derivatives to zero, we have the following conditions: ah = bh = bh ah 2 2 −1 σ2 + 2 cah V− −1 σ2 + 2 cbh ∑ h′ =h V− ⊤ b h′ a ⊤ h′ bh , ∑ bh a ⊤ h ′ h′ =h ′ (18) ah . [sent-169, score-4.128]

8 h (20) h h=1 The MAP estimator U MAP is given by U MAP = H ⊤ ∑ γMAP ωb ωa , h h h h=1 where γMAP = max 0, γh − h σ2 cah cbh . [sent-182, score-0.78]

9 (21) The theorem implies that the MAP solution cuts off the singular values less than σ2 /(cah cbh ); otherwise it reduces the singular values by σ2 /(cah cbh ) (see Figure 2). [sent-183, score-0.797]

10 More precisely, Equations (21) and (22) show that the product cah cbh → ∞ is sufficient for MAP to be reduced to ML, which is weaker than both cah , cbh → ∞. [sent-198, score-1.506]

11 µah , µbh , Σah , and Σbh satisfy ⊤ 1 µah = 2 Σah V − ∑ µbh′ µ⊤ ′ ah σ h′ =h µ bh = µ bh , 1 Σb V − ∑ µbh′ µ⊤ ′ ah σ2 h h′ =h (23) µ ah , Σ ah = 1 σ2 µ bh 2 + tr(Σbh ) + c−2 ah Σ bh = 1 σ2 µ ah 2 + tr(Σah ) + c−2 bh (24) −1 −1 IM , (25) IL . [sent-209, score-5.499]

12 When √ γh > Mσ2 , γVB (= µah µbh ) is bounded as h max 0, 1 − Mσ2 γ2 h γh − σ2 M/L cah cbh ≤ γVB < 1 − h Mσ2 γ2 h γh . [sent-218, score-0.753]

13 Theorem 2 states that, in the limit of cah cbh → ∞, the lower bound agrees with the upper bound and we have  2  max 0, 1 − Mσ γ if γh > 0, h VB 2 γh (29) lim γh = cah cbh →∞  0 otherwise. [sent-221, score-1.516]

14 Let us compare the behavior of the VB solution (29) with that of the MAP solution (21) when cah cbh → ∞. [sent-227, score-0.779]

15 In contrast, VB offers PJS-type regularization even when cah cbh → ∞. [sent-229, score-0.753]

16 (30) NAKAJIMA AND S UGIYAMA γVB is monotone decreasing with respect to cah cbh , and is lower-bounded as h γVB > h lim cah cbh →∞ γVB = h √ Mσ2 . [sent-239, score-1.516]

17 cah cbh Since σ4 = γMAP , h c2h c2h a b γVB > h VB has a stronger shrinkage effect than MAP in terms of the vanishing condition of singular values. [sent-241, score-0.806]

18 We can derive another upper bound of γVB , which depends on hyperparameters cah and cbh (its h proof is also included in Appendix C): Theorem 4 When γh > √ Mσ2 , γVB is upper-bounded as h γVB ≤ h 1− Lσ2 γ2 h 1− Mσ2 γ2 h · γh − σ2 . [sent-242, score-0.776]

19 cah cbh (31) √ When L = M and γh > Mσ2 , the lower bound in Equation (28) and the upper bound in Equation (31) agree with each other. [sent-243, score-0.753]

20 Thus, we have an analytic-form expression of γVB as follows: h  2  max 0, 1 − Mσ γ2 γVB = h h  0 γh − σ2 cah cbh if γh > 0, (32) otherwise. [sent-244, score-0.753]

21 + 4σ2 M − γVB + h cah cbh σ2 γVB + h cah cbh γVB + h 2 σ2 cah cbh (35) (36) 3. [sent-246, score-2.259]

22 3 EMAPMF In the EMAPMF framework, the hyperparameters cah and cbh are determined so that the Bayes posterior p(A, B|V ) is maximized (equivalently, the negative log posterior is minimized). [sent-247, score-0.816]

23 c2h = b L c2h = a (37) (38) Alternating Equations (18), (19), (37), and (38), one may learn the parameters A, B and the hyperparameters cah , cbh at the same time. [sent-250, score-0.768]

24 (2007), EMAPMF does not work properly since its objective (17) is unbounded from below at ah , bh = 0 and cah , cbh → 0. [sent-252, score-1.747]

25 Differentiating Equation (39) with respect to c2h and a c2h and setting the derivatives to zero, we obtain the following optimality conditions: b c2h = a c2h = b 2 + tr(Σ ah ) , M 2 + tr(Σ ) bh . [sent-256, score-0.986]

26 L µ ah µ bh (40) (41) Here, we observe the invariance of Equation (39) with respect to the transform −1/2 1/2 (µah , µbh , Σah , Σbh , c2h , c2h ) → (sh µah , sh a b µbh , sh Σah , s−1 Σbh , sh c2h , s−1 c2h ) a h h b (42) for any {sh ∈ R; sh > 0, h = 1, . [sent-257, score-1.05]

27 cbh (43) Then, Equations (40) and (41) yield c2h = a c2h = b ( µ ah 2 + tr(Σ µ bh 2 + tr(Σ bh )) ( µ ah LM 2 + tr(Σ )) ( µ ah bh LM 2 + tr(Σ bh )) ah )) ( , (44) . [sent-262, score-4.315]

28 (45) One may learn the parameters A, B and the hyperparameters cah , cbh by applying Equations (44) and (45) after every iteration of Equations (23)–(26) (this gives a local minimum of Equation (39) at convergence). [sent-263, score-0.768]

29 Thus, even when the hyperparameters cah and cbh are learned from data by EVB, the same upper bound as the fixed-hyperparameter case in VB holds. [sent-279, score-0.768]

30 Furthermore, from Equation (32), we can confirm that the upper bound (49) is equivalent to the VB solution when cah cbh = γh /M. [sent-288, score-0.766]

31 If γh ≥ 2 Mσ and ϕ(γh ) ≤ 0, then the EVB estimator of cah cbh is given by cEVB cEVB = ah bh γh ρ+ . [sent-293, score-1.766]

32 The EVB posterior is obtained by Corollary 1 with ah bh (c2h , c2h ) = cEVB cEVB , cEVB cEVB . [sent-295, score-1.005]

33 a ah ah b bh bh Furthermore, when γh ≥ √ 7Mσ, it holds that ϕ(γh ) < 0. [sent-296, score-1.972]

34 Note that γEVB is theoretically bounded as ah bh 2 = 2σ2 = γEVB ≤ γEVB ≤ γEVB = √ 7σ2 ≈ 2. [sent-570, score-0.986]

35 3, EMAPMF always results in the trivial solution, A, B = 0 and cah , cbh → 0. [sent-690, score-0.753]

36 The model (1) and the priors (2) and (3) are invariant under the following parameter transformation 1/2 −1/2 (ah , bh , cah , cbh ) → (sh ah , sh 1/2 −1/2 bh , sh cah , sh cbh ) for any {sh ∈ R; sh > 0, h = 1, . [sent-702, score-2.99]

37 Let us double Equation (17) and neglect some constant terms which are irrelevant to its minimization with respect to {ah , bh }H : h=1 L MAP ({ah , bh }H ) h=1 H = ∑ h=1 ah 2 bh 2 + 2 c2h cbh a H 1 + 2 V − ∑ bh a ⊤ h σ h=1 2 . [sent-789, score-2.608]

38 h h (62) h=1 There exists at least one minimizer that can be written as ah = ah ωah , (63) bh = bh ωbh , (64) where {ah , bh } are scalars such that γh = ah bh ≥ 0. [sent-805, score-3.394]

39 , H} such that cah cbh = cah′ cbh′ if and only if h and h′ belong to the same group (i. [sent-814, score-0.753]

40 , H, MAP Lh (ah , bh ) = a2 b2 h + 2h c2h cbh a + 1 (γ − ah bh )2 . [sent-829, score-1.774]

41 2 h σ This can be written as MAP Lh (ah , bh ) = b2 h c2h a ah cah − bh cbh 2 + 1 σ2 ah bh − γh − σ2 cah cbh 2 + σ2 2γh − 2 2 cah cbh cah cbh . [sent-830, score-5.401]

42 (65) The third term is constant with respect to ah and bh . [sent-831, score-0.986]

43 The first nonnegative term vanishes by setting the ratio ah /bh to ah cah = bh cbh (or bh = 0). [sent-832, score-2.725]

44 (66) Minimizing the second term in Equation (65), which is quadratic with respect to the product ah bh (≥ 0), we can easily obtain Equation (21), which completes the proof. [sent-833, score-0.994]

45 a b H = ∑ h=1 µ ah − log |Σah | + 2 + tr(Σ c2h a H 1 + 2 V − ∑ µ bh µ ⊤ ah σ h=1 1 H + 2 ∑ σ h=1 µ ah ah ) 2 − log |Σbh | + µ bh 2 + tr(Σ bh ) c2h b 2 Fro tr(Σbh ) + tr(Σah ) µbh 2 + tr(Σah )tr(Σbh ) . [sent-838, score-3.547]

46 Given fixed {(Σah , Σbh )}, the objective function (67) is of the same form as Equation (60) if we replace {(c2h , c2h )} in Equation (60) with {(c′2 , c′2 )} defined by a ah bh b c′2 = ah 1 tr(Σbh ) + 2 cah σ2 c′2 bh 1 tr(Σah ) + 2 σ2 cbh = −1 , (70) . [sent-853, score-2.733]

47 2611 NAKAJIMA AND S UGIYAMA where VB Lh (µah , µbh , σ2h , σ2h ) = −M log σ2h + a a b − µ2 + Lσ2h µ2h + Mσ2h a a b b − L log σ2h + h 2 b c2h cbh a 2 1 γ µ µ + 2 µ2h + Mσ2h a a 2 h ah bh σ σ µ2h + Lσ2h . [sent-866, score-1.377]

48 (83) σ2 + cah cbh (M − L) cah cbh + σ2 + cah cbh (M − L) cah cbh 2 Next, we investigate the positive stationary points, assuming that µah = 0, µbh = 0. [sent-877, score-3.081]

49 γh − cah cbh γ2 h The following lemma also holds (its proof is given in Appendix G. [sent-891, score-0.783]

50 We neglect such solutions, because they almost surely do not exist; Equations (70), (71), (35), and (36) together imply that any pair {(h, h′ ); h = h′ } such that max(γVB , γVB ) > 0 and h h′ c′ h c′ h = c′ h′ c′ ′ can exist only when cah cbh = cah′ cbh′ and γh = γh′ (i. [sent-913, score-0.753]

51 a a b b H = ∑ log h=1 c2 c2M µah 2 + tr(Σah ) µbh 2 + tr(Σbh ) ah b + + log h + |Σah | c2h |Σbh | c2h a b H 1 + 2 V − ∑ µ bh µ ⊤ ah σ h=1 H + 1 ∑ σ2 h=1 µ ah 2 2 Fro tr(Σbh ) + tr(Σah ) µbh 2 + tr(Σah )tr(Σbh ) . [sent-919, score-2.144]

52 We divide the domain (105) into two regions (see Figure 12): ˚ R = (µah , µbh , σ2h , σ2h , c2h , c2h ) ∈ R2 × R2 × R2 ; cah cbh ≤ κ , a a ++ ++ b b ˘ R = (µah , µbh , σ2h , σ2h , c2h , c2h ) ∈ R2 × R2 × R2 ; cah cbh > κ . [sent-951, score-1.506]

53 a a b b ˚ (µah ,µbh ,σ2 ,σ2 ,c2 ,c2 )∈R ah bh ah bh (109) ˘ and the infimum over R , EVB ˘ Lh = ˘ (µah ,µbh ,σ2 ,σ2 ,c2 ,c2 )∈R ah bh ah bh ˚ Rigorously speaking, no minimizer over R exists. [sent-953, score-3.963]

54 To make discussion simple, we approximate ˚ R by its subregion with an arbitrary accuracy; for any ε (0 < ε < κ), we define an ε-margin subregion ˚ of R : ˚ ˚ Rε = (µah , µbh , σ2h , σ2h , c2h , c2h ) ∈ R ; cah cbh ≥ ε . [sent-954, score-0.753]

55 We evaluate the difference between the objectives: EVB ˚ EVB ∆h (˘ ah , µbh , σ2h , σ2h , c2h , c2h ) = Lh (˘ ah , µbh , σ2h , σ2h , c2h , c2h ) − L h . [sent-989, score-1.138]

56 ): Lemma 25 ∆h (˘ ah , µbh , σ2h , σ2h , c2h , c2h ) is upper-bounded as µ ˘ ˘ a ˘ b ˘a ˘b ∆h (˘ ah , µbh , σ2h , σ2h , c2h , c2h ) < Mψ(α, β), µ ˘ ˘ a ˘ b ˘a ˘b (127) where ψ(α, β) = log β + α log β − (1 − α) + (1 − α) + α 2 1− √ (α+ α+1) β − β, (128) L , (129) M γ2 (130) β = h2. [sent-993, score-1.158]

57 In this case, Equation (122) leads to cah cbh = ˘ ˘ γh ρ+ . [sent-1006, score-0.753]

58 cah cbh ˘ ˘ σ Corollary 1 provides the exact values for the positive stationary points (˘ ah , µbh , σ2h , σ2h ), given µ ˘ ˘a ˘b 2 , c2 ) = (c c , c c ). [sent-1008, score-1.391]

59 2 This is minimized when dh = γh cah cbh ,5 and the minimum coincides to the right-hand side of Equation (61), which completes the proof. [sent-1043, score-0.774]

60 2622 T HEORETICAL A NALYSIS OF BAYESIAN M ATRIX FACTORIZATION Lemma 7 guarantees that cah γh ωah , cbh cbh γh ωbh , cah ah = bh = give a solution for the problem (135) for any (so far unknown) set of {γh }, which completes the proof. [sent-1060, score-2.513]

61 Further, (a ) (bh ) }) = ∞, (a ) (bh ) }) = ∞, lim L VB ({ah , bh , τm h , τl (a ) τm h →0 lim L VB ({ah , bh , τm h , τl (a ) τm h →∞ for any (h, m), because (x − log x) ≥ 1 for any x > 0, limx→+0 (x − log x) = ∞, and limx→∞ (x − (b ) log x) = ∞. [sent-1077, score-0.884]

62 (142) µbh µah = γh − η2 + σ2 (M − L) + h Subtraction of Equation (142) from Equation (141) gives 2σ2 (M − L)µah µbh + 2σ4 Lµb Mµah − 2 h 2 µ cah bh cbh µah = 2σ2 (M − L)γh , which is equivalent to Equation (88). [sent-1106, score-1.17]

63 7 Proof of Lemma 13 Squaring both sides of Equation (86) (which are positive) and substituting Equation (87) into it, we have γ2 + h + σ2 cah cbh cah cbh δh + cah cb δh γh h σ4 σ2 L − 1− 2 c2h c2h γh a b 1− σ2 M 2 γh γ2 h = 0. [sent-1121, score-1.958]

64 (146) Multiplying both sides of Equation (88) by δh (> 0) and solving it with respect to δh , we obtain δh = (M − L)(γh − γh ) + LM (M − L)2 (γh − γh )2 + 4σ c2 c2 4 ah bh 2σ2 Mc−2 ah (147) as a positive solution. [sent-1122, score-1.567]

65 γ2 cah cbh h The left-hand side can be factorized as κ2 − LMσ4 γ−2 γ2 − κ + h h κ2 − LMσ4 γ2 − κ − h > 0, (148) where κ= σ4 (L + M)σ2 + 2 2 . [sent-1140, score-0.762]

66 The partial derivatives of Equation (73) are given by VB µa 1 ∂Lh = 2h + 2 ∂µah cah −γh µbh + (µ2h + Lσ2h )µah b b VB 1 ∂Lh µb = 2h + 2 ∂µbh cbh −γh µah + (µ2h + Mσ2h )µbh a a σ2 σ2 , . [sent-1145, score-0.753]

67 Let us lower-bound Equation (94) by ignoring the positive h term 4σ4 LM/(c2h c2h ): a b LM −(M − L)2 (γh − γh ) + (L + M) (M − L)2 (γh − γh )2 + 4σ c2 c2 4 q1 (γh ) = > ah bh 2LM (M − L)2 (γh − γh )2 −(M − L)2 (γh − γh ) + (L + M) 2LM = 1− L (γh − γh ). [sent-1165, score-0.998]

68 Let us upper-bound Equation (94) by using the relation h x2 + y2 ≤ x2 + y2 + 2xy ≤ x + y for x, y ≥ 0: LM −(M − L)2 (γh − γh ) + (L + M) (M − L)2 (γh − γh )2 + 4σ c2 c2 4 q1 (γh ) = ≤ ah bh 2LM −(M − L)2 (γh − γh ) + (L + M) 2LM √ LM L (γh − γh ) + = 1− M 2LMcah cbh σ2 (L + M) L . [sent-1171, score-1.357]

69 (γh − γh ) + √ = 1− M LMcah cbh 2σ2 (L + M) 2630 √ LM c h bh 2 (M − L)(γh − γh ) + 2σa c T HEORETICAL A NALYSIS OF BAYESIAN M ATRIX FACTORIZATION We also upper-bound Equation (95) by adding a non-negative term (M − L)σ2 Lcah cbh √ σ2 LM 1 + cah cbh γh . [sent-1172, score-1.912]

70 Then we can obtain a lower-bound of γh : γh ≥ γlo , h where γlo is the larger solution of the following equation: h 2 L(γlo ) + (M − L)γh + h + σ2 (L + M) M/L cah cbh γlo h σ4 M(M − L) M/L σ2 L M 2 σ4 −M 1− 2 + γh cah cbh Lc2h c2h γh a b 1− σ2 M 2 γh = 0. [sent-1173, score-1.519]

71 γ2 h This can be factorized as γlo − 1 − h σ2 M/L σ2 M γh + cah cbh γ2 h Lγlo + M 1 − h σ2 M M/L σ2 L γh + cah cbh γ2 h Thus, the larger solution of this equation, γlo = 1 − h σ2 M/L σ2 M , γh − cah cbh γ2 h gives the lower-bound in Equation (28). [sent-1174, score-2.281]

72 The coefficient of the second term of Equation (146), σ2 cah cbh cah cbh δh + cah cb δh , h is minimized when δh = cah . [sent-1175, score-2.327]

73 cbh Then we can obtain another upper-bound of γh : ′up γh ≤ γh , ′up where γh is the larger solution of the following equation: ′up (γh )2 + 2σ2 cah cbh ′up γh + σ2 L σ4 − 1− 2 c2h c2h γh a b 2631 1− σ2 M 2 γh = 0. [sent-1176, score-1.137]

74 NAKAJIMA AND S UGIYAMA This can be factorized as ′up γh − 1− σ2 L γ2 h 1− σ2 M σ2 γh + cah cbh γ2 h ′up × γh + 1− σ2 L γ2 h 1− σ2 M σ2 γh + cah cbh γ2 h = 0. [sent-1178, score-1.515]

75 Thus, the larger solution of this equation, ′up γh = 1− σ2 L γ2 h 1− σ2 M σ2 , γh − cah cbh γ2 h gives the upper-bound in Equation (31). [sent-1179, score-0.766]

76 σ2 (154) where λa,k (cah cbh ) = 1 2M(cah cbh )k − + λb,k (cah cbh ) = 1 2L(cah cbh )k σ2 − cah cbh (M − L) cah cbh σ2 − cah cbh (M − L) cah cbh 2 + 4Mσ2 , σ2 + cah cbh (M − L) cah cbh − + σ2 + cah cbh (M − L) cah cbh 2 + 4Lσ2 . [sent-1185, score-7.508]

77 Note that λa,k > 0, λb,k > 0 for any k, and that Equation (154) depends on c2h and c2h only through a b their product cah cbh . [sent-1186, score-0.753]

78 2M 2L T HEORETICAL A NALYSIS OF BAYESIAN M ATRIX FACTORIZATION Since they are increasing with respect to x, λa,1 and λb,1 are decreasing with respect to cah cbh . [sent-1189, score-0.753]

79 Further, λa,1 and λb,1 are upper-bounded as λa,1 (cah cbh ) < λb,1 (cah cbh ) < lim λa,1 (cah cbh ) = lim λ′ (x) = 1, a,1 lim λb,1 (cah cbh ) = lim λ′ (x) = 1. [sent-1190, score-1.524]

80 (156) cah cbh →+0 Similarly, using the same decreasing mapping, we have λ′ (x) · λ′ (x) = a,0 b,0 σ2 (x + (L + M)) − 2LM (x + (L + M))2 − 4LM . [sent-1192, score-0.753]

81 Since this is decreasing with respect to x and lower-bounded by zero, λa,0 λb,0 is increasing with respect to cah cbh and lower-bounded as λa,0 (cah cbh ) · λb,0 (cah cbh ) > lim cah cbh →+0 λa,0 (cah cbh ) · λb,0 (cah cbh ) = lim λ′ (x) · λ′ (x) = 0. [sent-1193, score-3.01]

82 a,0 b,0 x→∞ Therefore, the third term in Equation (154) is increasing with respect to cah cbh , and lower-bounded as LMλa,0 λb,0 LMλa,0 λb,0 > lim = 0. [sent-1194, score-0.763]

83 2 cah cbh →+0 σ σ2 (157) Now we have found that Equation (154) is increasing with respect to cah cbh , because it consists ˚ of the increasing terms. [sent-1195, score-1.506]

84 Equations (114) and (115) minimize cah cbh over Rε when Equation (43) is adopted. [sent-1196, score-0.753]

85 σ4 h Using this bound, we have 1− σ2 L γ2 h 1− σ2 M σ2 γh − cah cbh ´ ´ γ2 h γ2 − (L + M)σ2 + h > LMσ4 − γ2 h γ2 − (L + M)σ2 h > 0. [sent-1232, score-0.753]

86 Substituting Equations (76), (77), (119), and (120) into (164), we have 1 EVB H = 2  1 σ2 a  γ −2µ h µ  h ah bh  σ2  µa   − c4h a  γh −2µah µbh σ2 1 σ2 b h 0 h 0 µb − c4h bh 2636 µa − c4h ah 0 M 2c4 a h 0  0  µb  − c4h  bh  . [sent-1244, score-2.389]

87 σ2 (166) Similarly from Equation (75), we obtain µ2h b = σ2 h b By using Equations (78), (84), (159), (165), and (166), we obtain 1 1 EVB H = 4 4 2 cah cbh Since γ2 γh γh h − 2 4 c4 cah bh 2σ M δ−2 Lδ2 + 4 cah c4h b + LM γh γh − γ2 h σ4 . [sent-1248, score-1.934]

88 (167) √ M δ−2 Lδ2 2 LM + 4 ≥ 2 2 cah c4h cah cbh b for any δ2 > 0, Equation (167) is upper-bounded by √ γ2 1 γh γh LM LM 1 EVB h ≤ 4 4 − 2 2 2 + 4 γh γh − γ2 H h 4 c4 2 σ cah cbh cah bh σ cah cbh √ √ γh LM LM 1 1 = 4 4 − 2 + 2 2 c2 2 c2 σ σ cah cbh cah bh cah bh √ LM γh − 2 γh . [sent-1249, score-5.791]

89 Therefore, Equation (168) is written as √ √ 1 LM LM 1 ´ EVB ≤C H γh − 2 γh , + 2 2 c2 2 σ σ cah ´bh ´ 2637 (168) NAKAJIMA AND S UGIYAMA with a positive factor γh C= 4 4 cah cbh ´ ´ √ LM 1 − 2 2 c2 σ cah ´bh ´ . [sent-1251, score-1.529]

90 cah cbh ´ ´ 1− Mσ2 γ2 h γh At the last inequality, we neglected the negative last term in the curly braces. [sent-1253, score-0.753]

91 Using Equation (163), we have 1 ´ EVB < −C′ ( f (γh ) − g(γh )), H 2 (169) where γ2C h , 2σ2 cah cbh ´ ´ √ √ ( M − L)2 σ2 f (γh ) = 1 − γ2 h C′ = g(γh ) = 2 1− Lσ2 γ2 h × 1− 1− + (L + M)σ2 1− γ2 h 2 − 4LMσ4 , γ4 h Mσ2 γ2 h (L + M)σ2 γ2 h 2638 + 1− (L + M)σ2 γ2 h 2 − 4LMσ4 . [sent-1254, score-0.753]

92 2 Lσ L cah cbh ˘ ˘ σ σ2 x2 − y2 > x − y for x > y > 0, Equation (122) yields c2h c2h ˘a ˘b √ γ2 − (L + M + LM)σ2 h . [sent-1267, score-0.753]

93 a b Since the lower-bound in Equation (28) is nondecreasing with respect to cah cbh , substituting Equation (173) into Equation (28) yields   σ2 M γh − γh ≥ max 0, 1 − 2  γh It holds that − σ2 M >− γh σ2 M . [sent-1282, score-0.766]

94 Let us consider a partially minimized objective function: ˜ EVB Lh (cah cbh ) = min (µah ,µbh ,σ2 ,σ2 ) a b h EVB Lh (µah , µbh , σ2h , σ2h , cah cbh , cah cbh ). [sent-1289, score-1.885]

95 cah cbh →0 (176) Figure 13 depicts the partially minimized objective function (175) when L = M = H = 1, σ2 = 1, and V = 1. [sent-1291, score-0.761]

96 In this case, the objective function (175) has a stationary point at cah cbh = 1. [sent-1305, score-0.826]

97 In this case, there exists a large positive stationary point (which is a local minimum) at cah cbh ≈ 1. [sent-1310, score-0.83]

98 37, as well as a small positive stationary point (which is a local maximum) at cah cbh ≈ 0. [sent-1311, score-0.83]

99 The convergence Lh (cah cbh ) → L + M (= 2) as cah cbh → 0 is observed (see Equation (176)). [sent-1329, score-1.124]

100 As Lemma 28 states, a large positive stationary point at cah cbh ≈ 2. [sent-1334, score-0.83]


similar papers computed by tfidf model

tfidf for this paper:

wordName wordTfidf (topN-words)

[('ah', 0.569), ('bh', 0.417), ('cah', 0.382), ('cbh', 0.371), ('evb', 0.334), ('vb', 0.18), ('lm', 0.098), ('nakajima', 0.076), ('lh', 0.071), ('heoretical', 0.067), ('mf', 0.06), ('cb', 0.057), ('stationary', 0.057), ('ugiyama', 0.053), ('vbmf', 0.052), ('pjs', 0.046), ('atrix', 0.044), ('nalysis', 0.037), ('map', 0.036), ('factorization', 0.035), ('tr', 0.034), ('fb', 0.034), ('shrinkage', 0.032), ('equation', 0.031), ('mapmf', 0.029), ('estimator', 0.027), ('ml', 0.025), ('equations', 0.025), ('cevb', 0.025), ('null', 0.024), ('ca', 0.024), ('emapmf', 0.023), ('lemma', 0.022), ('jeffreys', 0.022), ('rb', 0.022), ('singular', 0.021), ('evbmf', 0.021), ('fro', 0.021), ('osterior', 0.021), ('bayes', 0.02), ('energy', 0.02), ('watanabe', 0.02), ('posterior', 0.019), ('ba', 0.019), ('minimizer', 0.019), ('ra', 0.018), ('fvb', 0.018), ('priors', 0.017), ('raiko', 0.017), ('sh', 0.016), ('hyperparameters', 0.015), ('aca', 0.015), ('bcb', 0.015), ('efbmf', 0.015), ('fbmf', 0.015), ('jef', 0.015), ('bayesian', 0.014), ('minimizers', 0.014), ('substituting', 0.013), ('solution', 0.013), ('dh', 0.013), ('free', 0.013), ('baldi', 0.013), ('efb', 0.013), ('emap', 0.013), ('positive', 0.012), ('posteriors', 0.012), ('sp', 0.01), ('js', 0.01), ('collaborative', 0.01), ('log', 0.01), ('solutions', 0.01), ('lim', 0.01), ('estimators', 0.01), ('srebro', 0.01), ('contour', 0.01), ('stein', 0.01), ('factorized', 0.009), ('lo', 0.009), ('tensor', 0.009), ('hyperbolic', 0.009), ('objective', 0.008), ('proof', 0.008), ('rl', 0.008), ('modes', 0.008), ('activated', 0.008), ('amap', 0.008), ('bmap', 0.008), ('morris', 0.008), ('point', 0.008), ('completes', 0.008), ('appendix', 0.007), ('ab', 0.007), ('mum', 0.007), ('ltering', 0.007), ('hessian', 0.007), ('variational', 0.007), ('im', 0.007), ('svd', 0.006), ('elucidate', 0.006)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 1.0000002 94 jmlr-2011-Theoretical Analysis of Bayesian Matrix Factorization

Author: Shinichi Nakajima, Masashi Sugiyama

Abstract: Recently, variational Bayesian (VB) techniques have been applied to probabilistic matrix factorization and shown to perform very well in experiments. In this paper, we theoretically elucidate properties of the VB matrix factorization (VBMF) method. Through finite-sample analysis of the VBMF estimator, we show that two types of shrinkage factors exist in the VBMF estimator: the positive-part James-Stein (PJS) shrinkage and the trace-norm shrinkage, both acting on each singular component separately for producing low-rank solutions. The trace-norm shrinkage is simply induced by non-flat prior information, similarly to the maximum a posteriori (MAP) approach. Thus, no trace-norm shrinkage remains when priors are non-informative. On the other hand, we show a counter-intuitive fact that the PJS shrinkage factor is kept activated even with flat priors. This is shown to be induced by the non-identifiability of the matrix factorization model, that is, the mapping between the target matrix and factorized matrices is not one-to-one. We call this model-induced regularization. We further extend our analysis to empirical Bayes scenarios where hyperparameters are also learned based on the VB free energy. Throughout the paper, we assume no missing entry in the observed matrix, and therefore collaborative filtering is out of scope. Keywords: matrix factorization, variational Bayes, empirical Bayes, positive-part James-Stein shrinkage, non-identifiable model, model-induced regularization

2 0.051169764 34 jmlr-2011-Faster Algorithms for Max-Product Message-Passing

Author: Julian J. McAuley, TibĂŠrio S. Caetano

Abstract: Maximum A Posteriori inference in graphical models is often solved via message-passing algorithms, such as the junction-tree algorithm or loopy belief-propagation. The exact solution to this problem is well-known to be exponential in the size of the maximal cliques of the triangulated model, while approximate inference is typically exponential in the size of the model’s factors. In this paper, we take advantage of the fact that many models have maximal cliques that are larger than their constituent factors, and also of the fact that many factors consist only of latent variables (i.e., they do not depend on an observation). This is a common case in a wide variety of applications that deal with grid-, tree-, and ring-structured models. In such cases, we are able to decrease the exponent of complexity for message-passing by 0.5 for both exact and approximate inference. We demonstrate that message-passing operations in such models are equivalent to some variant of matrix multiplication in the tropical semiring, for which we offer an O(N 2.5 ) expected-case solution. Keywords: graphical models, belief-propagation, tropical matrix multiplication

3 0.044963695 104 jmlr-2011-X-Armed Bandits

Author: Sébastien Bubeck, Rémi Munos, Gilles Stoltz, Csaba Szepesvári

Abstract: We consider a generalization of stochastic bandits where the set of arms, X , is allowed to be a generic measurable space and the mean-payoff function is “locally Lipschitz” with respect to a dissimilarity function that is known to the decision maker. Under this condition we construct an arm selection policy, called HOO (hierarchical optimistic optimization), with improved regret bounds compared to previous results for a large class of problems. In particular, our results imply that if X is the unit hypercube in a Euclidean space and the mean-payoff function has a finite number of global maxima around which the behavior of the function is locally continuous with a known √ smoothness degree, then the expected regret of HOO is bounded up to a logarithmic factor by n, that is, the rate of growth of the regret is independent of the dimension of the space. We also prove the minimax optimality of our algorithm when the dissimilarity is a metric. Our basic strategy has quadratic computational complexity as a function of the number of time steps and does not rely on the doubling trick. We also introduce a modified strategy, which relies on the doubling trick but runs in linearithmic time. Both results are improvements with respect to previous approaches. Keywords: bandits with infinitely many arms, optimistic online optimization, regret bounds, minimax rates

4 0.032056771 82 jmlr-2011-Robust Gaussian Process Regression with a Student-tLikelihood

Author: Pasi Jylänki, Jarno Vanhatalo, Aki Vehtari

Abstract: This paper considers the robust and efficient implementation of Gaussian process regression with a Student-t observation model, which has a non-log-concave likelihood. The challenge with the Student-t model is the analytically intractable inference which is why several approximative methods have been proposed. Expectation propagation (EP) has been found to be a very accurate method in many empirical studies but the convergence of EP is known to be problematic with models containing non-log-concave site functions. In this paper we illustrate the situations where standard EP fails to converge and review different modifications and alternative algorithms for improving the convergence. We demonstrate that convergence problems may occur during the type-II maximum a posteriori (MAP) estimation of the hyperparameters and show that standard EP may not converge in the MAP values with some difficult data sets. We present a robust implementation which relies primarily on parallel EP updates and uses a moment-matching-based double-loop algorithm with adaptively selected step size in difficult cases. The predictive performance of EP is compared with Laplace, variational Bayes, and Markov chain Monte Carlo approximations. Keywords: Gaussian process, robust regression, Student-t distribution, approximate inference, expectation propagation

5 0.027026227 61 jmlr-2011-Logistic Stick-Breaking Process

Author: Lu Ren, Lan Du, Lawrence Carin, David Dunson

Abstract: A logistic stick-breaking process (LSBP) is proposed for non-parametric clustering of general spatially- or temporally-dependent data, imposing the belief that proximate data are more likely to be clustered together. The sticks in the LSBP are realized via multiple logistic regression functions, with shrinkage priors employed to favor contiguous and spatially localized segments. The LSBP is also extended for the simultaneous processing of multiple data sets, yielding a hierarchical logistic stick-breaking process (H-LSBP). The model parameters (atoms) within the H-LSBP are shared across the multiple learning tasks. Efficient variational Bayesian inference is derived, and comparisons are made to related techniques in the literature. Experimental analysis is performed for audio waveforms and images, and it is demonstrated that for segmentation applications the LSBP yields generally homogeneous segments with sharp boundaries. Keywords: Bayesian, nonparametric, dependent, hierarchical models, segmentation

6 0.021063088 77 jmlr-2011-Posterior Sparsity in Unsupervised Dependency Parsing

7 0.01811029 33 jmlr-2011-Exploiting Best-Match Equations for Efficient Reinforcement Learning

8 0.016077746 41 jmlr-2011-Improved Moves for Truncated Convex Models

9 0.015621962 11 jmlr-2011-Approximate Marginals in Latent Gaussian Models

10 0.01476714 43 jmlr-2011-Information, Divergence and Risk for Binary Experiments

11 0.014523215 39 jmlr-2011-High-dimensional Covariance Estimation Based On Gaussian Graphical Models

12 0.014274917 44 jmlr-2011-Information Rates of Nonparametric Gaussian Process Methods

13 0.013299636 92 jmlr-2011-The Stationary Subspace Analysis Toolbox

14 0.012451424 91 jmlr-2011-The Sample Complexity of Dictionary Learning

15 0.011930295 6 jmlr-2011-A Simpler Approach to Matrix Completion

16 0.011284634 8 jmlr-2011-Adaptive Subgradient Methods for Online Learning and Stochastic Optimization

17 0.011175361 89 jmlr-2011-Super-Linear Convergence of Dual Augmented Lagrangian Algorithm for Sparsity Regularized Estimation

18 0.010717414 72 jmlr-2011-On the Relation between Realizable and Nonrealizable Cases of the Sequence Prediction Problem

19 0.01026577 4 jmlr-2011-A Family of Simple Non-Parametric Kernel Learning Algorithms

20 0.010243078 24 jmlr-2011-Dirichlet Process Mixtures of Generalized Linear Models


similar papers computed by lsi model

lsi for this paper:

topicId topicWeight

[(0, 0.062), (1, -0.006), (2, -0.035), (3, 0.027), (4, 0.015), (5, -0.0), (6, -0.002), (7, -0.001), (8, 0.014), (9, -0.013), (10, -0.03), (11, 0.009), (12, 0.051), (13, 0.022), (14, 0.089), (15, -0.071), (16, 0.048), (17, 0.037), (18, -0.078), (19, -0.016), (20, 0.049), (21, -0.104), (22, -0.007), (23, 0.118), (24, 0.183), (25, 0.111), (26, 0.192), (27, 0.094), (28, 0.247), (29, 0.149), (30, 0.083), (31, -0.011), (32, 0.153), (33, -0.043), (34, 0.017), (35, 0.361), (36, 0.001), (37, 0.281), (38, 0.123), (39, 0.052), (40, -0.081), (41, 0.213), (42, -0.128), (43, 0.145), (44, 0.041), (45, -0.001), (46, -0.167), (47, -0.014), (48, -0.271), (49, -0.237)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.98307103 94 jmlr-2011-Theoretical Analysis of Bayesian Matrix Factorization

Author: Shinichi Nakajima, Masashi Sugiyama

Abstract: Recently, variational Bayesian (VB) techniques have been applied to probabilistic matrix factorization and shown to perform very well in experiments. In this paper, we theoretically elucidate properties of the VB matrix factorization (VBMF) method. Through finite-sample analysis of the VBMF estimator, we show that two types of shrinkage factors exist in the VBMF estimator: the positive-part James-Stein (PJS) shrinkage and the trace-norm shrinkage, both acting on each singular component separately for producing low-rank solutions. The trace-norm shrinkage is simply induced by non-flat prior information, similarly to the maximum a posteriori (MAP) approach. Thus, no trace-norm shrinkage remains when priors are non-informative. On the other hand, we show a counter-intuitive fact that the PJS shrinkage factor is kept activated even with flat priors. This is shown to be induced by the non-identifiability of the matrix factorization model, that is, the mapping between the target matrix and factorized matrices is not one-to-one. We call this model-induced regularization. We further extend our analysis to empirical Bayes scenarios where hyperparameters are also learned based on the VB free energy. Throughout the paper, we assume no missing entry in the observed matrix, and therefore collaborative filtering is out of scope. Keywords: matrix factorization, variational Bayes, empirical Bayes, positive-part James-Stein shrinkage, non-identifiable model, model-induced regularization

2 0.32867128 34 jmlr-2011-Faster Algorithms for Max-Product Message-Passing

Author: Julian J. McAuley, TibĂŠrio S. Caetano

Abstract: Maximum A Posteriori inference in graphical models is often solved via message-passing algorithms, such as the junction-tree algorithm or loopy belief-propagation. The exact solution to this problem is well-known to be exponential in the size of the maximal cliques of the triangulated model, while approximate inference is typically exponential in the size of the model’s factors. In this paper, we take advantage of the fact that many models have maximal cliques that are larger than their constituent factors, and also of the fact that many factors consist only of latent variables (i.e., they do not depend on an observation). This is a common case in a wide variety of applications that deal with grid-, tree-, and ring-structured models. In such cases, we are able to decrease the exponent of complexity for message-passing by 0.5 for both exact and approximate inference. We demonstrate that message-passing operations in such models are equivalent to some variant of matrix multiplication in the tropical semiring, for which we offer an O(N 2.5 ) expected-case solution. Keywords: graphical models, belief-propagation, tropical matrix multiplication

3 0.19371431 61 jmlr-2011-Logistic Stick-Breaking Process

Author: Lu Ren, Lan Du, Lawrence Carin, David Dunson

Abstract: A logistic stick-breaking process (LSBP) is proposed for non-parametric clustering of general spatially- or temporally-dependent data, imposing the belief that proximate data are more likely to be clustered together. The sticks in the LSBP are realized via multiple logistic regression functions, with shrinkage priors employed to favor contiguous and spatially localized segments. The LSBP is also extended for the simultaneous processing of multiple data sets, yielding a hierarchical logistic stick-breaking process (H-LSBP). The model parameters (atoms) within the H-LSBP are shared across the multiple learning tasks. Efficient variational Bayesian inference is derived, and comparisons are made to related techniques in the literature. Experimental analysis is performed for audio waveforms and images, and it is demonstrated that for segmentation applications the LSBP yields generally homogeneous segments with sharp boundaries. Keywords: Bayesian, nonparametric, dependent, hierarchical models, segmentation

4 0.18166612 92 jmlr-2011-The Stationary Subspace Analysis Toolbox

Author: Jan Saputra Müller, Paul von Bünau, Frank C. Meinecke, Franz J. Király, Klaus-Robert Müller

Abstract: The Stationary Subspace Analysis (SSA) algorithm linearly factorizes a high-dimensional time series into stationary and non-stationary components. The SSA Toolbox is a platform-independent efficient stand-alone implementation of the SSA algorithm with a graphical user interface written in Java, that can also be invoked from the command line and from Matlab. The graphical interface guides the user through the whole process; data can be imported and exported from comma separated values (CSV) and Matlab’s .mat files. Keywords: non-stationarities, blind source separation, dimensionality reduction, unsupervised learning

5 0.16123761 51 jmlr-2011-Laplacian Support Vector Machines Trained in the Primal

Author: Stefano Melacci, Mikhail Belkin

Abstract: In the last few years, due to the growing ubiquity of unlabeled data, much effort has been spent by the machine learning community to develop better understanding and improve the quality of classifiers exploiting unlabeled data. Following the manifold regularization approach, Laplacian Support Vector Machines (LapSVMs) have shown the state of the art performance in semi-supervised classification. In this paper we present two strategies to solve the primal LapSVM problem, in order to overcome some issues of the original dual formulation. In particular, training a LapSVM in the primal can be efficiently performed with preconditioned conjugate gradient. We speed up training by using an early stopping strategy based on the prediction on unlabeled data or, if available, on labeled validation examples. This allows the algorithm to quickly compute approximate solutions with roughly the same classification accuracy as the optimal ones, considerably reducing the training time. The computational complexity of the training algorithm is reduced from O(n3 ) to O(kn2 ), where n is the combined number of labeled and unlabeled examples and k is empirically evaluated to be significantly smaller than n. Due to its simplicity, training LapSVM in the primal can be the starting point for additional enhancements of the original LapSVM formulation, such as those for dealing with large data sets. We present an extensive experimental evaluation on real world data showing the benefits of the proposed approach. Keywords: Laplacian support vector machines, manifold regularization, semi-supervised learning, classification, optimization

6 0.14949225 104 jmlr-2011-X-Armed Bandits

7 0.11803591 43 jmlr-2011-Information, Divergence and Risk for Binary Experiments

8 0.11431471 70 jmlr-2011-Non-Parametric Estimation of Topic Hierarchies from Texts with Hierarchical Dirichlet Processes

9 0.099372096 82 jmlr-2011-Robust Gaussian Process Regression with a Student-tLikelihood

10 0.093827568 4 jmlr-2011-A Family of Simple Non-Parametric Kernel Learning Algorithms

11 0.09096235 77 jmlr-2011-Posterior Sparsity in Unsupervised Dependency Parsing

12 0.087624684 33 jmlr-2011-Exploiting Best-Match Equations for Efficient Reinforcement Learning

13 0.083845124 69 jmlr-2011-Neyman-Pearson Classification, Convexity and Stochastic Constraints

14 0.072163172 88 jmlr-2011-Structured Variable Selection with Sparsity-Inducing Norms

15 0.06458652 18 jmlr-2011-Convergence Rates of Efficient Global Optimization Algorithms

16 0.062928006 1 jmlr-2011-A Bayesian Approach for Learning and Planning in Partially Observable Markov Decision Processes

17 0.061572742 89 jmlr-2011-Super-Linear Convergence of Dual Augmented Lagrangian Algorithm for Sparsity Regularized Estimation

18 0.060387932 72 jmlr-2011-On the Relation between Realizable and Nonrealizable Cases of the Sequence Prediction Problem

19 0.059730731 99 jmlr-2011-Unsupervised Similarity-Based Risk Stratification for Cardiovascular Events Using Long-Term Time-Series Data

20 0.058227457 44 jmlr-2011-Information Rates of Nonparametric Gaussian Process Methods


similar papers computed by lda model

lda for this paper:

topicId topicWeight

[(4, 0.034), (9, 0.022), (10, 0.018), (24, 0.019), (31, 0.041), (32, 0.035), (41, 0.016), (70, 0.014), (73, 0.037), (78, 0.08), (83, 0.506), (90, 0.013)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.72604668 94 jmlr-2011-Theoretical Analysis of Bayesian Matrix Factorization

Author: Shinichi Nakajima, Masashi Sugiyama

Abstract: Recently, variational Bayesian (VB) techniques have been applied to probabilistic matrix factorization and shown to perform very well in experiments. In this paper, we theoretically elucidate properties of the VB matrix factorization (VBMF) method. Through finite-sample analysis of the VBMF estimator, we show that two types of shrinkage factors exist in the VBMF estimator: the positive-part James-Stein (PJS) shrinkage and the trace-norm shrinkage, both acting on each singular component separately for producing low-rank solutions. The trace-norm shrinkage is simply induced by non-flat prior information, similarly to the maximum a posteriori (MAP) approach. Thus, no trace-norm shrinkage remains when priors are non-informative. On the other hand, we show a counter-intuitive fact that the PJS shrinkage factor is kept activated even with flat priors. This is shown to be induced by the non-identifiability of the matrix factorization model, that is, the mapping between the target matrix and factorized matrices is not one-to-one. We call this model-induced regularization. We further extend our analysis to empirical Bayes scenarios where hyperparameters are also learned based on the VB free energy. Throughout the paper, we assume no missing entry in the observed matrix, and therefore collaborative filtering is out of scope. Keywords: matrix factorization, variational Bayes, empirical Bayes, positive-part James-Stein shrinkage, non-identifiable model, model-induced regularization

2 0.29297307 28 jmlr-2011-Double Updating Online Learning

Author: Peilin Zhao, Steven C.H. Hoi, Rong Jin

Abstract: In most kernel based online learning algorithms, when an incoming instance is misclassified, it will be added into the pool of support vectors and assigned with a weight, which often remains unchanged during the rest of the learning process. This is clearly insufficient since when a new support vector is added, we generally expect the weights of the other existing support vectors to be updated in order to reflect the influence of the added support vector. In this paper, we propose a new online learning method, termed Double Updating Online Learning, or DUOL for short, that explicitly addresses this problem. Instead of only assigning a fixed weight to the misclassified example received at the current trial, the proposed online learning algorithm also tries to update the weight for one of the existing support vectors. We show that the mistake bound can be improved by the proposed online learning method. We conduct an extensive set of empirical evaluations for both binary and multi-class online learning tasks. The experimental results show that the proposed technique is considerably more effective than the state-of-the-art online learning algorithms. The source code is available to public at http://www.cais.ntu.edu.sg/˜chhoi/DUOL/. Keywords: online learning, kernel method, support vector machines, maximum margin learning, classification

3 0.20385915 91 jmlr-2011-The Sample Complexity of Dictionary Learning

Author: Daniel Vainsencher, Shie Mannor, Alfred M. Bruckstein

Abstract: A large set of signals can sometimes be described sparsely using a dictionary, that is, every element can be represented as a linear combination of few elements from the dictionary. Algorithms for various signal processing applications, including classification, denoising and signal separation, learn a dictionary from a given set of signals to be represented. Can we expect that the error in representing by such a dictionary a previously unseen signal from the same source will be of similar magnitude as those for the given examples? We assume signals are generated from a fixed distribution, and study these questions from a statistical learning theory perspective. We develop generalization bounds on the quality of the learned dictionary for two types of constraints on the coefficient selection, as measured by the expected L2 error in representation when the dictionary is used. For the case of l1 regularized coefficient selection we provide a generalnp ln(mλ)/m , where n is the dimension, p is the number of ization bound of the order of O elements in the dictionary, λ is a bound on the l1 norm of the coefficient vector and m is the number of samples, which complements existing results. For the case of representing a new signal as a combination of at most k dictionary elements, we provide a bound of the order O( np ln(mk)/m) under an assumption on the closeness to orthogonality of the dictionary (low Babel function). We further show that this assumption holds for most dictionaries in high dimensions in a strong probabilistic sense. Our results also include bounds that converge as 1/m, not previously known for this problem. We provide similar results in a general setting using kernels with weak smoothness requirements. Keywords: dictionary learning, generalization bound, sparse representation

4 0.20258425 43 jmlr-2011-Information, Divergence and Risk for Binary Experiments

Author: Mark D. Reid, Robert C. Williamson

Abstract: We unify f -divergences, Bregman divergences, surrogate regret bounds, proper scoring rules, cost curves, ROC-curves and statistical information. We do this by systematically studying integral and variational representations of these objects and in so doing identify their representation primitives which all are related to cost-sensitive binary classification. As well as developing relationships between generative and discriminative views of learning, the new machinery leads to tight and more general surrogate regret bounds and generalised Pinsker inequalities relating f -divergences to variational divergence. The new viewpoint also illuminates existing algorithms: it provides a new derivation of Support Vector Machines in terms of divergences and relates maximum mean discrepancy to Fisher linear discriminants. Keywords: classification, loss functions, divergence, statistical information, regret bounds

5 0.19942179 37 jmlr-2011-Group Lasso Estimation of High-dimensional Covariance Matrices

Author: Jérémie Bigot, Rolando J. Biscay, Jean-Michel Loubes, Lillian Muñiz-Alvarez

Abstract: In this paper, we consider the Group Lasso estimator of the covariance matrix of a stochastic process corrupted by an additive noise. We propose to estimate the covariance matrix in a highdimensional setting under the assumption that the process has a sparse representation in a large dictionary of basis functions. Using a matrix regression model, we propose a new methodology for high-dimensional covariance matrix estimation based on empirical contrast regularization by a group Lasso penalty. Using such a penalty, the method selects a sparse set of basis functions in the dictionary used to approximate the process, leading to an approximation of the covariance matrix into a low dimensional space. Consistency of the estimator is studied in Frobenius and operator norms and an application to sparse PCA is proposed. Keywords: group Lasso, ℓ1 penalty, high-dimensional covariance estimation, basis expansion, sparsity, oracle inequality, sparse PCA

6 0.19901609 35 jmlr-2011-Forest Density Estimation

7 0.19862685 8 jmlr-2011-Adaptive Subgradient Methods for Online Learning and Stochastic Optimization

8 0.19815269 22 jmlr-2011-Differentially Private Empirical Risk Minimization

9 0.19703527 71 jmlr-2011-On Equivalence Relationships Between Classification and Ranking Algorithms

10 0.19548269 29 jmlr-2011-Efficient Learning with Partially Observed Attributes

11 0.19518454 74 jmlr-2011-Operator Norm Convergence of Spectral Clustering on Level Sets

12 0.1950547 67 jmlr-2011-Multitask Sparsity via Maximum Entropy Discrimination

13 0.19494997 89 jmlr-2011-Super-Linear Convergence of Dual Augmented Lagrangian Algorithm for Sparsity Regularized Estimation

14 0.19484921 40 jmlr-2011-Hyper-Sparse Optimal Aggregation

15 0.19479886 87 jmlr-2011-Stochastic Methods forl1-regularized Loss Minimization

16 0.19411522 81 jmlr-2011-Robust Approximate Bilinear Programming for Value Function Approximation

17 0.19347574 9 jmlr-2011-An Asymptotic Behaviour of the Marginal Likelihood for General Markov Models

18 0.19344886 53 jmlr-2011-Learning High-Dimensional Markov Forest Distributions: Analysis of Error Rates

19 0.19213946 4 jmlr-2011-A Family of Simple Non-Parametric Kernel Learning Algorithms

20 0.19178128 104 jmlr-2011-X-Armed Bandits