jmlr jmlr2005 jmlr2005-55 knowledge-graph by maker-knowledge-mining

55 jmlr-2005-Matrix Exponentiated Gradient Updates for On-line Learning and Bregman Projection


Source: pdf

Author: Koji Tsuda, Gunnar Rätsch, Manfred K. Warmuth

Abstract: We address the problem of learning a symmetric positive definite matrix. The central issue is to design parameter updates that preserve positive definiteness. Our updates are motivated with the von Neumann divergence. Rather than treating the most general case, we focus on two key applications that exemplify our methods: on-line learning with a simple square loss, and finding a symmetric positive definite matrix subject to linear constraints. The updates generalize the exponentiated gradient (EG) update and AdaBoost, respectively: the parameter is now a symmetric positive definite matrix of trace one instead of a probability vector (which in this context is a diagonal positive definite matrix with trace one). The generalized updates use matrix logarithms and exponentials to preserve positive definiteness. Most importantly, we show how the derivation and the analyses of the original EG update and AdaBoost generalize to the non-diagonal case. We apply the resulting matrix exponentiated gradient (MEG) update and DefiniteBoost to the problem of learning a kernel matrix from distance measurements.

Reference: text


Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 The updates generalize the exponentiated gradient (EG) update and AdaBoost, respectively: the parameter is now a symmetric positive definite matrix of trace one instead of a probability vector (which in this context is a diagonal positive definite matrix with trace one). [sent-14, score-0.613]

2 We apply the resulting matrix exponentiated gradient (MEG) update and DefiniteBoost to the problem of learning a kernel matrix from distance measurements. [sent-17, score-0.348]

3 In this paper, we introduce the matrix exponentiated gradient update which works as follows: First, the matrix logarithm of the current parameter matrix is computed. [sent-37, score-0.385]

4 Our update preserves symmetry and positive definiteness because the matrix exponential maps any symmetric matrix to a symmetric positive definite matrix. [sent-40, score-0.38]

5 For example, the gradient descent update is derived from the squared Euclidean distance, and the exponentiated gradient update from the Kullback-Leibler divergence (relative entropy). [sent-45, score-0.358]

6 In this work we use the von Neumann divergence (also called quantum relative entropy) for measuring the discrepancy between two positive definite matrices (Nielsen and Chuang, 2000). [sent-46, score-0.307]

7 We derive a new matrix exponentiated gradient update from this divergence (which is a Bregman divergence for symmetric positive definite matrices). [sent-47, score-0.477]

8 Finally we prove relative loss bounds using the von Neumann divergence as a measure of progress. [sent-48, score-0.235]

9 For any such matrix A ∈ Rd×d , exp A and log A denote the matrix exponential and logarithm, respectively. [sent-60, score-0.339]

10 1) M ATRIX E XPONENTIATED G RADIENT U PDATES In the case of symmetric matrices, the matrix exponential operation can be computed using the eigenvalue decomposition A = V ΛV , where V is an orthonormal matrix with the eigenvectors of A as columns and Λ the diagonal matrix of eigenvalues. [sent-65, score-0.307]

11 The matrix logarithm log A is defined as the inverse function of exp A, which does not always exist for arbitrary A. [sent-67, score-0.272]

12 However, when A is symmetric and strictly positive definite, log A is computed as log A := V (log Λ)V , where (log Λ)i,i = log Λi,i . [sent-68, score-0.366]

13 Throughout the paper log a and exp a denote the natural logarithm and exponential of scalar “a”. [sent-69, score-0.205]

14 If W is symmetric and X an arbitrary matrix, then tr(W X) = tr W X +X 2 + tr W X −X 2 = tr(W sym(X)). [sent-92, score-0.724]

15 997 ¨ T SUDA , R ATSCH AND WARMUTH Proof Assume A is eigen-decomposed as A = V ΛV , where Λ is the diagonal matrix of eigenvalues and V is an orthogonal matrix with the eigenvectors of A as columns. [sent-101, score-0.209]

16 2 For any positive semi-definite symmetric matrix A ∈ Rd×d and any two symmetric matrices B, C ∈ Rd×d , B C implies tr(AB) ≤ tr(AC). [sent-108, score-0.314]

17 Our main choice of F is F(W ) = tr(W log W − W ), which is called von Neumann entropy or quantum entropy. [sent-124, score-0.227]

18 The Bregman divergence corresponding to this choice of F is the von Neumann divergence or quantum relative entropy (e. [sent-127, score-0.299]

19 , Nielsen and Chuang, 2000): ∆F (W , W ) = tr(W log W − W log W − W + W ). [sent-129, score-0.188]

20 Symmetric positive definite matrices of trace one are related to density matrices commonly used in Statistical Physics. [sent-131, score-0.225]

21 For normalized parameters the divergence simplifies to ∆F (W , W ) = tr(W log W − W log W ). [sent-132, score-0.259]

22 998 M ATRIX E XPONENTIATED G RADIENT U PDATES If W = ∑i λi vi vi is our notation for the eigenvalue decomposition, then the von Neumann entropy1 becomes F(W ) = ∑i λi log λi . [sent-133, score-0.211]

23 , vi = vi ), then the divergence becomes the usual relative entropy ˜ ˜ λ ˜ between the eigenvalues ∆F (W , W ) = ∑i λi log λii . [sent-139, score-0.284]

24 3 Rotation Invariance One can visualize a symmetric positive definite matrix W = ∑i λi vi vi = V ΛV as an ellipse, where the eigenvectors vi are the axes of the ellipse and the square-roots of the eigenvalues (i. [sent-141, score-0.293]

25 That is, for any orthonormal matrix U , the von Neumann divergence has the property that ∆F (W , W ) = ∆F (U W U , U W U ). [sent-145, score-0.213]

26 There is a second important divergence between symmetric positive definite matrices that is invariant under the simultaneous rotation of both eigen systems (2. [sent-154, score-0.275]

27 It is a Bregman divergence based on the strictly convex function F(W ) = − log det(W ) (e. [sent-156, score-0.196]

28 Also since f (W ) = ∇W F(W ) = (W −1 ) = W −1 , the Bregman divergence becomes: ∆F (W , W ) = log det(W ) + tr(W −1 W ) − d det(W ) λi = ∑ log + tr(W −1 W ) − d, ˜ λi i where d is the dimension of the parameter matrices. [sent-160, score-0.259]

29 Notice that in this case, F(W ) is essentially minus the log of the volume of the ellipse W , and the LogDet divergence is the relative entropy between two multidimensional Gaussians with fixed mean and covariance matrices W and W , respectively (see Singer and Warmuth, 1999). [sent-162, score-0.294]

30 Note that for this divergence ∆F (W , W ) = ∆F (U W , U W ) for any orthonormal matrix U and parameter matrices in the domain of F. [sent-165, score-0.217]

31 F(W ) can be extended to symmetric positive semi-definite matrices by using the convention 0 log 0 = 0. [sent-167, score-0.257]

32 On-line Learning In this section we present a natural extension of the exponentiated gradient (EG) update (Kivinen and Warmuth, 1997) to an update for symmetric positive definite matrices. [sent-171, score-0.328]

33 It then produces a prediction yt for the instance Xt based on the algorithm’s current parameter matrix Wt and receives a label yt . [sent-177, score-0.371]

34 ˆ (The prediction yt and the label yt lie some labeling domain Y . [sent-178, score-0.304]

35 ) Finally the algorithm incurs a real ˆ valued loss L(yt , yt ) and updates its parameter matrix to Wt+1 . [sent-179, score-0.319]

36 The on-line algorithm we analyze for this case predicts with yt = tr(Wt Xt ) and is based on the loss ˆ Lt (Wt ) = L(yt , yt ) = (yt − yt )2 . [sent-182, score-0.503]

37 In the case of the von Neumann divergence, the functions f (W ) = log W and f −1 (Q) = exp Q clearly preserve symmetry. [sent-197, score-0.304]

38 4) symmetric symmetric positive definite We call this update the unnormalized matrix exponentiated gradient update. [sent-206, score-0.419]

39 Note that f (W ) = log W maps symmetric positive definite matrices to arbitrary symmetric matrices, and after adding a scaled symmetrized gradient, the function f −1 (Q) = exp Q maps the symmetric exponent back to a symmetric positive definite matrix. [sent-207, score-0.62]

40 When the parameters are constrained to trace one, then we arrive at the Matrix Exponentiated Gradient (MEG) update, which generalizes the exponentiated gradient (EG) update of Kivinen and Warmuth (1997) to non-diagonal matrices: Wt+1 = 1 exp (log Wt − η sym(∇W Lt (Wt ))) , Zt (3. [sent-208, score-0.38]

41 5) where Zt = tr (exp (log Wt − η sym(∇W Lt (Wt )))) is the normalizing constant (See Appendix B for details. [sent-209, score-0.32]

42 However, f −1 does not map an arbitrary symmetric matrix back to a symmetric positive definite matrix. [sent-213, score-0.235]

43 However we can “unwrap” this update to the following: Wt+1 = t 1 exp ct I + log W1 − η ∑ sym(∇W Ls (Ws )) , ˜ Zt s=1 1001 ¨ T SUDA , R ATSCH AND WARMUTH ˜ where the constant Zt normalizes the trace of Wt+1 to one. [sent-231, score-0.374]

44 We incrementally maintain an eigenvalue decomposition of the matrix in the exponent (O(n3 ) per iteration): t Vt Λt Vt = ct I + log W1 − η ∑ sym(∇W Ls (Ws )) s=1 where the constant ct is chosen so that the maximum eigenvalue of the above is zero. [sent-234, score-0.245]

45 Algorithm 1 Pseudo-code of the matrix exponentiated gradient (MEG) algorithm for quadratic Loss Choose W1 and η Initialize G0 = log W1 for t = 1, 2, . [sent-237, score-0.285]

46 do Obtain instance matrix Xt Predict yt = tr(Wt Xt ) ˆ Obtain label yt and determine the loss Lt = (yt − yt )2 ˆ Update Gt = Gt−1 − 2η(yt − yt ) sym(Xt ) ˆ Compute spectral decomposition: Gt = Vt Λt Vt Update Wt+1 = Vt exp(Λt − ct I)Vt /tr(exp(Λt − ct I)), where ct = maxs (Λt )s,s end for 3. [sent-240, score-0.848]

47 3 Relative Loss Bounds For the sake of simplicity we now restrict ourselves to the case when the algorithm predicts with yt = tr(Wt Xt ) and the loss function is quadratic: Lt (Wt ) = L(yt , yt ) := (yt − yt )2 . [sent-241, score-0.503]

48 , (XT , yT ) denote a sequence of examples, where the instance matrices Xt ∈ Rd×d and the labels yt ∈ R. [sent-246, score-0.231]

49 The total loss of the on-line algorithm on the entire sequence S is LMEG (S) = ∑t (tr(Wt Xt ) − t=1 yt )2 . [sent-247, score-0.199]

50 Such a comparator parameter is any symmetric positive semi-definite matrix U with trace T one, and its total loss is defined as LU (S) = ∑t=1 (tr(U Xt ) − yt )2 . [sent-249, score-0.443]

51 These two lemmas generalize similar lemmas previously proven for the exponentiated gradient update (Lemmas 5. [sent-254, score-0.223]

52 It has the same structure as the corresponding previous lemma proven for the exponentiated gradient algorithm, but now we apply the various matrix inequalities given at the end of Section 2. [sent-269, score-0.23]

53 2 Let S be any sequence of examples with square real matrices as instances and real labels, and let r be an upper bound on the range of eigenvalues of the symmetric part of each instance matrix of S. [sent-275, score-0.283]

54 Let the initial parameter W1 and comparison parameter U be arbitrary symmetric positive definite matrices of trace one. [sent-276, score-0.23]

55 2 1 1 In particular, if W1 = d I, then ∆F (U , W1 ) = log d − ∑i λi log λi ≤ log d. [sent-288, score-0.282]

56 1 Preliminaries In this section, we address the following Bregman projection problem of finding a positive semidefinite symmetric matrix W ∈ Rd×d of trace one satisfying a set of linear constraints:3 W ∗ = argmin ∆F (W , W1 ) (4. [sent-297, score-0.28]

57 , n, where the symmetric positive definite matrix W1 of trace one is the initial parameter matrix and C1 , . [sent-303, score-0.285]

58 Before presenting the algorithm, let us describe the dual problem of minimizing the von Neumann divergence subject to linear constraints (4. [sent-324, score-0.223]

59 The dual variables are the Lagrange multipliers α ∈ Rn (α ≥ 0) associated with this optimization problem: α∗ = argmax α≥0 n − log tr exp(log W1 − ∑ α j sym(C j )) . [sent-326, score-0.46]

60 1) becomes a Bregman projection subject to a single equality constraint tr(W Xt ) = yt . [sent-333, score-0.206]

61 where Z(α∗ ) = tr exp(log W1 − ∑n α∗ sym(C j )) and α∗ is the optimal dual solution. [sent-349, score-0.366]

62 At the t-th step, choose an unsatisfied constraint jt , i. [sent-353, score-0.216]

63 Appendix D) αt∗ = argmin α≥0 tr (exp(log Wt − α sym(C jt ))) . [sent-361, score-0.544]

64 4) Using the solution of the dual problem, Wt is updated as Wt+1 = 1 exp(log Wt − αt∗ sym(C jt )) Zt (αt∗ ) (4. [sent-363, score-0.232]

65 5) where the normalization factor is Zt (αt∗ ) = tr (exp(log Wt − αt∗ sym(C jt ))). [sent-364, score-0.506]

66 However, one can use the following approximate choice of αt : 1 + rt /λtmax 1 log , (4. [sent-375, score-0.196]

67 6) αt = max λt − λtmin 1 + rt /λtmin when the eigenvalues of sym(C jt ) lie in the interval [λtmin , λtmax ] and rt = tr(Wt C jt ). [sent-376, score-0.629]

68 5), where probability distributions are replaced by symmetric positive definite matrices of trace one. [sent-381, score-0.23]

69 1 The negative exponentiated dual objective is bounded from above by T T tr exp log W1 − ∑ αt sym(C jt ) ≤ ∏ ρ(rt ), where αt = and 1 1 + rt /λtmax log λtmax − λtmin 1 + rt /λtmin ρ(rt ) = 1 − rt λtmax max λt max −λmin λt t (4. [sent-394, score-1.259]

70 7) t=1 t=1 , rt = tr(Wt C jt ), rt 1 − min λt min −λt max −λmin λt t . [sent-395, score-0.39]

71 The algorithm selects in each iteration an constraint jt that is violated by at least ε (i. [sent-409, score-0.24]

72 rt = tr(Wt C jt ) ≥ ε), or stops if no such constraint exists. [sent-411, score-0.339]

73 Denote by j j h primal (W ) and hdual (α) the primal and dual objective functions in (4. [sent-414, score-0.205]

74 8) n hdual (α) = − log tr exp log W1 − ∑ α j sym(C j ) (4. [sent-418, score-0.671]

75 9) j=1 The primal objective is upper-bounded by log d, since ∆F (W , W1 ) = ∑i λi log λi + log d ≤ log d. [sent-419, score-0.44]

76 1: T exp(−hdual (α)) = tr exp log W1 − ∑ αt sym(C jt ) ˜ t=1 ≤ λ2 − ε2 λ2 T /2 , T where α is the cumulative coefficient vector for the constraints, i. [sent-424, score-0.711]

77 Finally, we obtain T ε2 ≤ hdual (α) ≤ hdual (α∗ ) = h primal (W ∗ ) ≤ log d, 2λ2 2 and the upper bound T ≤ 2λ2 log d . [sent-433, score-0.335]

78 solution W 1007 ¨ T SUDA , R ATSCH AND WARMUTH 2 rt = tr(Wt C jt ) ≥ ε, or stops if no such constraint exists. [sent-443, score-0.339]

79 7) is rewritten as n 1 d exp −yi ∑ α j h j (xi ) ∑ d i=1 j=1 − log , which is equivalent to the exponential loss function used in AdaBoost. [sent-471, score-0.252]

80 6) is described as αt = 2 log 1+rtt , where rt is the weighted training error of the t-th hypothesis, 1−r i. [sent-476, score-0.196]

81 The distance example at time t is described as {at , bt , yt }, which indicates that the squared Euclidean distance between objects at and bt is yt . [sent-524, score-0.431]

82 Then, the total loss incurred by this sequence is i=1 T ∑ t=1 xtat − xtbt 2 − yt 2 T = ∑ (tr(Wt Xt ) − yt )2 , t=1 where Xt is a symmetric matrix whose (at , at ) and (bt , bt ) elements are 0. [sent-528, score-0.527]

83 The total loss of the kernel matrix sequence obtained by the matrix exponential update is shown in Figure 2 (left). [sent-571, score-0.241]

84 Summary and Discussion We motivated and analyzed a new update for symmetric positive matrices using the von Neumann divergence. [sent-611, score-0.298]

85 We showed that the standard bounds for on-line learning and boosting generalize to the case when the parameters are symmetric positive definite matrices of trace one instead of a probability vector. [sent-612, score-0.294]

86 All algorithms of this paper (and their analyzes) immediately generalize to the case when symmetric is replaced by Hermitian and symmetric positive definite by positive Hermitian (i. [sent-626, score-0.189]

87 Note that density matrices (as used in Statistical Physics) are positive Hermitian matrices of trace one. [sent-632, score-0.225]

88 Consider the following optimization problem, where Xt is an arbitrary matrix in Rd×d , Wt an arbitrary symmetric matrix in Rd×d and yt ∈ R: Wt+1 = argmin ∆F (W , Wt ) + ηLt (W ) W s. [sent-679, score-0.408]

89 This adds a term δ(tr(W ) − 1) to the Lagrangian and the update now has the form Wt+1 = exp log Wt − η∇W Lt (Wt+1 ) − (Γ − Γ ) − δI . [sent-693, score-0.265]

90 Choosing Γ = −η∇W Lt (Wt+1 )/2 and δ = − log (tr(exp(log Wt − η sym(∇W Lt (Wt+1 ))))) enforces the symmetry and trace constraints and after approximating the gradient we arrive at the explicit MEG update (3. [sent-694, score-0.313]

91 3), we have tr (exp(log Wt + δt sym(Xt ))) ≤ tr (Wt exp(δt sym(Xt ))) . [sent-704, score-0.64]

92 2 by pre-multiplying the inequality by Wt and taking a trace of both sides: tr (Wt exp(δt sym(Xt ))) ≤ exp(r0 δt ) 1 − tr(Wt Xt ) − r0 (1 − exp(rδt )) . [sent-715, score-0.41]

93 Solving ∂g = 0, ∂z we have z = yt − δt /(2b) = yt + η(tr(Xt Wt ) − yt )/b. [sent-722, score-0.456]

94 Similarly, β = log tr (exp(log W1 − α sym(C))) enforces the trace constraint. [sent-747, score-0.481]

95 1 Recall the definition of the normalization factor Zt (α) = tr (exp(log Wt − α sym(C jt ))) of DefiniteBoost. [sent-753, score-0.506]

96 exp(−α sym(C jt )) Since Wt is positive definite and both sides of the above inequality are symmetric, we can apply Lemma 2. [sent-764, score-0.227]

97 2 by multiplying this inequality by Wt and taking a trace of both sides: tr(Wt exp(−α sym(C jt ))) ≤ exp(−αλtmax )tr(Wt A) + exp(αλtmin )tr (Wt (I − A)) . [sent-765, score-0.276]

98 By expanding A and using the shorthand rt = tr(Wt C jt ), we obtain Zt (α) ≤ exp(−αλtmax ) λtmin + rt λtmax − rt + exp(αλmin ) max . [sent-766, score-0.492]

99 With this choice, the inequality becomes Zt (αt ) ≤ (1 − rt λtmax ) max λt max +λmin λt t min λt rt max (1 + min ) λt +λtmin . [sent-769, score-0.227]

100 ∏t Zt (αt ) Taking the trace of both sides and rearranging terms, we get T tr exp(log W1 − ∑ αt sym(C jt )) t=1 T = ∏ Zt (αt ). [sent-773, score-0.591]


similar papers computed by tfidf model

tfidf for this paper:

wordName wordTfidf (topN-words)

[('wt', 0.486), ('sym', 0.473), ('tr', 0.32), ('lt', 0.195), ('jt', 0.186), ('yt', 0.152), ('bregman', 0.14), ('tmax', 0.138), ('tmin', 0.123), ('xt', 0.12), ('niteboost', 0.12), ('exp', 0.111), ('suda', 0.103), ('rt', 0.102), ('warmuth', 0.097), ('pdates', 0.095), ('radient', 0.095), ('xponentiated', 0.095), ('log', 0.094), ('atsch', 0.087), ('symmetric', 0.084), ('neumann', 0.082), ('exponentiated', 0.081), ('atrix', 0.079), ('eg', 0.079), ('matrices', 0.079), ('tsuda', 0.077), ('von', 0.075), ('meg', 0.072), ('divergence', 0.071), ('kivinen', 0.07), ('matrix', 0.067), ('trace', 0.067), ('zt', 0.066), ('update', 0.06), ('quantum', 0.058), ('eigenvalues', 0.053), ('adaboost', 0.052), ('hdual', 0.052), ('noble', 0.052), ('hermitian', 0.05), ('vt', 0.048), ('loss', 0.047), ('dual', 0.046), ('rd', 0.046), ('lmeg', 0.043), ('logdet', 0.043), ('unsatis', 0.043), ('primal', 0.043), ('gradient', 0.043), ('ct', 0.042), ('argmin', 0.038), ('tsch', 0.035), ('lu', 0.035), ('updates', 0.034), ('convex', 0.031), ('constraints', 0.031), ('constraint', 0.03), ('distance', 0.03), ('singer', 0.029), ('gt', 0.027), ('ei', 0.026), ('bacteria', 0.026), ('chuang', 0.026), ('comparator', 0.026), ('ellipse', 0.026), ('tsang', 0.026), ('boosting', 0.025), ('bt', 0.025), ('projection', 0.024), ('preserve', 0.024), ('relative', 0.024), ('littlestone', 0.024), ('violated', 0.024), ('nielsen', 0.023), ('ab', 0.023), ('projections', 0.023), ('inequality', 0.023), ('diagonal', 0.022), ('lmax', 0.022), ('eigen', 0.022), ('koji', 0.022), ('objective', 0.021), ('lemma', 0.021), ('stops', 0.021), ('generalize', 0.021), ('vi', 0.021), ('xing', 0.021), ('batch', 0.02), ('incurs', 0.019), ('rotation', 0.019), ('bounds', 0.018), ('schapire', 0.018), ('symmetry', 0.018), ('proven', 0.018), ('generalizes', 0.018), ('sides', 0.018), ('kwok', 0.017), ('objects', 0.017), ('atr', 0.017)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 1.0000007 55 jmlr-2005-Matrix Exponentiated Gradient Updates for On-line Learning and Bregman Projection

Author: Koji Tsuda, Gunnar Rätsch, Manfred K. Warmuth

Abstract: We address the problem of learning a symmetric positive definite matrix. The central issue is to design parameter updates that preserve positive definiteness. Our updates are motivated with the von Neumann divergence. Rather than treating the most general case, we focus on two key applications that exemplify our methods: on-line learning with a simple square loss, and finding a symmetric positive definite matrix subject to linear constraints. The updates generalize the exponentiated gradient (EG) update and AdaBoost, respectively: the parameter is now a symmetric positive definite matrix of trace one instead of a probability vector (which in this context is a diagonal positive definite matrix with trace one). The generalized updates use matrix logarithms and exponentials to preserve positive definiteness. Most importantly, we show how the derivation and the analyses of the original EG update and AdaBoost generalize to the non-diagonal case. We apply the resulting matrix exponentiated gradient (MEG) update and DefiniteBoost to the problem of learning a kernel matrix from distance measurements.

2 0.15322472 20 jmlr-2005-Clustering with Bregman Divergences

Author: Arindam Banerjee, Srujana Merugu, Inderjit S. Dhillon, Joydeep Ghosh

Abstract: A wide variety of distortion functions, such as squared Euclidean distance, Mahalanobis distance, Itakura-Saito distance and relative entropy, have been used for clustering. In this paper, we propose and analyze parametric hard and soft clustering algorithms based on a large class of distortion functions known as Bregman divergences. The proposed algorithms unify centroid-based parametric clustering approaches, such as classical kmeans, the Linde-Buzo-Gray (LBG) algorithm and information-theoretic clustering, which arise by special choices of the Bregman divergence. The algorithms maintain the simplicity and scalability of the classical kmeans algorithm, while generalizing the method to a large class of clustering loss functions. This is achieved by first posing the hard clustering problem in terms of minimizing the loss in Bregman information, a quantity motivated by rate distortion theory, and then deriving an iterative algorithm that monotonically decreases this loss. In addition, we show that there is a bijection between regular exponential families and a large class of Bregman divergences, that we call regular Bregman divergences. This result enables the development of an alternative interpretation of an efficient EM scheme for learning mixtures of exponential family distributions, and leads to a simple soft clustering algorithm for regular Bregman divergences. Finally, we discuss the connection between rate distortion theory and Bregman clustering and present an information theoretic analysis of Bregman clustering algorithms in terms of a trade-off between compression and loss in Bregman information. Keywords: clustering, Bregman divergences, Bregman information, exponential families, expectation maximization, information theory

3 0.10042158 66 jmlr-2005-Smooth ε-Insensitive Regression by Loss Symmetrization

Author: Ofer Dekel, Shai Shalev-Shwartz, Yoram Singer

Abstract: We describe new loss functions for regression problems along with an accompanying algorithmic framework which utilizes these functions. These loss functions are derived by symmetrization of margin-based losses commonly used in boosting algorithms, namely, the logistic loss and the exponential loss. The resulting symmetric logistic loss can be viewed as a smooth approximation to the ε-insensitive hinge loss used in support vector regression. We describe and analyze two parametric families of batch learning algorithms for minimizing these symmetric losses. The first family employs an iterative log-additive update which can be viewed as a regression counterpart to recent boosting algorithms. The second family utilizes an iterative additive update step. We also describe and analyze online gradient descent (GD) and exponentiated gradient (EG) algorithms for the symmetric logistic loss. A byproduct of our work is a new simple form of regularization for boosting-based classification and regression algorithms. Our regression framework also has implications on classification algorithms, namely, a new additive update boosting algorithm for classification. We demonstrate the merits of our algorithms in a series of experiments.

4 0.088018753 29 jmlr-2005-Efficient Margin Maximizing with Boosting

Author: Gunnar Rätsch, Manfred K. Warmuth

Abstract: AdaBoost produces a linear combination of base hypotheses and predicts with the sign of this linear combination. The linear combination may be viewed as a hyperplane in feature space where the base hypotheses form the features. It has been observed that the generalization error of the algorithm continues to improve even after all examples are on the correct side of the current hyperplane. The improvement is attributed to the experimental observation that the distances (margins) of the examples to the separating hyperplane are increasing even after all examples are on the correct side. We introduce a new version of AdaBoost, called AdaBoost∗ , that explicitly maximizes the ν minimum margin of the examples up to a given precision. The algorithm incorporates a current estimate of the achievable margin into its calculation of the linear coefficients of the base hypotheses. The bound on the number of iterations needed by the new algorithms is the same as the number needed by a known version of AdaBoost that must have an explicit estimate of the achievable margin as a parameter. We also illustrate experimentally that our algorithm requires considerably fewer iterations than other algorithms that aim to maximize the margin.

5 0.059519276 57 jmlr-2005-Multiclass Boosting for Weak Classifiers

Author: Günther Eibl, Karl-Peter Pfeiffer

Abstract: AdaBoost.M2 is a boosting algorithm designed for multiclass problems with weak base classifiers. The algorithm is designed to minimize a very loose bound on the training error. We propose two alternative boosting algorithms which also minimize bounds on performance measures. These performance measures are not as strongly connected to the expected error as the training error, but the derived bounds are tighter than the bound on the training error of AdaBoost.M2. In experiments the methods have roughly the same performance in minimizing the training and test error rates. The new algorithms have the advantage that the base classifier should minimize the confidence-rated error, whereas for AdaBoost.M2 the base classifier should minimize the pseudo-loss. This makes them more easily applicable to already existing base classifiers. The new algorithms also tend to converge faster than AdaBoost.M2. Keywords: boosting, multiclass, ensemble, classification, decision stumps

6 0.057562873 42 jmlr-2005-Large Margin Methods for Structured and Interdependent Output Variables

7 0.048646096 52 jmlr-2005-Loopy Belief Propagation: Convergence and Effects of Message Errors

8 0.047910467 39 jmlr-2005-Information Bottleneck for Gaussian Variables

9 0.047708739 34 jmlr-2005-Feature Selection for Unsupervised and Supervised Inference: The Emergence of Sparsity in a Weight-Based Approach

10 0.046948992 10 jmlr-2005-Adaptive Online Prediction by Following the Perturbed Leader

11 0.043245677 62 jmlr-2005-Probabilistic Non-linear Principal Component Analysis with Gaussian Process Latent Variable Models

12 0.042734493 68 jmlr-2005-Tree-Based Batch Mode Reinforcement Learning

13 0.042082001 49 jmlr-2005-Learning the Kernel with Hyperkernels     (Kernel Machines Section)

14 0.040442277 63 jmlr-2005-Quasi-Geodesic Neural Learning Algorithms Over the Orthogonal Group: A Tutorial

15 0.039440751 56 jmlr-2005-Maximum Margin Algorithms with Boolean Kernels

16 0.034305509 33 jmlr-2005-Fast Kernel Classifiers with Online and Active Learning

17 0.033393241 17 jmlr-2005-Change Point Problems in Linear Dynamical Systems

18 0.031500746 4 jmlr-2005-A Framework for Learning Predictive Structures from Multiple Tasks and Unlabeled Data

19 0.03104239 60 jmlr-2005-On the Nyström Method for Approximating a Gram Matrix for Improved Kernel-Based Learning

20 0.030927293 6 jmlr-2005-A Modified Finite Newton Method for Fast Solution of Large Scale Linear SVMs


similar papers computed by lsi model

lsi for this paper:

topicId topicWeight

[(0, 0.197), (1, 0.073), (2, 0.173), (3, -0.057), (4, -0.31), (5, -0.108), (6, -0.157), (7, -0.002), (8, 0.166), (9, 0.393), (10, 0.023), (11, -0.005), (12, -0.135), (13, -0.101), (14, -0.079), (15, 0.142), (16, -0.058), (17, 0.041), (18, -0.065), (19, -0.094), (20, 0.114), (21, -0.043), (22, -0.085), (23, -0.009), (24, -0.065), (25, -0.021), (26, -0.018), (27, 0.025), (28, 0.039), (29, 0.041), (30, 0.015), (31, -0.07), (32, -0.074), (33, -0.033), (34, 0.024), (35, -0.088), (36, 0.166), (37, 0.103), (38, 0.054), (39, 0.05), (40, -0.055), (41, -0.215), (42, -0.046), (43, -0.165), (44, 0.026), (45, 0.013), (46, 0.017), (47, 0.105), (48, -0.063), (49, -0.039)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.97186565 55 jmlr-2005-Matrix Exponentiated Gradient Updates for On-line Learning and Bregman Projection

Author: Koji Tsuda, Gunnar Rätsch, Manfred K. Warmuth

Abstract: We address the problem of learning a symmetric positive definite matrix. The central issue is to design parameter updates that preserve positive definiteness. Our updates are motivated with the von Neumann divergence. Rather than treating the most general case, we focus on two key applications that exemplify our methods: on-line learning with a simple square loss, and finding a symmetric positive definite matrix subject to linear constraints. The updates generalize the exponentiated gradient (EG) update and AdaBoost, respectively: the parameter is now a symmetric positive definite matrix of trace one instead of a probability vector (which in this context is a diagonal positive definite matrix with trace one). The generalized updates use matrix logarithms and exponentials to preserve positive definiteness. Most importantly, we show how the derivation and the analyses of the original EG update and AdaBoost generalize to the non-diagonal case. We apply the resulting matrix exponentiated gradient (MEG) update and DefiniteBoost to the problem of learning a kernel matrix from distance measurements.

2 0.64195853 20 jmlr-2005-Clustering with Bregman Divergences

Author: Arindam Banerjee, Srujana Merugu, Inderjit S. Dhillon, Joydeep Ghosh

Abstract: A wide variety of distortion functions, such as squared Euclidean distance, Mahalanobis distance, Itakura-Saito distance and relative entropy, have been used for clustering. In this paper, we propose and analyze parametric hard and soft clustering algorithms based on a large class of distortion functions known as Bregman divergences. The proposed algorithms unify centroid-based parametric clustering approaches, such as classical kmeans, the Linde-Buzo-Gray (LBG) algorithm and information-theoretic clustering, which arise by special choices of the Bregman divergence. The algorithms maintain the simplicity and scalability of the classical kmeans algorithm, while generalizing the method to a large class of clustering loss functions. This is achieved by first posing the hard clustering problem in terms of minimizing the loss in Bregman information, a quantity motivated by rate distortion theory, and then deriving an iterative algorithm that monotonically decreases this loss. In addition, we show that there is a bijection between regular exponential families and a large class of Bregman divergences, that we call regular Bregman divergences. This result enables the development of an alternative interpretation of an efficient EM scheme for learning mixtures of exponential family distributions, and leads to a simple soft clustering algorithm for regular Bregman divergences. Finally, we discuss the connection between rate distortion theory and Bregman clustering and present an information theoretic analysis of Bregman clustering algorithms in terms of a trade-off between compression and loss in Bregman information. Keywords: clustering, Bregman divergences, Bregman information, exponential families, expectation maximization, information theory

3 0.36276433 66 jmlr-2005-Smooth ε-Insensitive Regression by Loss Symmetrization

Author: Ofer Dekel, Shai Shalev-Shwartz, Yoram Singer

Abstract: We describe new loss functions for regression problems along with an accompanying algorithmic framework which utilizes these functions. These loss functions are derived by symmetrization of margin-based losses commonly used in boosting algorithms, namely, the logistic loss and the exponential loss. The resulting symmetric logistic loss can be viewed as a smooth approximation to the ε-insensitive hinge loss used in support vector regression. We describe and analyze two parametric families of batch learning algorithms for minimizing these symmetric losses. The first family employs an iterative log-additive update which can be viewed as a regression counterpart to recent boosting algorithms. The second family utilizes an iterative additive update step. We also describe and analyze online gradient descent (GD) and exponentiated gradient (EG) algorithms for the symmetric logistic loss. A byproduct of our work is a new simple form of regularization for boosting-based classification and regression algorithms. Our regression framework also has implications on classification algorithms, namely, a new additive update boosting algorithm for classification. We demonstrate the merits of our algorithms in a series of experiments.

4 0.30992353 29 jmlr-2005-Efficient Margin Maximizing with Boosting

Author: Gunnar Rätsch, Manfred K. Warmuth

Abstract: AdaBoost produces a linear combination of base hypotheses and predicts with the sign of this linear combination. The linear combination may be viewed as a hyperplane in feature space where the base hypotheses form the features. It has been observed that the generalization error of the algorithm continues to improve even after all examples are on the correct side of the current hyperplane. The improvement is attributed to the experimental observation that the distances (margins) of the examples to the separating hyperplane are increasing even after all examples are on the correct side. We introduce a new version of AdaBoost, called AdaBoost∗ , that explicitly maximizes the ν minimum margin of the examples up to a given precision. The algorithm incorporates a current estimate of the achievable margin into its calculation of the linear coefficients of the base hypotheses. The bound on the number of iterations needed by the new algorithms is the same as the number needed by a known version of AdaBoost that must have an explicit estimate of the achievable margin as a parameter. We also illustrate experimentally that our algorithm requires considerably fewer iterations than other algorithms that aim to maximize the margin.

5 0.21726839 52 jmlr-2005-Loopy Belief Propagation: Convergence and Effects of Message Errors

Author: Alexander T. Ihler, John W. Fisher III, Alan S. Willsky

Abstract: Belief propagation (BP) is an increasingly popular method of performing approximate inference on arbitrary graphical models. At times, even further approximations are required, whether due to quantization of the messages or model parameters, from other simplified message or model representations, or from stochastic approximation methods. The introduction of such errors into the BP message computations has the potential to affect the solution obtained adversely. We analyze the effect resulting from message approximation under two particular measures of error, and show bounds on the accumulation of errors in the system. This analysis leads to convergence conditions for traditional BP message passing, and both strict bounds and estimates of the resulting error in systems of approximate BP message passing. Keywords: belief propagation, sum-product, convergence, approximate inference, quantization

6 0.19936627 39 jmlr-2005-Information Bottleneck for Gaussian Variables

7 0.19534963 42 jmlr-2005-Large Margin Methods for Structured and Interdependent Output Variables

8 0.18022011 10 jmlr-2005-Adaptive Online Prediction by Following the Perturbed Leader

9 0.17983629 56 jmlr-2005-Maximum Margin Algorithms with Boolean Kernels

10 0.17636065 60 jmlr-2005-On the Nyström Method for Approximating a Gram Matrix for Improved Kernel-Based Learning

11 0.15720728 63 jmlr-2005-Quasi-Geodesic Neural Learning Algorithms Over the Orthogonal Group: A Tutorial

12 0.15597174 49 jmlr-2005-Learning the Kernel with Hyperkernels     (Kernel Machines Section)

13 0.1549039 62 jmlr-2005-Probabilistic Non-linear Principal Component Analysis with Gaussian Process Latent Variable Models

14 0.15339211 34 jmlr-2005-Feature Selection for Unsupervised and Supervised Inference: The Emergence of Sparsity in a Weight-Based Approach

15 0.15294611 25 jmlr-2005-Denoising Source Separation

16 0.14329384 68 jmlr-2005-Tree-Based Batch Mode Reinforcement Learning

17 0.13745221 26 jmlr-2005-Diffusion Kernels on Statistical Manifolds

18 0.13453446 12 jmlr-2005-An MDP-Based Recommender System

19 0.13357632 3 jmlr-2005-A Classification Framework for Anomaly Detection

20 0.13283476 57 jmlr-2005-Multiclass Boosting for Weak Classifiers


similar papers computed by lda model

lda for this paper:

topicId topicWeight

[(13, 0.015), (17, 0.014), (19, 0.538), (36, 0.024), (37, 0.018), (42, 0.017), (43, 0.017), (47, 0.015), (52, 0.07), (59, 0.012), (70, 0.013), (88, 0.075), (90, 0.045), (94, 0.039)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.89281595 55 jmlr-2005-Matrix Exponentiated Gradient Updates for On-line Learning and Bregman Projection

Author: Koji Tsuda, Gunnar Rätsch, Manfred K. Warmuth

Abstract: We address the problem of learning a symmetric positive definite matrix. The central issue is to design parameter updates that preserve positive definiteness. Our updates are motivated with the von Neumann divergence. Rather than treating the most general case, we focus on two key applications that exemplify our methods: on-line learning with a simple square loss, and finding a symmetric positive definite matrix subject to linear constraints. The updates generalize the exponentiated gradient (EG) update and AdaBoost, respectively: the parameter is now a symmetric positive definite matrix of trace one instead of a probability vector (which in this context is a diagonal positive definite matrix with trace one). The generalized updates use matrix logarithms and exponentials to preserve positive definiteness. Most importantly, we show how the derivation and the analyses of the original EG update and AdaBoost generalize to the non-diagonal case. We apply the resulting matrix exponentiated gradient (MEG) update and DefiniteBoost to the problem of learning a kernel matrix from distance measurements.

2 0.83707124 42 jmlr-2005-Large Margin Methods for Structured and Interdependent Output Variables

Author: Ioannis Tsochantaridis, Thorsten Joachims, Thomas Hofmann, Yasemin Altun

Abstract: Learning general functional dependencies between arbitrary input and output spaces is one of the key challenges in computational intelligence. While recent progress in machine learning has mainly focused on designing flexible and powerful input representations, this paper addresses the complementary issue of designing classification algorithms that can deal with more complex outputs, such as trees, sequences, or sets. More generally, we consider problems involving multiple dependent output variables, structured output spaces, and classification problems with class attributes. In order to accomplish this, we propose to appropriately generalize the well-known notion of a separation margin and derive a corresponding maximum-margin formulation. While this leads to a quadratic program with a potentially prohibitive, i.e. exponential, number of constraints, we present a cutting plane algorithm that solves the optimization problem in polynomial time for a large class of problems. The proposed method has important applications in areas such as computational biology, natural language processing, information retrieval/extraction, and optical character recognition. Experiments from various domains involving different types of output spaces emphasize the breadth and generality of our approach.

3 0.41865677 13 jmlr-2005-Analysis of Variance of Cross-Validation Estimators of the Generalization Error

Author: Marianthi Markatou, Hong Tian, Shameek Biswas, George Hripcsak

Abstract: This paper brings together methods from two different disciplines: statistics and machine learning. We address the problem of estimating the variance of cross-validation (CV) estimators of the generalization error. In particular, we approach the problem of variance estimation of the CV estimators of generalization error as a problem in approximating the moments of a statistic. The approximation illustrates the role of training and test sets in the performance of the algorithm. It provides a unifying approach to evaluation of various methods used in obtaining training and test sets and it takes into account the variability due to different training and test sets. For the simple problem of predicting the sample mean and in the case of smooth loss functions, we show that the variance of the CV estimator of the generalization error is a function of the moments of the random T T variables Y = Card(S j S j ) and Y ∗ = Card(Sc Sc ), where S j , S j are two training sets, and Sc , j j j c are the corresponding test sets. We prove that the distribution of Y and Y* is hypergeometric Sj and we compare our estimator with the one proposed by Nadeau and Bengio (2003). We extend these results in the regression case and the case of absolute error loss, and indicate how the methods can be extended to the classification case. We illustrate the results through simulation. Keywords: cross-validation, generalization error, moment approximation, prediction, variance estimation

4 0.38934231 20 jmlr-2005-Clustering with Bregman Divergences

Author: Arindam Banerjee, Srujana Merugu, Inderjit S. Dhillon, Joydeep Ghosh

Abstract: A wide variety of distortion functions, such as squared Euclidean distance, Mahalanobis distance, Itakura-Saito distance and relative entropy, have been used for clustering. In this paper, we propose and analyze parametric hard and soft clustering algorithms based on a large class of distortion functions known as Bregman divergences. The proposed algorithms unify centroid-based parametric clustering approaches, such as classical kmeans, the Linde-Buzo-Gray (LBG) algorithm and information-theoretic clustering, which arise by special choices of the Bregman divergence. The algorithms maintain the simplicity and scalability of the classical kmeans algorithm, while generalizing the method to a large class of clustering loss functions. This is achieved by first posing the hard clustering problem in terms of minimizing the loss in Bregman information, a quantity motivated by rate distortion theory, and then deriving an iterative algorithm that monotonically decreases this loss. In addition, we show that there is a bijection between regular exponential families and a large class of Bregman divergences, that we call regular Bregman divergences. This result enables the development of an alternative interpretation of an efficient EM scheme for learning mixtures of exponential family distributions, and leads to a simple soft clustering algorithm for regular Bregman divergences. Finally, we discuss the connection between rate distortion theory and Bregman clustering and present an information theoretic analysis of Bregman clustering algorithms in terms of a trade-off between compression and loss in Bregman information. Keywords: clustering, Bregman divergences, Bregman information, exponential families, expectation maximization, information theory

5 0.37454039 49 jmlr-2005-Learning the Kernel with Hyperkernels     (Kernel Machines Section)

Author: Cheng Soon Ong, Alexander J. Smola, Robert C. Williamson

Abstract: This paper addresses the problem of choosing a kernel suitable for estimation with a support vector machine, hence further automating machine learning. This goal is achieved by defining a reproducing kernel Hilbert space on the space of kernels itself. Such a formulation leads to a statistical estimation problem similar to the problem of minimizing a regularized risk functional. We state the equivalent representer theorem for the choice of kernels and present a semidefinite programming formulation of the resulting optimization problem. Several recipes for constructing hyperkernels are provided, as well as the details of common machine learning problems. Experimental results for classification, regression and novelty detection on UCI data show the feasibility of our approach. Keywords: learning the kernel, capacity control, kernel methods, support vector machines, representer theorem, semidefinite programming

6 0.35184795 39 jmlr-2005-Information Bottleneck for Gaussian Variables

7 0.35035494 52 jmlr-2005-Loopy Belief Propagation: Convergence and Effects of Message Errors

8 0.34294641 33 jmlr-2005-Fast Kernel Classifiers with Online and Active Learning

9 0.34013435 10 jmlr-2005-Adaptive Online Prediction by Following the Perturbed Leader

10 0.33582938 64 jmlr-2005-Semigroup Kernels on Measures

11 0.33212417 5 jmlr-2005-A Generalization Error for Q-Learning

12 0.32796934 71 jmlr-2005-Variational Message Passing

13 0.32540101 48 jmlr-2005-Learning the Kernel Function via Regularization

14 0.32333615 57 jmlr-2005-Multiclass Boosting for Weak Classifiers

15 0.32256335 29 jmlr-2005-Efficient Margin Maximizing with Boosting

16 0.31166294 36 jmlr-2005-Gaussian Processes for Ordinal Regression

17 0.30772099 66 jmlr-2005-Smooth ε-Insensitive Regression by Loss Symmetrization

18 0.30228132 63 jmlr-2005-Quasi-Geodesic Neural Learning Algorithms Over the Orthogonal Group: A Tutorial

19 0.30153668 23 jmlr-2005-Convergence Theorems for Generalized Alternating Minimization Procedures

20 0.30076566 45 jmlr-2005-Learning Multiple Tasks with Kernel Methods