jmlr jmlr2009 jmlr2009-49 knowledge-graph by maker-knowledge-mining
Source: pdf
Author: David P. Helmbold, Manfred K. Warmuth
Abstract: We give an algorithm for the on-line learning of permutations. The algorithm maintains its uncertainty about the target permutation as a doubly stochastic weight matrix, and makes predictions using an efficient method for decomposing the weight matrix into a convex combination of permutations. The weight matrix is updated by multiplying the current matrix entries by exponential factors, and an iterative procedure is needed to restore double stochasticity. Even though the result of this procedure does not have a closed form, a new analysis approach allows us to prove an optimal (up to small constant factors) bound on the regret of our algorithm. This regret bound is significantly better than that of either Kalai and Vempala’s more efficient Follow the Perturbed Leader algorithm or the computationally expensive method of explicitly representing each permutation as an expert. Keywords: permutation, ranking, on-line learning, Hedge algorithm, doubly stochastic matrix, relative entropy projection, Sinkhorn balancing
Reference: text
sentIndex sentText sentNum sentScore
1 The algorithm maintains its uncertainty about the target permutation as a doubly stochastic weight matrix, and makes predictions using an efficient method for decomposing the weight matrix into a convex combination of permutations. [sent-8, score-1.104]
2 This regret bound is significantly better than that of either Kalai and Vempala’s more efficient Follow the Perturbed Leader algorithm or the computationally expensive method of explicitly representing each permutation as an expert. [sent-11, score-0.502]
3 Keywords: permutation, ranking, on-line learning, Hedge algorithm, doubly stochastic matrix, relative entropy projection, Sinkhorn balancing 1. [sent-12, score-0.635]
4 In each trial, the algorithm probabilistically chooses a permutation and then incurs a linear loss based on how appropriate the permutation was for that trial. [sent-15, score-0.755]
5 The regret is the total expected loss of the algorithm on the whole sequence of trials minus the total loss of the best permutation chosen in hindsight for the whole sequence, and the goal is to find algorithms that have provably small worst-case regret. [sent-16, score-0.782]
6 permutations on n elements, it is infeasible to simply treat each permutation as an expert and apply one of the expert algorithms that uses exponential weights. [sent-37, score-0.711]
7 We represent a permutation of n elements as an n × n permutation matrix Π where Πi, j = 1 if the permutation maps element i to position j and Πi, j = 0 otherwise. [sent-41, score-1.082]
8 As the algorithm randomly selects a permutation Π at the beginning of a trial, an adversary simultaneously selects an arbitrary loss matrix L ∈ [0, 1]n×n which specifies the loss of all permutations for the trial. [sent-42, score-0.874]
9 Each entry Li, j of the loss matrix gives the loss for mapping element i to j, and the loss of any whole permutation is the sum of the losses of the permutation’s mappings, that is, the loss of permutation Π is ∑i Li,Π(i) = ∑i, j Πi, j Li, j . [sent-43, score-1.291]
10 This expectation W is an n × n weight matrix which is doubly stochastic, that is, it has non-negative entries and the property that every row and column sums to 1. [sent-47, score-0.73]
11 The algorithm’s uncertainty about which permutation is the target is summarized by W ; each weight Wi, j is the probability that the algorithm predicts with a permutation mapping element i to position j. [sent-48, score-0.825]
12 It is worth emphasizing that the W matrix is only a summary of the distribution over permutations used by any algorithm (it doesn’t indicate which permutations have non-zero probability, for example). [sent-49, score-0.553]
13 However, this summary is sufficient to determine the algorithm’s expected loss when the losses of permutations have the assumed loss matrix form. [sent-50, score-0.599]
14 By Birkhoff’s Theorem, every doubly stochastic matrix can be expressed as the convex combination of at most n2 − 2n + 2 permutations (see, e. [sent-52, score-0.765]
15 In Appendix A we show that a greedy matching-based algorithm efficiently decomposes any doubly stochastic matrix into a convex combination of at most n2 − 2n + 2 permutations. [sent-55, score-0.605]
16 Our algorithm for learning permutations predicts with a random Π sampled from the convex combination of permutations created by decomposing weight matrix W . [sent-57, score-0.756]
17 After this update, the weight matrix no longer has the doubly stochastic property, and the weight matrix must be projected back into the space of doubly stochastic matrices (called “Sinkhorn balancing”, see Section 4) before the next prediction can be made. [sent-62, score-1.228]
18 In Theorem 4 we bound the expected loss of PermELearn over any sequence of trials by n ln n + ηLbest , 1 − e−η (1) where n is the number of elements being permuted, η is the learning rate, and Lbest is the loss of the best permutation on the entire sequence. [sent-63, score-0.825]
19 If an upper bound Lest ≥ Lbest is known, then η can be tuned (as in Freund and Schapire, 1997) and the expected loss bound becomes Lbest + 2Lest n ln n + n ln n, (2) giving a bound of 2Lest n ln n + n ln n on the worst case expected regret of the tuned PermELearn algorithm. [sent-64, score-1.012]
20 We also prove a matching lower bound (Theorem 6) of Ω( Lbest n ln n) for the expected regret of any algorithm solving our permutation learning problem. [sent-65, score-0.693]
21 Each trial it adds random perturbations to the cumulative loss matrix and then predicts with the permutation having minimum perturbed loss. [sent-67, score-0.718]
22 However, the regret bound we can obtain for it in the permutation setting is about a factor of n worse than the bound for PermELearn and the lower bound. [sent-69, score-0.529]
23 If T is the sum of the loss matrices over the past trials and F is the n × n matrix with entries Fi, j = e−ηTi, j , then the weight of each permutation expert Π is proportional to the product ∏i Fi,Π(i) and the normalization constant is the permanent of the matrix F. [sent-72, score-1.153]
24 Moreover since the loss range of a permutation is [0, n], the standard loss bound for the algorithm that uses one expert per permutation must be scaled up by a factor of n, becoming L Lbest + n 2 est ln(n! [sent-75, score-0.995]
25 This expected loss bound is similar to our expected loss bound for PermELearn in Equation (2), except that the n ln n terms are replaced by n2 ln n. [sent-78, score-0.634]
26 PermELearn’s weight updates belong to the Exponentiated Gradient family of updates (Kivinen and Warmuth, 1997) since the components Li, j of the loss matrix that appear in the exponential 1707 H ELMBOLD AND WARMUTH factor are the derivatives of our linear loss with respect to the weights Wi, j . [sent-82, score-0.541]
27 Our new algorithm PermELearn for learning permutations maintains a doubly stochastic matrix with n2 weights. [sent-85, score-0.74]
28 We first show that our update minimizes a tradeoff between the loss and a relative entropy between doubly stochastic matrices. [sent-89, score-0.59]
29 One part of the algorithm is to decompose the current doubly stochastic matrix into a small convex combination of permutations using a greedy algorithm. [sent-105, score-0.787]
30 The bound on the number of permutations needed to decompose the weight matrix is deferred to Appendix A. [sent-106, score-0.466]
31 In Section 7 we prove a lower bound on the regret when learning permutations that is within a small constant factor of our regret bound on the tuned PermELearn algorithm. [sent-111, score-0.565]
32 , n} or as a permutation matrix which is an n × n binary matrix with exactly one “1” in each row and each column. [sent-126, score-0.637]
33 (2, 4, 3, 1) as matrix an arbitrary matrix permuting the rows Convex combinations of permutations create doubly stochastic or balanced matrices: nonnegative matrices whose n rows and n columns each sum to one. [sent-134, score-1.093]
34 Our algorithm maintains its uncertainty about which permutation is best as a doubly stochastic weight matrix W and needs to randomly select a permutation from some distribution whose expectation is W . [sent-135, score-1.252]
35 , Bhatia, 1997), for every doubly stochastic matrix W there is a decomposition into a convex combination of at most n2 − 2n + 2 permutation matrices. [sent-138, score-0.872]
36 In case of permutations we could have one expert per permutation and allocate a distribution over the n! [sent-147, score-0.603]
37 As discussed in the introduction, we assume that the losses in each trial can be specified by a loss matrix L ∈ [0, 1]n×n where the loss of each permutation Π has the linear form ∑i Li,Π(i) = Π • L. [sent-150, score-0.792]
38 This expected prediction W is an n × n doubly stochastic matrix and algorithms for learning permutations under the linear loss assumption can be viewed as implicitly maintaining such a doubly stochastic weight matrix. [sent-152, score-1.285]
39 Many ranking applications have an associated loss motif M and nature is constrained to choose (row) permutations of M as its loss matrix L. [sent-159, score-0.589]
40 In effect, at each trial nature chooses a “correct” permutation Π and uses the loss matrix L = ΠM. [sent-160, score-0.652]
41 If nature chooses the identity permutation then the loss matrix L is the motif M itself. [sent-162, score-0.589]
42 When M is known to the algorithm, it suffices to give the algorithm only the permutation Π at the end of the trial, rather than the loss matrix L itself. [sent-163, score-0.561]
43 However, our lower bound (see Section 7) shows that the n ln n factors in (2) are necessary in the general permutation setting. [sent-173, score-0.525]
44 It maintains an n × n doubly stochastic weight matrix W as its main data structure, where Wi, j is the probability that PermELearn predicts with a permutation mapping element i to position j. [sent-179, score-1.002]
45 Create a new doubly stochastic matrix W for use in the next trial based on the current weight matrix W and loss matrix L. [sent-184, score-1.036]
46 The algorithm greedily decomposes W into a convex combination of at most n2 − 2n + 2 permutations (see Theorem 7), and then randomly selects one of these permutations for the prediction. [sent-187, score-0.531]
47 We shall soon see that each matrix A is a constant times a doubly stochastic matrix, so the existence of a suitable permutation Π follows from Birkhoff’s Theorem. [sent-191, score-0.799]
48 1711 H ELMBOLD AND WARMUTH Algorithm 1 PermELearn: Selecting a permutation Require: a doubly stochastic n × n matrix W A := W ; q = 0; repeat q := q + 1; Find permutation Πq such that Ai,Πq (i) is positive for each i ∈ {1, . [sent-196, score-1.112]
49 Algorithm 2 PermELearn: Weight Matrix Update Require: learning rate η, loss matrix L, and doubly stochastic weight matrix W ′ Create W ′ where each Wi, j = Wi, j e−ηLi, j (3) Create doubly stochastic W by re-balancing the rows and columns of W ′ (Sinkhorn balancing) and update W to W . [sent-206, score-1.279]
50 column sum by α and the original matrix W was doubly stochastic, each matrix A will have rows and columns that sum to the same amount. [sent-207, score-0.69]
51 In other words, each matrix A created during Algorithm 1 is a constant times a doubly stochastic matrix, and thus (by Birkhoff’s Theorem) is a constant times a convex combination of permutations. [sent-208, score-0.559]
52 Therefore, Algorithm 1 decomposes the original doubly stochastic matrix into the convex combination of (at most) n2 − n + 1 permutation matrices. [sent-210, score-0.896]
53 Therefore we get the well known fact that Sinkhorn balancing a matrix A results in a doubly stochastic matrix RAC where R and C are diagonal matrices. [sent-228, score-0.834]
54 Each entry Ri,i is the positive multiplier applied to row i, and each entry C j, j is the positive multiplier of column j needed to convert A into a doubly stochastic matrix. [sent-229, score-0.583]
55 Since each row and column balancing step creates rationals, Sinkhorn balancing produces irrationals only in the limit (after infinitely many steps). [sent-231, score-0.534]
56 ∏i Ai,Π2 (i) Ri,iCΠ2 (i),Π2 (i) ∏i Ai,Π2 (i) must balance to a doubly stochastic matrix a 1−a 1−a a such that the ratio of the product weight between the two permutations (1, 2) and (2, 1) is preserved. [sent-236, score-0.784]
57 If each permutation is explicitly represented as an expert, then the Hedge algorithm predicts permutation Π with probability proportional −η Lt to the product weight, ∏i e ∑t i,Π(i) . [sent-239, score-0.686]
58 1, Hedge puts probability 2 on permutation (1, 2) and probability 3 3 on permutation (2, 1) while PermELearn puts probability √ 1 √ 1+ 2 √ 2 √ 1+ 2 ≈ 0. [sent-242, score-0.626]
59 Sinkhorn showed that this procedure converges and that the RAC balancing of any matrix A into a doubly stochastic matrix is unique (up to canceling multiples of R and C) if it exists3 (Sinkhorn, 1964). [sent-246, score-0.802]
60 A number of authors consider balancing a matrix A so that the row and column sums are 1 ± ε. [sent-247, score-0.495]
61 The weight matrices we deal with have strictly positive entries, and thus can always be made doubly stochastic with an RAC balancing. [sent-251, score-0.531]
62 Finally, F¨ rer (2004) shows u that if the row and column sums of A are 1 ± ε then every matrix entry changes by at most ±nε when A is balanced to a doubly stochastic matrix. [sent-258, score-0.787]
63 2 Dealing with Approximate Balancing With slight modifications, Algorithm PermELearn can handle the situation where its weight matrix is imperfectly balanced (and thus not quite doubly stochastic). [sent-260, score-0.582]
64 As before, let W be the fully balanced doubly stochastic weight matrix, but we now assume that only an approximately balanced W is available to predict from. [sent-261, score-0.627]
65 This allows us to bound the expected loss when s predicting with the convex combination C in terms of the expected loss using a decomposition of the perfectly balanced W : 1 W •L s Wi, j Li, j = ∑ i, j s C•L ≤ ≤ ∑(Wi, j + 3nε)Li, j i, j ≤ W • L + 3n3 ε. [sent-275, score-0.474]
66 Therefore the extra loss incurred by using a ε-approximately balanced weight matrix at a particular trial is at most 3n3 ε, as desired. [sent-276, score-0.515]
67 1714 L EARNING P ERMUTATIONS If in a sequence of T trials the matrices W are ε = 1/(3T n3 ) balanced (so that each row and column sum is 1 ± 1/(3T n3 )) then Lemma 1 implies that the total additional expected loss for using approximate balancing is at most 1. [sent-277, score-0.739]
68 This balancing algorithm with ε = 1/(3T n3 ) together with the modified prediction algorithm give a method requiring O(T n6 log(T n)) total time over the T trials and having a bound of 2Lest n ln n + n ln n + 1 on the worst-case regret. [sent-280, score-0.646]
69 Therefore the modified algorithm can either keep a cumulative loss matrix L≤t = ∑ti=1 Li and create its next W by (approximately) balancing the matrix with entries ≤t 1 −ηLi, j , or apply the multiplicative updates to the previous approximately balanced W . [sent-287, score-0.747]
70 This style of analysis first shows a per-trial progress bound using relative entropy to a comparator as a measure of progress, and then sums this invariant over the trials to bound the expected total loss of the algorithm. [sent-290, score-0.57]
71 Our analysis for this setting provides a good foundation for learning permutation matrices and lays the groundwork for the future study of other permutation loss functions. [sent-295, score-0.805]
72 The per-trial invariant used to analyze the exponentiated gradient family bounds the decrease in relative entropy from any (normalized) vector u to the algorithm’s weight vector by a linear combination of the algorithm’s loss and the loss of u on the trial. [sent-297, score-0.473]
73 After showing that balancing to a doubly stochastic matrix only increases the progress, we can sum the per-trial bound to obtain our main theorem. [sent-302, score-0.756]
74 1 A Dead End In each trial, PermELearn multiplies each entry of its weight matrix by an exponential factor and then uses one additional factor per row and column to make the matrix doubly stochastic (Algorithm 2 described in Section 4. [sent-304, score-0.899]
75 Using Equation (4) and noting that all three matrices are doubly stochastic (so their entries sum to n), we see that ∆(U,W ) − ∆(U, W ) = −ηU • L + ∑ ln ri + ∑ ln c j . [sent-313, score-0.846]
76 (7) ∀i : ∑ j Ai, j = 1 ∀ j : ∑i Ai, j = 1 The second problem shows that the doubly stochastic matrix W is the projection of W ′ onto to the linear row and column sum constraints. [sent-319, score-0.677]
77 Lemma 3 For any η > 0, any doubly stochastic matrices U and W and any trial with loss matrix L ∈ [0, 1]n×n ∆(U,W ) − ∆(U,W ′ ) ≥ (1 − e−η )(W • L) − η(U • L), where W ′ is the unbalanced intermediate matrix (6) constructed by PermELearn from W . [sent-322, score-0.897]
78 When considering a sequence of trials, Lt is the loss matrix at trial t, W t−1 is PermELearn’s weight matrix W at the start of trial t (so W 0 is the initial weight matrix) and W t is the updated weight matrix W at the end of the trial. [sent-330, score-0.966]
79 Theorem 4 For any learning rate η > 0, any doubly stochastic matrices U and initial W 0 , and any sequence of T trials with loss matrices Lt ∈ [0, 1]n×n (for 1 ≤ t ≤ T ), the expected loss of PermELearn is bounded by: T ∑ W t−1 • Lt ≤ t=1 T ∆(U,W 0 ) − ∆(U,W T ) + η ∑t=1 U • Lt . [sent-331, score-0.84]
80 When the entries of W 0 are all initialized to 1 and U is a permutation then ∆(U,W 0 ) = n ln n. [sent-335, score-0.5]
81 n Since each doubly stochastic matrix U is a convex combination of permutation matrices, at least T one minimizer of the total loss ∑t=1 U • L will be a permutation matrix. [sent-336, score-1.292]
82 If Lbest denotes the loss of ∗ , then Theorem 4 implies that the total loss of the algorithm is bounded by such a permutation U ∆(U ∗ ,W 0 ) + ηLbest . [sent-337, score-0.549]
83 However, balancing techniques are only capably of approximately balancing the weight matrix in finite time, so implementations of PermELearn must handle approximately balanced matrices. [sent-343, score-0.689]
84 Lemma 1 shows that when this implementation of PermELearn uses an approximately balanced W t−1 where each row and column sum is in 1 ± εt , then the expected loss on trial t is at most W t−1 • Lt + 3n3 εt . [sent-346, score-0.495]
85 In each trial the algorithm picks expert i with probability wi and then gets a loss vector ℓ ∈ [0, 1]N . [sent-356, score-0.486]
86 The Hedge algorithm (Freund and Schapire, 1997) updates its weight vector to wi = ˜ wi e−ηℓi −ηℓ . [sent-359, score-0.48]
87 ∑ j w je j This update can be motivated by a tradeoff between the un-normalized relative entropy to the old weight vector and expected loss in the last trial (Kivinen and Warmuth, 1999): w := argmin (∆(w, w) + η w · ℓ) . [sent-360, score-0.5]
88 ∑i wi =1 w For vectors, the relative entropy is simply ∆(w, w) := ∑i wi ln wii + wi − wi . [sent-361, score-0.845]
89 As in the permutation case, we can split this update (and motivation) into two steps: setting each w′ = wi e−ηℓi then w = i w′ / ∑i w′ . [sent-362, score-0.518]
90 Instead of projecting a nonnegative matrix onto all 2n constraints at once (as in optimization problem (7)), we could mimic the Sinkhorn balancing algorithm by first projecting onto the row constraints and then the column constraints and alternating until convergence. [sent-384, score-0.675]
91 The Generalized Pythagorean Theorem shows that projecting onto any convex constraint that is satisfied by the comparator class of doubly stochastic matrices brings the weight matrix closer to every doubly stochastic matrix. [sent-385, score-1.181]
92 When ∑i wi < 1, the loss w · L amounts to incurring 0 loss with probability 1 − ∑i wi , and predicting as expert i with probability wi . [sent-391, score-0.778]
93 There is a direct argument that shows that the same final doubly stochastic matrix is reached if we interleave the exponential updates with projections to any of the constraints as long as all 2n constraints hold at the end. [sent-400, score-0.647]
94 Note that MapELearn’s combined weight matrix is now a convex combination of mappings, that is, a “singly” stochastic matrix with the constraint that each row sums to one. [sent-451, score-0.608]
95 1722 L EARNING P ERMUTATIONS Our algorithm PermElearn for permutations may be seen as the above algorithm for mappings while enforcing the column sum constraints in addition to the row constraints used in MapELearn. [sent-461, score-0.528]
96 The enforcement of the additional column sum constraints results in a doubly stochastic matrix, an apparently necessary step to produce predictions that are permutations (and an expected prediction equal to the doubly stochastic weight matrix). [sent-465, score-1.168]
97 This matrix decomposes into 1+√2 times the permutation (1, 2) plus 1 √ times the permutation (2, 1). [sent-479, score-0.769]
98 Thus the probability of predicting with permutation (1, 2) is 1+ √ 2 2 times the probability of permutation (2, 1) for the PermELearn algorithm. [sent-480, score-0.626]
99 One trades the divergence to the last weight vector against the loss in the last trial, and the other trades the divergence to the initial weight vector against the loss in all past trials (Azoury and Warmuth, 2001): wt := argmin ∆(w, wt−1 ) + η w · ℓt , wt := argmin ∆(w, w0 ) + η w · ℓ≤t . [sent-509, score-0.624]
100 Its Lagrangian is ∑ wi ln i wi + w0 − wi + ηwi ℓ≤t + β i i w0 i ∑ wi − 1 i and since there is no duality gap:12 vt := min ∆(w, w0 ) + η w · ℓ≤t = max ∑ w0 (1 − e−ηℓi i ≤t ∑i wi =1 β i −β )−β. [sent-516, score-0.934]
wordName wordTfidf (topN-words)
[('permelearn', 0.46), ('permutation', 0.313), ('lbest', 0.303), ('doubly', 0.287), ('permutations', 0.206), ('warmuth', 0.198), ('balancing', 0.197), ('wi', 0.16), ('ln', 0.134), ('sinkhorn', 0.134), ('hedge', 0.133), ('matrix', 0.119), ('regret', 0.118), ('elmbold', 0.117), ('trial', 0.113), ('loss', 0.107), ('ermutations', 0.1), ('weight', 0.092), ('trials', 0.088), ('rac', 0.088), ('row', 0.086), ('balanced', 0.084), ('expert', 0.084), ('experts', 0.083), ('ai', 0.082), ('stochastic', 0.08), ('lt', 0.079), ('kivinen', 0.076), ('matrices', 0.072), ('allocation', 0.072), ('airline', 0.068), ('comparator', 0.068), ('ri', 0.062), ('birkhoff', 0.059), ('dest', 0.059), ('mapelearn', 0.059), ('column', 0.054), ('li', 0.053), ('normalization', 0.053), ('entries', 0.053), ('mappings', 0.052), ('motif', 0.05), ('bound', 0.049), ('motifs', 0.048), ('entropy', 0.047), ('updates', 0.046), ('update', 0.045), ('argmin', 0.045), ('convex', 0.044), ('bregman', 0.043), ('freund', 0.042), ('herbster', 0.042), ('kuzmin', 0.042), ('helmbold', 0.041), ('leader', 0.041), ('progress', 0.039), ('airplanes', 0.039), ('lest', 0.039), ('wti', 0.039), ('rows', 0.039), ('sums', 0.039), ('entry', 0.038), ('predicts', 0.038), ('exponentiated', 0.034), ('losses', 0.033), ('pythagorean', 0.033), ('invariant', 0.033), ('schapire', 0.033), ('diagonal', 0.032), ('constraints', 0.031), ('earning', 0.03), ('matching', 0.03), ('jc', 0.03), ('permanent', 0.029), ('projections', 0.029), ('combination', 0.029), ('factors', 0.029), ('perturbed', 0.028), ('lagrangian', 0.028), ('expected', 0.027), ('onto', 0.027), ('maintains', 0.026), ('la', 0.025), ('projecting', 0.025), ('tuned', 0.025), ('hadamard', 0.025), ('exponential', 0.024), ('position', 0.024), ('wt', 0.024), ('decomposes', 0.024), ('littlestone', 0.024), ('relative', 0.024), ('sum', 0.024), ('columns', 0.024), ('tracking', 0.023), ('mapping', 0.023), ('positions', 0.023), ('normalizes', 0.022), ('route', 0.022), ('algorithm', 0.022)]
simIndex simValue paperId paperTitle
same-paper 1 1.0000001 49 jmlr-2009-Learning Permutations with Exponential Weights
Author: David P. Helmbold, Manfred K. Warmuth
Abstract: We give an algorithm for the on-line learning of permutations. The algorithm maintains its uncertainty about the target permutation as a doubly stochastic weight matrix, and makes predictions using an efficient method for decomposing the weight matrix into a convex combination of permutations. The weight matrix is updated by multiplying the current matrix entries by exponential factors, and an iterative procedure is needed to restore double stochasticity. Even though the result of this procedure does not have a closed form, a new analysis approach allows us to prove an optimal (up to small constant factors) bound on the regret of our algorithm. This regret bound is significantly better than that of either Kalai and Vempala’s more efficient Follow the Perturbed Leader algorithm or the computationally expensive method of explicitly representing each permutation as an expert. Keywords: permutation, ranking, on-line learning, Hedge algorithm, doubly stochastic matrix, relative entropy projection, Sinkhorn balancing
2 0.10922517 36 jmlr-2009-Fourier Theoretic Probabilistic Inference over Permutations
Author: Jonathan Huang, Carlos Guestrin, Leonidas Guibas
Abstract: Permutations are ubiquitous in many real-world problems, such as voting, ranking, and data association. Representing uncertainty over permutations is challenging, since there are n! possibilities, and typical compact and factorized probability distribution representations, such as graphical models, cannot capture the mutual exclusivity constraints associated with permutations. In this paper, we use the “low-frequency” terms of a Fourier decomposition to represent distributions over permutations compactly. We present Kronecker conditioning, a novel approach for maintaining and updating these distributions directly in the Fourier domain, allowing for polynomial time bandlimited approximations. Low order Fourier-based approximations, however, may lead to functions that do not correspond to valid distributions. To address this problem, we present a quadratic program defined directly in the Fourier domain for projecting the approximation onto a relaxation of the polytope of legal marginal distributions. We demonstrate the effectiveness of our approach on a real camera-based multi-person tracking scenario. Keywords: identity management, permutations, approximate inference, group theoretical methods, sensor networks
3 0.097629301 87 jmlr-2009-Sparse Online Learning via Truncated Gradient
Author: John Langford, Lihong Li, Tong Zhang
Abstract: We propose a general method called truncated gradient to induce sparsity in the weights of onlinelearning algorithms with convex loss functions. This method has several essential properties: 1. The degree of sparsity is continuous—a parameter controls the rate of sparsification from no sparsification to total sparsification. 2. The approach is theoretically motivated, and an instance of it can be regarded as an online counterpart of the popular L1 -regularization method in the batch setting. We prove that small rates of sparsification result in only small additional regret with respect to typical online-learning guarantees. 3. The approach works well empirically. We apply the approach to several data sets and find for data sets with large numbers of features, substantial sparsity is discoverable. Keywords: truncated gradient, stochastic gradient descent, online learning, sparsity, regularization, Lasso
4 0.081003204 73 jmlr-2009-Prediction With Expert Advice For The Brier Game
Author: Vladimir Vovk, Fedor Zhdanov
Abstract: We show that the Brier game of prediction is mixable and find the optimal learning rate and substitution function for it. The resulting prediction algorithm is applied to predict results of football and tennis matches, with well-known bookmakers playing the role of experts. The theoretical performance guarantee is not excessively loose on the football data set and is rather tight on the tennis data set. Keywords: Brier game, classification, on-line prediction, strong aggregating algorithm, weighted average algorithm
5 0.06252607 51 jmlr-2009-Low-Rank Kernel Learning with Bregman Matrix Divergences
Author: Brian Kulis, Mátyás A. Sustik, Inderjit S. Dhillon
Abstract: In this paper, we study low-rank matrix nearness problems, with a focus on learning low-rank positive semidefinite (kernel) matrices for machine learning applications. We propose efficient algorithms that scale linearly in the number of data points and quadratically in the rank of the input matrix. Existing algorithms for learning kernel matrices often scale poorly, with running times that are cubic in the number of data points. We employ Bregman matrix divergences as the measures of nearness—these divergences are natural for learning low-rank kernels since they preserve rank as well as positive semidefiniteness. Special cases of our framework yield faster algorithms for various existing learning problems, and experimental results demonstrate that our algorithms can effectively learn both low-rank and full-rank kernel matrices. Keywords: kernel methods, Bregman divergences, convex optimization, kernel learning, matrix nearness
6 0.054157399 27 jmlr-2009-Efficient Online and Batch Learning Using Forward Backward Splitting
7 0.046300244 79 jmlr-2009-Reinforcement Learning in Finite MDPs: PAC Analysis
8 0.045694835 23 jmlr-2009-Discriminative Learning Under Covariate Shift
9 0.044120036 50 jmlr-2009-Learning When Concepts Abound
10 0.040357329 95 jmlr-2009-The P-Norm Push: A Simple Convex Ranking Algorithm that Concentrates at the Top of the List
11 0.039676584 100 jmlr-2009-When Is There a Representer Theorem? Vector Versus Matrix Regularizers
12 0.039544683 72 jmlr-2009-Polynomial-Delay Enumeration of Monotonic Graph Classes
13 0.038856406 52 jmlr-2009-Margin-based Ranking and an Equivalence between AdaBoost and RankBoost
14 0.038812853 83 jmlr-2009-SGD-QN: Careful Quasi-Newton Stochastic Gradient Descent
15 0.036247198 86 jmlr-2009-Similarity-based Classification: Concepts and Algorithms
16 0.036185496 9 jmlr-2009-Analysis of Perceptron-Based Active Learning
17 0.036049481 67 jmlr-2009-Online Learning with Sample Path Constraints
18 0.035339382 93 jmlr-2009-The Hidden Life of Latent Variables: Bayesian Learning with Mixed Graph Models
19 0.033767916 66 jmlr-2009-On the Consistency of Feature Selection using Greedy Least Squares Regression
20 0.033116113 45 jmlr-2009-Learning Approximate Sequential Patterns for Classification
topicId topicWeight
[(0, 0.188), (1, -0.015), (2, 0.032), (3, -0.045), (4, 0.019), (5, 0.089), (6, -0.109), (7, 0.029), (8, -0.088), (9, -0.04), (10, 0.081), (11, -0.067), (12, -0.055), (13, 0.101), (14, -0.227), (15, 0.164), (16, -0.002), (17, -0.335), (18, 0.003), (19, -0.218), (20, 0.056), (21, 0.116), (22, -0.143), (23, -0.247), (24, -0.092), (25, -0.004), (26, -0.01), (27, -0.043), (28, -0.067), (29, -0.043), (30, -0.023), (31, 0.004), (32, 0.03), (33, -0.047), (34, 0.053), (35, -0.118), (36, -0.088), (37, 0.001), (38, -0.121), (39, 0.029), (40, -0.08), (41, -0.123), (42, 0.043), (43, 0.012), (44, -0.095), (45, -0.176), (46, -0.042), (47, -0.044), (48, -0.03), (49, 0.004)]
simIndex simValue paperId paperTitle
same-paper 1 0.96630102 49 jmlr-2009-Learning Permutations with Exponential Weights
Author: David P. Helmbold, Manfred K. Warmuth
Abstract: We give an algorithm for the on-line learning of permutations. The algorithm maintains its uncertainty about the target permutation as a doubly stochastic weight matrix, and makes predictions using an efficient method for decomposing the weight matrix into a convex combination of permutations. The weight matrix is updated by multiplying the current matrix entries by exponential factors, and an iterative procedure is needed to restore double stochasticity. Even though the result of this procedure does not have a closed form, a new analysis approach allows us to prove an optimal (up to small constant factors) bound on the regret of our algorithm. This regret bound is significantly better than that of either Kalai and Vempala’s more efficient Follow the Perturbed Leader algorithm or the computationally expensive method of explicitly representing each permutation as an expert. Keywords: permutation, ranking, on-line learning, Hedge algorithm, doubly stochastic matrix, relative entropy projection, Sinkhorn balancing
2 0.53834909 36 jmlr-2009-Fourier Theoretic Probabilistic Inference over Permutations
Author: Jonathan Huang, Carlos Guestrin, Leonidas Guibas
Abstract: Permutations are ubiquitous in many real-world problems, such as voting, ranking, and data association. Representing uncertainty over permutations is challenging, since there are n! possibilities, and typical compact and factorized probability distribution representations, such as graphical models, cannot capture the mutual exclusivity constraints associated with permutations. In this paper, we use the “low-frequency” terms of a Fourier decomposition to represent distributions over permutations compactly. We present Kronecker conditioning, a novel approach for maintaining and updating these distributions directly in the Fourier domain, allowing for polynomial time bandlimited approximations. Low order Fourier-based approximations, however, may lead to functions that do not correspond to valid distributions. To address this problem, we present a quadratic program defined directly in the Fourier domain for projecting the approximation onto a relaxation of the polytope of legal marginal distributions. We demonstrate the effectiveness of our approach on a real camera-based multi-person tracking scenario. Keywords: identity management, permutations, approximate inference, group theoretical methods, sensor networks
3 0.51544392 87 jmlr-2009-Sparse Online Learning via Truncated Gradient
Author: John Langford, Lihong Li, Tong Zhang
Abstract: We propose a general method called truncated gradient to induce sparsity in the weights of onlinelearning algorithms with convex loss functions. This method has several essential properties: 1. The degree of sparsity is continuous—a parameter controls the rate of sparsification from no sparsification to total sparsification. 2. The approach is theoretically motivated, and an instance of it can be regarded as an online counterpart of the popular L1 -regularization method in the batch setting. We prove that small rates of sparsification result in only small additional regret with respect to typical online-learning guarantees. 3. The approach works well empirically. We apply the approach to several data sets and find for data sets with large numbers of features, substantial sparsity is discoverable. Keywords: truncated gradient, stochastic gradient descent, online learning, sparsity, regularization, Lasso
4 0.4177638 73 jmlr-2009-Prediction With Expert Advice For The Brier Game
Author: Vladimir Vovk, Fedor Zhdanov
Abstract: We show that the Brier game of prediction is mixable and find the optimal learning rate and substitution function for it. The resulting prediction algorithm is applied to predict results of football and tennis matches, with well-known bookmakers playing the role of experts. The theoretical performance guarantee is not excessively loose on the football data set and is rather tight on the tennis data set. Keywords: Brier game, classification, on-line prediction, strong aggregating algorithm, weighted average algorithm
5 0.34493679 51 jmlr-2009-Low-Rank Kernel Learning with Bregman Matrix Divergences
Author: Brian Kulis, Mátyás A. Sustik, Inderjit S. Dhillon
Abstract: In this paper, we study low-rank matrix nearness problems, with a focus on learning low-rank positive semidefinite (kernel) matrices for machine learning applications. We propose efficient algorithms that scale linearly in the number of data points and quadratically in the rank of the input matrix. Existing algorithms for learning kernel matrices often scale poorly, with running times that are cubic in the number of data points. We employ Bregman matrix divergences as the measures of nearness—these divergences are natural for learning low-rank kernels since they preserve rank as well as positive semidefiniteness. Special cases of our framework yield faster algorithms for various existing learning problems, and experimental results demonstrate that our algorithms can effectively learn both low-rank and full-rank kernel matrices. Keywords: kernel methods, Bregman divergences, convex optimization, kernel learning, matrix nearness
6 0.25989705 50 jmlr-2009-Learning When Concepts Abound
7 0.24658076 67 jmlr-2009-Online Learning with Sample Path Constraints
8 0.24463362 88 jmlr-2009-Stable and Efficient Gaussian Process Calculations
9 0.23079346 79 jmlr-2009-Reinforcement Learning in Finite MDPs: PAC Analysis
10 0.22153084 72 jmlr-2009-Polynomial-Delay Enumeration of Monotonic Graph Classes
11 0.22069241 27 jmlr-2009-Efficient Online and Batch Learning Using Forward Backward Splitting
12 0.2045368 95 jmlr-2009-The P-Norm Push: A Simple Convex Ranking Algorithm that Concentrates at the Top of the List
13 0.18162563 8 jmlr-2009-An Anticorrelation Kernel for Subsystem Training in Multiple Classifier Systems
14 0.18134099 93 jmlr-2009-The Hidden Life of Latent Variables: Bayesian Learning with Mixed Graph Models
15 0.17437336 29 jmlr-2009-Estimating Labels from Label Proportions
16 0.16842321 66 jmlr-2009-On the Consistency of Feature Selection using Greedy Least Squares Regression
17 0.1675908 55 jmlr-2009-Maximum Entropy Discrimination Markov Networks
18 0.16505083 53 jmlr-2009-Marginal Likelihood Integrals for Mixtures of Independence Models
19 0.16465086 63 jmlr-2009-On Efficient Large Margin Semisupervised Learning: Method and Theory
20 0.16163389 83 jmlr-2009-SGD-QN: Careful Quasi-Newton Stochastic Gradient Descent
topicId topicWeight
[(8, 0.015), (11, 0.012), (38, 0.019), (47, 0.015), (52, 0.031), (55, 0.027), (58, 0.016), (66, 0.695), (68, 0.014), (90, 0.035), (96, 0.019)]
simIndex simValue paperId paperTitle
same-paper 1 0.99617672 49 jmlr-2009-Learning Permutations with Exponential Weights
Author: David P. Helmbold, Manfred K. Warmuth
Abstract: We give an algorithm for the on-line learning of permutations. The algorithm maintains its uncertainty about the target permutation as a doubly stochastic weight matrix, and makes predictions using an efficient method for decomposing the weight matrix into a convex combination of permutations. The weight matrix is updated by multiplying the current matrix entries by exponential factors, and an iterative procedure is needed to restore double stochasticity. Even though the result of this procedure does not have a closed form, a new analysis approach allows us to prove an optimal (up to small constant factors) bound on the regret of our algorithm. This regret bound is significantly better than that of either Kalai and Vempala’s more efficient Follow the Perturbed Leader algorithm or the computationally expensive method of explicitly representing each permutation as an expert. Keywords: permutation, ranking, on-line learning, Hedge algorithm, doubly stochastic matrix, relative entropy projection, Sinkhorn balancing
2 0.99398506 72 jmlr-2009-Polynomial-Delay Enumeration of Monotonic Graph Classes
Author: Jan Ramon, Siegfried Nijssen
Abstract: Algorithms that list graphs such that no two listed graphs are isomorphic, are important building blocks of systems for mining and learning in graphs. Algorithms are already known that solve this problem efficiently for many classes of graphs of restricted topology, such as trees. In this article we introduce the concept of a dense augmentation schema, and introduce an algorithm that can be used to enumerate any class of graphs with polynomial delay, as long as the class of graphs can be described using a monotonic predicate operating on a dense augmentation schema. In practice this means that this is the first enumeration algorithm that can be applied theoretically efficiently in any frequent subgraph mining algorithm, and that this algorithm generalizes to situations beyond the standard frequent subgraph mining setting. Keywords: graph mining, enumeration, monotonic graph classes
3 0.9850288 16 jmlr-2009-Classification with Gaussians and Convex Loss
Author: Dao-Hong Xiang, Ding-Xuan Zhou
Abstract: This paper considers binary classification algorithms generated from Tikhonov regularization schemes associated with general convex loss functions and varying Gaussian kernels. Our main goal is to provide fast convergence rates for the excess misclassification error. Allowing varying Gaussian kernels in the algorithms improves learning rates measured by regularization error and sample error. Special structures of Gaussian kernels enable us to construct, by a nice approximation scheme with a Fourier analysis technique, uniformly bounded regularizing functions achieving polynomial decays of the regularization error under a Sobolev smoothness condition. The sample error is estimated by using a projection operator and a tight bound for the covering numbers of reproducing kernel Hilbert spaces generated by Gaussian kernels. The convexity of the general loss function plays a very important role in our analysis. Keywords: reproducing kernel Hilbert space, binary classification, general convex loss, varying Gaussian kernels, covering number, approximation
4 0.97258633 14 jmlr-2009-CarpeDiem: Optimizing the Viterbi Algorithm and Applications to Supervised Sequential Learning
Author: Roberto Esposito, Daniele P. Radicioni
Abstract: The growth of information available to learning systems and the increasing complexity of learning tasks determine the need for devising algorithms that scale well with respect to all learning parameters. In the context of supervised sequential learning, the Viterbi algorithm plays a fundamental role, by allowing the evaluation of the best (most probable) sequence of labels with a time complexity linear in the number of time events, and quadratic in the number of labels. In this paper we propose CarpeDiem, a novel algorithm allowing the evaluation of the best possible sequence of labels with a sub-quadratic time complexity.1 We provide theoretical grounding together with solid empirical results supporting two chief facts. CarpeDiem always finds the optimal solution requiring, in most cases, only a small fraction of the time taken by the Viterbi algorithm; meantime, CarpeDiem is never asymptotically worse than the Viterbi algorithm, thus confirming it as a sound replacement. Keywords: Viterbi algorithm, sequence labeling, conditional models, classifiers optimization, exact inference
5 0.97092599 55 jmlr-2009-Maximum Entropy Discrimination Markov Networks
Author: Jun Zhu, Eric P. Xing
Abstract: The standard maximum margin approach for structured prediction lacks a straightforward probabilistic interpretation of the learning scheme and the prediction rule. Therefore its unique advantages such as dual sparseness and kernel tricks cannot be easily conjoined with the merits of a probabilistic model such as Bayesian regularization, model averaging, and ability to model hidden variables. In this paper, we present a new general framework called maximum entropy discrimination Markov networks (MaxEnDNet, or simply, MEDN), which integrates these two approaches and combines and extends their merits. Major innovations of this approach include: 1) It extends the conventional max-entropy discrimination learning of classification rules to a new structural maxentropy discrimination paradigm of learning a distribution of Markov networks. 2) It generalizes the extant Markov network structured-prediction rule based on a point estimator of model coefficients to an averaging model akin to a Bayesian predictor that integrates over a learned posterior distribution of model coefficients. 3) It admits flexible entropic regularization of the model during learning. By plugging in different prior distributions of the model coefficients, it subsumes the wellknown maximum margin Markov networks (M3 N) as a special case, and leads to a model similar to an L1 -regularized M3 N that is simultaneously primal and dual sparse, or other new types of Markov networks. 4) It applies a modular learning algorithm that combines existing variational inference techniques and convex-optimization based M3 N solvers as subroutines. Essentially, MEDN can be understood as a jointly maximum likelihood and maximum margin estimate of Markov network. It represents the first successful attempt to combine maximum entropy learning (a dual form of maximum likelihood learning) with maximum margin learning of Markov network for structured input/output problems; and the basic principle can be generalized to learning arbi
6 0.839248 66 jmlr-2009-On the Consistency of Feature Selection using Greedy Least Squares Regression
7 0.83243144 73 jmlr-2009-Prediction With Expert Advice For The Brier Game
8 0.83180773 68 jmlr-2009-Online Learning with Samples Drawn from Non-identical Distributions
9 0.81924635 79 jmlr-2009-Reinforcement Learning in Finite MDPs: PAC Analysis
10 0.81852347 9 jmlr-2009-Analysis of Perceptron-Based Active Learning
11 0.81204998 37 jmlr-2009-Generalization Bounds for Ranking Algorithms via Algorithmic Stability
12 0.80292976 80 jmlr-2009-Reproducing Kernel Banach Spaces for Machine Learning
13 0.80001992 78 jmlr-2009-Refinement of Reproducing Kernels
14 0.79834265 82 jmlr-2009-Robustness and Regularization of Support Vector Machines
15 0.79583842 12 jmlr-2009-Bi-Level Path Following for Cross Validated Solution of Kernel Quantile Regression
16 0.79376698 46 jmlr-2009-Learning Halfspaces with Malicious Noise
17 0.78498083 51 jmlr-2009-Low-Rank Kernel Learning with Bregman Matrix Divergences
18 0.782363 52 jmlr-2009-Margin-based Ranking and an Equivalence between AdaBoost and RankBoost
19 0.77796549 36 jmlr-2009-Fourier Theoretic Probabilistic Inference over Permutations
20 0.77639967 22 jmlr-2009-Deterministic Error Analysis of Support Vector Regression and Related Regularized Kernel Methods