jmlr jmlr2012 jmlr2012-28 knowledge-graph by maker-knowledge-mining
Source: pdf
Author: Koby Crammer, Mark Dredze, Fernando Pereira
Abstract: Confidence-weighted online learning is a generalization of margin-based learning of linear classifiers in which the margin constraint is replaced by a probabilistic constraint based on a distribution over classifier weights that is updated online as examples are observed. The distribution captures a notion of confidence on classifier weights, and in some cases it can also be interpreted as replacing a single learning rate by adaptive per-weight rates. Confidence-weighted learning was motivated by the statistical properties of natural-language classification tasks, where most of the informative features are relatively rare. We investigate several versions of confidence-weighted learning that use a Gaussian distribution over weight vectors, updated at each observed example to achieve high probability of correct classification for the example. Empirical evaluation on a range of textcategorization tasks show that our algorithms improve over other state-of-the-art online and batch methods, learn faster in the online setting, and lead to better classifier combination for a type of distributed training commonly used in cloud computing. Keywords: online learning, confidence prediction, text categorization
Reference: text
sentIndex sentText sentNum sentScore
1 Empirical evaluation on a range of textcategorization tasks show that our algorithms improve over other state-of-the-art online and batch methods, learn faster in the online setting, and lead to better classifier combination for a type of distributed training commonly used in cloud computing. [sent-12, score-0.51]
2 Keywords: online learning, confidence prediction, text categorization 1. [sent-13, score-0.218]
3 Introduction While online learning is among the oldest approaches to machine learning, starting with the perceptron algorithm (Rosenblatt, 1958), it is still one of the most popular and and successful for many practical tasks. [sent-14, score-0.271]
4 In online learning, algorithms operate in rounds, whereby the algorithm is shown a single example for which it must first make a prediction and then update its hypothesis once it has seen the correct label. [sent-15, score-0.27]
5 By operating one example at a time, online methods are fast, simple, make few assumptions about the data, and perform fairly well across many domains and tasks. [sent-17, score-0.172]
6 For those reasons, online methods are often favored for large data problems, and they are also a natural fit for systems that learn from interaction with a user or another system. [sent-18, score-0.172]
7 C RAMMER , D REDZE AND P EREIRA their nice empirical properties, online algorithms have been analyzed in the mistake bound model (Littlestone, 1989), which supports both theoretical and empirical comparisons of performance. [sent-20, score-0.2]
8 Cesa-Bianchi and Lugosi (2006) provide an in-depth analysis of online learning algorithms. [sent-21, score-0.172]
9 Extensions of online learning to structured problems (Collins, 2002; McDonald et al. [sent-24, score-0.172]
10 Popular online methods for those tasks include the perceptron (Rosenblatt, 1958), passive-aggressive (Crammer et al. [sent-31, score-0.306]
11 Feature representations of text for tasks from spam filtering to parsing need to capture the variety of words, word combinations, and word attributes in the text, yielding very high-dimensional feature vectors, even though most of the features are absent in most texts. [sent-35, score-0.236]
12 The foregoing motivation led us to propose confidence-weighted (CW) learning, a class of online learning methods that maintain a probabilistic measure of confidence in each weight. [sent-40, score-0.172]
13 The result is an algorithm with superior classification accuracy over stateof-the-art online and batch baselines, faster learning, and new classifier combination methods for parallel training. [sent-43, score-0.269]
14 ” An online update would increase the weight of both “liked” and “author. [sent-78, score-0.331]
15 ” Since both are common words, over several examples the algorithm would converge to the correct values, a positive weight for “liked” and zero weight for “author. [sent-79, score-0.184]
16 This example demonstrates how a lack of memory for previous examples—a property that allows online learning—can hurt learning. [sent-85, score-0.172]
17 Instead of equally updating every feature weight for the on-features of an example, the update favors changing low-confidence weights more aggressively than high-confidence ones. [sent-91, score-0.191]
18 Second, our notion of weight confidence is based on a probabilistic interpretation of passive-aggressive online learning, which differs from the more familiar Bayesian learning for linear classifiers. [sent-97, score-0.264]
19 On round i the algorithm receives an example xi ∈ Rd to which it applies its current prediction rule to produce a prediction yi ∈ {−1, +1} (for binary classification). [sent-104, score-0.29]
20 It then receives the true label yi ∈ {−1, +1} ˆ and suffers a loss ℓ(yi , yi ), which in this work will be the zero-one loss: ℓ(yi , yi ) = 1 if yi = yi and ˆ ˆ ˆ ℓ(yi , yi ) = 0 otherwise. [sent-105, score-0.432]
21 For online evaluations, error is reported as the total loss ℓ on the training data and in batch evaluations, error is reported on held out data. [sent-107, score-0.303]
22 We denote the margin at round i by mi = yi (wi · xi ). [sent-114, score-0.52]
23 Online algorithms of that kind typically have updates of the form wi+1 = wi + αi yi xi , (1) for some non-negative coefficients αi . [sent-116, score-0.208]
24 1895 C RAMMER , D REDZE AND P EREIRA updates the prediction function such that the example (xi , yi ) will be classified correctly with a fixed margin (which can always be scaled to 1): wi+1 = min w s. [sent-123, score-0.23]
25 Solving this problem leads to an update of the form given by (1) with coefficient αi defined on each round as: αi = max {1 − yi (wi · xi ), 0} xi 2 , (3) Like the perceptron, this is a mistake driven update, whereby αi > 0 iff the learning condition was not met, ie. [sent-127, score-0.416]
26 Distributions over Classifiers Following the motivation of Section 2, we need a notion of confidence for the weight vector w maintained by an online learner for linear classifiers. [sent-136, score-0.264]
27 A Gibbs predictor samples from the distribution a single weight vector w, which is equivalent to drawing a margin value using (4), and takes its sign as the prediction. [sent-154, score-0.176]
28 1897 C RAMMER , D REDZE AND P EREIRA CW is an online learning algorithm, so on round i the algorithm receives example xi for which it issues a prediction yi . [sent-171, score-0.431]
29 3 The algorithm predicts yi as sign (µi · xi ), which is equivalent to averaging ˆ ˆ the predictions of many sampled weight vectors from the distribution. [sent-172, score-0.257]
30 In this case, a large prediction margin is formalized as ensuring that the probability of a correct prediction for training example i is no smaller than the confidence level η ∈ [0, 1]: Pr [yi (w · xi ) ≥ 0] ≥ η . [sent-176, score-0.273]
31 As noted above, under the distribution N (µ, Σ), the margin for (xi , yi ) has a Gaussian distribution with mean mi = yi (µi · xi ) , (7) σ2 = vi = x⊤ Σi xi . [sent-185, score-0.884]
32 To conclude the update rule solves the following optimization problem: (µi+1 , Σi+1 ) = arg min µ,Σ det Σi 1 log 2 det Σ 1 1 + Tr Σ−1 Σ + (µi − µ)⊤ Σ−1 (µi − µ) i i 2 2 s. [sent-195, score-0.199]
33 i (9) Conceptually, this is a large-margin constraint, where the value of the margin requirement depends on the example xi via a quadratic form. [sent-198, score-0.177]
34 4φvi where mi = yi (µi · xi ) (see (7)) and vi = x⊤ Σi xi (see (8)). [sent-213, score-0.728]
35 The Lagrangian for this optimization is L = 1 det Σi log 2 det Σ 1 1 + Tr Σ−1 Σ + (µi − µ)⊤ Σ−1 (µi − µ) i i 2 2 +α −yi (µ · xi ) + φ x⊤ Σxi i . [sent-229, score-0.225]
36 Substituting (11) and (14) into the equality version of (10), we obtain yi (xi · (µi + αyi Σi xi )) = φ x⊤ Σi − Σi xi βi x⊤ Σi xi . [sent-238, score-0.351]
37 i i (16) Rearranging terms we get yi (xi · µi ) + αx⊤ Σi xi = φx⊤ Σi xi − φv2 βi . [sent-239, score-0.258]
38 i i i (17) Substituting (7), (8), and (15) into (17) we obtain mi + αvi = φvi − φv2 i 2αφ . [sent-240, score-0.208]
39 i Rearranging the terms we obtain, 0 = mi + αvi + 2αφvi mi + 2α2 φv2 − φvi i = α2 2φv2 + αvi (1 + 2φmi ) + (mi − φvi ) . [sent-242, score-0.416]
40 (18) The constraint (10) is satisfied before the update if mi − φvi ≥ 0. [sent-246, score-0.319]
41 If 1 + 2φmi ≤ 0, then mi ≤ φvi and from (18) we have that γi > 0. [sent-247, score-0.208]
42 If, instead, 1 + 2φmi ≥ 0, then, again by (18), we have γi > 0 ⇔ (1 + 2φmi )2 − 8φ (mi − φvi ) > (1 + 2φmi ) ⇔ mi < φvi . [sent-248, score-0.208]
43 Instead, as before we derive a closed-form solution which we summarize in the following lemma: Lemma 2 The optimal solution of this form is, µi+1 = µi + αyi Σi xi Σi+1 = Σi − βΣi xi x⊤ Σi , i where β= αφ v+ + vi αφ i , v+ = x⊤ Σi+1 xi . [sent-264, score-0.541]
44 i i and the value of the parameter α (a Lagrange multiplier) is given by 1 −mi φ′ + m2 φ4 + vi φ2 φ′′ i 4 . [sent-265, score-0.262]
45 α = max 0, vi φ′′ where mi = yi (µi · xi ) (see (7)), vi = x⊤ Σi xi (see (8)), and for simplicity we define φ′ = 1+φ2 /2 , φ′′ = i 1 + φ2 . [sent-266, score-0.99]
46 1 D ERIVATION OF L EMMA 2 The Lagrangian for (19) is 1 2 L = log det ϒ2 i det ϒ2 1 1 + Tr ϒ−2 ϒ2 + (µi − µ)⊤ ϒ−2 (µi − µ) + α (−yi (µ · xi ) + φ ϒxi ) . [sent-270, score-0.225]
47 L = −ϒ−1 + ϒ−2 ϒ + ϒϒ−2 + αφ i i ∂ϒ 2 i 2 i ⊤ ϒ2 x ⊤ ϒ2 x 2 xi 2 xi i i (21) At the optimum, we must also have Defining the matrix xi x⊤ i C = ϒ−2 + αφ i , (22) x ⊤ ϒ2 x i i we get 1 1 ∂ L = −ϒ−1 + ϒC + Cϒ = 0 ∂ϒ 2 2 at the optimum. [sent-278, score-0.279]
48 Conveniently, the final form of the updates can be expressed in terms of the covariance matrix:6 µi+1 = µi + αyi Σi xi Σ−1 = Σ−1 + αφ i i+1 (23) xi x⊤ i . [sent-281, score-0.275]
49 As before we compute the inverse of (24) using the Woodbury identity (Petersen and Pedersen, 2008) to get, xi x⊤ i Σi+1 = Σ−1 + αφ i x⊤ Σi+1 xi i x⊤ Σi+1 xi i = Σi − Σi x i αφ −1 −1 + x ⊤ Σi x i i αφ = Σi − Σi x i x⊤ Σi+1 xi + x⊤ Σi xi αφ i i = Σi − βi Σi xi x⊤ Σi . [sent-288, score-0.558]
50 i x ⊤ Σi i x ⊤ Σi i (25) where we define v+ = x⊤ Σi+1 xi , i i and αi φ βi = v+ + vi αi φ i (26) . [sent-289, score-0.355]
51 (27) Multiplying (25) by x⊤ (left) and xi (right) we get i v+ = vi − vi i αφ v+ + vi αφ i vi , which is equivalent to v+ i v+ + v+ vi αφ = vi i i = vi Dividing both sides by v+ + v2 αφ − v2 αφ i i i v+ . [sent-290, score-1.927]
52 i v+ , we obtain i v+ + i v+ vi αφ − vi = 0 , i which can be solved for v+ to obtain i v+ i = −αvi φ + α2 v2 φ2 + 4vi i 2 . [sent-291, score-0.524]
53 1905 C RAMMER , D REDZE AND P EREIRA Using the equality version of (19) and Equations (23,25,26,28) we obtain mi + αvi = φ α2 v2 φ2 + 4vi i −αvi φ + 2 , (29) which can be rearranged into the following quadratic equation in α: α2 v2 1 + φ2 + 2αmi vi 1 + i φ2 2 + m2 − vi φ2 = 0 . [sent-293, score-0.732]
54 The larger root is then γi = −mi vi φ′ + m2 v2 φ′2 − v2 φ′′ m2 − vi φ2 i i i i v2 φ′′ i . [sent-296, score-0.524]
55 (30) √ √ The constraint (19) is satisfied before the update if mi − φ vi ≥ 0. [sent-297, score-0.581]
56 If mi ≤ 0, then mi ≤ φ vi and from (30) we have that γi > 0. [sent-298, score-0.678]
57 If instead mi ≥ 0, then, again by (30), we have γi > 0 ⇔ mi vi φ′ < m2 v2 φ′2 − v2 φ′ m2 − vi φ2 i i i i ⇔ mi < φvi . [sent-299, score-1.148]
58 We obtain the final form of αi by simplifying (30) together with last comment and get, 1 −mi φ′ + m2 φ4 + vi φ2 φ′′ i 4 . [sent-302, score-0.262]
59 2 we take an alternative approach and re-develop the update step assuming an explicit diagonal representation of the covariance matrix. [sent-313, score-0.176]
60 1 Approximate Diagonal Update Both updates above (linearization or change-of-variables) share the same form when updating the covariance matrix ((14) or (25)) Σi+1 = Σi − βi Σi xi x⊤ Σi . [sent-315, score-0.182]
61 (32) i Our diagonalization step will define the final matrix to be a diagonal matrix with its non-zero elements equals to the diagonal elements of (32). [sent-316, score-0.213]
62 Formally we get, diag Σi − βi Σi xi x⊤ Σi i Σi+1 = diag (Σi ) − diag βi Σi xi x⊤ Σi i = = Σi − βi diag Σi xi x⊤ Σi i , where the last equality follows since we assume that Σi is diagonal and diag (A) = p = p′ . [sent-317, score-0.537]
63 Concretely we start from the update of the inverse-covariance, Σ−1 = Σ−1 + ηi xi x⊤ , i i i+1 where ηi = 2αi φ , for the linearization approach ( (13) ) and ηi = αi φ , x⊤ Σi+1 xi i 7. [sent-325, score-0.31]
64 We first diagonalize the inverse-covariance and get Σ−1 = i+1 diag Σ−1 + ηi xi x⊤ i i diag Σ−1 + diag ηi xi x⊤ i i = = Σ−1 + ηi diag xi x⊤ i i . [sent-331, score-0.435]
65 1 + 2αΣi,(p) φx2 i,(p) Substituting (7) and (8) and rearranging the terms we get the constraint f (α) = 0 , where we defined f (α) = mi + αvi − ∑ r Σi,(p) φx2 i,(p) 1 + 2αΣi,(p) φx2 i,(p) 1908 . [sent-352, score-0.252]
66 x⊤ ϒ2 xi + αφx2 ϒ2 i i,(p) i,(p) r As before we employ the KKT conditions which state that when α > 0 we have mi + αvi = φ x⊤ ϒ2 xi . [sent-362, score-0.394]
67 i Substituting in the last equality we get x ⊤ ϒ2 x i i =∑ r φx2 ϒ2 i,(p) i,(p) mi + αvi + αφ2 x2 ϒ2 i,(p) i,(p) . [sent-363, score-0.208]
68 We use again the KKT conditions and get that the optimal value αi+1 is the solution of g(α) = 0 for g(α) = mi + αvi − ∑ r φ2 x 2 ϒ2 i,(p) i,(p) mi + αvi + αφ2 x2 ϒ2 i,(p) i,(p) . [sent-364, score-0.416]
69 (34) The function g(α) defined in (34) and the function f (α) defined in (33) are both of the form ar h(α) = mi + αvi − ∑ , b + cr α r where vi , ar , cr ≥ 0. [sent-365, score-0.47]
70 The only difference is that b = 1 > 0 in (33) and b = mi in (34). [sent-366, score-0.208]
71 The following lemma summarizes few properties of both functions: 1909 C RAMMER , D REDZE AND P EREIRA Lemma 3 Assume that vi > 0 and let Li = max{0, −mi /vi }. [sent-368, score-0.262]
72 For each function there exists a value Ui such that their value at Ui is positive, f (Ui ) > 0 , g(Ui ) > 0 Proof For the first property we consider two cases mi ≥ 0 and mi < 0. [sent-373, score-0.416]
73 Thus, f (0) = mi − ∑r Σi,(p) φx2 = mi − φvi < 0, where the last ineqluaity follows i,(p) 1 since we assume that the constraint of (10) does not hold. [sent-375, score-0.46]
74 Also, g(0) = mi − mi ∑r φ2 x2 ϒ2 = i,(p) i,(p) v mi − φ2 mii < 0 since we assumed that the constraint of (19) does not hold. [sent-376, score-0.668]
75 Li The second property of strictly-increasing follows immediately since vi > 0 and since for both functions the denominator of each term in the sum over p is an increasing function in α which is non-negative in the range α ≥ Li . [sent-380, score-0.262]
76 Note that ai = max {0, −2mi /vi } satisfies mi +(ai /2)vi ≥ 0. [sent-386, score-0.208]
77 Thus, bi = maxr 2dΣi,(p) φx2 i,(p) /vi satisfies bi vi / (2d) − Σi,(p) φx2 / 1 + 2αΣi,(p) φx2 i,(p) i,(p) ≥ 0 for p = 1 . [sent-387, score-0.262]
78 We compare our methods against each other and against competitive online and batch learning algorithms. [sent-398, score-0.269]
79 7 Results We start by comparing the performance of the diagonalized CW algorithms: var (linearization) against stdev (change of variables), approximate against exact diagonalization, and for approximate updates, KL against L2 . [sent-479, score-0.241]
80 Starting with the KL methods for var and stdev, the stdev method does slightly better, a result shown in Crammer et al. [sent-483, score-0.173]
81 achieving the best or closest to best results for the var and stdev methods. [sent-492, score-0.173]
82 We next compare the results from our approximation diagonalization CW methods to other popular online learning algorithms (Table 3). [sent-502, score-0.259]
83 As discussed above, online algorithms are attractive even for batch learning because of their simplicity and ability to operate on extremely large data sets. [sent-508, score-0.269]
84 8 Batch Learning While online algorithms are widely used, batch algorithms are still preferred for many tasks. [sent-630, score-0.269]
85 Classifier parameters (Gaussian prior for maxent and C for SVM) were tuned as for the online methods. [sent-634, score-0.228]
86 As expected, the batch methods tend to do better than the online methods (perceptron, PA, and SGD). [sent-636, score-0.269]
87 However, in 13 out of 17 tasks the CW 1915 C RAMMER , D REDZE AND P EREIRA Sentiment ECML Reuters 20 News Pascal USPS Task Apparel Books DVD Electronics Kitchen Music Video Spam A Spam B Spam C Retail Business Insurance Comp Sci Talk Webspam 0 vs 9 1 vs 2 3 vs 4 5 vs 6 7 vs 8 var KL L2 12. [sent-637, score-0.283]
88 The much faster and simpler online algorithm performs better than the slower more complex batch methods. [sent-820, score-0.269]
89 The speed advantage of online methods in the batch setting can be seen in Table 5, which shows the average training time in seconds for a single experiment (fold) for a representative selection of CW algorithms and some of the baselines. [sent-821, score-0.303]
90 The online times include the multiple iterations selected for each online learning experiment. [sent-822, score-0.344]
91 The differences between the online and batch algorithms are striking. [sent-823, score-0.269]
92 While CW performs better than the batch methods, it is also much faster, while being equivalent in speed to the other online methods. [sent-824, score-0.269]
93 We also evaluated the effects of commonly used techniques for online and batch learning, including averaging and TFIDF features; they did not improve results so details are omitted. [sent-827, score-0.269]
94 This means that Reuters features will occur several times during training while many sentiment features only once. [sent-1080, score-0.236]
95 00 Figure 5: The trained model’s accuracy on the training data for fifteen of the data sets for the exact diagonal and L2 diagonal approximation Stdev methods. [sent-1099, score-0.221]
96 Our update has a more general form, in which the input vector xi is linearly transformed using the covariance matrix, both rotating the input and assigning weight specific learning rates. [sent-1141, score-0.298]
97 The algorithm maintains a distribution over weight vectors; online updates both improve the weight estimates and reduce the distribution’s variance. [sent-1192, score-0.399]
98 Our method improves over both online and batch methods and learns faster on over a dozen NLP data sets. [sent-1193, score-0.269]
99 Single-pass online learning: Performance, voting schemes and online feature selection. [sent-1226, score-0.344]
100 Large margin online learning algorithms for scalable structured classification. [sent-1349, score-0.256]
wordName wordTfidf (topN-words)
[('cw', 0.39), ('vi', 0.262), ('mi', 0.208), ('ereira', 0.203), ('redze', 0.203), ('ategorization', 0.192), ('onfidence', 0.192), ('rammer', 0.173), ('online', 0.172), ('eighted', 0.164), ('ext', 0.164), ('crammer', 0.151), ('koby', 0.142), ('stdev', 0.135), ('sentiment', 0.134), ('dredze', 0.121), ('reuters', 0.121), ('spam', 0.121), ('nlp', 0.113), ('sop', 0.101), ('perceptron', 0.099), ('batch', 0.097), ('inear', 0.095), ('xi', 0.093), ('weight', 0.092), ('lassification', 0.092), ('fernando', 0.087), ('diagonalization', 0.087), ('margin', 0.084), ('dence', 0.081), ('liked', 0.079), ('mcdonald', 0.078), ('kl', 0.074), ('yi', 0.072), ('apparel', 0.068), ('retail', 0.068), ('update', 0.067), ('det', 0.066), ('pa', 0.065), ('round', 0.063), ('diagonal', 0.063), ('classi', 0.061), ('webspam', 0.058), ('linearization', 0.057), ('dull', 0.056), ('maxent', 0.056), ('rare', 0.049), ('con', 0.049), ('psd', 0.049), ('dvd', 0.048), ('insurance', 0.048), ('sci', 0.048), ('text', 0.046), ('covariance', 0.046), ('ui', 0.045), ('constraint', 0.044), ('kitchen', 0.043), ('comp', 0.043), ('shivaswamy', 0.043), ('updates', 0.043), ('vs', 0.042), ('kkt', 0.041), ('li', 0.041), ('mark', 0.04), ('talk', 0.04), ('diag', 0.039), ('var', 0.038), ('collins', 0.037), ('ers', 0.037), ('electronics', 0.035), ('books', 0.035), ('diagonalized', 0.035), ('tasks', 0.035), ('training', 0.034), ('features', 0.034), ('carreras', 0.034), ('mallet', 0.034), ('blitzer', 0.033), ('tony', 0.033), ('business', 0.033), ('exact', 0.033), ('lagrange', 0.032), ('language', 0.032), ('rosenblatt', 0.032), ('emnlp', 0.032), ('fw', 0.032), ('weights', 0.032), ('music', 0.032), ('linguistics', 0.031), ('prediction', 0.031), ('processors', 0.03), ('gaussian', 0.03), ('carvalho', 0.029), ('kulesza', 0.029), ('mcnemar', 0.029), ('trained', 0.028), ('beats', 0.028), ('mistake', 0.028), ('acl', 0.028), ('ryan', 0.028)]
simIndex simValue paperId paperTitle
same-paper 1 1.0000002 28 jmlr-2012-Confidence-Weighted Linear Classification for Text Categorization
Author: Koby Crammer, Mark Dredze, Fernando Pereira
Abstract: Confidence-weighted online learning is a generalization of margin-based learning of linear classifiers in which the margin constraint is replaced by a probabilistic constraint based on a distribution over classifier weights that is updated online as examples are observed. The distribution captures a notion of confidence on classifier weights, and in some cases it can also be interpreted as replacing a single learning rate by adaptive per-weight rates. Confidence-weighted learning was motivated by the statistical properties of natural-language classification tasks, where most of the informative features are relatively rare. We investigate several versions of confidence-weighted learning that use a Gaussian distribution over weight vectors, updated at each observed example to achieve high probability of correct classification for the example. Empirical evaluation on a range of textcategorization tasks show that our algorithms improve over other state-of-the-art online and batch methods, learn faster in the online setting, and lead to better classifier combination for a type of distributed training commonly used in cloud computing. Keywords: online learning, confidence prediction, text categorization
2 0.10972196 49 jmlr-2012-Hope and Fear for Discriminative Training of Statistical Translation Models
Author: David Chiang
Abstract: In machine translation, discriminative models have almost entirely supplanted the classical noisychannel model, but are standardly trained using a method that is reliable only in low-dimensional spaces. Two strands of research have tried to adapt more scalable discriminative training methods to machine translation: the first uses log-linear probability models and either maximum likelihood or minimum risk, and the other uses linear models and large-margin methods. Here, we provide an overview of the latter. We compare several learning algorithms and describe in detail some novel extensions suited to properties of the translation task: no single correct output, a large space of structured outputs, and slow inference. We present experimental results on a large-scale ArabicEnglish translation task, demonstrating large gains in translation accuracy. Keywords: machine translation, structured prediction, large-margin methods, online learning, distributed computing
3 0.084868848 97 jmlr-2012-Regularization Techniques for Learning with Matrices
Author: Sham M. Kakade, Shai Shalev-Shwartz, Ambuj Tewari
Abstract: There is growing body of learning problems for which it is natural to organize the parameters into a matrix. As a result, it becomes easy to impose sophisticated prior knowledge by appropriately regularizing the parameters under some matrix norm. This work describes and analyzes a systematic method for constructing such matrix-based regularization techniques. In particular, we focus on how the underlying statistical properties of a given problem can help us decide which regularization function is appropriate. Our methodology is based on a known duality phenomenon: a function is strongly convex with respect to some norm if and only if its conjugate function is strongly smooth with respect to the dual norm. This result has already been found to be a key component in deriving and analyzing several learning algorithms. We demonstrate the potential of this framework by deriving novel generalization and regret bounds for multi-task learning, multi-class learning, and multiple kernel learning. Keywords: regularization, strong convexity, regret bounds, generalization bounds, multi-task learning, multi-class learning, multiple kernel learning
4 0.080365963 17 jmlr-2012-An Active Learning Algorithm for Ranking from Pairwise Preferences with an Almost Optimal Query Complexity
Author: Nir Ailon
Abstract: Given a set V of n elements we wish to linearly order them given pairwise preference labels which may be non-transitive (due to irrationality or arbitrary noise). The goal is to linearly order the elements while disagreeing with as few pairwise preference labels as possible. Our performance is measured by two parameters: The number of disagreements (loss) and the query complexity (number of pairwise preference labels). Our algorithm adaptively queries at most O(ε−6 n log5 n) preference labels for a regret of ε times the optimal loss. As a function of n, this is asymptotically better than standard (non-adaptive) learning bounds achievable for the same problem. Our main result takes us a step closer toward settling an open problem posed by learning-torank (from pairwise information) theoreticians and practitioners: What is a provably correct way to sample preference labels? To further show the power and practicality of our solution, we analyze a typical test case in which a large margin linear relaxation is used for efficiently solving the simpler learning problems in our decomposition. Keywords: statistical learning theory, active learning, ranking, pairwise ranking, preferences
5 0.079238348 23 jmlr-2012-Breaking the Curse of Kernelization: Budgeted Stochastic Gradient Descent for Large-Scale SVM Training
Author: Zhuang Wang, Koby Crammer, Slobodan Vucetic
Abstract: Online algorithms that process one example at a time are advantageous when dealing with very large data or with data streams. Stochastic Gradient Descent (SGD) is such an algorithm and it is an attractive choice for online Support Vector Machine (SVM) training due to its simplicity and effectiveness. When equipped with kernel functions, similarly to other SVM learning algorithms, SGD is susceptible to the curse of kernelization that causes unbounded linear growth in model size and update time with data size. This may render SGD inapplicable to large data sets. We address this issue by presenting a class of Budgeted SGD (BSGD) algorithms for large-scale kernel SVM training which have constant space and constant time complexity per update. Specifically, BSGD keeps the number of support vectors bounded during training through several budget maintenance strategies. We treat the budget maintenance as a source of the gradient error, and show that the gap between the BSGD and the optimal SVM solutions depends on the model degradation due to budget maintenance. To minimize the gap, we study greedy budget maintenance methods based on removal, projection, and merging of support vectors. We propose budgeted versions of several popular online SVM algorithms that belong to the SGD family. We further derive BSGD algorithms for multi-class SVM training. Comprehensive empirical results show that BSGD achieves higher accuracy than the state-of-the-art budgeted online algorithms and comparable to non-budget algorithms, while achieving impressive computational efficiency both in time and space during training and prediction. Keywords: SVM, large-scale learning, online learning, stochastic gradient descent, kernel methods
6 0.072077416 85 jmlr-2012-Optimal Distributed Online Prediction Using Mini-Batches
7 0.06040027 83 jmlr-2012-Online Learning in the Embedded Manifold of Low-rank Matrices
8 0.058301806 26 jmlr-2012-Coherence Functions with Applications in Large-Margin Classification Methods
9 0.057021528 98 jmlr-2012-Regularized Bundle Methods for Convex and Non-Convex Risks
10 0.055491131 66 jmlr-2012-Metric and Kernel Learning Using a Linear Transformation
11 0.052972291 110 jmlr-2012-Static Prediction Games for Adversarial Learning Problems
12 0.05184735 87 jmlr-2012-PAC-Bayes Bounds with Data Dependent Priors
13 0.051772244 82 jmlr-2012-On the Necessity of Irrelevant Variables
14 0.051686957 1 jmlr-2012-A Case Study on Meta-Generalising: A Gaussian Processes Approach
15 0.050676938 74 jmlr-2012-Multi Kernel Learning with Online-Batch Optimization
16 0.047531184 14 jmlr-2012-Activized Learning: Transforming Passive to Active with Improved Label Complexity
17 0.047524143 99 jmlr-2012-Restricted Strong Convexity and Weighted Matrix Completion: Optimal Bounds with Noise
18 0.046756946 55 jmlr-2012-Learning Algorithms for the Classification Restricted Boltzmann Machine
19 0.046748891 11 jmlr-2012-A Unifying Probabilistic Perspective for Spectral Dimensionality Reduction: Insights and New Models
20 0.04547051 90 jmlr-2012-Pattern for Python
topicId topicWeight
[(0, -0.23), (1, -0.044), (2, 0.086), (3, 0.015), (4, 0.037), (5, -0.003), (6, 0.153), (7, 0.046), (8, -0.017), (9, -0.038), (10, -0.022), (11, 0.114), (12, -0.105), (13, -0.122), (14, 0.095), (15, 0.008), (16, 0.145), (17, -0.062), (18, -0.14), (19, 0.124), (20, -0.124), (21, 0.123), (22, 0.024), (23, 0.019), (24, 0.278), (25, -0.047), (26, -0.112), (27, 0.03), (28, -0.025), (29, -0.023), (30, -0.081), (31, 0.088), (32, -0.119), (33, 0.093), (34, 0.141), (35, -0.17), (36, -0.049), (37, 0.08), (38, 0.036), (39, 0.014), (40, -0.09), (41, -0.012), (42, 0.065), (43, 0.097), (44, 0.007), (45, 0.064), (46, 0.01), (47, 0.062), (48, -0.076), (49, 0.009)]
simIndex simValue paperId paperTitle
same-paper 1 0.93058556 28 jmlr-2012-Confidence-Weighted Linear Classification for Text Categorization
Author: Koby Crammer, Mark Dredze, Fernando Pereira
Abstract: Confidence-weighted online learning is a generalization of margin-based learning of linear classifiers in which the margin constraint is replaced by a probabilistic constraint based on a distribution over classifier weights that is updated online as examples are observed. The distribution captures a notion of confidence on classifier weights, and in some cases it can also be interpreted as replacing a single learning rate by adaptive per-weight rates. Confidence-weighted learning was motivated by the statistical properties of natural-language classification tasks, where most of the informative features are relatively rare. We investigate several versions of confidence-weighted learning that use a Gaussian distribution over weight vectors, updated at each observed example to achieve high probability of correct classification for the example. Empirical evaluation on a range of textcategorization tasks show that our algorithms improve over other state-of-the-art online and batch methods, learn faster in the online setting, and lead to better classifier combination for a type of distributed training commonly used in cloud computing. Keywords: online learning, confidence prediction, text categorization
2 0.81876963 49 jmlr-2012-Hope and Fear for Discriminative Training of Statistical Translation Models
Author: David Chiang
Abstract: In machine translation, discriminative models have almost entirely supplanted the classical noisychannel model, but are standardly trained using a method that is reliable only in low-dimensional spaces. Two strands of research have tried to adapt more scalable discriminative training methods to machine translation: the first uses log-linear probability models and either maximum likelihood or minimum risk, and the other uses linear models and large-margin methods. Here, we provide an overview of the latter. We compare several learning algorithms and describe in detail some novel extensions suited to properties of the translation task: no single correct output, a large space of structured outputs, and slow inference. We present experimental results on a large-scale ArabicEnglish translation task, demonstrating large gains in translation accuracy. Keywords: machine translation, structured prediction, large-margin methods, online learning, distributed computing
3 0.49230194 23 jmlr-2012-Breaking the Curse of Kernelization: Budgeted Stochastic Gradient Descent for Large-Scale SVM Training
Author: Zhuang Wang, Koby Crammer, Slobodan Vucetic
Abstract: Online algorithms that process one example at a time are advantageous when dealing with very large data or with data streams. Stochastic Gradient Descent (SGD) is such an algorithm and it is an attractive choice for online Support Vector Machine (SVM) training due to its simplicity and effectiveness. When equipped with kernel functions, similarly to other SVM learning algorithms, SGD is susceptible to the curse of kernelization that causes unbounded linear growth in model size and update time with data size. This may render SGD inapplicable to large data sets. We address this issue by presenting a class of Budgeted SGD (BSGD) algorithms for large-scale kernel SVM training which have constant space and constant time complexity per update. Specifically, BSGD keeps the number of support vectors bounded during training through several budget maintenance strategies. We treat the budget maintenance as a source of the gradient error, and show that the gap between the BSGD and the optimal SVM solutions depends on the model degradation due to budget maintenance. To minimize the gap, we study greedy budget maintenance methods based on removal, projection, and merging of support vectors. We propose budgeted versions of several popular online SVM algorithms that belong to the SGD family. We further derive BSGD algorithms for multi-class SVM training. Comprehensive empirical results show that BSGD achieves higher accuracy than the state-of-the-art budgeted online algorithms and comparable to non-budget algorithms, while achieving impressive computational efficiency both in time and space during training and prediction. Keywords: SVM, large-scale learning, online learning, stochastic gradient descent, kernel methods
Author: Nir Ailon
Abstract: Given a set V of n elements we wish to linearly order them given pairwise preference labels which may be non-transitive (due to irrationality or arbitrary noise). The goal is to linearly order the elements while disagreeing with as few pairwise preference labels as possible. Our performance is measured by two parameters: The number of disagreements (loss) and the query complexity (number of pairwise preference labels). Our algorithm adaptively queries at most O(ε−6 n log5 n) preference labels for a regret of ε times the optimal loss. As a function of n, this is asymptotically better than standard (non-adaptive) learning bounds achievable for the same problem. Our main result takes us a step closer toward settling an open problem posed by learning-torank (from pairwise information) theoreticians and practitioners: What is a provably correct way to sample preference labels? To further show the power and practicality of our solution, we analyze a typical test case in which a large margin linear relaxation is used for efficiently solving the simpler learning problems in our decomposition. Keywords: statistical learning theory, active learning, ranking, pairwise ranking, preferences
5 0.40480277 85 jmlr-2012-Optimal Distributed Online Prediction Using Mini-Batches
Author: Ofer Dekel, Ran Gilad-Bachrach, Ohad Shamir, Lin Xiao
Abstract: Online prediction methods are typically presented as serial algorithms running on a single processor. However, in the age of web-scale prediction problems, it is increasingly common to encounter situations where a single processor cannot keep up with the high rate at which inputs arrive. In this work, we present the distributed mini-batch algorithm, a method of converting many serial gradient-based online prediction algorithms into distributed algorithms. We prove a regret bound for this method that is asymptotically optimal for smooth convex loss functions and stochastic inputs. Moreover, our analysis explicitly takes into account communication latencies between nodes in the distributed environment. We show how our method can be used to solve the closely-related distributed stochastic optimization problem, achieving an asymptotically linear speed-up over multiple processors. Finally, we demonstrate the merits of our approach on a web-scale online prediction problem. Keywords: distributed computing, online learning, stochastic optimization, regret bounds, convex optimization
6 0.37239799 90 jmlr-2012-Pattern for Python
7 0.34496921 87 jmlr-2012-PAC-Bayes Bounds with Data Dependent Priors
8 0.34149694 110 jmlr-2012-Static Prediction Games for Adversarial Learning Problems
9 0.33374318 83 jmlr-2012-Online Learning in the Embedded Manifold of Low-rank Matrices
10 0.324871 55 jmlr-2012-Learning Algorithms for the Classification Restricted Boltzmann Machine
11 0.31924099 19 jmlr-2012-An Introduction to Artificial Prediction Markets for Classification
12 0.3120496 97 jmlr-2012-Regularization Techniques for Learning with Matrices
13 0.29330334 36 jmlr-2012-Efficient Methods for Robust Classification Under Uncertainty in Kernel Matrices
14 0.28151721 26 jmlr-2012-Coherence Functions with Applications in Large-Margin Classification Methods
15 0.27812085 11 jmlr-2012-A Unifying Probabilistic Perspective for Spectral Dimensionality Reduction: Insights and New Models
16 0.2771115 89 jmlr-2012-Pairwise Support Vector Machines and their Application to Large Scale Problems
17 0.273433 66 jmlr-2012-Metric and Kernel Learning Using a Linear Transformation
18 0.26692486 118 jmlr-2012-Variational Multinomial Logit Gaussian Process
19 0.25286177 44 jmlr-2012-Feature Selection via Dependence Maximization
20 0.2512897 104 jmlr-2012-Security Analysis of Online Centroid Anomaly Detection
topicId topicWeight
[(2, 0.011), (7, 0.014), (21, 0.022), (26, 0.035), (27, 0.013), (29, 0.024), (49, 0.014), (56, 0.017), (57, 0.011), (69, 0.014), (75, 0.053), (77, 0.013), (92, 0.042), (96, 0.62)]
simIndex simValue paperId paperTitle
1 0.9927783 32 jmlr-2012-Discriminative Hierarchical Part-based Models for Human Parsing and Action Recognition
Author: Yang Wang, Duan Tran, Zicheng Liao, David Forsyth
Abstract: We consider the problem of parsing human poses and recognizing their actions in static images with part-based models. Most previous work in part-based models only considers rigid parts (e.g., torso, head, half limbs) guided by human anatomy. We argue that this representation of parts is not necessarily appropriate. In this paper, we introduce hierarchical poselets—a new representation for modeling the pose configuration of human bodies. Hierarchical poselets can be rigid parts, but they can also be parts that cover large portions of human bodies (e.g., torso + left arm). In the extreme case, they can be the whole bodies. The hierarchical poselets are organized in a hierarchical way via a structured model. Human parsing can be achieved by inferring the optimal labeling of this hierarchical model. The pose information captured by this hierarchical model can also be used as a intermediate representation for other high-level tasks. We demonstrate it in action recognition from static images. Keywords: human parsing, action recognition, part-based models, hierarchical poselets, maxmargin structured learning
2 0.98922533 40 jmlr-2012-Exact Covariance Thresholding into Connected Components for Large-Scale Graphical Lasso
Author: Rahul Mazumder, Trevor Hastie
Abstract: We consider the sparse inverse covariance regularization problem or graphical lasso with regularization parameter λ. Suppose the sample covariance graph formed by thresholding the entries of the sample covariance matrix at λ is decomposed into connected components. We show that the vertex-partition induced by the connected components of the thresholded sample covariance graph (at λ) is exactly equal to that induced by the connected components of the estimated concentration graph, obtained by solving the graphical lasso problem for the same λ. This characterizes a very interesting property of a path of graphical lasso solutions. Furthermore, this simple rule, when used as a wrapper around existing algorithms for the graphical lasso, leads to enormous performance gains. For a range of values of λ, our proposal splits a large graphical lasso problem into smaller tractable problems, making it possible to solve an otherwise infeasible large-scale problem. We illustrate the graceful scalability of our proposal via synthetic and real-life microarray examples. Keywords: sparse inverse covariance selection, sparsity, graphical lasso, Gaussian graphical models, graph connected components, concentration graph, large scale covariance estimation
same-paper 3 0.98190784 28 jmlr-2012-Confidence-Weighted Linear Classification for Text Categorization
Author: Koby Crammer, Mark Dredze, Fernando Pereira
Abstract: Confidence-weighted online learning is a generalization of margin-based learning of linear classifiers in which the margin constraint is replaced by a probabilistic constraint based on a distribution over classifier weights that is updated online as examples are observed. The distribution captures a notion of confidence on classifier weights, and in some cases it can also be interpreted as replacing a single learning rate by adaptive per-weight rates. Confidence-weighted learning was motivated by the statistical properties of natural-language classification tasks, where most of the informative features are relatively rare. We investigate several versions of confidence-weighted learning that use a Gaussian distribution over weight vectors, updated at each observed example to achieve high probability of correct classification for the example. Empirical evaluation on a range of textcategorization tasks show that our algorithms improve over other state-of-the-art online and batch methods, learn faster in the online setting, and lead to better classifier combination for a type of distributed training commonly used in cloud computing. Keywords: online learning, confidence prediction, text categorization
4 0.96298862 52 jmlr-2012-Iterative Reweighted Algorithms for Matrix Rank Minimization
Author: Karthik Mohan, Maryam Fazel
Abstract: The problem of minimizing the rank of a matrix subject to affine constraints has applications in several areas including machine learning, and is known to be NP-hard. A tractable relaxation for this problem is nuclear norm (or trace norm) minimization, which is guaranteed to find the minimum rank matrix under suitable assumptions. In this paper, we propose a family of Iterative Reweighted Least Squares algorithms IRLS-p (with 0 ≤ p ≤ 1), as a computationally efficient way to improve over the performance of nuclear norm minimization. The algorithms can be viewed as (locally) minimizing certain smooth approximations to the rank function. When p = 1, we give theoretical guarantees similar to those for nuclear norm minimization, that is, recovery of low-rank matrices under certain assumptions on the operator defining the constraints. For p < 1, IRLSp shows better empirical performance in terms of recovering low-rank matrices than nuclear norm minimization. We provide an efficient implementation for IRLS-p, and also present a related family of algorithms, sIRLS-p. These algorithms exhibit competitive run times and improved recovery when compared to existing algorithms for random instances of the matrix completion problem, as well as on the MovieLens movie recommendation data set. Keywords: matrix rank minimization, matrix completion, iterative algorithms, null-space property
5 0.92247379 91 jmlr-2012-Plug-in Approach to Active Learning
Author: Stanislav Minsker
Abstract: We present a new active learning algorithm based on nonparametric estimators of the regression function. Our investigation provides probabilistic bounds for the rates of convergence of the generalization error achievable by proposed method over a broad class of underlying distributions. We also prove minimax lower bounds which show that the obtained rates are almost tight. Keywords: active learning, selective sampling, model selection, classification, confidence bands
6 0.79350585 106 jmlr-2012-Sign Language Recognition using Sub-Units
7 0.77634013 92 jmlr-2012-Positive Semidefinite Metric Learning Using Boosting-like Algorithms
8 0.77381617 49 jmlr-2012-Hope and Fear for Discriminative Training of Statistical Translation Models
9 0.75882262 45 jmlr-2012-Finding Recurrent Patterns from Continuous Sign Language Sentences for Automated Extraction of Signs
10 0.74971724 77 jmlr-2012-Non-Sparse Multiple Kernel Fisher Discriminant Analysis
11 0.74510008 18 jmlr-2012-An Improved GLMNET for L1-regularized Logistic Regression
12 0.7444182 30 jmlr-2012-DARWIN: A Framework for Machine Learning and Computer Vision Research and Development
13 0.74341035 37 jmlr-2012-Eliminating Spammers and Ranking Annotators for Crowdsourced Labeling Tasks
14 0.74216169 23 jmlr-2012-Breaking the Curse of Kernelization: Budgeted Stochastic Gradient Descent for Large-Scale SVM Training
15 0.73989546 65 jmlr-2012-MedLDA: Maximum Margin Supervised Topic Models
16 0.73852706 33 jmlr-2012-Distance Metric Learning with Eigenvalue Optimization
17 0.73179364 89 jmlr-2012-Pairwise Support Vector Machines and their Application to Large Scale Problems
18 0.73124659 115 jmlr-2012-Trading Regret for Efficiency: Online Convex Optimization with Long Term Constraints
19 0.72960466 36 jmlr-2012-Efficient Methods for Robust Classification Under Uncertainty in Kernel Matrices
20 0.72759956 83 jmlr-2012-Online Learning in the Embedded Manifold of Low-rank Matrices