nips nips2008 nips2008-88 knowledge-graph by maker-knowledge-mining
Source: pdf
Author: Ofer Dekel
Abstract: We present cutoff averaging, a technique for converting any conservative online learning algorithm into a batch learning algorithm. Most online-to-batch conversion techniques work well with certain types of online learning algorithms and not with others, whereas cutoff averaging explicitly tries to adapt to the characteristics of the online algorithm being converted. An attractive property of our technique is that it preserves the efficiency of the original online algorithm, making it appropriate for large-scale learning problems. We provide a statistical analysis of our technique and back our theoretical claims with experimental results. 1
Reference: text
sentIndex sentText sentNum sentScore
1 From Online to Batch Learning with Cutoff-Averaging Anonymous Author(s) Affiliation Address email Abstract We present cutoff averaging, a technique for converting any conservative online learning algorithm into a batch learning algorithm. [sent-1, score-1.131]
2 Most online-to-batch conversion techniques work well with certain types of online learning algorithms and not with others, whereas cutoff averaging explicitly tries to adapt to the characteristics of the online algorithm being converted. [sent-2, score-1.58]
3 An attractive property of our technique is that it preserves the efficiency of the original online algorithm, making it appropriate for large-scale learning problems. [sent-3, score-0.523]
4 We expect a batch learning algorithm to generalize, in the sense that its output hypothesis should accurately predict the labels of previously unseen examples, which are sampled from the distribution. [sent-12, score-0.554]
5 An online learning algorithm receives a sequence of examples and processes these examples one-by-one. [sent-14, score-0.528]
6 The sequence of internal hypotheses constructed by the online algorithm from round to round plays a central role in this paper, and we refer to this sequence as the online hypothesis sequence. [sent-18, score-1.424]
7 As a result, we are sometimes tempted to use online learning algorithms as if they were batch learning algorithms. [sent-21, score-0.465]
8 A common way to do this is to present training examples one-by-one to the online algorithm, and use the last hypothesis constructed by the algorithm as the output hypothesis. [sent-22, score-0.824]
9 We call this technique the last-hypothesis online-to-batch conversion technique. [sent-23, score-0.425]
10 The appeal of this technique is that it maintains the computational efficiency of the original online algorithm. [sent-24, score-0.502]
11 However, this heuristic technique generally comes with no theoretical guarantees, and the online algorithm’s inherent disregard for out-of-sample performance makes it a risky practice. [sent-25, score-0.522]
12 1 In addition to the last-hypothesis heuristic, various principled techniques for converting online algorithms into batch algorithms have been proposed. [sent-26, score-0.502]
13 Each of these techniques essentially wraps the online learning algorithm with an additional layer of instructions that endow it with the ability to generalize. [sent-27, score-0.359]
14 One approach is to use the online algorithm to create the online hypothesis sequence, and then to choose a single good hypothesis from this sequence. [sent-28, score-1.356]
15 For instance, the longest survivor technique [8] (originally called the pocket algorithm) chooses the hypothesis that survives the longest number of consecutive online rounds before it is replaced. [sent-29, score-1.112]
16 The validation technique [12] uses a validation set to evaluate each online hypothesis and chooses the hypothesis with the best empirical performance. [sent-30, score-1.274]
17 Another common online-to-batch conversion approach, which we call the ensemble approach, uses the online algorithm to construct the online hypothesis sequence, and combines the hypotheses in the sequence by taking a majority [7] or by averaging [2, Sec. [sent-35, score-1.83]
18 When using linear hypotheses, averaging can be done on-the-fly, while the online algorithm is constructing the online hypothesis sequence. [sent-38, score-1.258]
19 Moreover, since we do not truly know the quality of each online hypothesis, building an ensemble allows us to hedge our bets, rather than committing to a single online hypothesis. [sent-41, score-0.749]
20 Sometimes the ensemble approach outperforms the single hypothesis approach, while other times we see the opposite behavior (see Sec. [sent-42, score-0.42]
21 Ideally, we would like a conversion technique that enjoys the best of both worlds: when a single good online hypothesis can be clearly identified, it should be chosen as the output hypothesis, but when a good hypothesis cannot be identified, we should play it safe and construct an ensemble. [sent-44, score-1.457]
22 A first step in this direction was taken in [10, 5], where the conversion technique selectively chooses which subset of online hypotheses to include in the ensemble. [sent-45, score-0.92]
23 For example, the suffix averaging conversion [5] sets the output hypothesis to be the average over a suffix of the online hypothesis sequence, where the suffix length is determined by minimizing a theoretical upper-bound on the generalization ability of the resulting hypothesis. [sent-46, score-1.541]
24 One extreme of this approach is to include the entire online hypothesis sequence in the ensemble. [sent-47, score-0.825]
25 Specifically, the suffix averaging technique only chooses the suffix length after the entire hypothesis sequence has been constructed. [sent-51, score-0.886]
26 This is in sharp contrast to the last-hypothesis heuristic, which uses no memory aside from the memory used by the online algorithm itself. [sent-53, score-0.448]
27 When the training set is massive, storing the entire online hypothesis sequence in memory is impossible. [sent-54, score-0.854]
28 In this paper, we present and analyze a new conversion technique called cutoff averaging. [sent-55, score-0.826]
29 Like suffix averaging, it attempts to enjoy the best of the single hypothesis approach and of the ensemble approach. [sent-56, score-0.42]
30 One extreme of our technique reduces to the simple averaging conversion technique, while the other extreme reduces to the longest-survivor conversion technique. [sent-57, score-1.0]
31 The advantage of our technique is that much of it can be performed on-the-fly, as the online algorithm processes the data. [sent-59, score-0.53]
32 The memory required by cutoff averaging scales with square-root the number of training examples in the worst case, and is far less in the typically case. [sent-60, score-0.761]
33 3 we present the cutoff averaging technique and provide a statistical generalization analysis for it. [sent-65, score-0.827]
34 The risk of a hypothesis h, denoted by ℓ(h; D), is defined as the expected loss incurred by h over examples sampled from D. [sent-78, score-0.582]
35 The goal of a batch learning algorithm for the hypothesis class H and for the loss function ℓ is to find a hypothesis h⋆ ∈ H whose risk is as close as possible to inf h∈H ℓ(h; D). [sent-81, score-1.025]
36 m In online learning, the labeled examples take the form of a sequence S = (xi , yi ) i=1 . [sent-82, score-0.505]
37 The online algorithm observes the examples in the sequence one-by-one, and incrementally constructs the sequence of online hypotheses (hi )m , where each i=0 hi ∈ H. [sent-84, score-1.275]
38 Finally, the algorithm uses the new example (xt , yt ) to construct the next hypothesis ht . [sent-89, score-0.445]
39 The update rule used to construct ht is the main component of the online learning algorithm. [sent-90, score-0.37]
40 Since S is not necessarily generated by any distribution D, we cannot define the risk of an online hypothesis. [sent-92, score-0.442]
41 Instead, the performance of an online algorithm is measured using the game-theoretic notion of regret. [sent-93, score-0.359]
42 The regret of an online algorithm is defined as 1 m m i=1 1 ˆ m h∈H ℓ(hi−1 ; (xi , yi )) − min m ˆ ℓ h; (xi , yi ) . [sent-94, score-0.555]
43 (1) i=1 In words, regret measures how much better the algorithm could have done by using the best fixed hypothesis in H on all m rounds. [sent-95, score-0.425]
44 The goal of an online learning algorithm is to minimize regret. [sent-96, score-0.359]
45 , m receive xi , predict sign wi−1 , xi receive yi ∈ {−1, +1} if sign wi−1 , xi = yi m input S = (xi , yi ) i=1 s. [sent-116, score-0.483]
46 , m receive xi , predict sign wi−1 , xi receive yi ∈ {−1, +1} if ℓ(wi−1 ; (xi , yi )) > 0 ′ wi−1 ← wi−1 + wi ← wi−1 + yi xi wi ← y √i xi mR ′ wi−1 ′ wi−1 2 Figure 1: Two versions of the Perceptron algorithm. [sent-124, score-0.656]
47 The last hypothesis constructed by the margin-based Perceptron is typically better than any average. [sent-128, score-0.379]
48 Ideally, we would like a conversion technique that performs well in both cases. [sent-130, score-0.425]
49 From a theoretical standpoint, the purpose of an online-to-batch conversion technique is to turn an online learning algorithm with a regret bound into a batch learning algorithm with a risk bound. [sent-131, score-1.177]
50 Let (hi )m be the online hypothesis sequence generated by the margin-based i=0 ˆ Perceptron (see Fig. [sent-136, score-0.739]
51 3 Cutoff Averaging We now present the cutoff averaging conversion technique. [sent-141, score-0.89]
52 This technique can be applied to any conservative online learning algorithm that uses a convex hypothesis class H. [sent-142, score-0.913]
53 A conservative algorithm is one that modifies its online hypotheses only on rounds where a positive loss is suffered. [sent-143, score-0.692]
54 On rounds where no loss is suffered, the algorithm keeps its current hypothesis, and we say that the hypothesis survived the round. [sent-144, score-0.551]
55 The survival time of each distinct online hypothesis is the number of consecutive rounds it survives before the algorithm suffers a loss and replaces it with a new hypothesis. [sent-145, score-1.077]
56 training set, and obtaining the online hypothesis sequence (hi )m−1 . [sent-150, score-0.777]
57 Let k be an i=0 arbitrary non-negative integer, which we call the cutoff parameter. [sent-151, score-0.401]
58 The cutoff averaging technique defines the output hypothesis h⋆ as a weighted average over the hypotheses in Θ, where the weight of a hypothesis with survival time s is proportional to s − k. [sent-154, score-1.746]
59 Intuitively, each hypothesis must qualify for the ensemble, by suffering no loss for k consecutive rounds. [sent-155, score-0.444]
60 The cutoff parameter k sets the bar for acceptance into the ensemble. [sent-156, score-0.401]
61 Once a hypothesis is included in the ensemble, its weight is determined by the number of additional rounds it perseveres after qualifying. [sent-157, score-0.416]
62 4 We present a statistical analysis of the cutoff averaging technique. [sent-158, score-0.636]
63 Once this sequence is presented to the online algorithm, we obtain the online hypothesis sequence (Hi )m , which is a sequence of random functions. [sent-161, score-1.22]
64 (2) In words, the risk of the random function Hi equals the conditional expectation of the online loss suffered on round i + 1, conditioned on the random examples 1 through i. [sent-165, score-0.694]
65 This simple observation relates statistical risk with online loss, and is the key to converting regret bounds into risk bounds. [sent-166, score-0.682]
66 (3) Now define the output hypothesis m−1 ⋆ Hk = Bi i=0 −1 m−1 Bi Hi . [sent-171, score-0.368]
67 Also note that setting k = 0 results in Bi = 1 for all i, and would reduce our conversion technique to the standard averaging conversion technique. [sent-174, score-0.914]
68 At the other extreme, as k increases, our technique approaches the longest survivor conversion technique. [sent-175, score-0.495]
69 ⋆ The following theorem bounds the risk of Hk using the online loss suffered on rounds where Bi = 1. [sent-176, score-0.694]
70 If the loss function is not convex, the theorem does not hold, but note that we can still bound the average risk of the hypotheses in the ensemble. [sent-180, score-0.388]
71 An online algorithm is given m ≥ 4 independent samples from D and ⋆ constructs the online hypothesis sequence (Hi )m . [sent-183, score-1.124]
72 We can now complete the definition of the cutoff averaging technique. [sent-211, score-0.636]
73 The cutoff averaging technique sets the ⋆ ⋆ output hypothesis H ⋆ to be hypothesis in {H0 , . [sent-218, score-1.508]
74 If a small number of online hypotheses stand out with significantly long survival times, then our technique will favor a large k and a sparse ensemble. [sent-224, score-0.76]
75 On the other hand, if most of the online hypotheses have medium/short survival times, then our technique will favor small values of k and a dense ensemble. [sent-225, score-0.76]
76 If the online algorithm being converted has a regret bound, then the data dependent risk bound given by Thm. [sent-228, score-0.61]
77 2 is applied with k = 0, L simply becomes the average loss suffered by the ¯ online algorithm over the entire training set and Bi = m. [sent-234, score-0.581]
78 Particularly, we can choose h to ˆ by the average loss of any h i=1 ˆ be the hypothesis with the smallest risk in H, namely, h = arg minh∈H ℓ(h; D). [sent-237, score-0.53]
79 As mentioned in the introduction, our approach is similar to the suffix averaging conversion technique of [5], which also interpolates between an ensemble approach and a single hypothesis approach. [sent-241, score-1.08]
80 √ In contrast, cutoff averaging requires only O( m) space. [sent-243, score-0.636]
81 Our technique cannot choose the optimal value of k before the entire dataset has been processed, but nevertheless, it does not need to store the entire hypothesis sequence. [sent-244, score-0.59]
82 Instead, it can group the online hypotheses based on their survival times, and stores only the average hypothesis in each group and the total loss in each group. [sent-245, score-0.988]
83 By the time the entire dataset is processed, most of the work has already been done and calculating the optimal k and the output hypothesis is straightforward. [sent-246, score-0.411]
84 Since our conversion is mostly calculated on-the-fly, in parallel with the online rounds, we can easily construct intermediate output hypotheses, before the online algorithm has a chance to process the entire dataset. [sent-251, score-1.022]
85 1 1 10 3 10 5 10 1 10 3 10 5 10 1 3 10 10 5 10 1 10 3 10 5 10 1 10 3 10 5 10 Figure 2: Test error (zero-one-loss) of last-hypothesis and cutoff averaging, each applied to the standard Perceptron, on ten binary classification problems from RCV1. [sent-315, score-0.447]
86 We applied the cutoff averaging technique to the classic Perceptron and to the margin-based Perceptron. [sent-329, score-0.872]
87 We paused training at regular intervals, computed the output hypothesis so far, and calculated its test loss. [sent-332, score-0.406]
88 Clearly, the test loss of the last hypothesis is very unstable, even after averaging over 10 repetitions. [sent-337, score-0.68]
89 If we decide to use the last hypothesis technique, our training set size could happen to be such that we end up with a bad output hypothesis. [sent-339, score-0.432]
90 On the other hand, the cutoff averaging hypothesis is accurate, stable and consistent. [sent-340, score-0.969]
91 The performance of the simple averaging conversion technique is not plotted in Fig. [sent-341, score-0.66]
92 2, but we note that it was only slightly worse than the performance of cutoff averaging. [sent-342, score-0.401]
93 When using the classic Perceptron, any form of averaging is beneficial, and our technique successfully identifies this. [sent-343, score-0.471]
94 3 shows the test hinge loss of cutoff averaging, last-hypothesis, and simple averaging, when applied to the margin-based Perceptron. [sent-345, score-0.521]
95 1 1 10 3 10 5 10 1 10 3 10 5 10 Figure 3: Test hinge-loss of last-hypothesis, averaging, and cutoff averaging, each applied to the finite-horizon margin-based Perceptron, on ten binary classification problems from RCV1. [sent-406, score-0.447]
96 and the simple averaging conversion technique is significantly inferior for all training set sizes. [sent-408, score-0.698]
97 1% of the data), the cutoff averaging technique catches up to the last hypothesis and performs comparably well from then on. [sent-410, score-1.166]
98 Once the tail bounds become tight enough, our technique essentially identifies that there is no benefit in constructing a diverse ensemble, and assigns all of the weight to a short suffix of the online hypothesis sequence. [sent-413, score-0.92]
99 However, if we are after a generic solution that performs well in both cases, we need a conversion technique that automatically balances the tradeoff between these two extremes. [sent-416, score-0.477]
100 Suffix averaging [5] and cutoff averaging are two such techniques, with cutoff averaging having a significant computational advantage. [sent-417, score-1.507]
wordName wordTfidf (topN-words)
[('cutoff', 0.401), ('hypothesis', 0.333), ('online', 0.331), ('conversion', 0.254), ('perceptron', 0.242), ('averaging', 0.235), ('hi', 0.215), ('technique', 0.171), ('ccat', 0.165), ('ecat', 0.165), ('gcat', 0.165), ('mcat', 0.165), ('bi', 0.135), ('hypotheses', 0.135), ('batch', 0.134), ('risk', 0.111), ('survival', 0.103), ('ensemble', 0.087), ('loss', 0.086), ('rounds', 0.083), ('sequence', 0.075), ('wi', 0.072), ('yi', 0.066), ('classic', 0.065), ('regret', 0.064), ('li', 0.06), ('ln', 0.059), ('round', 0.058), ('tail', 0.057), ('xi', 0.057), ('bound', 0.056), ('articles', 0.055), ('suffered', 0.055), ('hk', 0.055), ('corpus', 0.048), ('yj', 0.047), ('article', 0.044), ('entire', 0.043), ('extreme', 0.043), ('pre', 0.043), ('ui', 0.042), ('ht', 0.039), ('training', 0.038), ('converting', 0.037), ('erceptron', 0.037), ('survives', 0.037), ('survivor', 0.037), ('lt', 0.036), ('output', 0.035), ('hinge', 0.034), ('memory', 0.034), ('longest', 0.033), ('examples', 0.033), ('minh', 0.032), ('conservative', 0.029), ('reuters', 0.029), ('separators', 0.029), ('chooses', 0.029), ('receive', 0.029), ('validation', 0.028), ('holds', 0.028), ('bounds', 0.028), ('receives', 0.028), ('sign', 0.028), ('algorithm', 0.028), ('label', 0.027), ('xj', 0.027), ('automatically', 0.026), ('constructs', 0.026), ('suffers', 0.026), ('dekel', 0.026), ('observes', 0.026), ('balances', 0.026), ('last', 0.026), ('ten', 0.026), ('suf', 0.026), ('ciency', 0.025), ('consecutive', 0.025), ('distinct', 0.025), ('tokens', 0.025), ('labels', 0.024), ('yt', 0.024), ('lemma', 0.024), ('xt', 0.023), ('deterministically', 0.023), ('namely', 0.021), ('keeps', 0.021), ('uses', 0.021), ('news', 0.021), ('uctuations', 0.021), ('preserves', 0.021), ('heuristic', 0.02), ('processed', 0.02), ('converted', 0.02), ('binary', 0.02), ('favor', 0.02), ('equals', 0.02), ('generalization', 0.02), ('typically', 0.02), ('incurred', 0.019)]
simIndex simValue paperId paperTitle
same-paper 1 1.0000006 88 nips-2008-From Online to Batch Learning with Cutoff-Averaging
Author: Ofer Dekel
Abstract: We present cutoff averaging, a technique for converting any conservative online learning algorithm into a batch learning algorithm. Most online-to-batch conversion techniques work well with certain types of online learning algorithms and not with others, whereas cutoff averaging explicitly tries to adapt to the characteristics of the online algorithm being converted. An attractive property of our technique is that it preserves the efficiency of the original online algorithm, making it appropriate for large-scale learning problems. We provide a statistical analysis of our technique and back our theoretical claims with experimental results. 1
2 0.18109429 65 nips-2008-Domain Adaptation with Multiple Sources
Author: Yishay Mansour, Mehryar Mohri, Afshin Rostamizadeh
Abstract: This paper presents a theoretical analysis of the problem of domain adaptation with multiple sources. For each source domain, the distribution over the input points as well as a hypothesis with error at most ǫ are given. The problem consists of combining these hypotheses to derive a hypothesis with small error with respect to the target domain. We present several theoretical results relating to this problem. In particular, we prove that standard convex combinations of the source hypotheses may in fact perform very poorly and that, instead, combinations weighted by the source distributions benefit from favorable theoretical guarantees. Our main result shows that, remarkably, for any fixed target function, there exists a distribution weighted combining rule that has a loss of at most ǫ with respect to any target mixture of the source distributions. We further generalize the setting from a single target function to multiple consistent target functions and show the existence of a combining rule with error at most 3ǫ. Finally, we report empirical results for a multiple source adaptation problem with a real-world dataset.
3 0.16669987 214 nips-2008-Sparse Online Learning via Truncated Gradient
Author: John Langford, Lihong Li, Tong Zhang
Abstract: We propose a general method called truncated gradient to induce sparsity in the weights of online-learning algorithms with convex loss. This method has several essential properties. First, the degree of sparsity is continuous—a parameter controls the rate of sparsification from no sparsification to total sparsification. Second, the approach is theoretically motivated, and an instance of it can be regarded as an online counterpart of the popular L1 -regularization method in the batch setting. We prove small rates of sparsification result in only small additional regret with respect to typical online-learning guarantees. Finally, the approach works well empirically. We apply it to several datasets and find for datasets with large numbers of features, substantial sparsity is discoverable. 1
4 0.15380456 78 nips-2008-Exact Convex Confidence-Weighted Learning
Author: Koby Crammer, Mark Dredze, Fernando Pereira
Abstract: Confidence-weighted (CW) learning [6], an online learning method for linear classifiers, maintains a Gaussian distributions over weight vectors, with a covariance matrix that represents uncertainty about weights and correlations. Confidence constraints ensure that a weight vector drawn from the hypothesis distribution correctly classifies examples with a specified probability. Within this framework, we derive a new convex form of the constraint and analyze it in the mistake bound model. Empirical evaluation with both synthetic and text data shows our version of CW learning achieves lower cumulative and out-of-sample errors than commonly used first-order and second-order online methods. 1
5 0.15014872 164 nips-2008-On the Generalization Ability of Online Strongly Convex Programming Algorithms
Author: Sham M. Kakade, Ambuj Tewari
Abstract: This paper examines the generalization properties of online convex programming algorithms when the loss function is Lipschitz and strongly convex. Our main result is a sharp bound, that holds with high probability, on the excess risk of the output of an online algorithm in terms of the average regret. This allows one to use recent algorithms with logarithmic cumulative regret guarantees to achieve fast convergence rates for the excess risk with high probability. As a corollary, we characterize the convergence rate of P EGASOS (with high probability), a recently proposed method for solving the SVM optimization problem. 1
6 0.11524442 123 nips-2008-Linear Classification and Selective Sampling Under Low Noise Conditions
7 0.11511074 133 nips-2008-Mind the Duality Gap: Logarithmic regret algorithms for online optimization
8 0.10938454 161 nips-2008-On the Complexity of Linear Prediction: Risk Bounds, Margin Bounds, and Regularization
9 0.10931841 168 nips-2008-Online Metric Learning and Fast Similarity Search
10 0.10739115 22 nips-2008-An Online Algorithm for Maximizing Submodular Functions
11 0.092689157 178 nips-2008-Performance analysis for L\ 2 kernel classification
12 0.089312255 15 nips-2008-Adaptive Martingale Boosting
13 0.0791751 169 nips-2008-Online Models for Content Optimization
14 0.076314412 111 nips-2008-Kernel Change-point Analysis
15 0.074056037 96 nips-2008-Hebbian Learning of Bayes Optimal Decisions
16 0.07294438 171 nips-2008-Online Prediction on Large Diameter Graphs
17 0.067460425 239 nips-2008-Tighter Bounds for Structured Estimation
18 0.067104734 228 nips-2008-Support Vector Machines with a Reject Option
19 0.066383995 189 nips-2008-Rademacher Complexity Bounds for Non-I.I.D. Processes
20 0.064753316 170 nips-2008-Online Optimization in X-Armed Bandits
topicId topicWeight
[(0, -0.2), (1, 0.011), (2, -0.174), (3, -0.002), (4, -0.162), (5, 0.072), (6, -0.021), (7, -0.015), (8, 0.044), (9, -0.025), (10, -0.007), (11, 0.224), (12, 0.025), (13, 0.037), (14, 0.105), (15, -0.08), (16, -0.018), (17, 0.016), (18, -0.092), (19, 0.072), (20, 0.018), (21, -0.066), (22, 0.041), (23, 0.123), (24, 0.013), (25, -0.015), (26, 0.062), (27, -0.091), (28, 0.113), (29, 0.051), (30, -0.025), (31, 0.06), (32, 0.11), (33, 0.023), (34, 0.007), (35, -0.071), (36, 0.068), (37, 0.044), (38, 0.004), (39, -0.076), (40, -0.115), (41, 0.055), (42, 0.086), (43, 0.113), (44, -0.036), (45, -0.085), (46, -0.044), (47, -0.032), (48, -0.017), (49, 0.028)]
simIndex simValue paperId paperTitle
same-paper 1 0.96862763 88 nips-2008-From Online to Batch Learning with Cutoff-Averaging
Author: Ofer Dekel
Abstract: We present cutoff averaging, a technique for converting any conservative online learning algorithm into a batch learning algorithm. Most online-to-batch conversion techniques work well with certain types of online learning algorithms and not with others, whereas cutoff averaging explicitly tries to adapt to the characteristics of the online algorithm being converted. An attractive property of our technique is that it preserves the efficiency of the original online algorithm, making it appropriate for large-scale learning problems. We provide a statistical analysis of our technique and back our theoretical claims with experimental results. 1
2 0.69580728 22 nips-2008-An Online Algorithm for Maximizing Submodular Functions
Author: Matthew Streeter, Daniel Golovin
Abstract: We present an algorithm for solving a broad class of online resource allocation problems. Our online algorithm can be applied in environments where abstract jobs arrive one at a time, and one can complete the jobs by investing time in a number of abstract activities, according to some schedule. We assume that the fraction of jobs completed by a schedule is a monotone, submodular function of a set of pairs (v, τ ), where τ is the time invested in activity v. Under this assumption, our online algorithm performs near-optimally according to two natural metrics: (i) the fraction of jobs completed within time T , for some fixed deadline T > 0, and (ii) the average time required to complete each job. We evaluate our algorithm experimentally by using it to learn, online, a schedule for allocating CPU time among solvers entered in the 2007 SAT solver competition. 1
3 0.60126448 214 nips-2008-Sparse Online Learning via Truncated Gradient
Author: John Langford, Lihong Li, Tong Zhang
Abstract: We propose a general method called truncated gradient to induce sparsity in the weights of online-learning algorithms with convex loss. This method has several essential properties. First, the degree of sparsity is continuous—a parameter controls the rate of sparsification from no sparsification to total sparsification. Second, the approach is theoretically motivated, and an instance of it can be regarded as an online counterpart of the popular L1 -regularization method in the batch setting. We prove small rates of sparsification result in only small additional regret with respect to typical online-learning guarantees. Finally, the approach works well empirically. We apply it to several datasets and find for datasets with large numbers of features, substantial sparsity is discoverable. 1
4 0.60015017 168 nips-2008-Online Metric Learning and Fast Similarity Search
Author: Prateek Jain, Brian Kulis, Inderjit S. Dhillon, Kristen Grauman
Abstract: Metric learning algorithms can provide useful distance functions for a variety of domains, and recent work has shown good accuracy for problems where the learner can access all distance constraints at once. However, in many real applications, constraints are only available incrementally, thus necessitating methods that can perform online updates to the learned metric. Existing online algorithms offer bounds on worst-case performance, but typically do not perform well in practice as compared to their offline counterparts. We present a new online metric learning algorithm that updates a learned Mahalanobis metric based on LogDet regularization and gradient descent. We prove theoretical worst-case performance bounds, and empirically compare the proposed method against existing online metric learning algorithms. To further boost the practicality of our approach, we develop an online locality-sensitive hashing scheme which leads to efficient updates to data structures used for fast approximate similarity search. We demonstrate our algorithm on multiple datasets and show that it outperforms relevant baselines.
5 0.59835184 78 nips-2008-Exact Convex Confidence-Weighted Learning
Author: Koby Crammer, Mark Dredze, Fernando Pereira
Abstract: Confidence-weighted (CW) learning [6], an online learning method for linear classifiers, maintains a Gaussian distributions over weight vectors, with a covariance matrix that represents uncertainty about weights and correlations. Confidence constraints ensure that a weight vector drawn from the hypothesis distribution correctly classifies examples with a specified probability. Within this framework, we derive a new convex form of the constraint and analyze it in the mistake bound model. Empirical evaluation with both synthetic and text data shows our version of CW learning achieves lower cumulative and out-of-sample errors than commonly used first-order and second-order online methods. 1
6 0.58958501 164 nips-2008-On the Generalization Ability of Online Strongly Convex Programming Algorithms
7 0.55158949 178 nips-2008-Performance analysis for L\ 2 kernel classification
8 0.54314035 123 nips-2008-Linear Classification and Selective Sampling Under Low Noise Conditions
9 0.53218991 133 nips-2008-Mind the Duality Gap: Logarithmic regret algorithms for online optimization
10 0.48346639 169 nips-2008-Online Models for Content Optimization
11 0.46753973 65 nips-2008-Domain Adaptation with Multiple Sources
12 0.46521357 196 nips-2008-Relative Margin Machines
13 0.44694802 211 nips-2008-Simple Local Models for Complex Dynamical Systems
14 0.44689488 125 nips-2008-Local Gaussian Process Regression for Real Time Online Model Learning
15 0.4353883 96 nips-2008-Hebbian Learning of Bayes Optimal Decisions
16 0.42602396 161 nips-2008-On the Complexity of Linear Prediction: Risk Bounds, Margin Bounds, and Regularization
17 0.41477358 182 nips-2008-Posterior Consistency of the Silverman g-prior in Bayesian Model Choice
18 0.41080213 185 nips-2008-Privacy-preserving logistic regression
19 0.37316293 111 nips-2008-Kernel Change-point Analysis
20 0.36315182 110 nips-2008-Kernel-ARMA for Hand Tracking and Brain-Machine interfacing During 3D Motor Control
topicId topicWeight
[(6, 0.162), (7, 0.044), (12, 0.068), (28, 0.167), (38, 0.211), (57, 0.041), (59, 0.017), (63, 0.058), (71, 0.029), (77, 0.048), (83, 0.052)]
simIndex simValue paperId paperTitle
same-paper 1 0.81993037 88 nips-2008-From Online to Batch Learning with Cutoff-Averaging
Author: Ofer Dekel
Abstract: We present cutoff averaging, a technique for converting any conservative online learning algorithm into a batch learning algorithm. Most online-to-batch conversion techniques work well with certain types of online learning algorithms and not with others, whereas cutoff averaging explicitly tries to adapt to the characteristics of the online algorithm being converted. An attractive property of our technique is that it preserves the efficiency of the original online algorithm, making it appropriate for large-scale learning problems. We provide a statistical analysis of our technique and back our theoretical claims with experimental results. 1
2 0.81757432 177 nips-2008-Particle Filter-based Policy Gradient in POMDPs
Author: Pierre-arnaud Coquelin, Romain Deguest, Rémi Munos
Abstract: Our setting is a Partially Observable Markov Decision Process with continuous state, observation and action spaces. Decisions are based on a Particle Filter for estimating the belief state given past observations. We consider a policy gradient approach for parameterized policy optimization. For that purpose, we investigate sensitivity analysis of the performance measure with respect to the parameters of the policy, focusing on Finite Difference (FD) techniques. We show that the naive FD is subject to variance explosion because of the non-smoothness of the resampling procedure. We propose a more sophisticated FD method which overcomes this problem and establish its consistency. 1
3 0.79612011 181 nips-2008-Policy Search for Motor Primitives in Robotics
Author: Jens Kober, Jan R. Peters
Abstract: Many motor skills in humanoid robotics can be learned using parametrized motor primitives as done in imitation learning. However, most interesting motor learning problems are high-dimensional reinforcement learning problems often beyond the reach of current methods. In this paper, we extend previous work on policy learning from the immediate reward case to episodic reinforcement learning. We show that this results in a general, common framework also connected to policy gradient methods and yielding a novel algorithm for policy learning that is particularly well-suited for dynamic motor primitives. The resulting algorithm is an EM-inspired algorithm applicable to complex motor learning tasks. We compare this algorithm to several well-known parametrized policy search methods and show that it outperforms them. We apply it in the context of motor learning and show that it can learn a complex Ball-in-a-Cup task using a real Barrett WAMTM robot arm. 1
4 0.78711808 242 nips-2008-Translated Learning: Transfer Learning across Different Feature Spaces
Author: Wenyuan Dai, Yuqiang Chen, Gui-rong Xue, Qiang Yang, Yong Yu
Abstract: This paper investigates a new machine learning strategy called translated learning. Unlike many previous learning tasks, we focus on how to use labeled data from one feature space to enhance the classification of other entirely different learning spaces. For example, we might wish to use labeled text data to help learn a model for classifying image data, when the labeled images are difficult to obtain. An important aspect of translated learning is to build a “bridge” to link one feature space (known as the “source space”) to another space (known as the “target space”) through a translator in order to migrate the knowledge from source to target. The translated learning solution uses a language model to link the class labels to the features in the source spaces, which in turn is translated to the features in the target spaces. Finally, this chain of linkages is completed by tracing back to the instances in the target spaces. We show that this path of linkage can be modeled using a Markov chain and risk minimization. Through experiments on the text-aided image classification and cross-language classification tasks, we demonstrate that our translated learning framework can greatly outperform many state-of-the-art baseline methods. 1
5 0.75386882 133 nips-2008-Mind the Duality Gap: Logarithmic regret algorithms for online optimization
Author: Shai Shalev-shwartz, Sham M. Kakade
Abstract: We describe a primal-dual framework for the design and analysis of online strongly convex optimization algorithms. Our framework yields the tightest known logarithmic regret bounds for Follow-The-Leader and for the gradient descent algorithm proposed in Hazan et al. [2006]. We then show that one can interpolate between these two extreme cases. In particular, we derive a new algorithm that shares the computational simplicity of gradient descent but achieves lower regret in many practical situations. Finally, we further extend our framework for generalized strongly convex functions. 1
6 0.75377774 164 nips-2008-On the Generalization Ability of Online Strongly Convex Programming Algorithms
7 0.7368328 202 nips-2008-Robust Regression and Lasso
8 0.73594618 199 nips-2008-Risk Bounds for Randomized Sample Compressed Classifiers
9 0.73506516 178 nips-2008-Performance analysis for L\ 2 kernel classification
10 0.73354536 214 nips-2008-Sparse Online Learning via Truncated Gradient
11 0.72907758 75 nips-2008-Estimating vector fields using sparse basis field expansions
12 0.72317797 85 nips-2008-Fast Rates for Regularized Objectives
13 0.72274709 162 nips-2008-On the Design of Loss Functions for Classification: theory, robustness to outliers, and SavageBoost
14 0.72097087 74 nips-2008-Estimating the Location and Orientation of Complex, Correlated Neural Activity using MEG
15 0.72069281 62 nips-2008-Differentiable Sparse Coding
16 0.71785837 228 nips-2008-Support Vector Machines with a Reject Option
17 0.71152484 14 nips-2008-Adaptive Forward-Backward Greedy Algorithm for Sparse Learning with Linear Models
18 0.71077824 19 nips-2008-An Empirical Analysis of Domain Adaptation Algorithms for Genomic Sequence Analysis
19 0.7102586 245 nips-2008-Unlabeled data: Now it helps, now it doesn't
20 0.7088908 91 nips-2008-Generative and Discriminative Learning with Unknown Labeling Bias