nips nips2008 nips2008-88 knowledge-graph by maker-knowledge-mining

88 nips-2008-From Online to Batch Learning with Cutoff-Averaging

Source: pdf

Author: Ofer Dekel

Abstract: We present cutoff averaging, a technique for converting any conservative online learning algorithm into a batch learning algorithm. Most online-to-batch conversion techniques work well with certain types of online learning algorithms and not with others, whereas cutoff averaging explicitly tries to adapt to the characteristics of the online algorithm being converted. An attractive property of our technique is that it preserves the efﬁciency of the original online algorithm, making it appropriate for large-scale learning problems. We provide a statistical analysis of our technique and back our theoretical claims with experimental results. 1

Reference: text

Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 From Online to Batch Learning with Cutoff-Averaging Anonymous Author(s) Afﬁliation Address email Abstract We present cutoff averaging, a technique for converting any conservative online learning algorithm into a batch learning algorithm. [sent-1, score-1.131]

2 Most online-to-batch conversion techniques work well with certain types of online learning algorithms and not with others, whereas cutoff averaging explicitly tries to adapt to the characteristics of the online algorithm being converted. [sent-2, score-1.58]

3 An attractive property of our technique is that it preserves the efﬁciency of the original online algorithm, making it appropriate for large-scale learning problems. [sent-3, score-0.523]

4 We expect a batch learning algorithm to generalize, in the sense that its output hypothesis should accurately predict the labels of previously unseen examples, which are sampled from the distribution. [sent-12, score-0.554]

5 An online learning algorithm receives a sequence of examples and processes these examples one-by-one. [sent-14, score-0.528]

6 The sequence of internal hypotheses constructed by the online algorithm from round to round plays a central role in this paper, and we refer to this sequence as the online hypothesis sequence. [sent-18, score-1.424]

7 As a result, we are sometimes tempted to use online learning algorithms as if they were batch learning algorithms. [sent-21, score-0.465]

8 A common way to do this is to present training examples one-by-one to the online algorithm, and use the last hypothesis constructed by the algorithm as the output hypothesis. [sent-22, score-0.824]

9 We call this technique the last-hypothesis online-to-batch conversion technique. [sent-23, score-0.425]

10 The appeal of this technique is that it maintains the computational efﬁciency of the original online algorithm. [sent-24, score-0.502]

11 However, this heuristic technique generally comes with no theoretical guarantees, and the online algorithm’s inherent disregard for out-of-sample performance makes it a risky practice. [sent-25, score-0.522]

12 1 In addition to the last-hypothesis heuristic, various principled techniques for converting online algorithms into batch algorithms have been proposed. [sent-26, score-0.502]

13 Each of these techniques essentially wraps the online learning algorithm with an additional layer of instructions that endow it with the ability to generalize. [sent-27, score-0.359]

14 One approach is to use the online algorithm to create the online hypothesis sequence, and then to choose a single good hypothesis from this sequence. [sent-28, score-1.356]

15 For instance, the longest survivor technique [8] (originally called the pocket algorithm) chooses the hypothesis that survives the longest number of consecutive online rounds before it is replaced. [sent-29, score-1.112]

16 The validation technique [12] uses a validation set to evaluate each online hypothesis and chooses the hypothesis with the best empirical performance. [sent-30, score-1.274]

17 Another common online-to-batch conversion approach, which we call the ensemble approach, uses the online algorithm to construct the online hypothesis sequence, and combines the hypotheses in the sequence by taking a majority [7] or by averaging [2, Sec. [sent-35, score-1.83]

18 When using linear hypotheses, averaging can be done on-the-ﬂy, while the online algorithm is constructing the online hypothesis sequence. [sent-38, score-1.258]

19 Moreover, since we do not truly know the quality of each online hypothesis, building an ensemble allows us to hedge our bets, rather than committing to a single online hypothesis. [sent-41, score-0.749]

20 Sometimes the ensemble approach outperforms the single hypothesis approach, while other times we see the opposite behavior (see Sec. [sent-42, score-0.42]

21 Ideally, we would like a conversion technique that enjoys the best of both worlds: when a single good online hypothesis can be clearly identiﬁed, it should be chosen as the output hypothesis, but when a good hypothesis cannot be identiﬁed, we should play it safe and construct an ensemble. [sent-44, score-1.457]

22 A ﬁrst step in this direction was taken in [10, 5], where the conversion technique selectively chooses which subset of online hypotheses to include in the ensemble. [sent-45, score-0.92]

23 For example, the sufﬁx averaging conversion [5] sets the output hypothesis to be the average over a sufﬁx of the online hypothesis sequence, where the sufﬁx length is determined by minimizing a theoretical upper-bound on the generalization ability of the resulting hypothesis. [sent-46, score-1.541]

24 One extreme of this approach is to include the entire online hypothesis sequence in the ensemble. [sent-47, score-0.825]

25 Speciﬁcally, the sufﬁx averaging technique only chooses the sufﬁx length after the entire hypothesis sequence has been constructed. [sent-51, score-0.886]

26 This is in sharp contrast to the last-hypothesis heuristic, which uses no memory aside from the memory used by the online algorithm itself. [sent-53, score-0.448]

27 When the training set is massive, storing the entire online hypothesis sequence in memory is impossible. [sent-54, score-0.854]

28 In this paper, we present and analyze a new conversion technique called cutoff averaging. [sent-55, score-0.826]

29 Like sufﬁx averaging, it attempts to enjoy the best of the single hypothesis approach and of the ensemble approach. [sent-56, score-0.42]

30 One extreme of our technique reduces to the simple averaging conversion technique, while the other extreme reduces to the longest-survivor conversion technique. [sent-57, score-1.0]

31 The advantage of our technique is that much of it can be performed on-the-ﬂy, as the online algorithm processes the data. [sent-59, score-0.53]

32 The memory required by cutoff averaging scales with square-root the number of training examples in the worst case, and is far less in the typically case. [sent-60, score-0.761]

33 3 we present the cutoff averaging technique and provide a statistical generalization analysis for it. [sent-65, score-0.827]

34 The risk of a hypothesis h, denoted by ℓ(h; D), is deﬁned as the expected loss incurred by h over examples sampled from D. [sent-78, score-0.582]

35 The goal of a batch learning algorithm for the hypothesis class H and for the loss function ℓ is to ﬁnd a hypothesis h⋆ ∈ H whose risk is as close as possible to inf h∈H ℓ(h; D). [sent-81, score-1.025]

36 m In online learning, the labeled examples take the form of a sequence S = (xi , yi ) i=1 . [sent-82, score-0.505]

37 The online algorithm observes the examples in the sequence one-by-one, and incrementally constructs the sequence of online hypotheses (hi )m , where each i=0 hi ∈ H. [sent-84, score-1.275]

38 Finally, the algorithm uses the new example (xt , yt ) to construct the next hypothesis ht . [sent-89, score-0.445]

39 The update rule used to construct ht is the main component of the online learning algorithm. [sent-90, score-0.37]

40 Since S is not necessarily generated by any distribution D, we cannot deﬁne the risk of an online hypothesis. [sent-92, score-0.442]

41 Instead, the performance of an online algorithm is measured using the game-theoretic notion of regret. [sent-93, score-0.359]

42 The regret of an online algorithm is deﬁned as 1 m m i=1 1 ˆ m h∈H ℓ(hi−1 ; (xi , yi )) − min m ˆ ℓ h; (xi , yi ) . [sent-94, score-0.555]

43 (1) i=1 In words, regret measures how much better the algorithm could have done by using the best ﬁxed hypothesis in H on all m rounds. [sent-95, score-0.425]

44 The goal of an online learning algorithm is to minimize regret. [sent-96, score-0.359]

45 , m receive xi , predict sign wi−1 , xi receive yi ∈ {−1, +1} if sign wi−1 , xi = yi m input S = (xi , yi ) i=1 s. [sent-116, score-0.483]

46 , m receive xi , predict sign wi−1 , xi receive yi ∈ {−1, +1} if ℓ(wi−1 ; (xi , yi )) > 0 ′ wi−1 ← wi−1 + wi ← wi−1 + yi xi wi ← y √i xi mR ′ wi−1 ′ wi−1 2 Figure 1: Two versions of the Perceptron algorithm. [sent-124, score-0.656]

47 The last hypothesis constructed by the margin-based Perceptron is typically better than any average. [sent-128, score-0.379]

48 Ideally, we would like a conversion technique that performs well in both cases. [sent-130, score-0.425]

49 From a theoretical standpoint, the purpose of an online-to-batch conversion technique is to turn an online learning algorithm with a regret bound into a batch learning algorithm with a risk bound. [sent-131, score-1.177]

50 Let (hi )m be the online hypothesis sequence generated by the margin-based i=0 ˆ Perceptron (see Fig. [sent-136, score-0.739]

51 3 Cutoff Averaging We now present the cutoff averaging conversion technique. [sent-141, score-0.89]

52 This technique can be applied to any conservative online learning algorithm that uses a convex hypothesis class H. [sent-142, score-0.913]

53 A conservative algorithm is one that modiﬁes its online hypotheses only on rounds where a positive loss is suffered. [sent-143, score-0.692]

54 On rounds where no loss is suffered, the algorithm keeps its current hypothesis, and we say that the hypothesis survived the round. [sent-144, score-0.551]

55 The survival time of each distinct online hypothesis is the number of consecutive rounds it survives before the algorithm suffers a loss and replaces it with a new hypothesis. [sent-145, score-1.077]

56 training set, and obtaining the online hypothesis sequence (hi )m−1 . [sent-150, score-0.777]

57 Let k be an i=0 arbitrary non-negative integer, which we call the cutoff parameter. [sent-151, score-0.401]

58 The cutoff averaging technique deﬁnes the output hypothesis h⋆ as a weighted average over the hypotheses in Θ, where the weight of a hypothesis with survival time s is proportional to s − k. [sent-154, score-1.746]

59 Intuitively, each hypothesis must qualify for the ensemble, by suffering no loss for k consecutive rounds. [sent-155, score-0.444]

60 The cutoff parameter k sets the bar for acceptance into the ensemble. [sent-156, score-0.401]

61 Once a hypothesis is included in the ensemble, its weight is determined by the number of additional rounds it perseveres after qualifying. [sent-157, score-0.416]

62 4 We present a statistical analysis of the cutoff averaging technique. [sent-158, score-0.636]

63 Once this sequence is presented to the online algorithm, we obtain the online hypothesis sequence (Hi )m , which is a sequence of random functions. [sent-161, score-1.22]

64 (2) In words, the risk of the random function Hi equals the conditional expectation of the online loss suffered on round i + 1, conditioned on the random examples 1 through i. [sent-165, score-0.694]

65 This simple observation relates statistical risk with online loss, and is the key to converting regret bounds into risk bounds. [sent-166, score-0.682]

66 (3) Now deﬁne the output hypothesis m−1 ⋆ Hk = Bi i=0 −1 m−1 Bi Hi . [sent-171, score-0.368]

67 Also note that setting k = 0 results in Bi = 1 for all i, and would reduce our conversion technique to the standard averaging conversion technique. [sent-174, score-0.914]

68 At the other extreme, as k increases, our technique approaches the longest survivor conversion technique. [sent-175, score-0.495]

69 ⋆ The following theorem bounds the risk of Hk using the online loss suffered on rounds where Bi = 1. [sent-176, score-0.694]

70 If the loss function is not convex, the theorem does not hold, but note that we can still bound the average risk of the hypotheses in the ensemble. [sent-180, score-0.388]

71 An online algorithm is given m ≥ 4 independent samples from D and ⋆ constructs the online hypothesis sequence (Hi )m . [sent-183, score-1.124]

72 We can now complete the deﬁnition of the cutoff averaging technique. [sent-211, score-0.636]

73 The cutoff averaging technique sets the ⋆ ⋆ output hypothesis H ⋆ to be hypothesis in {H0 , . [sent-218, score-1.508]

74 If a small number of online hypotheses stand out with signiﬁcantly long survival times, then our technique will favor a large k and a sparse ensemble. [sent-224, score-0.76]

75 On the other hand, if most of the online hypotheses have medium/short survival times, then our technique will favor small values of k and a dense ensemble. [sent-225, score-0.76]

76 If the online algorithm being converted has a regret bound, then the data dependent risk bound given by Thm. [sent-228, score-0.61]

77 2 is applied with k = 0, L simply becomes the average loss suffered by the ¯ online algorithm over the entire training set and Bi = m. [sent-234, score-0.581]

78 Particularly, we can choose h to ˆ by the average loss of any h i=1 ˆ be the hypothesis with the smallest risk in H, namely, h = arg minh∈H ℓ(h; D). [sent-237, score-0.53]

79 As mentioned in the introduction, our approach is similar to the sufﬁx averaging conversion technique of [5], which also interpolates between an ensemble approach and a single hypothesis approach. [sent-241, score-1.08]

80 √ In contrast, cutoff averaging requires only O( m) space. [sent-243, score-0.636]

81 Our technique cannot choose the optimal value of k before the entire dataset has been processed, but nevertheless, it does not need to store the entire hypothesis sequence. [sent-244, score-0.59]

82 Instead, it can group the online hypotheses based on their survival times, and stores only the average hypothesis in each group and the total loss in each group. [sent-245, score-0.988]

83 By the time the entire dataset is processed, most of the work has already been done and calculating the optimal k and the output hypothesis is straightforward. [sent-246, score-0.411]

84 Since our conversion is mostly calculated on-the-ﬂy, in parallel with the online rounds, we can easily construct intermediate output hypotheses, before the online algorithm has a chance to process the entire dataset. [sent-251, score-1.022]

85 1 1 10 3 10 5 10 1 10 3 10 5 10 1 3 10 10 5 10 1 10 3 10 5 10 1 10 3 10 5 10 Figure 2: Test error (zero-one-loss) of last-hypothesis and cutoff averaging, each applied to the standard Perceptron, on ten binary classiﬁcation problems from RCV1. [sent-315, score-0.447]

86 We applied the cutoff averaging technique to the classic Perceptron and to the margin-based Perceptron. [sent-329, score-0.872]

87 We paused training at regular intervals, computed the output hypothesis so far, and calculated its test loss. [sent-332, score-0.406]

88 Clearly, the test loss of the last hypothesis is very unstable, even after averaging over 10 repetitions. [sent-337, score-0.68]

89 If we decide to use the last hypothesis technique, our training set size could happen to be such that we end up with a bad output hypothesis. [sent-339, score-0.432]

90 On the other hand, the cutoff averaging hypothesis is accurate, stable and consistent. [sent-340, score-0.969]

91 The performance of the simple averaging conversion technique is not plotted in Fig. [sent-341, score-0.66]

92 2, but we note that it was only slightly worse than the performance of cutoff averaging. [sent-342, score-0.401]

93 When using the classic Perceptron, any form of averaging is beneﬁcial, and our technique successfully identiﬁes this. [sent-343, score-0.471]

94 3 shows the test hinge loss of cutoff averaging, last-hypothesis, and simple averaging, when applied to the margin-based Perceptron. [sent-345, score-0.521]

95 1 1 10 3 10 5 10 1 10 3 10 5 10 Figure 3: Test hinge-loss of last-hypothesis, averaging, and cutoff averaging, each applied to the ﬁnite-horizon margin-based Perceptron, on ten binary classiﬁcation problems from RCV1. [sent-406, score-0.447]

96 and the simple averaging conversion technique is signiﬁcantly inferior for all training set sizes. [sent-408, score-0.698]

97 1% of the data), the cutoff averaging technique catches up to the last hypothesis and performs comparably well from then on. [sent-410, score-1.166]

98 Once the tail bounds become tight enough, our technique essentially identiﬁes that there is no beneﬁt in constructing a diverse ensemble, and assigns all of the weight to a short sufﬁx of the online hypothesis sequence. [sent-413, score-0.92]

99 However, if we are after a generic solution that performs well in both cases, we need a conversion technique that automatically balances the tradeoff between these two extremes. [sent-416, score-0.477]

100 Sufﬁx averaging [5] and cutoff averaging are two such techniques, with cutoff averaging having a signiﬁcant computational advantage. [sent-417, score-1.507]

similar papers computed by tfidf model

tfidf for this paper:

wordName wordTfidf (topN-words)

[('cutoff', 0.401), ('hypothesis', 0.333), ('online', 0.331), ('conversion', 0.254), ('perceptron', 0.242), ('averaging', 0.235), ('hi', 0.215), ('technique', 0.171), ('ccat', 0.165), ('ecat', 0.165), ('gcat', 0.165), ('mcat', 0.165), ('bi', 0.135), ('hypotheses', 0.135), ('batch', 0.134), ('risk', 0.111), ('survival', 0.103), ('ensemble', 0.087), ('loss', 0.086), ('rounds', 0.083), ('sequence', 0.075), ('wi', 0.072), ('yi', 0.066), ('classic', 0.065), ('regret', 0.064), ('li', 0.06), ('ln', 0.059), ('round', 0.058), ('tail', 0.057), ('xi', 0.057), ('bound', 0.056), ('articles', 0.055), ('suffered', 0.055), ('hk', 0.055), ('corpus', 0.048), ('yj', 0.047), ('article', 0.044), ('entire', 0.043), ('extreme', 0.043), ('pre', 0.043), ('ui', 0.042), ('ht', 0.039), ('training', 0.038), ('converting', 0.037), ('erceptron', 0.037), ('survives', 0.037), ('survivor', 0.037), ('lt', 0.036), ('output', 0.035), ('hinge', 0.034), ('memory', 0.034), ('longest', 0.033), ('examples', 0.033), ('minh', 0.032), ('conservative', 0.029), ('reuters', 0.029), ('separators', 0.029), ('chooses', 0.029), ('receive', 0.029), ('validation', 0.028), ('holds', 0.028), ('bounds', 0.028), ('receives', 0.028), ('sign', 0.028), ('algorithm', 0.028), ('label', 0.027), ('xj', 0.027), ('automatically', 0.026), ('constructs', 0.026), ('suffers', 0.026), ('dekel', 0.026), ('observes', 0.026), ('balances', 0.026), ('last', 0.026), ('ten', 0.026), ('suf', 0.026), ('ciency', 0.025), ('consecutive', 0.025), ('distinct', 0.025), ('tokens', 0.025), ('labels', 0.024), ('yt', 0.024), ('lemma', 0.024), ('xt', 0.023), ('deterministically', 0.023), ('namely', 0.021), ('keeps', 0.021), ('uses', 0.021), ('news', 0.021), ('uctuations', 0.021), ('preserves', 0.021), ('heuristic', 0.02), ('processed', 0.02), ('converted', 0.02), ('binary', 0.02), ('favor', 0.02), ('equals', 0.02), ('generalization', 0.02), ('typically', 0.02), ('incurred', 0.019)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 1.0000006 88 nips-2008-From Online to Batch Learning with Cutoff-Averaging

Author: Ofer Dekel

2 0.18109429 65 nips-2008-Domain Adaptation with Multiple Sources

Author: Yishay Mansour, Mehryar Mohri, Afshin Rostamizadeh

Abstract: This paper presents a theoretical analysis of the problem of domain adaptation with multiple sources. For each source domain, the distribution over the input points as well as a hypothesis with error at most ǫ are given. The problem consists of combining these hypotheses to derive a hypothesis with small error with respect to the target domain. We present several theoretical results relating to this problem. In particular, we prove that standard convex combinations of the source hypotheses may in fact perform very poorly and that, instead, combinations weighted by the source distributions beneﬁt from favorable theoretical guarantees. Our main result shows that, remarkably, for any ﬁxed target function, there exists a distribution weighted combining rule that has a loss of at most ǫ with respect to any target mixture of the source distributions. We further generalize the setting from a single target function to multiple consistent target functions and show the existence of a combining rule with error at most 3ǫ. Finally, we report empirical results for a multiple source adaptation problem with a real-world dataset.

3 0.16669987 214 nips-2008-Sparse Online Learning via Truncated Gradient

Author: John Langford, Lihong Li, Tong Zhang

Abstract: We propose a general method called truncated gradient to induce sparsity in the weights of online-learning algorithms with convex loss. This method has several essential properties. First, the degree of sparsity is continuous—a parameter controls the rate of sparsiﬁcation from no sparsiﬁcation to total sparsiﬁcation. Second, the approach is theoretically motivated, and an instance of it can be regarded as an online counterpart of the popular L1 -regularization method in the batch setting. We prove small rates of sparsiﬁcation result in only small additional regret with respect to typical online-learning guarantees. Finally, the approach works well empirically. We apply it to several datasets and ﬁnd for datasets with large numbers of features, substantial sparsity is discoverable. 1

4 0.15380456 78 nips-2008-Exact Convex Confidence-Weighted Learning

Author: Koby Crammer, Mark Dredze, Fernando Pereira

Abstract: Conﬁdence-weighted (CW) learning [6], an online learning method for linear classiﬁers, maintains a Gaussian distributions over weight vectors, with a covariance matrix that represents uncertainty about weights and correlations. Conﬁdence constraints ensure that a weight vector drawn from the hypothesis distribution correctly classiﬁes examples with a speciﬁed probability. Within this framework, we derive a new convex form of the constraint and analyze it in the mistake bound model. Empirical evaluation with both synthetic and text data shows our version of CW learning achieves lower cumulative and out-of-sample errors than commonly used ﬁrst-order and second-order online methods. 1

5 0.15014872 164 nips-2008-On the Generalization Ability of Online Strongly Convex Programming Algorithms

Author: Sham M. Kakade, Ambuj Tewari

Abstract: This paper examines the generalization properties of online convex programming algorithms when the loss function is Lipschitz and strongly convex. Our main result is a sharp bound, that holds with high probability, on the excess risk of the output of an online algorithm in terms of the average regret. This allows one to use recent algorithms with logarithmic cumulative regret guarantees to achieve fast convergence rates for the excess risk with high probability. As a corollary, we characterize the convergence rate of P EGASOS (with high probability), a recently proposed method for solving the SVM optimization problem. 1

6 0.11524442 123 nips-2008-Linear Classification and Selective Sampling Under Low Noise Conditions

7 0.11511074 133 nips-2008-Mind the Duality Gap: Logarithmic regret algorithms for online optimization

8 0.10938454 161 nips-2008-On the Complexity of Linear Prediction: Risk Bounds, Margin Bounds, and Regularization

9 0.10931841 168 nips-2008-Online Metric Learning and Fast Similarity Search

10 0.10739115 22 nips-2008-An Online Algorithm for Maximizing Submodular Functions

11 0.092689157 178 nips-2008-Performance analysis for L\ 2 kernel classification

12 0.089312255 15 nips-2008-Adaptive Martingale Boosting

13 0.0791751 169 nips-2008-Online Models for Content Optimization

14 0.076314412 111 nips-2008-Kernel Change-point Analysis

15 0.074056037 96 nips-2008-Hebbian Learning of Bayes Optimal Decisions

16 0.07294438 171 nips-2008-Online Prediction on Large Diameter Graphs

17 0.067460425 239 nips-2008-Tighter Bounds for Structured Estimation

18 0.067104734 228 nips-2008-Support Vector Machines with a Reject Option

19 0.066383995 189 nips-2008-Rademacher Complexity Bounds for Non-I.I.D. Processes

20 0.064753316 170 nips-2008-Online Optimization in X-Armed Bandits

similar papers computed by lsi model

lsi for this paper:

topicId topicWeight

[(0, -0.2), (1, 0.011), (2, -0.174), (3, -0.002), (4, -0.162), (5, 0.072), (6, -0.021), (7, -0.015), (8, 0.044), (9, -0.025), (10, -0.007), (11, 0.224), (12, 0.025), (13, 0.037), (14, 0.105), (15, -0.08), (16, -0.018), (17, 0.016), (18, -0.092), (19, 0.072), (20, 0.018), (21, -0.066), (22, 0.041), (23, 0.123), (24, 0.013), (25, -0.015), (26, 0.062), (27, -0.091), (28, 0.113), (29, 0.051), (30, -0.025), (31, 0.06), (32, 0.11), (33, 0.023), (34, 0.007), (35, -0.071), (36, 0.068), (37, 0.044), (38, 0.004), (39, -0.076), (40, -0.115), (41, 0.055), (42, 0.086), (43, 0.113), (44, -0.036), (45, -0.085), (46, -0.044), (47, -0.032), (48, -0.017), (49, 0.028)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.96862763 88 nips-2008-From Online to Batch Learning with Cutoff-Averaging

Author: Ofer Dekel

2 0.69580728 22 nips-2008-An Online Algorithm for Maximizing Submodular Functions

Author: Matthew Streeter, Daniel Golovin

Abstract: We present an algorithm for solving a broad class of online resource allocation problems. Our online algorithm can be applied in environments where abstract jobs arrive one at a time, and one can complete the jobs by investing time in a number of abstract activities, according to some schedule. We assume that the fraction of jobs completed by a schedule is a monotone, submodular function of a set of pairs (v, τ ), where τ is the time invested in activity v. Under this assumption, our online algorithm performs near-optimally according to two natural metrics: (i) the fraction of jobs completed within time T , for some ﬁxed deadline T > 0, and (ii) the average time required to complete each job. We evaluate our algorithm experimentally by using it to learn, online, a schedule for allocating CPU time among solvers entered in the 2007 SAT solver competition. 1

3 0.60126448 214 nips-2008-Sparse Online Learning via Truncated Gradient

Author: John Langford, Lihong Li, Tong Zhang

4 0.60015017 168 nips-2008-Online Metric Learning and Fast Similarity Search

Author: Prateek Jain, Brian Kulis, Inderjit S. Dhillon, Kristen Grauman

Abstract: Metric learning algorithms can provide useful distance functions for a variety of domains, and recent work has shown good accuracy for problems where the learner can access all distance constraints at once. However, in many real applications, constraints are only available incrementally, thus necessitating methods that can perform online updates to the learned metric. Existing online algorithms offer bounds on worst-case performance, but typically do not perform well in practice as compared to their ofﬂine counterparts. We present a new online metric learning algorithm that updates a learned Mahalanobis metric based on LogDet regularization and gradient descent. We prove theoretical worst-case performance bounds, and empirically compare the proposed method against existing online metric learning algorithms. To further boost the practicality of our approach, we develop an online locality-sensitive hashing scheme which leads to efﬁcient updates to data structures used for fast approximate similarity search. We demonstrate our algorithm on multiple datasets and show that it outperforms relevant baselines.

5 0.59835184 78 nips-2008-Exact Convex Confidence-Weighted Learning

Author: Koby Crammer, Mark Dredze, Fernando Pereira

6 0.58958501 164 nips-2008-On the Generalization Ability of Online Strongly Convex Programming Algorithms

7 0.55158949 178 nips-2008-Performance analysis for L\ 2 kernel classification

8 0.54314035 123 nips-2008-Linear Classification and Selective Sampling Under Low Noise Conditions

9 0.53218991 133 nips-2008-Mind the Duality Gap: Logarithmic regret algorithms for online optimization

10 0.48346639 169 nips-2008-Online Models for Content Optimization

11 0.46753973 65 nips-2008-Domain Adaptation with Multiple Sources

12 0.46521357 196 nips-2008-Relative Margin Machines

13 0.44694802 211 nips-2008-Simple Local Models for Complex Dynamical Systems

14 0.44689488 125 nips-2008-Local Gaussian Process Regression for Real Time Online Model Learning

15 0.4353883 96 nips-2008-Hebbian Learning of Bayes Optimal Decisions

16 0.42602396 161 nips-2008-On the Complexity of Linear Prediction: Risk Bounds, Margin Bounds, and Regularization

17 0.41477358 182 nips-2008-Posterior Consistency of the Silverman g-prior in Bayesian Model Choice

18 0.41080213 185 nips-2008-Privacy-preserving logistic regression

19 0.37316293 111 nips-2008-Kernel Change-point Analysis

20 0.36315182 110 nips-2008-Kernel-ARMA for Hand Tracking and Brain-Machine interfacing During 3D Motor Control

similar papers computed by lda model

lda for this paper:

topicId topicWeight

[(6, 0.162), (7, 0.044), (12, 0.068), (28, 0.167), (38, 0.211), (57, 0.041), (59, 0.017), (63, 0.058), (71, 0.029), (77, 0.048), (83, 0.052)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.81993037 88 nips-2008-From Online to Batch Learning with Cutoff-Averaging

Author: Ofer Dekel

2 0.81757432 177 nips-2008-Particle Filter-based Policy Gradient in POMDPs

Author: Pierre-arnaud Coquelin, Romain Deguest, Rémi Munos

Abstract: Our setting is a Partially Observable Markov Decision Process with continuous state, observation and action spaces. Decisions are based on a Particle Filter for estimating the belief state given past observations. We consider a policy gradient approach for parameterized policy optimization. For that purpose, we investigate sensitivity analysis of the performance measure with respect to the parameters of the policy, focusing on Finite Difference (FD) techniques. We show that the naive FD is subject to variance explosion because of the non-smoothness of the resampling procedure. We propose a more sophisticated FD method which overcomes this problem and establish its consistency. 1

3 0.79612011 181 nips-2008-Policy Search for Motor Primitives in Robotics

Author: Jens Kober, Jan R. Peters

Abstract: Many motor skills in humanoid robotics can be learned using parametrized motor primitives as done in imitation learning. However, most interesting motor learning problems are high-dimensional reinforcement learning problems often beyond the reach of current methods. In this paper, we extend previous work on policy learning from the immediate reward case to episodic reinforcement learning. We show that this results in a general, common framework also connected to policy gradient methods and yielding a novel algorithm for policy learning that is particularly well-suited for dynamic motor primitives. The resulting algorithm is an EM-inspired algorithm applicable to complex motor learning tasks. We compare this algorithm to several well-known parametrized policy search methods and show that it outperforms them. We apply it in the context of motor learning and show that it can learn a complex Ball-in-a-Cup task using a real Barrett WAMTM robot arm. 1

4 0.78711808 242 nips-2008-Translated Learning: Transfer Learning across Different Feature Spaces

Author: Wenyuan Dai, Yuqiang Chen, Gui-rong Xue, Qiang Yang, Yong Yu

Abstract: This paper investigates a new machine learning strategy called translated learning. Unlike many previous learning tasks, we focus on how to use labeled data from one feature space to enhance the classiﬁcation of other entirely different learning spaces. For example, we might wish to use labeled text data to help learn a model for classifying image data, when the labeled images are difﬁcult to obtain. An important aspect of translated learning is to build a “bridge” to link one feature space (known as the “source space”) to another space (known as the “target space”) through a translator in order to migrate the knowledge from source to target. The translated learning solution uses a language model to link the class labels to the features in the source spaces, which in turn is translated to the features in the target spaces. Finally, this chain of linkages is completed by tracing back to the instances in the target spaces. We show that this path of linkage can be modeled using a Markov chain and risk minimization. Through experiments on the text-aided image classiﬁcation and cross-language classiﬁcation tasks, we demonstrate that our translated learning framework can greatly outperform many state-of-the-art baseline methods. 1

5 0.75386882 133 nips-2008-Mind the Duality Gap: Logarithmic regret algorithms for online optimization

Author: Shai Shalev-shwartz, Sham M. Kakade

Abstract: We describe a primal-dual framework for the design and analysis of online strongly convex optimization algorithms. Our framework yields the tightest known logarithmic regret bounds for Follow-The-Leader and for the gradient descent algorithm proposed in Hazan et al. [2006]. We then show that one can interpolate between these two extreme cases. In particular, we derive a new algorithm that shares the computational simplicity of gradient descent but achieves lower regret in many practical situations. Finally, we further extend our framework for generalized strongly convex functions. 1

6 0.75377774 164 nips-2008-On the Generalization Ability of Online Strongly Convex Programming Algorithms

7 0.7368328 202 nips-2008-Robust Regression and Lasso

8 0.73594618 199 nips-2008-Risk Bounds for Randomized Sample Compressed Classifiers

9 0.73506516 178 nips-2008-Performance analysis for L\ 2 kernel classification

10 0.73354536 214 nips-2008-Sparse Online Learning via Truncated Gradient

11 0.72907758 75 nips-2008-Estimating vector fields using sparse basis field expansions

12 0.72317797 85 nips-2008-Fast Rates for Regularized Objectives

13 0.72274709 162 nips-2008-On the Design of Loss Functions for Classification: theory, robustness to outliers, and SavageBoost

14 0.72097087 74 nips-2008-Estimating the Location and Orientation of Complex, Correlated Neural Activity using MEG

15 0.72069281 62 nips-2008-Differentiable Sparse Coding

16 0.71785837 228 nips-2008-Support Vector Machines with a Reject Option

17 0.71152484 14 nips-2008-Adaptive Forward-Backward Greedy Algorithm for Sparse Learning with Linear Models

18 0.71077824 19 nips-2008-An Empirical Analysis of Domain Adaptation Algorithms for Genomic Sequence Analysis

19 0.7102586 245 nips-2008-Unlabeled data: Now it helps, now it doesn't

20 0.7088908 91 nips-2008-Generative and Discriminative Learning with Unknown Labeling Bias