nips nips2005 nips2005-191 knowledge-graph by maker-knowledge-mining
Source: pdf
Author: Ofer Dekel, Shai Shalev-shwartz, Yoram Singer
Abstract: The Perceptron algorithm, despite its simplicity, often performs well on online classification tasks. The Perceptron becomes especially effective when it is used in conjunction with kernels. However, a common difficulty encountered when implementing kernel-based online algorithms is the amount of memory required to store the online hypothesis, which may grow unboundedly. In this paper we present and analyze the Forgetron algorithm for kernel-based online learning on a fixed memory budget. To our knowledge, this is the first online learning algorithm which, on one hand, maintains a strict limit on the number of examples it stores while, on the other hand, entertains a relative mistake bound. In addition to the formal results, we also present experiments with real datasets which underscore the merits of our approach.
Reference: text
sentIndex sentText sentNum sentScore
1 However, a common difficulty encountered when implementing kernel-based online algorithms is the amount of memory required to store the online hypothesis, which may grow unboundedly. [sent-6, score-0.536]
2 In this paper we present and analyze the Forgetron algorithm for kernel-based online learning on a fixed memory budget. [sent-7, score-0.32]
3 To our knowledge, this is the first online learning algorithm which, on one hand, maintains a strict limit on the number of examples it stores while, on the other hand, entertains a relative mistake bound. [sent-8, score-0.702]
4 This set of stored examples is the online equivalent of the support set of SVMs, however in contrast to the support, it continually changes as learning progresses. [sent-15, score-0.302]
5 In this paper, we call this set the active set, as it includes those examples that actively define the current classifier. [sent-16, score-0.238]
6 Typically, an example is added to the active set every time the online algorithm makes a prediction mistake, or when its confidence in a prediction is inadequately low. [sent-17, score-0.684]
7 Naturally, since computing devices have bounded memory resources, there is the danger that an online algorithm would require more memory than is physically available. [sent-19, score-0.397]
8 This problem becomes especially eminent in cases where the online algorithm is implemented as part of a specialized hardware system with a small memory, such as a mobile telephone or an au- tonomous robot. [sent-20, score-0.267]
9 Moreover, an excessively large active set can lead to unacceptably long running times, as the time-complexity of each online round scales linearly with the size of the active set. [sent-21, score-0.855]
10 Crammer, Kandola, and Singer [2] first addressed this problem by describing an online kernel-based modification of the Perceptron algorithm in which the active set does not exceed a predefined budget. [sent-22, score-0.456]
11 Their algorithm removes redundant examples from the active set so as to make the best use of the limited memory resource. [sent-23, score-0.349]
12 Weston, Bordes and Bottou [9] followed with their own online kernel machine on a budget. [sent-24, score-0.268]
13 In this paper we present the Forgetron algorithm for online kernel-based classification. [sent-26, score-0.267]
14 To the best of our knowledge, the Forgetron is the first online algorithm with a fixed memory budget which also entertains a formal worst-case mistake bound. [sent-27, score-0.994]
15 We name our algorithm the Forgetron since its update builds on that of the Perceptron and since it gradually forgets active examples as learning progresses. [sent-28, score-0.313]
16 2 we begin with a more formal presentation of our problem and discuss some difficulties in proving mistake bounds for kernel-methods on a budget. [sent-31, score-0.42]
17 3 we present an algorithmic framework for online prediction with a predefined budget of active examples. [sent-33, score-0.744]
18 On round t the online algorithm observes an instance xt , which is drawn from some predefined instance domain X . [sent-40, score-0.66]
19 The predictions of the online algorithm are determined by a hypothesis which is stored in its internal memory and is updated from round to round. [sent-44, score-0.693]
20 Our focus in this paper is on margin based hypotheses, namely, ft is a function from X to R where sign(ft (xt )) constitutes the actual binary prediction and |ft (xt )| is the confidence in this prediction. [sent-46, score-0.466]
21 First, we can check whether the hypothesis makes a prediction mistake, namely determine whether y = sign(f (x)) or not. [sent-49, score-0.26]
22 Throughout this paper, we use M to denote the number of prediction mistakes made by an online algorithm on a sequence of examples (x1 , y1 ), . [sent-50, score-0.56]
23 That is, the hypothesis used on round t takes the form, ft (x) = i∈It αi K(xi , x), where It is a subset of {1, . [sent-72, score-0.652]
24 , (t-1)} and xi is the example observed by the algorithm on round i. [sent-75, score-0.285]
25 As stated above, It is called the active set, and we say that example xi is active on round t if i ∈ It . [sent-76, score-0.679]
26 Perhaps the most well known online algorithm for binary classification is the Perceptron [6]. [sent-77, score-0.298]
27 Stated in the form of a kernel method, the hypotheses generated by the Perceptron take the form ft (x) = i∈It yi K(xi , x). [sent-78, score-0.396]
28 Namely, the weight assigned to each active example is either +1 or −1, depending on the label of that example. [sent-79, score-0.244]
29 It then updates its hypothesis only on rounds where a prediction mistake is made. [sent-81, score-0.615]
30 Concretely, on round t, if ft (xt ) = yt then the index t is inserted into the active set. [sent-82, score-0.832]
31 As a consequence, the size of the active set on round t equals the number of prediction mistakes made on previous rounds. [sent-83, score-0.669]
32 A relative mistake bound can be proven for the Perceptron algorithm. [sent-84, score-0.416]
33 The bound holds for any sequence of instance-label pairs, and compares the number of mistakes made by the Perceptron with the cumulative hinge-loss of any fixed hypothesis g ∈ HK , even one defined with prior knowledge of the sequence. [sent-85, score-0.378]
34 , (xT , yT ) be a sequence of examples such that K(xt , xt ) ≤ 1 for all t. [sent-90, score-0.239]
35 Then the number of prediction mistakes made by the Perceptron ˆ on this sequence is bounded by, M ≤ g 2 + 2 T ℓt . [sent-92, score-0.268]
36 t=1 Although the Perceptron is guaranteed to be competitive with any fixed hypothesis g ∈ HK , the fact that its active set can grow without a bound poses a serious computational problem. [sent-93, score-0.416]
37 In fact, this problem is common to most kernel-based online methods that do not explicitly monitor the size of It . [sent-94, score-0.253]
38 As discussed above, our goal is to derive and analyze an online prediction algorithm which resolves these problems by enforcing a fixed bound on the size of the active set. [sent-95, score-0.628]
39 Formally, let B be a positive integer, which we refer to as the budget parameter. [sent-96, score-0.262]
40 We would like to devise an algorithm which enforces the constraint |It | ≤ B on every round t. [sent-97, score-0.364]
41 Furthermore, we would like to prove a relative mistake bound for this algorithm, analogous to the bound stated in Thm. [sent-98, score-0.533]
42 We show this inherent limitation by presenting a simple counterexample which applies to any online algorithm which uses a prediction function of the form given in Eq. [sent-101, score-0.383]
43 In this example, we show a hypothesis g ∈ HK and an arbitrarily long sequence of examples such that the algorithm makes a prediction mistake on every single round whereas g suffers no loss at all. [sent-103, score-0.97]
44 Now for every t, ft is a linear combination of at most B vectors from X . [sent-106, score-0.337]
45 Since |X | = B + 1, there exists a vector xt ∈ X which is currently not in the active set. [sent-107, score-0.339]
46 Furthermore, xt is orthogonal to all of the active vectors and therefore ft (xt ) = 0. [sent-108, score-0.64]
47 Assume without loss of generality that the online algorithm we are using predicts yt to be −1 when ft (x) = 0. [sent-109, score-0.664]
48 If on every round we were to present the online algorithm with the example (xt , +1) then the online algorithm would make a B+1 prediction mistake on every round. [sent-110, score-1.266]
49 We have found a sequence of examples and a fixed hypothesis (which is indeed defined by more than B vectors from X ) that attains a cumulative loss of zero on this sequence, while the number of mistakes made by the online algorithm equals the number of rounds. [sent-112, score-0.659]
50 Formally, we wish to devise an online algorithm which is competitive with every hypothesis g ∈ HK for which g ≤ U , for some constant U . [sent-117, score-0.487]
51 Our counterexample indicates that we cannot prove a relative mistake bound with U set to at √ least B + 1, since that was the norm of g in our counterexample. [sent-118, score-0.482]
52 In this paper we come ¯ close to this upper bound by proving that our algorithms can compete with any hypothesis 1 with a norm bounded by 4 (B + 1)/ log(B + 1). [sent-119, score-0.317]
53 Recall that whenever the Perceptron makes a prediction mistake, it updates its hypothesis by adding the element t to It . [sent-121, score-0.259]
54 Thus, on any given round, the size of its active set equals the number of prediction mistakes it has made so far. [sent-122, score-0.445]
55 This implies that the Perceptron may violate the budget constraint |It | ≤ B. [sent-123, score-0.262]
56 We can solve this problem by removing an example from the active set whenever its size exceeds B. [sent-124, score-0.263]
57 One simple strategy is to remove the oldest example in the active set whenever |It | > B. [sent-125, score-0.298]
58 Let t be a round on which the Perceptron makes a prediction mistake. [sent-126, score-0.331]
59 Otherwise, we apply a removal step by finding the ′ ′ oldest example in the active set, rt = min It , and setting It+1 = It \ {rt }. [sent-131, score-0.546]
60 The resulting algorithm is a simple modification of the kernel Perceptron, which conforms with a fixed budget constraint. [sent-132, score-0.312]
61 While we are unable to prove a mistake bound for this algorithm, it is nonetheless an important milestone on the path to an algorithm with a fixed budget and a formal mistake bound. [sent-133, score-1.091]
62 The removal of the oldest active example from It may significantly change the hypothesis and effect its accuracy. [sent-134, score-0.519]
63 By controlling the weight of the oldest active example, we can guarantee that the removal step will not significantly effect the accuracy of our predictions. [sent-136, score-0.456]
64 More formally, we redefine our hypothesis to be, ft = i∈It σi,t yi K(xi , ·) , where each σi,t is a weight in (0, 1]. [sent-137, score-0.457]
65 On round t, if a prediction mistake occurs, a three step update is performed. [sent-141, score-0.734]
66 The first step is the standard Perceptron update, namely, the index t is inserted into the active set and the ′ weight σt,t is set to be 1. [sent-142, score-0.275]
67 Let It denote the active set which results from this update, and let ft′ denote the resulting hypothesis, ft′ (x) = ft (x) + yt K(xt , x). [sent-143, score-0.609]
68 The second step of the update is a shrinking step in which we scale f ′ by a coefficient φt ∈ (0, 1]. [sent-144, score-0.309]
69 If the shrinking coefficients φt are sufficiently small, then the example weights σi,t decrease rapidly with t, and particularly the weight of the oldest active example can be made arbitrarily small. [sent-152, score-0.529]
70 Alas, aggressively shrinking the online hypothesis with every update might itself degrade the performance of the online hypothesis and therefore φt should not be set too small. [sent-154, score-0.991]
71 To formalize this tradeoff, we begin with the mistake bound in Thm. [sent-156, score-0.416]
72 1 and investigate how it is effected by the shrinking and removal steps. [sent-157, score-0.319]
73 Let J denote the set of rounds on which the Forgetron makes a prediction mistake and define the function, Ψ(σ , φ , µ) = (σ φ)2 + 2 σ φ(1 − φ µ) . [sent-159, score-0.51]
74 On this round, example rt is removed from the active set. [sent-161, score-0.338]
75 Let µt = yrt ft′ (xrt ) be the signed margin attained by ft′ on the active example being removed. [sent-162, score-0.238]
76 Lemma 1 below states that removing example rt from the active set on round t increases the mistake bound by Ψt . [sent-164, score-0.975]
77 In addition, it is clear from the definition of Ψt that µt also plays a key role in determining whether xrt can be safely removed from the active set. [sent-166, score-0.254]
78 We note in passing that [2] used a heuristic criterion similar to µt to dynamically choose which active example to remove on each online round. [sent-167, score-0.42]
79 Turning to the shrinking step, for every t ∈ J we define, if ft+1 ≥ U 1 φt if ft′ ≤ U ∧ ft+1 < U Φt = φt ft′ if ft′ > U ∧ ft+1 < U U . [sent-168, score-0.236]
80 Lemma 1 below also states that applying the shrinking step on round t increases the mistake bound by U 2 log(1/Φt ). [sent-169, score-0.875]
81 Note that if ft+1 ≥ U then Φt = 1 and the shrinking step on round t has no effect on our mistake bound. [sent-170, score-0.81]
82 Intuitively, this is due to the fact that, in this case, the shrinking step does not make the norm of ft+1 smaller than the norm of our competitor, g. [sent-171, score-0.305]
83 , (xT , yT ) be a sequence of examples such that K(xt , xt ) ≤ 1 for all t and assume that this sequence is presented to the Forgetron with a budget constraint ˆ B. [sent-176, score-0.541]
84 The first term in the bound of Lemma 1 is identical to the mistake bound of the standard Perceptron, given in Thm. [sent-179, score-0.481]
85 The second term is the consequence of the removal and shrinking steps. [sent-181, score-0.319]
86 If we set the shrinking coefficients in such a way that the second term is at ˆ most M , then the bound in Lemma 1 reduces to M ≤ g 2 + 2 t ℓt + M . [sent-182, score-0.265]
87 In the next section, we define the specific mechanism used by the Forgetron algorithm to choose the shrinking coefficients φt . [sent-188, score-0.236]
88 Then, we conclude our analysis by arguing that this choice satisfies the sufficient conditions stated in Lemma 2, and obtain a mistake bound as described above. [sent-189, score-0.468]
89 In words, Jt is the set of rounds on which the algorithm made a mistake up until round t, and Mt is the size of this set. [sent-193, score-0.712]
90 Let i denote a round on which the algorithm makes a prediction mistake and on which an example must be removed from 15 the active set. [sent-201, score-0.937]
91 To prove a mistake bound it suffices to show that the two conditions stated in Lemma 2 hold. [sent-210, score-0.468]
92 I NPUT: Mercer kernel K(·, ·) ; budget parameter B > 0 I NITIALIZE : I1 = ∅ ; f1 ≡ 0 ; Q1 = 0 ; M0 = 0 For t = 1, 2, . [sent-214, score-0.276]
93 , (xT , yT ) be a sequence of examples such that K(xt , xt ) ≤ 1 for all t. [sent-222, score-0.239]
94 Then, the number of prediction 4 mistakes made by the Forgetron on this sequence is at most, T M ≤ 2 g 2 ˆ ℓt + 4 t=1 5 Experiments and Discussion In this section we present preliminary experimental results which demonstrate the merits of the Forgetron algorithm. [sent-226, score-0.27]
95 When the CKS algorithm exceeds its budget, it removes the active example whose margin would be the largest after the removal. [sent-228, score-0.296]
96 For each budget value, we ran the two algorithms on all 126 binary problems and averaged the results. [sent-232, score-0.291]
97 05 1000 2000 3000 4000 budget size − B 5000 6000 200 400 600 800 1000 1200 1400 1600 1800 budget size − B Figure 2: The error of different budget algorithms as a function of the budget size B on the censusincome (adult) dataset (left) and on the MNIST dataset (right). [sent-249, score-1.105]
98 In this paper we described the Forgetron algorithm which is a kernel-based online learning algorithm with a fixed memory budget. [sent-259, score-0.356]
99 We 4 further argued that no algorithm with a√ budget of B active examples can be competitive with every hypothesis whose norm is B + 1, on every input sequence. [sent-261, score-0.782]
100 The analysis presented in this paper can be used to derive a family of online algorithms of which the Forgetron is only one special case. [sent-263, score-0.252]
wordName wordTfidf (topN-words)
[('forgetron', 0.462), ('mistake', 0.351), ('ft', 0.301), ('perceptron', 0.266), ('budget', 0.239), ('online', 0.231), ('round', 0.224), ('shrinking', 0.2), ('active', 0.189), ('xt', 0.15), ('hypothesis', 0.127), ('rt', 0.119), ('removal', 0.119), ('cks', 0.106), ('yt', 0.096), ('lemma', 0.093), ('hk', 0.092), ('mistakes', 0.092), ('mt', 0.086), ('prediction', 0.085), ('oldest', 0.084), ('qt', 0.07), ('bound', 0.065), ('jt', 0.062), ('hypotheses', 0.058), ('yf', 0.053), ('memory', 0.053), ('rounds', 0.052), ('stated', 0.052), ('margin', 0.049), ('examples', 0.049), ('formal', 0.049), ('sequence', 0.04), ('qi', 0.04), ('mercer', 0.039), ('update', 0.039), ('classi', 0.038), ('kernel', 0.037), ('every', 0.036), ('algorithm', 0.036), ('sign', 0.036), ('entertains', 0.035), ('initializes', 0.035), ('xrt', 0.035), ('mnist', 0.035), ('step', 0.035), ('norm', 0.035), ('competitive', 0.035), ('crammer', 0.032), ('ei', 0.031), ('binary', 0.031), ('innerproduct', 0.031), ('fth', 0.031), ('bordes', 0.031), ('dekel', 0.031), ('rb', 0.031), ('counterexample', 0.031), ('dataset', 0.031), ('equals', 0.03), ('prede', 0.03), ('removed', 0.03), ('weight', 0.029), ('cation', 0.029), ('mi', 0.027), ('removing', 0.027), ('made', 0.027), ('cumulative', 0.027), ('label', 0.026), ('claims', 0.026), ('kandola', 0.026), ('merits', 0.026), ('abbreviate', 0.026), ('namely', 0.026), ('xi', 0.025), ('whenever', 0.025), ('compete', 0.025), ('bounded', 0.024), ('enforces', 0.023), ('constraint', 0.023), ('datasets', 0.023), ('coef', 0.023), ('let', 0.023), ('ready', 0.022), ('devise', 0.022), ('adult', 0.022), ('removes', 0.022), ('labels', 0.022), ('stored', 0.022), ('formally', 0.022), ('makes', 0.022), ('size', 0.022), ('inserted', 0.022), ('cients', 0.021), ('algorithms', 0.021), ('zj', 0.021), ('weston', 0.02), ('proving', 0.02), ('ji', 0.02), ('observes', 0.019), ('hebrew', 0.019)]
simIndex simValue paperId paperTitle
same-paper 1 1.0000005 191 nips-2005-The Forgetron: A Kernel-Based Perceptron on a Fixed Budget
Author: Ofer Dekel, Shai Shalev-shwartz, Yoram Singer
Abstract: The Perceptron algorithm, despite its simplicity, often performs well on online classification tasks. The Perceptron becomes especially effective when it is used in conjunction with kernels. However, a common difficulty encountered when implementing kernel-based online algorithms is the amount of memory required to store the online hypothesis, which may grow unboundedly. In this paper we present and analyze the Forgetron algorithm for kernel-based online learning on a fixed memory budget. To our knowledge, this is the first online learning algorithm which, on one hand, maintains a strict limit on the number of examples it stores while, on the other hand, entertains a relative mistake bound. In addition to the formal results, we also present experiments with real datasets which underscore the merits of our approach.
2 0.16877653 76 nips-2005-From Batch to Transductive Online Learning
Author: Sham Kakade, Adam Tauman Kalai
Abstract: It is well-known that everything that is learnable in the difficult online setting, where an arbitrary sequences of examples must be labeled one at a time, is also learnable in the batch setting, where examples are drawn independently from a distribution. We show a result in the opposite direction. We give an efficient conversion algorithm from batch to online that is transductive: it uses future unlabeled data. This demonstrates the equivalence between what is properly and efficiently learnable in a batch model and a transductive online model.
3 0.16353661 54 nips-2005-Data-Driven Online to Batch Conversions
Author: Ofer Dekel, Yoram Singer
Abstract: Online learning algorithms are typically fast, memory efficient, and simple to implement. However, many common learning problems fit more naturally in the batch learning setting. The power of online learning algorithms can be exploited in batch settings by using online-to-batch conversions techniques which build a new batch algorithm from an existing online algorithm. We first give a unified overview of three existing online-to-batch conversion techniques which do not use training data in the conversion process. We then build upon these data-independent conversions to derive and analyze data-driven conversions. Our conversions find hypotheses with a small risk by explicitly minimizing datadependent generalization bounds. We experimentally demonstrate the usefulness of our approach and in particular show that the data-driven conversions consistently outperform the data-independent conversions.
4 0.13363706 41 nips-2005-Coarse sample complexity bounds for active learning
Author: Sanjoy Dasgupta
Abstract: We characterize the sample complexity of active learning problems in terms of a parameter which takes into account the distribution over the input space, the specific target hypothesis, and the desired accuracy.
5 0.1140236 50 nips-2005-Convex Neural Networks
Author: Yoshua Bengio, Nicolas L. Roux, Pascal Vincent, Olivier Delalleau, Patrice Marcotte
Abstract: Convexity has recently received a lot of attention in the machine learning community, and the lack of convexity has been seen as a major disadvantage of many learning algorithms, such as multi-layer artificial neural networks. We show that training multi-layer neural networks in which the number of hidden units is learned can be viewed as a convex optimization problem. This problem involves an infinite number of variables, but can be solved by incrementally inserting a hidden unit at a time, each time finding a linear classifier that minimizes a weighted sum of errors. 1
6 0.097612403 95 nips-2005-Improved risk tail bounds for on-line algorithms
7 0.096195348 74 nips-2005-Faster Rates in Regression via Active Learning
8 0.084962428 60 nips-2005-Dynamic Social Network Analysis using Latent Space Models
9 0.077364117 45 nips-2005-Conditional Visual Tracking in Kernel Space
10 0.067450434 148 nips-2005-Online Discovery and Learning of Predictive State Representations
11 0.066877626 19 nips-2005-Active Learning for Misspecified Models
12 0.064535365 144 nips-2005-Off-policy Learning with Options and Recognizers
13 0.064490855 80 nips-2005-Gaussian Process Dynamical Models
14 0.063996404 117 nips-2005-Learning from Data of Variable Quality
15 0.061581515 160 nips-2005-Query by Committee Made Real
16 0.0596736 57 nips-2005-Distance Metric Learning for Large Margin Nearest Neighbor Classification
17 0.057250481 85 nips-2005-Generalization to Unseen Cases
18 0.056946043 123 nips-2005-Maximum Margin Semi-Supervised Learning for Structured Variables
19 0.055682819 104 nips-2005-Laplacian Score for Feature Selection
20 0.054538801 27 nips-2005-Analysis of Spectral Kernel Design based Semi-supervised Learning
topicId topicWeight
[(0, 0.172), (1, 0.054), (2, 0.014), (3, -0.082), (4, 0.127), (5, 0.106), (6, 0.068), (7, 0.086), (8, -0.065), (9, 0.13), (10, -0.146), (11, 0.278), (12, -0.011), (13, -0.127), (14, 0.021), (15, -0.158), (16, -0.093), (17, 0.001), (18, -0.049), (19, 0.005), (20, -0.045), (21, 0.046), (22, 0.058), (23, 0.005), (24, 0.124), (25, -0.022), (26, 0.002), (27, 0.003), (28, 0.03), (29, -0.057), (30, -0.037), (31, 0.01), (32, -0.0), (33, -0.059), (34, 0.018), (35, -0.05), (36, 0.093), (37, 0.023), (38, -0.05), (39, -0.044), (40, -0.016), (41, 0.001), (42, -0.045), (43, -0.029), (44, 0.102), (45, 0.069), (46, 0.058), (47, 0.007), (48, -0.015), (49, 0.017)]
simIndex simValue paperId paperTitle
same-paper 1 0.96121657 191 nips-2005-The Forgetron: A Kernel-Based Perceptron on a Fixed Budget
Author: Ofer Dekel, Shai Shalev-shwartz, Yoram Singer
Abstract: The Perceptron algorithm, despite its simplicity, often performs well on online classification tasks. The Perceptron becomes especially effective when it is used in conjunction with kernels. However, a common difficulty encountered when implementing kernel-based online algorithms is the amount of memory required to store the online hypothesis, which may grow unboundedly. In this paper we present and analyze the Forgetron algorithm for kernel-based online learning on a fixed memory budget. To our knowledge, this is the first online learning algorithm which, on one hand, maintains a strict limit on the number of examples it stores while, on the other hand, entertains a relative mistake bound. In addition to the formal results, we also present experiments with real datasets which underscore the merits of our approach.
2 0.82807404 54 nips-2005-Data-Driven Online to Batch Conversions
Author: Ofer Dekel, Yoram Singer
Abstract: Online learning algorithms are typically fast, memory efficient, and simple to implement. However, many common learning problems fit more naturally in the batch learning setting. The power of online learning algorithms can be exploited in batch settings by using online-to-batch conversions techniques which build a new batch algorithm from an existing online algorithm. We first give a unified overview of three existing online-to-batch conversion techniques which do not use training data in the conversion process. We then build upon these data-independent conversions to derive and analyze data-driven conversions. Our conversions find hypotheses with a small risk by explicitly minimizing datadependent generalization bounds. We experimentally demonstrate the usefulness of our approach and in particular show that the data-driven conversions consistently outperform the data-independent conversions.
3 0.76878083 76 nips-2005-From Batch to Transductive Online Learning
Author: Sham Kakade, Adam Tauman Kalai
Abstract: It is well-known that everything that is learnable in the difficult online setting, where an arbitrary sequences of examples must be labeled one at a time, is also learnable in the batch setting, where examples are drawn independently from a distribution. We show a result in the opposite direction. We give an efficient conversion algorithm from batch to online that is transductive: it uses future unlabeled data. This demonstrates the equivalence between what is properly and efficiently learnable in a batch model and a transductive online model.
4 0.64908302 41 nips-2005-Coarse sample complexity bounds for active learning
Author: Sanjoy Dasgupta
Abstract: We characterize the sample complexity of active learning problems in terms of a parameter which takes into account the distribution over the input space, the specific target hypothesis, and the desired accuracy.
5 0.56003928 95 nips-2005-Improved risk tail bounds for on-line algorithms
Author: Nicolò Cesa-bianchi, Claudio Gentile
Abstract: We prove the strongest known bound for the risk of hypotheses selected from the ensemble generated by running a learning algorithm incrementally on the training data. Our result is based on proof techniques that are remarkably different from the standard risk analysis based on uniform convergence arguments.
6 0.4257606 160 nips-2005-Query by Committee Made Real
7 0.39478493 74 nips-2005-Faster Rates in Regression via Active Learning
8 0.37909776 117 nips-2005-Learning from Data of Variable Quality
9 0.37090024 45 nips-2005-Conditional Visual Tracking in Kernel Space
10 0.36307901 148 nips-2005-Online Discovery and Learning of Predictive State Representations
11 0.36180335 50 nips-2005-Convex Neural Networks
12 0.34840599 19 nips-2005-Active Learning for Misspecified Models
13 0.34624273 85 nips-2005-Generalization to Unseen Cases
14 0.34532076 68 nips-2005-Factorial Switching Kalman Filters for Condition Monitoring in Neonatal Intensive Care
15 0.33479187 123 nips-2005-Maximum Margin Semi-Supervised Learning for Structured Variables
16 0.31902632 60 nips-2005-Dynamic Social Network Analysis using Latent Space Models
17 0.31541139 72 nips-2005-Fast Online Policy Gradient Learning with SMD Gain Vector Adaptation
18 0.31367835 81 nips-2005-Gaussian Processes for Multiuser Detection in CDMA receivers
19 0.29869324 144 nips-2005-Off-policy Learning with Options and Recognizers
20 0.28810847 12 nips-2005-A PAC-Bayes approach to the Set Covering Machine
topicId topicWeight
[(3, 0.094), (10, 0.048), (13, 0.277), (27, 0.03), (31, 0.026), (34, 0.104), (41, 0.01), (55, 0.038), (69, 0.047), (73, 0.037), (88, 0.097), (91, 0.076)]
simIndex simValue paperId paperTitle
same-paper 1 0.80553997 191 nips-2005-The Forgetron: A Kernel-Based Perceptron on a Fixed Budget
Author: Ofer Dekel, Shai Shalev-shwartz, Yoram Singer
Abstract: The Perceptron algorithm, despite its simplicity, often performs well on online classification tasks. The Perceptron becomes especially effective when it is used in conjunction with kernels. However, a common difficulty encountered when implementing kernel-based online algorithms is the amount of memory required to store the online hypothesis, which may grow unboundedly. In this paper we present and analyze the Forgetron algorithm for kernel-based online learning on a fixed memory budget. To our knowledge, this is the first online learning algorithm which, on one hand, maintains a strict limit on the number of examples it stores while, on the other hand, entertains a relative mistake bound. In addition to the formal results, we also present experiments with real datasets which underscore the merits of our approach.
2 0.58137351 112 nips-2005-Learning Minimum Volume Sets
Author: Clayton Scott, Robert Nowak
Abstract: Given a probability measure P and a reference measure µ, one is often interested in the minimum µ-measure set with P -measure at least α. Minimum volume sets of this type summarize the regions of greatest probability mass of P , and are useful for detecting anomalies and constructing confidence regions. This paper addresses the problem of estimating minimum volume sets based on independent samples distributed according to P . Other than these samples, no other information is available regarding P , but the reference measure µ is assumed to be known. We introduce rules for estimating minimum volume sets that parallel the empirical risk minimization and structural risk minimization principles in classification. As in classification, we show that the performances of our estimators are controlled by the rate of uniform convergence of empirical to true probabilities over the class from which the estimator is drawn. Thus we obtain finite sample size performance bounds in terms of VC dimension and related quantities. We also demonstrate strong universal consistency and an oracle inequality. Estimators based on histograms and dyadic partitions illustrate the proposed rules. 1
3 0.57422048 74 nips-2005-Faster Rates in Regression via Active Learning
Author: Rebecca Willett, Robert Nowak, Rui M. Castro
Abstract: This paper presents a rigorous statistical analysis characterizing regimes in which active learning significantly outperforms classical passive learning. Active learning algorithms are able to make queries or select sample locations in an online fashion, depending on the results of the previous queries. In some regimes, this extra flexibility leads to significantly faster rates of error decay than those possible in classical passive learning settings. The nature of these regimes is explored by studying fundamental performance limits of active and passive learning in two illustrative nonparametric function classes. In addition to examining the theoretical potential of active learning, this paper describes a practical algorithm capable of exploiting the extra flexibility of the active setting and provably improving upon the classical passive techniques. Our active learning theory and methods show promise in a number of applications, including field estimation using wireless sensor networks and fault line detection. 1
4 0.57348007 49 nips-2005-Convergence and Consistency of Regularized Boosting Algorithms with Stationary B-Mixing Observations
Author: Aurelie C. Lozano, Sanjeev R. Kulkarni, Robert E. Schapire
Abstract: We study the statistical convergence and consistency of regularized Boosting methods, where the samples are not independent and identically distributed (i.i.d.) but come from empirical processes of stationary β-mixing sequences. Utilizing a technique that constructs a sequence of independent blocks close in distribution to the original samples, we prove the consistency of the composite classifiers resulting from a regularization achieved by restricting the 1-norm of the base classifiers’ weights. When compared to the i.i.d. case, the nature of sampling manifests in the consistency result only through generalization of the original condition on the growth of the regularization parameter.
5 0.57278442 177 nips-2005-Size Regularized Cut for Data Clustering
Author: Yixin Chen, Ya Zhang, Xiang Ji
Abstract: We present a novel spectral clustering method that enables users to incorporate prior knowledge of the size of clusters into the clustering process. The cost function, which is named size regularized cut (SRcut), is defined as the sum of the inter-cluster similarity and a regularization term measuring the relative size of two clusters. Finding a partition of the data set to minimize SRcut is proved to be NP-complete. An approximation algorithm is proposed to solve a relaxed version of the optimization problem as an eigenvalue problem. Evaluations over different data sets demonstrate that the method is not sensitive to outliers and performs better than normalized cut. 1
6 0.57218236 50 nips-2005-Convex Neural Networks
7 0.56966704 41 nips-2005-Coarse sample complexity bounds for active learning
8 0.56843513 137 nips-2005-Non-Gaussian Component Analysis: a Semi-parametric Framework for Linear Dimension Reduction
9 0.56769174 160 nips-2005-Query by Committee Made Real
10 0.56766355 92 nips-2005-Hyperparameter and Kernel Learning for Graph Based Semi-Supervised Classification
11 0.56581533 32 nips-2005-Augmented Rescorla-Wagner and Maximum Likelihood Estimation
12 0.56467783 47 nips-2005-Consistency of one-class SVM and related algorithms
13 0.56450796 43 nips-2005-Comparing the Effects of Different Weight Distributions on Finding Sparse Representations
14 0.56361109 24 nips-2005-An Approximate Inference Approach for the PCA Reconstruction Error
15 0.56283087 58 nips-2005-Divergences, surrogate loss functions and experimental design
16 0.56270909 54 nips-2005-Data-Driven Online to Batch Conversions
17 0.5625639 154 nips-2005-Preconditioner Approximations for Probabilistic Graphical Models
18 0.56051183 201 nips-2005-Variational Bayesian Stochastic Complexity of Mixture Models
19 0.56013018 132 nips-2005-Nearest Neighbor Based Feature Selection for Regression and its Application to Neural Activity
20 0.5591405 66 nips-2005-Estimation of Intrinsic Dimensionality Using High-Rate Vector Quantization