jmlr jmlr2009 jmlr2009-13 knowledge-graph by maker-knowledge-mining
Source: pdf
Author: Francesco Orabona, Joseph Keshet, Barbara Caputo
Abstract: A common problem of kernel-based online algorithms, such as the kernel-based Perceptron algorithm, is the amount of memory required to store the online hypothesis, which may increase without bound as the algorithm progresses. Furthermore, the computational load of such algorithms grows linearly with the amount of memory used to store the hypothesis. To attack these problems, most previous work has focused on discarding some of the instances, in order to keep the memory bounded. In this paper we present a new algorithm, in which the instances are not discarded, but are instead projected onto the space spanned by the previous online hypothesis. We call this algorithm Projectron. While the memory size of the Projectron solution cannot be predicted before training, we prove that its solution is guaranteed to be bounded. We derive a relative mistake bound for the proposed algorithm, and deduce from it a slightly different algorithm which outperforms the Perceptron. We call this second algorithm Projectron++. We show that this algorithm can be extended to handle the multiclass and the structured output settings, resulting, as far as we know, in the first online bounded algorithm that can learn complex classification tasks. The method of bounding the hypothesis representation can be applied to any conservative online algorithm and to other online algorithms, as it is demonstrated for ALMA2 . Experimental results on various data sets show the empirical advantage of our technique compared to various bounded online algorithms, both in terms of memory and accuracy. Keywords: online learning, kernel methods, support vector machines, bounded support set
Reference: text
sentIndex sentText sentNum sentScore
1 In this paper we present a new algorithm, in which the instances are not discarded, but are instead projected onto the space spanned by the previous online hypothesis. [sent-7, score-0.271]
2 We show that this algorithm can be extended to handle the multiclass and the structured output settings, resulting, as far as we know, in the first online bounded algorithm that can learn complex classification tasks. [sent-12, score-0.243]
3 The method of bounding the hypothesis representation can be applied to any conservative online algorithm and to other online algorithms, as it is demonstrated for ALMA2 . [sent-13, score-0.302]
4 Keywords: online learning, kernel methods, support vector machines, bounded support set 1. [sent-15, score-0.233]
5 On rounds where the online algorithm makes a prediction mistake or when the confidence in the prediction is not sufficient, the algorithm adds the instance to a set of stored instances, called the support set. [sent-20, score-0.351]
6 The online classification function is defined as a weighted sum of kernel combination of the instances in the support set. [sent-21, score-0.196]
7 The very first online algorithm to have a fixed memory budget and a relative mistake bound is the Forgetron algorithm (Dekel et al. [sent-39, score-0.313]
8 A different approach to address this problem for online Gaussian processes is proposed in Csat´ and Opper (2002), where, in common with our approach, the instances are not discarded, o but rather projected onto the space spanned by the instances in the support set. [sent-44, score-0.347]
9 Either they are projected onto the space spanned by the support set, or they are added to the support set. [sent-51, score-0.232]
10 By using this method, we show that the support set and, hence, the online hypothesis, is guaranteed to be bounded, although we cannot predict its size before training. [sent-52, score-0.15]
11 The main advantage of this setup is that by using all training samples, we are able to provide an online hypothesis with high online accuracy. [sent-54, score-0.256]
12 We modify the Perceptron algorithm so that the number of stored samples needed to represent the online hypothesis is always bounded. [sent-57, score-0.179]
13 We present a relative mistake bound for the Projectron algorithm, and deduce from it a new online bounded algorithm which outperforms the Perceptron algorithm, but still retains all of its advantages. [sent-60, score-0.268]
14 As far as we know, this is the first bounded multiclass and structured output online algorithm, with a relative mistake bound. [sent-63, score-0.311]
15 1 Our technique for bounding the size of the support set can be applied to any conservative kernel-based online algorithm and to other online algorithms, as we demonstrate for ALMA2 (Gentile, 2001). [sent-64, score-0.288]
16 Note that converting the budget algorithms presented by other authors, such as the Forgetron, to the multiclass or the structured output setting is not trivial, since these algorithms are inherently binary in nature. [sent-66, score-0.153]
17 }, where xt ∈ X is called an instance and yt ∈ {−1, +1} is called a label. [sent-85, score-0.534]
18 We denote the hypothesis estimated after the t-th round by ft . [sent-92, score-0.593]
19 At each round t, an instance xt ∈ X is presented to the algorithm, which predicts a label yt ∈ {−1, +1} using ˆ the current function, yt = sign( ft (xt )). [sent-94, score-1.306]
20 If the prediction yt ˆ ˆ differs from the correct label yt , the hypothesis is updated as ft = ft−1 + yt k(xt , ·), otherwise the hypothesis is left intact, ft = ft−1 . [sent-96, score-1.84]
21 The hypothesis ft can be written as a kernel expansion according to the representer theorem (Sch¨ lkopf et al. [sent-97, score-0.593]
22 , 2001), o ft (x) = ∑ αi k(xi , x), (1) xi ∈St where αi = yi and St is defined to be the set of instances for which an update of the hypothesis occurred, that is, St = {xi , 0 ≤ i ≤ t | yi = yi }. [sent-98, score-0.605]
23 Receive new instance xt Predict yt = sign( ft−1 (xt )) ˆ Receive label yt If yt = yt ˆ ft = ft−1 + yt k(xt , ·) St = St−1 ∪ xt (update the hypothesis) (add instance xt to the support set) Else ft = ft−1 St = St−1 Figure 1: The kernel-based Perceptron Algorithm. [sent-104, score-3.073]
24 Although the Perceptron algorithm is very simple, it produces an online hypothesis with good performance. [sent-105, score-0.179]
25 Recall that the hypothesis ft is represented as a weighted sum over all the instances in the support set. [sent-107, score-0.628]
26 It continues with a description of how to calculate the projected hypothesis and describes some other computational aspects of the algorithm. [sent-111, score-0.161]
27 On a prediction mistake, we check if the instance xt can be spanned by the support set, namely, for scalars di ∈ R, 1 ≤ i ≤ |St−1 |, not all zeros, such that k(xt , ·) = ∑ di k(xi , ·) . [sent-119, score-0.408]
28 On the other hand, if the instance and the support set are linearly independent, the instance is added to the set with αt = yt as before. [sent-122, score-0.317]
29 In particular, let us assume that at round t of the algorithm there is a prediction mistake, and that the mistaken instance xt should be added to the support set, St−1 . [sent-131, score-0.41]
30 (2) Before adding the instance to the support set, we construct two hypotheses: a temporary hypothesis, ft′ , using the function k(xt , ·), that is, ft′ = ft−1 + yt k(xt , ·), and a projected hypothesis, ft′′ , that is the projection of ft′ onto the space Ht−1 . [sent-134, score-0.475]
31 That is, the projected hypothesis is that hypothesis from the space Ht−1 which is the closest to the temporary hypothesis. [sent-135, score-0.262]
32 If the norm of distance δt is below some threshold η, we use the projected hypothesis as our next hypothesis, that is, ft = ft′′ , otherwise we use the temporary hypothesis as our next hypothesis, that is, ft = ft′ . [sent-138, score-1.222]
33 2 Practical Considerations We now consider the problem of deriving the projected hypothesis ft′′ in a Hilbert space H , induced by a kernel function k(·, ·). [sent-151, score-0.189]
34 Recall that ft′ is defined as ft′ = ft + yt k(xt , ·). [sent-152, score-0.716]
35 The projected hypothesis ft′′ is defined as ft′′ = Pt−1 ft′ . [sent-154, score-0.161]
36 Expanding ft′ we have ft′′ = Pt−1 ft′ = Pt−1 ( ft−1 + yt k(xt , ·)) . [sent-156, score-0.236]
37 The projection is a linear operator, hence ft′′ = ft−1 + yt Pt−1 k(xt , ·) . [sent-161, score-0.264]
38 By substituting ft′′ from Equation (3) and ft′ we have δt = ft′′ − ft′ = yt Pt−1 k(xt , ·) − yt k(xt , ·) . [sent-163, score-0.472]
39 (5) x j ∈St−1 Expanding Equation (5) we get δt 2 = min d ∑ d j di k(x j , xi ) − 2 ∑ x j ∈St−1 xi ,x j ∈St−1 2648 d j k(x j , xt ) + k(xt , xt ) . [sent-170, score-0.572]
40 Let us also define kt ∈ Rt−1 to be the vector whose i-th element is kti = k(xi , xt ). [sent-173, score-0.367]
41 We have δt 2 = min dT Kt−1 d − 2dT kt + k(xt , xt ) . [sent-174, score-0.367]
42 d (6) Solving Equation (6), that is, applying the extremum conditions with respect to d, we obtain −1 d⋆ = Kt−1 kt (7) and, by substituting Equation (7) into Equation (6), δt 2 = k(xt , xt ) − ktT d⋆ . [sent-175, score-0.367]
43 (8) Furthermore, by substituting Equation (7) back into Equation (3) we get ft′′ = ft−1 + yt ∑ d⋆ k(x j , ·) . [sent-176, score-0.236]
44 j (9) x j ∈St−1 We have shown how to calculate both the distance δt and the projected hypothesis ft′′ . [sent-177, score-0.161]
45 + 1 Kt−1 = d⋆ T −1 δt 2 −1 0 0 ··· 0 0 2649 O RABONA , K ESHET AND C APUTO −1 Input: new instance xt , Kt−1 , and the support set St−1 - Set kt = k(x1 , xt ), k(x2 , xt ), . [sent-187, score-0.987]
46 , k(x|St−1 | , xt ) −1 - Solve d⋆ = Kt−1 kt - Set δt 2 = k(xt , xt ) − ktT d⋆ - The projected hypothesis is ft′′ = ft−1 + yt ∑x j ∈St−1 d⋆ k(x j , ·) j - Kernel inverse matrix for the next round 0 . [sent-190, score-1.084]
47 + 1 Kt−1 = δt 0 0 ··· 0 0 2 d⋆ −1 d⋆ T −1 Output: the projected hypothesis ft′′ , the measure δt and the kernel inverse matrix Kt−1 . [sent-193, score-0.189]
48 Figure 4: Calculation of the projected hypothesis ft′′ . [sent-194, score-0.161]
49 The main idea is to bound the maximum number of mistakes of the algorithm, relative to any hypothesis g ∈ H , even chosen in hindsight. [sent-221, score-0.17]
50 First, we define the loss with a margin γ ∈ R of the hypothesis g on the example (xt , yt ) as ℓγ (g(xt ), yt ) = max{0, γ − yt g(xt )}, (10) and we define the cumulative loss, Dγ , of g on the first T examples as T Dγ = ∑ ℓγ (g(xt ), yt ) . [sent-222, score-1.041]
51 We will use its first statement to bound the scalar product between a projected sample and the competitor, and its second statement to derive the scalar product between the current hypothesis and the projected sample. [sent-224, score-0.268]
52 2651 O RABONA , K ESHET AND C APUTO Theorem 3 Let (x1 , y1 ), · · · , (xT , yT ) be a sequence of instance-label pairs where xt ∈ X , yt ∈ {−1, +1}, and k(xt , xt ) ≤ 1 for all t. [sent-231, score-0.794]
53 Proof Define the relative progress in each round as ∆t = ft−1 − λg 2 − ft − λg 2 , where λ is a positive scalar to be optimized. [sent-234, score-0.534]
54 On rounds where there is a mistake there are two possible updates: either ft = ft−1 + yt Pt−1 k(xt , ·) or ft = ft−1 + yt k(xt , ·). [sent-237, score-1.573]
55 In particular we set q(·) = Pt−1 k(xt , ·) in Lemma 2 and use δt = yt Pt−1 k(xt , ·) − yt k(xt , ·) from Equation (4). [sent-239, score-0.472]
56 Let τt be an indicator function for a mistake on the t-th round, that is, τt is 1 if there is a mistake on round t and 0 otherwise. [sent-240, score-0.237]
57 We have ∆t = ft−1 − λg 2 − ft − λg 2 = 2τt yt λg − ft−1 , Pt−1 k(xt , ·) − τt2 Pt−1 k(xt , ·) ≥ τt 2λ − 2λℓ1 (g(xt ), yt ) − τt Pt−1 k(xt , ·) 2 − 2λ g · δt − 2yt ft−1 (xt ) . [sent-241, score-0.952]
58 2 (11) Moreover, on every projection update δt ≤ η, and Pt−1 k(xt , ·) ≤ 1 by the theorem’s assumption, so we have ∆t ≥ τt 2λ − 2λℓ1 (g(xt ), yt ) − τt − 2ηλ g − 2yt ft−1 (xt ) . [sent-242, score-0.284]
59 We can further bound ∆t by noting that on every prediction mistake yt ft−1 (xt ) ≤ 0. [sent-243, score-0.365]
60 Overall we have ft−1 − λg 2 − ft − λg 2 ≥ τt 2λ − 2λℓ1 (g(xt ), yt ) − τt − 2ηλ g . [sent-244, score-0.716]
61 (12) When there is an update without projection, similar reasoning yields that ft−1 − λg 2 − ft − λg 2 ≥ τt 2λ − 2λℓ1 (g(xt ), yt ) − τt , hence the bound in Equation (12) holds in both cases. [sent-245, score-0.754]
62 We refer to this latter case as a margin error, that is, 0 < yt ft−1 (xt ) < 1. [sent-263, score-0.261]
63 A possible solution to this obstacle is not to update on every round in which a margin error occurs, but only when there is a margin error and the new instance can be projected onto the support set. [sent-266, score-0.293]
64 Hence, the update on round in which there is a margin error would in general be of the form ft = ft−1 + yt τt Pt−1 k(xt , ·) , with 0 < τt ≤ 1. [sent-267, score-0.802]
65 We can even expect solutions with a smaller support set, since new instances can be added to the support set only if misclassified, hence having fewer mistakes should result in a smaller support set. [sent-273, score-0.242]
66 Theorem 4 Let (x1 , y1 ), · · · , (xT , yT ) be a sequence of instance-label pairs where xt ∈ X , yt ∈ {−1, +1}, and k(xt , xt ) ≤ 1 for all t. [sent-276, score-0.794]
67 Proof The proof is similar to the proof of Theorem 3, where the difference is that during rounds in which there is a margin error we update the solution whenever it is possible to project ensuring an improvement of the mistake bound. [sent-279, score-0.186]
68 A sufficient condition to have βt positive is τt < 2 ℓ1 ( ft−1 (xt ), yt ) − Pt−1 k(xt , ·) δt η 2 . [sent-283, score-0.236]
69 In fact, removing the projection step and updating on rounds in which there is a margin error, with ft = ft−1 + yt τt k(xt , ·), we end up with the t−1 condition 0 < τt < min 2 ℓ1 ( fk(xt(xt ),yt ) , 1 . [sent-295, score-0.812]
70 Experimentally we have observed that we obtain the best performance if the update is done with the following rule ℓ ( f (x ), y ) ℓ1 ( ft−1 (xt ), yt ) − δt η 1 t−1 t t ,2 ,1 . [sent-299, score-0.256]
71 We note in passing that the condition on whether xt can be projected onto Ht−1 on margin error may stated as ℓ1 ( ft−1 (xt ), yt ) ≥ δt . [sent-303, score-0.66]
72 The learning task is therefore defined as finding a function f : X × Y → R such that yt = arg max f (xt , y) . [sent-320, score-0.236]
73 A kernel function in this setting should reflect the dependencies between the instances and the labels, hence we define the structured kernel function as a function on the domain of the instances and the labels, namely, kS : (X × Y )2 → R. [sent-322, score-0.161]
74 As in the binary classification algorithm presented earlier, the structured output online algorithm receives instances in a sequential order. [sent-328, score-0.194]
75 Upon receiving an instance, xt ∈ X , the algorithm predicts a label, yt′ , according to Equation (16). [sent-329, score-0.294]
76 After making its prediction, the algorithm receives the correct label, yt . [sent-330, score-0.251]
77 We define the loss suffered by the algorithm on round t for the example (xt , yt ) as ℓS ( f , xt , yt ) = max{0, γ − f (xt , yt ) + max f (xt , yt′ )}, γ ′ yt =yt 2656 B OUNDED K ERNEL -BASED O NLINE L EARNING and the cumulative loss DS as γ T DS = ∑ ℓS ( f , xt , yt ) . [sent-331, score-1.794]
78 As in the binary case, on rounds in which there is a prediction mistake, yt′ = yt , the algorithm updates the hypothesis ft−1 by adding k((xt , yt ), ·) − k((xt , yt′ ), ·) or its projection. [sent-335, score-0.627]
79 When there is a margin mistake, 0 < ℓS ( ft−1 , xt , yt ) < γ, the algorithm updates the hypothesis ft−1 by adding γ τt Pt−1 (k((xt , yt ), ·) − k((xt , yt′ ), ·)), where 0 < τt < 1 and will be defined shortly. [sent-336, score-0.875]
80 Now, for the structured output case, δt is defined as δt = k((xt , yt ), ·) − k((xt , yt′ ), ·) − Pt−1 k((xt , yt ), ·) − k((xt , yt′ ), ·) . [sent-337, score-0.511]
81 x ˆ x Theorem 6 Let (x1 , y1 ), · · · , (xT , yT ) be a sequence of instance-label pairs where xt ∈ X , yt ∈ Y , and k((xt , y), ·) ≤ 1/2 for all t and y ∈ Y . [sent-344, score-0.515]
82 O RABONA , K ESHET AND C APUTO As in Theorem 4 there is some freedom in the choice of τt , and again we set it to ℓS ( f , x , y ) ℓS ( ft−1 , xt , yt ) − δt 1 η 1 t−1 t t ,2 ,1 . [sent-346, score-0.515]
83 This simplifies the projection step, in fact k((xt , yt ), ·) can be projected only onto the functions in St−1 belonging to yt , the scalar product with the other functions being zero. [sent-349, score-0.62]
84 A conservative online algorithm is an algorithm that updates its hypothesis only on rounds on which it makes a prediction error. [sent-358, score-0.281]
85 Again we define two hypotheses: a temporary hypothesis ft′ , which is the hypothesis of ALMA2 after its update rule, and a projected hypothesis, which is the hypothesis ft′ projected on the set Ht−1 as defined in Equation (2). [sent-364, score-0.443]
86 The modified ALMA2 algorithm uses the projected hypothesis ft′′ whenever the projection error is smaller than a parameter η, otherwise it uses the temporary hypothesis ft′ . [sent-366, score-0.305]
87 We can state the following bound Theorem 7 Let (x1 , y1 ), · · · , (xT , yT ) be a sequence of instance-label pairs where xt ∈ X , yt ∈ {−1, +1}, and k(xt , xt ) ≤ 1 for all t. [sent-367, score-0.812]
88 Specifically, according to Lemma 2, one can replace the relation yt g, k(xt , ·) ≥ γ − ℓγ (g(xt ), yt ) with yt g, Pt−1 k(xt , ·) ≥ 2658 B OUNDED K ERNEL -BASED O NLINE L EARNING Data Set a9a (Platt, 1999) ijcnn1 (Prokhorov, 2001) news20. [sent-371, score-0.708]
89 125 1 7 13 80 Table 1: Data sets used in the experiments γ − η − ℓγ (g(xt ), yt ), and further substitute γ − η for γ. [sent-378, score-0.236]
90 In the first set of experiments we compared the online average number of mistakes and the support set size of all algorithms. [sent-403, score-0.23]
91 Both Forgetron and RBP work by discarding vectors from the support set, if the size of the support set reaches the budget size, B. [sent-404, score-0.178]
92 5 4 x 10 (b) Figure 6: Average online error (left) and size of the support set (right) for the different algorithms on a9a data set as a function of the number of training samples (better viewed in color). [sent-431, score-0.15]
93 Due to the fact that there are no other bounded online algorithms with a mistake bound for multiclass, we have extended RBP in the natural manner to multiclass. [sent-478, score-0.235]
94 In particular we used the max-score update in Crammer and Singer (2003), for which a mistake bound exists, discarding a vector at random from the solution each time a new instance is added and the number of support vectors is equal to the budget size. [sent-479, score-0.275]
95 5 500 1000 1500 Size of the Support Set 2000 (c) 2000 4000 6000 8000 10000 12000 14000 16000 Size of the Support Set (d) Figure 8: Average online error for the different algorithms as a function of the size of the support set on different binary data sets. [sent-523, score-0.15]
96 7 400 500 500 (a) 1000 1500 2000 Size of the Support Set 2500 (b) Figure 9: Average online error for the different algorithms as a function of the size of the support set on different multiclass data sets. [sent-546, score-0.205]
97 5 4 4 x 10 (b) Figure 10: Average online error (a) and test error (b) for the different algorithms as a function of the size of the support set on a subset of the timit data set. [sent-571, score-0.185]
98 Although the size of the support set cannot be determined 2663 O RABONA , K ESHET AND C APUTO before training, practically, for a given target accuracy, the size of the support sets of Projectron or Projectron++ are much smaller than those of other budget algorithms such as Forgetron and RBP. [sent-577, score-0.175]
99 The experimental results suggest that Projectron++ outperforms other online bounded algorithms such as Forgetron and RBP, with a similar hypothesis size. [sent-581, score-0.209]
100 Second, a standard online-to-batch conversion can be applied to the online bounded solution of these algorithms, resulting in a bounded batch solution. [sent-584, score-0.181]
wordName wordTfidf (topN-words)
[('projectron', 0.643), ('ft', 0.48), ('xt', 0.279), ('yt', 0.236), ('perceptron', 0.165), ('rbp', 0.138), ('forgetron', 0.137), ('st', 0.123), ('mistake', 0.098), ('online', 0.092), ('projected', 0.089), ('kt', 0.088), ('aputo', 0.083), ('eshet', 0.083), ('rabona', 0.083), ('mistakes', 0.08), ('pt', 0.08), ('ounded', 0.076), ('hypothesis', 0.072), ('ht', 0.068), ('budget', 0.059), ('multiclass', 0.055), ('crammer', 0.052), ('nline', 0.049), ('rounds', 0.043), ('support', 0.043), ('mrbp', 0.041), ('round', 0.041), ('ernel', 0.04), ('structured', 0.039), ('orabona', 0.035), ('timit', 0.035), ('instances', 0.033), ('onto', 0.031), ('temporary', 0.029), ('projection', 0.028), ('kernel', 0.028), ('csat', 0.028), ('phoneme', 0.028), ('usps', 0.028), ('bounded', 0.027), ('spanned', 0.026), ('pa', 0.026), ('dekel', 0.025), ('margin', 0.025), ('conversion', 0.024), ('rkhs', 0.024), ('gentile', 0.022), ('plain', 0.022), ('vehicle', 0.021), ('ktt', 0.021), ('lemel', 0.021), ('receive', 0.02), ('update', 0.02), ('instance', 0.019), ('earning', 0.019), ('conservative', 0.019), ('outperforms', 0.018), ('bound', 0.018), ('discarding', 0.018), ('conconi', 0.018), ('keshet', 0.018), ('ds', 0.016), ('hypotheses', 0.016), ('engel', 0.016), ('memory', 0.016), ('else', 0.015), ('sparseness', 0.015), ('size', 0.015), ('label', 0.015), ('algorithm', 0.015), ('synthetic', 0.014), ('equation', 0.014), ('caputo', 0.014), ('cauwenberghs', 0.014), ('competitor', 0.014), ('dogma', 0.014), ('downs', 0.014), ('duarte', 0.014), ('francesco', 0.014), ('utterances', 0.014), ('di', 0.014), ('corpus', 0.013), ('kivinen', 0.013), ('theorem', 0.013), ('incremental', 0.013), ('growth', 0.013), ('progress', 0.013), ('opper', 0.013), ('ch', 0.013), ('prediction', 0.013), ('bounding', 0.012), ('gaussian', 0.012), ('updates', 0.012), ('cucker', 0.012), ('mnist', 0.012), ('speakers', 0.012), ('dj', 0.012), ('packing', 0.012), ('batch', 0.011)]
simIndex simValue paperId paperTitle
same-paper 1 1.0000001 13 jmlr-2009-Bounded Kernel-Based Online Learning
Author: Francesco Orabona, Joseph Keshet, Barbara Caputo
Abstract: A common problem of kernel-based online algorithms, such as the kernel-based Perceptron algorithm, is the amount of memory required to store the online hypothesis, which may increase without bound as the algorithm progresses. Furthermore, the computational load of such algorithms grows linearly with the amount of memory used to store the hypothesis. To attack these problems, most previous work has focused on discarding some of the instances, in order to keep the memory bounded. In this paper we present a new algorithm, in which the instances are not discarded, but are instead projected onto the space spanned by the previous online hypothesis. We call this algorithm Projectron. While the memory size of the Projectron solution cannot be predicted before training, we prove that its solution is guaranteed to be bounded. We derive a relative mistake bound for the proposed algorithm, and deduce from it a slightly different algorithm which outperforms the Perceptron. We call this second algorithm Projectron++. We show that this algorithm can be extended to handle the multiclass and the structured output settings, resulting, as far as we know, in the first online bounded algorithm that can learn complex classification tasks. The method of bounding the hypothesis representation can be applied to any conservative online algorithm and to other online algorithms, as it is demonstrated for ALMA2 . Experimental results on various data sets show the empirical advantage of our technique compared to various bounded online algorithms, both in terms of memory and accuracy. Keywords: online learning, kernel methods, support vector machines, bounded support set
2 0.42952073 68 jmlr-2009-Online Learning with Samples Drawn from Non-identical Distributions
Author: Ting Hu, Ding-Xuan Zhou
Abstract: Learning algorithms are based on samples which are often drawn independently from an identical distribution (i.i.d.). In this paper we consider a different setting with samples drawn according to a non-identical sequence of probability distributions. Each time a sample is drawn from a different distribution. In this setting we investigate a fully online learning algorithm associated with a general convex loss function and a reproducing kernel Hilbert space (RKHS). Error analysis is conducted under the assumption that the sequence of marginal distributions converges polynomially in the dual of a H¨ lder space. For regression with least square or insensitive loss, learning rates are given o in both the RKHS norm and the L2 norm. For classification with hinge loss and support vector machine q-norm loss, rates are explicitly stated with respect to the excess misclassification error. Keywords: sampling with non-identical distributions, online learning, classification with a general convex loss, regression with insensitive loss and least square loss, reproducing kernel Hilbert space
3 0.31904858 65 jmlr-2009-On Uniform Deviations of General Empirical Risks with Unboundedness, Dependence, and High Dimensionality
Author: Wenxin Jiang
Abstract: The statistical learning theory of risk minimization depends heavily on probability bounds for uniform deviations of the empirical risks. Classical probability bounds using Hoeffding’s inequality cannot accommodate more general situations with unbounded loss and dependent data. The current paper introduces an inequality that extends Hoeffding’s inequality to handle these more general situations. We will apply this inequality to provide probability bounds for uniform deviations in a very general framework, which can involve discrete decision rules, unbounded loss, and a dependence structure that can be more general than either martingale or strong mixing. We will consider two examples with high dimensional predictors: autoregression (AR) with ℓ1 -loss, and ARX model with variable selection for sign classification, which uses both lagged responses and exogenous predictors. Keywords: dependence, empirical risk, probability bound, unbounded loss, uniform deviation
4 0.20745072 40 jmlr-2009-Identification of Recurrent Neural Networks by Bayesian Interrogation Techniques
Author: Barnabás Póczos, András Lőrincz
Abstract: We introduce novel online Bayesian methods for the identification of a family of noisy recurrent neural networks (RNNs). We present Bayesian active learning techniques for stimulus selection given past experiences. In particular, we consider the unknown parameters as stochastic variables and use A-optimality and D-optimality principles to choose optimal stimuli. We derive myopic cost functions in order to maximize the information gain concerning network parameters at each time step. We also derive the A-optimal and D-optimal estimations of the additive noise that perturbs the dynamical system of the RNN. Here we investigate myopic as well as non-myopic estimations, and study the problem of simultaneous estimation of both the system parameters and the noise. Employing conjugate priors our derivations remain approximation-free and give rise to simple update rules for the online learning of the parameters. The efficiency of our method is demonstrated for a number of selected cases, including the task of controlled independent component analysis. Keywords: active learning, system identification, online Bayesian learning, A-optimality, Doptimality, infomax control, optimal design
5 0.13727626 14 jmlr-2009-CarpeDiem: Optimizing the Viterbi Algorithm and Applications to Supervised Sequential Learning
Author: Roberto Esposito, Daniele P. Radicioni
Abstract: The growth of information available to learning systems and the increasing complexity of learning tasks determine the need for devising algorithms that scale well with respect to all learning parameters. In the context of supervised sequential learning, the Viterbi algorithm plays a fundamental role, by allowing the evaluation of the best (most probable) sequence of labels with a time complexity linear in the number of time events, and quadratic in the number of labels. In this paper we propose CarpeDiem, a novel algorithm allowing the evaluation of the best possible sequence of labels with a sub-quadratic time complexity.1 We provide theoretical grounding together with solid empirical results supporting two chief facts. CarpeDiem always finds the optimal solution requiring, in most cases, only a small fraction of the time taken by the Viterbi algorithm; meantime, CarpeDiem is never asymptotically worse than the Viterbi algorithm, thus confirming it as a sound replacement. Keywords: Viterbi algorithm, sequence labeling, conditional models, classifiers optimization, exact inference
6 0.12397009 9 jmlr-2009-Analysis of Perceptron-Based Active Learning
7 0.11496948 90 jmlr-2009-Structure Spaces
8 0.084763832 51 jmlr-2009-Low-Rank Kernel Learning with Bregman Matrix Divergences
9 0.075868703 30 jmlr-2009-Estimation of Sparse Binary Pairwise Markov Networks using Pseudo-likelihoods
10 0.070302457 83 jmlr-2009-SGD-QN: Careful Quasi-Newton Stochastic Gradient Descent
11 0.060672496 67 jmlr-2009-Online Learning with Sample Path Constraints
12 0.059562393 37 jmlr-2009-Generalization Bounds for Ranking Algorithms via Algorithmic Stability
13 0.056274295 27 jmlr-2009-Efficient Online and Batch Learning Using Forward Backward Splitting
14 0.046863705 50 jmlr-2009-Learning When Concepts Abound
15 0.04359987 79 jmlr-2009-Reinforcement Learning in Finite MDPs: PAC Analysis
16 0.043488927 69 jmlr-2009-Optimized Cutting Plane Algorithm for Large-Scale Risk Minimization
17 0.040118594 87 jmlr-2009-Sparse Online Learning via Truncated Gradient
18 0.036790136 23 jmlr-2009-Discriminative Learning Under Covariate Shift
19 0.036069427 24 jmlr-2009-Distance Metric Learning for Large Margin Nearest Neighbor Classification
20 0.034261011 33 jmlr-2009-Exploring Strategies for Training Deep Neural Networks
topicId topicWeight
[(0, 0.314), (1, 0.458), (2, -0.323), (3, 0.149), (4, -0.118), (5, -0.142), (6, 0.116), (7, 0.112), (8, 0.142), (9, 0.075), (10, -0.098), (11, -0.031), (12, -0.008), (13, -0.018), (14, 0.027), (15, 0.076), (16, -0.078), (17, -0.054), (18, 0.02), (19, 0.033), (20, -0.062), (21, -0.032), (22, -0.065), (23, 0.017), (24, -0.008), (25, -0.082), (26, -0.018), (27, 0.008), (28, -0.005), (29, 0.019), (30, -0.006), (31, 0.025), (32, -0.022), (33, 0.038), (34, -0.01), (35, -0.035), (36, 0.035), (37, -0.117), (38, -0.016), (39, -0.067), (40, 0.011), (41, 0.08), (42, -0.012), (43, 0.019), (44, -0.017), (45, -0.073), (46, 0.054), (47, -0.04), (48, -0.002), (49, -0.016)]
simIndex simValue paperId paperTitle
same-paper 1 0.96799076 13 jmlr-2009-Bounded Kernel-Based Online Learning
Author: Francesco Orabona, Joseph Keshet, Barbara Caputo
Abstract: A common problem of kernel-based online algorithms, such as the kernel-based Perceptron algorithm, is the amount of memory required to store the online hypothesis, which may increase without bound as the algorithm progresses. Furthermore, the computational load of such algorithms grows linearly with the amount of memory used to store the hypothesis. To attack these problems, most previous work has focused on discarding some of the instances, in order to keep the memory bounded. In this paper we present a new algorithm, in which the instances are not discarded, but are instead projected onto the space spanned by the previous online hypothesis. We call this algorithm Projectron. While the memory size of the Projectron solution cannot be predicted before training, we prove that its solution is guaranteed to be bounded. We derive a relative mistake bound for the proposed algorithm, and deduce from it a slightly different algorithm which outperforms the Perceptron. We call this second algorithm Projectron++. We show that this algorithm can be extended to handle the multiclass and the structured output settings, resulting, as far as we know, in the first online bounded algorithm that can learn complex classification tasks. The method of bounding the hypothesis representation can be applied to any conservative online algorithm and to other online algorithms, as it is demonstrated for ALMA2 . Experimental results on various data sets show the empirical advantage of our technique compared to various bounded online algorithms, both in terms of memory and accuracy. Keywords: online learning, kernel methods, support vector machines, bounded support set
2 0.79937571 68 jmlr-2009-Online Learning with Samples Drawn from Non-identical Distributions
Author: Ting Hu, Ding-Xuan Zhou
Abstract: Learning algorithms are based on samples which are often drawn independently from an identical distribution (i.i.d.). In this paper we consider a different setting with samples drawn according to a non-identical sequence of probability distributions. Each time a sample is drawn from a different distribution. In this setting we investigate a fully online learning algorithm associated with a general convex loss function and a reproducing kernel Hilbert space (RKHS). Error analysis is conducted under the assumption that the sequence of marginal distributions converges polynomially in the dual of a H¨ lder space. For regression with least square or insensitive loss, learning rates are given o in both the RKHS norm and the L2 norm. For classification with hinge loss and support vector machine q-norm loss, rates are explicitly stated with respect to the excess misclassification error. Keywords: sampling with non-identical distributions, online learning, classification with a general convex loss, regression with insensitive loss and least square loss, reproducing kernel Hilbert space
3 0.72142106 65 jmlr-2009-On Uniform Deviations of General Empirical Risks with Unboundedness, Dependence, and High Dimensionality
Author: Wenxin Jiang
Abstract: The statistical learning theory of risk minimization depends heavily on probability bounds for uniform deviations of the empirical risks. Classical probability bounds using Hoeffding’s inequality cannot accommodate more general situations with unbounded loss and dependent data. The current paper introduces an inequality that extends Hoeffding’s inequality to handle these more general situations. We will apply this inequality to provide probability bounds for uniform deviations in a very general framework, which can involve discrete decision rules, unbounded loss, and a dependence structure that can be more general than either martingale or strong mixing. We will consider two examples with high dimensional predictors: autoregression (AR) with ℓ1 -loss, and ARX model with variable selection for sign classification, which uses both lagged responses and exogenous predictors. Keywords: dependence, empirical risk, probability bound, unbounded loss, uniform deviation
4 0.48686823 40 jmlr-2009-Identification of Recurrent Neural Networks by Bayesian Interrogation Techniques
Author: Barnabás Póczos, András Lőrincz
Abstract: We introduce novel online Bayesian methods for the identification of a family of noisy recurrent neural networks (RNNs). We present Bayesian active learning techniques for stimulus selection given past experiences. In particular, we consider the unknown parameters as stochastic variables and use A-optimality and D-optimality principles to choose optimal stimuli. We derive myopic cost functions in order to maximize the information gain concerning network parameters at each time step. We also derive the A-optimal and D-optimal estimations of the additive noise that perturbs the dynamical system of the RNN. Here we investigate myopic as well as non-myopic estimations, and study the problem of simultaneous estimation of both the system parameters and the noise. Employing conjugate priors our derivations remain approximation-free and give rise to simple update rules for the online learning of the parameters. The efficiency of our method is demonstrated for a number of selected cases, including the task of controlled independent component analysis. Keywords: active learning, system identification, online Bayesian learning, A-optimality, Doptimality, infomax control, optimal design
5 0.46819142 14 jmlr-2009-CarpeDiem: Optimizing the Viterbi Algorithm and Applications to Supervised Sequential Learning
Author: Roberto Esposito, Daniele P. Radicioni
Abstract: The growth of information available to learning systems and the increasing complexity of learning tasks determine the need for devising algorithms that scale well with respect to all learning parameters. In the context of supervised sequential learning, the Viterbi algorithm plays a fundamental role, by allowing the evaluation of the best (most probable) sequence of labels with a time complexity linear in the number of time events, and quadratic in the number of labels. In this paper we propose CarpeDiem, a novel algorithm allowing the evaluation of the best possible sequence of labels with a sub-quadratic time complexity.1 We provide theoretical grounding together with solid empirical results supporting two chief facts. CarpeDiem always finds the optimal solution requiring, in most cases, only a small fraction of the time taken by the Viterbi algorithm; meantime, CarpeDiem is never asymptotically worse than the Viterbi algorithm, thus confirming it as a sound replacement. Keywords: Viterbi algorithm, sequence labeling, conditional models, classifiers optimization, exact inference
6 0.32196116 90 jmlr-2009-Structure Spaces
7 0.27288374 51 jmlr-2009-Low-Rank Kernel Learning with Bregman Matrix Divergences
8 0.27099597 67 jmlr-2009-Online Learning with Sample Path Constraints
9 0.25927556 9 jmlr-2009-Analysis of Perceptron-Based Active Learning
10 0.22826096 30 jmlr-2009-Estimation of Sparse Binary Pairwise Markov Networks using Pseudo-likelihoods
11 0.21321781 37 jmlr-2009-Generalization Bounds for Ranking Algorithms via Algorithmic Stability
12 0.19921547 69 jmlr-2009-Optimized Cutting Plane Algorithm for Large-Scale Risk Minimization
13 0.19548237 16 jmlr-2009-Classification with Gaussians and Convex Loss
14 0.17621608 83 jmlr-2009-SGD-QN: Careful Quasi-Newton Stochastic Gradient Descent
15 0.16264924 50 jmlr-2009-Learning When Concepts Abound
16 0.15061617 97 jmlr-2009-Ultrahigh Dimensional Feature Selection: Beyond The Linear Model
17 0.14975619 33 jmlr-2009-Exploring Strategies for Training Deep Neural Networks
18 0.13158031 79 jmlr-2009-Reinforcement Learning in Finite MDPs: PAC Analysis
19 0.12476692 7 jmlr-2009-An Analysis of Convex Relaxations for MAP Estimation of Discrete MRFs
20 0.12187596 27 jmlr-2009-Efficient Online and Batch Learning Using Forward Backward Splitting
topicId topicWeight
[(8, 0.011), (11, 0.416), (19, 0.015), (38, 0.023), (47, 0.01), (52, 0.068), (55, 0.068), (58, 0.015), (66, 0.127), (68, 0.027), (90, 0.049), (96, 0.052)]
simIndex simValue paperId paperTitle
1 0.97153264 17 jmlr-2009-Computing Maximum Likelihood Estimates in Recursive Linear Models with Correlated Errors
Author: Mathias Drton, Michael Eichler, Thomas S. Richardson
Abstract: In recursive linear models, the multivariate normal joint distribution of all variables exhibits a dependence structure induced by a recursive (or acyclic) system of linear structural equations. These linear models have a long tradition and appear in seemingly unrelated regressions, structural equation modelling, and approaches to causal inference. They are also related to Gaussian graphical models via a classical representation known as a path diagram. Despite the models’ long history, a number of problems remain open. In this paper, we address the problem of computing maximum likelihood estimates in the subclass of ‘bow-free’ recursive linear models. The term ‘bow-free’ refers to the condition that the errors for variables i and j be uncorrelated if variable i occurs in the structural equation for variable j. We introduce a new algorithm, termed Residual Iterative Conditional Fitting (RICF), that can be implemented using only least squares computations. In contrast to existing algorithms, RICF has clear convergence properties and yields exact maximum likelihood estimates after the first iteration whenever the MLE is available in closed form. Keywords: linear regression, maximum likelihood estimation, path diagram, structural equation model, recursive semi-Markov model, residual iterative conditional fitting
2 0.94257635 11 jmlr-2009-Bayesian Network Structure Learning by Recursive Autonomy Identification
Author: Raanan Yehezkel, Boaz Lerner
Abstract: We propose the recursive autonomy identification (RAI) algorithm for constraint-based (CB) Bayesian network structure learning. The RAI algorithm learns the structure by sequential application of conditional independence (CI) tests, edge direction and structure decomposition into autonomous sub-structures. The sequence of operations is performed recursively for each autonomous substructure while simultaneously increasing the order of the CI test. While other CB algorithms d-separate structures and then direct the resulted undirected graph, the RAI algorithm combines the two processes from the outset and along the procedure. By this means and due to structure decomposition, learning a structure using RAI requires a smaller number of CI tests of high orders. This reduces the complexity and run-time of the algorithm and increases the accuracy by diminishing the curse-of-dimensionality. When the RAI algorithm learned structures from databases representing synthetic problems, known networks and natural problems, it demonstrated superiority with respect to computational complexity, run-time, structural correctness and classification accuracy over the PC, Three Phase Dependency Analysis, Optimal Reinsertion, greedy search, Greedy Equivalence Search, Sparse Candidate, and Max-Min Hill-Climbing algorithms. Keywords: Bayesian networks, constraint-based structure learning
3 0.8123433 100 jmlr-2009-When Is There a Representer Theorem? Vector Versus Matrix Regularizers
Author: Andreas Argyriou, Charles A. Micchelli, Massimiliano Pontil
Abstract: We consider a general class of regularization methods which learn a vector of parameters on the basis of linear measurements. It is well known that if the regularizer is a nondecreasing function of the L2 norm, then the learned vector is a linear combination of the input data. This result, known as the representer theorem, lies at the basis of kernel-based methods in machine learning. In this paper, we prove the necessity of the above condition, in the case of differentiable regularizers. We further extend our analysis to regularization methods which learn a matrix, a problem which is motivated by the application to multi-task learning. In this context, we study a more general representer theorem, which holds for a larger class of regularizers. We provide a necessary and sufficient condition characterizing this class of matrix regularizers and we highlight some concrete examples of practical importance. Our analysis uses basic principles from matrix theory, especially the useful notion of matrix nondecreasing functions. Keywords: kernel methods, matrix learning, minimal norm interpolation, multi-task learning, regularization
same-paper 4 0.80669701 13 jmlr-2009-Bounded Kernel-Based Online Learning
Author: Francesco Orabona, Joseph Keshet, Barbara Caputo
Abstract: A common problem of kernel-based online algorithms, such as the kernel-based Perceptron algorithm, is the amount of memory required to store the online hypothesis, which may increase without bound as the algorithm progresses. Furthermore, the computational load of such algorithms grows linearly with the amount of memory used to store the hypothesis. To attack these problems, most previous work has focused on discarding some of the instances, in order to keep the memory bounded. In this paper we present a new algorithm, in which the instances are not discarded, but are instead projected onto the space spanned by the previous online hypothesis. We call this algorithm Projectron. While the memory size of the Projectron solution cannot be predicted before training, we prove that its solution is guaranteed to be bounded. We derive a relative mistake bound for the proposed algorithm, and deduce from it a slightly different algorithm which outperforms the Perceptron. We call this second algorithm Projectron++. We show that this algorithm can be extended to handle the multiclass and the structured output settings, resulting, as far as we know, in the first online bounded algorithm that can learn complex classification tasks. The method of bounding the hypothesis representation can be applied to any conservative online algorithm and to other online algorithms, as it is demonstrated for ALMA2 . Experimental results on various data sets show the empirical advantage of our technique compared to various bounded online algorithms, both in terms of memory and accuracy. Keywords: online learning, kernel methods, support vector machines, bounded support set
5 0.65943855 54 jmlr-2009-Markov Properties for Linear Causal Models with Correlated Errors (Special Topic on Causality)
Author: Changsung Kang, Jin Tian
Abstract: A linear causal model with correlated errors, represented by a DAG with bi-directed edges, can be tested by the set of conditional independence relations implied by the model. A global Markov property specifies, by the d-separation criterion, the set of all conditional independence relations holding in any model associated with a graph. A local Markov property specifies a much smaller set of conditional independence relations which will imply all other conditional independence relations which hold under the global Markov property. For DAGs with bi-directed edges associated with arbitrary probability distributions, a local Markov property is given in Richardson (2003) which may invoke an exponential number of conditional independencies. In this paper, we show that for a class of linear structural equation models with correlated errors, there is a local Markov property which will invoke only a linear number of conditional independence relations. For general linear models, we provide a local Markov property that often invokes far fewer conditional independencies than that in Richardson (2003). The results have applications in testing linear structural equation models with correlated errors. Keywords: Markov properties, linear causal models, linear structural equation models, graphical models
7 0.60146511 74 jmlr-2009-Properties of Monotonic Effects on Directed Acyclic Graphs
8 0.58358288 93 jmlr-2009-The Hidden Life of Latent Variables: Bayesian Learning with Mixed Graph Models
9 0.5075438 94 jmlr-2009-The Nonparanormal: Semiparametric Estimation of High Dimensional Undirected Graphs
10 0.50497746 89 jmlr-2009-Strong Limit Theorems for the Bayesian Scoring Criterion in Bayesian Networks
11 0.50080371 71 jmlr-2009-Perturbation Corrections in Approximate Inference: Mixture Modelling Applications
12 0.49957883 85 jmlr-2009-Settable Systems: An Extension of Pearl's Causal Model with Optimization, Equilibrium, and Learning
13 0.49674869 62 jmlr-2009-Nonlinear Models Using Dirichlet Process Mixtures
14 0.48865157 47 jmlr-2009-Learning Linear Ranking Functions for Beam Search with Application to Planning
15 0.48361075 78 jmlr-2009-Refinement of Reproducing Kernels
16 0.48299205 97 jmlr-2009-Ultrahigh Dimensional Feature Selection: Beyond The Linear Model
17 0.47922024 9 jmlr-2009-Analysis of Perceptron-Based Active Learning
18 0.47352833 53 jmlr-2009-Marginal Likelihood Integrals for Mixtures of Independence Models
19 0.47292188 80 jmlr-2009-Reproducing Kernel Banach Spaces for Machine Learning
20 0.4635064 3 jmlr-2009-A Parameter-Free Classification Method for Large Scale Learning