jmlr jmlr2006 jmlr2006-96 knowledge-graph by maker-knowledge-mining
Source: pdf
Author: Nicolò Cesa-Bianchi, Claudio Gentile, Luca Zaniboni
Abstract: A selective sampling algorithm is a learning algorithm for classification that, based on the past observed data, decides whether to ask the label of each new instance to be classified. In this paper, we introduce a general technique for turning linear-threshold classification algorithms from the general additive family into randomized selective sampling algorithms. For the most popular algorithms in this family we derive mistake bounds that hold for individual sequences of examples. These bounds show that our semi-supervised algorithms can achieve, on average, the same accuracy as that of their fully supervised counterparts, but using fewer labels. Our theoretical results are corroborated by a number of experiments on real-world textual data. The outcome of these experiments is essentially predicted by our theoretical results: Our selective sampling algorithms tend to perform as well as the algorithms receiving the true label after each classification, while observing in practice substantially fewer labels. Keywords: selective sampling, semi-supervised learning, on-line learning, kernel algorithms, linear-threshold classifiers
Reference: text
sentIndex sentText sentNum sentScore
1 IT DTI, Universit` di Milano a via Bramante, 65 26013 Crema, Italy Editor: Manfred Warmuth Abstract A selective sampling algorithm is a learning algorithm for classification that, based on the past observed data, decides whether to ask the label of each new instance to be classified. [sent-7, score-0.482]
2 In this paper, we introduce a general technique for turning linear-threshold classification algorithms from the general additive family into randomized selective sampling algorithms. [sent-8, score-0.43]
3 The outcome of these experiments is essentially predicted by our theoretical results: Our selective sampling algorithms tend to perform as well as the algorithms receiving the true label after each classification, while observing in practice substantially fewer labels. [sent-12, score-0.44]
4 Natural real-world scenarios for selective sampling are all those applications where labels are scarce or expensive to obtain. [sent-21, score-0.407]
5 An additional motivation for using selective sampling arises from the widespread use of kernel-based algorithms (Vapnik, 1998; Cristianini and Shawe-Taylor, 2001; Sch¨ lkopf and Smola, 2002). [sent-27, score-0.378]
6 In this paper we present a mistake bound analysis for selective sampling versions of Perceptronlike algorithms. [sent-38, score-0.483]
7 Our selective sampling algorithms use a simple randomized rule to decide whether to query the label of the current instance. [sent-46, score-0.506]
8 For each of the algorithms we consider a bound is proven on the expected number of mistakes made in an arbitrary data sequence, where the expectation is with respect to the randomized sampling rule. [sent-52, score-0.285]
9 In all algorithms we analyze, a proper choice of the scaling factor b in the randomized rule yields the same mistake bound as the one achieved by the original algorithm before the introduction of the selective sampling mechanism. [sent-54, score-0.532]
10 In Section 2 we describe and analyze our 1206 W ORST-C ASE S ELECTIVE S AMPLING Perceptron-like selective sampling algorithms. [sent-61, score-0.378]
11 We consider the following selective sampling variant of the standard on-line learning model (Angluin, 1988; Littlestone, 1988). [sent-66, score-0.378]
12 In the generic trial t the algorithm observes instance xt , outputs a prediction yt ∈ {−1, +1} for the label yt associated with xt , and decides whether or not to ask the label yt . [sent-68, score-1.208]
13 No matter what the algorithm decides, we say that the algorithm has made a prediction mistake if yt = yt . [sent-69, score-0.484]
14 We measure the performance of a linearthreshold algorithm by the total number of mistakes it makes on a sequence of examples (including the trials where the true label yt remains unknown). [sent-70, score-0.317]
15 In this paper we are concerned with selective sampling versions of linear-threshold algorithms. [sent-72, score-0.378]
16 of weight vectors wt ∈ Rd , where wt can only depend on the past examples (x1 , y1 ), . [sent-79, score-0.631]
17 the linear-threshold ⊤ algorithm predicts yt using1 yt = SGN( pt ) where pt = wt−1 xt is the margin of wt−1 on the instance xt . [sent-86, score-1.337]
18 If the label yt is queried, then the algorithm (possibly) uses yt to compute a new weight wt ; on the other hand, if yt remains unknown then wt = wt−1 . [sent-87, score-1.261]
19 , (xn , yn ) of examples and a given margin threshold γ > 0, we measure the performance of u by its cumulative hinge loss (Freund and Schapire, 1999; Gentile and Warmuth, 1999) n n t=1 t=1 Lγ,n (u) = ∑ ℓγ,t (u) = ∑ (γ − yt u⊤ xt )+ where we used the notation (x)+ = max{0, x}. [sent-92, score-0.455]
20 In other words, the source cannot use the knowledge of Zt to determine xt and yt . [sent-99, score-0.39]
21 , Zt−1 ] and Mt to denote the indicator function of the event yt = yt , where yt is the prediction at time t of the algorithm under consideration. [sent-103, score-0.595]
22 Figure 1: A selective sampling version of the classical Perceptron algorithm. [sent-116, score-0.378]
23 1 Selective Sampling Perceptron Our selective sampling variant of the classical Perceptron algorithm is described in Figure 1. [sent-134, score-0.378]
24 In each trial t the algorithm observes an instance vector xt ∈ Rd and predicts the binary label yt through the sign of the margin 1208 W ORST-C ASE S ELECTIVE S AMPLING ⊤ value pt = wt−1 xt . [sent-136, score-1.016]
25 Then the algorithm decides whether to query the label yt through the randomized rule described in the introduction: a coin with bias b/(b + | pt |) is flipped; if the coin turns up heads (Zt = 1 in Figure 1), then the label yt is queried. [sent-137, score-0.902]
26 If a prediction mistake is observed (yt = yt ), then the algorithm updates vector wt according to the usual Perceptron additive rule. [sent-138, score-0.631]
27 On the other hand, if either the coin turns up tails or yt = yt (Mt = 0 in Figure 1), then no update takes place. [sent-139, score-0.44]
28 The following theorem shows that our selective sampling Perceptron can achieve, in expectation, the same mistake bound as the standard Perceptron’s, but using fewer labels. [sent-140, score-0.5]
29 Furthermore, the expected number of labels queried by the algorithm n equals ∑t=1 E b b+| pt | . [sent-148, score-0.346]
30 However, since b is an input parameter of the selective sampling algorithm, the above setting implies that, at the beginning of the prediction process, the algorithm needs some extra information on the sequence of examples. [sent-157, score-0.418]
31 This is due to the fact that our simple proof produces a mistake bound where the constant ruling the (natural) trade-off between hinge loss term and margin term is directly related to the label sampling rate. [sent-160, score-0.403]
32 We can write γ − ℓγ,t (u) = γ − (γ − yt u⊤ xt )+ ≤ yt u⊤ xt = yt (u − wt−1 + wt−1 )⊤ xt 1 1 1 ⊤ = yt wt−1 xt + u − wt−1 2 − u − wt 2 + wt−1 − wt 2 2 2 1 1 1 = yt pt + u − wt−1 2 − u − wt 2 + wt−1 − wt 2 . [sent-180, score-3.234]
33 Rearranging, and using yt pt ≤ 0 implied by Mt = 1, yields αγ + | pt | ≤ αℓγ,t (u) + 1 αu − wt−1 2 2 − 1 αu − wt 2 2 + 1 wt−1 − wt 2 2 . [sent-182, score-1.314]
34 If t is such that Mt Zt = 0 then no update occurs and wt = wt−1 . [sent-185, score-0.328]
35 Hence we conclude that, for any trial t, Mt Zt (αγ + | pt |) ≤ Mt Zt α ℓγ,t (u) 1 + αu − wt−1 2 We now sum the above over t, use wt−1 − wt n ∑ Mt Zt αγ + | pt | − t=1 X2 2 1 αu − wt 2 2 Mt Zt wt−1 − wt 2 2 − 2 ≤ X 2 and recall that w0 = 0. [sent-186, score-1.484]
36 Here, however, we have added the random variable Zt , associated with the selective sampling, and kept the term | pt |. [sent-192, score-0.45]
37 Note that this term also appears in the conditional expectation of Zt , since we have defined Et−1 Zt as b/(b + | pt |). [sent-193, score-0.258]
38 On the left-hand side we obtain n n E ∑ Mt Zt b + | pt | =E ∑ Mt b + | pt | Et−1 Zt = E t=1 t=1 n ∑ b Mt , t=1 where the first equality is proven by observing that Mt and pt are determined by Z1 , . [sent-195, score-0.746]
39 Our adaptive version of the selective sampling Perceptron algorithm is described in Figure 2. [sent-215, score-0.401]
40 However the reader should not conclude from this observation that the label rate bt−1 /(bt−1 + | pt |) converges to 1 as t → ∞, since bt does not scale with time t but with the number of updates made up to time t, which can be far smaller than t. [sent-221, score-0.534]
41 At the same time, the margin | pt | might have an erratic behavior whose oscillations can also grow with the number of updates. [sent-222, score-0.285]
42 , Mt = 1) then update wt = wt−1 + yt xt Kt = Kt−1 + 1 Xt = X ′ ; (6) else (Zt = 0) set wt = wt−1 , Kt = Kt−1 , Xt = Xt−1 . [sent-235, score-1.027]
43 Figure 2: Adaptive parameter version of the selective sampling Perceptron algorithm. [sent-236, score-0.392]
44 n Moreover, the expected number of labels queried by the algorithm equals ∑t=1 E bt−1 bt−1 +| pt | . [sent-244, score-0.346]
45 This yields Mt Zt (ct−1 + | pt |) ≤ Mt Zt + ct−1 ℓγ,t (u) γ 2 1 ct−1 u − wt−1 2 γ − From the update rule in Figure 2 we have (Mt Zt /2) wt−1 − wt and divide by bt−1 . [sent-261, score-0.6]
46 This yields Mt Zt ct−1 + | pt | − xt bt−1 2 /2 ≤ Mt Zt + 2 1 ct−1 u − wt 2 γ + 2 Mt Zt wt−1 − wt 2 ≤ (Mt Zt /2) xt 2 2 . [sent-262, score-1.268]
47 We rearrange ct−1 ℓγ,t (u) bt−1 γ 1 2 bt−1 2 ct−1 u − wt−1 γ − 2 ct−1 u − wt γ . [sent-264, score-0.309]
48 (4) We now transform the difference of squared norms in (4) into a pair of telescoping differences, 1 2 bt−1 ct−1 u − wt−1 γ 2 ct−1 − u − wt γ = 2 1 ct−1 u − wt−1 2 bt−1 γ 1 ct + u − wt 2 bt γ 2 2 − 1 ct u − wt 2 bt γ ct−1 1 − u − wt 2 bt−1 γ 2 2 . [sent-265, score-1.87]
49 Recall now the standard way of bounding the norm of a Perceptron weight vector in terms of the number of updates, 2 = wt−1 2 ⊤ + Mt Zt yt wt−1 xt + Mt Zt xt ≤ wt wt−1 2 + Mt Zt xt wt−1 2 + Mt Zt X 2 ≤ 2 2 which, combined with w0 = 0, implies wt ≤ X √ Kt for any t. [sent-267, score-1.413]
50 (7) Applying inequality (7) to (6) yields u 2 (5) ≤ 2γ2 2 ct2 ct−1 − bt bt−1 u X + γ √ Kt ct−1 ct − bt−1 bt . [sent-268, score-0.534]
51 Thus we can write √ √ 1 1 ct−1 ct Kt √ −√ − = Kt bt−1 bt 2β 1 + Kt−1 1 + Kt √ Kt 1 1 √ −√ = 2β Kt 1 + Kt √ √ 1 1 + Kt − Kt √ = 2β 1 + Kt 1 1 √ √ ≤ 4β Kt 1 + Kt √ √ 1 (using 1 + x − x ≤ 2√x ) 1 1 . [sent-271, score-0.317]
52 4β Kt ≤ On the other hand, if Mt Zt = 0 we have bt = bt−1 and ct = ct−1 . [sent-272, score-0.317]
53 Hence, for any trial t we obtain √ ct−1 ct − bt−1 bt Kt Putting together as in (4) and using ct−1 − xt Mt Zt bt−1 + | pt | bt−1 ≤ Mt Zt 2 ≤ Mt Zt 1 . [sent-273, score-0.826]
54 4β Kt /2 ≥ bt−1 on the left-hand side yields ct−1 ℓγ,t (u) bt−1 γ + ct−1 1 u − wt−1 2bt−1 γ + u 2 2γ2 2 ct2 ct−1 − bt bt−1 1214 + 2 − 1 ct u − wt 2bt γ u X Mt Zt 1 γ 4β Kt 2 W ORST-C ASE S ELECTIVE S AMPLING holding for any trial t, any u ∈ Rd , and any γ > 0. [sent-274, score-0.728]
55 , n, use w0 = 0, and simplify n bt−1 + | pt | bt−1 ∑ Mt Zt t=1 ≤ 1 γ n ct−1 ∑ Mt Zt bt−1 ℓγ,t (u) (9) t=1 1 cn c2 u 2 n − u − wn 2 bn 2γ 2 bn γ 1 u X n Mt Zt . [sent-278, score-0.328]
56 2β γ γ 1 + Kt−1 For (10) we obtain c2 u 2 1 cn n − u − wn bn 2γ2 2bn γ 2 = ≤ ≤ ≤ cn u⊤ wn wn 2 − bn γ 2bn cn u⊤ wn bn γ cn u wn bn γ 1 1+ √ 2β 1 + Kn √ u X Kn γ where in the last step we used (7). [sent-281, score-0.278]
57 Using these expressions to bound the left-hand side of (9) yields n ∑ Mt Zt t=1 bt−1 + | pt | bt−1 ≤ 1 γ n 1 ∑ Mt Zt ℓγ,t (u) + 2β t=1 1 + 1+ √ 2β 1 + Kn 1+ u X γ n Mt Zt ∑ √1 + Kt−1 (11) t=1 √ 1 u X u X Kn + γ 4β γ n Mt Zt . [sent-282, score-0.28]
58 As in the proof of Theorem 1, since Et−1 Zt = bt−1 bt−1 +| pt | and both Mt and bt−1 are measurable with respect to the σ-algebra generated by Z1 , . [sent-289, score-0.257]
59 3 Selective Sampling Second-Order Perceptron We now consider a selective sampling version of the second-order Perceptron algorithm introduced by Cesa-Bianchi et al. [sent-303, score-0.378]
60 In trial t, instead of using the sign of vt−1 xt to predict the current instance xt , the second-order algorithm predicts through the sign of the margin pt = M −1/2 vt−1 ⊤ ⊤ M −1/2 xt = vt−1 M −1 xt . [sent-307, score-1.166]
61 1216 W ORST-C ASE S ELECTIVE S AMPLING Here M = I + ∑s xs x⊤ + xt xt⊤ is a (full-rank) positive definite matrix, where I is the d × d identity s matrix, and the sum ∑s xs x⊤ runs over the mistaken trials s up to time t − 1. [sent-308, score-0.31]
62 As before, the comparison between the second-order Perceptron’s bound and the one contained in Theorem 3 reveals that the selective sampling algorithm can achieve, in expectation, the same mistake bound using fewer labels. [sent-321, score-0.505]
63 Moreover, the expected number of labels I + ∑t=1 Mt Zt xt xt i n n b queried by the algorithm equals ∑t=1 E b+| pt | . [sent-329, score-0.738]
64 Figure 3: A selective sampling version of the second-order Perceptron algorithm. [sent-344, score-0.378]
65 Indeed, we now show that the algorithm incurs on each mistaken 2 trial a square loss yt − pt bounded by the difference infv Φt+1 (v) − infv Φt (v) plus a quadratic term involving At−1 . [sent-352, score-0.628]
66 Hence the equality Mt Zt pt − yt 2 2 = inf Φt+1 (v) − inf Φt (v) + v v Mt Zt ⊤ −1 Mt Zt ⊤ −1 xt At−1 xt pt2 xt At xt − 2 2 −1 holds for all trials t. [sent-357, score-1.271]
67 Observing that infv Φ1 (v) = 0, we obtain 1 n ∑ Mt Zt pt − yt 2 t=1 2 ≤ inf Φn+1 (v) − inf Φ1 (v) + v v ≤ Φn+1 (u) + ≤ 1 u 2 2 + 1 n ∑ Mt Zt xt⊤ At−1 xt 2 t=1 1 n ∑ Mt Zt xt⊤ At−1 xt 2 t=1 1 n ∑ Mt Zt u⊤ xt − yt 2 t=1 2 + 1 n ∑ Mt Zt xt⊤ At−1 xt 2 t=1 holding for any u ∈ Rd . [sent-362, score-1.501]
68 Expanding the squares and performing trivial simplifications we arrive at the following inequal- ity 1 n ∑ Mt Zt pt2 − 2yt pt 2 t=1 ≤ 1 2 u 2 n + ∑ Mt Zt u⊤ xt 2 t=1 n − ∑ Mt Zt yt u⊤ xt + t=1 1 n ∑ Mt Zt xt⊤ At−1 xt . [sent-363, score-1.026]
69 For the first term we have 1 2 u 2 n + ∑ Mt Zt u⊤ xt t=1 2 = n 1 1 ⊤ I + ∑ xt xt⊤ Mt Zt u = u⊤ An u . [sent-366, score-0.392]
70 , 2005), 1 n ∑ Mt Zt xt⊤ At−1 xt 2 t=1 = ≤ = = = |At−1 | 1 n ∑ 1 − |At | 2 t=1 |At | 1 n ∑ ln |At−1| 2 t=1 1 |An | ln 2 |A0 | 1 ln |An | 2 1 d ∑ ln(1 + λi ) 2 i=1 where we recall that 1 + λi is the i-th eigenvalue of An . [sent-370, score-0.34]
71 Replacing back, observing that −yt pt ≤ 0 whenever Mt = 1, dropping the term involving pt2 , and rearranging yields n ∑ Mt Zt t=1 | pt | + yt u⊤ xt ≤ 1 ⊤ 1 d u An u + ∑ ln(1 + λi ) . [sent-371, score-0.906]
72 Recalling that Et−1 Zt = b/(b + | pt |), and proceeding similarly n n to the proof of Theorem 1 we get the claimed bounds on ∑t=1 E Mt and ∑t=1 E Zt . [sent-375, score-0.257]
73 Selective Sampling Winnow The techniques used to prove Theorem 1 can be readily extended to analyze selective sampling versions of algorithms in the general additive family of Grove et al. [sent-377, score-0.395]
74 We now show a concrete example by analyzing the selective sampling version of the Winnow algorithm (Littlestone, 1988), a member of the general additive family based on the exponential potential Ψ(u) = eu1 + · · · + eud . [sent-381, score-0.395]
75 Figure 4: A selective sampling version of the Winnow algorithm. [sent-397, score-0.378]
76 To obtain a selective sampling version of Winnow we proceed exactly as we did in the previous cases: we query the label yt with probability b/(b + | pt |), where | pt | is the margin computed by the algorithm. [sent-403, score-1.194]
77 The mistake bound we prove for selective sampling Winnow is somewhat atypical since, unlike the Perceptron-like algorithms analyzed so far, the choice of the learning rate η given in this theorem is the same as the one suggested by the original Winnow analysis (see, e. [sent-405, score-0.524]
78 Furthermore, since a meaningful bound in Winnow requires η be chosen in terms of γ, it turns out that in the selective sampling version there is no additional tuning to perform, and we are able to obtain the same mistake bound as the original version. [sent-409, score-0.54]
79 Thus, unlike the other cases, the selective sampling mechanism does not weaken in any respect the original mistake bound, apart from turning a deterministic bound into an expected one. [sent-410, score-0.523]
80 α γ 2α(1 − α) γ2 n As before, the expected number of labels queried by the algorithm equals ∑t=1 E b b+| pt | . [sent-418, score-0.346]
81 As in the proof of Theorem 1, we have that Mt Zt = 1 implies η γ − ℓγ,t (u) = η γ − (γ − yt , u⊤ xt )+ ≤ η yt u⊤ xt = η yt (u − wt−1 + wt−1 )⊤ xt ⊤ = η yt (u − wt−1 )⊤ xt + ηyt wt−1 xt . [sent-426, score-1.769]
82 5), we can rewrite the term η yt (u − wt−1 )⊤ xt as η yt (u − wt−1 )⊤ xt = KL(u, wt−1 ) − KL(u, wt ) + ln d ∑ w j,t−1 eη y v t j j=1 ⊤ where v j = x j − wt−1 xt . [sent-429, score-1.333]
83 This equation is similar to the one obtained in the analysis of the selective sampling Perceptron algorithm, but for the relative entropy replacing the squared Euclidean distance. [sent-430, score-0.378]
84 Then, from the Hoeffding inequality (Hoeffding, 1963) applied to X, d ln ∑ w j,t−1 eη yt v j j=1 = ln E eη yt (X−E X) ≤ 1222 η2 2 X . [sent-433, score-0.484]
85 2 ∞ W ORST-C ASE S ELECTIVE S AMPLING We plug back, rearrange and note that wt = wt−1 whenever Mt Zt = 0. [sent-434, score-0.309]
86 This gets η 2 Mt Zt η γ + | pt | − X∞ ≤ Mt Zt η ℓγ,t (u) + KL(u, wt−1 ) − KL(u, wt ) , 2 holding for any t. [sent-435, score-0.572]
87 Mt Zt γ + | pt | − X∞ ≤ ∑ Mt Zt ℓγ,t (u) + ∑ 2 η η t=1 t=1 n We drop the last term (which is nonpositive), and use KL(u, w0 ) ≤ ln d holding for any u in the probability simplex whenever w0 = (1/d, . [sent-440, score-0.325]
88 Then the above reduces to n ∑ Mt Zt t=1 n η 2 ln d γ + | pt | − X∞ ≤ ∑ Mt Zt ℓγ,t (u) + . [sent-444, score-0.292]
89 2 η t=1 Substituting our choice for η and b yields n n t=1 t=1 ∑ Mt Zt b + | pt | ≤ ∑ Mt Zt ℓγ,t (u) + 2 X∞ ln d . [sent-445, score-0.306]
90 2(1 − α)γ To conclude, it suffices to exploit Et−1 Zt = b/(b + | pt |) and proceed as in the proof of the previous theorems. [sent-446, score-0.257]
91 We focused on the following three algorithms: the selective sampling Perceptron algorithm of Figure 1 (here abbreviated as SEL - P), its adaptive version of Figure 2 (abbreviated as SEL - ADA), and the selective sampling second-order Perceptron algorithm of Figure 3 (abbreviated as SEL -2 ND). [sent-455, score-0.779]
92 In Figure 5 we check whether our margin-based sampling technique achieves a better performance than the baseline sampling strategy of querying each label with constant probability. [sent-456, score-0.407]
93 718 Sampling rate Figure 5: Comparison between margin-based sampling and random sampling with pre-specified sampling rate for the Perceptron algorithm (left) and the second-order Perceptron algorithm (right). [sent-486, score-0.564]
94 Note that SEL - P uses an average sampling rate of about 60%, while SEL - ADA needs a larger (and growing with time) sampling rate of about 74%. [sent-579, score-0.392]
95 Conclusions and Open Problems We have introduced a general technique for turning linear-threshold algorithms from the general additive family into selective sampling algorithms. [sent-595, score-0.409]
96 Thus we have also provided a theoretical analysis for an adaptive parameter version of the (first-order) selective sampling Perceptron algorithm. [sent-606, score-0.415]
97 It is worth observing that for the adaptive version of the selective sampling Perceptron (Figure 2) we can easily derive a lower bound on the label sampling rate. [sent-613, score-0.657]
98 Then we can write bt−1 bt−1 + | pt | = ≥ ≥ ≥ √ β 1 + Kt−1 √ ⊤ β 1 + Kt−1 + |wt−1 xt | √ β 1 + Kt−1 √ β 1 + Kt−1 + wt−1 √ β 1 + Kt−1 √ √ β 1 + Kt−1 + Kt−1 β β+1 (using Inequality (7)) holding for any trial t. [sent-615, score-0.528]
99 At first glance, this requires a lower bound on the margin | pt |. [sent-617, score-0.307]
100 We would like to devise an adaptive parameter version of the selective sampling Perceptron algorithm that both lends itself to formal analysis and is competitive in practice. [sent-628, score-0.428]
wordName wordTfidf (topN-words)
[('mt', 0.473), ('zt', 0.402), ('wt', 0.309), ('pt', 0.244), ('perceptron', 0.229), ('selective', 0.206), ('bt', 0.203), ('xt', 0.196), ('yt', 0.194), ('kt', 0.189), ('sampling', 0.172), ('sel', 0.172), ('ct', 0.114), ('aniboni', 0.094), ('entile', 0.094), ('ampling', 0.087), ('elective', 0.087), ('mistake', 0.083), ('ianchi', 0.08), ('winnow', 0.077), ('gentile', 0.075), ('esa', 0.071), ('trial', 0.069), ('ase', 0.066), ('queried', 0.06), ('rd', 0.055), ('mistaken', 0.049), ('ln', 0.048), ('label', 0.048), ('warmuth', 0.047), ('vt', 0.045), ('query', 0.045), ('ada', 0.043), ('decides', 0.043), ('mistakes', 0.043), ('margin', 0.041), ('grove', 0.038), ('infv', 0.036), ('sgn', 0.035), ('littlestone', 0.033), ('kn', 0.033), ('labels', 0.029), ('kl', 0.027), ('wn', 0.026), ('hinge', 0.024), ('rate', 0.024), ('adaptive', 0.023), ('xs', 0.023), ('forster', 0.022), ('maxt', 0.022), ('bound', 0.022), ('saving', 0.021), ('randomized', 0.021), ('bn', 0.021), ('freund', 0.02), ('tuning', 0.02), ('holding', 0.019), ('update', 0.019), ('trials', 0.019), ('claudio', 0.018), ('coin', 0.018), ('bernoulli', 0.018), ('auer', 0.018), ('theorem', 0.017), ('additive', 0.017), ('helmbold', 0.017), ('azoury', 0.016), ('milano', 0.016), ('cn', 0.016), ('inf', 0.015), ('turns', 0.015), ('updates', 0.015), ('dent', 0.015), ('querying', 0.015), ('predicts', 0.015), ('apple', 0.014), ('corroborated', 0.014), ('jagota', 0.014), ('nonpositive', 0.014), ('perceptronlike', 0.014), ('stretching', 0.014), ('simplex', 0.014), ('turning', 0.014), ('yields', 0.014), ('parameter', 0.014), ('observing', 0.014), ('rule', 0.014), ('expectation', 0.014), ('instance', 0.013), ('expected', 0.013), ('weight', 0.013), ('sequence', 0.013), ('devise', 0.013), ('reuters', 0.013), ('textual', 0.013), ('kivinen', 0.013), ('prediction', 0.013), ('deterministic', 0.013), ('proof', 0.013), ('cristianini', 0.012)]
simIndex simValue paperId paperTitle
same-paper 1 1.0000005 96 jmlr-2006-Worst-Case Analysis of Selective Sampling for Linear Classification
Author: Nicolò Cesa-Bianchi, Claudio Gentile, Luca Zaniboni
Abstract: A selective sampling algorithm is a learning algorithm for classification that, based on the past observed data, decides whether to ask the label of each new instance to be classified. In this paper, we introduce a general technique for turning linear-threshold classification algorithms from the general additive family into randomized selective sampling algorithms. For the most popular algorithms in this family we derive mistake bounds that hold for individual sequences of examples. These bounds show that our semi-supervised algorithms can achieve, on average, the same accuracy as that of their fully supervised counterparts, but using fewer labels. Our theoretical results are corroborated by a number of experiments on real-world textual data. The outcome of these experiments is essentially predicted by our theoretical results: Our selective sampling algorithms tend to perform as well as the algorithms receiving the true label after each classification, while observing in practice substantially fewer labels. Keywords: selective sampling, semi-supervised learning, on-line learning, kernel algorithms, linear-threshold classifiers
2 0.3976768 70 jmlr-2006-Online Passive-Aggressive Algorithms
Author: Koby Crammer, Ofer Dekel, Joseph Keshet, Shai Shalev-Shwartz, Yoram Singer
Abstract: We present a family of margin based online learning algorithms for various prediction tasks. In particular we derive and analyze algorithms for binary and multiclass categorization, regression, uniclass prediction and sequence prediction. The update steps of our different algorithms are all based on analytical solutions to simple constrained optimization problems. This unified view allows us to prove worst-case loss bounds for the different algorithms and for the various decision problems based on a single lemma. Our bounds on the cumulative loss of the algorithms are relative to the smallest loss that can be attained by any fixed hypothesis, and as such are applicable to both realizable and unrealizable settings. We demonstrate some of the merits of the proposed algorithms in a series of experiments with synthetic and real data sets.
3 0.19001009 75 jmlr-2006-Policy Gradient in Continuous Time
Author: Rémi Munos
Abstract: Policy search is a method for approximately solving an optimal control problem by performing a parametric optimization search in a given class of parameterized policies. In order to process a local optimization technique, such as a gradient method, we wish to evaluate the sensitivity of the performance measure with respect to the policy parameters, the so-called policy gradient. This paper is concerned with the estimation of the policy gradient for continuous-time, deterministic state dynamics, in a reinforcement learning framework, that is, when the decision maker does not have a model of the state dynamics. We show that usual likelihood ratio methods used in discrete-time, fail to proceed the gradient because they are subject to variance explosion when the discretization time-step decreases to 0. We describe an alternative approach based on the approximation of the pathwise derivative, which leads to a policy gradient estimate that converges almost surely to the true gradient when the timestep tends to 0. The underlying idea starts with the derivation of an explicit representation of the policy gradient using pathwise derivation. This derivation makes use of the knowledge of the state dynamics. Then, in order to estimate the gradient from the observable data only, we use a stochastic policy to discretize the continuous deterministic system into a stochastic discrete process, which enables to replace the unknown coefficients by quantities that solely depend on known data. We prove the almost sure convergence of this estimate to the true policy gradient when the discretization time-step goes to zero. The method is illustrated on two target problems, in discrete and continuous control spaces. Keywords: optimal control, reinforcement learning, policy search, sensitivity analysis, parametric optimization, gradient estimate, likelihood ratio method, pathwise derivation 1. Introduction and Statement of the Problem We consider an optimal control problem with continuous state (xt ∈ IRd )t≥0 whose state dynamics is defined according to the controlled differential equation: dxt = f (xt , ut ), dt (1) where the control (ut )t≥0 is a Lebesgue measurable function with values in a control space U. Note that the state-dynamics f may also depend on time, but we omit this dependency in the notation, for simplicity. We intend to maximize a functional J that depends on the trajectory (xt )0≤t≤T over a finite-time horizon T > 0. For simplicity, in the paper, we illustrate the case of a terminal reward c 2006 Rémi Munos. M UNOS only: J(x; (ut )t≥0 ) := r(xT ), (2) where r : IRd → IR is the reward function. Extension to the case of general functional of the kind J(x; (ut )t≥0 ) = Z T 0 r(t, xt )dt + R(xT ), (3) with r and R being current and terminal reward functions, would easily follow, as indicated in Remark 1. The optimal control problem of finding a control (ut )t≥0 that maximizes the functional is replaced by a parametric optimization problem for which we search for a good feed-back control law in a given class of parameterized policies {πα : [0, T ] × IRd → U}α , where α ∈ IRm is the parameter. The control ut ∈ U (or action) at time t is ut = πα (t, xt ), and we may write the dynamics of the resulting feed-back system as dxt = fα (xt ), (4) dt where fα (xt ) := f (x, πα (t, x)). In the paper, we will make the assumption that fα is C 2 , with bounded derivatives. Let us define the performance measure V (α) := J(x; πα (t, xt )t≥0 ), where its dependency with respect to (w.r.t.) the parameter α is emphasized. One may also consider an average performance measure according to some distribution µ for the initial state: V (α) := E[J(x; πα (t, xt )t≥0 )|x ∼ µ]. In order to find a local maximum of V (α), one may perform a local search, such as a gradient ascent method α ← α + η∇αV (α), (5) with an adequate step η (see for example (Polyak, 1987; Kushner and Yin, 1997)). The computation of the gradient ∇αV (α) is the object of this paper. A first method would be to approximate the gradient by a finite-difference quotient for each of the m components of the parameter: V (α + εei ) −V (α) , ε for some small value of ε (we use the notation ∂α instead of ∇α to indicate that it is a singledimensional derivative). This finite-difference method requires the simulation of m + 1 trajectories to compute an approximation of the true gradient. When the number of parameters is large, this may be computationally expensive. However, this simple method may be efficient if the number of parameters is relatively small. In the rest of the paper we will not consider this approach, and will aim at computing the gradient using one trajectory only. ∂αi V (α) ≃ 772 P OLICY G RADIENT IN C ONTINUOUS T IME Pathwise estimation of the gradient. We now illustrate that if the decision-maker has access to a model of the state dynamics, then a pathwise derivation would directly lead to the policy gradient. Indeed, let us define the gradient of the state with respect to the parameter: zt := ∇α xt (i.e. zt is defined as a d × m-matrix whose (i, j)-component is the derivative of the ith component of xt w.r.t. α j ). Our smoothness assumption on fα allows to differentiate the state dynamics (4) w.r.t. α, which provides the dynamics on (zt ): dzt = ∇α fα (xt ) + ∇x fα (xt )zt , dt (6) where the coefficients ∇α fα and ∇x fα are, respectively, the derivatives of f w.r.t. the parameter (matrix of size d × m) and the state (matrix of size d × d). The initial condition for z is z0 = 0. When the reward function r is smooth (i.e. continuously differentiable), one may apply a pathwise differentiation to derive a gradient formula (see e.g. (Bensoussan, 1988) or (Yang and Kushner, 1991) for an extension to the stochastic case): ∇αV (α) = ∇x r(xT )zT . (7) Remark 1 In the more general setting of a functional (3), the gradient is deduced (by linearity) from the above formula: ∇αV (α) = Z T 0 ∇x r(t, xt )zt dt + ∇x R(xT )zT . What is known from the agent? The decision maker (call it the agent) that intends to design a good controller for the dynamical system may or may not know a model of the state dynamics f . In case the dynamics is known, the state gradient zt = ∇α xt may be computed from (6) along the trajectory and the gradient of the performance measure w.r.t. the parameter α is deduced at time T from (7), which allows to perform the gradient ascent step (5). However, in this paper we consider a Reinforcement Learning (Sutton and Barto, 1998) setting in which the state dynamics is unknown from the agent, but we still assume that the state is fully observable. The agent knows only the response of the system to its control. To be more precise, the available information to the agent at time t is its own control policy πα and the trajectory (xs )0≤s≤t up to time t. At time T , the agent receives the reward r(xT ) and, in this paper, we assume that the gradient ∇r(xT ) is available to the agent. From this point of view, it seems impossible to derive the state gradient zt from (6), since ∇α f and ∇x f are unknown. The term ∇x f (xt ) may be approximated by a least squares method from the observation of past states (xs )s≤t , as this will be explained later on in subsection 3.2. However the term ∇α f (xt ) cannot be calculated analogously. In this paper, we introduce the idea of using stochastic policies to approximate the state (xt ) and the state gradient (zt ) by discrete-time stochastic processes (Xt∆ ) and (Zt∆ ) (with ∆ being some discretization time-step). We show how Zt∆ can be computed without the knowledge of ∇α f , but only from information available to the agent. ∆ ∆ We prove the convergence (with probability one) of the gradient estimate ∇x r(XT )ZT derived from the stochastic processes to ∇αV (α) when ∆ → 0. Here, almost sure convergence is obtained using the concentration of measure phenomenon (Talagrand, 1996; Ledoux, 2001). 773 M UNOS y ∆ XT ∆ X t2 ∆ Xt 0 fα ∆ x Xt 1 Figure 1: A trajectory (Xt∆ )0≤n≤N and the state dynamics vector fα of the continuous process n (xt )0≤t≤T . Likelihood ratio method? It is worth mentioning that this strong convergence result contrasts with the usual likelihood ratio method (also called score method) in discrete time (see e.g. (Reiman and Weiss, 1986; Glynn, 1987) or more recently in the reinforcement learning literature (Williams, 1992; Sutton et al., 2000; Baxter and Bartlett, 2001; Marbach and Tsitsiklis, 2003)) for which the policy gradient estimate is subject to variance explosion when the discretization time-step ∆ tends to 0. The intuitive reason for that problem lies in the fact that the number of decisions before getting the reward grows to infinity when ∆ → 0 (the variance of likelihood ratio estimates being usually linear with the number of decisions). Let us illustrate this problem on a simple 2 dimensional process. Consider the deterministic continuous process (xt )0≤t≤1 defined by the state dynamics: dxt = fα := dt α 1−α , (8) (0 < α < 1) with initial condition x0 = (0 0)′ (where ′ denotes the transpose operator). The performance measure V (α) is the reward at the terminal state at time T = 1, with the reward function being the first coordinate of the state r((x y)′ ) := x. Thus V (α) = r(xT =1 ) = α and its derivative is ∇αV (α) = 1. Let (Xt∆ )0≤n≤N ∈ IR2 be a discrete time stochastic process (the discrete times being {tn = n ∆ n∆}n=0...N with the discretization time-step ∆ = 1/N) that starts from initial state X0 = x0 = (0 0)′ and makes N random moves of length ∆ towards the right (action u1 ) or the top (action u2 ) (see Figure 1) according to the stochastic policy (i.e., the probability of choosing the actions in each state x) πα (u1 |x) = α, πα (u2 |x) = 1 − α. The process is thus defined according to the dynamics: Xt∆ = Xt∆ + n n+1 Un 1 −Un ∆, (9) where (Un )0≤n < N and all ∞ N > 0), there exists a constant C that does not depend on N such that dn ≤ C/N. Thus we may take D2 = C2 /N. Now, from the previous paragraph, ||E[XN ] − xN || ≤ e(N), with e(N) → 0 when N → ∞. This means that ||h − E[h]|| + e(N) ≥ ||XN − xN ||, thus P(||h − E[h]|| ≥ ε + e(N)) ≥ P(||XN − xN || ≥ ε), and we deduce from (31) that 2 /(2C 2 ) P(||XN − xN || ≥ ε) ≤ 2e−N(ε+e(N)) . Thus, for all ε > 0, the series ∑N≥0 P(||XN − xN || ≥ ε) converges. Now, from Borel-Cantelli lemma, we deduce that for all ε > 0, there exists Nε such that for all N ≥ Nε , ||XN − xN || < ε, which ∆→0 proves the almost sure convergence of XN to xN as N → ∞ (i.e. XT −→ xT almost surely). Appendix C. Proof of Proposition 8 ′ First, note that Qt = X X ′ − X X is a symmetric, non-negative matrix, since it may be rewritten as 1 nt ∑ (Xs+ − X)(Xs+ − X)′ . s∈S(t) In solving the least squares problem (21), we deduce b = ∆X + AX∆, thus min A,b 1 1 ∑ ∆Xs − b −A(Xs+2 ∆Xs )∆ nt s∈S(t) ≤ 2 = min A 1 ∑ ∆Xs − ∆X − A(Xs+ − X)∆ nt s∈S(t) 1 ∑ ∆Xs− ∆X− ∇x f (X, ut )(Xs+− X)∆ nt s∈S(t) 2 2 . (32) Now, since Xs = X + O(∆) one may obtain like in (19) and (20) (by replacing Xt by X) that: ∆Xs − ∆X − ∇x f (X, ut )(Xs+ − X)∆ = O(∆3 ). (33) We deduce from (32) and (33) that 1 nt ∑ ∇x f (Xt , ut ) − ∇x f (X, ut ) (Xs+ − X)∆ 2 = O(∆6 ). s∈S(t) By developing each component, d ∑ ∇x f (Xt , ut ) − ∇x f (X, ut ) i=1 row i Qt ∇x f (Xt , ut ) − ∇x f (X, ut ) ′ row i = O(∆4 ). Now, from the definition of ν(∆), for all vector u ∈ IRd , u′ Qt u ≥ ν(∆)||u||2 , thus ν(∆)||∇x f (Xt , ut ) − ∇x f (X, ut )||2 = O(∆4 ). Condition (23) yields ∇x f (Xt , ut ) = ∇x f (X, ut ) + o(1), and since ∇x f (Xt , ut ) = ∇x f (X, ut ) + O(∆), we deduce lim ∇x f (Xt , ut ) = ∇x f (Xt , ut ). ∆→0 789 M UNOS References J. Baxter and P. L. Bartlett. Infinite-horizon gradient-based policy search. Journal of Artificial Intelligence Research, 15:319–350, 2001. A. Bensoussan. Perturbation methods in optimal control. Wiley/Gauthier-Villars Series in Modern Applied Mathematics. John Wiley & Sons Ltd., Chichester, 1988. Translated from the French by C. Tomson. A. Bogdanov. Optimal control of a double inverted pendulum on a cart. Technical report CSE-04006, CSEE, OGI School of Science and Engineering, OHSU, 2004. P. W. Glynn. Likelihood ratio gradient estimation: an overview. In A. Thesen, H. Grant, and W. D. Kelton, editors, Proceedings of the 1987 Winter Simulation Conference, pages 366–375, 1987. E. Gobet and R. Munos. Sensitivity analysis using Itô-Malliavin calculus and martingales. application to stochastic optimal control. SIAM journal on Control and Optimization, 43(5):1676–1713, 2005. G. H. Golub and C. F. Van Loan. Matrix Computations, 3rd ed. Baltimore, MD: Johns Hopkins, 1996. R. E. Kalman, P. L. Falb, and M. A. Arbib. Topics in Mathematical System Theory. New York: McGraw Hill, 1969. P. E. Kloeden and E. Platen. Numerical Solutions of Stochastic Differential Equations. SpringerVerlag, 1995. H. J. Kushner and G. Yin. Stochastic Approximation Algorithms and Applications. Springer-Verlag, Berlin and New York, 1997. S. M. LaValle. Planning Algorithms. Cambridge University Press, 2006. M. Ledoux. The concentration of measure phenomenon. American Mathematical Society, Providence, RI, 2001. P. Marbach and J. N. Tsitsiklis. Approximate gradient methods in policy-space optimization of Markov reward processes. Journal of Discrete Event Dynamical Systems, 13:111–148, 2003. B. T. Polyak. Introduction to Optimization. Optimization Software Inc., New York, 1987. M. I. Reiman and A. Weiss. Sensitivity analysis via likelihood ratios. In J. Wilson, J. Henriksen, and S. Roberts, editors, Proceedings of the 1986 Winter Simulation Conference, pages 285–289, 1986. R. S. Sutton and A. G. Barto. Reinforcement learning: An introduction. Bradford Book, 1998. R. S. Sutton, D. McAllester, S. Singh, and Y. Mansour. Policy gradient methods for reinforcement learning with function approximation. Neural Information Processing Systems. MIT Press, pages 1057–1063, 2000. 790 P OLICY G RADIENT IN C ONTINUOUS T IME M. Talagrand. A new look at independence. Annals of Probability, 24:1–34, 1996. R. J. Williams. Simple statistical gradient-following algorithms for connectionist reinforcement learning. Machine Learning, 8:229–256, 1992. J. Yang and H. J. Kushner. A Monte Carlo method for sensitivity analysis and parametric optimization of nonlinear stochastic systems. SIAM J. Control Optim., 29(5):1216–1249, 1991. 791
4 0.15183857 37 jmlr-2006-Incremental Algorithms for Hierarchical Classification
Author: Nicolò Cesa-Bianchi, Claudio Gentile, Luca Zaniboni
Abstract: We study the problem of classifying data in a given taxonomy when classifications associated with multiple and/or partial paths are allowed. We introduce a new algorithm that incrementally learns a linear-threshold classifier for each node of the taxonomy. A hierarchical classification is obtained by evaluating the trained node classifiers in a top-down fashion. To evaluate classifiers in our multipath framework, we define a new hierarchical loss function, the H-loss, capturing the intuition that whenever a classification mistake is made on a node of the taxonomy, then no loss should be charged for any additional mistake occurring in the subtree of that node. Making no assumptions on the mechanism generating the data instances, and assuming a linear noise model for the labels, we bound the H-loss of our on-line algorithm in terms of the H-loss of a reference classifier knowing the true parameters of the label-generating process. We show that, in expectation, the excess cumulative H-loss grows at most logarithmically in the length of the data sequence. Furthermore, our analysis reveals the precise dependence of the rate of convergence on the eigenstructure of the data each node observes. Our theoretical results are complemented by a number of experiments on texual corpora. In these experiments we show that, after only one epoch of training, our algorithm performs much better than Perceptron-based hierarchical classifiers, and reasonably close to a hierarchical support vector machine. Keywords: incremental algorithms, online learning, hierarchical classification, second order perceptron, support vector machines, regret bound, loss function
5 0.11919628 86 jmlr-2006-Step Size Adaptation in Reproducing Kernel Hilbert Space
Author: S. V. N. Vishwanathan, Nicol N. Schraudolph, Alex J. Smola
Abstract: This paper presents an online support vector machine (SVM) that uses the stochastic meta-descent (SMD) algorithm to adapt its step size automatically. We formulate the online learning problem as a stochastic gradient descent in reproducing kernel Hilbert space (RKHS) and translate SMD to the nonparametric setting, where its gradient trace parameter is no longer a coefficient vector but an element of the RKHS. We derive efficient updates that allow us to perform the step size adaptation in linear time. We apply the online SVM framework to a variety of loss functions, and in particular show how to handle structured output spaces and achieve efficient online multiclass classification. Experiments show that our algorithm outperforms more primitive methods for setting the gradient step size. Keywords: online SVM, stochastic meta-descent, structured output spaces
6 0.11412739 71 jmlr-2006-Optimising Kernel Parameters and Regularisation Coefficients for Non-linear Discriminant Analysis
7 0.10397349 34 jmlr-2006-Generalized Bradley-Terry Models and Multi-Class Probability Estimates
8 0.071194254 40 jmlr-2006-Infinite-σ Limits For Tikhonov Regularization
10 0.051298738 90 jmlr-2006-Superior Guarantees for Sequential Prediction and Lossless Compression via Alphabet Decomposition
11 0.048492059 28 jmlr-2006-Estimating the "Wrong" Graphical Model: Benefits in the Computation-Limited Setting
12 0.048091862 92 jmlr-2006-Toward Attribute Efficient Learning of Decision Lists and Parities
13 0.043942593 12 jmlr-2006-Active Learning with Feedback on Features and Instances
14 0.039102592 35 jmlr-2006-Geometric Variance Reduction in Markov Chains: Application to Value Function and Gradient Estimation
16 0.034317169 89 jmlr-2006-Structured Prediction, Dual Extragradient and Bregman Projections (Special Topic on Machine Learning and Optimization)
17 0.03101597 32 jmlr-2006-Expectation Correction for Smoothed Inference in Switching Linear Dynamical Systems
18 0.030484172 10 jmlr-2006-Action Elimination and Stopping Conditions for the Multi-Armed Bandit and Reinforcement Learning Problems
19 0.029485749 23 jmlr-2006-Consistency and Convergence Rates of One-Class SVMs and Related Algorithms
20 0.027569477 57 jmlr-2006-Linear State-Space Models for Blind Source Separation
topicId topicWeight
[(0, 0.237), (1, 0.488), (2, 0.337), (3, 0.004), (4, -0.114), (5, 0.064), (6, 0.107), (7, -0.038), (8, 0.175), (9, 0.094), (10, 0.069), (11, -0.058), (12, -0.122), (13, -0.018), (14, -0.093), (15, 0.045), (16, -0.051), (17, 0.127), (18, -0.077), (19, -0.038), (20, 0.109), (21, 0.071), (22, 0.083), (23, 0.025), (24, -0.035), (25, 0.024), (26, -0.034), (27, -0.015), (28, -0.038), (29, -0.042), (30, -0.009), (31, 0.026), (32, 0.026), (33, -0.058), (34, -0.031), (35, 0.025), (36, 0.046), (37, 0.038), (38, 0.002), (39, 0.039), (40, -0.003), (41, -0.011), (42, 0.031), (43, 0.087), (44, -0.055), (45, 0.035), (46, -0.086), (47, -0.024), (48, 0.039), (49, 0.05)]
simIndex simValue paperId paperTitle
same-paper 1 0.98492938 96 jmlr-2006-Worst-Case Analysis of Selective Sampling for Linear Classification
Author: Nicolò Cesa-Bianchi, Claudio Gentile, Luca Zaniboni
Abstract: A selective sampling algorithm is a learning algorithm for classification that, based on the past observed data, decides whether to ask the label of each new instance to be classified. In this paper, we introduce a general technique for turning linear-threshold classification algorithms from the general additive family into randomized selective sampling algorithms. For the most popular algorithms in this family we derive mistake bounds that hold for individual sequences of examples. These bounds show that our semi-supervised algorithms can achieve, on average, the same accuracy as that of their fully supervised counterparts, but using fewer labels. Our theoretical results are corroborated by a number of experiments on real-world textual data. The outcome of these experiments is essentially predicted by our theoretical results: Our selective sampling algorithms tend to perform as well as the algorithms receiving the true label after each classification, while observing in practice substantially fewer labels. Keywords: selective sampling, semi-supervised learning, on-line learning, kernel algorithms, linear-threshold classifiers
2 0.92721248 70 jmlr-2006-Online Passive-Aggressive Algorithms
Author: Koby Crammer, Ofer Dekel, Joseph Keshet, Shai Shalev-Shwartz, Yoram Singer
Abstract: We present a family of margin based online learning algorithms for various prediction tasks. In particular we derive and analyze algorithms for binary and multiclass categorization, regression, uniclass prediction and sequence prediction. The update steps of our different algorithms are all based on analytical solutions to simple constrained optimization problems. This unified view allows us to prove worst-case loss bounds for the different algorithms and for the various decision problems based on a single lemma. Our bounds on the cumulative loss of the algorithms are relative to the smallest loss that can be attained by any fixed hypothesis, and as such are applicable to both realizable and unrealizable settings. We demonstrate some of the merits of the proposed algorithms in a series of experiments with synthetic and real data sets.
3 0.54388875 75 jmlr-2006-Policy Gradient in Continuous Time
Author: Rémi Munos
Abstract: Policy search is a method for approximately solving an optimal control problem by performing a parametric optimization search in a given class of parameterized policies. In order to process a local optimization technique, such as a gradient method, we wish to evaluate the sensitivity of the performance measure with respect to the policy parameters, the so-called policy gradient. This paper is concerned with the estimation of the policy gradient for continuous-time, deterministic state dynamics, in a reinforcement learning framework, that is, when the decision maker does not have a model of the state dynamics. We show that usual likelihood ratio methods used in discrete-time, fail to proceed the gradient because they are subject to variance explosion when the discretization time-step decreases to 0. We describe an alternative approach based on the approximation of the pathwise derivative, which leads to a policy gradient estimate that converges almost surely to the true gradient when the timestep tends to 0. The underlying idea starts with the derivation of an explicit representation of the policy gradient using pathwise derivation. This derivation makes use of the knowledge of the state dynamics. Then, in order to estimate the gradient from the observable data only, we use a stochastic policy to discretize the continuous deterministic system into a stochastic discrete process, which enables to replace the unknown coefficients by quantities that solely depend on known data. We prove the almost sure convergence of this estimate to the true policy gradient when the discretization time-step goes to zero. The method is illustrated on two target problems, in discrete and continuous control spaces. Keywords: optimal control, reinforcement learning, policy search, sensitivity analysis, parametric optimization, gradient estimate, likelihood ratio method, pathwise derivation 1. Introduction and Statement of the Problem We consider an optimal control problem with continuous state (xt ∈ IRd )t≥0 whose state dynamics is defined according to the controlled differential equation: dxt = f (xt , ut ), dt (1) where the control (ut )t≥0 is a Lebesgue measurable function with values in a control space U. Note that the state-dynamics f may also depend on time, but we omit this dependency in the notation, for simplicity. We intend to maximize a functional J that depends on the trajectory (xt )0≤t≤T over a finite-time horizon T > 0. For simplicity, in the paper, we illustrate the case of a terminal reward c 2006 Rémi Munos. M UNOS only: J(x; (ut )t≥0 ) := r(xT ), (2) where r : IRd → IR is the reward function. Extension to the case of general functional of the kind J(x; (ut )t≥0 ) = Z T 0 r(t, xt )dt + R(xT ), (3) with r and R being current and terminal reward functions, would easily follow, as indicated in Remark 1. The optimal control problem of finding a control (ut )t≥0 that maximizes the functional is replaced by a parametric optimization problem for which we search for a good feed-back control law in a given class of parameterized policies {πα : [0, T ] × IRd → U}α , where α ∈ IRm is the parameter. The control ut ∈ U (or action) at time t is ut = πα (t, xt ), and we may write the dynamics of the resulting feed-back system as dxt = fα (xt ), (4) dt where fα (xt ) := f (x, πα (t, x)). In the paper, we will make the assumption that fα is C 2 , with bounded derivatives. Let us define the performance measure V (α) := J(x; πα (t, xt )t≥0 ), where its dependency with respect to (w.r.t.) the parameter α is emphasized. One may also consider an average performance measure according to some distribution µ for the initial state: V (α) := E[J(x; πα (t, xt )t≥0 )|x ∼ µ]. In order to find a local maximum of V (α), one may perform a local search, such as a gradient ascent method α ← α + η∇αV (α), (5) with an adequate step η (see for example (Polyak, 1987; Kushner and Yin, 1997)). The computation of the gradient ∇αV (α) is the object of this paper. A first method would be to approximate the gradient by a finite-difference quotient for each of the m components of the parameter: V (α + εei ) −V (α) , ε for some small value of ε (we use the notation ∂α instead of ∇α to indicate that it is a singledimensional derivative). This finite-difference method requires the simulation of m + 1 trajectories to compute an approximation of the true gradient. When the number of parameters is large, this may be computationally expensive. However, this simple method may be efficient if the number of parameters is relatively small. In the rest of the paper we will not consider this approach, and will aim at computing the gradient using one trajectory only. ∂αi V (α) ≃ 772 P OLICY G RADIENT IN C ONTINUOUS T IME Pathwise estimation of the gradient. We now illustrate that if the decision-maker has access to a model of the state dynamics, then a pathwise derivation would directly lead to the policy gradient. Indeed, let us define the gradient of the state with respect to the parameter: zt := ∇α xt (i.e. zt is defined as a d × m-matrix whose (i, j)-component is the derivative of the ith component of xt w.r.t. α j ). Our smoothness assumption on fα allows to differentiate the state dynamics (4) w.r.t. α, which provides the dynamics on (zt ): dzt = ∇α fα (xt ) + ∇x fα (xt )zt , dt (6) where the coefficients ∇α fα and ∇x fα are, respectively, the derivatives of f w.r.t. the parameter (matrix of size d × m) and the state (matrix of size d × d). The initial condition for z is z0 = 0. When the reward function r is smooth (i.e. continuously differentiable), one may apply a pathwise differentiation to derive a gradient formula (see e.g. (Bensoussan, 1988) or (Yang and Kushner, 1991) for an extension to the stochastic case): ∇αV (α) = ∇x r(xT )zT . (7) Remark 1 In the more general setting of a functional (3), the gradient is deduced (by linearity) from the above formula: ∇αV (α) = Z T 0 ∇x r(t, xt )zt dt + ∇x R(xT )zT . What is known from the agent? The decision maker (call it the agent) that intends to design a good controller for the dynamical system may or may not know a model of the state dynamics f . In case the dynamics is known, the state gradient zt = ∇α xt may be computed from (6) along the trajectory and the gradient of the performance measure w.r.t. the parameter α is deduced at time T from (7), which allows to perform the gradient ascent step (5). However, in this paper we consider a Reinforcement Learning (Sutton and Barto, 1998) setting in which the state dynamics is unknown from the agent, but we still assume that the state is fully observable. The agent knows only the response of the system to its control. To be more precise, the available information to the agent at time t is its own control policy πα and the trajectory (xs )0≤s≤t up to time t. At time T , the agent receives the reward r(xT ) and, in this paper, we assume that the gradient ∇r(xT ) is available to the agent. From this point of view, it seems impossible to derive the state gradient zt from (6), since ∇α f and ∇x f are unknown. The term ∇x f (xt ) may be approximated by a least squares method from the observation of past states (xs )s≤t , as this will be explained later on in subsection 3.2. However the term ∇α f (xt ) cannot be calculated analogously. In this paper, we introduce the idea of using stochastic policies to approximate the state (xt ) and the state gradient (zt ) by discrete-time stochastic processes (Xt∆ ) and (Zt∆ ) (with ∆ being some discretization time-step). We show how Zt∆ can be computed without the knowledge of ∇α f , but only from information available to the agent. ∆ ∆ We prove the convergence (with probability one) of the gradient estimate ∇x r(XT )ZT derived from the stochastic processes to ∇αV (α) when ∆ → 0. Here, almost sure convergence is obtained using the concentration of measure phenomenon (Talagrand, 1996; Ledoux, 2001). 773 M UNOS y ∆ XT ∆ X t2 ∆ Xt 0 fα ∆ x Xt 1 Figure 1: A trajectory (Xt∆ )0≤n≤N and the state dynamics vector fα of the continuous process n (xt )0≤t≤T . Likelihood ratio method? It is worth mentioning that this strong convergence result contrasts with the usual likelihood ratio method (also called score method) in discrete time (see e.g. (Reiman and Weiss, 1986; Glynn, 1987) or more recently in the reinforcement learning literature (Williams, 1992; Sutton et al., 2000; Baxter and Bartlett, 2001; Marbach and Tsitsiklis, 2003)) for which the policy gradient estimate is subject to variance explosion when the discretization time-step ∆ tends to 0. The intuitive reason for that problem lies in the fact that the number of decisions before getting the reward grows to infinity when ∆ → 0 (the variance of likelihood ratio estimates being usually linear with the number of decisions). Let us illustrate this problem on a simple 2 dimensional process. Consider the deterministic continuous process (xt )0≤t≤1 defined by the state dynamics: dxt = fα := dt α 1−α , (8) (0 < α < 1) with initial condition x0 = (0 0)′ (where ′ denotes the transpose operator). The performance measure V (α) is the reward at the terminal state at time T = 1, with the reward function being the first coordinate of the state r((x y)′ ) := x. Thus V (α) = r(xT =1 ) = α and its derivative is ∇αV (α) = 1. Let (Xt∆ )0≤n≤N ∈ IR2 be a discrete time stochastic process (the discrete times being {tn = n ∆ n∆}n=0...N with the discretization time-step ∆ = 1/N) that starts from initial state X0 = x0 = (0 0)′ and makes N random moves of length ∆ towards the right (action u1 ) or the top (action u2 ) (see Figure 1) according to the stochastic policy (i.e., the probability of choosing the actions in each state x) πα (u1 |x) = α, πα (u2 |x) = 1 − α. The process is thus defined according to the dynamics: Xt∆ = Xt∆ + n n+1 Un 1 −Un ∆, (9) where (Un )0≤n < N and all ∞ N > 0), there exists a constant C that does not depend on N such that dn ≤ C/N. Thus we may take D2 = C2 /N. Now, from the previous paragraph, ||E[XN ] − xN || ≤ e(N), with e(N) → 0 when N → ∞. This means that ||h − E[h]|| + e(N) ≥ ||XN − xN ||, thus P(||h − E[h]|| ≥ ε + e(N)) ≥ P(||XN − xN || ≥ ε), and we deduce from (31) that 2 /(2C 2 ) P(||XN − xN || ≥ ε) ≤ 2e−N(ε+e(N)) . Thus, for all ε > 0, the series ∑N≥0 P(||XN − xN || ≥ ε) converges. Now, from Borel-Cantelli lemma, we deduce that for all ε > 0, there exists Nε such that for all N ≥ Nε , ||XN − xN || < ε, which ∆→0 proves the almost sure convergence of XN to xN as N → ∞ (i.e. XT −→ xT almost surely). Appendix C. Proof of Proposition 8 ′ First, note that Qt = X X ′ − X X is a symmetric, non-negative matrix, since it may be rewritten as 1 nt ∑ (Xs+ − X)(Xs+ − X)′ . s∈S(t) In solving the least squares problem (21), we deduce b = ∆X + AX∆, thus min A,b 1 1 ∑ ∆Xs − b −A(Xs+2 ∆Xs )∆ nt s∈S(t) ≤ 2 = min A 1 ∑ ∆Xs − ∆X − A(Xs+ − X)∆ nt s∈S(t) 1 ∑ ∆Xs− ∆X− ∇x f (X, ut )(Xs+− X)∆ nt s∈S(t) 2 2 . (32) Now, since Xs = X + O(∆) one may obtain like in (19) and (20) (by replacing Xt by X) that: ∆Xs − ∆X − ∇x f (X, ut )(Xs+ − X)∆ = O(∆3 ). (33) We deduce from (32) and (33) that 1 nt ∑ ∇x f (Xt , ut ) − ∇x f (X, ut ) (Xs+ − X)∆ 2 = O(∆6 ). s∈S(t) By developing each component, d ∑ ∇x f (Xt , ut ) − ∇x f (X, ut ) i=1 row i Qt ∇x f (Xt , ut ) − ∇x f (X, ut ) ′ row i = O(∆4 ). Now, from the definition of ν(∆), for all vector u ∈ IRd , u′ Qt u ≥ ν(∆)||u||2 , thus ν(∆)||∇x f (Xt , ut ) − ∇x f (X, ut )||2 = O(∆4 ). Condition (23) yields ∇x f (Xt , ut ) = ∇x f (X, ut ) + o(1), and since ∇x f (Xt , ut ) = ∇x f (X, ut ) + O(∆), we deduce lim ∇x f (Xt , ut ) = ∇x f (Xt , ut ). ∆→0 789 M UNOS References J. Baxter and P. L. Bartlett. Infinite-horizon gradient-based policy search. Journal of Artificial Intelligence Research, 15:319–350, 2001. A. Bensoussan. Perturbation methods in optimal control. Wiley/Gauthier-Villars Series in Modern Applied Mathematics. John Wiley & Sons Ltd., Chichester, 1988. Translated from the French by C. Tomson. A. Bogdanov. Optimal control of a double inverted pendulum on a cart. Technical report CSE-04006, CSEE, OGI School of Science and Engineering, OHSU, 2004. P. W. Glynn. Likelihood ratio gradient estimation: an overview. In A. Thesen, H. Grant, and W. D. Kelton, editors, Proceedings of the 1987 Winter Simulation Conference, pages 366–375, 1987. E. Gobet and R. Munos. Sensitivity analysis using Itô-Malliavin calculus and martingales. application to stochastic optimal control. SIAM journal on Control and Optimization, 43(5):1676–1713, 2005. G. H. Golub and C. F. Van Loan. Matrix Computations, 3rd ed. Baltimore, MD: Johns Hopkins, 1996. R. E. Kalman, P. L. Falb, and M. A. Arbib. Topics in Mathematical System Theory. New York: McGraw Hill, 1969. P. E. Kloeden and E. Platen. Numerical Solutions of Stochastic Differential Equations. SpringerVerlag, 1995. H. J. Kushner and G. Yin. Stochastic Approximation Algorithms and Applications. Springer-Verlag, Berlin and New York, 1997. S. M. LaValle. Planning Algorithms. Cambridge University Press, 2006. M. Ledoux. The concentration of measure phenomenon. American Mathematical Society, Providence, RI, 2001. P. Marbach and J. N. Tsitsiklis. Approximate gradient methods in policy-space optimization of Markov reward processes. Journal of Discrete Event Dynamical Systems, 13:111–148, 2003. B. T. Polyak. Introduction to Optimization. Optimization Software Inc., New York, 1987. M. I. Reiman and A. Weiss. Sensitivity analysis via likelihood ratios. In J. Wilson, J. Henriksen, and S. Roberts, editors, Proceedings of the 1986 Winter Simulation Conference, pages 285–289, 1986. R. S. Sutton and A. G. Barto. Reinforcement learning: An introduction. Bradford Book, 1998. R. S. Sutton, D. McAllester, S. Singh, and Y. Mansour. Policy gradient methods for reinforcement learning with function approximation. Neural Information Processing Systems. MIT Press, pages 1057–1063, 2000. 790 P OLICY G RADIENT IN C ONTINUOUS T IME M. Talagrand. A new look at independence. Annals of Probability, 24:1–34, 1996. R. J. Williams. Simple statistical gradient-following algorithms for connectionist reinforcement learning. Machine Learning, 8:229–256, 1992. J. Yang and H. J. Kushner. A Monte Carlo method for sensitivity analysis and parametric optimization of nonlinear stochastic systems. SIAM J. Control Optim., 29(5):1216–1249, 1991. 791
4 0.48191085 86 jmlr-2006-Step Size Adaptation in Reproducing Kernel Hilbert Space
Author: S. V. N. Vishwanathan, Nicol N. Schraudolph, Alex J. Smola
Abstract: This paper presents an online support vector machine (SVM) that uses the stochastic meta-descent (SMD) algorithm to adapt its step size automatically. We formulate the online learning problem as a stochastic gradient descent in reproducing kernel Hilbert space (RKHS) and translate SMD to the nonparametric setting, where its gradient trace parameter is no longer a coefficient vector but an element of the RKHS. We derive efficient updates that allow us to perform the step size adaptation in linear time. We apply the online SVM framework to a variety of loss functions, and in particular show how to handle structured output spaces and achieve efficient online multiclass classification. Experiments show that our algorithm outperforms more primitive methods for setting the gradient step size. Keywords: online SVM, stochastic meta-descent, structured output spaces
5 0.4445363 37 jmlr-2006-Incremental Algorithms for Hierarchical Classification
Author: Nicolò Cesa-Bianchi, Claudio Gentile, Luca Zaniboni
Abstract: We study the problem of classifying data in a given taxonomy when classifications associated with multiple and/or partial paths are allowed. We introduce a new algorithm that incrementally learns a linear-threshold classifier for each node of the taxonomy. A hierarchical classification is obtained by evaluating the trained node classifiers in a top-down fashion. To evaluate classifiers in our multipath framework, we define a new hierarchical loss function, the H-loss, capturing the intuition that whenever a classification mistake is made on a node of the taxonomy, then no loss should be charged for any additional mistake occurring in the subtree of that node. Making no assumptions on the mechanism generating the data instances, and assuming a linear noise model for the labels, we bound the H-loss of our on-line algorithm in terms of the H-loss of a reference classifier knowing the true parameters of the label-generating process. We show that, in expectation, the excess cumulative H-loss grows at most logarithmically in the length of the data sequence. Furthermore, our analysis reveals the precise dependence of the rate of convergence on the eigenstructure of the data each node observes. Our theoretical results are complemented by a number of experiments on texual corpora. In these experiments we show that, after only one epoch of training, our algorithm performs much better than Perceptron-based hierarchical classifiers, and reasonably close to a hierarchical support vector machine. Keywords: incremental algorithms, online learning, hierarchical classification, second order perceptron, support vector machines, regret bound, loss function
6 0.40574023 71 jmlr-2006-Optimising Kernel Parameters and Regularisation Coefficients for Non-linear Discriminant Analysis
7 0.22922738 34 jmlr-2006-Generalized Bradley-Terry Models and Multi-Class Probability Estimates
8 0.20883654 40 jmlr-2006-Infinite-σ Limits For Tikhonov Regularization
9 0.18791732 90 jmlr-2006-Superior Guarantees for Sequential Prediction and Lossless Compression via Alphabet Decomposition
11 0.16129257 10 jmlr-2006-Action Elimination and Stopping Conditions for the Multi-Armed Bandit and Reinforcement Learning Problems
12 0.12850884 92 jmlr-2006-Toward Attribute Efficient Learning of Decision Lists and Parities
13 0.11749347 57 jmlr-2006-Linear State-Space Models for Blind Source Separation
14 0.11433425 28 jmlr-2006-Estimating the "Wrong" Graphical Model: Benefits in the Computation-Limited Setting
15 0.10752299 53 jmlr-2006-Learning a Hidden Hypergraph
16 0.10542704 35 jmlr-2006-Geometric Variance Reduction in Markov Chains: Application to Value Function and Gradient Estimation
17 0.10408635 12 jmlr-2006-Active Learning with Feedback on Features and Instances
18 0.10180714 4 jmlr-2006-A Linear Non-Gaussian Acyclic Model for Causal Discovery
19 0.096656479 81 jmlr-2006-Some Discriminant-Based PAC Algorithms
20 0.093323775 15 jmlr-2006-Bayesian Network Learning with Parameter Constraints (Special Topic on Machine Learning and Optimization)
topicId topicWeight
[(8, 0.014), (34, 0.371), (36, 0.063), (40, 0.058), (45, 0.016), (50, 0.053), (63, 0.036), (66, 0.014), (68, 0.011), (76, 0.024), (78, 0.012), (81, 0.02), (84, 0.019), (90, 0.087), (91, 0.021), (96, 0.081)]
simIndex simValue paperId paperTitle
same-paper 1 0.80442804 96 jmlr-2006-Worst-Case Analysis of Selective Sampling for Linear Classification
Author: Nicolò Cesa-Bianchi, Claudio Gentile, Luca Zaniboni
Abstract: A selective sampling algorithm is a learning algorithm for classification that, based on the past observed data, decides whether to ask the label of each new instance to be classified. In this paper, we introduce a general technique for turning linear-threshold classification algorithms from the general additive family into randomized selective sampling algorithms. For the most popular algorithms in this family we derive mistake bounds that hold for individual sequences of examples. These bounds show that our semi-supervised algorithms can achieve, on average, the same accuracy as that of their fully supervised counterparts, but using fewer labels. Our theoretical results are corroborated by a number of experiments on real-world textual data. The outcome of these experiments is essentially predicted by our theoretical results: Our selective sampling algorithms tend to perform as well as the algorithms receiving the true label after each classification, while observing in practice substantially fewer labels. Keywords: selective sampling, semi-supervised learning, on-line learning, kernel algorithms, linear-threshold classifiers
2 0.76488107 2 jmlr-2006-A Graphical Representation of Equivalence Classes of AMP Chain Graphs
Author: Alberto Roverato, Milan Studený
Abstract: This paper deals with chain graph models under alternative AMP interpretation. A new representative of an AMP Markov equivalence class, called the largest deflagged graph, is proposed. The representative is based on revealed internal structure of the AMP Markov equivalence class. More specifically, the AMP Markov equivalence class decomposes into finer strong equivalence classes and there exists a distinguished strong equivalence class among those forming the AMP Markov equivalence class. The largest deflagged graph is the largest chain graph in that distinguished strong equivalence class. A composed graphical procedure to get the largest deflagged graph on the basis of any AMP Markov equivalent chain graph is presented. In general, the largest deflagged graph differs from the AMP essential graph, which is another representative of the AMP Markov equivalence class. Keywords: chain graph, AMP Markov equivalence, strong equivalence, largest deflagged graph, component merging procedure, deflagging procedure, essential graph
3 0.37578097 37 jmlr-2006-Incremental Algorithms for Hierarchical Classification
Author: Nicolò Cesa-Bianchi, Claudio Gentile, Luca Zaniboni
Abstract: We study the problem of classifying data in a given taxonomy when classifications associated with multiple and/or partial paths are allowed. We introduce a new algorithm that incrementally learns a linear-threshold classifier for each node of the taxonomy. A hierarchical classification is obtained by evaluating the trained node classifiers in a top-down fashion. To evaluate classifiers in our multipath framework, we define a new hierarchical loss function, the H-loss, capturing the intuition that whenever a classification mistake is made on a node of the taxonomy, then no loss should be charged for any additional mistake occurring in the subtree of that node. Making no assumptions on the mechanism generating the data instances, and assuming a linear noise model for the labels, we bound the H-loss of our on-line algorithm in terms of the H-loss of a reference classifier knowing the true parameters of the label-generating process. We show that, in expectation, the excess cumulative H-loss grows at most logarithmically in the length of the data sequence. Furthermore, our analysis reveals the precise dependence of the rate of convergence on the eigenstructure of the data each node observes. Our theoretical results are complemented by a number of experiments on texual corpora. In these experiments we show that, after only one epoch of training, our algorithm performs much better than Perceptron-based hierarchical classifiers, and reasonably close to a hierarchical support vector machine. Keywords: incremental algorithms, online learning, hierarchical classification, second order perceptron, support vector machines, regret bound, loss function
4 0.36640072 70 jmlr-2006-Online Passive-Aggressive Algorithms
Author: Koby Crammer, Ofer Dekel, Joseph Keshet, Shai Shalev-Shwartz, Yoram Singer
Abstract: We present a family of margin based online learning algorithms for various prediction tasks. In particular we derive and analyze algorithms for binary and multiclass categorization, regression, uniclass prediction and sequence prediction. The update steps of our different algorithms are all based on analytical solutions to simple constrained optimization problems. This unified view allows us to prove worst-case loss bounds for the different algorithms and for the various decision problems based on a single lemma. Our bounds on the cumulative loss of the algorithms are relative to the smallest loss that can be attained by any fixed hypothesis, and as such are applicable to both realizable and unrealizable settings. We demonstrate some of the merits of the proposed algorithms in a series of experiments with synthetic and real data sets.
5 0.36274472 86 jmlr-2006-Step Size Adaptation in Reproducing Kernel Hilbert Space
Author: S. V. N. Vishwanathan, Nicol N. Schraudolph, Alex J. Smola
Abstract: This paper presents an online support vector machine (SVM) that uses the stochastic meta-descent (SMD) algorithm to adapt its step size automatically. We formulate the online learning problem as a stochastic gradient descent in reproducing kernel Hilbert space (RKHS) and translate SMD to the nonparametric setting, where its gradient trace parameter is no longer a coefficient vector but an element of the RKHS. We derive efficient updates that allow us to perform the step size adaptation in linear time. We apply the online SVM framework to a variety of loss functions, and in particular show how to handle structured output spaces and achieve efficient online multiclass classification. Experiments show that our algorithm outperforms more primitive methods for setting the gradient step size. Keywords: online SVM, stochastic meta-descent, structured output spaces
7 0.33690673 60 jmlr-2006-Manifold Regularization: A Geometric Framework for Learning from Labeled and Unlabeled Examples
8 0.33684051 9 jmlr-2006-Accurate Error Bounds for the Eigenvalues of the Kernel Matrix
9 0.33681548 1 jmlr-2006-A Direct Method for Building Sparse Kernel Learning Algorithms
10 0.33253798 29 jmlr-2006-Estimation of Gradients and Coordinate Covariation in Classification
11 0.32373118 52 jmlr-2006-Learning Spectral Clustering, With Application To Speech Separation
12 0.32221282 61 jmlr-2006-Maximum-Gain Working Set Selection for SVMs (Special Topic on Machine Learning and Optimization)
13 0.32030958 44 jmlr-2006-Large Scale Transductive SVMs
14 0.31876639 93 jmlr-2006-Universal Kernels
15 0.31771815 17 jmlr-2006-Bounds for the Loss in Probability of Correct Classification Under Model Based Approximation
16 0.31590658 92 jmlr-2006-Toward Attribute Efficient Learning of Decision Lists and Parities
17 0.31424314 28 jmlr-2006-Estimating the "Wrong" Graphical Model: Benefits in the Computation-Limited Setting
20 0.31109747 42 jmlr-2006-Kernels on Prolog Proof Trees: Statistical Learning in the ILP Setting (Special Topic on Inductive Programming)