jmlr jmlr2006 jmlr2006-70 knowledge-graph by maker-knowledge-mining

70 jmlr-2006-Online Passive-Aggressive Algorithms

Source: pdf

Author: Koby Crammer, Ofer Dekel, Joseph Keshet, Shai Shalev-Shwartz, Yoram Singer

Abstract: We present a family of margin based online learning algorithms for various prediction tasks. In particular we derive and analyze algorithms for binary and multiclass categorization, regression, uniclass prediction and sequence prediction. The update steps of our different algorithms are all based on analytical solutions to simple constrained optimization problems. This uniﬁed view allows us to prove worst-case loss bounds for the different algorithms and for the various decision problems based on a single lemma. Our bounds on the cumulative loss of the algorithms are relative to the smallest loss that can be attained by any ﬁxed hypothesis, and as such are applicable to both realizable and unrealizable settings. We demonstrate some of the merits of the proposed algorithms in a series of experiments with synthetic and real data sets.

Reference: text

Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 We assume that xt is associated with a unique label yt ∈ {+1, −1}. [sent-100, score-1.069]

2 We denote by wt the weight vector used by the algorithm on round t, and refer to the term yt (wt · xt ) as the (signed) margin attained on round t. [sent-106, score-1.751]

3 Whenever the margin is a positive number then sign(wt · xt ) = yt and the algorithm has made a correct prediction. [sent-107, score-1.065]

4 We abbreviate the loss 553 C RAMMER , D EKEL , K ESHET, S HALEV-S HWARTZ AND S INGER suffered on round t by ℓt , that is, ℓt = ℓ wt ; (xt , yt ) . [sent-117, score-1.303]

5 (2) has a simple closed form solution, wt+1 = wt + τt yt xt where τt = ℓt . [sent-144, score-1.591]

6 (2) to be, L (w, τ) = 1 w − wt 2 2 + τ 1 − yt (w · xt ) , (4) where τ ≥ 0 is a Lagrange multiplier. [sent-150, score-1.575]

7 • receive instance: xt ∈ Rn • predict: yt = sign(wt · xt ) ˆ • receive correct label: yt ∈ {−1, +1} • suffer loss: ℓt = max{0 , 1 − yt (wt · xt )} • update: 1. [sent-159, score-3.11]

8 set: τt = ℓt xt (PA) 2 τt = min C , τt = xt ℓt xt (PA-I) 2 ℓt 2+ 1 2C (PA-II) 2. [sent-160, score-1.182]

9 update: wt+1 = wt + τt yt xt Figure 1: Three variants of the Passive-Aggressive algorithm for binary classiﬁcation. [sent-161, score-1.615]

10 Setting the partial derivatives of L with respect to the elements of w to zero gives, 0 = ∇w L (w, τ) = w − wt − τyt xt =⇒ w = wt + τyt xt . [sent-163, score-1.872]

11 (4) we get, 1 2 L (τ) = − τ2 xt 2 + τ 1 − yt (wt · xt ) . [sent-165, score-1.427]

12 Taking the derivative of L (τ) with respect to τ and setting it to zero, we get, 0 = ∂L (τ) = − τ xt ∂τ 2 + 1 − yt (wt · xt ) =⇒ τ= 1 − yt (wt · xt ) . [sent-166, score-2.46]

13 xt 2 Since we assumed that ℓt > 0 then ℓt = 1 − yt (w · xt ). [sent-167, score-1.427]

14 The updates of PA-I and PA-II also share the simple closed form wt+1 = wt + τt yt xt , where τt = min C, ℓt xt 2 (PA-I) or τt = ℓt 2+ 1 xt 2C (PA-II). [sent-191, score-2.393]

15 Simply note that for all three PA variants, t−1 wt = ∑ τt yt xt , i=1 and therefore, t−1 wt · xt = ∑ τt yt (xi · xt ). [sent-198, score-3.544]

16 Formally, let u be an arbitrary vector in Rn , and deﬁne ℓt = ℓ wt ; (xt , yt ) ℓt⋆ = ℓ u; (xt , yt ) . [sent-209, score-1.82]

17 Using the deﬁnition wt+1 = wt + yt τt xt , we can write ∆t as, ∆t = = = wt − u wt − u wt − u 2 2 2 2 − wt+1 − u − wt − u + yt τt xt − wt − u 2 2 + 2τt yt (wt − u) · xt + τt2 xt = −2τt yt (wt − u) · xt − τt2 xt 557 2 . [sent-232, score-8.172]

18 2 (11) C RAMMER , D EKEL , K ESHET, S HALEV-S HWARTZ AND S INGER Since we assumed that ℓt > 0 then ℓt = 1 − yt (wt · xt ) or alternatively yt (wt · xt ) = 1 − ℓt . [sent-233, score-2.066]

19 In addition, the deﬁnition of the hinge loss implies that ℓt⋆ ≥ 1 − yt (u · xt ), hence yt (u · xt ) ≥ 1 − ℓt⋆ . [sent-234, score-2.105]

20 Without loss of generality we can assume that u is scaled such that that yt (u · xt ) ≥ 1 and therefore u attains a loss of zero on all T examples in the sequence. [sent-246, score-1.111]

21 , (xT , yT ) be a sequence of examples where xt ∈ Rn , yt ∈ {+1, −1} and xt ≤ R for all t. [sent-251, score-1.447]

22 , (xT , yT ) be a sequence of examples where xt ∈ Rn , yt ∈ {+1, −1} and xt = 1 for all t. [sent-268, score-1.447]

23 , (xT , yT ) be a sequence of examples where xt ∈ Rn , yt ∈ {+1, −1} and xt ≤ R for all t. [sent-284, score-1.447]

24 , (xT , yt ) be a sequence of examples where xt ∈ Rn , yt ∈ {+1, −1} and xt 2 ≤ R2 for all t. [sent-302, score-2.086]

25 560 O NLINE PASSIVE -AGGRESSIVE A LGORITHMS Plugging in the deﬁnition of α, we obtain the following lower bound, u T 2 ≥ Using the deﬁnition τt = ℓt /( xt u Replacing xt 2 ∑ t=1 2τt ℓt − τt2 2 + 1/(2C)), 2 T ≥ ∑ t=1 xt 2 + 1 2C − 2C(ℓt⋆ )2 . [sent-307, score-1.182]

26 In the regression setting, every instance xt is associated with a real target value yt ∈ R, which the online algorithm tries to predict. [sent-334, score-1.138]

27 On every round, the algorithm receives an instance xt ∈ Rn and predicts a target value yt ∈ R using its internal regression function. [sent-335, score-1.081]

28 We ˆ focus on the class of linear regression functions, that is, yt = wt · xt where wt is the incrementally ˆ learned vector. [sent-336, score-2.133]

29 At the end of every round, the algorithm uses wt and the example (xt , yt ) to ˆ generate a new weight vector wt+1 , which will be used to extend the prediction on the next round. [sent-340, score-1.232]

30 ℓε w; (xt , yt ) = 0, (19) In the binary classiﬁcation setting, we gave the PA update the geometric interpretation of projecting wt onto the linear half-space deﬁned by the constraint ℓ w; (xt , yt ) = 0. [sent-349, score-1.901]

31 (19) has a closed form solution similar to that of the classiﬁcation PA algorithm of the previous section, namely, wt+1 = wt + sign(yt − yt )τt xt ˆ where τt = ℓt / xt 2 . [sent-353, score-1.985]

32 The closed form solution for these updates also comes out to be wt+1 = wt + sign(yt − yt )τt xt where ˆ τt is deﬁned as in Eq. [sent-358, score-1.605]

33 (20) Since wt · xt = yt , we have that −sign(yt − yt )(wt · xt − yt ) = |wt · xt − yt |. [sent-379, score-4.28]

34 (20) as, ∆t ≥ 2τt (ℓt + ε) + sign(yt − yt )2τt (u · xt − yt ) − τt2 xt ˆ 2 . [sent-381, score-2.066]

35 We also know that sign(yt − yt )(u · xt − yt ) ≥ −|u · xt − yt | and that −|u · xt − yt | ≥ −(ℓt⋆ + ε). [sent-382, score-3.738]

36 Speciﬁcally, the algorithm maintains in its memory a vector wt ∈ Rn and simply predicts the next element of the sequence to be wt . [sent-389, score-1.116]

37 ℓε (w; yt ) = 0, (22) Geometrically, wt+1 is set to be the projection of wt onto a ball of radius ε about yt . [sent-404, score-1.84]

38 We now show that the closed form solution of this optimization problem turns out to be, wt+1 = 1− ℓt wt − yt wt + ℓt wt − yt yt . [sent-405, score-3.559]

39 (23) First, we rewrite the above equation and express wt+1 by, wt+1 = wt + τt yt − wt , yt − wt (24) where τt = ℓt . [sent-406, score-2.916]

40 (22) is, 1 w − wt 2 + τ ( w − yt − ε) , (25) 2 where τ ≥ 0 is a Lagrange multiplier. [sent-412, score-1.181]

41 Differentiating with respect to the elements of w and setting these partial derivatives to zero, we get the ﬁrst KKT condition, stating that at the optimum (w, τ) must satisfy the equality, L (w, τ) = 0 = ∇w L (w, τ) = w − wt + τ w − yt . [sent-413, score-1.195]

42 w − yt (26) In addition, an optimal solution must satisfy the conditions τ ≥ 0 and, τ ( w − yt − ε) = 0. [sent-414, score-1.278]

43 (26) gives, wt+1 − wt + τt wt+1 − yt wt+1 − yt = τt wt+1 − yt yt − wt + yt − wt wt+1 − yt . [sent-422, score-5.46]

44 (28) Note that, wt+1 − yt yt − wt 1 − yt = (wt − yt ) 1 − τt yt − wt yt − wt ε wt − yt ( wt − yt − τt ) = (wt − yt ). [sent-423, score-8.461]

45 wt − yt wt − yt = wt + τt = 564 (29) O NLINE PASSIVE -AGGRESSIVE A LGORITHMS Combining Eq. [sent-424, score-2.904]

46 (28) we get that, wt+1 − yt wt+1 − yt wt+1 − wt + τt = 0, and thus Eq. [sent-426, score-1.834]

47 Namely, the update for PA-I is deﬁned by, wt+1 = argmin w∈Rn 1 w − wt 2 2 +Cξ w − yt ≤ ε + ξ, ξ ≥ 0, s. [sent-436, score-1.27]

48 Using the recursive deﬁnition of wt+1 , we rewrite ∆t as, ∆t = wt − u 2 = wt − u 2 = −2 τt 1− wt − yt − − (wt − u) + τt wt − yt τt wt − yt wt + τt wt − yt 2 yt − u 2 (yt − wt ) (wt − u) · (yt − wt ) − τt2 . [sent-461, score-8.085]

49 We now add and subtract yt from the term (wt − u) above to get, ∆t = −2 τt wt − yt = 2τt wt − yt − 2 (wt − yt + yt − u) · (yt − wt ) − τt2 τt wt − yt (yt − u) · (yt − wt ) − τt2 . [sent-462, score-6.556]

50 Now, using the Cauchy-Schwartz inequality on the term (yt − u) · (yt − wt ), we can bound, ∆t ≥ 2τt wt − yt − 2τt yt − u − τt2 . [sent-463, score-2.362]

51 We now add and subtract 2τt ε from the right-hand side of the above, to get, ∆t ≥ 2τt ( wt − yt − ε) − 2τt ( yt − u − ε) − τt2 . [sent-464, score-1.832]

52 Since we are dealing with the case where ℓt > 0, it holds that ℓt = wt − yt − ε. [sent-465, score-1.181]

53 Our goal is now to incrementally ﬁnd wt and εt such that, wt − yt ≤ εt , (31) as often as possible. [sent-486, score-1.723]

54 (31) can be written equivalently as, (32) wt − yt 2 + (B2 − εt2 ) ≤ B2 . [sent-491, score-1.181]

55 If we were to concatenate a 0 to the end of every yt (thus increasing its dimension to n + 1) and if we considered the n + 1’th coordinate of wt to be equivalent to B2 − εt2 , then Eq. [sent-492, score-1.181]

56 We deﬁne the margin attained by the algorithm on round t for the example (xt ,Yt ) as, γ wt ; (xt ,Yt ) = min wt · Φ(xt , r) − max wt · Φ(xt , s). [sent-551, score-1.733]

57 That is, rt = argmin wt · Φ(xt , r) and r∈Yt st = argmax wt · Φ(xt , s). [sent-570, score-1.218]

58 The key observation in our analysis it that, ℓMC wt ; (xt ,Yt ) = ℓ wt ; (Φ(xt , rt ) − Φ(xt , st ), +1) . [sent-584, score-1.172]

59 Namely, on round t we ﬁnd the pair of indices rt , st which corresponds to the largest violation of the margin constraints, rt = argmin wt · Φ(xt , r) = argmin wtr · xt , r∈Yt st r∈Yt = argmax wt · Φ(xt , s) = argmax wts · xt . [sent-637, score-2.249]

60 , wtk ; (xt ,Yt ) = 0 1 − wtrt · xt + wtst · xt wtrt · xt − wtst · xt ≥ 1 . [sent-641, score-1.589]

61 otherwise (40) Furthermore, applying the same reduction to the update scheme we get that the resulting multiprototype update is, rt rt st st wt+1 = wt+1 + τt xt and wt+1 = wt+1 − τt xt . [sent-642, score-1.094]

62 Since the two non-zero blocks are non-overlapping we get that, Φ(xt , rt ) − Φ(xt , st ) 2 = xt 2 + − xt 2 = 2 xt 2 . [sent-646, score-1.295]

63 Namely, each instance xt is associated with a single correct label yt ∈ Y and the prediction extended by the online algorithm is simply, yt = argmax (wt · Φ(xt , y)) . [sent-652, score-1.853]

64 ˆ (42) y∈Y A prediction mistake occurs if yt = yt , however in the cost-sensitive setting different mistakes incur ˆ different levels of cost. [sent-653, score-1.373]

65 Speciﬁcally, on every online round we would like for the following constraints to hold, ∀r ∈ {Y \ yt } wt · Φ(xt , yt ) − wt · Φ(xt , r) ≥ ρ(yt , r). [sent-662, score-2.49]

66 wt · Φ(xt , yt ) − wt · Φ(xt , yt ) ≥ ˆ ρ(yt , yt ), ˆ (44) where yt is deﬁned in Eq. [sent-676, score-3.64]

67 ˆ (45) Note that this loss equals zero if and only if a correct prediction was made, namely if yt = yt . [sent-681, score-1.369]

68 On ˆ the other hand, if a prediction mistake occurred it means that wt ranked yt higher than yt , thus, ˆ ρ(yt , yt ) ≤ wt · Φ(xt , yt ) − wt · Φ(xt , yt ) + ˆ ˆ 572 ρ(yt , yt ) = ℓPB (wt ; (xt , yt )). [sent-682, score-6.185]

69 (44) has the closed form solution, wt+1 = wt + τt (Φ(xt , yt ) − Φ(xt , yt )) , ˆ (47) where, τt = ℓPB (wt ; (xt , yt )) Φ(xt , yt ) − Φ(xt , yt ) ˆ 2 . [sent-688, score-3.753]

70 (48) As before, we obtain cost sensitive versions of PA-I and PA-II by setting, ℓPB (wt ; (xt , yt )) Φ(xt , yt ) − Φ(xt , yt ) ˆ ℓPB (wt ; (xt , yt )) 1 Φ(xt , yt ) − Φ(xt , yt ) 2 + 2C ˆ τt = min τt = C, 2 (PA-I) (PA-II), (49) where in both cases C > 0 is a user-deﬁned parameter. [sent-689, score-3.879]

71 Let yt be the label in Y deﬁned by, ˜ yt = argmax wt · Φ(xt , r) − wt · Φ(xt , yt ) + ˜ ρ(yt , r) . [sent-692, score-3.052]

72 ˜ ˜ Note that the online algorithm continues to predict the label yt as before and that yt only inﬂuences ˆ ˜ the online update. [sent-697, score-1.452]

73 wt · Φ(xt , yt ) − wt · Φ(xt , yt ) ≥ ˜ ρ(yt , yt ), ˜ (51) The update in Eq. [sent-700, score-3.059]

74 Note that since y attains the maximal loss of all other labels, it follows ˜ ˜ that, ℓPB (wt ; (xt , yt )) ≤ ℓML (wt ; (xt , yt )). [sent-706, score-1.317]

75 (49) then, T ∑ τt t=1 2ℓPB (wt ; (xt , yt )) − τt Φ(xt , yt ) − Φ(xt , yt ) ˆ 2 − 2ℓML (u; (xt , yt )) ≤ u 2. [sent-726, score-2.556]

76 (49) with yt replaced by yt then, ˆ ˜ T ∑ τt t=1 2ℓML (wt ; (xt , yt )) − τt Φ(xt , yt ) − Φ(xt , yt ) ˜ 2 − 2ℓML (u; (xt , yt )) ≤ u 2. [sent-729, score-3.834]

77 The proof of the second statement is identical, except that yt is replaced by yt and ℓPB (wt ; (xt , yt )) is ˆ ˜ replaced by ℓML (wt ; (xt , yt )). [sent-731, score-2.556]

78 Using the recursive deﬁnition of wt+1 , we rewrite ∆t as, ∆t = wt − u 2 − wt − u + τt (Φ(xt , yt ) − Φ(xt , yt )) ˆ 2 = −2τt (wt − u) · (Φ(xt , yt ) − Φ(xt , yt )) − τt2 Φ(xt , yt ) − Φ(xt , yt ) 2 . [sent-734, score-4.93]

79 ˆ ˆ (54) By deﬁnition, ℓML (u; (xt , yt )) equals, max u · (Φ(xt , r) − Φ(xt , yt )) + r∈Y ρ(yt , r) . [sent-735, score-1.278]

80 Since ℓML (u; (xt , yt )) is the maximum over Y , it is clearly greater than u · (Φ(xt , yt ) − Φ(xt , yt )) + ˆ ρ(yt , yt ). [sent-736, score-2.556]

81 This can be written as, ˆ u · (Φ(xt , yt ) − Φ(xt , yt )) ≥ ˆ ˆ ρ(yt , yt ) − ℓML (u; (xt , yt )). [sent-737, score-2.556]

82 (54) we get, ∆t ≥ − 2τt wt · (Φ(xt , yt ) − Φ(xt , yt )) + 2τt ˆ ρ(yt , yt ) − ℓML (u; (xt , yt )) ˆ − τt2 Φ(xt , yt ) − Φ(xt , yt ) 2 . [sent-739, score-4.376]

83 (55) ˆ 574 O NLINE PASSIVE -AGGRESSIVE A LGORITHMS Rearranging terms in the deﬁnition of ℓPB , we get that wt · (Φ(xt , yt ) − Φ(xt , yt )) = ˆ ℓPB (wt ; (xt , yt )). [sent-740, score-2.473]

84 (55) as, ∆t ρ(yt , yt ) − ˆ ρ(yt , yt ) − ℓPB (wt ; (xt , yt )) + ˆ ≥ −2τt ρ(yt , yt ) − ℓML (u; (xt , yt )) − τt2 Φ(xt , yt ) − Φ(xt , yt ) ˆ ˆ 2τt ˆ = τt 2ℓPB (wt ; (xt , yt )) − τt Φ(xt , yt ) − Φ(xt , yt ) 2 2 − 2ℓML (u; (xt , yt )) . [sent-742, score-7.029]

85 The proof of these theorems remains essentially the same as before, however one cosmetic change is required: xt 2 is replaced by either Φ(xt , yt )−Φ(xt , yt ) 2 ˆ or Φ(xt , yt ) − Φ(xt , yt ) 2 in each of the theorems and throughout their proofs. [sent-747, score-2.95]

86 We make two simplifying assumptions: ﬁrst assume that φ(xt , yt ) − φ(xt , yt ) is upper bounded ˜ by 1. [sent-752, score-1.278]

87 , (xT , yT ) be a sequence of examples where xt ∈ Rn , yt ∈ Y and φ(xt , yt )− φ(xt , yt ) ≤ 1 for all t. [sent-757, score-2.331]

88 Then for any vector ˆ u ∈ Rn , the cumulative cost obtained by the max-loss cost sensitive version of PA-I on the sequence is bounded from above by, T ˆ ∑ ρ(yt , yt ) t=1 ≤ u 2 T + 2C ∑ ℓML (u; (xt , yt )). [sent-759, score-1.394]

89 t=1 Proof We abbreviate ρt = ρ(yt , yt ) and ℓt = ℓML (wt ; (xt , yt )) throughout this proof. [sent-760, score-1.278]

90 τt is deﬁned as, τt = min ℓt φ(xt , yt ) − φ(xt , yt ) ˜ 2 √ ρt on ,C , and due to our assumption on φ(xt , yt ) − φ(xt , yt ) 2 we get that τt ≥ min{ℓt ,C}. [sent-762, score-2.57]

91 t=1 575 (56) C RAMMER , D EKEL , K ESHET, S HALEV-S HWARTZ AND S INGER Again using the deﬁnition of τt , we know that τt ℓML (u; (xt , yt )) ≤ CℓML (u; (xt , yt )) and that τt φ(xt , yt )− φ(xt , yt ) 2 ≤ ℓt . [sent-766, score-2.556]

92 The loss function used to evaluate the prediction-based variant is a function of yt and yt , whereas the loss function used to evaluate the max-loss update essentially ˆ ignores yt . [sent-774, score-2.064]

93 Thus, on round t, the learning algorithm receives an ˆ instance xt and then predicts an output sequence yt ∈ Y m . [sent-803, score-1.144]

94 However, obtaining yt in the general case may require as many as km evaluations of wt · Φ(xt , y). [sent-821, score-1.181]

95 The Lagrangian of the PA-I optimization problem is, L (w, ξ, τ, λ) = = 1 w − wt 2 1 w − wt 2 2 + Cξ + τ(1 − ξ − yt w · xt ) − λξ 2 + ξ (C − τ − λ) + τ(1 − yt w · xt ) , (60) where τ ≥ 0 and λ ≥ 0 are Lagrange multipliers. [sent-1075, score-3.15]

96 This condition can be rewritten as C xt 2 < 1 − yt (wt · xt ). [sent-1089, score-1.427]

97 (5), we can rewrite this constraint as 1 − yt (wt · xt ) − τ xt 2 ≤ ξ. [sent-1093, score-1.449]

98 (7) equals, 1 w − wt 2 L (w, ξ, τ) = 2 + Cξ2 + τ 1 − ξ − yt (w · xt ) , (64) where τ ≥ 0 is a Lagrange multiplier. [sent-1106, score-1.575]

99 (60) with wt + τyt xt , we rewrite the Lagrangian as, L (τ) = − τ2 2 xt 2 + 1 2C + τ 1 − yt (wt · xt ) . [sent-1111, score-2.375]

100 Setting the derivative of the above to zero gives, 0 = ∂L (τ) = −τ ∂τ xt 2 + 1 2C + 1 − yt (wt · xt ) =⇒ τ= 1 − yt (wt · xt ) . [sent-1112, score-2.46]

similar papers computed by tfidf model

tfidf for this paper:

wordName wordTfidf (topN-words)

[('yt', 0.639), ('wt', 0.542), ('xt', 0.394), ('uniclass', 0.116), ('pa', 0.11), ('multiclass', 0.082), ('ekel', 0.076), ('eshet', 0.076), ('nline', 0.076), ('rammer', 0.076), ('online', 0.069), ('pb', 0.068), ('hwartz', 0.064), ('inger', 0.064), ('passive', 0.061), ('round', 0.059), ('update', 0.058), ('rt', 0.054), ('crammer', 0.054), ('mira', 0.053), ('ml', 0.047), ('cumulative', 0.045), ('rn', 0.043), ('prediction', 0.041), ('lgorithms', 0.041), ('loss', 0.039), ('label', 0.036), ('aggressiveness', 0.036), ('st', 0.034), ('perceptron', 0.032), ('margin', 0.032), ('argmin', 0.031), ('lt', 0.03), ('singer', 0.029), ('mistakes', 0.027), ('mistake', 0.027), ('variants', 0.027), ('rounds', 0.024), ('kivinen', 0.024), ('suffered', 0.024), ('squared', 0.022), ('dekel', 0.022), ('classi', 0.02), ('plugging', 0.02), ('radius', 0.02), ('instance', 0.02), ('sequence', 0.02), ('concreteness', 0.019), ('instantaneous', 0.019), ('cost', 0.018), ('wtr', 0.018), ('ranked', 0.018), ('lagrangian', 0.017), ('multilabel', 0.017), ('schapire', 0.017), ('herbster', 0.017), ('lemma', 0.016), ('regression', 0.016), ('closed', 0.016), ('attained', 0.016), ('bounds', 0.016), ('bound', 0.016), ('tsochantaridis', 0.016), ('sign', 0.015), ('argmax', 0.015), ('novikoff', 0.015), ('sensitive', 0.015), ('get', 0.014), ('updates', 0.014), ('labels', 0.014), ('kkt', 0.014), ('gentile', 0.014), ('keshet', 0.014), ('binary', 0.013), ('nonseparable', 0.013), ('verbatim', 0.013), ('wtk', 0.013), ('warmuth', 0.013), ('predictor', 0.013), ('classifier', 0.012), ('aggressive', 0.012), ('huji', 0.012), ('versions', 0.012), ('predicts', 0.012), ('subtract', 0.012), ('mc', 0.012), ('taskar', 0.012), ('rewrite', 0.012), ('jornal', 0.011), ('norma', 0.011), ('ranking', 0.011), ('blocks', 0.011), ('equals', 0.011), ('ut', 0.011), ('variant', 0.011), ('summing', 0.011), ('suffer', 0.011), ('freund', 0.011), ('constraint', 0.01), ('weight', 0.01)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 1.0000005 70 jmlr-2006-Online Passive-Aggressive Algorithms

Author: Koby Crammer, Ofer Dekel, Joseph Keshet, Shai Shalev-Shwartz, Yoram Singer

2 0.3976768 96 jmlr-2006-Worst-Case Analysis of Selective Sampling for Linear Classification

Author: Nicolò Cesa-Bianchi, Claudio Gentile, Luca Zaniboni

Abstract: A selective sampling algorithm is a learning algorithm for classiﬁcation that, based on the past observed data, decides whether to ask the label of each new instance to be classiﬁed. In this paper, we introduce a general technique for turning linear-threshold classiﬁcation algorithms from the general additive family into randomized selective sampling algorithms. For the most popular algorithms in this family we derive mistake bounds that hold for individual sequences of examples. These bounds show that our semi-supervised algorithms can achieve, on average, the same accuracy as that of their fully supervised counterparts, but using fewer labels. Our theoretical results are corroborated by a number of experiments on real-world textual data. The outcome of these experiments is essentially predicted by our theoretical results: Our selective sampling algorithms tend to perform as well as the algorithms receiving the true label after each classiﬁcation, while observing in practice substantially fewer labels. Keywords: selective sampling, semi-supervised learning, on-line learning, kernel algorithms, linear-threshold classiﬁers

3 0.21804105 37 jmlr-2006-Incremental Algorithms for Hierarchical Classification

Author: Nicolò Cesa-Bianchi, Claudio Gentile, Luca Zaniboni

Abstract: We study the problem of classifying data in a given taxonomy when classiﬁcations associated with multiple and/or partial paths are allowed. We introduce a new algorithm that incrementally learns a linear-threshold classiﬁer for each node of the taxonomy. A hierarchical classiﬁcation is obtained by evaluating the trained node classiﬁers in a top-down fashion. To evaluate classiﬁers in our multipath framework, we deﬁne a new hierarchical loss function, the H-loss, capturing the intuition that whenever a classiﬁcation mistake is made on a node of the taxonomy, then no loss should be charged for any additional mistake occurring in the subtree of that node. Making no assumptions on the mechanism generating the data instances, and assuming a linear noise model for the labels, we bound the H-loss of our on-line algorithm in terms of the H-loss of a reference classiﬁer knowing the true parameters of the label-generating process. We show that, in expectation, the excess cumulative H-loss grows at most logarithmically in the length of the data sequence. Furthermore, our analysis reveals the precise dependence of the rate of convergence on the eigenstructure of the data each node observes. Our theoretical results are complemented by a number of experiments on texual corpora. In these experiments we show that, after only one epoch of training, our algorithm performs much better than Perceptron-based hierarchical classiﬁers, and reasonably close to a hierarchical support vector machine. Keywords: incremental algorithms, online learning, hierarchical classiﬁcation, second order perceptron, support vector machines, regret bound, loss function

4 0.19862406 86 jmlr-2006-Step Size Adaptation in Reproducing Kernel Hilbert Space

Author: S. V. N. Vishwanathan, Nicol N. Schraudolph, Alex J. Smola

Abstract: This paper presents an online support vector machine (SVM) that uses the stochastic meta-descent (SMD) algorithm to adapt its step size automatically. We formulate the online learning problem as a stochastic gradient descent in reproducing kernel Hilbert space (RKHS) and translate SMD to the nonparametric setting, where its gradient trace parameter is no longer a coefﬁcient vector but an element of the RKHS. We derive efﬁcient updates that allow us to perform the step size adaptation in linear time. We apply the online SVM framework to a variety of loss functions, and in particular show how to handle structured output spaces and achieve efﬁcient online multiclass classiﬁcation. Experiments show that our algorithm outperforms more primitive methods for setting the gradient step size. Keywords: online SVM, stochastic meta-descent, structured output spaces

5 0.18560642 75 jmlr-2006-Policy Gradient in Continuous Time

Author: Rémi Munos

Abstract: Policy search is a method for approximately solving an optimal control problem by performing a parametric optimization search in a given class of parameterized policies. In order to process a local optimization technique, such as a gradient method, we wish to evaluate the sensitivity of the performance measure with respect to the policy parameters, the so-called policy gradient. This paper is concerned with the estimation of the policy gradient for continuous-time, deterministic state dynamics, in a reinforcement learning framework, that is, when the decision maker does not have a model of the state dynamics. We show that usual likelihood ratio methods used in discrete-time, fail to proceed the gradient because they are subject to variance explosion when the discretization time-step decreases to 0. We describe an alternative approach based on the approximation of the pathwise derivative, which leads to a policy gradient estimate that converges almost surely to the true gradient when the timestep tends to 0. The underlying idea starts with the derivation of an explicit representation of the policy gradient using pathwise derivation. This derivation makes use of the knowledge of the state dynamics. Then, in order to estimate the gradient from the observable data only, we use a stochastic policy to discretize the continuous deterministic system into a stochastic discrete process, which enables to replace the unknown coefﬁcients by quantities that solely depend on known data. We prove the almost sure convergence of this estimate to the true policy gradient when the discretization time-step goes to zero. The method is illustrated on two target problems, in discrete and continuous control spaces. Keywords: optimal control, reinforcement learning, policy search, sensitivity analysis, parametric optimization, gradient estimate, likelihood ratio method, pathwise derivation 1. Introduction and Statement of the Problem We consider an optimal control problem with continuous state (xt ∈ IRd )t≥0 whose state dynamics is deﬁned according to the controlled differential equation: dxt = f (xt , ut ), dt (1) where the control (ut )t≥0 is a Lebesgue measurable function with values in a control space U. Note that the state-dynamics f may also depend on time, but we omit this dependency in the notation, for simplicity. We intend to maximize a functional J that depends on the trajectory (xt )0≤t≤T over a ﬁnite-time horizon T > 0. For simplicity, in the paper, we illustrate the case of a terminal reward c 2006 Rémi Munos. M UNOS only: J(x; (ut )t≥0 ) := r(xT ), (2) where r : IRd → IR is the reward function. Extension to the case of general functional of the kind J(x; (ut )t≥0 ) = Z T 0 r(t, xt )dt + R(xT ), (3) with r and R being current and terminal reward functions, would easily follow, as indicated in Remark 1. The optimal control problem of ﬁnding a control (ut )t≥0 that maximizes the functional is replaced by a parametric optimization problem for which we search for a good feed-back control law in a given class of parameterized policies {πα : [0, T ] × IRd → U}α , where α ∈ IRm is the parameter. The control ut ∈ U (or action) at time t is ut = πα (t, xt ), and we may write the dynamics of the resulting feed-back system as dxt = fα (xt ), (4) dt where fα (xt ) := f (x, πα (t, x)). In the paper, we will make the assumption that fα is C 2 , with bounded derivatives. Let us deﬁne the performance measure V (α) := J(x; πα (t, xt )t≥0 ), where its dependency with respect to (w.r.t.) the parameter α is emphasized. One may also consider an average performance measure according to some distribution µ for the initial state: V (α) := E[J(x; πα (t, xt )t≥0 )|x ∼ µ]. In order to ﬁnd a local maximum of V (α), one may perform a local search, such as a gradient ascent method α ← α + η∇αV (α), (5) with an adequate step η (see for example (Polyak, 1987; Kushner and Yin, 1997)). The computation of the gradient ∇αV (α) is the object of this paper. A ﬁrst method would be to approximate the gradient by a ﬁnite-difference quotient for each of the m components of the parameter: V (α + εei ) −V (α) , ε for some small value of ε (we use the notation ∂α instead of ∇α to indicate that it is a singledimensional derivative). This ﬁnite-difference method requires the simulation of m + 1 trajectories to compute an approximation of the true gradient. When the number of parameters is large, this may be computationally expensive. However, this simple method may be efﬁcient if the number of parameters is relatively small. In the rest of the paper we will not consider this approach, and will aim at computing the gradient using one trajectory only. ∂αi V (α) ≃ 772 P OLICY G RADIENT IN C ONTINUOUS T IME Pathwise estimation of the gradient. We now illustrate that if the decision-maker has access to a model of the state dynamics, then a pathwise derivation would directly lead to the policy gradient. Indeed, let us deﬁne the gradient of the state with respect to the parameter: zt := ∇α xt (i.e. zt is deﬁned as a d × m-matrix whose (i, j)-component is the derivative of the ith component of xt w.r.t. α j ). Our smoothness assumption on fα allows to differentiate the state dynamics (4) w.r.t. α, which provides the dynamics on (zt ): dzt = ∇α fα (xt ) + ∇x fα (xt )zt , dt (6) where the coefﬁcients ∇α fα and ∇x fα are, respectively, the derivatives of f w.r.t. the parameter (matrix of size d × m) and the state (matrix of size d × d). The initial condition for z is z0 = 0. When the reward function r is smooth (i.e. continuously differentiable), one may apply a pathwise differentiation to derive a gradient formula (see e.g. (Bensoussan, 1988) or (Yang and Kushner, 1991) for an extension to the stochastic case): ∇αV (α) = ∇x r(xT )zT . (7) Remark 1 In the more general setting of a functional (3), the gradient is deduced (by linearity) from the above formula: ∇αV (α) = Z T 0 ∇x r(t, xt )zt dt + ∇x R(xT )zT . What is known from the agent? The decision maker (call it the agent) that intends to design a good controller for the dynamical system may or may not know a model of the state dynamics f . In case the dynamics is known, the state gradient zt = ∇α xt may be computed from (6) along the trajectory and the gradient of the performance measure w.r.t. the parameter α is deduced at time T from (7), which allows to perform the gradient ascent step (5). However, in this paper we consider a Reinforcement Learning (Sutton and Barto, 1998) setting in which the state dynamics is unknown from the agent, but we still assume that the state is fully observable. The agent knows only the response of the system to its control. To be more precise, the available information to the agent at time t is its own control policy πα and the trajectory (xs )0≤s≤t up to time t. At time T , the agent receives the reward r(xT ) and, in this paper, we assume that the gradient ∇r(xT ) is available to the agent. From this point of view, it seems impossible to derive the state gradient zt from (6), since ∇α f and ∇x f are unknown. The term ∇x f (xt ) may be approximated by a least squares method from the observation of past states (xs )s≤t , as this will be explained later on in subsection 3.2. However the term ∇α f (xt ) cannot be calculated analogously. In this paper, we introduce the idea of using stochastic policies to approximate the state (xt ) and the state gradient (zt ) by discrete-time stochastic processes (Xt∆ ) and (Zt∆ ) (with ∆ being some discretization time-step). We show how Zt∆ can be computed without the knowledge of ∇α f , but only from information available to the agent. ∆ ∆ We prove the convergence (with probability one) of the gradient estimate ∇x r(XT )ZT derived from the stochastic processes to ∇αV (α) when ∆ → 0. Here, almost sure convergence is obtained using the concentration of measure phenomenon (Talagrand, 1996; Ledoux, 2001). 773 M UNOS y ∆ XT ∆ X t2 ∆ Xt 0 fα ∆ x Xt 1 Figure 1: A trajectory (Xt∆ )0≤n≤N and the state dynamics vector fα of the continuous process n (xt )0≤t≤T . Likelihood ratio method? It is worth mentioning that this strong convergence result contrasts with the usual likelihood ratio method (also called score method) in discrete time (see e.g. (Reiman and Weiss, 1986; Glynn, 1987) or more recently in the reinforcement learning literature (Williams, 1992; Sutton et al., 2000; Baxter and Bartlett, 2001; Marbach and Tsitsiklis, 2003)) for which the policy gradient estimate is subject to variance explosion when the discretization time-step ∆ tends to 0. The intuitive reason for that problem lies in the fact that the number of decisions before getting the reward grows to inﬁnity when ∆ → 0 (the variance of likelihood ratio estimates being usually linear with the number of decisions). Let us illustrate this problem on a simple 2 dimensional process. Consider the deterministic continuous process (xt )0≤t≤1 deﬁned by the state dynamics: dxt = fα := dt α 1−α , (8) (0 < α < 1) with initial condition x0 = (0 0)′ (where ′ denotes the transpose operator). The performance measure V (α) is the reward at the terminal state at time T = 1, with the reward function being the ﬁrst coordinate of the state r((x y)′ ) := x. Thus V (α) = r(xT =1 ) = α and its derivative is ∇αV (α) = 1. Let (Xt∆ )0≤n≤N ∈ IR2 be a discrete time stochastic process (the discrete times being {tn = n ∆ n∆}n=0...N with the discretization time-step ∆ = 1/N) that starts from initial state X0 = x0 = (0 0)′ and makes N random moves of length ∆ towards the right (action u1 ) or the top (action u2 ) (see Figure 1) according to the stochastic policy (i.e., the probability of choosing the actions in each state x) πα (u1 |x) = α, πα (u2 |x) = 1 − α. The process is thus deﬁned according to the dynamics: Xt∆ = Xt∆ + n n+1 Un 1 −Un ∆, (9) where (Un )0≤n < N and all ∞ N > 0), there exists a constant C that does not depend on N such that dn ≤ C/N. Thus we may take D2 = C2 /N. Now, from the previous paragraph, ||E[XN ] − xN || ≤ e(N), with e(N) → 0 when N → ∞. This means that ||h − E[h]|| + e(N) ≥ ||XN − xN ||, thus P(||h − E[h]|| ≥ ε + e(N)) ≥ P(||XN − xN || ≥ ε), and we deduce from (31) that 2 /(2C 2 ) P(||XN − xN || ≥ ε) ≤ 2e−N(ε+e(N)) . Thus, for all ε > 0, the series ∑N≥0 P(||XN − xN || ≥ ε) converges. Now, from Borel-Cantelli lemma, we deduce that for all ε > 0, there exists Nε such that for all N ≥ Nε , ||XN − xN || < ε, which ∆→0 proves the almost sure convergence of XN to xN as N → ∞ (i.e. XT −→ xT almost surely). Appendix C. Proof of Proposition 8 ′ First, note that Qt = X X ′ − X X is a symmetric, non-negative matrix, since it may be rewritten as 1 nt ∑ (Xs+ − X)(Xs+ − X)′ . s∈S(t) In solving the least squares problem (21), we deduce b = ∆X + AX∆, thus min A,b 1 1 ∑ ∆Xs − b −A(Xs+2 ∆Xs )∆ nt s∈S(t) ≤ 2 = min A 1 ∑ ∆Xs − ∆X − A(Xs+ − X)∆ nt s∈S(t) 1 ∑ ∆Xs− ∆X− ∇x f (X, ut )(Xs+− X)∆ nt s∈S(t) 2 2 . (32) Now, since Xs = X + O(∆) one may obtain like in (19) and (20) (by replacing Xt by X) that: ∆Xs − ∆X − ∇x f (X, ut )(Xs+ − X)∆ = O(∆3 ). (33) We deduce from (32) and (33) that 1 nt ∑ ∇x f (Xt , ut ) − ∇x f (X, ut ) (Xs+ − X)∆ 2 = O(∆6 ). s∈S(t) By developing each component, d ∑ ∇x f (Xt , ut ) − ∇x f (X, ut ) i=1 row i Qt ∇x f (Xt , ut ) − ∇x f (X, ut ) ′ row i = O(∆4 ). Now, from the deﬁnition of ν(∆), for all vector u ∈ IRd , u′ Qt u ≥ ν(∆)||u||2 , thus ν(∆)||∇x f (Xt , ut ) − ∇x f (X, ut )||2 = O(∆4 ). Condition (23) yields ∇x f (Xt , ut ) = ∇x f (X, ut ) + o(1), and since ∇x f (Xt , ut ) = ∇x f (X, ut ) + O(∆), we deduce lim ∇x f (Xt , ut ) = ∇x f (Xt , ut ). ∆→0 789 M UNOS References J. Baxter and P. L. Bartlett. Inﬁnite-horizon gradient-based policy search. Journal of Artiﬁcial Intelligence Research, 15:319–350, 2001. A. Bensoussan. Perturbation methods in optimal control. Wiley/Gauthier-Villars Series in Modern Applied Mathematics. John Wiley & Sons Ltd., Chichester, 1988. Translated from the French by C. Tomson. A. Bogdanov. Optimal control of a double inverted pendulum on a cart. Technical report CSE-04006, CSEE, OGI School of Science and Engineering, OHSU, 2004. P. W. Glynn. Likelihood ratio gradient estimation: an overview. In A. Thesen, H. Grant, and W. D. Kelton, editors, Proceedings of the 1987 Winter Simulation Conference, pages 366–375, 1987. E. Gobet and R. Munos. Sensitivity analysis using Itô-Malliavin calculus and martingales. application to stochastic optimal control. SIAM journal on Control and Optimization, 43(5):1676–1713, 2005. G. H. Golub and C. F. Van Loan. Matrix Computations, 3rd ed. Baltimore, MD: Johns Hopkins, 1996. R. E. Kalman, P. L. Falb, and M. A. Arbib. Topics in Mathematical System Theory. New York: McGraw Hill, 1969. P. E. Kloeden and E. Platen. Numerical Solutions of Stochastic Differential Equations. SpringerVerlag, 1995. H. J. Kushner and G. Yin. Stochastic Approximation Algorithms and Applications. Springer-Verlag, Berlin and New York, 1997. S. M. LaValle. Planning Algorithms. Cambridge University Press, 2006. M. Ledoux. The concentration of measure phenomenon. American Mathematical Society, Providence, RI, 2001. P. Marbach and J. N. Tsitsiklis. Approximate gradient methods in policy-space optimization of Markov reward processes. Journal of Discrete Event Dynamical Systems, 13:111–148, 2003. B. T. Polyak. Introduction to Optimization. Optimization Software Inc., New York, 1987. M. I. Reiman and A. Weiss. Sensitivity analysis via likelihood ratios. In J. Wilson, J. Henriksen, and S. Roberts, editors, Proceedings of the 1986 Winter Simulation Conference, pages 285–289, 1986. R. S. Sutton and A. G. Barto. Reinforcement learning: An introduction. Bradford Book, 1998. R. S. Sutton, D. McAllester, S. Singh, and Y. Mansour. Policy gradient methods for reinforcement learning with function approximation. Neural Information Processing Systems. MIT Press, pages 1057–1063, 2000. 790 P OLICY G RADIENT IN C ONTINUOUS T IME M. Talagrand. A new look at independence. Annals of Probability, 24:1–34, 1996. R. J. Williams. Simple statistical gradient-following algorithms for connectionist reinforcement learning. Machine Learning, 8:229–256, 1992. J. Yang and H. J. Kushner. A Monte Carlo method for sensitivity analysis and parametric optimization of nonlinear stochastic systems. SIAM J. Control Optim., 29(5):1216–1249, 1991. 791

6 0.10660606 71 jmlr-2006-Optimising Kernel Parameters and Regularisation Coefficients for Non-linear Discriminant Analysis

7 0.083831705 38 jmlr-2006-Incremental Support Vector Learning: Analysis, Implementation and Applications (Special Topic on Machine Learning and Optimization)

8 0.076013036 40 jmlr-2006-Infinite-σ Limits For Tikhonov Regularization

9 0.066177078 41 jmlr-2006-Kernel-Based Learning of Hierarchical Multilabel Classification Models (Special Topic on Machine Learning and Optimization)

10 0.060674656 26 jmlr-2006-Efficient Learning of Label Ranking by Soft Projections onto Polyhedra (Special Topic on Machine Learning and Optimization)

11 0.058841825 32 jmlr-2006-Expectation Correction for Smoothed Inference in Switching Linear Dynamical Systems

12 0.050548404 12 jmlr-2006-Active Learning with Feedback on Features and Instances

13 0.050075721 57 jmlr-2006-Linear State-Space Models for Blind Source Separation

14 0.049017046 35 jmlr-2006-Geometric Variance Reduction in Markov Chains: Application to Value Function and Gradient Estimation

15 0.047554728 28 jmlr-2006-Estimating the "Wrong" Graphical Model: Benefits in the Computation-Limited Setting

16 0.03728557 90 jmlr-2006-Superior Guarantees for Sequential Prediction and Lossless Compression via Alphabet Decomposition

17 0.033502541 91 jmlr-2006-The Interplay of Optimization and Machine Learning Research (Special Topic on Machine Learning and Optimization)

18 0.032747176 10 jmlr-2006-Action Elimination and Stopping Conditions for the Multi-Armed Bandit and Reinforcement Learning Problems

19 0.032388221 79 jmlr-2006-Second Order Cone Programming Approaches for Handling Missing and Uncertain Data (Special Topic on Machine Learning and Optimization)

20 0.031228773 89 jmlr-2006-Structured Prediction, Dual Extragradient and Bregman Projections (Special Topic on Machine Learning and Optimization)

similar papers computed by lsi model

lsi for this paper:

topicId topicWeight

[(0, 0.261), (1, 0.513), (2, 0.406), (3, 0.045), (4, -0.107), (5, 0.016), (6, 0.098), (7, -0.018), (8, 0.127), (9, 0.096), (10, 0.019), (11, -0.017), (12, -0.088), (13, 0.04), (14, -0.0), (15, 0.026), (16, -0.025), (17, 0.087), (18, -0.066), (19, -0.074), (20, 0.083), (21, 0.022), (22, 0.096), (23, 0.014), (24, -0.028), (25, -0.036), (26, -0.061), (27, 0.034), (28, -0.02), (29, -0.022), (30, -0.019), (31, 0.037), (32, 0.004), (33, -0.011), (34, -0.019), (35, 0.001), (36, 0.006), (37, 0.035), (38, 0.002), (39, 0.0), (40, 0.001), (41, 0.005), (42, 0.035), (43, 0.051), (44, -0.023), (45, 0.003), (46, -0.036), (47, -0.065), (48, -0.026), (49, 0.047)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.98893517 70 jmlr-2006-Online Passive-Aggressive Algorithms

Author: Koby Crammer, Ofer Dekel, Joseph Keshet, Shai Shalev-Shwartz, Yoram Singer

2 0.95362937 96 jmlr-2006-Worst-Case Analysis of Selective Sampling for Linear Classification

Author: Nicolò Cesa-Bianchi, Claudio Gentile, Luca Zaniboni

3 0.61632097 86 jmlr-2006-Step Size Adaptation in Reproducing Kernel Hilbert Space

Author: S. V. N. Vishwanathan, Nicol N. Schraudolph, Alex J. Smola

4 0.57959098 75 jmlr-2006-Policy Gradient in Continuous Time

Author: Rémi Munos

5 0.56379741 37 jmlr-2006-Incremental Algorithms for Hierarchical Classification

Author: Nicolò Cesa-Bianchi, Claudio Gentile, Luca Zaniboni

6 0.36409253 71 jmlr-2006-Optimising Kernel Parameters and Regularisation Coefficients for Non-linear Discriminant Analysis

7 0.23076601 40 jmlr-2006-Infinite-σ Limits For Tikhonov Regularization

8 0.22593784 38 jmlr-2006-Incremental Support Vector Learning: Analysis, Implementation and Applications (Special Topic on Machine Learning and Optimization)

9 0.17048611 26 jmlr-2006-Efficient Learning of Label Ranking by Soft Projections onto Polyhedra (Special Topic on Machine Learning and Optimization)

10 0.16203047 12 jmlr-2006-Active Learning with Feedback on Features and Instances

11 0.15720482 41 jmlr-2006-Kernel-Based Learning of Hierarchical Multilabel Classification Models (Special Topic on Machine Learning and Optimization)

12 0.14417417 57 jmlr-2006-Linear State-Space Models for Blind Source Separation

13 0.13852148 10 jmlr-2006-Action Elimination and Stopping Conditions for the Multi-Armed Bandit and Reinforcement Learning Problems

14 0.12960134 28 jmlr-2006-Estimating the "Wrong" Graphical Model: Benefits in the Computation-Limited Setting

15 0.12893414 32 jmlr-2006-Expectation Correction for Smoothed Inference in Switching Linear Dynamical Systems

16 0.12309863 35 jmlr-2006-Geometric Variance Reduction in Markov Chains: Application to Value Function and Gradient Estimation

17 0.11585063 83 jmlr-2006-Sparse Boosting

18 0.1060656 79 jmlr-2006-Second Order Cone Programming Approaches for Handling Missing and Uncertain Data (Special Topic on Machine Learning and Optimization)

19 0.10549423 90 jmlr-2006-Superior Guarantees for Sequential Prediction and Lossless Compression via Alphabet Decomposition

20 0.10203819 91 jmlr-2006-The Interplay of Optimization and Machine Learning Research (Special Topic on Machine Learning and Optimization)

similar papers computed by lda model

lda for this paper:

topicId topicWeight

[(8, 0.016), (22, 0.256), (36, 0.064), (40, 0.026), (45, 0.018), (50, 0.042), (63, 0.045), (68, 0.012), (76, 0.016), (78, 0.013), (81, 0.031), (84, 0.032), (90, 0.136), (91, 0.098), (96, 0.076)]

similar papers list:

simIndex simValue paperId paperTitle

1 0.84276158 8 jmlr-2006-A Very Fast Learning Method for Neural Networks Based on Sensitivity Analysis

Author: Enrique Castillo, Bertha Guijarro-Berdiñas, Oscar Fontenla-Romero, Amparo Alonso-Betanzos

Abstract: This paper introduces a learning method for two-layer feedforward neural networks based on sensitivity analysis, which uses a linear training algorithm for each of the two layers. First, random values are assigned to the outputs of the ﬁrst layer; later, these initial values are updated based on sensitivity formulas, which use the weights in each of the layers; the process is repeated until convergence. Since these weights are learnt solving a linear system of equations, there is an important saving in computational time. The method also gives the local sensitivities of the least square errors with respect to input and output data, with no extra computational cost, because the necessary information becomes available without extra calculations. This method, called the Sensitivity-Based Linear Learning Method, can also be used to provide an initial set of weights, which signiﬁcantly improves the behavior of other learning algorithms. The theoretical basis for the method is given and its performance is illustrated by its application to several examples in which it is compared with several learning algorithms and well known data sets. The results have shown a learning speed generally faster than other existing methods. In addition, it can be used as an initialization tool for other well known methods with signiﬁcant improvements. Keywords: supervised learning, neural networks, linear optimization, least-squares, initialization method, sensitivity analysis

same-paper 2 0.79468822 70 jmlr-2006-Online Passive-Aggressive Algorithms

Author: Koby Crammer, Ofer Dekel, Joseph Keshet, Shai Shalev-Shwartz, Yoram Singer

3 0.58749133 86 jmlr-2006-Step Size Adaptation in Reproducing Kernel Hilbert Space

Author: S. V. N. Vishwanathan, Nicol N. Schraudolph, Alex J. Smola

4 0.54318643 43 jmlr-2006-Large Scale Multiple Kernel Learning (Special Topic on Machine Learning and Optimization)

Author: Sören Sonnenburg, Gunnar Rätsch, Christin Schäfer, Bernhard Schölkopf

Abstract: While classical kernel-based learning algorithms are based on a single kernel, in practice it is often desirable to use multiple kernels. Lanckriet et al. (2004) considered conic combinations of kernel matrices for classiﬁcation, leading to a convex quadratically constrained quadratic program. We show that it can be rewritten as a semi-inﬁnite linear program that can be efﬁciently solved by recycling the standard SVM implementations. Moreover, we generalize the formulation and our method to a larger class of problems, including regression and one-class classiﬁcation. Experimental results show that the proposed algorithm works for hundred thousands of examples or hundreds of kernels to be combined, and helps for automatic model selection, improving the interpretability of the learning result. In a second part we discuss general speed up mechanism for SVMs, especially when used with sparse feature maps as appear for string kernels, allowing us to train a string kernel SVM on a 10 million real-world splice data set from computational biology. We integrated multiple kernel learning in our machine learning toolbox SHOGUN for which the source code is publicly available at http://www.fml.tuebingen.mpg.de/raetsch/projects/shogun. Keywords: multiple kernel learning, string kernels, large scale optimization, support vector machines, support vector regression, column generation, semi-inﬁnite linear programming

5 0.52112401 93 jmlr-2006-Universal Kernels

Author: Charles A. Micchelli, Yuesheng Xu, Haizhang Zhang

Abstract: In this paper we investigate conditions on the features of a continuous kernel so that it may approximate an arbitrary continuous target function uniformly on any compact subset of the input space. A number of concrete examples are given of kernels with this universal approximating property. Keywords: density, translation invariant kernels, radial kernels

6 0.49365485 41 jmlr-2006-Kernel-Based Learning of Hierarchical Multilabel Classification Models (Special Topic on Machine Learning and Optimization)

7 0.49150416 1 jmlr-2006-A Direct Method for Building Sparse Kernel Learning Algorithms

8 0.48575598 26 jmlr-2006-Efficient Learning of Label Ranking by Soft Projections onto Polyhedra (Special Topic on Machine Learning and Optimization)

9 0.48034197 60 jmlr-2006-Manifold Regularization: A Geometric Framework for Learning from Labeled and Unlabeled Examples

10 0.46953958 9 jmlr-2006-Accurate Error Bounds for the Eigenvalues of the Kernel Matrix

11 0.46394092 29 jmlr-2006-Estimation of Gradients and Coordinate Covariation in Classification

12 0.45923445 61 jmlr-2006-Maximum-Gain Working Set Selection for SVMs (Special Topic on Machine Learning and Optimization)

13 0.45841756 37 jmlr-2006-Incremental Algorithms for Hierarchical Classification

14 0.4558472 96 jmlr-2006-Worst-Case Analysis of Selective Sampling for Linear Classification

15 0.45362902 89 jmlr-2006-Structured Prediction, Dual Extragradient and Bregman Projections (Special Topic on Machine Learning and Optimization)

16 0.45084465 56 jmlr-2006-Linear Programs for Hypotheses Selection in Probabilistic Inference Models (Special Topic on Machine Learning and Optimization)

17 0.45033503 51 jmlr-2006-Learning Sparse Representations by Non-Negative Matrix Factorization and Sequential Cone Programming (Special Topic on Machine Learning and Optimization)

18 0.44878215 52 jmlr-2006-Learning Spectral Clustering, With Application To Speech Separation

19 0.44605809 44 jmlr-2006-Large Scale Transductive SVMs

20 0.44500273 14 jmlr-2006-An Efficient Implementation of an Active Set Method for SVMs (Special Topic on Machine Learning and Optimization)