nips nips2012 nips2012-293 knowledge-graph by maker-knowledge-mining

293 nips-2012-Relax and Randomize : From Value to Algorithms


Source: pdf

Author: Sasha Rakhlin, Ohad Shamir, Karthik Sridharan

Abstract: We show a principled way of deriving online learning algorithms from a minimax analysis. Various upper bounds on the minimax value, previously thought to be non-constructive, are shown to yield algorithms. This allows us to seamlessly recover known methods and to derive new ones, also capturing such “unorthodox” methods as Follow the Perturbed Leader and the R2 forecaster. Understanding the inherent complexity of the learning problem thus leads to the development of algorithms. To illustrate our approach, we present several new algorithms, including a family of randomized methods that use the idea of a “random playout”. New versions of the Follow-the-Perturbed-Leader algorithms are presented, as well as methods based on the Littlestone’s dimension, efficient methods for matrix completion with trace norm, and algorithms for the problems of transductive learning and prediction with static experts. 1

Reference: text


Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 Relax and Randomize: From Value to Algorithms Alexander Rakhlin University of Pennsylvania Ohad Shamir Microsoft Research Karthik Sridharan University of Pennsylvania Abstract We show a principled way of deriving online learning algorithms from a minimax analysis. [sent-1, score-0.195]

2 Various upper bounds on the minimax value, previously thought to be non-constructive, are shown to yield algorithms. [sent-2, score-0.161]

3 New versions of the Follow-the-Perturbed-Leader algorithms are presented, as well as methods based on the Littlestone’s dimension, efficient methods for matrix completion with trace norm, and algorithms for the problems of transductive learning and prediction with static experts. [sent-6, score-0.242]

4 1 Introduction This paper studies the online learning framework, where the goal of the player is to incur small regret while observing a sequence of data on which we place no distributional assumptions. [sent-7, score-0.263]

5 More recently, a non-algorithmic minimax approach has been developed to study the inherent complexities of sequential problems [2, 1, 15, 20]. [sent-9, score-0.208]

6 Just as the classical learning theory is concerned with the study of the supremum of empirical or Rademacher process, online learning is concerned with the study of the supremum of martingale processes. [sent-11, score-0.255]

7 Based on these developments, we exhibit an efficient method for the trace norm matrix completion problem, novel Follow the Perturbed Leader algorithms, and efficient methods for the problems of online transductive learning. [sent-21, score-0.289]

8 , xt is often denoted by x1∶t , and the set of all distributions on some set A by (A). [sent-29, score-0.311]

9 The online protocol dictates that on every round t = 1, . [sent-43, score-0.211]

10 , T the learner and Nature simultaneously choose ft ∈ F, xt ∈ X , and observe each other’s actions. [sent-46, score-0.559]

11 The learner aims to minimize regret RegT ￿ ∑T `(ft , xt ) − t=1 inf f ∈F ∑T `(f, xt ) where ` ∶ F × X → R is a known loss function. [sent-47, score-1.027]

12 The starting point of our development is the minimax value of the associated online learning game: VT (F) = inf sup E . [sent-51, score-0.57]

13 q1 ∈ (F ) x1 ∈X f1 ∼q1 T inf sup T E ￿￿ `(ft , xt ) − inf ￿ `(f, xt )￿ qT ∈ (F ) xT ∈X fT ∼qT t=1 f ∈F t=1 (1) where (F) is the set of distributions on F. [sent-54, score-1.149]

14 The minimax formulation immediately gives rise to the optimal algorithm that solves the minimax expression at every round t and returns T T argmin ￿sup E ￿`(ft , xt ) + inf sup E . [sent-55, score-1.064]

15 inf sup E ￿ ￿ `(fi , xi ) − inf ￿ `(f, xi )￿￿￿ q∈ (F ) xt ft ∼q qt+1 xt+1 ft+1 qT xT fT i=t+1 f ∈F i=1 Henceforth, if the quantification in inf and sup is omitted, it will be understood that xt , ft , pt , qt range over X , F, (X ), (F), respectively. [sent-58, score-2.245]

16 Moreover, Ext is with respect to pt while Eft is with respect to qt . [sent-59, score-0.245]

17 , xT ) ￿ − inf f ∈F ∑T `(f, xt ) and VT (F) = VT (F￿{}). [sent-73, score-0.463]

18 The minimax optimal t=1 algorithm specifying the mixed strategy of the player can be written succinctly as qt = argmin sup ￿ E [`(f, x)] + VT (F￿x1 , . [sent-74, score-0.65]

19 sup E ￿ ￿ inf pt+1 xt+1 pT xT T E `(fi , xi ) − inf ￿ `(f, xi )￿ . [sent-91, score-0.667]

20 i=t+1 fi ∈F xi ∼pi f ∈F i=1 (3) The idea now is to come up with more manageable, yet tight, upper bounds on the conditional value. [sent-92, score-0.158]

21 We call a relaxation admissible if for any x1 , . [sent-97, score-0.278]

22 , xt , x)￿ q∈ (F ) x∈X f ∼q (4) for all t ∈ [T − 1], and RelT (F￿x1 , . [sent-106, score-0.311]

23 A strategy q that minimizes the expression in (4) defines an optimal Meta-Algorithm for an admissible relaxation RelT : on round t, compute qt = arg min sup ￿ E [`(f, x)] + RelT (F￿x1 , . [sent-111, score-0.859]

24 , xt−1 , x)￿ , (5) q∈ (F ) x∈X 2 f ∼q play ft ∼ qt and receive xt from the opponent. [sent-114, score-0.697]

25 Importantly, minimization need not be exact: any qt that satisfies the admissibility condition (4) is a valid method, and we will say that such an algorithm is admissible with respect to the relaxation RelT . [sent-115, score-0.54]

26 For any admissible algorithm with respect to RelT , (including the Meta-Algorithm), irrespective of the strategy of the adversary, T T ￿ Eft ∼qt `(ft , xt ) − inf ￿ `(f, xt ) ≤ RelT (F) , (6) f ∈F t=1 t=1 and therefore, E[RegT ] ≤ RelT (F) . [sent-118, score-0.961]

27 Further, if for all t ∈ [T ], the admissible strategies qt are deterministic, RegT ≤ RelT (F) . [sent-121, score-0.374]

28 It is known that one can derive regret bounds by coming up with a potential such that the current loss of the player is related to the difference in the potentials at successive steps, and that the regret can be extracted from the final potential. [sent-123, score-0.359]

29 , xt ) = sup E✏t+1∶T sup ￿2 ￿ ✏s `(f, xs−t (✏t+1∶s−1 )) − ￿ `(f, xs )￿ . [sent-130, score-0.81]

30 One may view this complexity as a partially symmetrized version of the sequential Rademacher complexity T RT (F) ￿ RT (F ￿ {}) = sup E✏1∶T sup ￿2 ￿ ✏s `(f, xs (✏1∶s−1 ))￿ f ∈F x (8) s=1 defined in [15]. [sent-132, score-0.686]

31 , xt , while for the future we have the worst possible binary tree. [sent-137, score-0.311]

32 In this case, a (tight) upper bound on sequential Rademacher complexity leads to the following relaxation: ￿ ￿ t ￿ ￿1 ￿ ￿ ￿ ￿ RelT (F ￿x1 , . [sent-142, score-0.157]

33 , xt ) = inf ￿ log ￿ exp ￿− ￿ `(f, xi )￿ + 2 (T − t)￿ ￿ ￿f ∈F ￿ >0 ￿ ￿ ￿ i=1 ￿ ￿ (9) Proposition 3. [sent-145, score-0.533]

34 The relaxation (9) is admissible and RT (F￿x1 , . [sent-146, score-0.278]

35 Furthermore, it leads to a parameter-free version of the Exponential Weights algorithm, defined on round t + 1 by the mixed strategy qt+1 (f ) ∝ exp ￿− ∗ ∑t `(f, xs )￿ with ∗ the optimal value in t ￿ s=1 t (9). [sent-153, score-0.211]

36 The algorithm’s regret is bounded by RelT (F) ≤ 2 2T log ￿F￿ . [sent-154, score-0.163]

37 Using the notation xt−1 = ∑t−1 xs , a straightforward ˜ s=1 upper bound on sequential Rademacher complexity is the following relaxation: RelT (F ￿x1 , . [sent-160, score-0.21]

38 , xt ) = ￿ ￿˜t−1 ￿2 + ￿∇ ￿˜t−1 ￿2 , xt ￿ + C(T − t + 1) x x (10) It is cumbersome to write out the indices on xs−t (✏t+1∶s−1 ) in (7), so we will instead use xs (✏) whenever this doesn’t cause confusion. [sent-163, score-0.675]

39 The relaxation (10) is admissible and RT (F￿x1 , . [sent-165, score-0.278]

40 √ 2 x It yields the update ft = ￿ −∇￿˜t−1 ￿ with regret bound RelT (F) ≤ 2CT . [sent-172, score-0.334]

41 There are two closely related protocols for the online interaction between the learner and Nature, so let us outline them. [sent-178, score-0.159]

42 The “proper” version of supervised learning follows the protocol presented in Section 2: at time t, the learner selects ft ∈ F, Nature simultaneously selects (xt , yt ) ∈ X × Y, and the learner suffers the loss `(f (xt ), yt ). [sent-179, score-0.874]

43 The “improper” version is as follows: at time t, Nature chooses xt ∈ X and presents it to the learner as “side information”, the learner then picks yt ∈ Y and Nature simultaneously chooses yt ∈ Y. [sent-180, score-0.991]

44 In the improper version, the ˆ loss of the learner is `(ˆt , yt ), and it is easy to see that we may equivalently state this protocol as the y learner choosing any function ft ∈ Y X (not necessarily in F), and Nature simultaneously choosing (xt , yt ). [sent-181, score-0.875]

45 For the improper version, we may write the value in (1) as VT (F ) = sup inf E . [sent-183, score-0.413]

46 sup sup x1 ∈X q1 ∈ (Y) y1 ∈X y1 ∼q1 ˆ inf sup E xT ∈X qT ∈ (Y) yT ∈X yT ∼qT ˆ ￿￿ `(ˆt , yt ) − inf ￿ `(f (xt ), yt )￿ y T T f ∈F t=1 t=1 and a relaxation RelT is admissible if for any (x1 , y1 ) . [sent-186, score-1.705]

47 , (xT , yT ) ∈ X × Y, sup inf sup ￿ E `(ˆ, y) + RelT ￿F ￿{(xi , yi )}t , (x, y)￿￿ ≤ RelT ￿F ￿{(xi , yi )}t ￿ y i=1 i=1 x∈X q∈ (Y) y∈Y y ∼q ˆ (11) Let us now focus on binary prediction, i. [sent-189, score-0.71]

48 Suppose the loss function is `(ˆ, y) = ￿ˆ − y￿, and consider the “mixed” conditional Rademacher complexity y y T −t sup E✏ sup ￿2 ￿ ✏i f (xi (✏)) − ￿ ￿f (xi ) − yi ￿￿ x f ∈F t i=1 (13) i=1 as a possible relaxation. [sent-195, score-0.599]

49 The following proposition i=1 i=1 ￿ gives a relaxation and an algorithm which achieves the O( Ldim(F)T log T ) regret bound. [sent-217, score-0.291]

50 4 Randomized Algorithms and Follow the Perturbed Leader We now develop a class of admissible randomized methods that arise through sampling. [sent-223, score-0.265]

51 The ideas of random playout have been discussed in previous literature for estimating the utility of a move in a game (see also [3]). [sent-228, score-0.165]

52 In such cases, the relaxation RelT does not involve the supremum over a tree, and the randomized method only needs to draw a sequence of coin flips and compute a solution to an optimization problem slightly more complicated than ERM. [sent-232, score-0.35]

53 Our analysis shows that it arises through a natural relaxation based on the sequential (and thus the classical) Rademacher complexity, coupled with the random playout idea. [sent-235, score-0.294]

54 As a new algorithmic contribution, we provide a version of the FPL algorithm for the case of the decision sets being `2 balls, with a regret bound that is independent of the dimension. [sent-236, score-0.184]

55 , ✏T ∈ {±1}, sup E sup ￿ CAt+1 (f ) − Lt−1 (f ) + E [`(f, x)] − `(f, xt )￿ ≤ p∈ (X ) xt ∼p f ∈F x∼p E sup [ CAt (f ) − Lt−1 (f )] ✏t ,xt ∼D f ∈F where ✏t ’s are i. [sent-251, score-1.291]

56 xT ∼D E✏ sup ￿C ￿ ✏i `(f, xi ) − ￿ `(f, xi )￿ T f ∈F t i=t+1 i=1 (16) which is a partially symmetrized version of the classical Rademacher averages. [sent-261, score-0.439]

57 The proof of admissibility for the randomized methods is quite curious – the forecaster can be seen as mimicking the sequential Rademacher complexity by sampling from the “equivalently bad” classical Rademacher complexity under the specific distribution D specified by the above assumption. [sent-262, score-0.348]

58 (16) is admissible and a randomized strategy that ensures admissibility is given by: at time t, draw xt+1 , . [sent-265, score-0.341]

59 In the transductive case, the xt ’s are pre-specified before the game, and in the static expert case – effectively absent. [sent-272, score-0.464]

60 We easily see that in these cases, the expected regret bound is simply two times the transductive Rademacher complexity. [sent-274, score-0.283]

61 Suppose X is a unit ball in some norm ￿ ⋅ ￿ in a vector space B, and F is a unit ball in the dual norm ￿ ⋅ ￿∗ . [sent-276, score-0.218]

62 There exists a distribution D ∈ (X ) and constant C ≥ 2 such that for any w ∈ B, sup E ￿w + 2✏t xt ￿ ≤ E E ￿w + C✏t xt ￿ x∈X xt ∼p (19) xt ∼D ✏t At round t, the generic algorithm specified by Lemma 18 draws fresh Rademacher random variables ✏ and xt+1 , . [sent-278, score-1.594]

63 , xT ∼ D and picks ft = argmin sup ￿￿f, x￿ + ￿C ￿ ✏i xi − ￿ xi − x￿￿ T f ∈F i=t+1 x∈X t−1 i=1 (20) We now look at `2 ￿`2 and `1 ￿`∞ cases and provide corresponding randomized algorithms. [sent-281, score-0.702]

64 Let F ⊂ RN be the `1 unit ball and X the (dual) `∞ unit ball in RN . [sent-283, score-0.156]

65 The form of update in Equation (20), however, is not in a convenient form, and the following lemma shows a simple Follow the Perturbed Leader type algorithm with the associated regret bound. [sent-295, score-0.177]

66 Consider the randomized algorithm that at each round t, freshly draws Rademacher random variables ✏t+1 , . [sent-298, score-0.243]

67 , xT ∼ DN and picks ft = argmin ￿f, ∑t−1 xi − C ∑T i=1 i=t+1 ✏i xi ￿ where C = 6￿Ex∼D ￿x￿. [sent-304, score-0.393]

68 Consider the randomized algorithm that at each round (say round √ t) freshly draws xt+1 , . [sent-316, score-0.352]

69 , xT ∼ D and picks ft = ￿− ∑t−1 xi + C ∑T i=1 i=t+1 xi ￿ ￿L where C = 4 2 and scaling factor L = ￿￿− ∑t−1 xi + C ∑T i=1 i=t+1 ✏i xi ￿2 + 1￿ 1￿2 2 . [sent-319, score-0.496]

70 The randomized algorithm enjoys a √ bound on the expected regret given by E [RegT ] ≤ C Ex1 ,. [sent-320, score-0.275]

71 5 Static Experts with Convex Losses and Transductive Online Learning We show how to recover a variant of the R2 forecaster of [7], for static experts and transductive online learning. [sent-327, score-0.294]

72 At each round, the learner makes a prediction qt ∈ [−1, 1], observes the outcome yt ∈ [−1, 1], and suffers convex L-Lipschitz loss `(qt , yt ). [sent-328, score-0.855]

73 Regret is defined as the difference between learner’s cumulative loss and inf f ∈F ∑T `(f [t], yt ), where F ⊂ [−1, 1]T can be seen as a t=1 set of static experts. [sent-329, score-0.447]

74 The transductive setting is equivalent to this: the sequence of xt ’s is known before the game starts, and hence the effective function class is once again a subset of [−1, 1]T . [sent-330, score-0.506]

75 It turns out that in these cases, sequential Rademacher complexity becomes the classical Rademacher complexity (see [17]), which can thus be taken as a relaxation. [sent-331, score-0.183]

76 For general convex loss, one possible admissible relaxation is just a conditional version of the classical Rademacher averages: RelT (F ￿y1 , . [sent-333, score-0.382]

77 , yt ) = E✏t+1∶T sup ￿2L ￿ ✏s f [s] − Lt (f )￿ T f ∈F (21) s=t+1 where Lt (f ) = ∑t `(f [s], ys ). [sent-336, score-0.45]

78 If (21) is used as a relaxation, the calculation of prediction yt ˆ s=1 involves a supremum over f ∈ F with (potentially nonlinear) loss functions of instances seen so far. [sent-337, score-0.373]

79 More precisely, the new loss is `′ (ˆ, r) = r ⋅ y ; we first pick prediction yt (dey ˆ ˆ terministically) and the adversary picks rt (corresponding to rt = @`(ˆt , yt ) for choice of yt picked y by adversary). [sent-340, score-0.922]

80 Now note that `′ is indeed convex in its first argument and is L Lipschitz because ￿@`(ˆt , yt )￿ ≤ L. [sent-341, score-0.252]

81 This is a one dimensional convex learning game where we pick yt and regret is y ˆ given by the right hand side of (22). [sent-342, score-0.453]

82 , @`(ˆt , yt )) = E✏t+1∶T sup ￿2L ￿ ✏i f [t] − ￿ @`(ˆi , yi ) ⋅ f [i]￿ y y y T f ∈F t i=t+1 i=1 (23) as a linearized form of (21). [sent-346, score-0.528]

83 At round t, the prediction of the algorithm is then yt = E ￿sup ￿ ￿ ✏i f [i] − ˆ T ✏ f ∈F i=t+1 1 2L y ￿ @`(ˆi , yi )f [i] + 1 f [t]￿− sup ￿ ￿ ✏i f [i] − 2 t−1 T f ∈F i=1 i=t+1 1 2L 1 y ￿ @`(ˆi , yi )f [i] − 2 f [t]￿￿ t−1 i=1 (24) Lemma 11. [sent-347, score-0.698]

84 Further the regret of the strategy is bounded as RegT ≤ 2L E✏ ￿supf ∈F ∑T ✏t f [t]￿ . [sent-353, score-0.194]

85 While we need to evaluate the expectation over ✏’s on each round, we can estimate yt by sampling ✏’s and using ˆ McDiarmid’s inequality argue that the estimate is close to yt with high probability. [sent-355, score-0.454]

86 The randomized ˆ prediction is now given simply as: on round t, draw ✏t+1 , . [sent-356, score-0.246]

87 The relaxation specified in Equation (21) is admissible w. [sent-360, score-0.278]

88 the randomized prediction strategy specified in Equation (25), and enjoys bound E [RegT ] ≤ 2L E✏ ￿supf ∈F ∑T ✏t f [t]￿ . [sent-363, score-0.192]

89 At each round t the adversary picks an entry in an m × n matrix and a value yt for that entry. [sent-365, score-0.438]

90 The learner then chooses a predicted value yt , and suffers loss `(yt , yt ), assumed to be ⇢-Lipschitz. [sent-366, score-0.585]

91 We define our ˆ ˆ regret with respect to the class F of all matrices whose trace-norm is at most B (namely, we can use any such matrix to predict just by returning its relevant entry at each round). [sent-367, score-0.159]

92 We show how to develop an algorithm whose regret is bounded by the √ (transductive) Rademacher complexity of F, which by Theorem 6 of [18], is O(B n) independent of T . [sent-370, score-0.203]

93 Moreover, in [7], it was shown how one can convert algorithms with such guarantees to obtain the same regret even in a “fully” online case, where the set of entry locations is unknown in advance. [sent-371, score-0.238]

94 As a simple corollary of Lemma 12, together with the bound on the Rademacher complexity √ √ mentioned earlier, we get that the expected regret of either variant is O ￿B ⇢ ( m + n)￿. [sent-384, score-0.206]

95 7 Conclusion In [2, 1, 15, 20] the minimax value of the online learning game has been analyzed and nonconstructive bounds on the value have been provided. [sent-385, score-0.305]

96 In this paper, we provide a general constructive recipe for deriving new (and old) online learning algorithms, using techniques from the apparently non-constructive minimax analysis. [sent-386, score-0.237]

97 The recipe is rather simple: we start with the notion of conditional sequential Rademacher complexity, and find an “admissible” relaxation which upper bounds it. [sent-387, score-0.301]

98 This relaxation immediately leads to an online learning algorithm, as well as to an associated regret guarantee. [sent-388, score-0.342]

99 A stochastic view of optimal regret through minimax duality. [sent-397, score-0.257]

100 Optimal strategies and minimax lower bounds for online convex games. [sent-405, score-0.24]


similar papers computed by tfidf model

tfidf for this paper:

wordName wordTfidf (topN-words)

[('relt', 0.573), ('xt', 0.311), ('rademacher', 0.232), ('yt', 0.227), ('sup', 0.223), ('qt', 0.218), ('ft', 0.168), ('admissible', 0.156), ('inf', 0.152), ('regret', 0.141), ('regt', 0.136), ('leader', 0.127), ('relaxation', 0.122), ('transductive', 0.117), ('minimax', 0.116), ('round', 0.109), ('playout', 0.105), ('ldim', 0.09), ('littlestone', 0.09), ('perturbed', 0.088), ('randomized', 0.086), ('lt', 0.08), ('learner', 0.08), ('online', 0.079), ('vt', 0.076), ('supremum', 0.07), ('xi', 0.07), ('sequential', 0.067), ('game', 0.06), ('yi', 0.056), ('rakhlin', 0.055), ('xs', 0.053), ('sridharan', 0.05), ('rt', 0.049), ('picks', 0.048), ('unorthodox', 0.045), ('admissibility', 0.044), ('colt', 0.043), ('recipe', 0.042), ('ball', 0.04), ('fpl', 0.04), ('complexity', 0.04), ('improper', 0.038), ('unit', 0.038), ('argmin', 0.037), ('follow', 0.036), ('classical', 0.036), ('static', 0.036), ('relaxations', 0.036), ('lemma', 0.036), ('adversary', 0.036), ('completion', 0.035), ('forecaster', 0.035), ('ips', 0.033), ('loss', 0.032), ('norm', 0.031), ('abernethy', 0.031), ('mirror', 0.031), ('strategy', 0.031), ('coin', 0.03), ('eft', 0.03), ('freshly', 0.03), ('nonconstructive', 0.03), ('balls', 0.028), ('proposition', 0.028), ('pt', 0.027), ('prediction', 0.027), ('experts', 0.027), ('trace', 0.027), ('conditional', 0.025), ('convex', 0.025), ('bound', 0.025), ('player', 0.025), ('complexities', 0.025), ('upper', 0.025), ('games', 0.024), ('draw', 0.024), ('enjoys', 0.023), ('banach', 0.023), ('supf', 0.023), ('arise', 0.023), ('protocol', 0.023), ('bounded', 0.022), ('linearized', 0.022), ('symmetrized', 0.022), ('averages', 0.022), ('sphere', 0.021), ('surface', 0.02), ('bounds', 0.02), ('manageable', 0.02), ('suffers', 0.019), ('style', 0.018), ('fi', 0.018), ('sequence', 0.018), ('version', 0.018), ('draws', 0.018), ('entry', 0.018), ('pennsylvania', 0.017), ('involves', 0.017), ('shamir', 0.017)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.99999952 293 nips-2012-Relax and Randomize : From Value to Algorithms

Author: Sasha Rakhlin, Ohad Shamir, Karthik Sridharan

Abstract: We show a principled way of deriving online learning algorithms from a minimax analysis. Various upper bounds on the minimax value, previously thought to be non-constructive, are shown to yield algorithms. This allows us to seamlessly recover known methods and to derive new ones, also capturing such “unorthodox” methods as Follow the Perturbed Leader and the R2 forecaster. Understanding the inherent complexity of the learning problem thus leads to the development of algorithms. To illustrate our approach, we present several new algorithms, including a family of randomized methods that use the idea of a “random playout”. New versions of the Follow-the-Perturbed-Leader algorithms are presented, as well as methods based on the Littlestone’s dimension, efficient methods for matrix completion with trace norm, and algorithms for the problems of transductive learning and prediction with static experts. 1

2 0.29656285 218 nips-2012-Mixing Properties of Conditional Markov Chains with Unbounded Feature Functions

Author: Mathieu Sinn, Bei Chen

Abstract: Conditional Markov Chains (also known as Linear-Chain Conditional Random Fields in the literature) are a versatile class of discriminative models for the distribution of a sequence of hidden states conditional on a sequence of observable variables. Large-sample properties of Conditional Markov Chains have been first studied in [1]. The paper extends this work in two directions: first, mixing properties of models with unbounded feature functions are being established; second, necessary conditions for model identifiability and the uniqueness of maximum likelihood estimates are being given. 1

3 0.2677663 252 nips-2012-On Multilabel Classification and Ranking with Partial Feedback

Author: Claudio Gentile, Francesco Orabona

Abstract: We present a novel multilabel/ranking algorithm working in partial information settings. The algorithm is based on 2nd-order descent methods, and relies on upper-confidence bounds to trade-off exploration and exploitation. We analyze this algorithm in a partial adversarial setting, where covariates can be adversarial, but multilabel probabilities are ruled by (generalized) linear models. We show O(T 1/2 log T ) regret bounds, which improve in several ways on the existing results. We test the effectiveness of our upper-confidence scheme by contrasting against full-information baselines on real-world multilabel datasets, often obtaining comparable performance. 1

4 0.25134164 324 nips-2012-Stochastic Gradient Descent with Only One Projection

Author: Mehrdad Mahdavi, Tianbao Yang, Rong Jin, Shenghuo Zhu, Jinfeng Yi

Abstract: Although many variants of stochastic gradient descent have been proposed for large-scale convex optimization, most of them require projecting the solution at each iteration to ensure that the obtained solution stays within the feasible domain. For complex domains (e.g., positive semidefinite cone), the projection step can be computationally expensive, making stochastic gradient descent unattractive for large-scale optimization problems. We address this limitation by developing novel stochastic optimization algorithms that do not need intermediate projections. Instead, only one projection at the last iteration is needed to obtain a feasible solution in the given domain. Our theoretical analysis shows that with a high probability, √ the proposed algorithms achieve an O(1/ T ) convergence rate for general convex optimization, and an O(ln T /T ) rate for strongly convex optimization under mild conditions about the domain and the objective function. 1

5 0.21139567 13 nips-2012-A Nonparametric Conjugate Prior Distribution for the Maximizing Argument of a Noisy Function

Author: Pedro Ortega, Jordi Grau-moya, Tim Genewein, David Balduzzi, Daniel Braun

Abstract: We propose a novel Bayesian approach to solve stochastic optimization problems that involve finding extrema of noisy, nonlinear functions. Previous work has focused on representing possible functions explicitly, which leads to a two-step procedure of first, doing inference over the function space and second, finding the extrema of these functions. Here we skip the representation step and directly model the distribution over extrema. To this end, we devise a non-parametric conjugate prior based on a kernel regressor. The resulting posterior distribution directly captures the uncertainty over the maximum of the unknown function. Given t observations of the function, the posterior can be evaluated efficiently in time O(t2 ) up to a multiplicative constant. Finally, we show how to apply our model to optimize a noisy, non-convex, high-dimensional objective function.

6 0.19394943 121 nips-2012-Expectation Propagation in Gaussian Process Dynamical Systems

7 0.16874535 138 nips-2012-Fully Bayesian inference for neural models with negative-binomial spiking

8 0.14748119 314 nips-2012-Slice Normalized Dynamic Markov Logic Networks

9 0.14720088 241 nips-2012-No-Regret Algorithms for Unconstrained Online Convex Optimization

10 0.14198744 292 nips-2012-Regularized Off-Policy TD-Learning

11 0.1153986 80 nips-2012-Confusion-Based Online Learning and a Passive-Aggressive Scheme

12 0.10756842 305 nips-2012-Selective Labeling via Error Bound Minimization

13 0.09902367 208 nips-2012-Matrix reconstruction with the local max norm

14 0.096158907 216 nips-2012-Mirror Descent Meets Fixed Share (and feels no regret)

15 0.096124932 174 nips-2012-Learning Halfspaces with the Zero-One Loss: Time-Accuracy Tradeoffs

16 0.092484906 199 nips-2012-Link Prediction in Graphs with Autoregressive Features

17 0.090713605 102 nips-2012-Distributed Non-Stochastic Experts

18 0.090008423 134 nips-2012-Finite Sample Convergence Rates of Zero-Order Stochastic Optimization Methods

19 0.08973515 83 nips-2012-Controlled Recognition Bounds for Visual Learning and Exploration

20 0.083073184 283 nips-2012-Putting Bayes to sleep


similar papers computed by lsi model

lsi for this paper:

topicId topicWeight

[(0, 0.208), (1, -0.04), (2, 0.246), (3, 0.309), (4, 0.169), (5, -0.113), (6, -0.026), (7, -0.028), (8, -0.032), (9, -0.002), (10, 0.106), (11, -0.007), (12, -0.044), (13, 0.044), (14, -0.03), (15, 0.033), (16, 0.002), (17, 0.001), (18, -0.019), (19, 0.007), (20, -0.024), (21, -0.057), (22, 0.004), (23, -0.092), (24, -0.013), (25, 0.061), (26, 0.042), (27, 0.028), (28, 0.003), (29, 0.008), (30, -0.07), (31, -0.051), (32, -0.033), (33, 0.024), (34, -0.013), (35, 0.08), (36, -0.015), (37, -0.027), (38, -0.022), (39, -0.01), (40, -0.005), (41, -0.021), (42, 0.005), (43, 0.014), (44, -0.006), (45, 0.059), (46, -0.036), (47, -0.029), (48, -0.009), (49, -0.037)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.96431494 293 nips-2012-Relax and Randomize : From Value to Algorithms

Author: Sasha Rakhlin, Ohad Shamir, Karthik Sridharan

Abstract: We show a principled way of deriving online learning algorithms from a minimax analysis. Various upper bounds on the minimax value, previously thought to be non-constructive, are shown to yield algorithms. This allows us to seamlessly recover known methods and to derive new ones, also capturing such “unorthodox” methods as Follow the Perturbed Leader and the R2 forecaster. Understanding the inherent complexity of the learning problem thus leads to the development of algorithms. To illustrate our approach, we present several new algorithms, including a family of randomized methods that use the idea of a “random playout”. New versions of the Follow-the-Perturbed-Leader algorithms are presented, as well as methods based on the Littlestone’s dimension, efficient methods for matrix completion with trace norm, and algorithms for the problems of transductive learning and prediction with static experts. 1

2 0.82119983 218 nips-2012-Mixing Properties of Conditional Markov Chains with Unbounded Feature Functions

Author: Mathieu Sinn, Bei Chen

Abstract: Conditional Markov Chains (also known as Linear-Chain Conditional Random Fields in the literature) are a versatile class of discriminative models for the distribution of a sequence of hidden states conditional on a sequence of observable variables. Large-sample properties of Conditional Markov Chains have been first studied in [1]. The paper extends this work in two directions: first, mixing properties of models with unbounded feature functions are being established; second, necessary conditions for model identifiability and the uniqueness of maximum likelihood estimates are being given. 1

3 0.80293828 80 nips-2012-Confusion-Based Online Learning and a Passive-Aggressive Scheme

Author: Liva Ralaivola

Abstract: This paper provides the first —to the best of our knowledge— analysis of online learning algorithms for multiclass problems when the confusion matrix is taken as a performance measure. The work builds upon recent and elegant results on noncommutative concentration inequalities, i.e. concentration inequalities that apply to matrices, and, more precisely, to matrix martingales. We do establish generalization bounds for online learning algorithms and show how the theoretical study motivates the proposition of a new confusion-friendly learning procedure. This learning algorithm, called COPA (for COnfusion Passive-Aggressive) is a passive-aggressive learning algorithm; it is shown that the update equations for COPA can be computed analytically and, henceforth, there is no need to recourse to any optimization package to implement it. 1

4 0.78053063 252 nips-2012-On Multilabel Classification and Ranking with Partial Feedback

Author: Claudio Gentile, Francesco Orabona

Abstract: We present a novel multilabel/ranking algorithm working in partial information settings. The algorithm is based on 2nd-order descent methods, and relies on upper-confidence bounds to trade-off exploration and exploitation. We analyze this algorithm in a partial adversarial setting, where covariates can be adversarial, but multilabel probabilities are ruled by (generalized) linear models. We show O(T 1/2 log T ) regret bounds, which improve in several ways on the existing results. We test the effectiveness of our upper-confidence scheme by contrasting against full-information baselines on real-world multilabel datasets, often obtaining comparable performance. 1

5 0.75056708 324 nips-2012-Stochastic Gradient Descent with Only One Projection

Author: Mehrdad Mahdavi, Tianbao Yang, Rong Jin, Shenghuo Zhu, Jinfeng Yi

Abstract: Although many variants of stochastic gradient descent have been proposed for large-scale convex optimization, most of them require projecting the solution at each iteration to ensure that the obtained solution stays within the feasible domain. For complex domains (e.g., positive semidefinite cone), the projection step can be computationally expensive, making stochastic gradient descent unattractive for large-scale optimization problems. We address this limitation by developing novel stochastic optimization algorithms that do not need intermediate projections. Instead, only one projection at the last iteration is needed to obtain a feasible solution in the given domain. Our theoretical analysis shows that with a high probability, √ the proposed algorithms achieve an O(1/ T ) convergence rate for general convex optimization, and an O(ln T /T ) rate for strongly convex optimization under mild conditions about the domain and the objective function. 1

6 0.71192503 292 nips-2012-Regularized Off-Policy TD-Learning

7 0.69593388 13 nips-2012-A Nonparametric Conjugate Prior Distribution for the Maximizing Argument of a Noisy Function

8 0.66982335 121 nips-2012-Expectation Propagation in Gaussian Process Dynamical Systems

9 0.64541489 241 nips-2012-No-Regret Algorithms for Unconstrained Online Convex Optimization

10 0.61194563 11 nips-2012-A Marginalized Particle Gaussian Process Regression

11 0.60978538 66 nips-2012-Causal discovery with scale-mixture model for spatiotemporal variance dependencies

12 0.56835085 305 nips-2012-Selective Labeling via Error Bound Minimization

13 0.56802946 314 nips-2012-Slice Normalized Dynamic Markov Logic Networks

14 0.54215556 138 nips-2012-Fully Bayesian inference for neural models with negative-binomial spiking

15 0.50537801 263 nips-2012-Optimal Regularized Dual Averaging Methods for Stochastic Optimization

16 0.49098754 174 nips-2012-Learning Halfspaces with the Zero-One Loss: Time-Accuracy Tradeoffs

17 0.47974268 161 nips-2012-Interpreting prediction markets: a stochastic approach

18 0.47141963 216 nips-2012-Mirror Descent Meets Fixed Share (and feels no regret)

19 0.46373659 102 nips-2012-Distributed Non-Stochastic Experts

20 0.44664186 283 nips-2012-Putting Bayes to sleep


similar papers computed by lda model

lda for this paper:

topicId topicWeight

[(0, 0.015), (7, 0.198), (17, 0.012), (21, 0.017), (38, 0.123), (42, 0.06), (53, 0.025), (54, 0.027), (55, 0.016), (68, 0.011), (74, 0.029), (76, 0.115), (80, 0.173), (84, 0.011), (92, 0.064)]

similar papers list:

simIndex simValue paperId paperTitle

1 0.86050344 170 nips-2012-Large Scale Distributed Deep Networks

Author: Jeffrey Dean, Greg Corrado, Rajat Monga, Kai Chen, Matthieu Devin, Mark Mao, Marc'aurelio Ranzato, Andrew Senior, Paul Tucker, Ke Yang, Quoc V. Le, Andrew Y. Ng

Abstract: Recent work in unsupervised feature learning and deep learning has shown that being able to train large models can dramatically improve performance. In this paper, we consider the problem of training a deep network with billions of parameters using tens of thousands of CPU cores. We have developed a software framework called DistBelief that can utilize computing clusters with thousands of machines to train large models. Within this framework, we have developed two algorithms for large-scale distributed training: (i) Downpour SGD, an asynchronous stochastic gradient descent procedure supporting a large number of model replicas, and (ii) Sandblaster, a framework that supports a variety of distributed batch optimization procedures, including a distributed implementation of L-BFGS. Downpour SGD and Sandblaster L-BFGS both increase the scale and speed of deep network training. We have successfully used our system to train a deep network 30x larger than previously reported in the literature, and achieves state-of-the-art performance on ImageNet, a visual object recognition task with 16 million images and 21k categories. We show that these same techniques dramatically accelerate the training of a more modestly- sized deep network for a commercial speech recognition service. Although we focus on and report performance of these methods as applied to training large neural networks, the underlying algorithms are applicable to any gradient-based machine learning algorithm. 1

same-paper 2 0.84972614 293 nips-2012-Relax and Randomize : From Value to Algorithms

Author: Sasha Rakhlin, Ohad Shamir, Karthik Sridharan

Abstract: We show a principled way of deriving online learning algorithms from a minimax analysis. Various upper bounds on the minimax value, previously thought to be non-constructive, are shown to yield algorithms. This allows us to seamlessly recover known methods and to derive new ones, also capturing such “unorthodox” methods as Follow the Perturbed Leader and the R2 forecaster. Understanding the inherent complexity of the learning problem thus leads to the development of algorithms. To illustrate our approach, we present several new algorithms, including a family of randomized methods that use the idea of a “random playout”. New versions of the Follow-the-Perturbed-Leader algorithms are presented, as well as methods based on the Littlestone’s dimension, efficient methods for matrix completion with trace norm, and algorithms for the problems of transductive learning and prediction with static experts. 1

3 0.80633265 194 nips-2012-Learning to Discover Social Circles in Ego Networks

Author: Jure Leskovec, Julian J. Mcauley

Abstract: Our personal social networks are big and cluttered, and currently there is no good way to organize them. Social networking sites allow users to manually categorize their friends into social circles (e.g. ‘circles’ on Google+, and ‘lists’ on Facebook and Twitter), however they are laborious to construct and must be updated whenever a user’s network grows. We define a novel machine learning task of identifying users’ social circles. We pose the problem as a node clustering problem on a user’s ego-network, a network of connections between her friends. We develop a model for detecting circles that combines network structure as well as user profile information. For each circle we learn its members and the circle-specific user profile similarity metric. Modeling node membership to multiple circles allows us to detect overlapping as well as hierarchically nested circles. Experiments show that our model accurately identifies circles on a diverse set of data from Facebook, Google+, and Twitter for all of which we obtain hand-labeled ground-truth. 1

4 0.80327165 358 nips-2012-Value Pursuit Iteration

Author: Amir M. Farahmand, Doina Precup

Abstract: Value Pursuit Iteration (VPI) is an approximate value iteration algorithm that finds a close to optimal policy for reinforcement learning problems with large state spaces. VPI has two main features: First, it is a nonparametric algorithm that finds a good sparse approximation of the optimal value function given a dictionary of features. The algorithm is almost insensitive to the number of irrelevant features. Second, after each iteration of VPI, the algorithm adds a set of functions based on the currently learned value function to the dictionary. This increases the representation power of the dictionary in a way that is directly relevant to the goal of having a good approximation of the optimal value function. We theoretically study VPI and provide a finite-sample error upper bound for it. 1

5 0.76969045 218 nips-2012-Mixing Properties of Conditional Markov Chains with Unbounded Feature Functions

Author: Mathieu Sinn, Bei Chen

Abstract: Conditional Markov Chains (also known as Linear-Chain Conditional Random Fields in the literature) are a versatile class of discriminative models for the distribution of a sequence of hidden states conditional on a sequence of observable variables. Large-sample properties of Conditional Markov Chains have been first studied in [1]. The paper extends this work in two directions: first, mixing properties of models with unbounded feature functions are being established; second, necessary conditions for model identifiability and the uniqueness of maximum likelihood estimates are being given. 1

6 0.76862323 281 nips-2012-Provable ICA with Unknown Gaussian Noise, with Implications for Gaussian Mixtures and Autoencoders

7 0.76084489 200 nips-2012-Local Supervised Learning through Space Partitioning

8 0.75841808 171 nips-2012-Latent Coincidence Analysis: A Hidden Variable Model for Distance Metric Learning

9 0.75505155 100 nips-2012-Discriminative Learning of Sum-Product Networks

10 0.75399524 251 nips-2012-On Lifting the Gibbs Sampling Algorithm

11 0.75271523 252 nips-2012-On Multilabel Classification and Ranking with Partial Feedback

12 0.75236076 121 nips-2012-Expectation Propagation in Gaussian Process Dynamical Systems

13 0.75134158 67 nips-2012-Classification Calibration Dimension for General Multiclass Losses

14 0.74996442 197 nips-2012-Learning with Recursive Perceptual Representations

15 0.7473222 216 nips-2012-Mirror Descent Meets Fixed Share (and feels no regret)

16 0.74687064 65 nips-2012-Cardinality Restricted Boltzmann Machines

17 0.74572456 229 nips-2012-Multimodal Learning with Deep Boltzmann Machines

18 0.74444205 56 nips-2012-Bayesian active learning with localized priors for fast receptive field characterization

19 0.7425108 330 nips-2012-Supervised Learning with Similarity Functions

20 0.74179673 292 nips-2012-Regularized Off-Policy TD-Learning