nips nips2012 nips2012-216 knowledge-graph by maker-knowledge-mining
Source: pdf
Author: Nicolò Cesa-bianchi, Pierre Gaillard, Gabor Lugosi, Gilles Stoltz
Abstract: Mirror descent with an entropic regularizer is known to achieve shifting regret bounds that are logarithmic in the dimension. This is done using either a carefully designed projection or by a weight sharing technique. Via a novel unified analysis, we show that these two approaches deliver essentially equivalent bounds on a notion of regret generalizing shifting, adaptive, discounted, and other related regrets. Our analysis also captures and extends the generalized weight sharing technique of Bousquet and Warmuth, and can be refined in several ways, including improvements for small losses and adaptive tuning of parameters. 1
Reference: text
sentIndex sentText sentNum sentScore
1 Mirror Descent Meets Fixed Share (and feels no regret) Nicolò Cesa-Bianchi Università degli Studi di Milano nicolo. [sent-1, score-0.046]
2 fr Abstract Mirror descent with an entropic regularizer is known to achieve shifting regret bounds that are logarithmic in the dimension. [sent-9, score-0.646]
3 This is done using either a carefully designed projection or by a weight sharing technique. [sent-10, score-0.048]
4 Via a novel unified analysis, we show that these two approaches deliver essentially equivalent bounds on a notion of regret generalizing shifting, adaptive, discounted, and other related regrets. [sent-11, score-0.364]
5 Our analysis also captures and extends the generalized weight sharing technique of Bousquet and Warmuth, and can be refined in several ways, including improvements for small losses and adaptive tuning of parameters. [sent-12, score-0.214]
6 1 Introduction Online convex optimization is a sequential prediction paradigm in which, at each time step, the learner chooses an element from a fixed convex set S and then is given access to a convex loss function defined on the same set. [sent-13, score-0.295]
7 Many problems such as prediction with expert advice, sequential investment, and online regression/classification can be viewed as special cases of this general framework. [sent-15, score-0.189]
8 The standard notion of regret is the difference between the learner’s cumulative loss and the cumulative loss of the single best element in S. [sent-17, score-0.497]
9 A much harder criterion to minimize is shifting regret, which is defined as the difference between the learner’s cumulative loss and the cumulative loss of an arbitrary sequence of elements in S. [sent-18, score-0.486]
10 Shifting regret bounds are typically expressed in terms of the shift, a notion of regularity measuring the length of the trajectory in S described by the comparison sequence (i. [sent-19, score-0.48]
11 , the sequence of elements against which the regret is evaluated). [sent-21, score-0.3]
12 In online convex optimization, shifting regret bounds for convex subsets S ⊆ Rd are obtained for the projected online mirror descent (or follow-the-regularized-leader) algorithm. [sent-22, score-0.893]
13 In this case the shift is typically computed in terms of the p-norm of the difference of consecutive elements in the comparison sequence —see [1, 2] and [3]. [sent-23, score-0.108]
14 In [1] shifting bounds are shown for projected mirror descent with entropic regularizers using a 1-norm to measure the shift. [sent-25, score-0.514]
15 However, without using entropic regularizers it is not clear how to achieve a logarithmic dependence on the dimension, which is one of the advantages of working in the simplex. [sent-27, score-0.107]
16 , [4, 5, 1, 6, 7], and it is well known that exponential weights with weight sharing, which corresponds to the fixedshare algorithm of [4], achieves a good shifting bound in this setting. [sent-31, score-0.284]
17 In [6] the authors introduce a generalization of the fixed-share algorithm, and prove various shifting bounds for any trajectory in the simplex. [sent-32, score-0.31]
18 However, their bounds are expressed using a quantity that corresponds to a proper shift only for trajectories on the simplex corners. [sent-33, score-0.183]
19 In this paper we offer a unified analysis of mirror descent, fixed share, and the generalized fixed share of [6] for the setting of online convex optimization in the simplex. [sent-34, score-0.3]
20 Our bounds are expressed in terms of a notion of shift based on the total variation distance. [sent-35, score-0.171]
21 Our analysis relies on a generalized notion of shifting regret which includes, as special cases, related notions of regret such as adaptive regret, discounted regret, and regret with time-selection functions. [sent-36, score-1.181]
22 Perhaps surprisingly, we show that projected mirror descent and fixed share achieve essentially the same generalized regret bound. [sent-37, score-0.511]
23 Finally, we show that widespread techniques in online learning, such as improvements for small losses and adaptive tuning of parameters, are all easily captured by our analysis. [sent-38, score-0.199]
24 2 Preliminaries For simplicity, we derive our results in the setting of online linear optimization. [sent-39, score-0.065]
25 As we show in the supplementary material, these results can be easily extended to the more general setting of online convex optimization through a standard linearization step. [sent-40, score-0.133]
26 Online linear optimization may be cast as a repeated game between the forecaster and the environment as follows. [sent-41, score-0.176]
27 We use ∆d to denote the simplex q ∈ [0, 1]d : q 1 = 1 . [sent-42, score-0.058]
28 T The goal of the forecaster is to minimize the accumulated loss, e. [sent-57, score-0.157]
29 In the now classical problem of prediction with expert advice, the goal of the forecaster is to compete with the T best fixed component (often called “expert”) chosen in hindsight, that is, with mini=1,. [sent-60, score-0.334]
30 ,T t=1 i,t ; or even to compete with a richer class of sequences of components. [sent-63, score-0.178]
31 We start by introducing our main algorithmic tool, described in Figure 1, a share algorithm whose formulation generalizes the seemingly unrelated formulations of the algorithms studied in [4, 1, 6]. [sent-65, score-0.121]
32 In all examples discussed in this paper, these mixing functions are quite simple, but working with such a general model makes the main ideas more transparent. [sent-67, score-0.035]
33 We then provide a simple lemma that serves as the starting point2 for analyzing different instances of this generalized share algorithm. [sent-68, score-0.151]
34 For all t 1 and for all q t ∈ ∆d , Algorithm 1 satisfies pt − q t t 1 η d qi,t ln i=1 vi,t+1 η + . [sent-70, score-0.548]
35 pj,t j,t − ln η 8 j=1 j=1 By definition of vi,t+1 , for all i = 1, . [sent-77, score-0.35]
36 , d we then have d j=1 pj,t e−η (1) j,t = pi,t e−η i,t /vi,t+1 , which implies pt t i,t + (1/η) ln(vi,t+1 /pi,t ) + η/8. [sent-80, score-0.198]
37 The proof is concluded by taking a convex aggregation with respect to q t . [sent-81, score-0.063]
38 However, it is straightforward that for sequences of η–expconcave loss functions, the additional term η/8 in the bound is no longer needed. [sent-83, score-0.217]
39 3 A generalized shifting regret for the simplex We now introduce a generalized notion of shifting regret which unifies and generalizes the notions of discounted regret (see [3, Section 2. [sent-102, score-1.522]
40 11]), adaptive regret (see [8]), and shifting regret (see [2]). [sent-103, score-0.767]
41 For a fixed horizon T , a sequence of discount factors βt,T 0 for t = 1, . [sent-104, score-0.073]
42 , T assigns varying weights to the instantaneous losses suffered at each round. [sent-107, score-0.082]
43 We compare the total loss of the forecaster with the loss of an arbitrary sequence of vectors q 1 , . [sent-108, score-0.36]
44 Our goal is to bound the regret T T βt,T pt t− t=1 βt,T q t t t=1 in terms of the “regularity” of the comparison sequence q 1 , . [sent-112, score-0.526]
45 , q T and of the variations of the discounting weights βt,T . [sent-115, score-0.043]
46 By setting ut = βt,T q t ∈ Rd , we can rephrase the above regret as + T T ut 1 pt t− t=1 ut t . [sent-116, score-2.569]
47 (2) t=1 In the literature on tracking the best expert [4, 5, 1, 6], the regularity of the sequence u1 , . [sent-117, score-0.262]
48 , uT is measured as the number of times ut = ut+1 . [sent-120, score-0.707]
49 We introduce the following regularity measure T m(uT ) = 1 DTV (ut , ut−1 ) (3) t=2 where for x = (x1 , . [sent-121, score-0.059]
50 + 1 Note that when x, y ∈ ∆d , we recover the total variation distance DTV (x, y) = 2 x − y 1 , while d for general x, y ∈ R+ , the quantity DTV (x, y) is not necessarily symmetric and is always bounded by x − y 1 . [sent-128, score-0.043]
51 The traditional shifting regret of [4, 5, 1, 6] is obtained from (2) when all ut are such that ut 1 = 1. [sent-129, score-1.898]
52 4 Projected update The shifting variant of the EG algorithm analyzed in [1] is a special case of the generalized share algorithm in which the function ψt+1 performs a projection of the pre-weights on the convex set ∆α = [α/d, 1]d ∩ ∆d . [sent-130, score-0.469]
53 We can prove (using techniques similar d to the ones shown in the next section—see the supplementary material) the following bound which generalizes [1, Theorem 16]. [sent-132, score-0.084]
54 , t ∈ [0, 1]d of loss vectors, and for all u1 , . [sent-137, score-0.064]
55 , uT ∈ Rd , if Algorithm 1 is run with the above update, then + T T ut 1 pt t− t=1 ut u1 t 1 ln d η t=1 + m(uT ) d η 1 ln + +α η α 8 T ut 1 . [sent-140, score-3.019]
56 (4) t=1 This bound can be optimized by a proper tuning of α and η parameters. [sent-141, score-0.069]
57 We show a similarly tuned (and slightly better) bound in Corollary 1. [sent-142, score-0.056]
58 We now show that this update delivers a bound on the regret almost equivalent to (though slightly better than) that achieved by projection on the subset ∆α of the simplex. [sent-144, score-0.377]
59 With the above update, for all T 1, for all sequences 1 , . [sent-146, score-0.125]
60 , T of loss vectors d d t ∈ [0, 1] , and for all u1 , . [sent-149, score-0.089]
61 , uT ∈ R+ , T T ut 1 pt t − ut u1 t ln d η t=1 t=1 1 + η + 8 T ut 1 t=1 m(uT ) d 1 ln + η α T t=2 ut 1 − m(uT ) 1 η ln 1 . [sent-152, score-4.076]
62 1−α Note that if we only consider vectors of the form ut = q t = (0, . [sent-153, score-0.732]
63 , 0) then m(q T ) 1 corresponds to the number of times q t+1 = q t in the sequence q T . [sent-159, score-0.05]
64 The fixed-share forecaster does not need to “know” anything in advance about the sequence of the norms ut for the bound above to be valid. [sent-161, score-0.942]
65 Of course, in order to minimize the obtained upper bound, the tuning parameters α, η need to be optimized and their values will depend on the T maximal values of m(uT ) and t=1 ut 1 for the sequences one wishes to compete against. [sent-162, score-0.945]
66 Therein, h(x) = −x ln x − (1 − x) ln(1 − x) denotes the binary entropy function for x ∈ [0, 1]. [sent-164, score-0.35]
67 , T of loss vectors t ∈ [0, 1]d , and for all sequences u1 , . [sent-172, score-0.214]
68 , uT ∈ Rd + T with u1 1 + m(uT ) m0 and t=1 ut 1 U0 , 1 T T ut t=1 1 pt t− ut U0 m0 m0 ln d + U0 h 2 U0 t t=1 U0 m0 ln d + ln 2 e U0 m0 whenever η and α are optimally chosen in terms of m0 and U0 . [sent-175, score-3.412]
69 Applying Lemma 1 with q t = ut / ut 1 , and multiplying by ut 1 , we get for all t 1 and ut ∈ Rd + ut 3 1 pt t − ut t 1 η d ui,t ln i=1 As can be seen by noting that ln 1/(1 − x) < x/(1 − x) 4 vi,t+1 η + ut pi,t 8 1 . [sent-177, score-5.847]
70 (6) We now examine d vi,t+1 ui,t ln = pi,t i=1 d d 1 1 − ui,t−1 ln ui,t ln pi,t vi,t i=1 + 1 1 − ui,t ln vi,t vi,t+1 ui,t−1 ln i=1 . [sent-178, score-1.75]
71 (7) For the first term on the right-hand side, we have d ui,t ln i=1 1 1 − ui,t−1 ln pi,t vi,t (ui,t − ui,t−1 ) ln = i : ui,t ui,t−1 vi,t 1 + ui,t−1 ln pi,t pi,t (ui,t − ui,t−1 ) ln + i : ui,t 0, U0 > 0, and L0 > 0. [sent-179, score-1.75]
72 , T of loss vectors t ∈ [0, 1]d , and for all sequences T T u1 , . [sent-183, score-0.214]
73 , uT ∈ Rd with u1 1 + m(uT ) m0 , t=1 ut 1 U0 , and t=1 ut t L0 , + 1 T T ut t=1 1 pt t − ut t L0 m0 ln d + ln t=1 e U0 m0 + ln d + ln e U0 m0 whenever η and α are optimally chosen in terms of m0 , U0 , and L0 . [sent-186, score-4.469]
74 Here again, the parameters α and η may be tuned online using the techniques shown in Section 7. [sent-187, score-0.093]
75 The above refinement is obtained by mimicking the analysis of Hedge forecasters for small losses (see, e. [sent-189, score-0.137]
76 In particular, one should substitute Lemma 1 with the following lemma in the analysis carried out in Section 5; its proof follows from the mere replacement of Hoeffding’s inequality by [3, Lemma A. [sent-193, score-0.078]
77 3], which states that for all η ∈ R and for all random variable X taking values in [0, 1], one has ln E[e−ηX ] (e−η − 1)EX. [sent-194, score-0.35]
78 2 1 − e−η pt η t − qt t 1 η d qi,t ln i=1 vi,t pi,t+1 for all q t ∈ ∆d . [sent-197, score-0.548]
79 Sparse target sequences The work [6] introduced forecasters that are able to efficiently compete with the best sequence of experts among all those sequences that only switch a bounded number of times and also take a small number of different values. [sent-198, score-0.476]
80 Such “sparse” sequences of experts appear naturally in many applications. [sent-199, score-0.171]
81 In this section we show that their algorithms in fact work very well in comparison with a much larger class of sequences u1 , . [sent-200, score-0.125]
82 Note 1 that when q t ∈ ∆d for all t, then two interesting upper bounds can be provided. [sent-207, score-0.046]
83 First, denoting the union of the supports of these convex combinations by S ⊆ {1, . [sent-208, score-0.043]
84 , T } , the cardinality of the pool of convex 1 combinations. [sent-215, score-0.065]
85 Thus, n(uT ) generalizes the notion of sparsity of [6]. [sent-216, score-0.076]
86 1 7 Here we consider a family of shared updates of the form wj,t pj,t = (1 − α)vj,t + α , 0 α 1, (11) Zt where the wj,t are nonnegative weights that may depend on past and current pre-weights and Zt = d i=1 wi,t is a normalization constant. [sent-217, score-0.113]
87 Apart from generalizing the regret bounds of [6], we believe that the analysis given below is significantly simpler and more transparent. [sent-220, score-0.319]
88 (12) T T The next result improves on Theorem 2 when T d and n(u1 ) m(u1 ), that is, when the dimension (or number of experts) d is large but the sequence uT is sparse. [sent-229, score-0.05]
89 Its proof can be found in 1 the supplementary material; it is a variation on the proof of Theorem 2. [sent-230, score-0.087]
90 Suppose Algorithm 1 is run with the shared update (11) with weights satisfying the conditions (12). [sent-232, score-0.116]
91 , T of loss vectors t ∈ [0, 1]d , and for all sequences u1 , . [sent-236, score-0.214]
92 , uT ∈ Rd , + T T ut 1 pt t − ut t t=1 t=1 n(uT ) ln d n(uT ) T ln C η 1 1 + + η η 8 + m(uT ) maxt T Zt 1 ln + η α T ut 1 t=1 T t=2 ut 1 − m(uT ) 1 η ln 1 . [sent-239, score-4.455]
93 1−α Corollaries 8 and 9 of [6] can now be generalized (and even improved); we do so—in the supplementary material—by showing two specific instances of the generic update (11) that satisfy (12). [sent-240, score-0.135]
94 3 Online tuning of the parameters The forecasters studied above need their parameters η and α to be tuned according to various quantities, including the time horizon T . [sent-242, score-0.169]
95 We respectively replace steps 3 and 4 of its description by the loss and shared updates d ηt t−1 vj,t+1 = pj,t e−ηt η ηt pi,tt−1 e−ηt η j,t pj,t+1 = and i,t i=1 αt + (1 − αt ) vj,t+1 , (13) d for all t 1 and all j ∈ {1, . [sent-247, score-0.124]
96 , d}, where (ητ ) and (ατ ) are two sequences of positive numbers, indexed by τ 1. [sent-250, score-0.125]
97 The forecaster based on the updates (13) is such that whenever ηt ηt−1 and αt αt−1 for all t 1, the following performance bound is achieved. [sent-254, score-0.232]
98 , T of loss vectors t ∈ [0, 1] , and for all u1 , . [sent-258, score-0.089]
99 , uT ∈ R+ , T T ut t=1 1 pt t− ut ut η1 t t=1 T 1 + ut t=2 1 1 1 − ηt ηt−1 T + ln d T ut 1 m(uT ) d(1 − αT ) 1 ηt−1 1 ln + ln + ηT αT ηt−1 1 − αt t=1 8 t=2 ut 1 Due to space constraints, we provide an illustration of this bound only in the supplementary material. [sent-261, score-5.543]
100 Tracking a small set of experts by mixing past posteriors. [sent-293, score-0.112]
wordName wordTfidf (topN-words)
[('ut', 0.707), ('ln', 0.35), ('regret', 0.25), ('shifting', 0.234), ('pt', 0.198), ('forecaster', 0.157), ('sequences', 0.125), ('expert', 0.093), ('dtv', 0.09), ('mirror', 0.078), ('forecasters', 0.077), ('advice', 0.069), ('online', 0.065), ('loss', 0.064), ('share', 0.063), ('tracking', 0.06), ('losses', 0.06), ('regularity', 0.059), ('rieure', 0.059), ('entropic', 0.059), ('update', 0.059), ('shift', 0.058), ('simplex', 0.058), ('normale', 0.056), ('compete', 0.053), ('generalized', 0.051), ('xedshare', 0.051), ('sequence', 0.05), ('paris', 0.046), ('discounted', 0.046), ('experts', 0.046), ('bounds', 0.046), ('herbster', 0.045), ('notion', 0.045), ('ecole', 0.044), ('rd', 0.044), ('convex', 0.043), ('learner', 0.042), ('tuning', 0.041), ('lemma', 0.037), ('cumulative', 0.037), ('descent', 0.037), ('shared', 0.035), ('mixing', 0.035), ('adaptive', 0.033), ('projected', 0.032), ('theorem', 0.032), ('zt', 0.032), ('past', 0.031), ('generalizes', 0.031), ('prediction', 0.031), ('trajectory', 0.03), ('sharing', 0.029), ('material', 0.029), ('hoeffding', 0.029), ('bousquet', 0.029), ('maxt', 0.029), ('chooses', 0.029), ('regularizers', 0.028), ('tuned', 0.028), ('bound', 0.028), ('seemingly', 0.027), ('uni', 0.027), ('sup', 0.026), ('vectors', 0.025), ('supplementary', 0.025), ('updates', 0.025), ('vt', 0.024), ('horizon', 0.023), ('corollary', 0.023), ('generalizing', 0.023), ('degli', 0.023), ('milano', 0.023), ('studi', 0.023), ('feels', 0.023), ('linder', 0.023), ('chernov', 0.023), ('conventionally', 0.023), ('universitat', 0.023), ('cardinality', 0.022), ('whenever', 0.022), ('notions', 0.022), ('weights', 0.022), ('variation', 0.022), ('round', 0.022), ('quantity', 0.021), ('optimally', 0.021), ('hedge', 0.021), ('stoltz', 0.021), ('mere', 0.021), ('delivers', 0.021), ('discounting', 0.021), ('proof', 0.02), ('logarithmic', 0.02), ('barcelona', 0.02), ('investment', 0.02), ('environment', 0.019), ('projection', 0.019), ('eg', 0.019), ('wishes', 0.019)]
simIndex simValue paperId paperTitle
same-paper 1 0.99999994 216 nips-2012-Mirror Descent Meets Fixed Share (and feels no regret)
Author: Nicolò Cesa-bianchi, Pierre Gaillard, Gabor Lugosi, Gilles Stoltz
Abstract: Mirror descent with an entropic regularizer is known to achieve shifting regret bounds that are logarithmic in the dimension. This is done using either a carefully designed projection or by a weight sharing technique. Via a novel unified analysis, we show that these two approaches deliver essentially equivalent bounds on a notion of regret generalizing shifting, adaptive, discounted, and other related regrets. Our analysis also captures and extends the generalized weight sharing technique of Bousquet and Warmuth, and can be refined in several ways, including improvements for small losses and adaptive tuning of parameters. 1
2 0.23886609 281 nips-2012-Provable ICA with Unknown Gaussian Noise, with Implications for Gaussian Mixtures and Autoencoders
Author: Sanjeev Arora, Rong Ge, Ankur Moitra, Sushant Sachdeva
Abstract: We present a new algorithm for Independent Component Analysis (ICA) which has provable performance guarantees. In particular, suppose we are given samples of the form y = Ax + η where A is an unknown n × n matrix and x is a random variable whose components are independent and have a fourth moment strictly less than that of a standard Gaussian random variable and η is an n-dimensional Gaussian random variable with unknown covariance Σ: We give an algorithm that provable recovers A and Σ up to an additive and whose running time and sample complexity are polynomial in n and 1/ . To accomplish this, we introduce a novel “quasi-whitening” step that may be useful in other contexts in which the covariance of Gaussian noise is not known in advance. We also give a general framework for finding all local optima of a function (given an oracle for approximately finding just one) and this is a crucial step in our algorithm, one that has been overlooked in previous attempts, and allows us to control the accumulation of error when we find the columns of A one by one via local search. 1
3 0.19661753 83 nips-2012-Controlled Recognition Bounds for Visual Learning and Exploration
Author: Vasiliy Karasev, Alessandro Chiuso, Stefano Soatto
Abstract: We describe the tradeoff between the performance in a visual recognition problem and the control authority that the agent can exercise on the sensing process. We focus on the problem of “visual search” of an object in an otherwise known and static scene, propose a measure of control authority, and relate it to the expected risk and its proxy (conditional entropy of the posterior density). We show this analytically, as well as empirically by simulation using the simplest known model that captures the phenomenology of image formation, including scaling and occlusions. We show that a “passive” agent given a training set can provide no guarantees on performance beyond what is afforded by the priors, and that an “omnipotent” agent, capable of infinite control authority, can achieve arbitrarily good performance (asymptotically). In between these limiting cases, the tradeoff can be characterized empirically. 1
4 0.17642699 324 nips-2012-Stochastic Gradient Descent with Only One Projection
Author: Mehrdad Mahdavi, Tianbao Yang, Rong Jin, Shenghuo Zhu, Jinfeng Yi
Abstract: Although many variants of stochastic gradient descent have been proposed for large-scale convex optimization, most of them require projecting the solution at each iteration to ensure that the obtained solution stays within the feasible domain. For complex domains (e.g., positive semidefinite cone), the projection step can be computationally expensive, making stochastic gradient descent unattractive for large-scale optimization problems. We address this limitation by developing novel stochastic optimization algorithms that do not need intermediate projections. Instead, only one projection at the last iteration is needed to obtain a feasible solution in the given domain. Our theoretical analysis shows that with a high probability, √ the proposed algorithms achieve an O(1/ T ) convergence rate for general convex optimization, and an O(ln T /T ) rate for strongly convex optimization under mild conditions about the domain and the objective function. 1
5 0.1636012 149 nips-2012-Hierarchical Optimistic Region Selection driven by Curiosity
Author: Odalric-ambrym Maillard
Abstract: This paper aims to take a step forwards making the term “intrinsic motivation” from reinforcement learning theoretically well founded, focusing on curiositydriven learning. To that end, we consider the setting where, a fixed partition P of a continuous space X being given, and a process ν defined on X being unknown, we are asked to sequentially decide which cell of the partition to select as well as where to sample ν in that cell, in order to minimize a loss function that is inspired from previous work on curiosity-driven learning. The loss on each cell consists of one term measuring a simple worst case quadratic sampling error, and a penalty term proportional to the range of the variance in that cell. The corresponding problem formulation extends the setting known as active learning for multi-armed bandits to the case when each arm is a continuous region, and we show how an adaptation of recent algorithms for that problem and of hierarchical optimistic sampling algorithms for optimization can be used in order to solve this problem. The resulting procedure, called Hierarchical Optimistic Region SElection driven by Curiosity (HORSE.C) is provided together with a finite-time regret analysis. 1
6 0.16232616 61 nips-2012-Best Arm Identification: A Unified Approach to Fixed Budget and Fixed Confidence
7 0.16044839 64 nips-2012-Calibrated Elastic Regularization in Matrix Completion
8 0.15403453 134 nips-2012-Finite Sample Convergence Rates of Zero-Order Stochastic Optimization Methods
9 0.14360748 102 nips-2012-Distributed Non-Stochastic Experts
10 0.13742816 241 nips-2012-No-Regret Algorithms for Unconstrained Online Convex Optimization
11 0.13199832 295 nips-2012-Risk-Aversion in Multi-armed Bandits
12 0.11940207 259 nips-2012-Online Regret Bounds for Undiscounted Continuous Reinforcement Learning
13 0.1074124 280 nips-2012-Proper losses for learning from partial labels
14 0.10384415 218 nips-2012-Mixing Properties of Conditional Markov Chains with Unbounded Feature Functions
15 0.10188077 258 nips-2012-Online L1-Dictionary Learning with Application to Novel Document Detection
16 0.096419148 199 nips-2012-Link Prediction in Graphs with Autoregressive Features
17 0.096158907 293 nips-2012-Relax and Randomize : From Value to Algorithms
18 0.09536542 252 nips-2012-On Multilabel Classification and Ranking with Partial Feedback
19 0.093673281 110 nips-2012-Efficient Reinforcement Learning for High Dimensional Linear Quadratic Systems
20 0.09262199 160 nips-2012-Imitation Learning by Coaching
topicId topicWeight
[(0, 0.16), (1, -0.075), (2, 0.159), (3, 0.09), (4, 0.14), (5, 0.001), (6, -0.012), (7, 0.045), (8, -0.067), (9, 0.028), (10, 0.215), (11, 0.158), (12, -0.172), (13, -0.122), (14, -0.125), (15, 0.11), (16, -0.029), (17, 0.025), (18, -0.051), (19, -0.067), (20, 0.037), (21, -0.064), (22, -0.021), (23, 0.08), (24, -0.034), (25, 0.096), (26, -0.111), (27, -0.016), (28, 0.039), (29, -0.056), (30, -0.027), (31, 0.05), (32, 0.12), (33, -0.023), (34, 0.102), (35, -0.06), (36, 0.054), (37, -0.021), (38, -0.15), (39, 0.089), (40, 0.066), (41, -0.014), (42, -0.027), (43, 0.022), (44, -0.033), (45, 0.018), (46, 0.071), (47, 0.149), (48, -0.069), (49, -0.064)]
simIndex simValue paperId paperTitle
same-paper 1 0.97633648 216 nips-2012-Mirror Descent Meets Fixed Share (and feels no regret)
Author: Nicolò Cesa-bianchi, Pierre Gaillard, Gabor Lugosi, Gilles Stoltz
Abstract: Mirror descent with an entropic regularizer is known to achieve shifting regret bounds that are logarithmic in the dimension. This is done using either a carefully designed projection or by a weight sharing technique. Via a novel unified analysis, we show that these two approaches deliver essentially equivalent bounds on a notion of regret generalizing shifting, adaptive, discounted, and other related regrets. Our analysis also captures and extends the generalized weight sharing technique of Bousquet and Warmuth, and can be refined in several ways, including improvements for small losses and adaptive tuning of parameters. 1
2 0.61789 102 nips-2012-Distributed Non-Stochastic Experts
Author: Varun Kanade, Zhenming Liu, Bozidar Radunovic
Abstract: We consider the online distributed non-stochastic experts problem, where the distributed system consists of one coordinator node that is connected to k sites, and the sites are required to communicate with each other via the coordinator. At each time-step t, one of the k site nodes has to pick an expert from the set {1, . . . , n}, and the same site receives information about payoffs of all experts for that round. The goal of the distributed system is to minimize regret at time horizon T , while simultaneously keeping communication to a minimum. The two extreme solutions to this problem are: (i) Full communication: This essentially simulates the nondistributed setting to obtain the optimal O( log(n)T ) regret bound at the cost of T communication. (ii) No communication: Each site runs an independent copy – the regret is O( log(n)kT ) and the communication is 0. This paper shows the √ difficulty of simultaneously achieving regret asymptotically better than kT and communication better than T . We give a novel algorithm that for an oblivious √ adversary achieves a non-trivial trade-off: regret O( k 5(1+ )/6 T ) and communication O(T /k ), for any value of ∈ (0, 1/5). We also consider a variant of the model, where the coordinator picks the expert. In this model, we show that the label-efficient forecaster of Cesa-Bianchi et al. (2005) already gives us strategy that is near optimal in regret vs communication trade-off. 1
3 0.59264928 61 nips-2012-Best Arm Identification: A Unified Approach to Fixed Budget and Fixed Confidence
Author: Victor Gabillon, Mohammad Ghavamzadeh, Alessandro Lazaric
Abstract: We study the problem of identifying the best arm(s) in the stochastic multi-armed bandit setting. This problem has been studied in the literature from two different perspectives: fixed budget and fixed confidence. We propose a unifying approach that leads to a meta-algorithm called unified gap-based exploration (UGapE), with a common structure and similar theoretical analysis for these two settings. We prove a performance bound for the two versions of the algorithm showing that the two problems are characterized by the same notion of complexity. We also show how the UGapE algorithm as well as its theoretical analysis can be extended to take into account the variance of the arms and to multiple bandits. Finally, we evaluate the performance of UGapE and compare it with a number of existing fixed budget and fixed confidence algorithms. 1
4 0.59097785 281 nips-2012-Provable ICA with Unknown Gaussian Noise, with Implications for Gaussian Mixtures and Autoencoders
Author: Sanjeev Arora, Rong Ge, Ankur Moitra, Sushant Sachdeva
Abstract: We present a new algorithm for Independent Component Analysis (ICA) which has provable performance guarantees. In particular, suppose we are given samples of the form y = Ax + η where A is an unknown n × n matrix and x is a random variable whose components are independent and have a fourth moment strictly less than that of a standard Gaussian random variable and η is an n-dimensional Gaussian random variable with unknown covariance Σ: We give an algorithm that provable recovers A and Σ up to an additive and whose running time and sample complexity are polynomial in n and 1/ . To accomplish this, we introduce a novel “quasi-whitening” step that may be useful in other contexts in which the covariance of Gaussian noise is not known in advance. We also give a general framework for finding all local optima of a function (given an oracle for approximately finding just one) and this is a crucial step in our algorithm, one that has been overlooked in previous attempts, and allows us to control the accumulation of error when we find the columns of A one by one via local search. 1
5 0.55020046 241 nips-2012-No-Regret Algorithms for Unconstrained Online Convex Optimization
Author: Brendan Mcmahan, Matthew Streeter
Abstract: Some of the most compelling applications of online convex optimization, including online prediction and classification, are unconstrained: the natural feasible set is Rn . Existing algorithms fail to achieve sub-linear regret in this setting unless constraints on the comparator point ˚ are known in advance. We present algox rithms that, without such prior knowledge, offer near-optimal regret bounds with respect to any choice of ˚. In particular, regret with respect to ˚ = 0 is constant. x x We then prove lower bounds showing that our guarantees are near-optimal in this setting. 1
6 0.53931308 259 nips-2012-Online Regret Bounds for Undiscounted Continuous Reinforcement Learning
7 0.52706081 149 nips-2012-Hierarchical Optimistic Region Selection driven by Curiosity
8 0.52553558 295 nips-2012-Risk-Aversion in Multi-armed Bandits
9 0.49978116 64 nips-2012-Calibrated Elastic Regularization in Matrix Completion
10 0.48717833 258 nips-2012-Online L1-Dictionary Learning with Application to Novel Document Detection
11 0.47421697 110 nips-2012-Efficient Reinforcement Learning for High Dimensional Linear Quadratic Systems
12 0.42463583 44 nips-2012-Approximating Concavely Parameterized Optimization Problems
13 0.41601378 280 nips-2012-Proper losses for learning from partial labels
14 0.39921933 217 nips-2012-Mixability in Statistical Learning
15 0.39782968 293 nips-2012-Relax and Randomize : From Value to Algorithms
16 0.39732873 83 nips-2012-Controlled Recognition Bounds for Visual Learning and Exploration
17 0.37567544 174 nips-2012-Learning Halfspaces with the Zero-One Loss: Time-Accuracy Tradeoffs
18 0.37527007 80 nips-2012-Confusion-Based Online Learning and a Passive-Aggressive Scheme
19 0.37475374 199 nips-2012-Link Prediction in Graphs with Autoregressive Features
20 0.37256187 134 nips-2012-Finite Sample Convergence Rates of Zero-Order Stochastic Optimization Methods
topicId topicWeight
[(0, 0.025), (17, 0.017), (21, 0.035), (38, 0.124), (42, 0.054), (49, 0.176), (53, 0.031), (54, 0.023), (55, 0.021), (74, 0.036), (76, 0.092), (80, 0.155), (92, 0.098)]
simIndex simValue paperId paperTitle
1 0.82949358 231 nips-2012-Multiple Operator-valued Kernel Learning
Author: Hachem Kadri, Alain Rakotomamonjy, Philippe Preux, Francis R. Bach
Abstract: Positive definite operator-valued kernels generalize the well-known notion of reproducing kernels, and are naturally adapted to multi-output learning situations. This paper addresses the problem of learning a finite linear combination of infinite-dimensional operator-valued kernels which are suitable for extending functional data analysis methods to nonlinear contexts. We study this problem in the case of kernel ridge regression for functional responses with an r -norm constraint on the combination coefficients (r ≥ 1). The resulting optimization problem is more involved than those of multiple scalar-valued kernel learning since operator-valued kernels pose more technical and theoretical issues. We propose a multiple operator-valued kernel learning algorithm based on solving a system of linear operator equations by using a block coordinate-descent procedure. We experimentally validate our approach on a functional regression task in the context of finger movement prediction in brain-computer interfaces. 1
same-paper 2 0.8207258 216 nips-2012-Mirror Descent Meets Fixed Share (and feels no regret)
Author: Nicolò Cesa-bianchi, Pierre Gaillard, Gabor Lugosi, Gilles Stoltz
Abstract: Mirror descent with an entropic regularizer is known to achieve shifting regret bounds that are logarithmic in the dimension. This is done using either a carefully designed projection or by a weight sharing technique. Via a novel unified analysis, we show that these two approaches deliver essentially equivalent bounds on a notion of regret generalizing shifting, adaptive, discounted, and other related regrets. Our analysis also captures and extends the generalized weight sharing technique of Bousquet and Warmuth, and can be refined in several ways, including improvements for small losses and adaptive tuning of parameters. 1
3 0.79396188 145 nips-2012-Gradient Weights help Nonparametric Regressors
Author: Samory Kpotufe, Abdeslam Boularias
Abstract: In regression problems over Rd , the unknown function f often varies more in some coordinates than in others. We show that weighting each coordinate i with the estimated norm of the ith derivative of f is an efficient way to significantly improve the performance of distance-based regressors, e.g. kernel and k-NN regressors. We propose a simple estimator of these derivative norms and prove its consistency. Moreover, the proposed estimator is efficiently learned online. 1
4 0.77080774 133 nips-2012-Finding Exemplars from Pairwise Dissimilarities via Simultaneous Sparse Recovery
Author: Ehsan Elhamifar, Guillermo Sapiro, René Vidal
Abstract: Given pairwise dissimilarities between data points, we consider the problem of finding a subset of data points, called representatives or exemplars, that can efficiently describe the data collection. We formulate the problem as a row-sparsity regularized trace minimization problem that can be solved efficiently using convex programming. The solution of the proposed optimization program finds the representatives and the probability that each data point is associated with each one of the representatives. We obtain the range of the regularization parameter for which the solution of the proposed optimization program changes from selecting one representative for all data points to selecting all data points as representatives. When data points are distributed around multiple clusters according to the dissimilarities, we show that the data points in each cluster select representatives only from that cluster. Unlike metric-based methods, our algorithm can be applied to dissimilarities that are asymmetric or violate the triangle inequality, i.e., it does not require that the pairwise dissimilarities come from a metric. We demonstrate the effectiveness of the proposed algorithm on synthetic data as well as real-world image and text data. 1
5 0.76468176 218 nips-2012-Mixing Properties of Conditional Markov Chains with Unbounded Feature Functions
Author: Mathieu Sinn, Bei Chen
Abstract: Conditional Markov Chains (also known as Linear-Chain Conditional Random Fields in the literature) are a versatile class of discriminative models for the distribution of a sequence of hidden states conditional on a sequence of observable variables. Large-sample properties of Conditional Markov Chains have been first studied in [1]. The paper extends this work in two directions: first, mixing properties of models with unbounded feature functions are being established; second, necessary conditions for model identifiability and the uniqueness of maximum likelihood estimates are being given. 1
6 0.76403773 65 nips-2012-Cardinality Restricted Boltzmann Machines
7 0.7620157 293 nips-2012-Relax and Randomize : From Value to Algorithms
8 0.76130879 229 nips-2012-Multimodal Learning with Deep Boltzmann Machines
9 0.75956661 281 nips-2012-Provable ICA with Unknown Gaussian Noise, with Implications for Gaussian Mixtures and Autoencoders
10 0.75739062 251 nips-2012-On Lifting the Gibbs Sampling Algorithm
11 0.75555074 252 nips-2012-On Multilabel Classification and Ranking with Partial Feedback
12 0.75448644 77 nips-2012-Complex Inference in Neural Circuits with Probabilistic Population Codes and Topic Models
13 0.75417221 197 nips-2012-Learning with Recursive Perceptual Representations
14 0.75406855 200 nips-2012-Local Supervised Learning through Space Partitioning
15 0.74673426 186 nips-2012-Learning as MAP Inference in Discrete Graphical Models
16 0.74637181 292 nips-2012-Regularized Off-Policy TD-Learning
17 0.74597228 100 nips-2012-Discriminative Learning of Sum-Product Networks
18 0.74585992 333 nips-2012-Synchronization can Control Regularization in Neural Systems via Correlated Noise Processes
19 0.74445629 171 nips-2012-Latent Coincidence Analysis: A Hidden Variable Model for Distance Metric Learning
20 0.74241894 56 nips-2012-Bayesian active learning with localized priors for fast receptive field characterization