nips nips2011 nips2011-98 knowledge-graph by maker-knowledge-mining
Source: pdf
Author: Shie Mannor, Ohad Shamir
Abstract: We consider an adversarial online learning setting where a decision maker can choose an action in every stage of the game. In addition to observing the reward of the chosen action, the decision maker gets side observations on the reward he would have obtained had he chosen some of the other actions. The observation structure is encoded as a graph, where node i is linked to node j if sampling i provides information on the reward of j. This setting naturally interpolates between the well-known “experts” setting, where the decision maker can view all rewards, and the multi-armed bandits setting, where the decision maker can only view the reward of the chosen action. We develop practical algorithms with provable regret guarantees, which depend on non-trivial graph-theoretic properties of the information feedback structure. We also provide partially-matching lower bounds. 1
Reference: text
sentIndex sentText sentNum sentScore
1 il Abstract We consider an adversarial online learning setting where a decision maker can choose an action in every stage of the game. [sent-5, score-0.409]
2 In addition to observing the reward of the chosen action, the decision maker gets side observations on the reward he would have obtained had he chosen some of the other actions. [sent-6, score-0.452]
3 The observation structure is encoded as a graph, where node i is linked to node j if sampling i provides information on the reward of j. [sent-7, score-0.271]
4 This setting naturally interpolates between the well-known “experts” setting, where the decision maker can view all rewards, and the multi-armed bandits setting, where the decision maker can only view the reward of the chosen action. [sent-8, score-0.611]
5 We develop practical algorithms with provable regret guarantees, which depend on non-trivial graph-theoretic properties of the information feedback structure. [sent-9, score-0.313]
6 This process is iterated for T rounds, and our goal is to minimize the regret, namely the difference between the total reward of the single best action in hindsight, and our own accumulated reward. [sent-14, score-0.488]
7 A crucial assumption in this setting is that we get to see the rewards of all actions at the end of each round. [sent-16, score-0.345]
8 A canonical example is web advertising, where at any timepoint one may choose only a single ad (or small number of ads) to display, and observe whether it was clicked, but not whether other ads would have been clicked or not if presented to the user. [sent-18, score-0.202]
9 This partial information constraint has led to a flourishing literature on multi-armed bandits problems, which model the setting where we can only observe the reward of the action we chose. [sent-19, score-0.691]
10 While this setting has been long studied under stochastic assumptions, the landmark paper [4] showed that this setting can also be dealt with under adversarial conditions, making the setting comparable to√ experts setting discussed above. [sent-20, score-0.323]
11 The price in terms of the the provable regret is usually an extra k multiplicative factor in the bound. [sent-21, score-0.275]
12 The intuition for this factor has long been that in the bandit setting, we only get “1/k of the information” obtained in the expert setting (as we observe just a single reward rather than k). [sent-22, score-0.35]
13 In this paper, we formalize and initiate a study on a range of settings that interpolates between the bandits setting and the experts setting. [sent-26, score-0.464]
14 This subset may depend on action i in an arbitrary way, and may change from round to round. [sent-28, score-0.338]
15 This information feedback structure can be modeled as a sequence of directed graphs G1 , . [sent-29, score-0.231]
16 , GT (one per round t), so that an edge from action i to action j implies that by choosing action i, “sufficiently good” information is revealed on the reward of action j as well. [sent-32, score-1.367]
17 The case of Gt being the complete graph corresponds to the experts setting. [sent-33, score-0.214]
18 The case of Gt being the empty graph corresponds to the bandit setting. [sent-34, score-0.204]
19 In the standard multi-armed bandits setting, we assume that we have no information whatsoever on whether undisplayed ads would have been clicked on. [sent-37, score-0.408]
20 For instance, if two ads i, j are for similar vacation packages in Hawaii, and ad i was displayed and clicked on by some user, it is likely that the other ad j would have been clicked on as well. [sent-39, score-0.319]
21 In contrast, if ad i is for running shoes, and ad j is for wheelchair accessories, then a user who clicked on one ad is unlikely to clique on the other. [sent-40, score-0.351]
22 Since the area covered by each of the sensors overlaps the area covered by other sensors, the reward obtained when choosing sensor i provides an indication of the reward that would have been obtained when sampling sensor j. [sent-46, score-0.694]
23 Our results portray an interesting picture, with the attainable regret depending on non-trivial properties of these graphs. [sent-49, score-0.302]
24 We provide two practical algorithms with regret guarantees: the ExpBan algorithm that is based on a combination of existing methods, and the more fundamentally novel ELP algorithm that has superior guarantees. [sent-50, score-0.301]
25 In the case of undirected graphs, we show that the information-theoretically attainable regret is precisely characterized by the average independence number (or stability number) of the graph, namely the size of its largest independent set. [sent-52, score-0.489]
26 For the case of directed graphs, we obtain a weaker regret which depends on the average clique-partition number of the graphs. [sent-53, score-0.351]
27 • The framework we consider assumes that by choosing each action, other than just obtaining that action’s reward, we can also observe some side-information about the rewards of other actions. [sent-56, score-0.238]
28 We formalize this as a graph Gt over the actions, where an edge between two actions means that by choosing one action, we can also get a “sufficiently good” estimate of the reward of the other action. [sent-57, score-0.535]
29 Ignoring computational constraints, the algorithm achieves a regret bound of O( χ(G) log(k)T ). [sent-62, score-0.333]
30 With computational constraints, its regret bound is ¯ O( c log(k)T ), where c is the size of the minimal clique partition one can efficiently find for G. [sent-63, score-0.448]
31 For undirected graphs, where sampling i gives an obserT vation on j and vice versa, it achieves a regret bound of O( log(k) t=1 α(Gt )). [sent-66, score-0.381]
32 For directed graphs (where the observation structure is not symmetric), our regret bound is T ¯ at most O( log(k) t=1 χ(Gt )). [sent-67, score-0.503]
33 This is in contrast to the ExpBan algorithm, which in the worst case, cannot efficiently achieve regret significantly better than O( k log(k)T ). [sent-69, score-0.255]
34 • For the case of a fixed graph Gt = G, we present an information-theoretic Ω lower bound on the regret, which holds regardless of computational efficiency. [sent-70, score-0.192]
35 1 Related Work The standard multi-armed bandits problem assumes no relationship between the actions. [sent-73, score-0.238]
36 Examples include [11], where the actions’ rewards are assumed to be drawn from a statistical distribution, with correlations between the actions; and [1, 8], where the actions reward’s are assumed to satisfy some Lipschitz continuity property with respect to a distance measure between the actions. [sent-76, score-0.316]
37 In terms of other approaches, the combinatorial bandits framework [7] considers a setting slightly similar to ours, in that one chooses and observes the rewards of some subset of actions. [sent-77, score-0.397]
38 However, it is crucially assumed that the reward obtained is the sum of the rewards of all actions in the subset. [sent-78, score-0.486]
39 In other words, there is no separation between earning a reward and obtaining information on its value. [sent-79, score-0.211]
40 , [9, 10]), where the standard multi-armed bandits setting is augmented with some side-information provided in each round, which can be used to determine which action to pick. [sent-84, score-0.501]
41 Choosing an action i at round t results in receiving a reward gi (t), which we shall assume without loss of generality to be bounded in [0, 1]. [sent-98, score-0.57]
42 Following the standard adversarial framework, we make no assumptions whatsoever about how the rewards are selected, and they might even be chosen by an adversary. [sent-99, score-0.227]
43 We denote our choice of action at round t as it . [sent-100, score-0.338]
44 Our goal is to minimize regret with respect to the best single action in hindsight, namely T max i t=1 T gi (t) − 3 git (t). [sent-101, score-0.706]
45 Each round t, the learning algorithm chooses a single action it . [sent-105, score-0.361]
46 In the standard multi-armed bandits setting, this results in git (t) being revealed to the algorithm, while gj (t) remains unknown for any j = it . [sent-106, score-0.455]
47 In our setting, we assume that by choosing an action i, other than getting gi (t), we also get some side-observations about the rewards of the other actions. [sent-107, score-0.501]
48 Formally, we assume that one receives gi (t), and for some fixed parameter b is able to construct unbiased estimates gj (t) for all ˆ actions j in some subset of [k], such that E[ˆj (t)|action i chosen] = gj (t) and Pr(|ˆj (t)| ≤ b) = 1. [sent-108, score-0.463]
49 g g For any action j, we let Nj (t) be the set of actions, for which we can get such an estimate gj (t) on the ˆ reward of action j. [sent-109, score-0.836]
50 This is essentially the “neighborhood” of action j, which receives sufficiently good information (as parameterized by b) on the reward of action j. [sent-110, score-0.746]
51 Intuitively, one can think of this setting as a sequence of graphs, one graph per round t, which captures the information feedback structure between the actions. [sent-113, score-0.275]
52 The idea of the algorithm is to split the actions into c cliques, such that choosing an action in a clique reveals unbiased estimates of the rewards of all the other actions in the clique. [sent-121, score-0.912]
53 By running a standard experts algorithm (such as the exponentially weighted forecaster - see [6, Chapter 2]), we can get low regret with respect to any action in that clique. [sent-122, score-0.692]
54 We then treat each such expert algorithm as a meta-action, and run a standard bandits algorithm (such as the EXP3 [4]) over these c meta-actions. [sent-123, score-0.275]
55 We denote this algorithm as ExpBan, since it combines an experts algorithm with a bandit algorithm. [sent-124, score-0.22]
56 The following result provides a bound on the expected regret of the algorithm. [sent-125, score-0.31]
57 If we run ExpBan using the exponentially weighted forecaster and the EXP3 algorithm, then the expected regret is bounded as follows:2 T t=1 T gj (t) − E t=1 git (t) ≤ 4b c log(k)T . [sent-129, score-0.514]
58 The case χ(G) = 1 corresponds to G being ¯ ¯ a clique, namely, that choosing any action allows us to estimate the rewards of all other actions. [sent-132, score-0.435]
59 4 namely, that choosing any action only reveals the reward of that action. [sent-138, score-0.531]
60 Like all multi-armed bandits algorithms, it is based on a tradeoff between exploration and exploitation. [sent-148, score-0.226]
61 Below, we present the pseudo-code as well as a couple of theorems that bound the expected regret of the algorithm under appropriate parameter choices. [sent-151, score-0.333]
62 The first theorem concerns the symmetric observation case, where if choosing action i gives information on action j, then choosing action j must also give information on i. [sent-153, score-0.897]
63 , T do ∀ i ∈ [k] pi (t) := (1 − γ(t)) kwi (t) (k) + γ(t)si (t) w l=1 l Choose action it with probability pit (t), and receive reward git (t) Compute gj (t) for all j ∈ Nit (t) ˆ gj (t) ˆ ˜ For all j ∈ [k], let gj (t) = ˜ pl (t) if it ∈ Nj (t), and gj (t) = 0 otherwise. [sent-161, score-1.025]
64 1 Undirected Graphs The following theorem provides a regret bound for the algorithm, as well as appropriate parameter choices, in the case of undirected graphs. [sent-163, score-0.381]
65 In a nutshell, the theorem shows that the regret bound depends on the average independence number α(Gt ) of each graph Gt - namely, the size of its largest independent set. [sent-165, score-0.486]
66 1, we note that for any graph Gt , its independence number α(Gt ) lower bounds its clique-partition number χ(Gt ). [sent-173, score-0.23]
67 Thus, the attainable regret using the ELP algorithm is better than the one attained by the ExpBan algorithm. [sent-176, score-0.325]
68 Note that one of these values must be wrong by a factor of at most 2, t=1 α(Gt ) equals 2 so the regret of the algorithm using that value would be larger by a factor of at most 2. [sent-182, score-0.309]
69 But this can be easily solved by treating each such copy as a “meta-action”, and running a standard multi-armed bandits algorithm (such as EXP3) over these log(k) actions. [sent-184, score-0.249]
70 Since there are log(k) meta-actions, the additional regret incurred is O( log2 (k)T ). [sent-186, score-0.255]
71 So up to logarithmic factors in k, we get the same regret as if we could actually compute the optimal value of β. [sent-187, score-0.279]
72 However, a natural extension of this setting is to assume a directed graph, where choosing an action i may give us information on the reward of action j, but not vice-versa. [sent-190, score-0.872]
73 2 (with the relaxation that the graphs Gt may be directed), it holds for any fixed action j that T t=1 T gj (t) − E T git (t) t=1 ≤ 3βb2 χ(Gt ), + ¯ t=1 where χ(Gt ) is the clique-partition number of Gt . [sent-194, score-0.602]
74 We ¯ do not know whether this bound (relying on the clique-partition number) is tight, but we conjecture that the independence number, which appears to be the key quantity in undirected graphs, is not the correct combinatorial measure for the case of directed graphs3 . [sent-199, score-0.294]
75 In any case, we note that even with the weaker bound above, the ELP algorithm still seems superior to the ExpBan algorithm, in the sense that it allows us to deal with time-changing graphs, and that an explicit clique decomposition of the graph is not required. [sent-200, score-0.347]
76 Suppose Gt = G for all t, and that actions which are not linked in G get no side-observations whatsoever between them. [sent-207, score-0.273]
77 Then there exists a (randomized) adversary strategy, such that for every T ≥ 374α(G)3 and any learning strategy, the expected regret is at least 0. [sent-208, score-0.288]
78 The intuition of the proof is that if the graph G has α(G) independent vertices, then an adversary can make this problem as hard as a√ standard multi-armed bandits problem, played on α(G) actions. [sent-211, score-0.343]
79 Using a known lower bound of Ω( nT ) for multi-armed bandits on n actions, our result follows4 . [sent-212, score-0.29]
80 For constant undirected graphs, this lower bound matches the regret upper bound for the ELP algorithm (Thm. [sent-213, score-0.49]
81 6 Examples Here, we briefly discuss some concrete examples of graphs G, and show how the regret performance of our algorithms depend on their structure. [sent-218, score-0.376]
82 ¯ First, consider the case where there exists a single action, such that choosing it reveals the rewards of all the other actions. [sent-220, score-0.211]
83 In contrast, choosing the other actions only reveal their own reward. [sent-221, score-0.215]
84 We can think of each action i as being in the center of a sphere of radius r, such that the reward of action i is propagated to every other action in that sphere. [sent-226, score-1.003]
85 Both numbers shrink rapidly as r increases, improving the regret of our algorithms. [sent-229, score-0.255]
86 For example, if the actions are placed as the elements in {0, 1/2, 1}n , we use the l∞ metric, and r ∈ (1/2, 1), it is easily seen that the sphere packing number is just 1. [sent-231, score-0.236]
87 Third, consider the random Erd¨ s - R´ nyi graph G = G(k, p), which is formed by linking every o e action i to every action j with probability p independently. [sent-234, score-0.635]
88 This translates to a regret bound of O( kT ) for the Exp¯ Ban algorithm, and only O( directed random graph. [sent-236, score-0.382]
89 Such a gap would also hold for a Empirical Performance Gap between ExpBan and ELP In this section, we show that the gap between the performance of the ExpBan algorithm and the ELP algorithm can be real, and is not just an artifact of our analysis. [sent-238, score-0.193]
90 To show this, we performed the following simple experiment: we created a random Erd¨ s - R´ nyi o e graph over 300 nodes, where each pair of nodes were linked independently with probability p. [sent-248, score-0.185]
91 Choosing any action results in observing the rewards of neighboring actions in the graph. [sent-249, score-0.528]
92 The reward of each action at each round was chosen randomly and independently to be 1 with probability 1/2 and 0 with probability 1/2, except for a single node, whose reward equals 1 with a higher probability of 3/4. [sent-250, score-0.749]
93 For comparison, we also implemented the standard EXP3 multi-armed bandits algorithm [4], which doesn’t use any side-observations. [sent-252, score-0.227]
94 As p increases, the value of side-obervations increase, and the the performance of our two algorithms, which utilize side-observations, improves over the standard multi-armed bandits algorithm. [sent-260, score-0.204]
95 This is exactly the regime where the gap between the clique-partition number (governing the regret bound of ExpBan) and the independence number (governing the regret bound for the ELP algorithm) tends to be larger as well5 . [sent-262, score-0.752]
96 Finally, for large p, the graph is almost complete, and the advantage of ELP over ExpBan becomes small again (since most actions give information on most other actions). [sent-263, score-0.26]
97 In particular, we studied the broad regime which interpolates between the experts setting and the bandits setting of online learning. [sent-265, score-0.498]
98 Second, our lower bounds depend on a reduction which essentially assumes that the graph is constant over time. [sent-272, score-0.215]
99 5 Intuitively, this can be seen by considering the extreme cases - for a complete graph over k nodes, both numbers equal 1, and for an empty graph over k nodes, both numbers equal k. [sent-280, score-0.244]
100 The epoch-greedy algorithm for multi-armed bandits with side information. [sent-333, score-0.248]
wordName wordTfidf (topN-words)
[('gt', 0.452), ('elp', 0.387), ('expban', 0.387), ('regret', 0.255), ('action', 0.252), ('bandits', 0.204), ('reward', 0.19), ('actions', 0.154), ('rewards', 0.122), ('graphs', 0.121), ('clique', 0.118), ('gj', 0.118), ('git', 0.111), ('experts', 0.108), ('graph', 0.106), ('clicked', 0.089), ('round', 0.086), ('nj', 0.078), ('directed', 0.072), ('undirected', 0.071), ('whatsoever', 0.07), ('interpolates', 0.07), ('independence', 0.07), ('bandit', 0.066), ('gap', 0.062), ('choosing', 0.061), ('sphere', 0.057), ('bound', 0.055), ('maker', 0.051), ('ad', 0.048), ('attainable', 0.047), ('namely', 0.046), ('ads', 0.045), ('setting', 0.045), ('sensor', 0.044), ('gi', 0.042), ('kt', 0.039), ('feedback', 0.038), ('log', 0.037), ('chromatic', 0.037), ('initiate', 0.037), ('wideband', 0.037), ('covered', 0.036), ('adversarial', 0.035), ('ni', 0.035), ('advice', 0.034), ('assumes', 0.034), ('sensors', 0.033), ('cliques', 0.033), ('adversary', 0.033), ('empty', 0.032), ('lower', 0.031), ('receives', 0.031), ('equals', 0.031), ('ultra', 0.03), ('forecaster', 0.03), ('spheres', 0.03), ('nodes', 0.029), ('node', 0.028), ('neighborhood', 0.028), ('reveals', 0.028), ('si', 0.027), ('shie', 0.027), ('endowed', 0.027), ('combinatorial', 0.026), ('online', 0.026), ('rounds', 0.025), ('expert', 0.025), ('hindsight', 0.025), ('packing', 0.025), ('governing', 0.025), ('nyi', 0.025), ('erd', 0.025), ('linked', 0.025), ('sl', 0.024), ('weaker', 0.024), ('get', 0.024), ('advertising', 0.023), ('bounds', 0.023), ('algorithm', 0.023), ('artifact', 0.023), ('revealed', 0.022), ('copy', 0.022), ('indication', 0.022), ('policies', 0.022), ('exploration', 0.022), ('advance', 0.022), ('obtaining', 0.021), ('side', 0.021), ('wj', 0.021), ('essentially', 0.021), ('deal', 0.021), ('web', 0.02), ('provable', 0.02), ('assumed', 0.02), ('partition', 0.02), ('channel', 0.019), ('motivating', 0.019), ('concerns', 0.019), ('area', 0.019)]
simIndex simValue paperId paperTitle
same-paper 1 1.0 98 nips-2011-From Bandits to Experts: On the Value of Side-Observations
Author: Shie Mannor, Ohad Shamir
Abstract: We consider an adversarial online learning setting where a decision maker can choose an action in every stage of the game. In addition to observing the reward of the chosen action, the decision maker gets side observations on the reward he would have obtained had he chosen some of the other actions. The observation structure is encoded as a graph, where node i is linked to node j if sampling i provides information on the reward of j. This setting naturally interpolates between the well-known “experts” setting, where the decision maker can view all rewards, and the multi-armed bandits setting, where the decision maker can only view the reward of the chosen action. We develop practical algorithms with provable regret guarantees, which depend on non-trivial graph-theoretic properties of the information feedback structure. We also provide partially-matching lower bounds. 1
2 0.258863 210 nips-2011-PAC-Bayesian Analysis of Contextual Bandits
Author: Yevgeny Seldin, Peter Auer, John S. Shawe-taylor, Ronald Ortner, François Laviolette
Abstract: We derive an instantaneous (per-round) data-dependent regret bound for stochastic multiarmed bandits with side information (also known as contextual bandits). The p scaling of our regret bound with the number of states (contexts) N goes as N I⇢t (S; A), where I⇢t (S; A) is the mutual information between states and actions (the side information) used by the algorithm at round t. If the algorithm p uses all the side information, the regret bound scales as N ln K, where K is the number of actions (arms). However, if the side information I⇢t (S; A) is not fully used, the regret bound is significantly tighter. In the extreme case, when I⇢t (S; A) = 0, the dependence on the number of states reduces from linear to logarithmic. Our analysis allows to provide the algorithm large amount of side information, let the algorithm to decide which side information is relevant for the task, and penalize the algorithm only for the side information that it is using de facto. We also present an algorithm for multiarmed bandits with side information with O(K) computational complexity per game round. 1
3 0.24276848 128 nips-2011-Improved Algorithms for Linear Stochastic Bandits
Author: Yasin Abbasi-yadkori, Csaba Szepesvári, David Tax
Abstract: We improve the theoretical analysis and empirical performance of algorithms for the stochastic multi-armed bandit problem and the linear stochastic multi-armed bandit problem. In particular, we show that a simple modification of Auer’s UCB algorithm (Auer, 2002) achieves with high probability constant regret. More importantly, we modify and, consequently, improve the analysis of the algorithm for the for linear stochastic bandit problem studied by Auer (2002), Dani et al. (2008), Rusmevichientong and Tsitsiklis (2010), Li et al. (2010). Our modification improves the regret bound by a logarithmic factor, though experiments show a vast improvement. In both cases, the improvement stems from the construction of smaller confidence sets. For their construction we use a novel tail inequality for vector-valued martingales. 1
4 0.20274286 245 nips-2011-Selecting the State-Representation in Reinforcement Learning
Author: Odalric-ambrym Maillard, Daniil Ryabko, Rémi Munos
Abstract: The problem of selecting the right state-representation in a reinforcement learning problem is considered. Several models (functions mapping past observations to a finite set) of the observations are given, and it is known that for at least one of these models the resulting state dynamics are indeed Markovian. Without knowing neither which of the models is the correct one, nor what are the probabilistic characteristics of the resulting MDP, it is required to obtain as much reward as the optimal policy for the correct model (or for the best of the correct models, if there are several). We propose an algorithm that achieves that, with a regret of order T 2/3 where T is the horizon time. 1
5 0.17833924 220 nips-2011-Prediction strategies without loss
Author: Michael Kapralov, Rina Panigrahy
Abstract: Consider a sequence of bits where we are trying to predict the next bit from the previous bits. Assume we are allowed to say ‘predict 0’ or ‘predict 1’, and our payoff is +1 if the prediction is correct and −1 otherwise. We will say that at each point in time the loss of an algorithm is the number of wrong predictions minus the number of right predictions so far. In this paper we are interested in algorithms that have essentially zero (expected) loss over any string at any point in time and yet have small regret with respect to always predicting 0 or always predicting 1. For a sequence of length T our algorithm has regret 14 T and loss √ 2 2 T e− T in expectation for all strings. We show that the tradeoff between loss and regret is optimal up to constant factors. Our techniques extend to the general setting of N experts, where the related problem of trading off regret to the best expert for regret to the ’special’ expert has been studied by Even-Dar et al. (COLT’07). We obtain essentially zero loss with respect to the special expert and optimal loss/regret tradeoff, improving upon the results of Even-Dar et al and settling the main question left open in their paper. The strong loss bounds of the algorithm have some surprising consequences. First, we obtain a parameter free algorithm for the experts problem that has optimal regret bounds with respect to k-shifting optima, i.e. bounds with respect to the optimum that is allowed to change arms multiple times. Moreover, for any window of size n the regret of our algorithm to any expert never exceeds O( n(log N + log T )), where N is the number of experts and T is the time horizon, while maintaining the essentially zero loss property. 1
6 0.1682979 61 nips-2011-Contextual Gaussian Process Bandit Optimization
7 0.13359478 32 nips-2011-An Empirical Evaluation of Thompson Sampling
8 0.13340464 88 nips-2011-Environmental statistics and the trade-off between model-based and TD learning in humans
9 0.1333701 25 nips-2011-Adaptive Hedge
10 0.12373648 80 nips-2011-Efficient Online Learning via Randomized Rounding
11 0.12371741 163 nips-2011-MAP Inference for Bayesian Inverse Reinforcement Learning
12 0.12198334 205 nips-2011-Online Submodular Set Cover, Ranking, and Repeated Active Learning
13 0.12160888 56 nips-2011-Committing Bandits
14 0.10605519 272 nips-2011-Stochastic convex optimization with bandit feedback
15 0.10386361 190 nips-2011-Nonlinear Inverse Reinforcement Learning with Gaussian Processes
16 0.1004485 185 nips-2011-Newtron: an Efficient Bandit algorithm for Online Multiclass Prediction
17 0.095037676 271 nips-2011-Statistical Tests for Optimization Efficiency
18 0.09236642 306 nips-2011-t-divergence Based Approximate Inference
19 0.090273783 145 nips-2011-Learning Eigenvectors for Free
20 0.084512085 177 nips-2011-Multi-armed bandits on implicit metric spaces
topicId topicWeight
[(0, 0.187), (1, -0.312), (2, 0.027), (3, 0.121), (4, 0.122), (5, -0.033), (6, -0.039), (7, 0.115), (8, 0.089), (9, -0.01), (10, -0.087), (11, 0.036), (12, -0.129), (13, -0.035), (14, 0.03), (15, 0.059), (16, 0.028), (17, -0.012), (18, -0.011), (19, 0.016), (20, -0.002), (21, 0.088), (22, -0.066), (23, 0.067), (24, 0.154), (25, -0.004), (26, 0.021), (27, -0.018), (28, -0.045), (29, 0.092), (30, 0.045), (31, -0.077), (32, -0.128), (33, -0.079), (34, -0.016), (35, -0.062), (36, 0.031), (37, -0.113), (38, 0.016), (39, -0.031), (40, 0.05), (41, -0.076), (42, -0.008), (43, -0.065), (44, 0.027), (45, -0.018), (46, 0.079), (47, -0.071), (48, 0.08), (49, -0.076)]
simIndex simValue paperId paperTitle
same-paper 1 0.96080494 98 nips-2011-From Bandits to Experts: On the Value of Side-Observations
Author: Shie Mannor, Ohad Shamir
Abstract: We consider an adversarial online learning setting where a decision maker can choose an action in every stage of the game. In addition to observing the reward of the chosen action, the decision maker gets side observations on the reward he would have obtained had he chosen some of the other actions. The observation structure is encoded as a graph, where node i is linked to node j if sampling i provides information on the reward of j. This setting naturally interpolates between the well-known “experts” setting, where the decision maker can view all rewards, and the multi-armed bandits setting, where the decision maker can only view the reward of the chosen action. We develop practical algorithms with provable regret guarantees, which depend on non-trivial graph-theoretic properties of the information feedback structure. We also provide partially-matching lower bounds. 1
2 0.72810012 210 nips-2011-PAC-Bayesian Analysis of Contextual Bandits
Author: Yevgeny Seldin, Peter Auer, John S. Shawe-taylor, Ronald Ortner, François Laviolette
Abstract: We derive an instantaneous (per-round) data-dependent regret bound for stochastic multiarmed bandits with side information (also known as contextual bandits). The p scaling of our regret bound with the number of states (contexts) N goes as N I⇢t (S; A), where I⇢t (S; A) is the mutual information between states and actions (the side information) used by the algorithm at round t. If the algorithm p uses all the side information, the regret bound scales as N ln K, where K is the number of actions (arms). However, if the side information I⇢t (S; A) is not fully used, the regret bound is significantly tighter. In the extreme case, when I⇢t (S; A) = 0, the dependence on the number of states reduces from linear to logarithmic. Our analysis allows to provide the algorithm large amount of side information, let the algorithm to decide which side information is relevant for the task, and penalize the algorithm only for the side information that it is using de facto. We also present an algorithm for multiarmed bandits with side information with O(K) computational complexity per game round. 1
3 0.70877415 128 nips-2011-Improved Algorithms for Linear Stochastic Bandits
Author: Yasin Abbasi-yadkori, Csaba Szepesvári, David Tax
Abstract: We improve the theoretical analysis and empirical performance of algorithms for the stochastic multi-armed bandit problem and the linear stochastic multi-armed bandit problem. In particular, we show that a simple modification of Auer’s UCB algorithm (Auer, 2002) achieves with high probability constant regret. More importantly, we modify and, consequently, improve the analysis of the algorithm for the for linear stochastic bandit problem studied by Auer (2002), Dani et al. (2008), Rusmevichientong and Tsitsiklis (2010), Li et al. (2010). Our modification improves the regret bound by a logarithmic factor, though experiments show a vast improvement. In both cases, the improvement stems from the construction of smaller confidence sets. For their construction we use a novel tail inequality for vector-valued martingales. 1
4 0.70441902 245 nips-2011-Selecting the State-Representation in Reinforcement Learning
Author: Odalric-ambrym Maillard, Daniil Ryabko, Rémi Munos
Abstract: The problem of selecting the right state-representation in a reinforcement learning problem is considered. Several models (functions mapping past observations to a finite set) of the observations are given, and it is known that for at least one of these models the resulting state dynamics are indeed Markovian. Without knowing neither which of the models is the correct one, nor what are the probabilistic characteristics of the resulting MDP, it is required to obtain as much reward as the optimal policy for the correct model (or for the best of the correct models, if there are several). We propose an algorithm that achieves that, with a regret of order T 2/3 where T is the horizon time. 1
5 0.70236218 32 nips-2011-An Empirical Evaluation of Thompson Sampling
Author: Olivier Chapelle, Lihong Li
Abstract: Thompson sampling is one of oldest heuristic to address the exploration / exploitation trade-off, but it is surprisingly unpopular in the literature. We present here some empirical results using Thompson sampling on simulated and real data, and show that it is highly competitive. And since this heuristic is very easy to implement, we argue that it should be part of the standard baselines to compare against. 1
6 0.67177671 272 nips-2011-Stochastic convex optimization with bandit feedback
7 0.57302737 61 nips-2011-Contextual Gaussian Process Bandit Optimization
8 0.54384112 25 nips-2011-Adaptive Hedge
9 0.53956562 56 nips-2011-Committing Bandits
10 0.50676274 220 nips-2011-Prediction strategies without loss
11 0.468775 11 nips-2011-A Reinforcement Learning Theory for Homeostatic Regulation
12 0.44393697 160 nips-2011-Linear Submodular Bandits and their Application to Diversified Retrieval
13 0.43436506 190 nips-2011-Nonlinear Inverse Reinforcement Learning with Gaussian Processes
14 0.43301097 163 nips-2011-MAP Inference for Bayesian Inverse Reinforcement Learning
15 0.43270114 185 nips-2011-Newtron: an Efficient Bandit algorithm for Online Multiclass Prediction
16 0.42637917 88 nips-2011-Environmental statistics and the trade-off between model-based and TD learning in humans
17 0.42207566 79 nips-2011-Efficient Offline Communication Policies for Factored Multiagent POMDPs
18 0.38745093 80 nips-2011-Efficient Online Learning via Randomized Rounding
19 0.38364112 41 nips-2011-Autonomous Learning of Action Models for Planning
20 0.38291427 177 nips-2011-Multi-armed bandits on implicit metric spaces
topicId topicWeight
[(0, 0.013), (4, 0.046), (10, 0.017), (20, 0.03), (26, 0.033), (31, 0.107), (33, 0.019), (40, 0.229), (43, 0.089), (45, 0.126), (57, 0.04), (74, 0.067), (83, 0.019), (99, 0.058)]
simIndex simValue paperId paperTitle
Author: Shinichi Nakajima, Masashi Sugiyama, S. D. Babacan
Abstract: Variational Bayesian matrix factorization (VBMF) efficiently approximates the posterior distribution of factorized matrices by assuming matrix-wise independence of the two factors. A recent study on fully-observed VBMF showed that, under a stronger assumption that the two factorized matrices are column-wise independent, the global optimal solution can be analytically computed. However, it was not clear how restrictive the column-wise independence assumption is. In this paper, we prove that the global solution under matrix-wise independence is actually column-wise independent, implying that the column-wise independence assumption is harmless. A practical consequence of our theoretical finding is that the global solution under matrix-wise independence (which is a standard setup) can be obtained analytically in a computationally very efficient way without any iterative algorithms. We experimentally illustrate advantages of using our analytic solution in probabilistic principal component analysis. 1
same-paper 2 0.79971582 98 nips-2011-From Bandits to Experts: On the Value of Side-Observations
Author: Shie Mannor, Ohad Shamir
Abstract: We consider an adversarial online learning setting where a decision maker can choose an action in every stage of the game. In addition to observing the reward of the chosen action, the decision maker gets side observations on the reward he would have obtained had he chosen some of the other actions. The observation structure is encoded as a graph, where node i is linked to node j if sampling i provides information on the reward of j. This setting naturally interpolates between the well-known “experts” setting, where the decision maker can view all rewards, and the multi-armed bandits setting, where the decision maker can only view the reward of the chosen action. We develop practical algorithms with provable regret guarantees, which depend on non-trivial graph-theoretic properties of the information feedback structure. We also provide partially-matching lower bounds. 1
3 0.78821099 30 nips-2011-Algorithms for Hyper-Parameter Optimization
Author: James S. Bergstra, Rémi Bardenet, Yoshua Bengio, Balázs Kégl
Abstract: Several recent advances to the state of the art in image classification benchmarks have come from better configurations of existing techniques rather than novel approaches to feature learning. Traditionally, hyper-parameter optimization has been the job of humans because they can be very efficient in regimes where only a few trials are possible. Presently, computer clusters and GPU processors make it possible to run more trials and we show that algorithmic approaches can find better results. We present hyper-parameter optimization results on tasks of training neural networks and deep belief networks (DBNs). We optimize hyper-parameters using random search and two new greedy sequential methods based on the expected improvement criterion. Random search has been shown to be sufficiently efficient for learning neural networks for several datasets, but we show it is unreliable for training DBNs. The sequential algorithms are applied to the most difficult DBN learning problems from [1] and find significantly better results than the best previously reported. This work contributes novel techniques for making response surface models P (y|x) in which many elements of hyper-parameter assignment (x) are known to be irrelevant given particular values of other elements. 1
4 0.76532233 180 nips-2011-Multiple Instance Filtering
Author: Kamil A. Wnuk, Stefano Soatto
Abstract: We propose a robust filtering approach based on semi-supervised and multiple instance learning (MIL). We assume that the posterior density would be unimodal if not for the effect of outliers that we do not wish to explicitly model. Therefore, we seek for a point estimate at the outset, rather than a generic approximation of the entire posterior. Our approach can be thought of as a combination of standard finite-dimensional filtering (Extended Kalman Filter, or Unscented Filter) with multiple instance learning, whereby the initial condition comes with a putative set of inlier measurements. We show how both the state (regression) and the inlier set (classification) can be estimated iteratively and causally by processing only the current measurement. We illustrate our approach on visual tracking problems whereby the object of interest (target) moves and evolves as a result of occlusions and deformations, and partial knowledge of the target is given in the form of a bounding box (training set). 1
5 0.69006383 105 nips-2011-Generalized Lasso based Approximation of Sparse Coding for Visual Recognition
Author: Nobuyuki Morioka, Shin'ichi Satoh
Abstract: Sparse coding, a method of explaining sensory data with as few dictionary bases as possible, has attracted much attention in computer vision. For visual object category recognition, 1 regularized sparse coding is combined with the spatial pyramid representation to obtain state-of-the-art performance. However, because of its iterative optimization, applying sparse coding onto every local feature descriptor extracted from an image database can become a major bottleneck. To overcome this computational challenge, this paper presents “Generalized Lasso based Approximation of Sparse coding” (GLAS). By representing the distribution of sparse coefficients with slice transform, we fit a piece-wise linear mapping function with the generalized lasso. We also propose an efficient post-refinement procedure to perform mutual inhibition between bases which is essential for an overcomplete setting. The experiments show that GLAS obtains a comparable performance to 1 regularized sparse coding, yet achieves a significant speed up demonstrating its effectiveness for large-scale visual recognition problems. 1
6 0.68712932 258 nips-2011-Sparse Bayesian Multi-Task Learning
7 0.67578757 84 nips-2011-EigenNet: A Bayesian hybrid of generative and conditional models for sparse learning
8 0.6747511 206 nips-2011-Optimal Reinforcement Learning for Gaussian Systems
9 0.67439926 186 nips-2011-Noise Thresholds for Spectral Clustering
10 0.67365444 159 nips-2011-Learning with the weighted trace-norm under arbitrary sampling distributions
11 0.67156267 278 nips-2011-TD gamma: Re-evaluating Complex Backups in Temporal Difference Learning
12 0.67126912 92 nips-2011-Expressive Power and Approximation Errors of Restricted Boltzmann Machines
13 0.67014855 276 nips-2011-Structured sparse coding via lateral inhibition
14 0.67013597 239 nips-2011-Robust Lasso with missing and grossly corrupted observations
15 0.66749883 37 nips-2011-Analytical Results for the Error in Filtering of Gaussian Processes
16 0.66740674 144 nips-2011-Learning Auto-regressive Models from Sequence and Non-sequence Data
17 0.66691983 66 nips-2011-Crowdclustering
18 0.66669297 269 nips-2011-Spike and Slab Variational Inference for Multi-Task and Multiple Kernel Learning
19 0.66553694 204 nips-2011-Online Learning: Stochastic, Constrained, and Smoothed Adversaries
20 0.66530073 273 nips-2011-Structural equations and divisive normalization for energy-dependent component analysis