nips nips2013 nips2013-29 knowledge-graph by maker-knowledge-mining

29 nips-2013-Adaptive Submodular Maximization in Bandit Setting

Source: pdf

Author: Victor Gabillon, Branislav Kveton, Zheng Wen, Brian Eriksson, S. Muthukrishnan

Abstract: Maximization of submodular functions has wide applications in machine learning and artiﬁcial intelligence. Adaptive submodular maximization has been traditionally studied under the assumption that the model of the world, the expected gain of choosing an item given previously selected items and their states, is known. In this paper, we study the setting where the expected gain is initially unknown, and it is learned by interacting repeatedly with the optimized function. We propose an efﬁcient algorithm for solving our problem and prove that its expected cumulative regret increases logarithmically with time. Our regret bound captures the inherent property of submodular maximization, earlier mistakes are more costly than later ones. We refer to our approach as Optimistic Adaptive Submodular Maximization (OASM) because it trades off exploration and exploitation based on the optimism in the face of uncertainty principle. We evaluate our method on a preference elicitation problem and show that non-trivial K-step policies can be learned from just a few hundred interactions with the problem. 1

Reference: text

Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 edu Abstract Maximization of submodular functions has wide applications in machine learning and artiﬁcial intelligence. [sent-11, score-0.339]

2 Adaptive submodular maximization has been traditionally studied under the assumption that the model of the world, the expected gain of choosing an item given previously selected items and their states, is known. [sent-12, score-1.582]

3 In this paper, we study the setting where the expected gain is initially unknown, and it is learned by interacting repeatedly with the optimized function. [sent-13, score-0.344]

4 We propose an efﬁcient algorithm for solving our problem and prove that its expected cumulative regret increases logarithmically with time. [sent-14, score-0.301]

5 Our regret bound captures the inherent property of submodular maximization, earlier mistakes are more costly than later ones. [sent-15, score-0.576]

6 We refer to our approach as Optimistic Adaptive Submodular Maximization (OASM) because it trades off exploration and exploitation based on the optimism in the face of uncertainty principle. [sent-16, score-0.063]

7 We evaluate our method on a preference elicitation problem and show that non-trivial K-step policies can be learned from just a few hundred interactions with the problem. [sent-17, score-0.236]

8 1 Introduction Maximization of submodular functions [14] has wide applications in machine learning and artiﬁcial intelligence, such as social network analysis [9], sensor placement [10], and recommender systems [7, 2]. [sent-18, score-0.385]

9 In this paper, we study the problem of adaptive submodular maximization [5]. [sent-19, score-0.617]

10 This problem is a variant of submodular maximization where each item has a state and this state is revealed when the item is chosen. [sent-20, score-1.649]

11 The goal is to learn a policy that maximizes the expected return for choosing K items. [sent-21, score-0.381]

12 Adaptive submodular maximization has been traditionally studied in the setting where the model of the world, the expected gain of choosing an item given previously selected items and their states, is known. [sent-22, score-1.582]

13 This is the ﬁrst paper that studies the setting where the model is initially unknown, and it is learned by interacting repeatedly with the environment. [sent-23, score-0.135]

14 We bring together the concepts of adaptive submodular maximization and bandits, and the result is an efﬁcient solution to our problem. [sent-24, score-0.617]

15 First, we propose a model where the expected gain of choosing an item can be learned efﬁciently. [sent-26, score-0.785]

16 The main assumption in the model is that the state of each item is distributed independently of the other states. [sent-27, score-0.578]

17 Second, we propose Optimistic Adaptive Submodular Maximization (OASM), a bandit algorithm that selects items with the highest upper conﬁdence bound on the expected gain. [sent-28, score-0.513]

18 Third, we prove that the expected cumulative regret of our algorithm increases logarithmically with time. [sent-30, score-0.301]

19 Our regret bound captures the inherent property of adaptive submodular maximization, earlier mistakes are more costly than later ones. [sent-31, score-0.7]

20 Finally, we apply our approach to a real-world preference elicitation 1 problem and show that non-trivial policies can be learned from just a few hundred interactions with the problem. [sent-32, score-0.236]

21 2 Adaptive Submodularity In adaptive submodular maximization, the objective is to maximize, under constraints, a function of the form: L f : 2I × {−1, 1} → R, (1) where I = {1, . [sent-33, score-0.463]

22 , L} is a set of L items and 2I is its power set. [sent-36, score-0.301]

23 The ﬁrst argument of f is a subset L of chosen items A ⊆ I. [sent-37, score-0.332]

24 The second argument is the state φ ∈ {−1, 1} of all items. [sent-38, score-0.079]

25 The i-th entry of φ, φ[i], is the state of item i. [sent-39, score-0.578]

26 The reward for choosing items A in state φ is f (A, φ). [sent-44, score-0.429]

27 In problems of our interest, the state is only partially observed. [sent-46, score-0.079]

28 An observation is a vector y ∈ {−1, 0, 1} whose non-zero entries are the observed states of items. [sent-48, score-0.064]

29 We say that y is an observation of state φ, and write φ ∼ y, if y[i] = φ[i] in all non-zero entries of y. [sent-49, score-0.1]

30 Alternatively, the state φ can be viewed as a realization of y, one of many. [sent-50, score-0.079]

31 We denote by dom (y) = {i : y[i] = 0} the observed items in y and by φ A the observation of items A in state φ. [sent-51, score-0.94]

32 We deﬁne a partial ordering on observations and write y y if y [i] = y[i] in all non-zero entries of y, y is a more speciﬁc observation than y. [sent-52, score-0.057]

33 Then all of the following claims are true: φ ∼ y1 , φ ∼ y2 , y2 y1 , dom (y2 ) = {1, 3} , φ {1, 3} = y2 , φ dom (y1 ) = y1 . [sent-56, score-0.497]

34 Our goal is to maximize the expected value of f by adaptively choosing K items. [sent-57, score-0.188]

35 This problem can be viewed as a K step game, where at each step we choose an item according to some policy π and L then observe its state. [sent-58, score-0.808]

36 A policy π : {−1, 0, 1} → I is a function from observations y to items. [sent-59, score-0.267]

37 A k-step policy in state φ, πk (φ), is a collection of the ﬁrst k items chosen by policy π. [sent-61, score-0.873]

38 The policy is deﬁned recursively as: πk (φ) = πk−1 (φ) ∪ π[k] (φ) , π[k] (φ) = π(φ πk−1 (φ) ), π0 (φ) = ∅, (2) where π[k] (φ) is the k-th item chosen by policy π in state φ. [sent-62, score-1.071]

39 The optimal K-step policy satisﬁes: π ∗ = arg maxπ Eφ [f (πK (φ), φ)] . [sent-63, score-0.231]

40 However, near-optimal policies can be computed efﬁciently when the maximized function has a diminishing return property. [sent-65, score-0.124]

41 Formally, we require that the function is adaptive submodular and adaptive monotonic [5]. [sent-66, score-0.62]

42 Function f is adaptive submodular if: Eφ [ f (A ∪ {i} , φ) − f (A, φ) | φ ∼ yA ] ≥ Eφ [ f (B ∪ {i} , φ) − f (B, φ) | φ ∼ yB ] for all items i ∈ I \ B and observations yB yA , where A = dom (yA ) and B = dom (yB ). [sent-68, score-1.276]

43 Function f is adaptive monotonic if Eφ [ f (A ∪ {i} , φ) − f (A, φ) | φ ∼ yA ] ≥ 0 for all items i ∈ I \ A and observations yA , where A = dom (yA ). [sent-70, score-0.732]

44 In other words, the expected gain of choosing an item is always non-negative and does not increase as the observations become more speciﬁc. [sent-71, score-0.793]

45 Let π g be the greedy policy for maximizing f , a policy that always selects the item with the highest expected gain: π g (y) = arg max gi (y), (4) i∈I\dom(y) where: gi (y) = Eφ [ f (dom (y) ∪ {i} , φ) − f (dom (y) , φ) | φ ∼ y ] (5) is the expected gain of choosing item i after observing y. [sent-72, score-2.339]

46 Then, based on the result of Golovin and g ∗ Krause [5], π g is a (1 − 1/e)-approximation to π ∗ , Eφ [f (πK (φ), φ)] ≥ (1 − 1/e)Eφ [f (πK (φ), φ)], if f is adaptive submodular and adaptive monotonic. [sent-73, score-0.587]

47 In the rest of this paper, we say that an observation y is a context if it can be observed under the greedy policy π g . [sent-74, score-0.353]

48 2 3 Adaptive Submodularity in Bandit Setting The greedy policy π g can be computed only if the objective function f and the distribution of states P (Φ) are known, because both of these quantities are needed to compute the marginal beneﬁt gi (y) (Equation 5). [sent-76, score-0.554]

49 In practice, the distribution P (Φ) is often unknown, for instance in a newly deployed sensor network where the failure rates of the sensors are unknown. [sent-77, score-0.093]

50 In this paper, we study a natural variant of adaptive submodular maximization that can model such problems. [sent-78, score-0.617]

51 The distribution P (Φ) is assumed to be unknown and we learn it by interacting repeatedly with the problem. [sent-79, score-0.103]

52 First, the number of states φ is exponential in the number of items L. [sent-84, score-0.344]

53 Second, the state of our problem is observed only partially. [sent-85, score-0.079]

54 Another possibility is to learn the probability of individual states φ[i] conditioned on context, observations y under the greedy policy π g in up to K steps. [sent-87, score-0.387]

55 In this paper, we assume that the states of items are independent of the context in which the items are chosen. [sent-90, score-0.669]

56 In particular, the state φ[i] of each item i is drawn i. [sent-91, score-0.578]

57 In this setting, the joint probability distribution factors as: L 1{φ[i]=1} P (Φ = φ) = pi (1 − pi )1−1{φ[i]=1} (6) i=1 and the problem of learning P (Φ) reduces to estimating L parameters, the means of the Bernoullis. [sent-95, score-0.106]

58 For instance, consider a sensor network where the sensors fail at random due to manufacturing defects. [sent-98, score-0.12]

59 The failures of these sensors are independent of each other and thus can be modeled in our framework. [sent-99, score-0.068]

60 Based on the independence assumption, we rewrite the expected gain (Equation 5) as: gi (y) = pi gi (y), ¯ (7) gi (y) = Eφ [ f (dom (y) ∪ {i} , φ) − f (dom (y) , φ) | φ ∼ y, φ[i] = 1 ] ¯ (8) where: is the expected gain when item i is in state 1. [sent-102, score-1.682]

61 For simplicity of exposition, we assume that the gain is zero when the item is in state −1. [sent-103, score-0.712]

62 In general, the gain gi (y) depends on P (Φ) and thus cannot be computed when P (Φ) is unknown. [sent-105, score-0.337]

63 ¯ In this paper, we assume that gi (y) can be computed without knowing P (Φ). [sent-106, score-0.203]

64 In maximum coverage problems, for instance, it is quite reasonable to assume that the covered area is only a function of the chosen items and their states. [sent-108, score-0.363]

65 In other words, the gain can be computed as gi (y) = f (dom (y) ∪ {i} , φ) − f (dom (y) , φ), where φ is any state such that ¯ φ ∼ y and φ[i] = 1. [sent-109, score-0.416]

66 In episode t, we adaptively choose K items according to some policy π t , which may differ from episode to episode. [sent-111, score-0.794]

67 The quality of the policy is measured n t by the expected cumulative K-step return Eφ1 ,. [sent-112, score-0.399]

68 We compare this return g to that of the greedy policy π and measure the difference between the two returns by the expected cumulative regret: n R(n) = Eφ1 ,. [sent-116, score-0.476]

69 ,φn t=1 (9) t=1 In maximum coverage problems, the greedy policy π g is a good surrogate for the optimal policy π ∗ because it is a (1 − 1/e)-approximation to π ∗ (Section 2). [sent-123, score-0.57]

70 3 Algorithm 1 OASM: Optimistic adaptive submodular maximization. [sent-124, score-0.463]

71 , φn for all i ∈ I do Select item i and set pi,1 to its state, Ti (0) ← 1 end for ˆ for all t = 1, 2, . [sent-128, score-0.499]

72 2 Update statistics Algorithm Our algorithm is designed based on the optimism in the face of uncertainty principle, a strategy that is at the core of many bandit algorithms [1, 8, 13]. [sent-135, score-0.091]

73 More speciﬁcally, it is a greedy policy where the expected gain gi (y) (Equation 7) is substituted for its optimistic estimate. [sent-136, score-0.807]

74 The algorithm adaptively maximizes a submodular function in an optimistic fashion and therefore we refer to it as Optimistic Adaptive Submodular Maximization (OASM). [sent-137, score-0.466]

75 At each step, we compute the index (ˆi,Ti (t−1) + ct−1,Ti (t−1) )¯i (y) of each item that p g has not been selected yet and then choose the item with the highest index. [sent-140, score-1.096]

76 The terms pi,Ti (t−1) and ˆ ct−1,Ti (t−1) are the maximum-likelihood estimate of the probability pi from the ﬁrst t − 1 episodes and the radius of the conﬁdence interval around this estimate, respectively. [sent-141, score-0.117]

77 Formally: s pi,s = ˆ 1 1 (φτ (i,z) [i] + 1), s z=1 2 ct,s = 2 log(t) , s (10) where s is the number of times that item i is chosen and τ (i, z) is the index of the episode in which item i is chosen for the z-th time. [sent-142, score-1.185]

78 In episode t, we set s to Ti (t − 1), the number of times that item i is selected in the ﬁrst t − 1 episodes. [sent-143, score-0.593]

79 The radius ct,s is designed such that each index is with high probability an upper bound on the corresponding gain. [sent-144, score-0.109]

80 The index enforces exploration of items that have not been chosen very often. [sent-145, score-0.386]

81 The log(t) term guarantees that each item is explored inﬁnitely often as t → ∞, to avoid linear regret. [sent-147, score-0.499]

82 Second, it is guaranteed to behave near optimally as our estimates of the gain gi (y) become more accurate. [sent-151, score-0.337]

83 Speciﬁcally, note that if an item is chosen in one context, it helps in reﬁning the estimate of the gain gi (y) in all other contexts. [sent-155, score-0.867]

84 3 Analysis In this section, we prove an upper bound on the expected cumulative regret of Algorithm OASM in n episodes. [sent-157, score-0.315]

85 We denote by i∗ (y) = π g (y) the item chosen by the greedy policy π g in context y. [sent-159, score-0.862]

86 Without loss of generality, we assume that this item is unique in all contexts. [sent-160, score-0.499]

87 The hardness of discriminating between items i and i∗ (y) is measured by a gap between the expected gains of the items: ∆i (y) = gi∗ (y) (y) − gi (y). [sent-161, score-0.624]

88 (11) Our analysis is based on counting how many times the policies π t and π g choose a different item at step k. [sent-162, score-0.653]

89 Therefore, we deﬁne several variables that describe the state of our problem at this step. [sent-163, score-0.079]

90 We 4 denote by Yk (π) = φ {φ πk−1 (φ) } the set of all possible observations after policy π is executed t for k − 1 steps. [sent-164, score-0.267]

91 We write Yk = Yk (π g ) and Yk = Yk (π t ) when we refer to the policies π g and π t , respectively. [sent-165, score-0.098]

92 Finally, we denote by Yk,i = Yk ∩ {y : i = i∗ (y)} the set of contexts where item i is suboptimal at step k. [sent-166, score-0.549]

93 The terms item and arm are treated as synonyms, and we use whichever is more appropriate in a given context. [sent-169, score-0.581]

94 In practice, the ﬁrst step where the policies π t and π g choose a different item is unknown, because π g is unknown. [sent-172, score-0.653]

95 Second, in Lemma 1 we bound the expected loss associated with choosing the ﬁrst different item at step k by the probability of this event and an upper bound on the expected loss Gk , which does not depend on π t and φt . [sent-175, score-0.822]

96 Based on this result, we bound the expected cumulative regret as: n n L K g t 1i,k,t (φt )(Fk→ (φt ) − Fk→ (φt )) Rt (φt ) = Eφ1 ,. [sent-176, score-0.29]

97 In Lemma 4, we show how to choose i,k such that arm i at step k is pulled suboptimally a constant number of times in expectation after i,k pulls. [sent-193, score-0.215]

98 Based on this result, the regret corresponding to the events 1{Ti (t − 1) > i,k } is bounded as: L K n 1i,k,t (φt )1{Ti (t − 1) > Gk Eφ1 ,. [sent-194, score-0.144]

99 ,φn i,k } t=1 i=1 k=1 ≤ 2 2 π L(L + 1) 3 On the other hand, the regret associated with the events 1{Ti (t − 1) ≤ L K i=1 k=1 Gk i,k . [sent-197, score-0.144]

100 A tighter upper bound is proved below: L K i=1 ≤ L 1i,k,t (φt )1{Ti (t − 1) ≤ K φ1 ,. [sent-198, score-0.053]

similar papers computed by tfidf model

tfidf for this paper:

wordName wordTfidf (topN-words)

[('item', 0.499), ('submodular', 0.339), ('items', 0.301), ('dom', 0.238), ('policy', 0.231), ('oasm', 0.211), ('gi', 0.203), ('gk', 0.187), ('ti', 0.157), ('maximization', 0.154), ('gain', 0.134), ('fk', 0.133), ('adaptive', 0.124), ('regret', 0.12), ('ya', 0.113), ('yk', 0.111), ('policies', 0.098), ('episode', 0.094), ('optimistic', 0.087), ('state', 0.079), ('greedy', 0.077), ('expected', 0.075), ('cumulative', 0.067), ('yb', 0.066), ('alto', 0.06), ('palo', 0.06), ('technicolor', 0.06), ('arm', 0.06), ('pi', 0.053), ('suboptimally', 0.053), ('bandit', 0.051), ('elicitation', 0.049), ('interacting', 0.049), ('choosing', 0.049), ('sensors', 0.047), ('golovin', 0.046), ('pulled', 0.046), ('sensor', 0.046), ('states', 0.043), ('adaptively', 0.04), ('optimism', 0.04), ('episodes', 0.039), ('logarithmically', 0.039), ('labs', 0.038), ('mistakes', 0.038), ('ct', 0.037), ('observations', 0.036), ('submodularity', 0.036), ('choose', 0.034), ('repeatedly', 0.033), ('monotonic', 0.033), ('highest', 0.033), ('hundred', 0.032), ('krause', 0.031), ('traditionally', 0.031), ('index', 0.031), ('rt', 0.031), ('chosen', 0.031), ('coverage', 0.031), ('max', 0.03), ('exposition', 0.029), ('preference', 0.029), ('bound', 0.028), ('contexts', 0.028), ('learned', 0.028), ('costly', 0.027), ('kveton', 0.027), ('branislav', 0.027), ('manufacturing', 0.027), ('wen', 0.027), ('zhengwen', 0.027), ('return', 0.026), ('dence', 0.025), ('radius', 0.025), ('upper', 0.025), ('initially', 0.025), ('eriksson', 0.024), ('muthukrishnan', 0.024), ('synonyms', 0.024), ('inherent', 0.024), ('events', 0.024), ('maximize', 0.024), ('indicator', 0.024), ('rewrite', 0.024), ('context', 0.024), ('disagree', 0.023), ('discriminating', 0.023), ('gabillon', 0.023), ('pulls', 0.023), ('exploration', 0.023), ('step', 0.022), ('gains', 0.022), ('whichever', 0.022), ('observation', 0.021), ('unknown', 0.021), ('rutgers', 0.021), ('claims', 0.021), ('failures', 0.021), ('ca', 0.021), ('event', 0.021)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 1.0000001 29 nips-2013-Adaptive Submodular Maximization in Bandit Setting

Author: Victor Gabillon, Branislav Kveton, Zheng Wen, Brian Eriksson, S. Muthukrishnan

2 0.27758539 78 nips-2013-Curvature and Optimal Algorithms for Learning and Minimizing Submodular Functions

Author: Rishabh K. Iyer, Stefanie Jegelka, Jeff A. Bilmes

Abstract: We investigate three related and important problems connected to machine learning: approximating a submodular function everywhere, learning a submodular function (in a PAC-like setting [28]), and constrained minimization of submodular functions. We show that the complexity of all three problems depends on the “curvature” of the submodular function, and provide lower and upper bounds that reﬁne and improve previous results [2, 6, 8, 27]. Our proof techniques are fairly generic. We either use a black-box transformation of the function (for approximation and learning), or a transformation of algorithms to use an appropriate surrogate function (for minimization). Curiously, curvature has been known to inﬂuence approximations for submodular maximization [3, 29], but its effect on minimization, approximation and learning has hitherto been open. We complete this picture, and also support our theoretical claims by empirical results. 1

3 0.27032372 319 nips-2013-Submodular Optimization with Submodular Cover and Submodular Knapsack Constraints

Author: Rishabh K. Iyer, Jeff A. Bilmes

Abstract: We investigate two new optimization problems — minimizing a submodular function subject to a submodular lower bound constraint (submodular cover) and maximizing a submodular function subject to a submodular upper bound constraint (submodular knapsack). We are motivated by a number of real-world applications in machine learning including sensor placement and data subset selection, which require maximizing a certain submodular function (like coverage or diversity) while simultaneously minimizing another (like cooperative cost). These problems are often posed as minimizing the difference between submodular functions [9, 25] which is in the worst case inapproximable. We show, however, that by phrasing these problems as constrained optimization, which is more natural for many applications, we achieve a number of bounded approximation guarantees. We also show that both these problems are closely related and an approximation algorithm solving one can be used to obtain an approximation guarantee for the other. We provide hardness results for both problems thus showing that our approximation factors are tight up to log-factors. Finally, we empirically demonstrate the performance and good scalability properties of our algorithms. 1

4 0.21398129 239 nips-2013-Optimistic policy iteration and natural actor-critic: A unifying view and a non-optimality result

Author: Paul Wagner

Abstract: Approximate dynamic programming approaches to the reinforcement learning problem are often categorized into greedy value function methods and value-based policy gradient methods. As our ﬁrst main result, we show that an important subset of the latter methodology is, in fact, a limiting special case of a general formulation of the former methodology; optimistic policy iteration encompasses not only most of the greedy value function methods but also natural actor-critic methods, and permits one to directly interpolate between them. The resulting continuum adjusts the strength of the Markov assumption in policy improvement and, as such, can be seen as dual in spirit to the continuum in TD(λ)-style algorithms in policy evaluation. As our second main result, we show for a substantial subset of softgreedy value function approaches that, while having the potential to avoid policy oscillation and policy chattering, this subset can never converge toward an optimal policy, except in a certain pathological case. Consequently, in the context of approximations (either in state estimation or in value function representation), the majority of greedy value function methods seem to be deemed to suffer either from the risk of oscillation/chattering or from the presence of systematic sub-optimality. 1

5 0.21297601 290 nips-2013-Scoring Workers in Crowdsourcing: How Many Control Questions are Enough?

Author: Qiang Liu, Alex Ihler, Mark Steyvers

Abstract: We study the problem of estimating continuous quantities, such as prices, probabilities, and point spreads, using a crowdsourcing approach. A challenging aspect of combining the crowd’s answers is that workers’ reliabilities and biases are usually unknown and highly diverse. Control items with known answers can be used to evaluate workers’ performance, and hence improve the combined results on the target items with unknown answers. This raises the problem of how many control items to use when the total number of items each workers can answer is limited: more control items evaluates the workers better, but leaves fewer resources for the target items that are of direct interest, and vice versa. We give theoretical results for this problem under different scenarios, and provide a simple rule of thumb for crowdsourcing practitioners. As a byproduct, we also provide theoretical analysis of the accuracy of different consensus methods. 1

6 0.20124066 257 nips-2013-Projected Natural Actor-Critic

7 0.19222341 268 nips-2013-Reflection methods for user-friendly submodular optimization

8 0.1905095 97 nips-2013-Distributed Submodular Maximization: Identifying Representative Elements in Massive Data

9 0.18875444 28 nips-2013-Adaptive Step-Size for Policy Gradient Methods

10 0.15376025 348 nips-2013-Variational Policy Search via Trajectory Optimization

11 0.15228337 23 nips-2013-Active Learning for Probabilistic Hypotheses Using the Maximum Gibbs Error Criterion

12 0.14724986 273 nips-2013-Reinforcement Learning in Robust Markov Decision Processes

13 0.14187266 241 nips-2013-Optimizing Instructional Policies

14 0.13806416 227 nips-2013-Online Learning in Markov Decision Processes with Adversarially Chosen Transition Probability Distributions

15 0.12518413 7 nips-2013-A Gang of Bandits

16 0.12124507 150 nips-2013-Learning Adaptive Value of Information for Structured Prediction

17 0.12110037 103 nips-2013-Efficient Exploration and Value Function Generalization in Deterministic Systems

18 0.11782211 253 nips-2013-Prior-free and prior-dependent regret bounds for Thompson Sampling

19 0.1168965 338 nips-2013-Two-Target Algorithms for Infinite-Armed Bandits with Bernoulli Rewards

20 0.11475965 1 nips-2013-(More) Efficient Reinforcement Learning via Posterior Sampling

similar papers computed by lsi model

lsi for this paper:

topicId topicWeight

[(0, 0.199), (1, -0.269), (2, -0.006), (3, 0.071), (4, 0.023), (5, 0.063), (6, -0.213), (7, -0.31), (8, 0.112), (9, -0.12), (10, -0.048), (11, 0.013), (12, -0.119), (13, 0.014), (14, 0.023), (15, -0.019), (16, 0.058), (17, -0.002), (18, -0.039), (19, 0.008), (20, 0.083), (21, -0.037), (22, -0.033), (23, -0.014), (24, 0.026), (25, 0.021), (26, 0.011), (27, -0.05), (28, -0.022), (29, -0.091), (30, -0.03), (31, 0.034), (32, -0.055), (33, -0.086), (34, -0.129), (35, -0.043), (36, 0.019), (37, -0.02), (38, 0.019), (39, 0.058), (40, -0.055), (41, 0.004), (42, 0.018), (43, -0.085), (44, -0.014), (45, 0.0), (46, 0.01), (47, -0.009), (48, -0.15), (49, 0.126)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.96568751 29 nips-2013-Adaptive Submodular Maximization in Bandit Setting

Author: Victor Gabillon, Branislav Kveton, Zheng Wen, Brian Eriksson, S. Muthukrishnan

2 0.67532325 319 nips-2013-Submodular Optimization with Submodular Cover and Submodular Knapsack Constraints

Author: Rishabh K. Iyer, Jeff A. Bilmes

3 0.66495711 78 nips-2013-Curvature and Optimal Algorithms for Learning and Minimizing Submodular Functions

Author: Rishabh K. Iyer, Stefanie Jegelka, Jeff A. Bilmes

4 0.57643175 97 nips-2013-Distributed Submodular Maximization: Identifying Representative Elements in Massive Data

Author: Baharan Mirzasoleiman, Amin Karbasi, Rik Sarkar, Andreas Krause

Abstract: Many large-scale machine learning problems (such as clustering, non-parametric learning, kernel machines, etc.) require selecting, out of a massive data set, a manageable yet representative subset. Such problems can often be reduced to maximizing a submodular set function subject to cardinality constraints. Classical approaches require centralized access to the full data set; but for truly large-scale problems, rendering the data centrally is often impractical. In this paper, we consider the problem of submodular function maximization in a distributed fashion. We develop a simple, two-stage protocol G REE D I, that is easily implemented using MapReduce style computations. We theoretically analyze our approach, and show, that under certain natural conditions, performance close to the (impractical) centralized approach can be achieved. In our extensive experiments, we demonstrate the effectiveness of our approach on several applications, including sparse Gaussian process inference and exemplar-based clustering, on tens of millions of data points using Hadoop. 1

5 0.55838448 268 nips-2013-Reflection methods for user-friendly submodular optimization

Author: Stefanie Jegelka, Francis Bach, Suvrit Sra

Abstract: Recently, it has become evident that submodularity naturally captures widely occurring concepts in machine learning, signal processing and computer vision. Consequently, there is need for efﬁcient optimization procedures for submodular functions, especially for minimization problems. While general submodular minimization is challenging, we propose a new method that exploits existing decomposability of submodular functions. In contrast to previous approaches, our method is neither approximate, nor impractical, nor does it need any cumbersome parameter tuning. Moreover, it is easy to implement and parallelize. A key component of our method is a formulation of the discrete submodular minimization problem as a continuous best approximation problem that is solved through a sequence of reﬂections, and its solution can be easily thresholded to obtain an optimal discrete solution. This method solves both the continuous and discrete formulations of the problem, and therefore has applications in learning, inference, and reconstruction. In our experiments, we illustrate the beneﬁts of our method on two image segmentation tasks. 1

6 0.48428518 257 nips-2013-Projected Natural Actor-Critic

7 0.48004732 239 nips-2013-Optimistic policy iteration and natural actor-critic: A unifying view and a non-optimality result

8 0.47763452 140 nips-2013-Improved and Generalized Upper Bounds on the Complexity of Policy Iteration

9 0.46816409 28 nips-2013-Adaptive Step-Size for Policy Gradient Methods

10 0.45471451 270 nips-2013-Regret based Robust Solutions for Uncertain Markov Decision Processes

11 0.45030233 38 nips-2013-Approximate Dynamic Programming Finally Performs Well in the Game of Tetris

12 0.44136062 241 nips-2013-Optimizing Instructional Policies

13 0.43885913 322 nips-2013-Symbolic Opportunistic Policy Iteration for Factored-Action MDPs

14 0.43485329 273 nips-2013-Reinforcement Learning in Robust Markov Decision Processes

15 0.42305642 290 nips-2013-Scoring Workers in Crowdsourcing: How Many Control Questions are Enough?

16 0.41382718 235 nips-2013-Online learning in episodic Markovian decision processes by relative entropy policy search

17 0.40089428 1 nips-2013-(More) Efficient Reinforcement Learning via Posterior Sampling

18 0.39515895 253 nips-2013-Prior-free and prior-dependent regret bounds for Thompson Sampling

19 0.39166689 227 nips-2013-Online Learning in Markov Decision Processes with Adversarially Chosen Transition Probability Distributions

20 0.38033307 23 nips-2013-Active Learning for Probabilistic Hypotheses Using the Maximum Gibbs Error Criterion

similar papers computed by lda model

lda for this paper:

topicId topicWeight

[(13, 0.024), (16, 0.018), (33, 0.112), (34, 0.111), (41, 0.071), (49, 0.042), (56, 0.181), (58, 0.193), (70, 0.042), (85, 0.064), (89, 0.018), (93, 0.017), (95, 0.014)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.84331357 29 nips-2013-Adaptive Submodular Maximization in Bandit Setting

Author: Victor Gabillon, Branislav Kveton, Zheng Wen, Brian Eriksson, S. Muthukrishnan

2 0.7719264 184 nips-2013-Marginals-to-Models Reducibility

Author: Tim Roughgarden, Michael Kearns

Abstract: We consider a number of classical and new computational problems regarding marginal distributions, and inference in models specifying a full joint distribution. We prove general and efﬁcient reductions between a number of these problems, which demonstrate that algorithmic progress in inference automatically yields progress for “pure data” problems. Our main technique involves formulating the problems as linear programs, and proving that the dual separation oracle required by the ellipsoid method is provided by the target problem. This technique may be of independent interest in probabilistic inference. 1

3 0.77126783 79 nips-2013-DESPOT: Online POMDP Planning with Regularization

Author: Adhiraj Somani, Nan Ye, David Hsu, Wee Sun Lee

Abstract: POMDPs provide a principled framework for planning under uncertainty, but are computationally intractable, due to the “curse of dimensionality” and the “curse of history”. This paper presents an online POMDP algorithm that alleviates these difﬁculties by focusing the search on a set of randomly sampled scenarios. A Determinized Sparse Partially Observable Tree (DESPOT) compactly captures the execution of all policies on these scenarios. Our Regularized DESPOT (R-DESPOT) algorithm searches the DESPOT for a policy, while optimally balancing the size of the policy and its estimated value obtained under the sampled scenarios. We give an output-sensitive performance bound for all policies derived from a DESPOT, and show that R-DESPOT works well if a small optimal policy exists. We also give an anytime algorithm that approximates R-DESPOT. Experiments show strong results, compared with two of the fastest online POMDP algorithms. Source code along with experimental settings are available at http://bigbird.comp. nus.edu.sg/pmwiki/farm/appl/. 1

4 0.77100164 249 nips-2013-Polar Operators for Structured Sparse Estimation

Author: Xinhua Zhang, Yao-Liang Yu, Dale Schuurmans

Abstract: Structured sparse estimation has become an important technique in many areas of data analysis. Unfortunately, these estimators normally create computational difﬁculties that entail sophisticated algorithms. Our ﬁrst contribution is to uncover a rich class of structured sparse regularizers whose polar operator can be evaluated efﬁciently. With such an operator, a simple conditional gradient method can then be developed that, when combined with smoothing and local optimization, significantly reduces training time vs. the state of the art. We also demonstrate a new reduction of polar to proximal maps that enables more efﬁcient latent fused lasso. 1

5 0.77024496 103 nips-2013-Efficient Exploration and Value Function Generalization in Deterministic Systems

Author: Zheng Wen, Benjamin Van Roy

Abstract: We consider the problem of reinforcement learning over episodes of a ﬁnitehorizon deterministic system and as a solution propose optimistic constraint propagation (OCP), an algorithm designed to synthesize efﬁcient exploration and value function generalization. We establish that when the true value function Q⇤ lies within the hypothesis class Q, OCP selects optimal actions over all but at most dimE [Q] episodes, where dimE denotes the eluder dimension. We establish further efﬁciency and asymptotic performance guarantees that apply even if Q⇤ does not lie in Q, for the special case where Q is the span of pre-speciﬁed indicator functions over disjoint sets. 1

6 0.76857489 78 nips-2013-Curvature and Optimal Algorithms for Learning and Minimizing Submodular Functions

7 0.7667712 235 nips-2013-Online learning in episodic Markovian decision processes by relative entropy policy search

8 0.76653916 228 nips-2013-Online Learning of Dynamic Parameters in Social Networks

9 0.76570761 224 nips-2013-On the Sample Complexity of Subspace Learning

10 0.76388836 125 nips-2013-From Bandits to Experts: A Tale of Domination and Independence

11 0.76376569 62 nips-2013-Causal Inference on Time Series using Restricted Structural Equation Models

12 0.76375824 32 nips-2013-Aggregating Optimistic Planning Trees for Solving Markov Decision Processes

13 0.76220167 250 nips-2013-Policy Shaping: Integrating Human Feedback with Reinforcement Learning

14 0.75973088 319 nips-2013-Submodular Optimization with Submodular Cover and Submodular Knapsack Constraints

15 0.75939167 227 nips-2013-Online Learning in Markov Decision Processes with Adversarially Chosen Transition Probability Distributions

16 0.75906873 282 nips-2013-Robust Multimodal Graph Matching: Sparse Coding Meets Graph Matching

17 0.75875926 240 nips-2013-Optimization, Learning, and Games with Predictable Sequences

18 0.75868446 11 nips-2013-A New Convex Relaxation for Tensor Completion

19 0.75816989 63 nips-2013-Cluster Trees on Manifolds

20 0.75756758 2 nips-2013-(Nearly) Optimal Algorithms for Private Online Learning in Full-information and Bandit Settings