jmlr jmlr2011 jmlr2011-104 knowledge-graph by maker-knowledge-mining
Source: pdf
Author: Sébastien Bubeck, Rémi Munos, Gilles Stoltz, Csaba Szepesvári
Abstract: We consider a generalization of stochastic bandits where the set of arms, X , is allowed to be a generic measurable space and the mean-payoff function is “locally Lipschitz” with respect to a dissimilarity function that is known to the decision maker. Under this condition we construct an arm selection policy, called HOO (hierarchical optimistic optimization), with improved regret bounds compared to previous results for a large class of problems. In particular, our results imply that if X is the unit hypercube in a Euclidean space and the mean-payoff function has a finite number of global maxima around which the behavior of the function is locally continuous with a known √ smoothness degree, then the expected regret of HOO is bounded up to a logarithmic factor by n, that is, the rate of growth of the regret is independent of the dimension of the space. We also prove the minimax optimality of our algorithm when the dissimilarity is a metric. Our basic strategy has quadratic computational complexity as a function of the number of time steps and does not rely on the doubling trick. We also introduce a modified strategy, which relies on the doubling trick but runs in linearithmic time. Both results are improvements with respect to previous approaches. Keywords: bandits with infinitely many arms, optimistic online optimization, regret bounds, minimax rates
Reference: text
sentIndex sentText sentNum sentScore
1 Under this condition we construct an arm selection policy, called HOO (hierarchical optimistic optimization), with improved regret bounds compared to previous results for a large class of problems. [sent-9, score-0.4]
2 Keywords: bandits with infinitely many arms, optimistic online optimization, regret bounds, minimax rates 1. [sent-15, score-0.532]
3 Introduction In the classical stochastic bandit problem a gambler tries to maximize his revenue by sequentially playing one of a finite number of slot machines that are associated with initially unknown (and potentially different) payoff distributions (Robbins, 1952). [sent-16, score-0.35]
4 Maximizing the total cumulative payoff is equivalent to minimizing the (total) regret, that is, minimizing the difference between the total cumulative payoff of the gambler and the one of another clairvoyant gambler who chooses the arm with the best mean-payoff in every round. [sent-22, score-0.462]
5 The quality of the gambler’s strategy can be characterized as the rate of growth of his expected regret with time. [sent-23, score-0.322]
6 While in the Bayesian case the question is whether the optimal actions can be computed efficiently, in the frequentist case the question is how to achieve low rate of growth of the regret in the lack of prior information, that is, it is a statistical question. [sent-29, score-0.293]
7 Although the first papers studied bandits with a finite number of arms, researchers have soon realized that bandits with infinitely many arms are also interesting, as well as practically significant. [sent-31, score-0.536]
8 A special case of interest, which forms a bridge between the case of a finite number of arms and the continuum-armed setting, is formed by bandit linear optimization, see Abernethy et al. [sent-39, score-0.349]
9 This is because when the set of arms is uncountably infinite and absolutely no assumptions are made on the payoff function, it is impossible to construct a strategy that simultaneously 1656 X -A RMED BANDITS achieves sublinear regret for all bandits problems (see, e. [sent-48, score-0.774]
10 When the set of arms is a metric space (possibly with the power of the continuum) previous works have assumed either the global smoothness of the payoff function (Agrawal, 1995b; Kleinberg, 2004; Kleinberg et al. [sent-52, score-0.337]
11 In this paper, we assume that there exists a dissimilarity function that constrains the behavior of the mean-payoff function, where a dissimilarity function is a measure of the discrepancy between two arms that is neither symmetric, nor reflexive, nor satisfies the triangle inequality. [sent-57, score-0.494]
12 We also assume that the decision maker can construct a recursive covering of the space of arms in such a way that the diameters of the sets in the covering shrink at a known geometric rate when measured with this dissimilarity. [sent-62, score-0.306]
13 (2008b) proposed a novel algorithm that achieves essentially the best possible regret bound in a minimax sense with respect to the environments studied, as well as a much better regret bound if the mean-payoff function has a small “zooming dimension”. [sent-71, score-0.723]
14 Thus, in this case, we get the desirable property that the rate of growth of the regret is independent of the dimensionality of the input space. [sent-94, score-0.293]
15 Assuming the knowledge of the horizon, we thus propose a computationally more efficient variant of the basic algorithm, called truncated HOO and prove that it enjoys a regret bound identical to the one of the basic version (Section 4. [sent-112, score-0.355]
16 Problem Setup A stochastic bandit problem B is a pair B = (X , M), where X is a measurable space of arms and M determines the distribution of rewards associated with each arm. [sent-124, score-0.405]
17 A decision maker (the gambler of the introduction) that interacts with a stochastic bandit problem B plays a game at discrete time steps according to the following rules. [sent-132, score-0.314]
18 In the first round the decision maker can select an arm X1 ∈ X and receives a reward Y1 drawn at random from MX1 . [sent-133, score-0.319]
19 At round n, the cumulative regret of a decision maker playing B is n Rn = n f ∗ − ∑ Yt , t=1 that is, the difference between the maximum expected payoff in n rounds and the actual total payoff. [sent-151, score-0.66]
20 In the sequel, we shall restrict our attention to the expected cumulative regret, which is defined as the expectation E[Rn ] of the cumulative regret Rn . [sent-152, score-0.405]
21 1659 ´ B UBECK , M UNOS , S TOLTZ AND S ZEPESV ARI that is, the actual rewards used in the definition of the regret are replaced by the mean-payoffs of the arms pulled. [sent-155, score-0.525]
22 Since (by the tower rule) E Yt = E E [Yt |Xt ] = E f (Xt ) , the expected values E[Rn ] of the cumulative regret and E[Rn ] of the cumulative pseudo-regret are the same. [sent-156, score-0.373]
23 (2011), in many real-world problems, the decision maker is not interested in his cumulative regret but rather in its simple regret. [sent-159, score-0.391]
24 After n rounds of play in a stochastic bandit problem B , the decision maker is asked to make a recommendation Zn ∈ X based on the n obtained rewards Y1 , . [sent-161, score-0.316]
25 The simple regret of this recommendation equals rn = f ∗ − f (Zn ) . [sent-165, score-0.346]
26 In this paper we focus on the cumulative regret Rn , but all the results can be readily extended to the simple regret by considering the recommendation Zn = XTn , where Tn is drawn uniformly at random in {1, . [sent-166, score-0.626]
27 To implement this idea, HOO maintains a binary tree whose nodes are associated with measurable regions of the arm-space X such that the regions associated with nodes deeper in the tree (further away from the root) represent increasingly smaller subsets of X . [sent-176, score-0.358]
28 In particular, HOO keeps track of the number of times a node was traversed up to round n and the corresponding empirical average of the rewards received so far. [sent-179, score-0.328]
29 The tree of coverings which HOO needs to receive as an input is an infinite binary tree whose nodes are associated with subsets of X . [sent-186, score-0.343]
30 The nodes in this tree are indexed by pairs of integers (h, i); node (h, i) is located at depth h 0 from the root. [sent-187, score-0.472]
31 Thus, the initial tree is T0 = {(0, 1)} and it is expanded round after round as Tn = Tn−1 ∪ {(Hn , In )} , where (Hn , In ) is the node selected in line 15. [sent-202, score-0.499]
32 We use Xn to denote the point selected by HOO in the region associated with the node played in round n, while Yn denotes the received reward. [sent-204, score-0.366]
33 The B-value, Bh,i (n), at node (h, i) by the end of round n is an estimated upper bound on the mean-payoff function at node (h, i). [sent-206, score-0.433]
34 To define it we first need to introduce the average of the rewards received in rounds when some descendant of node (h, i) was chosen (by convention, each node is a descendant of itself): µh,i (n) = n 1 Th,i (n) ∑ Yt I{(H ,I )∈C (h,i)} . [sent-207, score-0.387]
35 Note that at the beginning of round n, the algorithm uses Bh,i (n − 1) to select the node (Hn , In ) to be played (since Bh,i (n) will only be available at the end of round n). [sent-228, score-0.484]
36 It does so by following a path from the root node to an inner node with only one child or a leaf and finally considering a child (Hn , In ) of the latter; at each node of the path, the child with highest B-value is chosen, till the node (Hn , In ) with infinite B-value is reached. [sent-229, score-0.691]
37 1 Illustrations Figure 1 illustrates the computation done by HOO in round n, as well as the correspondence between the nodes of the tree constructed by the algorithm and their associated regions. [sent-231, score-0.325]
38 3 we modify HOO so that if the time horizon n0 is known in advance, the total running time is O(n0 ln n0 ), while the modified algorithm will be shown to enjoy essentially the same regret bound as the original version. [sent-238, score-0.61]
39 1663 ´ B UBECK , M UNOS , S TOLTZ AND S ZEPESV ARI Followed path B h,i B h+1,2i−1 B h+1,2i Selected node (Hn ,I n ) Pulled point Xn Figure 1: Illustration of the node selection procedure in round n. [sent-239, score-0.435]
40 Main Results We start by describing and commenting on the assumptions that we need to analyze the regret of HOO. [sent-243, score-0.319]
41 However, somewhat unconventionally, we shall use a notion of smoothness that is built around dissimilarity functions rather than distances, allowing us to deal with function classes of highly different smoothness degrees in a unified manner. [sent-248, score-0.301]
42 Assumptions Given the parameters of HOO, that is, the real numbers ν1 > 0 and ρ ∈ (0, 1) and the tree of coverings (Ph,i ), there exists a dissimilarity function ℓ such that the following two assumptions are satisfied. [sent-263, score-0.349]
43 Example 2 In the same setting, with the same tree of coverings (Ph,i ) over X = [0, 1]D , but now with the dissimilarity ℓ(x, y) = b x − y a , we get that for all integers u 0 and 0 k D − 1, ∞ diam(PuD+k,1 ) = b a 1 2u . [sent-302, score-0.372]
44 (2007, Assumption 2) observed that the regret of a continuum-armed bandit algorithm should depend on how fast the volumes of the sets of ε-optimal arms shrink as ε → 0. [sent-339, score-0.642]
45 the dissimilarity ℓ is the size of the largest packing of X with disjoint ℓ-open balls of radius ε. [sent-347, score-0.351]
46 ln ε−1 ε→0 The following example shows that using a dissimilarity (rather than a metric, for instance) may sometimes allow for a significant reduction of the near-optimality dimension. [sent-355, score-0.4]
47 First, the quantities Uh,i (n) of (3) are redefined by replacing the factor ln n by ln n0 , that is, now Uh,i (n) = µh,i (n) + 2 ln n0 + ν1 ρh . [sent-411, score-0.723]
48 1670 X -A RMED BANDITS However, and this is the reason for the second modification, in the basic algorithm, a path at round n may be of length linear in n (because the tree could have a depth linear in n). [sent-416, score-0.346]
49 This is why we also truncate the trees Tn at a depth Dn0 of the order of ln n0 . [sent-417, score-0.323]
50 ) Note that since no child of a node (Dn0 , i) located at 1 depth Dn0 will ever be explored, its B-value at round n n0 simply equals UDn0 ,i (n). [sent-420, score-0.432]
51 The computational complexity of updating all B-values at each round n is of the order of Dn0 and thus of the order of ln n0 . [sent-422, score-0.387]
52 The total computational complexity up to round n0 is therefore of the order of n0 ln n0 , as claimed in the introduction of this section. [sent-423, score-0.387]
53 As the next theorem indicates this new procedure enjoys almost the same cumulative regret bound as the basic HOO algorithm. [sent-424, score-0.394]
54 Theorem 8 (Upper bound on the regret of truncated HOO) Fix a horizon n0 such that Dn0 1. [sent-425, score-0.396]
55 Then, the regret bound of Theorem 6 still holds true at round n0 for truncated HOO up to an √ additional additive 4 n0 factor. [sent-426, score-0.501]
56 Doing so, we will be able to deal with an even larger class of functions and √ in fact we will show that the algorithm studied in this section achieves a O( n) bound on the regret regret when it is used for functions that are smooth around their maxima (e. [sent-429, score-0.7]
57 Algorithm z-HOO works as follows: it never plays any node with depth smaller or equal to z − 1 and starts directly the selection of a new node at depth z. [sent-466, score-0.416]
58 Note in particular that the initialization of this algorithm consists (in the first 2z rounds) in playing once each of the 2z nodes located at depth z in the tree (since by definition a node that has not been played yet has a B-value equal to +∞). [sent-468, score-0.513]
59 1672 X -A RMED BANDITS Note that each fresh start needs to pull each of the 2zr nodes located at depth zr at least once (the number of these nodes is ≈ r). [sent-475, score-0.358]
60 In the rest of this section, we propose first an upper bound on the regret of z-HOO (with exact and explicit constants). [sent-477, score-0.328]
61 If, in addition, z h0 and n 2 is large enough so that z 1 ln(4Lν1 n) − ln(γ ln n) , d +2 ln(1/ρ) γ= 4CLν1 ν−d 2 (1/ρ)d+1 − 1 where 16 +9 , ν2 ρ2 1 then the following bound holds for the expected regret of z-HOO: E Rn 1+ 1 ρd+2 4Lν1 n (d+1)/(d+2) (γ ln n)1/(d+2) + 2z − 1 8 ln n +4 . [sent-484, score-1.051]
62 In particular, this term disappears when z = 0, in which case we get a bound on the regret of HOO provided that 2ν1 < ε0 holds. [sent-490, score-0.328]
63 E Rn 32γ n ln n = The following theorem is an almost straightforward consequence of Theorem 9 (the detailed proof can be found in Section A. [sent-503, score-0.299]
64 a dissimilarity ℓ) is defined as ln N (X , ℓ, ε) lim sup . [sent-514, score-0.4]
65 For the sake of clarity, we now denote, for a bandit strategy ϕ and a bandit environment M on X , the expectation of the cumulative regret of ϕ over M at time n by EM Rn (ϕ) . [sent-520, score-0.737]
66 The following theorem provides a uniform upper bound on the regret of HOO over this class of environments. [sent-521, score-0.354]
67 1674 X -A RMED BANDITS Theorem 12 (Uniform upper bound on the regret of HOO) Assume that X has a finite ℓ-packing dimension D and that the parameters of HOO are such that A1 is satisfied. [sent-525, score-0.364]
68 Theorem 10 √ thus implies that the expected regret of local-HOO is O( n), that is, the rate of the bound is independent of the dimension D. [sent-549, score-0.364]
69 Then f is still locally weak-Lipschitz with respect to the dissimilarity ℓc,β and the near-optimality dimension is d = D(1/β − 1/α), as shown in Example 3; the regret of HOO is O n(d+1)/(d+2) . [sent-551, score-0.523]
70 , 2008a) have considered continuum-armed bandits in Euclidean or, more generally, normed or metric spaces and provided upper and lower bounds on the regret for given classes of environments. [sent-561, score-0.509]
71 √ • Cope (2009) derived a O( n) bound on the regret for compact and convex subsets of Rd and mean-payoff functions with a unique minimum and second-order smoothness. [sent-562, score-0.328]
72 They derived the regret bound 1+α−αβ Θ n 1+2α−αβ , where the parameter β is such that the Lebesgue measure of ε-optimal arm is O(εβ ). [sent-567, score-0.404]
73 (8) 1676 X -A RMED BANDITS The obtained regret bound is O n(d+1)/(d+2) , where d is the zooming dimension. [sent-574, score-0.427]
74 The latter is defined similarly to our near-optimality dimension with the exceptions that in the definition of zooming dimension (i) covering numbers instead of packing numbers are used and (ii) sets of the form Xε \ Xε/2 are considered instead of the set Xcε . [sent-575, score-0.302]
75 Our result √ extends the n rate of the regret bound to any dimension D. [sent-581, score-0.364]
76 Since up to round t no more than t nodes have been played (including the suboptimal node (h, i)), we know that (t, it∗ ) has not been played so far and thus has a B-value equal to +∞. [sent-633, score-0.534]
77 Applying the Hoeffding-Azuma inequality for martingale differences (see Hoeffding, 1963), using the boundedness of the ranges of the induced martingale difference sequence, we then get, for each t 1, 2 √ t 2t ln n √ 2 −4 P ∑ f X j − Yj 2t ln n exp− =n , t j=1 which concludes the proof. [sent-673, score-0.536]
78 Lemma 16 For all integers t integers u 1 such that n, for all suboptimal nodes (h, i) such that ∆h,i > ν1 ρh , and for all u one has 8 ln n , (∆h,i − ν1 ρh )2 P Uh,i (t) > f ∗ and Th,i (t) > u t n−4 . [sent-674, score-0.469]
79 Lemma 17 Under Assumptions A1 and A2, for all suboptimal nodes (h, i) with ∆h,i > ν1 ρh , we have, for all n 1, 8 ln n E[Th,i (n)] +4. [sent-679, score-0.371]
80 (∆h,i − ν1 ρh )2 Proof We take u as the upper integer part of (8 ln n)/(∆h,i − ν1 ρh )2 and use union bounds to get from Lemma 14 the bound E Th,i (n) 8 ln n +1 (∆h,i − ν1 ρh )2 n + ∑ t=u+1 t−1 P Th,i (t) > u and Uh,i (t) > f ∗ + ∑ P Us,i∗ (t) s s=1 1681 f∗ . [sent-680, score-0.517]
81 we denote by Jh the nodes in J that are located at depth h in the tree (i. [sent-695, score-0.297]
82 Since for these nodes ∆h,i > 2ν1 ρh , we get 8 ln n +4. [sent-699, score-0.339]
83 Let the set T 1 contain the descendants of the nodes in IH (by convention, a node is considered its own descendant, hence the nodes of IH are included in T 1 ); let T 2 = ∪0 h < 1). [sent-717, score-0.36]
84 ′ Now, by choosing H such that ρ−H(d +2) is of the order of n/ ln n, that is, ρH is of the order of ′ +2) (n/ ln n)−1/(d , we get the desired result, namely, ′ ′ ′ E Rn = O n(d +1)/(d +2) (ln n)1/(d +2) . [sent-719, score-0.482]
85 Let (h, i) be a suboptimal node with h Dn0 and let 0 k h − 1 be the largest depth such that (k, i∗ ) is on the path from the root (0, 1) k to (h, i). [sent-725, score-0.301]
86 For all integers t n0 , for all suboptimal nodes (h, i) such that h Dn0 and ∆h,i > ν1 ρh , and for all integers u 1 such that 8 ln n0 u , (∆h,i − ν1 ρh )2 1684 X -A RMED BANDITS one has P Uh,i (t) > f ∗ and Th,i (t) > u t (n0 )−4 . [sent-729, score-0.469]
87 The instantaneous regret incurred when playing any of these √ nodes is therefore bounded by 4/ n0 ; and the associated cumulative regret (over n0 rounds) can be √ bounded by 4 n0 . [sent-741, score-0.748]
88 ′ h=0 √ The rest of the proof goes through and only this additional additive factor of 4 n0 is suffered in the final regret bound. [sent-743, score-0.325]
89 Under Assumptions A1 and A2’, the algorithm z-HOO satisfies that for all n 1 and all suboptimal nodes (h, i) with ∆h,i > ν1 ρh and h z, E Th,i (n) 8 ln n +4. [sent-763, score-0.371]
90 Since for these nodes ∆h,i > 2ν1 ρh , we get E Th,i (n) 8 ln n +4. [sent-788, score-0.339]
91 As in the proof of Theorem 6 we denote by Rn,i the regret resulting from the selection of nodes in T i , for i ∈ {1, 2, 3, 4}. [sent-797, score-0.423]
92 Finally, for T 4 , we use that it contains at most 2z − 1 nodes, each of them being associated with a regret controlled by Lemma 19; therefore, 8 ln n +4 . [sent-802, score-0.534]
93 ν2 ρ2 1 1688 8 ln n +4 ν2 ρ2h 1 8 ln n ν2 ρ2 ρ2h 1 + 4 ρ2h X -A RMED BANDITS Denoting γ= it follows that for n 2 4CLν1 ν−d 2 (1/ρ)d+1 − 1 γ ρ−H(d+1) ln n . [sent-805, score-0.723]
94 To this end, let H be the smallest integer k such that 4Lν1 ρk n particular, γ ln n 1/(d+2) ρH 4Lν1 n γρ−k(d+1) ln n; in and 4Lν1 ρH−1 n > γρ−(H−1)(d+1) ln n , γ ρ−H(d+1) ln n implying 4Lν1 ρH n ρ−(d+2) . [sent-808, score-0.964]
95 The final bound on the regret is then 8 ln n +4 ν2 ρ2z 1 8 ln n +4 ν2 ρ2z 1 4Lν1 ρH n + γ ρ−H(d+1) ln n + 2z − 1 E Rn 1+ 1 ρd+2 1 1+ 1+ = ρd+2 4Lν1 ρH n + 2z − 1 γ ln n 4Lν1 n 1/(d+2) 8 ln n +4 ν2 ρ2z 1 8 ln n (d+1)/(d+2) (γ ln n)1/(d+2) + 2z − 1 4Lν1 n +4 . [sent-810, score-2.015]
96 Let r0 be a positive integer such that for r r0 , one has def zr = ⌈log2 r⌉ h0 and zr 1 ln(4Lν1 2r ) − ln(γ ln 2r ) ; d +2 ln(1/ρ) we can therefore apply the result of Theorem 9 in regimes indexed by r r0 . [sent-814, score-0.329]
97 For previous regimes, we simply upper bound the regret by the number of rounds, that is, 2r0 − 2 2r0 . [sent-815, score-0.328]
98 ) We use the notations It′ and Yt′ for, respectively, the arms pulled and the rewards obtained by the strategy ψ(K+1) at each round t. [sent-906, score-0.432]
99 The proof is concluded by noting that • the left-hand side is smaller than the maximal regret w. [sent-1004, score-0.325]
100 Sample mean based index policies with o(log n) regret for the multi-armed bandit problem. [sent-1023, score-0.466]
wordName wordTfidf (topN-words)
[('hoo', 0.546), ('regret', 0.293), ('ln', 0.241), ('bandits', 0.18), ('arms', 0.176), ('bandit', 0.173), ('dissimilarity', 0.159), ('toltz', 0.157), ('ubeck', 0.157), ('unos', 0.157), ('zepesv', 0.157), ('kleinberg', 0.153), ('round', 0.146), ('node', 0.126), ('ih', 0.122), ('ari', 0.12), ('bh', 0.099), ('zooming', 0.099), ('nodes', 0.098), ('packing', 0.095), ('coverings', 0.083), ('gambler', 0.083), ('depth', 0.082), ('tree', 0.081), ('maxima', 0.079), ('arm', 0.076), ('rmed', 0.074), ('payoff', 0.07), ('played', 0.066), ('doubling', 0.064), ('diam', 0.064), ('munos', 0.063), ('ht', 0.059), ('bubeck', 0.058), ('maker', 0.058), ('jh', 0.057), ('rewards', 0.056), ('smoothness', 0.055), ('auer', 0.055), ('rn', 0.053), ('suboptimality', 0.05), ('fx', 0.049), ('integers', 0.049), ('xt', 0.048), ('zr', 0.044), ('ph', 0.043), ('child', 0.042), ('gelly', 0.041), ('balls', 0.041), ('horizon', 0.041), ('cumulative', 0.04), ('reward', 0.039), ('trick', 0.039), ('hn', 0.039), ('environments', 0.039), ('mx', 0.039), ('lipschitz', 0.038), ('tn', 0.038), ('descendants', 0.038), ('path', 0.037), ('yt', 0.036), ('covering', 0.036), ('dimension', 0.036), ('metric', 0.036), ('located', 0.036), ('locally', 0.035), ('bound', 0.035), ('coquelin', 0.033), ('lemma', 0.033), ('radius', 0.032), ('proof', 0.032), ('shall', 0.032), ('suboptimal', 0.032), ('optimistic', 0.031), ('xc', 0.031), ('xl', 0.031), ('rounds', 0.029), ('bk', 0.029), ('strategy', 0.029), ('environment', 0.029), ('region', 0.028), ('stoltz', 0.028), ('uct', 0.028), ('minimax', 0.028), ('weakly', 0.027), ('martingale', 0.027), ('truncated', 0.027), ('assumptions', 0.026), ('theorem', 0.026), ('descendant', 0.025), ('pulled', 0.025), ('vicinity', 0.025), ('adaptations', 0.025), ('kocsis', 0.025), ('lipschitzness', 0.025), ('disjoint', 0.024), ('playing', 0.024), ('root', 0.024), ('oracle', 0.023), ('agrawal', 0.023)]
simIndex simValue paperId paperTitle
same-paper 1 1.0000013 104 jmlr-2011-X-Armed Bandits
Author: Sébastien Bubeck, Rémi Munos, Gilles Stoltz, Csaba Szepesvári
Abstract: We consider a generalization of stochastic bandits where the set of arms, X , is allowed to be a generic measurable space and the mean-payoff function is “locally Lipschitz” with respect to a dissimilarity function that is known to the decision maker. Under this condition we construct an arm selection policy, called HOO (hierarchical optimistic optimization), with improved regret bounds compared to previous results for a large class of problems. In particular, our results imply that if X is the unit hypercube in a Euclidean space and the mean-payoff function has a finite number of global maxima around which the behavior of the function is locally continuous with a known √ smoothness degree, then the expected regret of HOO is bounded up to a logarithmic factor by n, that is, the rate of growth of the regret is independent of the dimension of the space. We also prove the minimax optimality of our algorithm when the dissimilarity is a metric. Our basic strategy has quadratic computational complexity as a function of the number of time steps and does not rely on the doubling trick. We also introduce a modified strategy, which relies on the doubling trick but runs in linearithmic time. Both results are improvements with respect to previous approaches. Keywords: bandits with infinitely many arms, optimistic online optimization, regret bounds, minimax rates
2 0.17368273 14 jmlr-2011-Better Algorithms for Benign Bandits
Author: Elad Hazan, Satyen Kale
Abstract: The online multi-armed bandit problem and its generalizations are repeated decision making problems, where the goal is to select one of several possible decisions in every round, and incur a cost associated with the decision, in such a way that the total cost incurred over all iterations is close to the cost of the best fixed decision in hindsight. The difference in these costs is known as the regret of the algorithm. The term bandit refers to the setting where one only obtains the cost of the decision used in a given iteration and no other information. A very general form of this problem is the non-stochastic bandit linear optimization problem, where the set of decisions is a convex set in some√ Euclidean space, and the cost functions are linear. ˜ Only recently an efficient algorithm attaining O( T ) regret was discovered in this setting. In this paper we propose a new algorithm for the bandit linear optimization problem which √ ˜ obtains a tighter regret bound of O( Q), where Q is the total variation in the cost functions. This regret bound, previously conjectured to hold in the full information case, shows that it is possible to incur much less regret in a slowly changing environment even in the bandit setting. Our algorithm is efficient and applies several new ideas to bandit optimization such as reservoir sampling. Keywords: multi-armed bandit, regret minimization, online learning
3 0.11495207 45 jmlr-2011-Internal Regret with Partial Monitoring: Calibration-Based Optimal Algorithms
Author: Vianney Perchet
Abstract: We provide consistent random algorithms for sequential decision under partial monitoring, when the decision maker does not observe the outcomes but receives instead random feedback signals. Those algorithms have no internal regret in the sense that, on the set of stages where the decision maker chose his action according to a given law, the average payoff could not have been improved in average by using any other fixed law. They are based on a generalization of calibration, no longer defined in terms of a Vorono¨ ı diagram but instead of a Laguerre diagram (a more general concept). This allows us to bound, for the first time in this general framework, the expected average internal, as well as the usual external, regret at stage n by O(n−1/3 ), which is known to be optimal. Keywords: repeated games, on-line learning, regret, partial monitoring, calibration, Vorono¨ and ı Laguerre diagrams
4 0.10756797 18 jmlr-2011-Convergence Rates of Efficient Global Optimization Algorithms
Author: Adam D. Bull
Abstract: In the efficient global optimization problem, we minimize an unknown function f , using as few observations f (x) as possible. It can be considered a continuum-armed-bandit problem, with noiseless data, and simple regret. Expected-improvement algorithms are perhaps the most popular methods for solving the problem; in this paper, we provide theoretical results on their asymptotic behaviour. Implementing these algorithms requires a choice of Gaussian-process prior, which determines an associated space of functions, its reproducing-kernel Hilbert space (RKHS). When the prior is fixed, expected improvement is known to converge on the minimum of any function in its RKHS. We provide convergence rates for this procedure, optimal for functions of low smoothness, and describe a modified algorithm attaining optimal rates for smoother functions. In practice, however, priors are typically estimated sequentially from the data. For standard estimators, we show this procedure may never find the minimum of f . We then propose alternative estimators, chosen to minimize the constants in the rate of convergence, and show these estimators retain the convergence rates of a fixed prior. Keywords: convergence rates, efficient global optimization, expected improvement, continuumarmed bandit, Bayesian optimization
5 0.090686917 29 jmlr-2011-Efficient Learning with Partially Observed Attributes
Author: Nicolò Cesa-Bianchi, Shai Shalev-Shwartz, Ohad Shamir
Abstract: We investigate three variants of budgeted learning, a setting in which the learner is allowed to access a limited number of attributes from training or test examples. In the “local budget” setting, where a constraint is imposed on the number of available attributes per training example, we design and analyze an efficient algorithm for learning linear predictors that actively samples the attributes of each training instance. Our analysis bounds the number of additional examples sufficient to compensate for the lack of full information on the training set. This result is complemented by a general lower bound for the easier “global budget” setting, where it is only the overall number of accessible training attributes that is being constrained. In the third, “prediction on a budget” setting, when the constraint is on the number of available attributes per test example, we show that there are cases in which there exists a linear predictor with zero error but it is statistically impossible to achieve arbitrary accuracy without full information on test examples. Finally, we run simple experiments on a digit recognition problem that reveal that our algorithm has a good performance against both partial information and full information baselines. Keywords: budgeted learning, statistical learning, linear predictors, learning with partial information, learning theory
6 0.089308612 91 jmlr-2011-The Sample Complexity of Dictionary Learning
7 0.086560637 54 jmlr-2011-Learning Latent Tree Graphical Models
8 0.086432345 8 jmlr-2011-Adaptive Subgradient Methods for Online Learning and Stochastic Optimization
9 0.078383453 43 jmlr-2011-Information, Divergence and Risk for Binary Experiments
10 0.066933356 7 jmlr-2011-Adaptive Exact Inference in Graphical Models
11 0.063522018 97 jmlr-2011-Union Support Recovery in Multi-task Learning
12 0.062939882 73 jmlr-2011-Online Learning in Case of Unbounded Losses Using Follow the Perturbed Leader Algorithm
13 0.057183586 59 jmlr-2011-Learning with Structured Sparsity
14 0.046799172 47 jmlr-2011-Inverse Reinforcement Learning in Partially Observable Environments
15 0.044963695 94 jmlr-2011-Theoretical Analysis of Bayesian Matrix Factorization
17 0.041268013 38 jmlr-2011-Hierarchical Knowledge Gradient for Sequential Sampling
18 0.039227266 79 jmlr-2011-Proximal Methods for Hierarchical Sparse Coding
19 0.038355343 103 jmlr-2011-Weisfeiler-Lehman Graph Kernels
20 0.035631761 74 jmlr-2011-Operator Norm Convergence of Spectral Clustering on Level Sets
topicId topicWeight
[(0, 0.217), (1, 0.149), (2, -0.04), (3, 0.131), (4, 0.168), (5, 0.179), (6, 0.183), (7, 0.032), (8, -0.054), (9, 0.078), (10, -0.069), (11, -0.063), (12, -0.079), (13, 0.13), (14, 0.037), (15, -0.209), (16, -0.127), (17, 0.133), (18, 0.044), (19, -0.149), (20, -0.168), (21, -0.11), (22, 0.056), (23, 0.151), (24, 0.098), (25, 0.12), (26, 0.001), (27, 0.101), (28, 0.09), (29, 0.075), (30, 0.015), (31, 0.004), (32, -0.006), (33, 0.059), (34, 0.03), (35, 0.093), (36, 0.076), (37, -0.043), (38, 0.132), (39, 0.021), (40, 0.015), (41, -0.005), (42, 0.136), (43, -0.082), (44, 0.06), (45, 0.06), (46, 0.08), (47, 0.023), (48, -0.024), (49, 0.039)]
simIndex simValue paperId paperTitle
same-paper 1 0.95222592 104 jmlr-2011-X-Armed Bandits
Author: Sébastien Bubeck, Rémi Munos, Gilles Stoltz, Csaba Szepesvári
Abstract: We consider a generalization of stochastic bandits where the set of arms, X , is allowed to be a generic measurable space and the mean-payoff function is “locally Lipschitz” with respect to a dissimilarity function that is known to the decision maker. Under this condition we construct an arm selection policy, called HOO (hierarchical optimistic optimization), with improved regret bounds compared to previous results for a large class of problems. In particular, our results imply that if X is the unit hypercube in a Euclidean space and the mean-payoff function has a finite number of global maxima around which the behavior of the function is locally continuous with a known √ smoothness degree, then the expected regret of HOO is bounded up to a logarithmic factor by n, that is, the rate of growth of the regret is independent of the dimension of the space. We also prove the minimax optimality of our algorithm when the dissimilarity is a metric. Our basic strategy has quadratic computational complexity as a function of the number of time steps and does not rely on the doubling trick. We also introduce a modified strategy, which relies on the doubling trick but runs in linearithmic time. Both results are improvements with respect to previous approaches. Keywords: bandits with infinitely many arms, optimistic online optimization, regret bounds, minimax rates
2 0.5196256 29 jmlr-2011-Efficient Learning with Partially Observed Attributes
Author: Nicolò Cesa-Bianchi, Shai Shalev-Shwartz, Ohad Shamir
Abstract: We investigate three variants of budgeted learning, a setting in which the learner is allowed to access a limited number of attributes from training or test examples. In the “local budget” setting, where a constraint is imposed on the number of available attributes per training example, we design and analyze an efficient algorithm for learning linear predictors that actively samples the attributes of each training instance. Our analysis bounds the number of additional examples sufficient to compensate for the lack of full information on the training set. This result is complemented by a general lower bound for the easier “global budget” setting, where it is only the overall number of accessible training attributes that is being constrained. In the third, “prediction on a budget” setting, when the constraint is on the number of available attributes per test example, we show that there are cases in which there exists a linear predictor with zero error but it is statistically impossible to achieve arbitrary accuracy without full information on test examples. Finally, we run simple experiments on a digit recognition problem that reveal that our algorithm has a good performance against both partial information and full information baselines. Keywords: budgeted learning, statistical learning, linear predictors, learning with partial information, learning theory
3 0.49596456 45 jmlr-2011-Internal Regret with Partial Monitoring: Calibration-Based Optimal Algorithms
Author: Vianney Perchet
Abstract: We provide consistent random algorithms for sequential decision under partial monitoring, when the decision maker does not observe the outcomes but receives instead random feedback signals. Those algorithms have no internal regret in the sense that, on the set of stages where the decision maker chose his action according to a given law, the average payoff could not have been improved in average by using any other fixed law. They are based on a generalization of calibration, no longer defined in terms of a Vorono¨ ı diagram but instead of a Laguerre diagram (a more general concept). This allows us to bound, for the first time in this general framework, the expected average internal, as well as the usual external, regret at stage n by O(n−1/3 ), which is known to be optimal. Keywords: repeated games, on-line learning, regret, partial monitoring, calibration, Vorono¨ and ı Laguerre diagrams
4 0.43718889 18 jmlr-2011-Convergence Rates of Efficient Global Optimization Algorithms
Author: Adam D. Bull
Abstract: In the efficient global optimization problem, we minimize an unknown function f , using as few observations f (x) as possible. It can be considered a continuum-armed-bandit problem, with noiseless data, and simple regret. Expected-improvement algorithms are perhaps the most popular methods for solving the problem; in this paper, we provide theoretical results on their asymptotic behaviour. Implementing these algorithms requires a choice of Gaussian-process prior, which determines an associated space of functions, its reproducing-kernel Hilbert space (RKHS). When the prior is fixed, expected improvement is known to converge on the minimum of any function in its RKHS. We provide convergence rates for this procedure, optimal for functions of low smoothness, and describe a modified algorithm attaining optimal rates for smoother functions. In practice, however, priors are typically estimated sequentially from the data. For standard estimators, we show this procedure may never find the minimum of f . We then propose alternative estimators, chosen to minimize the constants in the rate of convergence, and show these estimators retain the convergence rates of a fixed prior. Keywords: convergence rates, efficient global optimization, expected improvement, continuumarmed bandit, Bayesian optimization
5 0.42887032 7 jmlr-2011-Adaptive Exact Inference in Graphical Models
Author: Özgür Sümer, Umut A. Acar, Alexander T. Ihler, Ramgopal R. Mettu
Abstract: Many algorithms and applications involve repeatedly solving variations of the same inference problem, for example to introduce new evidence to the model or to change conditional dependencies. As the model is updated, the goal of adaptive inference is to take advantage of previously computed quantities to perform inference more rapidly than from scratch. In this paper, we present algorithms for adaptive exact inference on general graphs that can be used to efficiently compute marginals and update MAP configurations under arbitrary changes to the input factor graph and its associated elimination tree. After a linear time preprocessing step, our approach enables updates to the model and the computation of any marginal in time that is logarithmic in the size of the input model. Moreover, in contrast to max-product our approach can also be used to update MAP configurations in time that is roughly proportional to the number of updated entries, rather than the size of the input model. To evaluate the practical effectiveness of our algorithms, we implement and test them using synthetic data as well as for two real-world computational biology applications. Our experiments show that adaptive inference can achieve substantial speedups over performing complete inference as the model undergoes small changes over time. Keywords: exact inference, factor graphs, factor elimination, marginalization, dynamic programming, MAP computation, model updates, parallel tree contraction ¨ u u c 2011 Ozg¨ r S¨ mer, Umut A. Acar, Alexander T. Ihler and Ramgopal R. Mettu. ¨ S UMER , ACAR , I HLER AND M ETTU
6 0.42251235 14 jmlr-2011-Better Algorithms for Benign Bandits
7 0.39591146 91 jmlr-2011-The Sample Complexity of Dictionary Learning
8 0.36254606 54 jmlr-2011-Learning Latent Tree Graphical Models
9 0.33688229 73 jmlr-2011-Online Learning in Case of Unbounded Losses Using Follow the Perturbed Leader Algorithm
10 0.30506179 8 jmlr-2011-Adaptive Subgradient Methods for Online Learning and Stochastic Optimization
11 0.28061378 43 jmlr-2011-Information, Divergence and Risk for Binary Experiments
12 0.2642163 97 jmlr-2011-Union Support Recovery in Multi-task Learning
13 0.25444481 59 jmlr-2011-Learning with Structured Sparsity
14 0.2257181 94 jmlr-2011-Theoretical Analysis of Bayesian Matrix Factorization
15 0.22479008 9 jmlr-2011-An Asymptotic Behaviour of the Marginal Likelihood for General Markov Models
16 0.21459617 34 jmlr-2011-Faster Algorithms for Max-Product Message-Passing
17 0.20775282 47 jmlr-2011-Inverse Reinforcement Learning in Partially Observable Environments
18 0.20558164 74 jmlr-2011-Operator Norm Convergence of Spectral Clustering on Level Sets
19 0.19243138 103 jmlr-2011-Weisfeiler-Lehman Graph Kernels
20 0.18775104 52 jmlr-2011-Large Margin Hierarchical Classification with Mutually Exclusive Class Membership
topicId topicWeight
[(4, 0.086), (9, 0.033), (10, 0.035), (24, 0.037), (31, 0.055), (32, 0.025), (41, 0.046), (60, 0.017), (63, 0.389), (65, 0.034), (67, 0.017), (71, 0.015), (73, 0.026), (78, 0.07), (90, 0.024)]
simIndex simValue paperId paperTitle
same-paper 1 0.66941947 104 jmlr-2011-X-Armed Bandits
Author: Sébastien Bubeck, Rémi Munos, Gilles Stoltz, Csaba Szepesvári
Abstract: We consider a generalization of stochastic bandits where the set of arms, X , is allowed to be a generic measurable space and the mean-payoff function is “locally Lipschitz” with respect to a dissimilarity function that is known to the decision maker. Under this condition we construct an arm selection policy, called HOO (hierarchical optimistic optimization), with improved regret bounds compared to previous results for a large class of problems. In particular, our results imply that if X is the unit hypercube in a Euclidean space and the mean-payoff function has a finite number of global maxima around which the behavior of the function is locally continuous with a known √ smoothness degree, then the expected regret of HOO is bounded up to a logarithmic factor by n, that is, the rate of growth of the regret is independent of the dimension of the space. We also prove the minimax optimality of our algorithm when the dissimilarity is a metric. Our basic strategy has quadratic computational complexity as a function of the number of time steps and does not rely on the doubling trick. We also introduce a modified strategy, which relies on the doubling trick but runs in linearithmic time. Both results are improvements with respect to previous approaches. Keywords: bandits with infinitely many arms, optimistic online optimization, regret bounds, minimax rates
2 0.33182466 91 jmlr-2011-The Sample Complexity of Dictionary Learning
Author: Daniel Vainsencher, Shie Mannor, Alfred M. Bruckstein
Abstract: A large set of signals can sometimes be described sparsely using a dictionary, that is, every element can be represented as a linear combination of few elements from the dictionary. Algorithms for various signal processing applications, including classification, denoising and signal separation, learn a dictionary from a given set of signals to be represented. Can we expect that the error in representing by such a dictionary a previously unseen signal from the same source will be of similar magnitude as those for the given examples? We assume signals are generated from a fixed distribution, and study these questions from a statistical learning theory perspective. We develop generalization bounds on the quality of the learned dictionary for two types of constraints on the coefficient selection, as measured by the expected L2 error in representation when the dictionary is used. For the case of l1 regularized coefficient selection we provide a generalnp ln(mλ)/m , where n is the dimension, p is the number of ization bound of the order of O elements in the dictionary, λ is a bound on the l1 norm of the coefficient vector and m is the number of samples, which complements existing results. For the case of representing a new signal as a combination of at most k dictionary elements, we provide a bound of the order O( np ln(mk)/m) under an assumption on the closeness to orthogonality of the dictionary (low Babel function). We further show that this assumption holds for most dictionaries in high dimensions in a strong probabilistic sense. Our results also include bounds that converge as 1/m, not previously known for this problem. We provide similar results in a general setting using kernels with weak smoothness requirements. Keywords: dictionary learning, generalization bound, sparse representation
3 0.33097789 59 jmlr-2011-Learning with Structured Sparsity
Author: Junzhou Huang, Tong Zhang, Dimitris Metaxas
Abstract: This paper investigates a learning formulation called structured sparsity, which is a natural extension of the standard sparsity concept in statistical learning and compressive sensing. By allowing arbitrary structures on the feature set, this concept generalizes the group sparsity idea that has become popular in recent years. A general theory is developed for learning with structured sparsity, based on the notion of coding complexity associated with the structure. It is shown that if the coding complexity of the target signal is small, then one can achieve improved performance by using coding complexity regularization methods, which generalize the standard sparse regularization. Moreover, a structured greedy algorithm is proposed to efficiently solve the structured sparsity problem. It is shown that the greedy algorithm approximately solves the coding complexity optimization problem under appropriate conditions. Experiments are included to demonstrate the advantage of structured sparsity over standard sparsity on some real applications. Keywords: structured sparsity, standard sparsity, group sparsity, tree sparsity, graph sparsity, sparse learning, feature selection, compressive sensing
4 0.32339266 53 jmlr-2011-Learning High-Dimensional Markov Forest Distributions: Analysis of Error Rates
Author: Vincent Y.F. Tan, Animashree Anandkumar, Alan S. Willsky
Abstract: The problem of learning forest-structured discrete graphical models from i.i.d. samples is considered. An algorithm based on pruning of the Chow-Liu tree through adaptive thresholding is proposed. It is shown that this algorithm is both structurally consistent and risk consistent and the error probability of structure learning decays faster than any polynomial in the number of samples under fixed model size. For the high-dimensional scenario where the size of the model d and the number of edges k scale with the number of samples n, sufficient conditions on (n, d, k) are given for the algorithm to satisfy structural and risk consistencies. In addition, the extremal structures for learning are identified; we prove that the independent (resp., tree) model is the hardest (resp., easiest) to learn using the proposed algorithm in terms of error rates for structure learning. Keywords: graphical models, forest distributions, structural consistency, risk consistency, method of types
5 0.31887999 54 jmlr-2011-Learning Latent Tree Graphical Models
Author: Myung Jin Choi, Vincent Y. F. Tan, Animashree Anandkumar, Alan S. Willsky
Abstract: We study the problem of learning a latent tree graphical model where samples are available only from a subset of variables. We propose two consistent and computationally efficient algorithms for learning minimal latent trees, that is, trees without any redundant hidden nodes. Unlike many existing methods, the observed nodes (or variables) are not constrained to be leaf nodes. Our algorithms can be applied to both discrete and Gaussian random variables and our learned models are such that all the observed and latent variables have the same domain (state space). Our first algorithm, recursive grouping, builds the latent tree recursively by identifying sibling groups using so-called information distances. One of the main contributions of this work is our second algorithm, which we refer to as CLGrouping. CLGrouping starts with a pre-processing procedure in which a tree over the observed variables is constructed. This global step groups the observed nodes that are likely to be close to each other in the true latent tree, thereby guiding subsequent recursive grouping (or equivalent procedures such as neighbor-joining) on much smaller subsets of variables. This results in more accurate and efficient learning of latent trees. We also present regularized versions of our algorithms that learn latent tree approximations of arbitrary distributions. We compare the proposed algorithms to other methods by performing extensive numerical experiments on various latent tree graphical models such as hidden Markov models and star graphs. In addition, we demonstrate the applicability of our methods on real-world data sets by modeling the dependency structure of monthly stock returns in the S&P; index and of the words in the 20 newsgroups data set. Keywords: graphical models, Markov random fields, hidden variables, latent tree models, structure learning c 2011 Myung Jin Choi, Vincent Y. F. Tan, Animashree Anandkumar and Alan S. Willsky. C HOI , TAN , A NANDKUMAR AND W ILLSKY
6 0.31659013 37 jmlr-2011-Group Lasso Estimation of High-dimensional Covariance Matrices
8 0.3153908 74 jmlr-2011-Operator Norm Convergence of Spectral Clustering on Level Sets
9 0.31487527 29 jmlr-2011-Efficient Learning with Partially Observed Attributes
10 0.30973607 67 jmlr-2011-Multitask Sparsity via Maximum Entropy Discrimination
11 0.30964649 9 jmlr-2011-An Asymptotic Behaviour of the Marginal Likelihood for General Markov Models
12 0.30728799 85 jmlr-2011-Smoothness, Disagreement Coefficient, and the Label Complexity of Agnostic Active Learning
13 0.30664742 97 jmlr-2011-Union Support Recovery in Multi-task Learning
14 0.30663261 6 jmlr-2011-A Simpler Approach to Matrix Completion
15 0.30393827 69 jmlr-2011-Neyman-Pearson Classification, Convexity and Stochastic Constraints
16 0.30374181 88 jmlr-2011-Structured Variable Selection with Sparsity-Inducing Norms
17 0.3031826 77 jmlr-2011-Posterior Sparsity in Unsupervised Dependency Parsing
18 0.3027367 16 jmlr-2011-Clustering Algorithms for Chains
19 0.30268973 86 jmlr-2011-Sparse Linear Identifiable Multivariate Modeling
20 0.30172706 76 jmlr-2011-Parameter Screening and Optimisation for ILP using Designed Experiments