jmlr jmlr2013 jmlr2013-28 knowledge-graph by maker-knowledge-mining

28 jmlr-2013-Construction of Approximation Spaces for Reinforcement Learning

Source: pdf

Author: Wendelin Böhmer, Steffen Grünewälder, Yun Shen, Marek Musial, Klaus Obermayer

Abstract: Linear reinforcement learning (RL) algorithms like least-squares temporal difference learning (LSTD) require basis functions that span approximation spaces of potential value functions. This article investigates methods to construct these bases from samples. We hypothesize that an ideal approximation spaces should encode diffusion distances and that slow feature analysis (SFA) constructs such spaces. To validate our hypothesis we provide theoretical statements about the LSTD value approximation error and induced metric of approximation spaces constructed by SFA and the state-of-the-art methods Krylov bases and proto-value functions (PVF). In particular, we prove that SFA minimizes the average (over all tasks in the same environment) bound on the above approximation error. Compared to other methods, SFA is very sensitive to sampling and can sometimes fail to encode the whole state space. We derive a novel importance sampling modiﬁcation to compensate for this effect. Finally, the LSTD and least squares policy iteration (LSPI) performance of approximation spaces constructed by Krylov bases, PVF, SFA and PCA is compared in benchmark tasks and a visual robot navigation experiment (both in a realistic simulation and with a robot). The results support our hypothesis and suggest that (i) SFA provides subspace-invariant features for MDPs with self-adjoint transition operators, which allows strong guarantees on the approximation error, (ii) the modiﬁed SFA algorithm is best suited for LSPI in both discrete and continuous state spaces and (iii) approximation spaces encoding diffusion distances facilitate LSPI performance. Keywords: reinforcement learning, diffusion distance, proto value functions, slow feature analysis, least-squares policy iteration, visual robot navigation c 2013 Wendelin B¨ hmer, Steffen Gr¨ new¨ lder, Yun Shen, Marek Musial and Klaus Obermayer. o u a ¨ ¨ ¨ B OHMER , G R UNEW ALDER , S HEN , M USIAL AND O BERMAYER

Reference: text

Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 We hypothesize that an ideal approximation spaces should encode diffusion distances and that slow feature analysis (SFA) constructs such spaces. [sent-14, score-0.458]

2 Finally, the LSTD and least squares policy iteration (LSPI) performance of approximation spaces constructed by Krylov bases, PVF, SFA and PCA is compared in benchmark tasks and a visual robot navigation experiment (both in a realistic simulation and with a robot). [sent-19, score-0.621]

3 Keywords: reinforcement learning, diffusion distance, proto value functions, slow feature analysis, least-squares policy iteration, visual robot navigation c 2013 Wendelin B¨ hmer, Steffen Gr¨ new¨ lder, Yun Shen, Marek Musial and Klaus Obermayer. [sent-21, score-0.796]

4 The goal of RL is to compute a policy which selects actions that maximize the expected future reward (called value). [sent-27, score-0.359]

5 The former encode the propagated reward function and the latter aim for eigenvectors of the transition matrix. [sent-109, score-0.305]

6 Given such a basis (dashed) one task is used to train a control policy with a linear RL algorithm. [sent-126, score-0.32]

7 In ﬁnite state spaces PVF are the eigenvectors to the smallest eigenvalues of the normalized graph Laplacian of an undirected graph representing the transition possibility (not probability) between states. [sent-129, score-0.296]

8 Euclidean distances in these spaces resemble diffusion distances closely. [sent-162, score-0.369]

9 To learn a control policy (dashed arrows) the agent treats the representation φ(z) ∈ IR p of each observed image z ∈ Z as the representation of the corresponding state x ∈ X. [sent-174, score-0.284]

10 We investigate the role of the metric in approximation space Fφ on a visual robot navigation experiment, both in a realistic simulation and on a robot (Sections 5. [sent-199, score-0.452]

11 We demonstrate than SFA can sometimes fail to encode the whole state space due to its dependence on the sampling policy and address this problem with a novel importance sampling modiﬁcation to the SFA algorithm. [sent-202, score-0.391]

12 We compare the performance of approximation spaces constructed by Krylov bases, PVF, SFA and PCA for least-squares policy iteration (LSPI, Lagoudakis and Parr, 2003). [sent-204, score-0.335]

13 Results suggest that (i) the modiﬁed SFA algorithm is best suited for LSPI in both discrete and continuous state spaces and (ii) approximation spaces that encode a diffusion metric facilitate LSPI performance. [sent-205, score-0.543]

14 R : X × A × X × B(IR) → [0, 1] is a distribution over rewards: R(B|x, a, y) is the probability to receive a reward within set B ∈ B(IR) after a transition from state x to state y, executing action a. [sent-230, score-0.434]

15 The goal of reinforcement learning is to ﬁnd a policy that maximizes the value V π (x) at each state x, that is the expected sum of discounted future rewards V π (x0 ) := IE ∞ ∑ γ t r(xt , at ) t=0 at ∼ π(·|xt ) xt+1 ∼ P(·|xt , at ) , ∀x0 ∈ X . [sent-233, score-0.347]

16 Note that for ﬁxed V π (·) the equation is linear in the policy π(·|·) and vice versa, allowing an expectation maximization type algorithm called policy iteration (PI, Sutton and Barto, 1998) to ﬁnd the best policy. [sent-237, score-0.438]

17 ˆ For a ﬁxed policy π, this yields the transition operator11 Pπ : L2 (X, ξ) → L2 (X, ξ), ˆ Pπ [ f ](x) := f (y) P(dy|x, a) π(da|x) , ∀x ∈ X , ∀ f ∈ L2 (X, ξ) . [sent-260, score-0.373]

18 The operator is called ergodic if every Markov chain sampled by the underlying transition kernel P and policy π is ergodic. [sent-261, score-0.6]

19 If one policy that assigns a nonzero probability to each action yields ergodic Markov chains, then every such policy does. [sent-279, score-0.558]

20 Note that for this property the samples must be drawn from steady state distribution ξ and policy π, usually by a long Markov chain executing π. [sent-287, score-0.464]

21 This problem is tackled by least-squares policy iteration (LSPI, Lagoudakis and Parr, 2003), which alternates between Q-value estimation (the expectation step) and policy improvement (the maximization step). [sent-293, score-0.438]

22 At iteration i with current policy πi , the Q-value function Qπi : X × A → IR is deﬁned as the value of state x ∈ X conditioned on the next action a ∈ A: Qπi (x, a) := r(x, a) + γ V πi (y) P(dy|x, a) = r(x, a) + γ Qπi (y, b) πi (db|y) P(dy|x, a) . [sent-294, score-0.326]

23 The greedy policy πi+1 in the i’th policy improvement step will for each state x draw one of the actions with the highest Q-value, that is, aπi (x) ∼ arg max Qπi (x, a), and stick with it: a∈A πi+1 (a|x) := 1 , if a = aπi (x) , 0 , else ∀x ∈ X , ∀a ∈ A . [sent-297, score-0.535]

24 Subspace-invariant features aim for eigenfunctions of transition operator Pπ to achieve no r , however, can still induce Bellman errors. [sent-353, score-0.334]

25 However, the transition operator must be approximated from observations and the resulting basis ΦK is not orthonormal. [sent-371, score-0.285]

26 3 Proto-value Functions In ﬁnite state spaces Z, |Z| < ∞, proto-value functions (PVF, Mahadevan and Maggioni, 2007) are motivated by diffusion maps on graphs (Coifman et al. [sent-418, score-0.298]

27 1 shows that this approach, also known as spectral encoding (Belkin and Niyogi, 2003), yields approximation spaces in which Euclidean distances are equivalent to diffusion distances on the connection graph. [sent-425, score-0.445]

28 Note, however, that these are not exactly the diffusion distances of the transition kernel, as the transition possibility rather than probability is encoded in matrix W. [sent-426, score-0.532]

29 1 demonstrates that these solutions are endowed with the diffusion metric of the symmetrized transition kernel, not unlike PVF in ﬁnite state spaces. [sent-480, score-0.442]

30 Note also that SFA encodes the actual transition probabilities, which requires more samples to converge than the transition possibilities encoded by PVF. [sent-482, score-0.331]

31 n Let {(zt , at )}t=0 denote a training sequence sampled by policy π with a steady state distribution ξ, which induces the joint distribution µ(B, A) = B π(A|z) ξ(dz), ∀B ∈ B(Z), ∀A ∈ B(A). [sent-498, score-0.424]

32 To switch to another policy τ and state distribution ζ, that is, the joint distribution η(B, A) = B τ (A|z) ζ(dz), ∀B ∈ B(Z), ∀A ∈ B(A), one can weight each transition with the Radon-Nikodym 17. [sent-499, score-0.438]

33 PVF are thus based on diffusion distances dt (x, y) of a graph representing the symmetrized transition possibilities Txy := Wxy /(∑z∈Z Wxz ) between discrete states (see Section 3. [sent-553, score-0.464]

34 These diffusion distances are equal to Euclidean distances in a space spanned by the eigenvectors φi ∈ IR|Z| and eigenvalues λi ∈ IR of connectivity matrix T (e. [sent-555, score-0.292]

35 An extension to potentially continuous observation (or state) spaces Z with ergodic transition kernels Pπ : Z × B(Z) → [0, 1] is not trivial. [sent-559, score-0.335]

36 Deﬁnition 1 The diffusion distance dt : Z × Z → IR+ based on ergodic transition kernel Pπ with steady state distribution ξ is deﬁned as dt (x, y) := µtx − µty ξ , µtx := d(Pπ )t (·|x) dξ ∈ L2 (Z, ξ) , ∀x, y ∈ Z , ∀t ∈ IN \ {0} . [sent-572, score-0.648]

37 Given a particular scaling, however, diffusion distances can equal Euclidean distances in Fφ . [sent-587, score-0.292]

38 If the analogy between value function generalization and diffusion distances is correct, one should aim for a set of basis functions that at least approximates an encoding of diffusion distances dt (·, ·), if possible for all forecast parameters t ∈ IN \ {0} at once. [sent-593, score-0.562]

39 Lemma 3 Let ξ denote the steady state distribution of ergodic transition kernel Pπ , which has ˆ ˆ a self-adjoint transition operator Pπ = (Pπ )∗ : L2 (Z, ξ) → L2 (Z, ξ). [sent-594, score-0.7]

40 Note that the full set of eigenfunctions φi encodes all diffusion distances dt (·, ·), ∀t ∈ IN \ {0}. [sent-597, score-0.284]

41 Lemma 3 shows that the above relationship between diffusion and Euclidean distances in the ˆ eigenspace of the transition operator Pπ : L2 (Z, ξ) → L2 (Z, ξ) also holds, but only if this opera23 This does not hold for most transition operators, however. [sent-598, score-0.586]

42 In this light one can interpret the symmetrized transition possibilities encoded by PVF as a selfadjoint approximation of the transition probabilities of Pπ . [sent-603, score-0.379]

43 Lemma 4 Let Pπ be an ergodic transition kernel in Z with steady state distribution ξ. [sent-605, score-0.492]

44 The kernel ˆ induced by adjoint transition operator (Pπ )∗ in L2 (Z, ξ) is ξ-almost-everywhere an ergodic transition kernel with steady state distribution ξ. [sent-606, score-0.782]

45 To obtain a self-adjoint transition kernel, Lemma 4 shows that the kernel of the adjoint operator ˆ ˆ ˆ (Pπ )∗ is a transition kernel as well. [sent-608, score-0.499]

46 The goal of encoding diffusion distances based on Pπ appears thus best served by the eigenfunctions of ˆ the symmetrized transition operator Psπ . [sent-620, score-0.538]

47 Theorem 6 SFA features {φi }∞ simultaneously encode all diffusion distances dt (·, ·), ∀t ∈ IN \ i=1 1 {0}, based on the symmetrized transition kernel Psπ = 2 Pπ + 1 (Pπ )∗ . [sent-629, score-0.597]

48 A similar proposition can be made for PVF features and diffusion distances based on transition possibilities. [sent-632, score-0.467]

49 It is, however, possible to show the encoding of diffusion distances on average, given the reward function is drawn from a white noise functional25 ρ. [sent-636, score-0.369]

50 In conclusion, reward-based features encode diffusion distances exactly up to some time horizon, whereas SFA and PVF approximate an encoding for all possible distances. [sent-650, score-0.393]

51 Constructing a real-valued, orthonormal basis of subspace-invariant features is thus only possible in some rare cases of self-adjoint transition operators. [sent-657, score-0.32]

52 Reversely, one can construct a transition kernel Pπ with a self-adjoint transition operator but without uniform ˆ transition probabilities. [sent-664, score-0.571]

53 Theorem 11 Let ξ be the steady state distribution on Z of a MDP with policy π and a selfp adjoint transition operator in L2 (Z, ξ). [sent-691, score-0.659]

54 With the same state (observation) space Z, action space A and transition kernel P. [sent-774, score-0.316]

55 This optimality is still restricted to the sampling policy π, but SFA features should have an advantage over PVF here, as they are subspace-invariant for a much larger class of MDPs. [sent-786, score-0.34]

56 Using such features effectively for transfer learning or within policy iteration requires them to perform well for other policies π ′ , in other words to induce little per-feature errors when applied to ˆ ′ Pπ . [sent-787, score-0.4]

57 Note that depending on the transition kernel Pπ , this can still yield an arbitrary steady state distribution ξ. [sent-789, score-0.414]

58 Theoretical statements about the inﬂuence of sampling policy and steady state distribution on SFA and PVF per-feature errors with other policies, however, are beyond the scope of this article and left for future works. [sent-792, score-0.496]

59 Although SFA is more sensitive to the sampling policy than PVF, the presented analysis suggests that it can provide better approximation spaces for value estimation, that is, LSTD. [sent-795, score-0.367]

60 To test the inﬂuence of the observation space metric, we designed a simple but realistic robot navigation task with continuous state and observation spaces (Section 5. [sent-812, score-0.403]

61 1 we stated our intuition that the symmetrized transition operator of SFA approximates the true transition operator better than the neighborhood operator of PVF. [sent-821, score-0.502]

62 As all states could be connected by the transition kernel, transition possibility (as in Section 3. [sent-834, score-0.337]

63 Let Pπ be the transition matrix and Ξ a diagonal matrix of steady state distribution ξ, which is the left eigenvector to the largest Ξ eigenvalue of Pπ . [sent-867, score-0.359]

64 Also, SFA features with uniform distribution ξ are roughly 3 times as subspace-invariant as original SFA features with steady state distribution ξ. [sent-892, score-0.383]

65 Furthermore, SFA features based on a uniform distribution ξ are on average more subspace-invariant than those based on the steady state distribution. [sent-895, score-0.294]

66 Note that transition probabilities of a random policy equal the neighborhood relationships. [sent-911, score-0.373]

67 SFA features for this policy should therefore equal PVF features. [sent-912, score-0.308]

68 To make sure all states are visited, an optimal policy trial is sampled in 40 trajectories with random start states and 100 samples each. [sent-931, score-0.32]

69 Performance of a (deterministic) policy learned with LSPI is measured by the mean accumulated reward of 1000 trajectories starting at random states. [sent-1017, score-0.37]

70 The different encoding of diffusion distances (Theorems 6 & 7, Page 2085) provides a good explanation for this effect: Krylov bases represent ﬁnite-horizon value functions perfectly (Corollary 10, Page 2087), 34. [sent-1036, score-0.316]

71 As convergence to the optimal policy can not be guaranteed for LSPI, the eventual policy depends also on the initial (randomly chosen) policy π0 . [sent-1040, score-0.657]

72 In comparison with theoretic SFA features, their performance resembles the solution with uniform distribution ξ (circles, upper row), which clearly outperforms SFA features based on the very similar steady state distribution ξ (crosses, upper row). [sent-1070, score-0.294]

73 While robot coordinates come close to encoding diffusion distances of random policies, these observations clearly do not. [sent-1089, score-0.384]

74 Learning and control are based on the current camera image zt and a corresponding reward signal rt ∈ {−1, 0, +1} indicating whether the robot is in the goal area (rt = +1), close to a wall (rt = −1) or none of the above (rt = 0). [sent-1118, score-0.308]

75 6 Visual Navigation-Comparison of Algorithms The left plot of Figure 9 shows a comparison of the effect of all discussed basis function construction algorithms on the control policy learned by LSPI. [sent-1206, score-0.296]

76 We attribute this advantage to approximation spaces encoding diffusion distances rather than similarities in Z. [sent-1238, score-0.377]

77 2101 ¨ ¨ ¨ B OHMER , G R UNEW ALDER , S HEN , M USIAL AND O BERMAYER 10 15 13 12 21 15 9 20 16 12 26 10 6 9 10 5 10 7 7 13 15 17 12 15 12 12 8 17 11 13 15 Figure 10: The recorded robot trajectories of a control policy learned by LSPI using RSK-SFA features. [sent-1254, score-0.385]

78 It also implies an answer to our third question on Page 2090: LSPI performance is facilitated by approximation spaces that encode diffusion distances of a uniform random policy. [sent-1268, score-0.383]

79 After 10 LSPI iterations with 128 RSK-SFA basis functions, the control policy achieved a success rate of 75% in two separate tasks depicted in Figure 10. [sent-1277, score-0.327]

80 Whenever the policy changes from one action to another, the respective Q-values must be close-by and small deviations from the true value can drastically inﬂuence the greedy action selection. [sent-1283, score-0.303]

81 Using steady state distribution ξ of some sampling policy π implies that decisions at often visited states should be more reliable that those at seldom visited states. [sent-1306, score-0.485]

82 Take the optimal policy in a navigation task (like Section 5. [sent-1307, score-0.331]

83 However, Theorem 13, Page 2089, implies that SFA features are only optimal for sampling policy π. [sent-1323, score-0.34]

84 This optimality is lost when policy iteration varies π, but optimizing the basis functions w. [sent-1324, score-0.296]

85 As LSPI is based on a dictionary of transitions, however, the distribution ξ of the weighted ˆ projection operator Πφ corresponds to the sampling distribution instead of steady state distriξ bution of policy π. [sent-1332, score-0.51]

86 6 demonstrate how sensitive the LSPI performance based on SFA features is to sampling policy and sampling state-distribution. [sent-1356, score-0.372]

87 However, compatible MDPs are not very common: SFA features are subspace-invariant for all MDPs with a self-adjoint transition operator and PVF are for all MDPs with a transition kernel visiting all neighbors uniformly. [sent-1428, score-0.506]

88 It states that SFA minimizes a mean bound over all tasks in the same environment, which means an arbitrary but ﬁxed transition kernel and all possible reward functions. [sent-1434, score-0.377]

89 It is therefore an empirical question how the discussed approximation spaces will fare when least-squares policy iteration (LSPI) changes the policy. [sent-1436, score-0.335]

90 6: “LSPI performance is facilitated by approximation spaces that encode diffusion distances of a uniform random policy” because “SFA essentially performs PCA based on diffusion distances”. [sent-1451, score-0.539]

91 Lemma 3 (repeated): Let ξ denote the steady state distribution of ergodic transition kernel Pπ , ˆ ˆ which has a self-adjoint transition operator Pπ = (Pπ )∗ : L2 (Z, ξ) → L2 (Z, ξ). [sent-1460, score-0.7]

92 2 C ONSTRUCTION OF A PPROXIMATION S PACES FOR R EINFORCEMENT L EARNING Lemma 4: Let Pπ be an ergodic transition kernel in Z with steady state distribution ξ. [sent-1472, score-0.492]

93 The ˆ kernel induced by adjoint transition operator (Pπ )∗ in L2 (Z, ξ) is ξ-almost-everywhere an ergodic transition kernel with steady state distribution ξ. [sent-1473, score-0.782]

94 ξ(dy) (ergodicity) ˆ This proves that the kernel of (Pπ )∗ is a transition kernel ξ-almost-everywhere in Z, which means the kernel adheres to (ii). [sent-1486, score-0.319]

95 ˆ ˆ 1, (Pπ )∗ [ f ] ξ = 1, f ξ , ∀ f ∈ L2 (X, ξ), and therefore the transition kernel of (Pπ )∗ is ergodic with steady state distribution ξ. [sent-1488, score-0.492]

96 Lemma 5: In the limit of an inﬁnite ergodic Markov chain drawn by transition kernel Pπ in Z ˆ ˆ with steady state distribution ξ holds S( f ) = 2 f , (I − Pπ )[ f ] ξ , ∀ f ∈ L2 (Z, ξ) . [sent-1489, score-0.532]

97 Proof Let ξ denote the steady state distribution of transition kernel Pπ . [sent-1495, score-0.414]

98 C ONSTRUCTION OF A PPROXIMATION S PACES FOR R EINFORCEMENT L EARNING Theorem 11: Let ξ be the steady state distribution on Z of a MDP with policy π and a selfp adjoint transition operator in L2 (Z, ξ). [sent-1507, score-0.659]

99 ˆ Lemma 14 For an ergodic transition operator Pπ in L2 (Z, ξ) with steady state distribution ξ and ˆ ˆ − γ Pπ ) is invertible. [sent-1540, score-0.491]

100 2112 C ONSTRUCTION OF A PPROXIMATION S PACES FOR R EINFORCEMENT L EARNING ˆ Lemma 15 An ergodic transition operator Pπ in L2 (Z, ξ) with steady state distribution ξ is a nonexpansion, deﬁned as ˆ , ∀ f ∈ L2 (Z, ξ) . [sent-1544, score-0.491]

similar papers computed by tfidf model

tfidf for this paper:

wordName wordTfidf (topN-words)

[('sfa', 0.551), ('pvf', 0.327), ('policy', 0.219), ('lspi', 0.196), ('lstd', 0.195), ('diffusion', 0.156), ('transition', 0.154), ('steady', 0.14), ('robot', 0.123), ('alder', 0.122), ('bermayer', 0.122), ('ohmer', 0.122), ('unew', 0.122), ('usial', 0.122), ('einforcement', 0.117), ('ir', 0.112), ('reward', 0.108), ('krylov', 0.108), ('onstruction', 0.1), ('mdps', 0.096), ('parr', 0.092), ('paces', 0.09), ('features', 0.089), ('mdp', 0.088), ('navigation', 0.088), ('pproximation', 0.083), ('hen', 0.081), ('ergodic', 0.078), ('spaces', 0.077), ('basis', 0.077), ('mahadevan', 0.076), ('distances', 0.068), ('state', 0.065), ('policies', 0.064), ('reinforcement', 0.063), ('rl', 0.056), ('kernel', 0.055), ('bases', 0.055), ('operator', 0.054), ('bellman', 0.053), ('zt', 0.052), ('bebf', 0.051), ('dy', 0.048), ('petrik', 0.047), ('page', 0.045), ('maggioni', 0.044), ('visual', 0.044), ('ie', 0.043), ('sutton', 0.043), ('trajectories', 0.043), ('feature', 0.043), ('encode', 0.043), ('action', 0.042), ('earning', 0.04), ('chain', 0.04), ('article', 0.04), ('hmer', 0.04), ('slowness', 0.04), ('approximation', 0.039), ('eigenfunctions', 0.037), ('encoding', 0.037), ('wiskott', 0.036), ('ird', 0.036), ('franzius', 0.036), ('metric', 0.035), ('barto', 0.032), ('sampling', 0.032), ('symmetrized', 0.032), ('slow', 0.032), ('iet', 0.032), ('velocities', 0.032), ('actions', 0.032), ('tasks', 0.031), ('markov', 0.03), ('pca', 0.03), ('ty', 0.029), ('states', 0.029), ('euclidean', 0.029), ('guestrin', 0.028), ('kaelbling', 0.028), ('kit', 0.028), ('proto', 0.028), ('ergodicity', 0.028), ('discount', 0.028), ('tx', 0.028), ('transfer', 0.028), ('transitions', 0.027), ('factored', 0.027), ('adjoint', 0.027), ('continuous', 0.026), ('camera', 0.025), ('tsitsiklis', 0.025), ('discrete', 0.025), ('lagoudakis', 0.024), ('task', 0.024), ('temporal', 0.023), ('legenstein', 0.023), ('luciw', 0.023), ('inf', 0.023), ('encodes', 0.023)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 1.0000014 28 jmlr-2013-Construction of Approximation Spaces for Reinforcement Learning

Author: Wendelin Böhmer, Steffen Grünewälder, Yun Shen, Marek Musial, Klaus Obermayer

2 0.37924325 52 jmlr-2013-How to Solve Classification and Regression Problems on High-Dimensional Data with a Supervised Extension of Slow Feature Analysis

Author: Alberto N. Escalante-B., Laurenz Wiskott

Abstract: Supervised learning from high-dimensional data, for example, multimedia data, is a challenging task. We propose an extension of slow feature analysis (SFA) for supervised dimensionality reduction called graph-based SFA (GSFA). The algorithm extracts a label-predictive low-dimensional set of features that can be post-processed by typical supervised algorithms to generate the ﬁnal label or class estimation. GSFA is trained with a so-called training graph, in which the vertices are the samples and the edges represent similarities of the corresponding labels. A new weighted SFA optimization problem is introduced, generalizing the notion of slowness from sequences of samples to such training graphs. We show that GSFA computes an optimal solution to this problem in the considered function space and propose several types of training graphs. For classiﬁcation, the most straightforward graph yields features equivalent to those of (nonlinear) Fisher discriminant analysis. Emphasis is on regression, where four different graphs were evaluated experimentally with a subproblem of face detection on photographs. The method proposed is promising particularly when linear models are insufﬁcient as well as when feature selection is difﬁcult. Keywords: slow feature analysis, feature extraction, classiﬁcation, regression, pattern recognition, training graphs, nonlinear dimensionality reduction, supervised learning, implicitly supervised, high-dimensional data, image analysis

3 0.2057858 87 jmlr-2013-Performance Bounds for λ Policy Iteration and Application to the Game of Tetris

Author: Bruno Scherrer

Abstract: We consider the discrete-time inﬁnite-horizon optimal control problem formalized by Markov decision processes (Puterman, 1994; Bertsekas and Tsitsiklis, 1996). We revisit the work of Bertsekas and Ioffe (1996), that introduced λ policy iteration—a family of algorithms parametrized by a parameter λ—that generalizes the standard algorithms value and policy iteration, and has some deep connections with the temporal-difference algorithms described by Sutton and Barto (1998). We deepen the original theory developed by the authors by providing convergence rate bounds which generalize standard bounds for value iteration described for instance by Puterman (1994). Then, the main contribution of this paper is to develop the theory of this algorithm when it is used in an approximate form. We extend and unify the separate analyzes developed by Munos for approximate value iteration (Munos, 2007) and approximate policy iteration (Munos, 2003), and provide performance bounds in the discounted and the undiscounted situations. Finally, we revisit the use of this algorithm in the training of a Tetris playing controller as originally done by Bertsekas and Ioffe (1996). Our empirical results are different from those of Bertsekas and Ioffe (which were originally qualiﬁed as “paradoxical” and “intriguing”). We track down the reason to be a minor implementation error of the algorithm, which suggests that, in practice, λ policy iteration may be more stable than previously thought. Keywords: stochastic optimal control, reinforcement learning, Markov decision processes, analysis of algorithms

4 0.096666895 65 jmlr-2013-Lower Bounds and Selectivity of Weak-Consistent Policies in Stochastic Multi-Armed Bandit Problem

Author: Antoine Salomon, Jean-Yves Audibert, Issam El Alaoui

Abstract: This paper is devoted to regret lower bounds in the classical model of stochastic multi-armed bandit. A well-known result of Lai and Robbins, which has then been extended by Burnetas and Katehakis, has established the presence of a logarithmic bound for all consistent policies. We relax the notion of consistency, and exhibit a generalisation of the bound. We also study the existence of logarithmic bounds in general and in the case of Hannan consistency. Moreover, we prove that it is impossible to design an adaptive policy that would select the best of two algorithms by taking advantage of the properties of the environment. To get these results, we study variants of popular Upper Conﬁdence Bounds (UCB) policies. Keywords: stochastic bandits, regret lower bounds, consistency, selectivity, UCB policies 1. Introduction and Notations Multi-armed bandits are a classical way to illustrate the difﬁculty of decision making in the case of a dilemma between exploration and exploitation. The denomination of these models comes from an analogy with playing a slot machine with more than one arm. Each arm has a given (and unknown) reward distribution and, for a given number of rounds, the agent has to choose one of them. As the goal is to maximize the sum of rewards, each round decision consists in a trade-off between exploitation (i.e., playing the arm that has been the more lucrative so far) and exploration (i.e., testing another arm, hoping to discover an alternative that beats the current best choice). One possible application is clinical trial: a doctor wants to heal as many patients as possible, the patients arrive sequentially, and the effectiveness of each treatment is initially unknown (Thompson, 1933). Bandit problems have initially been studied by Robbins (1952), and since then they have been applied to many ﬁelds such as economics (Lamberton et al., 2004; Bergemann and Valimaki, 2008), games (Gelly and Wang, 2006), and optimisation (Kleinberg, 2005; Coquelin and Munos, 2007; Kleinberg et al., 2008; Bubeck et al., 2009). ∗. Also at Willow, CNRS/ENS/INRIA - UMR 8548. c 2013 Antoine Salomon, Jean-Yves Audibert and Issam El Alaoui. S ALOMON , AUDIBERT AND E L A LAOUI 1.1 Setting In this paper, we consider the following model. A stochastic multi-armed bandit problem is deﬁned by: • a number of rounds n, • a number of arms K ≥ 2, • an environment θ = (ν1 , · · · , νK ), where each νk (k ∈ {1, · · · , K}) is a real-valued measure that represents the distribution reward of arm k. The number of rounds n may or may not be known by the agent, but this will not affect the present study. We assume that rewards are bounded. Thus, for simplicity, each νk is a probability on [0, 1]. Environment θ is initially unknown by the agent but lies in some known set Θ. For the problem to be interesting, the agent should not have great knowledges of its environment, so that Θ should not be too small and/or only contain too trivial distributions such as Dirac measures. To make it simple, we may assume that Θ contains all environments where each reward distribution is a Dirac distribution or a Bernoulli distribution. This will be acknowledged as Θ having the Dirac/Bernoulli property. For technical reason, we may also assume that Θ is of the form Θ1 × . . . × ΘK , meaning that Θk is the set of possible reward distributions of arm k. This will be acknowledged as Θ having the product property. The game is as follows. At each round (or time step) t = 1, · · · , n, the agent has to choose an arm It in the set {1, · · · , K}. This decision is based on past actions and observations, and the agent may also randomize his choice. Once the decision is made, the agent gets and observes a reward that is drawn from νIt independently from the past. Thus a policy (or strategy) can be described by a sequence (σt )t≥1 (or (σt )1≤t≤n if the number of rounds n is known) such that each σt is a mapping from the set {1, . . . , K}t−1 × [0, 1]t−1 of past decisions and rewards into the set of arm {1, . . . , K} (or into the set of probabilities on {1, . . . , K}, in case the agent randomizes his choices). For each arm k and all time step t, let Tk (t) = ∑ts=1 ½Is =k denote the sampling time, that is, the number of times arm k was pulled from round 1 to round t, and Xk,1 , Xk,2 , . . . , Xk,Tk (t) the corresponding sequence of rewards. We denote by Pθ the distribution on the probability space such that for any k ∈ {1, . . . , K}, the random variables Xk,1 , Xk,2 , . . . , Xk,n are i.i.d. realizations of νk , and such that these K sequences of random variables are independent. Let Eθ denote the associated expectation. Let µk = xdνk (x) be the mean reward of arm k. Introduce µ∗ = maxk∈{1,...,K} µk and ﬁx an arm ∗ ∈ argmax ∗ k k∈{1,...,K} µk , that is, k has the best expected reward. The agent aims at minimizing its regret, deﬁned as the difference between the cumulative reward he would have obtained by always drawing the best arm and the cumulative reward he actually received. Its regret is thus n n Rn = ∑ Xk∗ ,t − ∑ XIt ,TIt (t) . t=1 t=1 As most of the publications on this topic, we focus on the expected regret, for which one can check that: K E θ Rn = ∑ ∆k Eθ [Tk (n)], k=1 188 (1) L OWER B OUNDS , W EAK C ONSISTENCY AND S ELECTIVITY IN BANDIT P ROBLEMS where ∆k is the optimality gap of arm k, deﬁned by ∆k = µ∗ − µk . We also deﬁne ∆ as the gap between the best arm and the second best arm, that is, ∆ := mink:∆k >0 ∆k . Other notions of regret exist in the literature. One of them is the quantity n max ∑ Xk,t − XIt ,TIt (t) , k t=1 which is mostly used in adversarial settings. Results and ideas we want to convey here are more suited to expected regret, and considering other deﬁnitions of regret would only bring some more technical intricacies. 1.2 Consistency and Regret Lower Bounds Former works have shown the existence of lower bounds on the expected regret of a large class of policies: intuitively, to perform well the agent has to explore all arms, and this requires a signiﬁcant amount of suboptimal choices. In this way, Lai and Robbins (1985) proved a lower bound of order log n in a particular parametric framework, and they also exhibited optimal policies. This work has then been extended by Burnetas and Katehakis (1996). Both papers deal with consistent policies, meaning that they only consider policies such that: ∀a > 0, ∀θ ∈ Θ, Eθ [Rn ] = o(na ). (2) Let us detail the bound of Burnetas and Katehakis, which is valid when Θ has the product property. Given an environment θ = (ν1 , · · · , νK ) and an arm k ∈ {1, . . . , K}, deﬁne: Dk (θ) := inf ˜ νk ∈Θk :E[˜ k ]>µ∗ ν ˜ KL(νk , νk ), where KL(ν, µ) denotes the Kullback-Leibler divergence of measures ν and µ. Now ﬁx a consistent policy and an environment θ ∈ Θ. If k is a suboptimal arm (i.e., µk = µ∗ ) such that 0 < Dk (θ) < ∞, then (1 − ε) log n ∀ε > 0, lim P Tk (n) ≥ = 1. n→+∞ Dk (θ) This readily implies that: lim inf n→+∞ Eθ [Tk (n)] 1 ≥ . log n Dk (θ) Thanks to Formula (1), it is then easy to deduce a lower bound of the expected regret. One contribution of this paper is to generalize the study of regret lower bounds, by considering weaker notions of consistency: α-consistency and Hannan consistency. We will deﬁne αconsistency (α ∈ [0, 1)) as a variant of Equation (2), where equality Eθ [Rn ] = o(na ) only holds for all a > α. We show that the logarithmic bound of Burnetas and Katehakis still holds, but coefﬁcient 1−α 1 Dk (θ) is turned into Dk (θ) . We also prove that the dependence of this new bound with respect to the term 1 − α is asymptotically optimal when n → +∞ (up to a constant). We will also consider the case of Hannan consistency. Indeed, any policy achieves at most an expected regret of order n: because of the equality ∑K Tk (n) = n and thanks to Equation (1), one k=1 can show that Eθ Rn ≤ n maxk ∆k . More intuitively, this comes from the fact that the average cost of pulling an arm k is a constant ∆k . As a consequence, it is natural to wonder what happens when 189 S ALOMON , AUDIBERT AND E L A LAOUI dealing with policies whose expected regret is only required to be o(n), which is equivalent to Hannan consistency. This condition is less restrictive than any of the previous notions of consistency. In this larger class of policy, we show that the lower bounds on the expected regret are no longer logarithmic, but can be much smaller. Finally, even if no logarithmic lower bound holds on the whole set Θ, we show that there necessarily exist some environments θ for which the expected regret is at least logarithmic. The latter result is actually valid without any assumptions on the considered policies, and only requires a simple property on Θ. 1.3 Selectivity As we exhibit new lower bounds, we want to know if it is possible to provide optimal policies that achieve these lower bounds, as it is the case in the classical class of consistent policies. We answer negatively to this question, and for this we solve the more general problem of selectivity. Given a set of policies, we deﬁne selectivity as the ability to perform at least as good as the policy that is best suited to the current environment θ. Still, such an ability can not be implemented. As a by-product it is not possible to design a procedure that would speciﬁcally adapt to some kinds of environments, for example by selecting a particular policy. This contribution is linked with selectivity in on-line learning problem with perfect information, commonly addressed by prediction with expert advice (see, e.g., Cesa-Bianchi et al., 1997). In this spirit, a closely related problem is the one of regret against the best strategy from a pool studied by Auer et al. (2003). The latter designed an algorithm in the context of adversarial/nonstochastic bandit whose decisions are based on a given number of recommendations (experts), which are themselves possibly the rewards received by a set of given policies. To a larger extent, model selection has been intensively studied in statistics, and is commonly solved by penalization methods (Mallows, 1973; Akaike, 1973; Schwarz, 1978). 1.4 UCB Policies Some of our results are obtained using particular Upper Conﬁdence Bound algorithms. These algorithms were introduced by Lai and Robbins (1985): they basically consist in computing an index for each arm, and selecting the arm with the greatest index. A simple and efﬁcient way to design such policies is as follows: choose each index as low as possible such that, conditional to past observations, it is an upper bound of the mean reward of the considered arm with high probability (or, say, with high conﬁdence level). This idea can be traced back to Agrawal (1995), and has been popularized by Auer et al. (2002), who notably described a policy called UCB1. In this policy, each index Bk,s,t is deﬁned by an arm k, a time step t, an integer s that indicates the number of times arm k has been pulled before round t, and is given by: ˆ Bk,s,t = Xk,s + 2 logt , s ˆ ˆ where Xk,s is the empirical mean of arm k after s pulls, that is, Xk,s = 1 ∑s Xk,u . s u=1 To summarize, UCB1 policy ﬁrst pulls each arm once and then, at each round t > K, selects an arm k that maximizes Bk,Tk (t−1),t . Note that, by means of Hoeffding’s inequality, the index Bk,Tk (t−1),t is indeed an upper bound of µk with high probability (i.e., the probability is greater than 1 − 1/t 4 ). ˆ Another way to understand this index is to interpret the empiric mean Xk,Tk (t−1) as an ”exploitation” 190 L OWER B OUNDS , W EAK C ONSISTENCY AND S ELECTIVITY IN BANDIT P ROBLEMS term, and the square root 2 logt/s as an ”exploration” term (because the latter gradually increases when arm k is not selected). Policy UCB1 achieves the logarithmic bound (up to a multiplicative constant), as it was shown that: ∀θ ∈ Θ, ∀n ≥ 3, Eθ [Tk (n)] ≤ 12 K log n log n log n ≤ 12K . and Eθ Rn ≤ 12 ∑ 2 ∆ ∆k k=1 ∆k Audibert et al. (2009) studied some variants of UCB1 policy. Among them, one consists in changing the 2 logt in the exploration term into ρ logt, where ρ > 0. This can be interpreted as a way to tune exploration: the smaller ρ is, the better the policy will perform in simple environments where information is disclosed easily (for example when all reward distributions are Dirac measures). On the contrary, ρ has to be greater to face more challenging environments (typically when reward distributions are Bernoulli laws with close parameters). This policy, that we denote UCB(ρ), was proven by Audibert et al. to achieve the logarithmic bound when ρ > 1, and the optimality was also obtained when ρ > 1 for a variant of UCB(ρ). 2 Bubeck (2010) showed in his PhD dissertation that their ideas actually enable to prove optimality 1 of UCB(ρ) for ρ > 1 . Moreover, the case ρ = 2 corresponds to a conﬁdence level of 1 (because 2 t of Hoeffding’s inequality, as above), and several studies (Lai and Robbins, 1985; Agrawal, 1995; Burnetas and Katehakis, 1996; Audibert et al., 2009; Honda and Takemura, 2010) have shown that this level is critical. We complete these works by a precise study of UCB(ρ) when ρ ≤ 1 . We prove that UCB(ρ) 2 is (1 − 2ρ)-consistent and that it is not α-consistent for any α < 1 − 2ρ (in view of the deﬁnition above, this means that the expected regret is roughly of order n1−2ρ ). Thus it does not achieve the logarithmic bound, but it performs well in simple environments, for example, environments where all reward distributions are Dirac measures. Moreover, we exhibit expected regret bounds of general UCB policies, with the 2 logt in the exploration term of UCB1 replaced by an arbitrary function. We give sufﬁcient conditions for such policies to be Hannan consistent and, as mentioned before, ﬁnd that lower bounds need not be logarithmic any more. 1.5 Outline The paper is organized as follows: in Section 2, we give bounds on the expected regret of general 1 UCB policies and of UCB (ρ) (ρ ≤ 2 ), as preliminary results. In Section 3, we focus on α-consistent policies. Then, in Section 4, we study the problem of selectivity, and we conclude in Section 5 by general results on the existence of logarithmic lower bounds. Throughout the paper ⌈x⌉ denotes the smallest integer not less than x whereas ⌊x⌋ denotes the largest integer not greater than x, ½A stands for the indicator function of event A, Ber(p) is the Bernoulli law with parameter p, and δx is the Dirac measure centred on x. 2. Preliminary Results In this section, we estimate the expected regret of the paper. UCB 191 policies. This will be useful for the rest of S ALOMON , AUDIBERT AND E L A LAOUI 2.1 Bounds on the Expected Regret of General UCB Policies We ﬁrst study general UCB policies, deﬁned by: • Draw each arm once, • then, at each round t, draw an arm It ∈ argmax Bk,Tk (t−1),t , k∈{1,...,K} ˆ where Bk,s,t is deﬁned by Bk,s,t = Xk,s + creasing. fk (t) s and where functions fk (1 ≤ k ≤ K) are in- This deﬁnition is inspired by popular UCB1 algorithm, for which fk (t) = 2 logt for all k. The following lemma estimates the performances of UCB policies in simple environments, for which reward distributions are Dirac measures. Lemma 1 Let 0 ≤ b < a ≤ 1 and n ≥ 1. For θ = (δa , δb ), the random variable T2 (n) is uniformly 1 upper bounded by ∆2 f2 (n) + 1. Consequently, the expected regret of UCB is upper bounded by 1 ∆ f 2 (n) + 1. Proof In environment θ, best arm is arm 1 and ∆ = ∆2 = a − b. Let us ﬁrst prove the upper bound of the sampling time. The assertion is true for n = 1 and n = 2: the ﬁrst two rounds consists in 1 drawing each arm once, so that T2 (n) ≤ 1 ≤ ∆2 f2 (n) + 1 for n ∈ {1, 2}. If, by contradiction, the as1 1 sertion is false, then there exists t ≥ 3 such that T2 (t) > ∆2 f2 (t) + 1 and T2 (t − 1) ≤ ∆2 f2 (t − 1) + 1. Since f2 (t) ≥ f2 (t − 1), this leads to T2 (t) > T2 (t − 1), meaning that arm 2 is drawn at round t. Therefore, we have a + f1 (t) T1 (t−1) ≤ b+ f2 (t) T2 (t−1) , hence a − b = ∆ ≤ f2 (t) T2 (t−1) , which implies 1 1 T2 (t − 1) ≤ ∆2 f2 (t) and thus T2 (t) ≤ ∆2 f2 (t) + 1. This contradicts the deﬁnition of t, and this ends the proof of the ﬁrst statement. The second statement is a direct consequence of Formula (1). Remark: throughout the paper, we will often use environments with K = 2 arms to provide bounds on expected regrets. However, we do not lose generality by doing so, because all corresponding proofs can be written almost identically to suit to any K ≥ 2, by simply assuming that the distribution of each arm k ≥ 3 is δ0 . We now give an upper bound of the expected sampling time of any arm such that ∆k > 0. This bound is valid in any environment, and not only those of the form (δa , δb ). Lemma 2 For any θ ∈ Θ and any β ∈ (0, 1), if ∆k > 0 the following upper bound holds: n Eθ [Tk (n)] ≤ u + where u = 4 fk (n) ∆2 k ∑ t=u+1 1+ logt 1 log( β ) . 192 e−2β fk (t) + e−2β fk∗ (t) , L OWER B OUNDS , W EAK C ONSISTENCY AND S ELECTIVITY IN BANDIT P ROBLEMS An upper bound of the expected regret can be deduced from this lemma thanks to Formula 1. Proof The core of the proof is a peeling argument and the use of Hoeffding’s maximal inequality (see, e.g., Cesa-Bianchi and Lugosi, 2006, section A.1.3 for details). The idea is originally taken from Audibert et al. (2009), and the following is an adaptation of the proof of an upper bound of UCB (ρ) in the case ρ > 1 which can be found in S. Bubeck’s PhD dissertation. 2 First, let us notice that the policy selects an arm k such that ∆k > 0 at time step t ≤ n only if at least one of the three following equations holds: Bk∗ ,Tk∗ (t−1),t ≤ µ∗ , (3) fk (t) , Tk (t − 1) (4) ˆ Xk,t ≥ µk + Tk (t − 1) < 4 fk (n) . ∆2 k (5) Indeed, if none of the equations is true, then: fk (n) ˆ > Xk,t + Tk (t − 1) Bk∗ ,Tk∗ (t−1),t > µ∗ = µk + ∆k ≥ µk + 2 fk (t) = Bk,Tk (t−1),t , Tk (t − 1) which implies that arm k can not be chosen at time step t. We denote respectively by ξ1,t , ξ2,t and ξ3,t the events corresponding to Equations (3), (4) and (5). We have: n ∑ ½I =k Eθ [Tk (n)] = Eθ t n n ∑ ½{I =k}∩ξ = Eθ t t=1 + Eθ 3,t ∑ ½{I =k}\ξ t 3,t . t=1 t=1 n Let us show that the sum ∑t=1 ½{It =k}∩ξ3,t is almost surely lower than u := ⌈4 fk (n)/∆2 ⌉. We assume k m−1 n by contradiction that ∑t=1 ½{It =k}∩ξ3,t > u. Then there exists m < n such that ∑t=1 ½{It =k}∩ξ3,t < m 4 fk (n)/∆2 and ∑t=1 ½{It =k}∩ξ3,t = ⌈4 fk (n)/∆2 ⌉. Therefore, for any s > m, we have: k k m m t=1 t=1 Tk (s − 1) ≥ Tk (m) = ∑ ½{It =k} ≥ ∑ ½{It =k}∩ξ3,t = 4 fk (n) 4 fk (n) ≥ , 2 ∆k ∆2 k so that ½{Is =k}∩ξ3,s = 0. But then m n ∑ ½{I =k}∩ξ t t=1 3,t = ∑ ½{It =k}∩ξ3,t = t=1 4 fk (n) ≤ u, ∆2 k which is the contradiction expected. n n We also have ∑t=1 ½{It =k}\ξ3,t = ∑t=u+1 ½{It =k}\ξ3,t : since Tk (t − 1) ≤ t − 1, event ξ3,t always happens at time step t ∈ {1, . . . , u}. And then, since event {It = k} is included in ξ1,t ∪ ξ2,t ∪ ξ3,t : n Eθ ∑ ½{It =k}\ξ3,t ≤ Eθ t=u+1 n n t=u+1 t=u+1 ∑ ½ξ1,t ∪ξ2,t ≤ ∑ Pθ (ξ1,t ) + Pθ (ξ2,t ). 193 S ALOMON , AUDIBERT AND E L A LAOUI It remains to ﬁnd upper bounds of Pθ (ξ1,t ) and Pθ (ξ2,t ). To this aim, we apply the peeling argument with a geometric grid over the time interval [1,t]: fk∗ (t) ≤ µ∗ Tk∗ (t − 1) ˆ Pθ (ξ1,t ) = Pθ Bk∗ ,Tk∗ (t−1),t ≤ µ∗ = Pθ Xk∗ ,Tk∗ (t−1) + ˆ ≤ Pθ ∃s ∈ {1, · · · ,t}, Xk∗ ,s + fk∗ (t) ≤ µ∗ s logt log(1/β) ≤ ∑ j=0 ˆ Pθ ∃s : {β j+1t < s ≤ β j t}, Xk∗ ,s + logt log(1/β) ≤ ∑ j=0 s Pθ ∃s : {β j+1t < s ≤ β j t}, logt log(1/β) ≤ ∑ j=0 ∑ j=0 ∑ (Xk ,l − µ∗ ) ≤ − ∗ s fk∗ (t) β j+1t fk∗ (t) l=1 ∑ (µ∗ − Xk ,l ) ≥ t < s ≤ β j t}, logt log(1/β) = fk∗ (t) ≤ µ∗ s j ∗ β j+1t fk∗ (t) l=1 s Pθ max ∑ (µ∗ − Xk∗ ,l ) ≥ s≤β j t l=1 β j+1t fk∗ (t) . As the range of the random variables (Xk∗ ,l )1≤l≤s is [0, 1], Hoeffding’s maximal inequality gives:   2 logt log(1/β) β j+1t fk∗ (t)  2 logt  Pθ (ξ1,t ) ≤ + 1 e−2β fk∗ (t) . ≤ ∑ exp − jt β log(1/β) j=0 Similarly, we have: logt + 1 e−2β fk (t) , log(1/β) and the result follows from the combination of previous inequalities. Pθ (ξ2,t ) ≤ 2.2 Bounds on the Expected Regret of UCB(ρ), ρ ≤ We study the performances of UCB (ρ) 1 2 1 policy, with ρ ∈ (0, 2 ]. We recall that ρ logt s . UCB (ρ) is the UCB ˆ policy deﬁned by fk (t) = ρ log(t) for all k, that is, Bk,s,t = Xk,s + Small values of ρ can be interpreted as a low level of experimentation in the balance between exploration and exploitation. 1 Precise regret bound orders of UCB(ρ) when ρ ∈ (0, 2 ] are not documented in the literature. We ﬁrst give an upper bound of expected regret in simple environments, where it is supposed to perform well. As stated in the following proposition (which is a direct consequence of Lemma 1), the order of the bound is ρ log n . ∆ 194 L OWER B OUNDS , W EAK C ONSISTENCY AND S ELECTIVITY IN BANDIT P ROBLEMS Lemma 3 Let 0 ≤ b < a ≤ 1 and n ≥ 1. For θ = (δa , δb ), the random variable T2 (n) is uniformly ρ upper bounded by ∆2 log(n) + 1. Consequently, the expected regret of UCB(ρ) is upper bounded by ρ ∆ log(n) + 1. One can show that the expected regret of UCB(ρ) is actually equivalent to ρ log n as n goes to ∆ inﬁnity. These good performances are compensated by poor results in more complex environments, as showed in the following theorem. We exhibit an expected regret upper bound which is valid for any θ ∈ Θ, and which is roughly of order n1−2ρ . We also show that this upper bound is asymptot1 ically optimal. Thus, with ρ ∈ (0, 2 ), UCB(ρ) does not perform enough exploration to achieve the logarithmic bound, as opposed to UCB(ρ) with ρ ∈ ( 1 , +∞). 2 1 Theorem 4 For any ρ ∈ (0, 2 ], any θ ∈ Θ and any β ∈ (0, 1), one has Eθ [Rn ] ≤ 4ρ log n ∑ ∆k + ∆k + 2∆k k:∆k >0 log n n1−2ρβ +1 . log(1/β) 1 − 2ρβ Moreover, if Θ has the Dirac/Bernoulli property, then for any ε > 0 there exists θ ∈ Θ such that Eθ [Rn ] lim n→+∞ n1−2ρ−ε = +∞. 1 1 The value ρ = 2 is critical, but we can deduce from the upper bound of this theorem that UCB( 2 ) is consistent in the classical sense of Lai and Robbins (1985) and of Burnetas and Katehakis (1996). log Proof We set u = 4ρ∆2 n . By Lemma 2 we get: k n Eθ [Tk (n)] ≤ u + 2 = u+2 ∑ logt + 1 e−2βρ log(t) log(1/β) ∑ logt 1 + 1 2ρβ log(1/β) t t=u+1 n t=u+1 n 1 ≤ u+2 log n +1 log(1/β) ≤ u+2 log n +1 log(1/β) 1+ ∑ ≤ u+2 log n +1 log(1/β) 1+ ≤ u+2 log n +1 . log(1/β) 1 − 2ρβ ∑ t 2ρβ t=1 n 1 t 2ρβ t=2 n−1 1 1−2ρβ n 1 t 2ρβ dt As usual, the upper bound of the expected regret follows from Formula (1). Now, let us show the lower bound. The result is obtained by considering an environment θ of the √ 1 form Ber( 1 ), δ 1 −∆ , where ∆ lies in (0, 2 ) and is such that 2ρ(1 + ∆)2 < 2ρ + ε. This notation is 2 2 obviously consistent with the deﬁnition of ∆ as an optimality gap. We set Tn := ⌈ ρ log n ⌉, and deﬁne ∆ the event ξn by: 1 1 ˆ ξn = X1,Tn < − (1 + √ )∆ . 2 ∆ 195 S ALOMON , AUDIBERT AND E L A LAOUI When event ξn occurs, one has for any t ∈ {Tn , . . . , n} ˆ X1,Tn + ρ logt Tn ˆ ≤ X1,Tn + ≤ √ ρ log n 1 1 < − (1 + √ )∆ + ∆ Tn 2 ∆ 1 − ∆, 2 so that arm 1 is chosen no more than Tn times by UCB(ρ) policy. Consequently: Eθ [T2 (n)] ≥ Pθ (ξn )(n − Tn ). We will now ﬁnd a lower bound of the probability of ξn thanks to Berry-Esseen inequality. We denote by C the corresponding constant, and by Φ the c.d.f. of the standard normal distribution. For convenience, we also deﬁne the following quantities: σ := E X1,1 − Using the fact that Φ(−x) = e− √ 2 β(x) 2πx 1 2 2 1 = , M3 := E 2 X1,1 − 1 2 3 1 = . 8 x2 with β(x) − − → 1, we have: −− x→+∞ ˆ √ X1,Tn − 1 √ 1 2 Tn ≤ −2 1 + √ ∆ Tn σ ∆ √ √ CM3 Φ −2(∆ + ∆) Tn − 3 √ σ Tn √ 2 exp −2(∆ + ∆) Tn √ √ CM3 √ √ √ β 2(∆ + ∆) Tn − 3 √ σ Tn 2 2π(∆ + ∆) Tn √ 2 ρ log n exp −2(∆ + ∆) ( ∆ + 1) √ √ CM3 √ √ √ β 2(∆ + ∆) Tn − 3 √ σ Tn 2 2π(∆ + ∆) Tn √ √ −2ρ(1+ ∆)2 exp −2(∆ + ∆)2 √ √ CM3 n √ √ √ β 2(∆ + ∆) Tn − 3 √ . Tn σ Tn 2 2π(∆ + ∆) Pθ (ξn ) = Pθ ≥ ≥ ≥ ≥ Previous calculations and Formula (1) give Eθ [Rn ] = ∆Eθ [T2 (n)] ≥ ∆Pθ (ξn )(n − Tn ), √ 1−2ρ(1+ ∆)2 so that we ﬁnally obtain a lower bound of Eθ [Rn ] of order n √log n . Therefore, nEθ [Rn ] is at least 1−2ρ−ε √ 2 √ 2 n2ρ+ε−2ρ(1+ ∆) √ of order . Since 2ρ + ε − 2ρ(1 + ∆) > 0, the numerator goes to inﬁnity, faster than log n √ log n. This concludes the proof. 196 L OWER B OUNDS , W EAK C ONSISTENCY AND S ELECTIVITY IN BANDIT P ROBLEMS 3. Bounds on the Class α-consistent Policies In this section, our aim is to ﬁnd how the classical results of Lai and Robbins (1985) and of Burnetas and Katehakis (1996) can be generalised if we do not restrict the study to consistent policies. As a by-product, we will adapt their results to the present setting, which is simpler than their parametric frameworks. We recall that a policy is consistent if its expected regret is o(na ) for all a > 0 in all environments θ ∈ Θ. A natural way to relax this deﬁnition is the following. Deﬁnition 5 A policy is α-consistent if ∀a > α, ∀θ ∈ Θ, Eθ [Rn ] = o(na ). For example, we showed in the previous section that, for any ρ ∈ (0, 1 ], UCB(ρ) is (1−2ρ)-consistent 2 and not α-consistent if α < 1 − 2ρ. Note that the relevant range of α in this deﬁnition is [0, 1): the case α = 0 corresponds to the standard deﬁnition of consistency (so that throughout the paper the term ”consistent” also means ”0-consistent”), and any value α ≥ 1 is pointless as any policy is then α-consistent. Indeed, the expected regret of any policy is at most of order n. This also lead us to wonder what happens if we only require the expected regret to be o(n): ∀θ ∈ Θ, Eθ [Rn ] = o(n). This requirement corresponds to the deﬁnition of Hannan consistency. The class of Hannan consistent policies includes consistent policies and α-consistent policies for any α ∈ [0, 1). Some results about this class will be obtained in Section 5. We focus on regret lower bounds on α-consistent policies. We ﬁrst show that the main result of Burnetas and Katehakis can be extended in the following way. Theorem 6 Assume that Θ has the product property. Fix an α-consistent policy and θ ∈ Θ. If ∆k > 0 and if 0 < Dk (θ) < ∞, then ∀ε > 0, lim Pθ Tk (n) ≥ (1 − ε) n→+∞ (1 − α) log n = 1. Dk (θ) Consequently lim inf n→+∞ 1−α Eθ [Tk (n)] ≥ . log n Dk (θ) Remind that the lower bound of the expected regret is then deduced from Formula (1), and that coefﬁcient Dk (θ) is deﬁned by: Dk (θ) := inf ˜ νk ∈Θk :E[˜ k ]>µ∗ ν ˜ KL(νk , νk ), where KL(ν, µ) denotes the Kullback-Leibler divergence of measures ν and µ. Note that, as opposed to Burnetas and Katehakis (1996), there is no optimal policy in general (i.e., a policy that would achieve the lower bound in all environment θ). This can be explained intuitively as follows. If by contradiction there existed such a policy, its expected regret would be of order log n and consequently it would be (0-)consistent. Then the lower bounds in the case of 197 S ALOMON , AUDIBERT AND E L A LAOUI 1−α 0-consistency would necessarily hold. This can not happen if α > 0 because Dk (θ) < Dk1 . (θ) Nevertheless, this argument is not rigorous because the fact that the regret would be of order log n is only valid for environments θ such that 0 < Dk (θ) < ∞. The non-existence of optimal policies is implied by a stronger result of the next section (yet, only if α > 0.2). Proof We adapt Proposition 1 in Burnetas and Katehakis (1996) and its proof. Let us denote θ = (ν1 , . . . , νK ). We ﬁx ε > 0, and we want to show that: lim Pθ n→+∞ Set δ > 0 and δ′ > α such that ˜ that E[νk ] > µ∗ and 1−δ′ 1+δ Tk (n) (1 − ε)(1 − α) < log n Dk (θ) = 0. ˜ > (1 − ε)(1 − α). By deﬁnition of Dk (θ), there exists νk such ˜ Dk (θ) < KL(νk , νk ) < (1 + δ)Dk (θ). ˜ ˜ ˜ Let us set θ = (ν1 , . . . , νk−1 , νk , νk+1 , . . . , νK ). Environment θ lies in Θ by the product property and δ = KL(ν , ν ) and arm k is its best arm. Deﬁne I k ˜k ′ Aδ := n Tk (n) 1 − δ′ < δ log n I ′′ δ , Cn := log LTk (n) ≤ 1 − δ′′ log n , where δ′′ is such that α < δ′′ < δ′ and Lt is deﬁned by log Lt = ∑ts=1 log δ′ δ′ δ′′ δ′ dνk ˜ d νk (Xk,s ) . δ′′ Now, we show that Pθ (An ) = Pθ (An ∩Cn ) + Pθ (An \Cn ) − − → 0. −− n→+∞ On the one hand, one has: ′′ ′′ ′ ′′ ′ δ δ Pθ (Aδ ∩Cn ) ≤ n1−δ Pθ (Aδ ∩Cn ) ˜ n n ′′ ′ (6) ′′ ≤ n1−δ Pθ (Aδ ) = n1−δ Pθ n − Tk (n) > n − ˜ ˜ n 1 − δ′ Iδ log n ′′ ≤ n1−δ Eθ [n − Tk (n)] ˜ (7) ′ n − 1−δ log n Iδ ′′ = n−δ Eθ ∑K Tℓ (n) − Tk (n) ˜ l=1 ′ n − 1−δ Iδ log n n ′′ ≤ ∑ℓ=k n−δ Eθ [Tℓ (n)] ˜ ′ 1 − 1−δ Iδ log n n − − → 0. −− (8) n→+∞ ′ Equation (6) results from a partition of Aδ into events {Tk (n) = a}, 0 ≤ a < n ′′ 1−δ′ Iδ log n . Each event ′′ δ {Tk (n) = a} ∩ Cn equals {Tk (n) = a} ∩ ∏a dνk (Xk,s ) ≤ n1−δ and is measurable with respect s=1 d νk ˜ to Xk,1 , . . . , Xk,a and to Xℓ,1 , . . . , Xℓ,n (ℓ = k). Thus, ½{Tk (n)=a}∩Cn ′′ can be written as a function f of δ the latter r.v. and we have: ′′ δ Pθ {Tk (n) = a} ∩Cn = f (xk,s )1≤s≤a , (xℓ,s )ℓ=k,1≤s≤n ∏ ℓ=k 1≤s≤n ≤ f (xk,s )1≤s≤a , (xℓ,s )ℓ=k,1≤s≤n ∏ ℓ=k 1≤s≤n ′′ ′′ δ = n1−δ Pθ {Tk (n) = a} ∩Cn ˜ 198 . dνℓ (xℓ,s ) ∏ dνk (xk,s ) 1≤s≤a ′′ dνℓ (xℓ,s )n1−δ ∏ 1≤s≤a ˜ d νk (xk,s ) L OWER B OUNDS , W EAK C ONSISTENCY AND S ELECTIVITY IN BANDIT P ROBLEMS Equation (7) is a consequence of Markov’s inequality, and the limit in (8) is a consequence of α-consistency. ′ On the other hand, we set bn := 1−δ log n, so that Iδ ′ ′′ δ Pθ (Aδ \Cn ) ≤ P n ≤ P max log L j > (1 − δ′′ ) log n j≤⌊bn ⌋ 1 1 − δ′′ max log L j > I δ bn j≤⌊bn ⌋ 1 − δ′ . This term tends to zero, as a consequence of the law of large numbers. ′ Now that Pθ (Aδ ) tends to zero, the conclusion results from n 1 − δ′ 1 − δ′ (1 − ε)(1 − α) > ≥ . δ (1 + δ)Dk (θ) Dk (θ) I The previous lower bound is asymptotically optimal with respect to its dependence in α, as claimed in the following proposition. Proposition 7 Assume that Θ has the Dirac/Bernoulli property. There exist θ ∈ Θ and a constant c > 0 such that, for any α ∈ [0, 1), there exists an α-consistent policy such that: lim inf n→+∞ Eθ [Tk (n)] ≤ c, (1 − α) log n for any k satisfying ∆k > 0. Proof In any environment of the form θ = (δa , δb ) with a = b, Lemma 3 implies the following estimate for UCB(ρ): Eθ Tk (n) ρ lim inf ≤ 2, n→+∞ log n ∆ where k = k∗ . Because 1−α ∈ (0, 1 ) and since UCB(ρ) is (1 − 2ρ)-consistent for any ρ ∈ (0, 1 ] (Theorem 4), we 2 2 2 1 obtain the result by choosing the α-consistent policy UCB( 1−α ) and by setting c = 2∆2 . 2 4. Selectivity In this section, we address the problem of selectivity. By selectivity, we mean the ability to adapt to the environment as and when rewards are observed. More precisely, a set of two (or more) policies is given. The one that performs the best depends on environment θ. We wonder if there exists an adaptive procedure that, given any environment θ, would be as good as any policy in the given set. Two major reasons motivate this study. On the one hand this question was answered by Burnetas and Katehakis within the class of consistent policies. They exhibits an asymptotically optimal policy, that is, that achieves the regret 199 S ALOMON , AUDIBERT AND E L A LAOUI lower bounds they have proven. The fact that a policy performs as best as any other one obviously solves the problem of selectivity. On the other hand, this problem has already been studied in the context of adversarial bandit by Auer et al. (2003). Their setting differs from our not only because their bandits are nonstochastic, but also because their adaptive procedure takes only into account a given number of recommendations, whereas in our setting the adaptation is supposed to come from observing rewards of the chosen arms (only one per time step). Nevertheless, one can wonder if an ”exponentially weighted forecasters” procedure like E XP 4 could be transposed to our context. The answer is negative, as stated in the following theorem. To avoid confusion, we make the notations of the regret and of sampling time more precise by adding the considered policy: under policy A , Rn and Tk (n) will be respectively denoted Rn (A ) and Tk (n, A ). ˜ Theorem 8 Let A be a consistent policy and let ρ be a real number in (0, 0.4). If Θ has the ˜ Dirac/Bernoulli property and the product property, there is no policy which can both beat A and UCB (ρ), that is: ∀A , ∃θ ∈ Θ, lim sup n→+∞ Eθ [Rn (A )] > 1. ˜ min(Eθ [Rn (A )], Eθ [Rn (UCB(ρ))]) Thus the existence of optimal policies does not hold when we extend the notion of consistency. Precisely, as UCB(ρ) is (1 − 2ρ)-consistent, we have shown that there is no optimal policy within the class of α-consistent policies, with α > 0.2. Consequently, there do not exist optimal policies in the class of Hannan consistent policies either. Moreover, Theorem 8 shows that methods that would be inspired by related literature in adversarial bandit can not apply to our framework. As we said, this impossibility may come from the fact that we can not observe at each time step the decisions and rewards of more than one algorithm. If we were able to observe a given set of policies from step to step, then it would be easy to beat them all: it would be sufﬁcient to aggregate all the observations and simply pull the arm with the greater empiric mean. The case where we only observe decisions (and not rewards) of a set of policies may be interesting, but is left outside of the scope of this paper. Proof Assume by contradiction that ∃A , ∀θ ∈ Θ, lim sup un,θ ≤ 1, n→+∞ [Rn where un,θ = min(E [R (Eθ)],E(A )](UCB(ρ))]) . ˜ θ n A θ [Rn For any θ, we have Eθ [Rn (A )] = Eθ [Rn (A )] ˜ ˜ Eθ [Rn (A )] ≤ un,θ Eθ [Rn (A )], ˜ Eθ [Rn (A )] (9) ˜ so that the fact that A is a consistent policy implies that A is also consistent. Consequently the lower bound of Theorem 6 also holds for policy A . For the rest of the proof, we focus on environments of the form θ = (δ0 , δ∆ ) with ∆ > 0. In this case, arm 2 is the best arm, so that we have to compute D1 (θ). On the one hand, we have: D1 (θ) = inf ˜ ν1 ∈Θ1 :E[˜ 1 ν ]>µ∗ ˜ KL(ν1 , ν1 ) = inf ˜ ν1 ∈Θ1 :E[˜ 1 ]>∆ ν 200 ˜ KL(δ0 , ν1 ) = inf ˜ ν1 ∈Θ1 :E[˜ 1 ]>∆ ν log 1 . ˜ ν1 (0) L OWER B OUNDS , W EAK C ONSISTENCY AND S ELECTIVITY IN BANDIT P ROBLEMS ˜ ˜ As E[ν1 ] ≤ 1 − ν1 (0), we get: D1 (θ) ≥ inf ˜ ν1 ∈Θ1 :1−˜ 1 (0)≥∆ ν log 1 ˜ ν1 (0) ≥ log 1 . 1−∆ One the other hand, we have for any ε > 0: D1 (θ) ≤ KL(δ0 , Ber(∆ + ε)) = log Consequently D1 (θ) = log 1 1−∆ 1 1−∆−ε , and the lower bound of Theorem 6 reads: lim inf n→+∞ 1 Eθ [T1 (n, A )] . ≥ 1 log n log 1−∆ Just like Equation (9), we have: Eθ [Rn (A )] ≤ un,θ Eθ [Rn (UCB(ρ))]. Moreover, Lemma 3 provides: Eθ [Rn (UCB(ρ))] ≤ 1 + ρ log n . ∆ Now, by gathering the three previous inequalities and Formula (1), we get: 1 log 1 1−∆ ≤ lim inf n→+∞ Eθ [T1 (n, A )] Eθ [Rn (A )] = lim inf n→+∞ log n ∆ log n un,θ Eθ [Rn (UCB(ρ))] un,θ ρ log n 1+ ≤ lim inf n→+∞ ∆ log n ∆ log n ∆ ρun,θ un,θ ρ ρ + lim inf 2 = 2 lim inf un,θ ≤ 2 lim sup un,θ ≤ lim inf n→+∞ ∆ n→+∞ ∆ log n ∆ n→+∞ ∆ n→+∞ ρ . ≤ ∆2 ≤ lim inf n→+∞ This means that ρ has to be lower bounded by ∆2 , 1 log( 1−∆ ) but this is greater than 0.4 if ∆ = 0.75, hence the contradiction. Note that this proof gives a simple alternative to Theorem 4 to show that UCB(ρ) is not consistent (if ρ ≤ 0.4). Indeed if it were consistent, then in environment θ = (δ0 , δ∆ ) the same contradiction between the lower bound of Theorem 6 and the upper bound of Lemma 3 would hold. 5. General Bounds In this section, we study lower bounds on the expected regret with few requirements on Θ and on the class of policies. With a simple property on Θ but without any assumption on the policy, we show that there always exist logarithmic lower bounds for some environments θ. Then, still with a 201 S ALOMON , AUDIBERT AND E L A LAOUI simple property on Θ, we show that there exists a Hannan consistent policy for which the expected regret is sub-logarithmic for some environment θ. Note that the policy that always pulls arm 1 has a 0 expected regret in environments where arm 1 has the best mean reward, and an expected regret of order n in other environments. So, for this policy, expected regret is sub-logarithmic in some environments. Nevertheless, this policy is not Hannan consistent because its expected regret is not always o(n). 5.1 The Necessity of a Logarithmic Regret in Some Environments The necessity of a logarithmic regret in some environments can be explained by a simple sketch proof. Assume that the agent knows the number of rounds n, and that he balances exploration and exploitation in the following way: he ﬁrst pulls each arm s(n) times, and then selects the arm that has obtained the best empiric mean for the rest of the game. Denote by ps(n) the probability that the best arm does not have the best empiric mean after the exploration phase (i.e., after the ﬁrst Ks(n) rounds). The expected regret is then of the form c1 (1 − ps(n) )s(n) + c2 ps(n) n. (10) Indeed, if the agent manages to match the best arm then he only suffers the pulls of suboptimal arms during the exploration phase. That represents an expected regret of order s(n). If not, the number of pulls of suboptimal arms is of order n, and so is the expected regret. Now, let us approximate ps(n) . It has the same order as the probability that the best arm gets X ∗ −µ∗ an empiric mean lower than the second best mean reward. Moreover, k ,s(n) s(n) (where σ is σ ∗ ,1 ) has approximately a standard normal distribution by the central limit theorem. the variance of Xk Therefore, we have: ps(n) ≈ Pθ (Xk∗ ,s(n) ≤ µ∗ − ∆) = Pθ ≈ ≈  σ 1 1 √ exp − 2 2π ∆ s(n) Xk∗ ,s(n) − µ∗ σ  2 ∆ s(n)  σ s(n) ≤ − ∆ s(n) σ 1 σ ∆2 s(n) √ . exp − 2σ2 2π ∆ s(n) It follows that the expected regret has to be at least logarithmic. Indeed, to ensure that the second term c2 ps(n) n of Equation (10) is sub-logarithmic, s(n) has to be greater than log n. But then ﬁrst term c1 (1 − ps(n) )s(n) is greater than log n. Actually, the necessity of a logarithmic regret can be written as a consequence of Theorem 6. n Indeed, if we assume by contradiction that lim supn→+∞ Eθ Rn = 0 for all θ (i.e., Eθ Rn = o(log n)), log the considered policy is consistent. Consequently, Theorem 6 implies that lim sup n→+∞ E θ Rn E θ Rn ≥ lim inf > 0. n→+∞ log n log n Yet, this reasoning needs Θ having the product property, and conditions of the form 0 < Dk (θ) < ∞ also have to hold. The following proposition is a rigorous version of our sketch, and it shows that the necessity of a logarithmic lower bound can be based on a simple property on Θ. 202 L OWER B OUNDS , W EAK C ONSISTENCY AND S ELECTIVITY IN BANDIT P ROBLEMS ˜ ˜ ˜ Proposition 9 Assume that there exist two environments θ = (ν1 , . . . , νK ) ∈ Θ, θ = (ν1 , . . . , νK ) ∈ Θ, and an arm k ∈ {1, . . . , K} such that 1. k has the best mean reward in environment θ, ˜ 2. k is not the winning arm in environment θ, ˜ 3. νk = νk and there exists η ∈ (0, 1) such that dνℓ ∏ d νℓ (Xℓ,1 ) ≥ η ˜ ℓ=k Pθ − a.s. ˜ (11) ˆ Then, for any policy, there exists θ ∈ Θ such that lim sup n→+∞ E θ Rn ˆ > 0. log n ˜ Let us explain the logic of the three conditions of the proposition. If νk = νk , and in case νk seems to be the reward distribution of arm k, then arm k has to be pulled often enough for the regret to be small if the environment is θ. Nevertheless, one has to explore other arms to know ˜ whether the environment is actually θ. Moreover, Inequality (11) makes sure that the distinction ˜ is tough to make: it ensures that pulling any arm ℓ = k gives a reward which is between θ and θ likely in both environments. Without such an assumption the problem may be very simple, and providing a logarithmic lower bound is hopeless. Indeed, the distinction between any pair of tricky ˜ environments (θ, θ) may be solved in only one pull of a given arm ℓ = k, that would almost surely give a reward that is possible in only one of the two environments. The third condition can be seen as an alternate version of condition 0 < Dk (θ) < ∞ in Theorem 6, though there is no logical connection with it. Finally, let us remark that one can check that any set Θ that has the Dirac/Bernoulli property satisﬁes the conditions of Proposition 9. Proof The proof consists in writing a proper version of Expression (10). To this aim we compute a lower bound of Eθ Rn , expressed as a function of Eθ Rn and of an arbitrary function g(n). ˜ ˜ ˜ In the following, ∆k denotes the optimality gap of arm k in environment θ. As event ∑ℓ=k Tℓ (n) ≤ g(n) is measurable with respect to Xℓ,1 , . . . , Xℓ,⌊g(n)⌋ (ℓ = k) and to Xk,1 , . . . , Xk,n , we also introduce the function q such that ½{∑ℓ=k Tℓ (n)≤g(n)} = q (Xℓ,s )ℓ=k, s=1..⌊g(n)⌋ , (Xk,s )s=1..n . 203 S ALOMON , AUDIBERT AND E L A LAOUI We have: ˜ ˜ ˜ Eθ Rn ≥ ∆k Eθ [Tk (n)] ≥ ∆k (n − g(n))Pθ (Tk (n) ≥ n − g(n)) ˜ ˜ (12) ˜ = ∆k (n − g(n))Pθ n − ∑ Tℓ (n) ≥ n − g(n) ˜ ℓ=k ˜ = ∆k (n − g(n))Pθ ˜ ˜ = ∆k (n − g(n)) ∑ Tℓ (n) ≤ g(n) ℓ=k q (xℓ,s )ℓ=k, s=1..⌊g(n)⌋ , (xk,s )s=1..n ˜ ˜ ∏ d νℓ (xℓ,s )∏d νk (xk,s ) ℓ=k s = 1..⌊g(n)⌋ s=1..n ˜ ≥ ∆k (n − g(n)) q (xℓ,s )ℓ=k, s=1..⌊g(n)⌋ , (xk,s )s=1..n η⌊g(n)⌋∏ dνℓ (xℓ,s )∏dνk (xk,s ) ℓ=k s = 1..⌊g(n)⌋ ˜ ≥ ∆k (n − g(n))ηg(n) q (xℓ,s )ℓ=k, s=1..⌊g(n)⌋ , (xk,s )s=1..n ∏ dνℓ (xℓ,s )∏dνk (xk,s ) ℓ=k s = 1..⌊g(n)⌋ ˜ = ∆k (n − g(n))ηg(n) Pθ (13) s=1..n s=1..n ∑ Tℓ (n) ≤ g(n) ℓ=k ˜ = ∆k (n − g(n))ηg(n) 1 − Pθ ∑ Tℓ (n) > g(n) ℓ=k ˜ ≥ ∆k (n − g(n))ηg(n) 1 − Eθ ∑ℓ=k Tℓ (n) g(n) (14) ˜ ≥ ∆k (n − g(n))ηg(n) 1 − Eθ ∑ℓ=k ∆ℓ Tℓ (n) ∆g(n) (15) E θ Rn ˜ ≥ ∆k (n − g(n))ηg(n) 1 − , ∆g(n) where the ﬁrst inequality of (12) is a consequence of Formula (1), the second inequality of (12) and inequality (14) come from Markov’s inequality, Inequality (13) is a consequence of (11), and Inequality (15) results from the fact that ∆ℓ ≥ ∆ for all ℓ. n θ −− Now, let us conclude. If Eθ Rn − − → 0, we set g(n) = 2E∆Rn , so that log n→+∞ g(n) ≤ min n − log n 2 , 2 log η for n large enough. Then, we have: √ − log n ˜ k n − g(n) ηg(n) ≥ ∆k n η 2 log η = ∆k n . ˜ ˜ E θ Rn ≥ ∆ ˜ 2 4 4 In particular, Eθ Rn ˜ −− log n − − → n→+∞ +∞, and the result follows. 204 L OWER B OUNDS , W EAK C ONSISTENCY AND S ELECTIVITY IN BANDIT P ROBLEMS 5.2 Hannan Consistency We will prove that there exists a Hannan consistent policy such that there can not be a logarithmic lower bound for every environment θ of Θ. To this aim, we make use of general UCB policies again (cf. Section 2.1). Let us ﬁrst give sufﬁcient conditions on the fk for UCB policy to be Hannan consistent. Proposition 10 Assume that fk (n) = o(n) for all k ∈ {1, . . . , K}. Assume also that there exist γ > 1 2 and N ≥ 3 such that fk (n) ≥ γ log log n for all k ∈ {1, . . . , K} and for all n ≥ N. Then UCB is Hannan consistent. Proof Fix an arm k such that ∆k > 0 and choose β ∈ (0, 1) such that 2βγ > 1. By means of Lemma 2, we have for n large enough: n Eθ [Tk (n)] ≤ u + 2 ∑ 1+ t=u+1 logt 1 log( β ) e−2βγ log logt , k where u = 4 f∆(n) . 2 k Consequently, we have: n Eθ [Tk (n)] ≤ u + 2 ∑ t=2 1 1 1 + 1 (logt)2βγ−1 2βγ (logt) log( β ) . (16) n n 1 Sums of the form ∑t=2 (logt)c with c > 0 are equivalent to (log n)c as n goes to inﬁnity. Indeed, on the one hand we have n n n 1 dx 1 ∑ (logt)c ≤ 2 (log x)c ≤ ∑ (logt)c , t=2 t=3 n 1 so that ∑t=2 (logt)c ∼ n dx 2 (log x)c . n 2 On the other hand, we have n x dx = c (log x) (log x)c n dx 2 (log x)c+1 n 1 n ∑t=2 (logt)c ∼ (log n)c n +c 2 2 dx . (log x)c+1 n dx 2 (log x)c n dx 2 (log x)c n (log n)c . As both integrals are divergent we have =o Combining the fact that constant C > 0 such that with Equation (16), we get the existence of a Eθ [Tk (n)] ≤ , so that ∼ Cn 4 fk (n) + . 2 ∆ (log n)2βγ−1 Since fk (n) = o(n) and 2βγ − 1 > 0, the latter inequality shows that Eθ [Tk (n)] = o(n). The result follows. We are now in the position to prove the main result of this section. Theorem 11 If Θ has the Dirac/Bernoulli property, there exist Hannan consistent policies for which the expected regret can not be lower bounded by a logarithmic function in all environments θ. 205 S ALOMON , AUDIBERT AND E L A LAOUI Proof If f1 (n) = f2 (n) = log log n for n ≥ 3, UCB is Hannan consistent by Proposition 10. According to Lemma 1, the expected regret is then of order log log n in environments of the form (δa , δb ), a = b. Hence the conclusion on the non-existence of logarithmic lower bounds. Thus we have obtained a lower bound of order log log n. This order is critical regarding the methods we used. Yet, we do not know if this order is optimal. Acknowledgments This work has been supported by the French National Research Agency (ANR) through the COSINUS program (ANR-08-COSI-004: EXPLO-RA project). References R. Agrawal. Sample mean based index policies with o(log n) regret for the multi-armed bandit problem. Advances in Applied Mathematics, 27:1054–1078, 1995. H. Akaike. Information theory and an extension of the maximum likelihood principle. In Second International Symposium on Information Theory, volume 1, pages 267–281. Springer Verlag, 1973. J.-Y. Audibert, R. Munos, and C. Szepesv´ ri. Exploration-exploitation tradeoff using variance estia mates in multi-armed bandits. Theoretical Computer Science, 410(19):1876–1902, 2009. P. Auer, N. Cesa-Bianchi, and P. Fischer. Finite-time analysis of the multiarmed bandit problem. Machine Learning, 47(2-3):235–256, 2002. P. Auer, N. Cesa-Bianchi, Y. Freund, and R.E. Schapire. The nonstochastic multiarmed bandit problem. SIAM Journal on Computing, 32(1):48–77, 2003. D. Bergemann and J. Valimaki. Bandit problems. In The New Palgrave Dictionary of Economics, 2nd ed. Macmillan Press, 2008. S. Bubeck. Bandits Games and Clustering Foundations. PhD thesis, Universit´ Lille 1, France, e 2010. S. Bubeck, R. Munos, G. Stoltz, and C. Szepesvari. Online optimization in X-armed bandits. In Advances in Neural Information Processing Systems 21, pages 201–208. 2009. A.N. Burnetas and M.N. Katehakis. Optimal adaptive policies for sequential allocation problems. Advances in Applied Mathematics, 17(2):122–142, 1996. N. Cesa-Bianchi and G. Lugosi. Prediction, Learning, and Games. Cambridge University Press, New York, NY, 2006. N. Cesa-Bianchi, Y. Freund, D. Haussler, D.P. Helmbold, R.E. Schapire, and M.K. Warmuth. How to use expert advice. Journal of the ACM (JACM), 44(3):427–485, 1997. 206 L OWER B OUNDS , W EAK C ONSISTENCY AND S ELECTIVITY IN BANDIT P ROBLEMS P.A. Coquelin and R. Munos. Bandit algorithms for tree search. In Uncertainty in Artiﬁcial Intelligence, 2007. S. Gelly and Y. Wang. Exploration exploitation in go: UCT for Monte-Carlo go. In Online Trading between Exploration and Exploitation Workshop, Twentieth Annual Conference on Neural Information Processing Systems (NIPS 2006), 2006. J. Honda and A. Takemura. An asymptotically optimal bandit algorithm for bounded support models. In Proceedings of the Twenty-Third Annual Conference on Learning Theory (COLT), 2010. R. Kleinberg, A. Slivkins, and E. Upfal. Multi-armed bandits in metric spaces. In Proceedings of the 40th Annual ACM Symposium on Theory of Computing, pages 681–690, 2008. R. D. Kleinberg. Nearly tight bounds for the continuum-armed bandit problem. In Advances in Neural Information Processing Systems 17, pages 697–704. 2005. T. L. Lai and H. Robbins. Asymptotically efﬁcient adaptive allocation rules. Advances in Applied Mathematics, 6:4–22, 1985. D. Lamberton, G. Pag` s, and P. Tarr` s. When can the two-armed bandit algorithm be trusted? e e Annals of Applied Probability, 14(3):1424–1454, 2004. C.L. Mallows. Some comments on cp. Technometrics, pages 661–675, 1973. H. Robbins. Some aspects of the sequential design of experiments. Bulletin of the American Mathematics Society, 58:527–535, 1952. G. Schwarz. Estimating the dimension of a model. The Annals of Statistics, 6(2):461–464, 1978. W.R. Thompson. On the likelihood that one unknown probability exceeds another in view of the evidence of two samples. Biometrika, 25(3/4):285–294, 1933. 207

5 0.076807991 1 jmlr-2013-AC++Template-Based Reinforcement Learning Library: Fitting the Code to the Mathematics

Author: Hervé Frezza-Buet, Matthieu Geist

Abstract: This paper introduces the rllib as an original C++ template-based library oriented toward value function estimation. Generic programming is promoted here as a way of having a good ﬁt between the mathematics of reinforcement learning and their implementation in a library. The main concepts of rllib are presented, as well as a short example. Keywords: reinforcement learning, C++, generic programming 1. C++ Genericity for Fitting the Mathematics of Reinforcement Learning Reinforcement learning (RL) is a ﬁeld of machine learning that beneﬁts from a rigorous mathematical formalism, as shown for example by Bertsekas (1995). Although this formalism is well accepted in the ﬁeld, its translation into efﬁcient computer science tools has surprisingly not led to any standard yet, as mentioned by Kovacs and Egginton (2011). The claim of this paper is that genericity enables a natural expression of the mathematics of RL. The rllib (2011) library implements this idea in the C++ language, where genericity relies on templates. Templates automate the re-writing of some generic code involving user types, offering a strong type checking at compile time that improves the code safety. Using the rllib templates requires that the user-deﬁned types ﬁt some documented concepts. For example, some class C deﬁning an agent should be designed so that C::state type is the type used for the states, C::action type is the type used for the actions, and the method C::action type policy(const C::state type& s) const is implemented, in order to compute the action to be performed in a given state. This concept deﬁnition speciﬁes what is required for an agent mathematically. Note that C does not need to inherit from any kind of abstract rl::Agent class to be used by the rllib tools. It can be directly provided as a type argument to any rllib template requiring an argument ﬁtting the concept of an agent, so that the re-written code actually compiles. 2. A Short Example Let us consider the following toy-example. The state space contains values from 0 to 9, and actions consist in increasing or decreasing the value. When the value gets out of bounds, a reward is returned ∗. Also at UMI 2958 Georgia Tech / CNRS, 2-3, rue Marconi, 57070 Metz, France. c 2013 Herv´ Frezza-Buet and Matthieu Geist. e F REZZA -B UET AND G EIST (-1 for bound 0, 1 for bound 9). Otherwise, a null reward is returned. Let us deﬁne this problem and run Sarsa. First, a simulator class ﬁtting the concept Simulator described in the documentation is needed. c l a s s Sim { / / Our s i m u l a t o r c l a s s . No i n h e r i t a n c e r e q u i r e d . private : i n t c u r r e n t ; double r ; public : typedef int phase type ; typedef int observation type ; t y p e d e f enum { up , down} a c t i o n t y p e ; t y p e d e f d o u b l e r e w a r d t y p e ; Sim ( v o i d ) : c u r r e n t ( 0 ) , r ( 0 ) {} v o i d s e t P h a s e ( c o n s t p h a s e t y p e &s; ) { c u r r e n t = s %10;} c o n s t o b s e r v a t i o n t y p e& s e n s e ( v o i d ) c o n s t { r e t u r n c u r r e n t ; } reward type reward ( void ) const { return r ;} v o i d t i m e S t e p ( c o n s t a c t i o n t y p e &a; ) { i f ( a == up ) c u r r e n t ++; e l s e c u r r e n t −−; i f ( c u r r e n t < 0 ) r =−1; e l s e i f ( c u r r e n t > 9 ) r = 1 ; e l s e r = 0 ; i f ( r ! = 0 ) throw r l : : e x c e p t i o n : : T e r m i n a l ( ” Out o f r a n g e ” ) ; } }; Following the concept requirements, the class Sim naturally implements a sensor method sense that provides an observation from the current phase of the controlled dynamical system, and a method timeStep that computes a transition consecutive to some action. Note the use of exceptions for terminal states. For the sake of simplicity in further code, the following is added. typedef typedef typedef typedef typedef Sim : : p h a s e t y p e Sim : : a c t i o n t y p e r l : : I t e r a t o r t y p e d e f r l : : a g e n t : : o n l i n e : : E p s i l o n G r e e d y Critic ; ArgmaxCritic ; TestAgent ; LearnAgent ; The rllib expresses that Sarsa provides a critic, offering a Q-function. As actions are discrete, the best action (i.e., argmaxa∈A Q(s, a)) can be found by considering all the actions sequentially. This is what ArgmaxCritic offers thanks to the action enumerator Aenum, in order to deﬁne greedy and ε-greedy agents. The main function then only consists in running episodes with the appropriate agents. i n t main ( i n t a r g c , char ∗ a r g v [ ] ) { Sim simulator ; Transition transition ; ArgmaxCritic c r i t i c ; LearnAgent learner ( critic ); TestAgent tester ( critic ); A a; S s; int episode , length , s t e p =0; // // // // // // // T h i s i s what t h e a g e n t c o n t r o l s . T h i s i s some s , a , r , s ’ , a ’ d a t a . T h i s c o m p u t e s Q and argmax a Q( s , a ) . SARSA u s e s t h i s a g e n t t o l e a r n t h e p o l i c y . This behaves according to the c r i t i c . Some a c t i o n . Some s t a t e . f o r ( e p i s o d e = 0 ; e p i s o d e < 1 0 0 0 0 ; ++ e p i s o d e ) { / / L e a r n i n g p h a s e s i m u l a t o r . setPhase ( rand ()%10); r l : : episode : : sa : : r u n a n d l e a r n ( simulator , l e a r n e r , t r a n s i t i o n , 0 , length ) ; } try { / / T e s t phase simulator . setPhase (0); while ( true ) { s = simulator . sense ( ) ; a = t e s t e r . policy ( s ) ; s t e p ++; s i m u l a t o r . t i m e S t e p ( a ) ; } } c a t c h ( r l : : e x c e p t i o n : : T e r m i n a l e ) { s t d : : c o u t << s t e p << ” s t e p s . ” << s t d : : e n d l ; } return 0 ; / / t h e message p r i n t e d i s ‘ ‘10 s t e p s . ’ ’ } 3. Features of the Library Using the library requires to deﬁne the features that are speciﬁc to the problem (the simulator and the Q-function architecture in our example) from scratch, but with the help of concepts. Then, the speciﬁc features can be handled by generic code provided by the library to implement RL techniques with value function estimation. 627 F REZZA -B UET AND G EIST Currently, Q-learing, Sarsa, KTD-Q, LSTD, and policy iteration are available, as well as a multi-layer perceptron architecture. Moreover, some benchmark problems (i.e., simulators) are also provided: the mountain car, the cliff walking, the inverted pendulum and the Boyan chain. Extending the library with new algorithms is allowed, since it consists in deﬁning new templates. This is a bit more technical than only using the existing algorithms, but the structure of existing concepts helps, since it reﬂects the mathematics of RL. For example, concepts like Feature, for linear approaches mainly (i.e., Q(s, a) = θT ϕ(s, a)) and Architecture (i.e., Q(s, a) = fθ (s, a) for more general approximation) orient the design toward functional approaches of RL. The algorithms implemented so far rely on the GNU Scientiﬁc Library (see GSL, 2011) for linear algebra computation, so the GPL licence of GSL propagates to the rllib. 4. Conclusion The rllib relies only on the C++ standard and the availability of the GSL on the system. It offers state-action function approximation tools for applying RL to real problems, as well as a design that ﬁts the mathematics. The latter allows for extensions, but is also compliant with pedagogical purpose. The design of the rllib aims at allowing the user to build (using C++ programming) its own experiment, using several algorithms, several agents, on-line or batch learning, and so on. Actually, the difﬁcult part of RL is the algorithms themselves, not the script-like part of the experiment where things are put together (see the main function in our example). With a framework, in the sense of Kovacs and Egginton (2011), the experiment is not directly accessible to the user programs, since it is handled by some libraries in order to offer graphical interface or analyzing tools. The user code is then called by the framework when required. We advocate that allowing the user to call the rllib functionality at his/her convenience provides an open and extensible access to RL for students, researchers and engineers. Last, the rllib ﬁts the requirements expressed by Kovacs and Egginton (2011, Section 4.3): support of good scientiﬁc research, formulation compliant with the domain, allowing for any kind of agents and any kind of approximators, interoperability of components (the Q function of the example can be used for different algorithms and agents), maximization of run-time speed (use of C++ and templates that inline massively the code), open source, etc. Extensions of rllib can be considered, for example for handling POMDPs, and contributions of users are expected. The use of templates is unfortunately unfamiliar to many programmers, but the effort is worth it, since it brings the code at the level of the mathematical formalism, increasing readability (by a rational use of typedefs) and reducing bugs. Even if the approach is dramatically different from existing frameworks, wrappings with frameworks can be considered in further development. References Dimitri P. Bertsekas. Dynamic Programming and Optimal Control. Athena Scientiﬁc, 3rd (20052007) edition, 1995. GSL, 2011. http://http://www.gnu.org/software/gsl. Tim Kovacs and Robert Egginton. On the analysis and design of software for reinforcement learning, with a survey of existing systems. Machine Learning, 84:7–49, 2011. rllib, 2011. http://ims.metz.supelec.fr/spip.php?article122. 628

6 0.076689444 68 jmlr-2013-Machine Learning with Operational Costs

7 0.069386013 56 jmlr-2013-Keep It Simple And Sparse: Real-Time Action Recognition

8 0.047983356 58 jmlr-2013-Language-Motivated Approaches to Action Recognition

9 0.041469511 76 jmlr-2013-Nonparametric Sparsity and Regularization

10 0.04067741 2 jmlr-2013-A Binary-Classification-Based Metric between Time-Series Distributions and Its Use in Statistical and Learning Problems

11 0.040077921 30 jmlr-2013-Counterfactual Reasoning and Learning Systems: The Example of Computational Advertising

12 0.039850689 17 jmlr-2013-Belief Propagation for Continuous State Spaces: Stochastic Message-Passing with Quantitative Guarantees

13 0.039736938 86 jmlr-2013-Parallel Vector Field Embedding

14 0.038446687 69 jmlr-2013-Manifold Regularization and Semi-supervised Learning: Some Theoretical Analyses

15 0.037820764 93 jmlr-2013-Random Walk Kernels and Learning Curves for Gaussian Process Regression on Random Graphs

16 0.036873721 16 jmlr-2013-Bayesian Nonparametric Hidden Semi-Markov Models

17 0.036503494 98 jmlr-2013-Segregating Event Streams and Noise with a Markov Renewal Process Model

18 0.035522785 81 jmlr-2013-Optimal Discovery with Probabilistic Expert Advice: Finite Time Analysis and Macroscopic Optimality

19 0.035194211 59 jmlr-2013-Large-scale SVD and Manifold Learning

20 0.035060652 101 jmlr-2013-Sparse Activity and Sparse Connectivity in Supervised Learning

similar papers computed by lsi model

lsi for this paper:

topicId topicWeight

[(0, -0.214), (1, 0.096), (2, -0.163), (3, 0.126), (4, -0.264), (5, 0.217), (6, -0.457), (7, 0.025), (8, 0.059), (9, -0.168), (10, 0.095), (11, 0.196), (12, -0.212), (13, 0.036), (14, -0.309), (15, -0.104), (16, -0.103), (17, -0.075), (18, 0.091), (19, -0.049), (20, 0.035), (21, 0.073), (22, 0.036), (23, 0.028), (24, 0.025), (25, -0.012), (26, 0.065), (27, 0.004), (28, 0.004), (29, 0.016), (30, -0.002), (31, 0.052), (32, 0.026), (33, 0.009), (34, 0.004), (35, 0.048), (36, -0.021), (37, -0.0), (38, -0.004), (39, 0.0), (40, 0.01), (41, -0.015), (42, 0.002), (43, 0.001), (44, -0.047), (45, -0.016), (46, -0.001), (47, 0.033), (48, 0.006), (49, 0.001)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.93728137 28 jmlr-2013-Construction of Approximation Spaces for Reinforcement Learning

Author: Wendelin Böhmer, Steffen Grünewälder, Yun Shen, Marek Musial, Klaus Obermayer

2 0.81749541 52 jmlr-2013-How to Solve Classification and Regression Problems on High-Dimensional Data with a Supervised Extension of Slow Feature Analysis

Author: Alberto N. Escalante-B., Laurenz Wiskott

3 0.41119266 87 jmlr-2013-Performance Bounds for λ Policy Iteration and Application to the Game of Tetris

Author: Bruno Scherrer

4 0.30316725 1 jmlr-2013-AC++Template-Based Reinforcement Learning Library: Fitting the Code to the Mathematics

Author: Hervé Frezza-Buet, Matthieu Geist

5 0.23002806 65 jmlr-2013-Lower Bounds and Selectivity of Weak-Consistent Policies in Stochastic Multi-Armed Bandit Problem

Author: Antoine Salomon, Jean-Yves Audibert, Issam El Alaoui

6 0.20117439 68 jmlr-2013-Machine Learning with Operational Costs

7 0.17859833 56 jmlr-2013-Keep It Simple And Sparse: Real-Time Action Recognition

8 0.1759581 38 jmlr-2013-Dynamic Affine-Invariant Shape-Appearance Handshape Features and Classification in Sign Language Videos

9 0.16363512 58 jmlr-2013-Language-Motivated Approaches to Action Recognition

10 0.16274068 2 jmlr-2013-A Binary-Classification-Based Metric between Time-Series Distributions and Its Use in Statistical and Learning Problems

11 0.15887251 43 jmlr-2013-Fast MCMC Sampling for Markov Jump Processes and Extensions

12 0.15856631 80 jmlr-2013-One-shot Learning Gesture Recognition from RGB-D Data Using Bag of Features

13 0.15595613 76 jmlr-2013-Nonparametric Sparsity and Regularization

14 0.15547797 17 jmlr-2013-Belief Propagation for Continuous State Spaces: Stochastic Message-Passing with Quantitative Guarantees

15 0.15098733 30 jmlr-2013-Counterfactual Reasoning and Learning Systems: The Example of Computational Advertising

16 0.14175579 99 jmlr-2013-Semi-Supervised Learning Using Greedy Max-Cut

17 0.13725278 16 jmlr-2013-Bayesian Nonparametric Hidden Semi-Markov Models

18 0.13288601 59 jmlr-2013-Large-scale SVD and Manifold Learning

19 0.13043794 93 jmlr-2013-Random Walk Kernels and Learning Curves for Gaussian Process Regression on Random Graphs

20 0.12941234 57 jmlr-2013-Kernel Bayes' Rule: Bayesian Inference with Positive Definite Kernels

similar papers computed by lda model

lda for this paper:

topicId topicWeight

[(0, 0.032), (5, 0.135), (6, 0.047), (10, 0.092), (15, 0.315), (20, 0.022), (23, 0.05), (29, 0.016), (53, 0.013), (68, 0.023), (70, 0.039), (75, 0.047), (85, 0.023), (87, 0.022), (89, 0.011), (90, 0.019), (93, 0.011)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.72445852 28 jmlr-2013-Construction of Approximation Spaces for Reinforcement Learning

Author: Wendelin Böhmer, Steffen Grünewälder, Yun Shen, Marek Musial, Klaus Obermayer

2 0.50585359 52 jmlr-2013-How to Solve Classification and Regression Problems on High-Dimensional Data with a Supervised Extension of Slow Feature Analysis

Author: Alberto N. Escalante-B., Laurenz Wiskott

3 0.50062346 25 jmlr-2013-Communication-Efficient Algorithms for Statistical Optimization

Author: Yuchen Zhang, John C. Duchi, Martin J. Wainwright

Abstract: We analyze two communication-efﬁcient algorithms for distributed optimization in statistical settings involving large-scale data sets. The ﬁrst algorithm is a standard averaging method that distributes the N data samples evenly to m machines, performs separate minimization on each subset, and then averages the estimates. We provide a sharp analysis of this average mixture algorithm, showing that under a reasonable set of conditions, the combined parameter achieves √ mean-squared error (MSE) that decays as O (N −1 + (N/m)−2 ). Whenever m ≤ N, this guarantee matches the best possible rate achievable by a centralized algorithm having access to all N samples. The second algorithm is a novel method, based on an appropriate form of bootstrap subsampling. Requiring only a single round of communication, it has mean-squared error that decays as O (N −1 + (N/m)−3 ), and so is more robust to the amount of parallelization. In addition, we show that a stochastic gradient-based method attains mean-squared error decaying as O (N −1 + (N/m)−3/2 ), easing computation at the expense of a potentially slower MSE rate. We also provide an experimental evaluation of our methods, investigating their performance both on simulated data and on a large-scale regression problem from the internet search domain. In particular, we show that our methods can be used to efﬁciently solve an advertisement prediction problem from the Chinese SoSo Search Engine, which involves logistic regression with N ≈ 2.4 × 108 samples and d ≈ 740,000 covariates. Keywords: distributed learning, stochastic optimization, averaging, subsampling

4 0.49943894 99 jmlr-2013-Semi-Supervised Learning Using Greedy Max-Cut

Author: Jun Wang, Tony Jebara, Shih-Fu Chang

Abstract: Graph-based semi-supervised learning (SSL) methods play an increasingly important role in practical machine learning systems, particularly in agnostic settings when no parametric information or other prior knowledge is available about the data distribution. Given the constructed graph represented by a weight matrix, transductive inference is used to propagate known labels to predict the values of all unlabeled vertices. Designing a robust label diffusion algorithm for such graphs is a widely studied problem and various methods have recently been suggested. Many of these can be formalized as regularized function estimation through the minimization of a quadratic cost. However, most existing label diffusion methods minimize a univariate cost with the classiﬁcation function as the only variable of interest. Since the observed labels seed the diffusion process, such univariate frameworks are extremely sensitive to the initial label choice and any label noise. To alleviate the dependency on the initial observed labels, this article proposes a bivariate formulation for graph-based SSL, where both the binary label information and a continuous classiﬁcation function are arguments of the optimization. This bivariate formulation is shown to be equivalent to a linearly constrained Max-Cut problem. Finally an efﬁcient solution via greedy gradient Max-Cut (GGMC) is derived which gradually assigns unlabeled vertices to each class with minimum connectivity. Once convergence guarantees are established, this greedy Max-Cut based SSL is applied on both artiﬁcial and standard benchmark data sets where it obtains superior classiﬁcation accuracy compared to existing state-of-the-art SSL methods. Moreover, GGMC shows robustness with respect to the graph construction method and maintains high accuracy over extensive experiments with various edge linking and weighting schemes. Keywords: graph transduction, semi-supervised learning, bivariate formulation, mixed integer programming, greedy

5 0.49553716 2 jmlr-2013-A Binary-Classification-Based Metric between Time-Series Distributions and Its Use in Statistical and Learning Problems

Author: Daniil Ryabko, Jérémie Mary

Abstract: A metric between time-series distributions is proposed that can be evaluated using binary classiﬁcation methods, which were originally developed to work on i.i.d. data. It is shown how this metric can be used for solving statistical problems that are seemingly unrelated to classiﬁcation and concern highly dependent time series. Speciﬁcally, the problems of time-series clustering, homogeneity testing and the three-sample problem are addressed. Universal consistency of the resulting algorithms is proven under most general assumptions. The theoretical results are illustrated with experiments on synthetic and real-world data. Keywords: time series, reductions, stationary ergodic, clustering, metrics between probability distributions

6 0.49461088 48 jmlr-2013-Generalized Spike-and-Slab Priors for Bayesian Group Feature Selection Using Expectation Propagation

7 0.48866636 4 jmlr-2013-A Max-Norm Constrained Minimization Approach to 1-Bit Matrix Completion

8 0.48792249 17 jmlr-2013-Belief Propagation for Continuous State Spaces: Stochastic Message-Passing with Quantitative Guarantees

9 0.48547673 72 jmlr-2013-Multi-Stage Multi-Task Feature Learning

10 0.48484191 50 jmlr-2013-Greedy Feature Selection for Subspace Clustering

11 0.48473716 26 jmlr-2013-Conjugate Relation between Loss Functions and Uncertainty Sets in Classification Problems

12 0.48266217 59 jmlr-2013-Large-scale SVD and Manifold Learning

13 0.48150334 29 jmlr-2013-Convex and Scalable Weakly Labeled SVMs

14 0.4807013 5 jmlr-2013-A Near-Optimal Algorithm for Differentially-Private Principal Components

15 0.48068202 47 jmlr-2013-Gaussian Kullback-Leibler Approximate Inference

16 0.48016939 73 jmlr-2013-Multicategory Large-Margin Unified Machines

17 0.47984433 69 jmlr-2013-Manifold Regularization and Semi-supervised Learning: Some Theoretical Analyses

18 0.47961843 77 jmlr-2013-On the Convergence of Maximum Variance Unfolding

19 0.47918221 68 jmlr-2013-Machine Learning with Operational Costs

20 0.47876123 46 jmlr-2013-GURLS: A Least Squares Library for Supervised Learning