nips nips2012 nips2012-149 knowledge-graph by maker-knowledge-mining

149 nips-2012-Hierarchical Optimistic Region Selection driven by Curiosity


Source: pdf

Author: Odalric-ambrym Maillard

Abstract: This paper aims to take a step forwards making the term “intrinsic motivation” from reinforcement learning theoretically well founded, focusing on curiositydriven learning. To that end, we consider the setting where, a fixed partition P of a continuous space X being given, and a process ν defined on X being unknown, we are asked to sequentially decide which cell of the partition to select as well as where to sample ν in that cell, in order to minimize a loss function that is inspired from previous work on curiosity-driven learning. The loss on each cell consists of one term measuring a simple worst case quadratic sampling error, and a penalty term proportional to the range of the variance in that cell. The corresponding problem formulation extends the setting known as active learning for multi-armed bandits to the case when each arm is a continuous region, and we show how an adaptation of recent algorithms for that problem and of hierarchical optimistic sampling algorithms for optimization can be used in order to solve this problem. The resulting procedure, called Hierarchical Optimistic Region SElection driven by Curiosity (HORSE.C) is provided together with a finite-time regret analysis. 1

Reference: text


Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 Hierarchical Optimistic Region Selection driven by Curiosity Odalric-Ambrym Maillard Lehrstuhl f¨ r Informationstechnologie u Montanuniversit¨ t Leoben a Leoben, A-8700, Austria odalricambrym. [sent-1, score-0.082]

2 com Abstract This paper aims to take a step forwards making the term “intrinsic motivation” from reinforcement learning theoretically well founded, focusing on curiositydriven learning. [sent-3, score-0.098]

3 The loss on each cell consists of one term measuring a simple worst case quadratic sampling error, and a penalty term proportional to the range of the variance in that cell. [sent-5, score-0.205]

4 The resulting procedure, called Hierarchical Optimistic Region SElection driven by Curiosity (HORSE. [sent-7, score-0.082]

5 1 Introduction In this paper, we focus on the setting of intrinsically motivated reinforcement learning (see Oudeyer and Kaplan [2007], Baranes and Oudeyer [2009], Schmidhuber [2010], Graziano et al. [sent-9, score-0.146]

6 Indeed, if some formal objective criteria have been proposed to implement specific notions of intrinsic rewards (see Jung et al. [sent-11, score-0.152]

7 [2011], Mugan [2010], Konidaris [2011]), where a precise algorithm is defined together with an experimental study, yet no formal goal is defined, and no def analysis is performed as well. [sent-18, score-0.481]

8 There is no reward, yet one can consider that the goal is to actively select and sample subregions of X for which a notion of “learning progress” - this intuitively measures the decay of some notion of error when successively sampling into one subregion - is maximal. [sent-20, score-0.182]

9 Two key components are advocated in Baranes and Oudeyer [2009], in order to achieve successful results (despite that success is a fuzzy notion): • The use of a hierarchy of regions, where each region is progressively split into sub-regions. [sent-21, score-0.271]

10 We believe it is possible to go one step towards a full performance analysis of such algorithms, by relating the corresponding active sampling problem to existing frameworks. [sent-24, score-0.079]

11 This paper aims to take a step forwards making the term “intrinsic motivation” from reinforcement learning theoretically well founded, focusing on curiosity-driven learning. [sent-26, score-0.098]

12 We introduce a mathematical framework in which a metric space (which intuitively plays the role of the state-action space) is divided into regions and a learner has to sample from an unknown random function in a way that reduces a notion of error measure the most. [sent-27, score-0.272]

13 This error consists of two terms, the first one is a robust measure of the quadratic error between the observed samples and their unknown mean, the second one penalizes regions with non constant learning complexity, thus enforcing the notion of curiosity. [sent-28, score-0.241]

14 The paper focuses on how to choose the region to sample from, when a partition of the space is provided. [sent-29, score-0.421]

15 The resulting problem formulation can be seen as a non trivial extension of the setting of active learning in multi-armed bandits (see Carpentier et al. [sent-30, score-0.234]

16 [2010]), where the main idea is to estimate the variance of each arm and sample proportionally to it, to the case when each arm is a region as opposed to a point. [sent-32, score-0.465]

17 In order to deal with this difficulty, the maximal and minimal variance inside each region is tracked by means of a hierarchical optimization procedure, in the spirit of the HOO algorithm from Bubeck et al. [sent-33, score-0.454]

18 This leads to a new procedure called Hierarchical Optimistic Region SElection driven by Curiosity (HORSE. [sent-35, score-0.082]

19 2 Setting: Robust region allocation with curiosity-inducing penalty. [sent-44, score-0.394]

20 We consider an unknown Y-valued process defined on X , written ν : X → M+ (Y), 1 where M+ (Y) refers to the set of all probability measures on Y, such that for all x ∈ X , the random 1 variable Y ∼ ν(x) has mean µ(x) ∈ Rd and covariance matrix Σ(x) ∈ Md,d (R) assumed to be def diagonal. [sent-46, score-0.493]

21 We thus introduce for convenience the notation ρ(x) = trace(Σ(x)), where trace is the trace operator (this corresponds to the variance in dimension 1). [sent-47, score-0.122]

22 We call X the input space or sampling space, and Y the output space or value space. [sent-48, score-0.088]

23 def Intuition Intuitively when applied to the setting of Baranes and Oudeyer [2009], then X = S × A is the space of state-action pairs, where S is a continuous state space and A a continuous action def space, ν is the transition kernel of an unknown MDP, and finally Y = S. [sent-49, score-1.069]

24 One difference is that we assume (see Section 3) that we can sample anywhere in X , which is a restrictive yet common assumption in the reinforcement learning literature. [sent-51, score-0.087]

25 E[ ||ηt || ] = E t t s=1 2 A similar property holds for a region R ⊂ X that has been sampled nt (R) times, and in order to be robust against a bad sampling strategy inside a region, it is natural to look at the worst case error, that we define as def supx∈R ρ(x) . [sent-56, score-1.118]

26 eR (nt ) = nt (R) One reason for looking at robustness is that for instance, in the case we work with an MDP, we are generally not completely free to choose the sample Xs ∈ S × A: we can only choose the action and the next state is generally given by Nature. [sent-57, score-0.414]

27 Goal Now let P be a fixed, known partition of the space X and consider the following game. [sent-59, score-0.149]

28 The goal of an algorithm is, at each time step t, to propose one point xt where to sample the space X , so that its allocation of samples {nt (R)}R∈P (that is, the number of points sampled in each region) minimizes some objective function. [sent-60, score-0.248]

29 Thus, the algorithm is free to sample everywhere in each region, with the goal that the total number of points chosen in each region is optimal in some sense. [sent-61, score-0.272]

30 A simple candidate for this objective function would be the following � � def LP (nt ) = max eR (nt ) ; R ∈ P , however, in order to incorporate a notion of curiosity, we would also like to penalize regions that have a variance term ρ that is non homogeneous (i. [sent-62, score-0.797]

31 Indeed, if a region has constant variance, then we do not really need to understand more its internal structure, and thus its better to focus on an other region that has very heterogeneous variance. [sent-65, score-0.516]

32 For instance, one would like to split such a region in several homogeneous parts, which is essentially the idea behind section C. [sent-66, score-0.306]

33 We thus add a curiositypenalization term to the previous objective function, which leads us to define the pseudo-loss of an def allocation nt = {nt (R)}R∈P in the following way: � � def (1) LP (nt ) = max eR (nt ) + λ|R|(max ρ(x) − min ρ(x)) ; R ∈ P . [sent-68, score-1.493]

34 x∈R x∈R Indeed, this means that we do not want to focus just on regions with high variance, but also trade-off with highly heterogeneous regions, which is coherent with the notion of curiosity (see Oudeyer and Kaplan [2007]). [sent-69, score-0.332]

35 For convenience, we also define the pseudo-loss of a region R by def LR (nt ) = eR (nt ) + λ|R|(max ρ(x) − min ρ(x)) . [sent-70, score-0.698]

36 n� ∈ argmin LP (nt ) ; {nt (R)}R∈P is such that t R∈P Note that the sum starts at t = |P| for a technical reason, since for t < |P|, whatever the allocation, there is always at least one region with no sample, and thus LP (nt ) = ∞. [sent-72, score-0.244]

37 , K} is finite with K � T , and when P is the complete partition (each cell corresponds to exactly one point), the penalization term is canceled. [sent-76, score-0.203]

38 Thus the problem reduces to the choice of the quantities nt (i) for each arm i, and the loss of an � � allocation simply becomes ρ(i) def L(nt ) = max ;1≤i≤K . [sent-77, score-1.171]

39 nt (i) This almost corresponds to the already challenging setting analyzed for instance in Carpentier et al. [sent-78, score-0.444]

40 The difference is that we are interested in the cumulative regret of our allocation instead of only the regret suffered for the last round as considered in Carpentier et al. [sent-81, score-0.52]

41 Also we directly target nt (i) whereas they consider the mean sampling error (but both terms are actually of the same order). [sent-84, score-0.447]

42 Thus the setting we consider can be seen as a generalization of these works to the case when each arm corresponds to a continuous sampling domain. [sent-85, score-0.136]

43 We essentially assume that the unknown distribution is such that it has a sub-Gaussian noise, and a smooth mean and variance functions. [sent-87, score-0.096]

44 Concerning the algorithm, we consider it can use a partition tree of the space, and that this one is essentially not degenerated (a typical binary tree that satisfies all the following assumptions is such that each cell is split in two children of equal volume). [sent-89, score-0.414]

45 Such assumptions on trees have been extensively discussed for instance in Bubeck et al. [sent-90, score-0.098]

46 Unknown distribution We assume that ν is sub-Gaussian, meaning that for all fixed x ∈ X ∀λ ∈ Rd ln E exp[�λ, Y − µ(X)�] ≤ and has diagonal covariance matrix in each point2 . [sent-95, score-0.143]

47 We consider an infinite binary tree T whose nodes correspond to regions of X . [sent-107, score-0.126]

48 A node is indexed by a pair (h, i), where h ≥ 0 is the depth of the nodes in T and 0 ≤ i < 2h is the position of the node at depth h. [sent-108, score-0.196]

49 We write R(h, i) ⊂ X the region associated with node (h, i). [sent-109, score-0.304]

50 In dimension d, a standard way to define such a tree is to split each parent node in half along the largest side of the corresponding hyper-rectangle, see Bubeck et al. [sent-111, score-0.235]

51 For a region (h, i) ∈ Tt , we denote by Ct (h, i) the set of its children in Tt , and by Tt (h, i) the subtree of Tt starting with root node (h, i). [sent-114, score-0.374]

52 Algorithm and partition The partition P is assumed to be such that each of its regions R corresponds to one region R(h, i) ∈ T ; equivalently, there exists a finite sub-tree T0 ⊂ T such that Leaf (T0 ) = P. [sent-115, score-0.563]

53 In the sequel, we write indifferently P ∈ T and (h, i) ∈ T or P and R(h, i) ⊂ X to refer to the partition or one of its cell. [sent-117, score-0.122]

54 � x ,x∈R(h,i) x ,x∈R(h,i) Similarly, we assume that there exists positive constants c� ≤ c, c� ≤ c1 and c� ≤ c2 such that for 1 2 all h ≥ 0, then |R(h, i)| ≥ c� γ h , h h max �1 (x, x� ) ≥ c� γ1 and � max �2 (x, x� ) ≥ c� γ2 . [sent-119, score-0.098]

55 this assumption is only here to make calculations easier and avoid nasty technical considerations that anyway do not affect the order of the final regret bound but only concern second order terms. [sent-123, score-0.184]

56 It is called Hierarchical Optimistic Region SElection driven by Curiosity. [sent-125, score-0.082]

57 1 High-probability upper-bound and lower-bound estimations Let us consider the following (biased) estimator t t 1 � 1 � def ˆt σ 2 (R) = ||Ys ||2 I{Xs ∈ R} − || Ys I{Xs ∈ R}||2 . [sent-128, score-0.485]

58 Nt (R) s=1 Nt (R) s=1 t Apart from a small multiplicative biased by a factor NN(R)−1 , it has more importantly a positive bias t (R) due to the fact that the random variables do not share the same mean; this phenomenon is the same as the estimation of the average variance for independent but� i. [sent-129, score-0.084]

59 2Nt (R) Nt (R) s=1 Note that we would have preferred to replace the terms involving ln(2d/δ) with a term depending on the empirical variance, in the spirit of Carpentier et al. [sent-138, score-0.098]

60 Similarly, under the same assumptions, then � � P ∀x ∈ X ; Lt (R, x, δ) ≤ ρ(x) − b(x, R, Nt (R), δ) ≤ tδ , where we introduced for convenience the quantity b(x, R, n, δ) def = √ 2 max �2 (x, x ) + d1 (R) + 2(1 + 2 d) � x ∈R � 2 � d ln(2d/δ) d ln(2d/δ) + . [sent-151, score-0.556]

61 2 Hierarchical Optimistic Region SElection driven by Curiosity (HORSE. [sent-162, score-0.082]

62 It is chosen by expanding a leaf of a hierarchical tree Tt ⊂ T , in an optimistic way, starting with a tree T0 with leaves corresponding to the partition P. [sent-167, score-0.535]

63 The intuition is the following: let us consider a node (h, i) of the tree Tt expanded by the algorithm at time t. [sent-168, score-0.111]

64 The maximum value of ρ in R(h, i) is thus achieved for one of its children node (h� , i� ) ∈ Ct (h, i). [sent-169, score-0.102]

65 These values are used in order to build an optimistic estimate of the quantity LR(h,i) (Nt ) in region (h, i) (step 4), and then to select in which cell of the partition we should sample (step 5). [sent-172, score-0.658]

66 Then the ˆt algorithm chooses where to sample in the selected region so as to improve the estimations of ρ+ and ˆt ρ− . [sent-173, score-0.303]

67 ) between expanding a leaf following a path that is optimistic ˆt ˆt according to ρ+ (step 7,8,9), or according to ρ− (step 11. [sent-175, score-0.256]

68 ) Thus, at a high level, the algorithm performs on each cell (h, i) ∈ P of the given partition two hierarchical searches, one for the maximum value of ρ in region R(h, i) and one for its minimal value. [sent-176, score-0.502]

69 On the other hand, there is a strong link between step 5, where we decide to allocate samples between regions {R(h, i)}(h,i)∈P , and the CH-AS algorithm from Carpentier et al. [sent-179, score-0.163]

70 To this end, we make use of the notion of near-optimality dimension, introduced in Bubeck et al. [sent-186, score-0.118]

71 [2011], and that measures a notion of intrinsic dimension of the maximization problem. [sent-187, score-0.166]

72 Let d+ (h0 , i0 ) be the c-optimality dimension of ρ restricted to the region R(h0 , i0 ) (see e. [sent-189, score-0.283]

73 Similarly, let d− (h0 , i0 ) be the c-optimality 1 2 dimension of −ρ restricted to the region R(h0 , i0 ). [sent-193, score-0.283]

74 Let us finally define the biggest near-optimality dimension of ρ over each cell of the partition P to be � � � � def dρ = max max d+ (h0 , i0 ), d− (h0 , i0 ) ; (h0 , i0 ) ∈ P . [sent-194, score-0.794]

75 C) Under the assumptions of Section 3 and if moreover 2 γ1 ≤ γ2 , then for all δ ∈ [0, 1], the regret of the Hierarchical Optimistic Region SElection driven by Curiosity procedure parameterized with δ is bounded with probability higher than 1 − 2δ as follows. [sent-196, score-0.278]

76 � � T � � � 1 + 2λcγ h0 B h0 , n� (h0 , i0 ), δt , RT ≤ max t � (h , i ) nt 0 0 (h0 ,i0 )∈P t=|P| 6 Algorithm 1 The HORSE. [sent-197, score-0.435]

77 Require: An infinite binary tree T , a partition P ⊂ T , δ ∈ [0, 1], λ ≥ 0 6δ 1: Let T0 be such that Leaf (T0 ) = P, and δi,t = π2 i2 (2t+1)|P|t3 , t := 0. [sent-199, score-0.173]

78 2: while true do 3: define for each region (h, i) ∈ Tt the estimated loss � � ˆ+ def ρ (h, i; δ) ˆ ˆt ˆt + λ|R(h, i)| ρ+ (h, i; δ) − ρ− (h, i; δ) , Lt (h, i) = t Nt (R(h, i)) ˆ where δ = δNt (R(h,i)),t , where by convention Lt (h, i) if it is undefined. [sent-200, score-0.731]

79 4: choose the next region of the current partition P ⊂ T to sample � � def ˆ (Ht+1 , It+1 ) = argmax Lt (h, i) ; (h, i) ⊂ P . [sent-201, score-0.889]

80 jt+1 jt+1 expand the node (Ht+1 , It+1 ) in order to define Tt+1 and then define the candidate child � � def jt+1 jt+1 ˆt (ht+1 , it+1 ) = argmax ρ+ (h, i; δn,t ) ; (h, i) ∈ Ct+1 (Ht+1 , It+1 ) . [sent-203, score-0.555]

81 sample at point Xt+1 and receive the value Yt+1 ∼ ν(Xt+1 ), where � � def Xt+1 = argmax Ut (R(ht+1 , it+1 ), x, δn,t ) ; x ∈ R(ht+1 , it+1 ) , else ˆt ˆt proceed similarly than steps 6,7,8 with ρ+ replaced with ρ− . [sent-204, score-0.552]

82 Then, we lower-bound the depth of the subtree of each region (h0 , i0 ) ∈ P that contains a maximal point argmaxx∈R(h0 ,i0 ) ρ(x), and proceed similarly for a minimal point. [sent-215, score-0.339]

83 This uses the near-optimality dimension of ρ and −ρ in the region R(h0 , i0 ), and enables to provide an ˆt ˆt upper bound on ρ+ (h, i; δ) as well as a lower bound on ρ− (h, i; δ). [sent-216, score-0.339]

84 Finally, we relate the true loss of the current allocation to the one using the optimal one n� (h0 , i0 ) by discussing whether a t+1 region has been over or under sampled. [sent-218, score-0.427]

85 This final part is closed in spirit to the proof of the regret bound for CH-AS in Carpentier et al. [sent-219, score-0.282]

86 7 � def −d � Corollary 1 Let β = 1+ln max{2, γ2 ρ } . [sent-222, score-0.454]

87 Under the assumptions of Theorem 1, assuming that the partition P of the space X is well behaved, i. [sent-223, score-0.189]

88 Thus, (h this shows that, after normalization, the relative regret on each cell (h0 , i0 ) is roughly of order 1 � ln(t) � 2β − 1 1 , i. [sent-227, score-0.237]

89 This shows that we are not only able t ρ+ (h0 ,i0 ) n� (h0 ,i0 ) t to compete with the performance of the best allocation strategy, but we actually achieve the exact same performance with multiplicative factor 1, up to a second order term. [sent-230, score-0.204]

90 Note also that, when specified to the case of Example 1, the order of this regret is competitive with the standard results from Carpentier et al. [sent-231, score-0.214]

91 ρ+ 0 ,i0 ) (h The lost of the variance term ρ+ (h0 , i0 )−1 (that is actually a constant) here comes from the fact that we are only able to use Hoeffding’s like bounds for the estimation of the variance. [sent-233, score-0.084]

92 In order to remove it, one would need empirical Bernstein’s bounds for variance estimation in the case of martingale difference sequences. [sent-234, score-0.086]

93 6 Discussion In this paper, we have provided an algorithm together with a regret analysis for a problem of online allocation of samples in a fixed partition, where the objective is to minimize a loss that contains a penalty term that is driven by a notion of curiosity. [sent-236, score-0.481]

94 We have considered an extension of this problem to a continuous domain where a fixed partition of the space as well as a generative model of the unknown dynamic are given, using our curiosity-driven loss function as a measure of performance. [sent-242, score-0.255]

95 Our main result is a regret bound for that problem, that shows that our procedure is first order optimal, i. [sent-243, score-0.184]

96 achieves the same performance as the best possible allocation (thus with multiplicative constant 1). [sent-245, score-0.177]

97 We believe this result contributes to filling the important gap that exists between existing algorithms for the challenging setting of intrinsic reinforcement learning and a theoretical analysis of such, the HORSE. [sent-246, score-0.126]

98 Indeed, in order to achieve the objective that tries to address RIAC, one should first remove the assumption that the partition is given: One trivial solution is to run the HORSE. [sent-248, score-0.155]

99 that we can sample wherever we want), a question that shares links with a problem called autonomous exploration. [sent-251, score-0.115]

100 Formal theory of creativity, fun, and intrinsic motivation (1990-2010). [sent-304, score-0.095]


similar papers computed by tfidf model

tfidf for this paper:

wordName wordTfidf (topN-words)

[('def', 0.454), ('nt', 0.386), ('region', 0.244), ('tt', 0.185), ('oudeyer', 0.175), ('curiosity', 0.169), ('carpentier', 0.165), ('optimistic', 0.156), ('regret', 0.156), ('allocation', 0.15), ('ln', 0.143), ('ht', 0.141), ('baranes', 0.136), ('lt', 0.127), ('partition', 0.122), ('bubeck', 0.116), ('antos', 0.104), ('jt', 0.097), ('xs', 0.096), ('graziano', 0.096), ('autonomous', 0.087), ('driven', 0.082), ('cell', 0.081), ('regions', 0.075), ('lp', 0.075), ('leaf', 0.071), ('ut', 0.069), ('arm', 0.068), ('intrinsic', 0.067), ('non', 0.067), ('ct', 0.066), ('node', 0.06), ('rc', 0.06), ('notion', 0.06), ('reinforcement', 0.059), ('et', 0.058), ('lipschitz', 0.057), ('variance', 0.057), ('hierarchical', 0.055), ('lr', 0.055), ('maillard', 0.052), ('tree', 0.051), ('max', 0.049), ('hoo', 0.048), ('martius', 0.048), ('mugan', 0.048), ('riac', 0.048), ('active', 0.045), ('metric', 0.043), ('xt', 0.043), ('csaba', 0.043), ('kaplan', 0.042), ('children', 0.042), ('argmax', 0.041), ('spirit', 0.04), ('assumptions', 0.04), ('forwards', 0.039), ('founded', 0.039), ('jung', 0.039), ('konidaris', 0.039), ('leoben', 0.039), ('unknown', 0.039), ('dimension', 0.039), ('depth', 0.038), ('tobias', 0.035), ('homogeneous', 0.035), ('szepesv', 0.035), ('sampling', 0.034), ('continuous', 0.034), ('trivial', 0.033), ('loss', 0.033), ('mdp', 0.031), ('quantities', 0.031), ('bandits', 0.031), ('estimations', 0.031), ('allocate', 0.03), ('ys', 0.03), ('similarly', 0.029), ('maxx', 0.029), ('validating', 0.029), ('intrinsically', 0.029), ('expanding', 0.029), ('martingale', 0.029), ('rt', 0.029), ('motivation', 0.028), ('subtree', 0.028), ('minx', 0.028), ('sample', 0.028), ('bound', 0.028), ('heterogeneous', 0.028), ('heidelberg', 0.028), ('actually', 0.027), ('munos', 0.027), ('formal', 0.027), ('multiplicative', 0.027), ('split', 0.027), ('space', 0.027), ('quantity', 0.027), ('decays', 0.026), ('convenience', 0.026)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 1.0000008 149 nips-2012-Hierarchical Optimistic Region Selection driven by Curiosity

Author: Odalric-ambrym Maillard

Abstract: This paper aims to take a step forwards making the term “intrinsic motivation” from reinforcement learning theoretically well founded, focusing on curiositydriven learning. To that end, we consider the setting where, a fixed partition P of a continuous space X being given, and a process ν defined on X being unknown, we are asked to sequentially decide which cell of the partition to select as well as where to sample ν in that cell, in order to minimize a loss function that is inspired from previous work on curiosity-driven learning. The loss on each cell consists of one term measuring a simple worst case quadratic sampling error, and a penalty term proportional to the range of the variance in that cell. The corresponding problem formulation extends the setting known as active learning for multi-armed bandits to the case when each arm is a continuous region, and we show how an adaptation of recent algorithms for that problem and of hierarchical optimistic sampling algorithms for optimization can be used in order to solve this problem. The resulting procedure, called Hierarchical Optimistic Region SElection driven by Curiosity (HORSE.C) is provided together with a finite-time regret analysis. 1

2 0.35698459 261 nips-2012-Online allocation and homogeneous partitioning for piecewise constant mean-approximation

Author: Alexandra Carpentier, Odalric-ambrym Maillard

Abstract: In the setting of active learning for the multi-armed bandit, where the goal of a learner is to estimate with equal precision the mean of a finite number of arms, recent results show that it is possible to derive strategies based on finite-time confidence bounds that are competitive with the best possible strategy. We here consider an extension of this problem to the case when the arms are the cells of a finite partition P of a continuous sampling space X ⊂ Rd . Our goal is now to build a piecewise constant approximation of a noisy function (where each piece is one region of P and P is fixed beforehand) in order to maintain the local quadratic error of approximation on each cell equally low. Although this extension is not trivial, we show that a simple algorithm based on upper confidence bounds can be proved to be adaptive to the function itself in a near-optimal way, when |P| is chosen to be of minimax-optimal order on the class of α−H¨ lder functions. o 1 Setting and Previous work Let us consider some space X ⊂ Rd , and Y ⊂ R. We call X the input space or sampling space, Y the output space or value space. We consider the problem of estimating with uniform precision the function f : X ⊂ Rd → Y ⊂ R. We assume that we can query n times the function f , anywhere in the domain, and observe noisy samples of this function. These samples are collected sequentially, and our aim is to design an adaptive procedure that selects wisely where on the domain to query the function, according to the information provided by the previous samples. More formally: Observed process We consider an unknown Y-valued process defined on X , written ν : X → M+ (Y), where M+ (Y) refers to the set of all probability measures on Y, such that for all x ∈ X , 1 1 def the random variable Y (x) ∼ ν(x) has mean f (x) = E[Y (x)|x] ∈ R. We write for convenience the model in the following way Y (x) = f (x) + noise(x) , def where noise(x) = Y (x) − E[Y (x)|x] is the centered random variable corresponding to the noise, o with unknown variance σ 2 (x). We assume throughout this paper that f is α-H¨ lder. Partition We consider we can define a partition P of the input space X , with finitely many P regions {Rp }1≤p≤P that are assumed to be convex and not degenerated, i.e. such that the interior of each region Rp has positive Lebesgue volume vp . Moreover, with each region Rp is associated a sampling distribution in that region, written µp ∈ M+ (Rp ). Thus, when we decide to sample in 1 region Rp , a new sample X ∈ Rp is generated according to X ∼ µp . Allocation. We consider that we have a finite budget of n ∈ N samples that we can use in order to allocate samples as we wish among the regions {Rp }1≤p≤P . For illustration, let us assume that we deterministically allocate Tp,n ∈ N samples in region Rp , with the constraint that the allocation {Tp,n }1≤p≤P must some to n. In region Rp , we thus sample points {Xp,i }1≤p≤P at random 1 according to the sampling distribution µp , and then get the corresponding values {Yp,i }1≤i≤Tp,n , where Yp,i ∼ ν(Xp,i ). In the sequel, the distribution µp is assumed to be the uniform distribution dλ(x)1x∈R over region Rp , i.e. the density of µp is λ(Rp ) p where λ denotes the Lebesgue measure. Note that this is not restrictive since we are in an active, not passive setting. Piecewise constant mean-approximation. We use the collected samples in order to build a pieceˆ wise constant approximation fn of the mean f , and measure the accuracy of approximation on a region Rp with the expected quadratic norm of the approximation error, namely � � � � � ˆ (x))2 λ(dx) = Eµ ,ν (f (X) − mp,n )2 , ˆ (f (x) − fn E p λ(Rp ) Rp ˆ where mp,n is the constant value that takes fn on the region Rp . A natural choice for the estimator ˆ mp,n is to use the empirical mean that is unbiased and asymptotically optimal for this criterion. ˆ Thus we consider the following estimate (histogram) ˆ fn (x) = P � p=1 mp,n I{x ∈ Rp } where mp,n = ˆ ˆ Tp,n 1 � Tp,n Yp,i . i=1 Pseudo loss Note that, since the Tp,n are deterministic, the expected quadratic norm of the approximation error of this estimator can be written in the following form � � � � � � ˆ Eµp ,ν (f (X) − mp,n )2 ˆ = Eµp ,ν (f (X) − Eµp [f (X)])2 + Eµp ,ν (Eµp [f (X)] − mp,n )2 � � � � = Vµp f (X) + Vµp ,ν mp,n ˆ � � � � 1 Vµp ,ν Y (X) . = Vµp f (X) + Tp,n Now, using the following immediate decomposition � � � � � Vµp ,ν Y (X) = Vµp f (X) + σ 2 (x)µp (dx) , Rp we deduce that the maximal expected quadratic norm of the approximation error over the regions def {Rp }1≤p≤P , that depends on the choice of the considered allocation strategy A = {Tp,n }1≤p≤P is thus given by the following so-called pseudo-loss � � � � � � Tp,n + 1 1 def 2 (1) Vµp f (X) + Eµ σ (X) . Ln (A) = max 1≤ p ≤P Tp,n Tp,n p Our goal is to minimize this pseudo-loss. Note that this is a local measure of performance, as opposed to a more usual yet less challenging global quadratic error. Eventually, as the number of �� �2 � ˆ cells tends to ∞, this local measure of performance approaches supx∈X Eν f (x) − fn (x) . At this point, let us also introduce, for convenience, the notation Qp (Tp,n ) that denotes the term inside the max, in order to emphasize the dependency on the quadratic error with the allocation. Previous work There is a huge literature on the topic of functional estimation in batch setting. Since it is a rather old and well studied question in statistics, many books have been written on this topic, such as Bosq and Lecoutre [1987], Rosenblatt [1991], Gy¨ rfi et al. [2002], where piecewise constant meano approximation are also called “partitioning estimate” or “regressogram” (first introduced by Tukey [1947]). The minimax-optimal rate of approximation on the class of α-H¨ lder functions is known o 2α to be in O(n− 2α+d ) (see e.g. Ibragimov and Hasminski [1981], Stone [1980], Gy¨ rfi et al. [2002]). o In such setting, a dataset {(Xi , Yi )}i≤n is given to the learner, and a typical question is thus to try to find the best possible histogram in order to minimize a approximation error. Thus the dataset is fixed and we typically resort to techniques such as model selection where each model corresponds to one histogram (see Arlot [2007] for an extensive study of such). However, we here ask a very different question, that is how to optimally sample in an online setting in order to minimize the approximation error of some histogram. Thus we choose the histogram 2 before we see any sample, then it is fixed and we need to decide which cell to sample from at each time step. Motivation for this setting comes naturally from some recent works in the setting of active learning for the multi-armed bandit problem Antos et al. [2010], Carpentier et al. [2011]. In these works, the objective is to estimate with equal precision the mean of a finite number of distributions (arms), which would correspond to the special case when X = {1, . . . , P } is a finite set in our setting. Intuitively, we reduce the problem to such bandit problem with finite set of arms (regions), and our setting answers the question whether it is possible to extend those results to the case when the arms do not correspond to a singleton, but rather to a continuous region. We show that the answer is positive, yet non trivial. This is non trivial due to the variance estimation in each region: points x in some region may have different means f(x), so that standard estimators for the variance are biased, contrary to the point-wise case and thus finite-arm techniques may yield disastrous results. (Estimating the variance of the distribution in a continuous region actually needs to take into account not only the point-wise noise but also the variation of the function f and the noise level σ 2 in that region.) We describe a way, inspired from quasi Monte-Carlo techniques, to correct this bias so that we can handle the additional error. Also, it is worth mentioning that this setting can be informally linked to a notion of curiosity-driven learning (see Schmidhuber [2010], Baranes and Oudeyer [2009]), since we want to decide in which region of the space to sample, without explicit reward but optimizing the goal to understand the unknown environment. Outline Section 2 provides more intuition about the pseudo-loss and a result about the optimal oracle strategy when the domain is partitioned in a minimax-optimal way on the class of α−H¨ lder o functions. Section 3 presents our assumptions, that are basically to have a sub-Gaussian noise and smooth mean and variance functions, then our estimator of the pseudo-loss together with its concentration properties, before introducing our sampling procedure, called OAHPA-pcma. Finally, the performance of this procedure is provided and discussed in Section 4. 2 The pseudo-loss: study and optimal strategies 2.1 More intuition on each term in the pseudo-loss It is natural to look at what happens to each of the two terms that appear in equation 1 when one makes Rp shrink towards a point. More precisely, let xp be the mean of X ∼ µp and let us look at the limit of Vµp (f (X)) when vp goes to 0. Assuming that f is differentiable, we get �2 � �� lim Vµp (f (X)) = lim Eµp f (X) − f (xp ) − E[f (X) − f (xp )] vp →0 vp →0 = = = lim Eµp �� �X − xp , ∇f (xp )� − E[�X − xp , ∇f (xp )�] vp →0 � � lim Eµp �X − xp , ∇f (xp )�2 vp →0 � � lim ∇f (xp )T Eµp (X − xp )(X − xp )T ∇f (xp ) . �2 � vp →0 Therefore, if we introduce Σp to be the covariance matrix of the random variable X ∼ µp , then we simply have lim Vµp (f (X)) = lim ||∇f (xp )||2 p . Σ vp →0 vp →0 Example with hyper-cubic regions An important example is when Rp is a hypercube with side 1/d length vp and µp is the uniform distribution over the region Rp . In that case (see Lemma 1), we dx have µp (dx) = , and 2/d vp vp . ||∇f (xp )||2 p = ||∇f (xp )||2 Σ 12 More generally, when f is α−differentiable, i.e. that ∀a ∈ X , ∃∇α f (a, ·) ∈ Sd (0, 1)R such that ∀x ∈ Sd (0, 1), limh→0 f (a+hx)−f (a) = ∇α f (a, x), then it is not too difficult to show that for such hα hyper-cubic regions, we have � � � 2α � Vµp f (X) = O vpd sup |∇α f (xp , u)|2 . S(0,1) � � On the other hand, by direct computation, the second term is such that limvp →0 Eµp σ 2 (X) = � � � � σ 2 (xp ). Thus, while Vµp f (X) vanishes, Eµp σ 2 (X) stays bounded away from 0 (unless ν is deterministic). 3 2.2 Oracle allocation and homogeneous partitioning for piecewise constant mean-approximation. We now assume that we are allowed to choose the partition P depending on n, thus P = Pn , amongst all homogeneous partitions of the space, i.e. partitions such that all cells have the same volume, and come from a regular grid of the space. Thus the only free parameter is the number of cells Pn of the partition. An exact yet not explicit oracle algorithm. The minimization of the pseudo-loss (1) does not yield to a closed-form solution in general. However, we can still derive the order of the optimal loss (see [Carpentier and Maillard, 2012, Lemma 2] in the full version of the paper for an example of minimax yet non adaptive oracle � algorithm given in closed-form solution): � � −β � � � −α� � � Lemma 1 In the case when Vµp f (X) = Ω Pn and Rp σ 2 (x)µp (dx) = Ω Pn , then an � optimal allocation and partitioning strategy An satisfies that� � � � Vµp f (X) + Eµp σ 2 (X) � � , L − Vµp f (X) � as soon as there exists, for such range of Pn , a constant L such that � � � � � Pn � Vµp f (X) + Eµp σ 2 (X) � � = n. L − Vµp f (X) p=1 1 � Pn = Ω(n max(1+α� −β� ,1) ) and def � Tp,n = The pseudo-loss of such an algorithm A� , optimal amongst the allocations strategies that use the n � partition Pn in Pn regions, is then given by � � � � def max(1 − β , 1 − α ) − 1. where γ = Ln (A� ) = Ω nγ n max(1 + α� − β � , 1) The condition involving the constant L is here to ensure that the partition is not degenerate. It is morally satisfied as soon as the variance of f and the noise are bounded and n is large enough. This Lemma applies to the important class W 1,2 (R) of functions that admit a weak derivative that o belongs to L2 (R). Indeed these functions are H¨ lder with coefficient α = 1/2, i.e. we have o W 1,2 (R) ⊂ C 0,1/2 (R). The standard Brownian motion is an example of function that is 1/2-H¨ lder. More generally, for k = d + α with α = 1/2 when d is odd and α = 1 when d is even, we have the 2 inclusion W k,2 (Rd ) ⊂ C 0,α (Rd ) , where W k,2 (Rd ) is the set of functions that admit a k th weak derivative belonging to L2 (Rd ). Thus the previous Lemma applies to sufficiently smooth functions with smoothness linearly increasing with the dimension d of the input space X . Important remark Note that this Lemma gives us a choice of the partition that is minimax-optimal, and an allocation strategy on that partition that is not only minimax-optimal but also adaptive to the function f itself. Thus it provides a way to decide in a minimax way what is the good number of regions, and then to provide the best oracle way to allocate the budget. We can deduce the following immediate corollary on the class of α−H¨ lder functions observed in a o non-negligible noise of bounded variance (i.e. in the setting β � = 0 and α� = 2α ). d Corollary 1 Consider that f is α−H¨ lder and the noise is of bounded variance. Then a minimaxo d � d+2α ) and an optimal allocation achieves the rate L (A� ) = optimal partition satisfies Pn = Ω(n n n � −2α � Ω n d+2α . Moreover, the strategy of Lemma 1 is optimal amongst the allocations strategies that � use the partition Pn in Pn regions. � −2α � The rate Ω n d+2α is minimax-optimal on the class of α−H¨ lder functions (see Gy¨ rfi et al. [2002], o o Ibragimov and Hasminski [1981], Stone [1980]), and it is thus interesting to consider an initial numd � � d+2α ). After having built the partition, if the quantities ber �� � 2 �� � � of�regions Pn that is of order Pn = Ω(n Vµp f p≤P and Eµp σ p≤P are known to the learner, it is optimal, in the aim of minimizing � the pseudo-loss, to allocate to each region the number of samples Tp,n provided in Lemma 1. Our objective in this paper is, after having chosen beforehand a minimax-optimal partition, to allocate 4 the samples properly in the regions, without having any access to those quantities. It is then �� � � necessary to balance between exploration, i.e. allocating the samples in order to estimate Vµp f p≤P � � �� and Eµp σ 2 p≤P , and exploitation, i.e. use the estimates to target the optimal allocation. 3 Online algorithms for allocation and homogeneous partitioning for piecewise constant mean-approximation In this section, we now turn to the design of algorithms that are fully online, with the goal to be competitive against the kind of oracle algorithms considered in Section 2.2. We now assume that the space X = [0, 1]d is divided in Pn hyper-cubic regions of same measure (the Lebesgue measure on 1 [0, 1]d ) vp = v = Pn . The goal of an algorithm is to minimize the quadratic error of approximation of f by a constant over each cell, in expectation, which we write as � � � � � � 2 λ(dx) ˆ (x))2 λ(dx) = max E , max E (f (x) − fn (f (x) − mp,n ) ˆ 1≤p≤Pn 1≤p≤Pn λ(Rp ) λ(Rp ) Rp Rp ˆ where fn is the histogram estimate of the function f on the partition P and mp,n is the empirical ˆ mean defined on region Rp with the samples (Xi , Yi ) such that Xi ∈ Rp . To do so, an algorithm is only allowed to specify at each time step t, the next point Xt where to sample, based on all the past samples {(Xs , Ys )}s < ∞ satisfies that λ2 σ 2 (x) , ∀λ ∈ R+ log E exp[λ noise(x)] ≤ 2 and we further assume that it satisfies the following slightly stronger second property (that is for instance exactly verified for a Gaussian variable, looking at the moment generating function): � � � � 1 λ2 σ 2 (x) ∀λ, γ ∈ R+ log E exp λnoise(x) + γnoise(x)2 ≤ − log 1 − 2γσ 2 (x) . 2(1 − 2γσ 2 (x)) 2 5 The function f is assumed to be (L, α)-H¨ lder, meaning that it satifies o � ∀x, x ∈ X f (x) − f (x� ) ≤ L||x − x� ||α . Similarly, the function σ 2 is assumed to be (M, β)-H¨ lder i.e. it satisfies o � 2 2 � ∀x, x ∈ X σ (x) − σ (x ) ≤ M ||x − x� ||β . We assume that Y is a convex and compact subset of R, thus w.l.g. that it is [0, 1], and that it is known that ||σ 2 ||∞ , which is thus finite, is bounded by the constant 1. 3.2 Empirical estimation of the quadratic approximation error on each cell We define the sampling distribution µp in the region Rp for each p ∈ {1, . . . , Pn } as a quasi-uniform ˜ sampling scheme using the uniform distribution over the sub-regions. More precisely at time t ≤ n, if we decide to sample in the region Rp according to µp , we sample uniformly in each sub-region ˜ one sample, resulting in a new batch of samples {(Xt,k , Yt,k )}1≤k≤K , where Xt,k ∼ µp,k . Note that due to this sampling process, the number of points Tp,t sampled in sub-region Rp at time t is always Tp,t a multiple of K and that moreover for all k, k � ∈ {1, . . . , K} we have that Tp,k,t = Tp,k� ,t = K . Now this specific sampling is used in order to be able to estimate the variances Vµp f and Eµp σ 2 , � so that the best proportions Tp,n can be computed as accurately as possible. Indeed, as explained in � � � � Lemma 1, we have that Vµp f (X) + Eµp σ 2 (X) � def � � . Tp,n = L − Vµp f (X) ˆ Variance estimation We now introduce two estimators. The first estimator is written Vp,t and is def ˆ built in the following way. First,let us introduce the empirical estimate fp,k,t of the mean fp,k = � � Eµp,k f (X) of f in sub-region Rp,k . Similarly, to avoid some cumbersome notations, we introduce � � � � � � def def def 2 fp = Eµp f (X) and vp,k = Vµp,k f (X) for the function f , and then σp,k = Eµp,k σ 2 (X) for the variance of the noise σ 2 . We now define the empirical variance estimator to be K 1 � ˆ ˆ (fp,k,t − mp,t )2 , ˆ Vp,t = K −1 k=1 that is a biased estimator. Indeed, for a deterministic Tp,t , it is not difficult to show that we have � K K � � � � � � �� � � � � 2 1 �� 1 � ˆ E Vp,t + Eµp,k f − Eµp f = Vµp,k f + Eµp,k σ 2 . K −1 Tp,t k=1 k=1 � � The leading term in this decomposition, that is given by the first sum, is closed to Vµp f since, by using the assumption that f is (L, α)−H¨ lder, we have the following inequality o � � K � �� �� � �1 � � � � 2 2L2 dα � Eµp,k f − Eµp f − Vµp f (X) � ≤ , � �K (KPn )2α/d k=1 where we also used that the diameter of a sub-region Rp,k is given by diam(Rp,k ) = d1/2 . (KPn )1/d ˆ Then, the second term also contributes to the bias, essentially due to the fact that V[fp,k,t ] = � � � � 2 def def 1 1 2 2 2 Tp,k,t (vp,k + σp,k ) and not Tp,t (vk + σk ) (with vp = Vµp f (X) and σp = Eµp σ (X) ). ˆ p,k,t In order to correct this term, we now introduce the second estimator σ 2 that estimates the variance � � � � � � of the outputs in a region Rp,k , i.e. Vµp,k ,ν Y (X) = Vµp,k f (X) + Eµp,k σ 2 . It is defined as �2 t t �� 1 1 � def ˆ p,k,t = Yi − Yj I{Xj ∈ Rp,k } I{Xi ∈ Rp,k } . σ2 Tp,k,t − 1 i=1 Tp,k,t j=1 Now, we combine the two previous estimators to form the following estimator K 1 �� 1 1 � 2 ˆ ˆ ˆ σ − . Qp,t = Vp,t − K Tp,k,t Tp,t p,k,t k=1 ˆ The following proposition provides a high-probability bound on the difference between Qp,t and the quantity we want to estimate. We report the detailed proof in [Carpentier and Maillard, 2012]. 6 ˆ Proposition 1 By the assumption that f is (L, α)-H¨ lder, the bias of the estimator Qp,t , and for o deterministic Tp,t , is given by � K � � � � � � � � � 2 1 � 2L2 dα ˆ − Vµp f (X) ≤ . Eµp,k f − Eµp f E Qp,t − Qp (Tp,t ) = K (KPn )2α/d k=1 Moreover, it satisfies that for all δ ∈ [0, 1], there exists an event of probability higher than 1 − δ such that on this event, we have � � � � � � K K � � � � 8 log(4/δ) � σ 2 �1 � � � � ˆ p,k,t 1 � 2 ˆ ˆ � Qp,t − E Qp,t � ≤ � √ +o σ p,k . � � (K − 1)2 T2 T K K k=1 p,k,t p,k,t k=1 We also state the following Lemma that we are going to use in the analysis, and that takes into account randomness of the stopping times Tp,k,t . Lemma 2 Let {Xp,k,u }p≤P, k≤K, u≤n be samples potentially sampled in region Rp,k . We introduce qp,u to be the�equivalent of Qp (Tp,t ) with explicitly fixed value of Tp,t = u. Let also qp,u be the ˆ � ˆ p,t but computed with the first u samples in estimate of E qp,u , that is to say the equivalent of Q each region Rp,k (i.e. Tp,t = u). Let us define the event � � � � � � � AK log(4nP/δ)V � � ˆp,t 2L2 dα � � ξn,P,K (δ) = + ω : � qp,u (ω) − E qp,u � ≤ ˆ , u K −1 (KPn )2α/d p≤P u≤n �K 1 ˆ ˆ ˆ p,k,t and where A ≤ 4 is a numerical constant. Then it where Vp,t = Vp (Tp,t ) = K−1 k=1 σ 2 holds that � � P ξn,P,K (δ) ≥ 1 − δ . Note that, with the notations of this Lemma, Proposition 1 above is thus about qp,u . ˆ 3.3 The Online allocation and homogeneous partitioning algorithm for piecewise constant mean-approximation (OAHPA-pcma) We are now ready to state the algorithm that we propose for minimizing the quadratic error of approximation of f . The algorithm is described in Figure 1. Although it looks similar, this algorithm is ˆ quite different from a normal UCB algorithm since Qp,t decreases in expectation with Tp,t . Indeed, � � � � � �� �K � 1 its expectation is close to Vµp f + KTp,t k=1 Vµp,k f + Eµp,k σ 2 . Algorithm 1 OAHPA-pcma. 1: Input: A, L, α, Horizon n; Partition {Rp }p≤P , with sub-partitions {Rp,k }k≤K . 2: Initialization: Sample K points in every sub-region {Rp,k }p≤P,k≤K 3: for t = K 2 P + 1; t ≤ n; t = t + K do ˆ 4: Compute ∀p, Qp,t . � ˆ ˆ p,t + AK log(4nP/δ)Vp,t + 2L2 dα . 5: Compute ∀p, Bp,t = Q 2α/d Tp,t K−1 (KPn ) 6: Select the region pt = argmax1≤p≤Pn Bp,t where to sample. 7: Sample K samples in region Rpt one per sub-region Rpt ,k according to µpt ,k . 8: end for 4 Performance of the allocation strategy and discussion Here is the main result of the paper; see the full version [Carpentier and Maillard, 2012] for the proof. We remind that the objective is to minimize for an algorithm A the pseudo-loss Ln (A). Theorem 1 (Main result) Let γ = � maxp Tp,n � minp Tp,n be the distortion factor of the optimal allocation stratdef d d egy, and let � > 0. Then with the choice of the number of regions Pn = n 2α+d �2+ 2α , and of the 2d d def def 8L2 α number of sub-regions K = C 4α+d �−2− α , where C = Ad1−α then the pseudo-loss of the OAHPApcma algorithm satisfies, under the assumptions of Section 3.1 and on an event of probability higher than 1 − δ, � � � � � 2α 1 + �γC � log(1/δ) Ln (A� ) + o n− 2α+d , Ln (A) ≤ n for some numerical constant C � not depending on n, where A� is the oracle of Lemma 1. n 7 Minimax-optimal partitioning and �-adaptive performance Theorem 1 provides a high probability bound on the performance of the OAHPA-pcma allocation strategy. It shows that this performance is competitive with that of an optimal (i.e. adaptive to the function f , see Lemma 1) allocation d A� on a partition with a number of cells Pn chosen to be of minimax order n 2α+d for the class of 2α α-H¨ lder functions. In particular, since Ln (A� ) = O(n d+2α ) on that class, we recover the same o n minimax order as what is obtained in the batch learning setting, when using for instance wavelets, or Kernel estimates (see e.g. Stone [1980], Ibragimov and Hasminski [1981]). But moreover, due to the adaptivity of A� to the function itself, this procedure is also �-adaptive to the function and not n only minimax-optimal on the class, on that partition (see Section 2.2). Naturally, the performance of the method increases, in the same way than for any classical functional estimation method, when the smoothness of the function increases. Similarly, in agreement with the classical curse of dimension, the higher the dimension of the domain, the less efficient the method. Limitations In this work, we assume that the smoothness α of the function is available to the learner, which enables her to calibrate Pn properly. Now it makes sense to combine the OAHPApcma procedure with existing methods that enable to estimate this smoothness online (under a slightly stronger assumption than H¨ lder, such as H¨ lder functions that attain their exponents, o o see Gin´ and Nickl [2010]). It is thus interesting, when no preliminary knowledge on the smoothness e of f is available, to spend some of the initial budget in order to estimate α. We have seen that the OAHPA-pcma procedure, although very simple, manages to get minimax optimal results. Now the downside of the simplicity of the OAHPA-pcma strategy is two-fold. � The first limitation is that the factor (1 + �γC � log(1/δ)) = (1 + O(�)) appearing in the bound before Ln (A� ) is not 1, but higher than 1. Of course it is generally difficult to get a constant 1 in the batch setting (see Arlot [2007]), and similarly this is a difficult task in our online setting too: If � is chosen to be small, then the error with respect to the optimal allocation is small. However, since Pn is expressed as an increasing function of �, this implies that the minimax bound on the loss for partition P increases also with �. That said, in the view of the work on active learning multi-armed bandit that we extend, we would still prefer to get the optimal constant 1. The second limitation is more problematic: since K is chosen irrespective of the region Rp , this causes the presence of the factor γ. Thus the algorithm will essentially no longer enjoy near-optimal performance guarantees when the optimal allocation strategy is highly not homogeneous. Conclusion and future work In this paper, we considered online regression with histograms in an active setting (we select in which bean to sample), and when we can choose the histogram in a class of homogeneous histograms. Since the (unknown) noise is heteroscedastic and we compete not only with the minimax allocation oracle on α-H¨ lder functions but with the adaptive oracle o that uses a minimax optimal histogram and allocates samples adaptively to the target function, this is an extremely challenging (and very practical) setting. Our contribution can be seen as a non trivial extension of the setting of active learning for multi-armed bandits to the case when each arm corresponds to one continuous region of a sampling space, as opposed to a singleton, which can also be seen as a problem of non parametric function approximation. This new setting offers interesting challenges: We provided a simple procedure, based on the computation of upper confidence bounds of the estimation of the local quadratic error of approximation, and provided a performance analysis that shows that OAHPA-pcma is first order �-optimal with respect to the function, for a partition chosen to be minimax-optimal on the class of α-H¨ lder functions. However, this simplicity also o has a drawback if one is interested in building exactly first order optimal procedure, and going beyond these limitations is definitely not trivial: A more optimal but much more complex algorithm would indeed need to tune a different factor Kp in each cell in an online way, i.e. define some Kp,t that evolves with time, and redefine sub-regions accordingly. Now, the analysis of the OAHPA-pcma already makes use of powerful tools such as empirical-Bernstein bounds for variance estimation (and not only for mean estimation), which make it non trivial; in order to handle possibly evolving subregions and deal with the progressive refinement of the regions, we would need even more intricate analysis, due to the fact that we are online and active. This interesting next step is postponed to future work. Acknowledgements This research was partially supported by Nord-Pas-de-Calais Regional Council, French ANR EXPLO-RA (ANR-08-COSI-004), the European Communitys Seventh Framework Programme (FP7/2007-2013) under grant agreement no 270327 (CompLACS) and no 216886 (PASCAL2). 8 References Andr` s Antos, Varun Grover, and Csaba Szepesv` ri. Active learning in heteroscedastic noise. Thea a oretical Computer Science, 411(29-30):2712–2728, 2010. Sylvain Arlot. R´ echantillonnage et S´ lection de mod` les. PhD thesis, Universit´ Paris Sud - Paris e´ e e e XI, 2007. A. Baranes and P.-Y. Oudeyer. R-IAC: Robust Intrinsically Motivated Exploration and Active Learning. IEEE Transactions on Autonomous Mental Development, 1(3):155–169, October 2009. D. Bosq and J.P. Lecoutre. Th´ orie de l’estimation fonctionnelle, volume 21. Economica, 1987. e Alexandra Carpentier and Odalric-Ambrym Maillard. Online allocation and homogeneous partitioning for piecewise constant mean-approximation. HAL, 2012. URL http://hal.archives-ouvertes.fr/hal-00742893. Alexandra Carpentier, Alessandro Lazaric, Mohammad Ghavamzadeh, Rmi Munos, and Peter Auer. Upper-confidence-bound algorithms for active learning in multi-armed bandits. In Jyrki Kivinen, Csaba Szepesv` ri, Esko Ukkonen, and Thomas Zeugmann, editors, Algorithmic Learning Theory, a volume 6925 of Lecture Notes in Computer Science, pages 189–203. Springer Berlin / Heidelberg, 2011. E. Gin´ and R. Nickl. Confidence bands in density estimation. The Annals of Statistics, 38(2): e 1122–1170, 2010. L. Gy¨ rfi, M. Kohler, A. Krzy´ ak, and Walk H. A distribution-free theory of nonparametric regreso z sion. Springer Verlag, 2002. I. Ibragimov and R. Hasminski. Statistical estimation: Asymptotic theory. 1981. M. Rosenblatt. Stochastic curve estimation, volume 3. Inst of Mathematical Statistic, 1991. J. Schmidhuber. Formal theory of creativity, fun, and intrinsic motivation (19902010). Autonomous Mental Development, IEEE Transactions on, 2(3):230–247, 2010. C.J. Stone. Optimal rates of convergence for nonparametric estimators. The annals of Statistics, pages 1348–1360, 1980. J.W. Tukey. Non-parametric estimation ii. statistically equivalent blocks and tolerance regions–the continuous case. The Annals of Mathematical Statistics, 18(4):529–539, 1947. 9

3 0.19624518 199 nips-2012-Link Prediction in Graphs with Autoregressive Features

Author: Emile Richard, Stephane Gaiffas, Nicolas Vayatis

Abstract: In the paper, we consider the problem of link prediction in time-evolving graphs. We assume that certain graph features, such as the node degree, follow a vector autoregressive (VAR) model and we propose to use this information to improve the accuracy of prediction. Our strategy involves a joint optimization procedure over the space of adjacency matrices and VAR matrices which takes into account both sparsity and low rank properties of the matrices. Oracle inequalities are derived and illustrate the trade-offs in the choice of smoothing parameters when modeling the joint effect of sparsity and low rank property. The estimate is computed efficiently using proximal methods through a generalized forward-backward agorithm. 1

4 0.1636012 216 nips-2012-Mirror Descent Meets Fixed Share (and feels no regret)

Author: Nicolò Cesa-bianchi, Pierre Gaillard, Gabor Lugosi, Gilles Stoltz

Abstract: Mirror descent with an entropic regularizer is known to achieve shifting regret bounds that are logarithmic in the dimension. This is done using either a carefully designed projection or by a weight sharing technique. Via a novel unified analysis, we show that these two approaches deliver essentially equivalent bounds on a notion of regret generalizing shifting, adaptive, discounted, and other related regrets. Our analysis also captures and extends the generalized weight sharing technique of Bousquet and Warmuth, and can be refined in several ways, including improvements for small losses and adaptive tuning of parameters. 1

5 0.15642057 156 nips-2012-Identifiability and Unmixing of Latent Parse Trees

Author: Percy Liang, Daniel J. Hsu, Sham M. Kakade

Abstract: This paper explores unsupervised learning of parsing models along two directions. First, which models are identifiable from infinite data? We use a general technique for numerically checking identifiability based on the rank of a Jacobian matrix, and apply it to several standard constituency and dependency parsing models. Second, for identifiable models, how do we estimate the parameters efficiently? EM suffers from local optima, while recent work using spectral methods [1] cannot be directly applied since the topology of the parse tree varies across sentences. We develop a strategy, unmixing, which deals with this additional complexity for restricted classes of parsing models. 1

6 0.1435857 295 nips-2012-Risk-Aversion in Multi-armed Bandits

7 0.13794222 260 nips-2012-Online Sum-Product Computation Over Trees

8 0.13085774 343 nips-2012-Tight Bounds on Profile Redundancy and Distinguishability

9 0.12370677 259 nips-2012-Online Regret Bounds for Undiscounted Continuous Reinforcement Learning

10 0.11776807 324 nips-2012-Stochastic Gradient Descent with Only One Projection

11 0.11533494 61 nips-2012-Best Arm Identification: A Unified Approach to Fixed Budget and Fixed Confidence

12 0.090201735 195 nips-2012-Learning visual motion in recurrent neural networks

13 0.087859072 122 nips-2012-Exploration in Model-based Reinforcement Learning by Empirically Estimating Learning Progress

14 0.087559797 13 nips-2012-A Nonparametric Conjugate Prior Distribution for the Maximizing Argument of a Noisy Function

15 0.083149388 241 nips-2012-No-Regret Algorithms for Unconstrained Online Convex Optimization

16 0.08033298 83 nips-2012-Controlled Recognition Bounds for Visual Learning and Exploration

17 0.079178639 191 nips-2012-Learning the Architecture of Sum-Product Networks Using Clustering on Variables

18 0.07916005 293 nips-2012-Relax and Randomize : From Value to Algorithms

19 0.078952052 258 nips-2012-Online L1-Dictionary Learning with Application to Novel Document Detection

20 0.075305395 218 nips-2012-Mixing Properties of Conditional Markov Chains with Unbounded Feature Functions


similar papers computed by lsi model

lsi for this paper:

topicId topicWeight

[(0, 0.214), (1, -0.067), (2, 0.093), (3, 0.045), (4, 0.061), (5, -0.025), (6, -0.015), (7, -0.004), (8, -0.2), (9, 0.032), (10, 0.134), (11, 0.078), (12, -0.143), (13, -0.142), (14, -0.228), (15, 0.024), (16, -0.075), (17, 0.117), (18, -0.091), (19, 0.095), (20, 0.059), (21, 0.02), (22, 0.039), (23, -0.022), (24, -0.018), (25, 0.079), (26, -0.01), (27, 0.05), (28, 0.085), (29, 0.024), (30, 0.015), (31, 0.087), (32, 0.078), (33, -0.035), (34, 0.04), (35, -0.026), (36, -0.093), (37, -0.029), (38, 0.022), (39, -0.078), (40, -0.091), (41, 0.075), (42, -0.063), (43, 0.039), (44, 0.11), (45, -0.146), (46, 0.118), (47, 0.026), (48, 0.022), (49, 0.15)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.95373452 149 nips-2012-Hierarchical Optimistic Region Selection driven by Curiosity

Author: Odalric-ambrym Maillard

Abstract: This paper aims to take a step forwards making the term “intrinsic motivation” from reinforcement learning theoretically well founded, focusing on curiositydriven learning. To that end, we consider the setting where, a fixed partition P of a continuous space X being given, and a process ν defined on X being unknown, we are asked to sequentially decide which cell of the partition to select as well as where to sample ν in that cell, in order to minimize a loss function that is inspired from previous work on curiosity-driven learning. The loss on each cell consists of one term measuring a simple worst case quadratic sampling error, and a penalty term proportional to the range of the variance in that cell. The corresponding problem formulation extends the setting known as active learning for multi-armed bandits to the case when each arm is a continuous region, and we show how an adaptation of recent algorithms for that problem and of hierarchical optimistic sampling algorithms for optimization can be used in order to solve this problem. The resulting procedure, called Hierarchical Optimistic Region SElection driven by Curiosity (HORSE.C) is provided together with a finite-time regret analysis. 1

2 0.81004095 261 nips-2012-Online allocation and homogeneous partitioning for piecewise constant mean-approximation

Author: Alexandra Carpentier, Odalric-ambrym Maillard

Abstract: In the setting of active learning for the multi-armed bandit, where the goal of a learner is to estimate with equal precision the mean of a finite number of arms, recent results show that it is possible to derive strategies based on finite-time confidence bounds that are competitive with the best possible strategy. We here consider an extension of this problem to the case when the arms are the cells of a finite partition P of a continuous sampling space X ⊂ Rd . Our goal is now to build a piecewise constant approximation of a noisy function (where each piece is one region of P and P is fixed beforehand) in order to maintain the local quadratic error of approximation on each cell equally low. Although this extension is not trivial, we show that a simple algorithm based on upper confidence bounds can be proved to be adaptive to the function itself in a near-optimal way, when |P| is chosen to be of minimax-optimal order on the class of α−H¨ lder functions. o 1 Setting and Previous work Let us consider some space X ⊂ Rd , and Y ⊂ R. We call X the input space or sampling space, Y the output space or value space. We consider the problem of estimating with uniform precision the function f : X ⊂ Rd → Y ⊂ R. We assume that we can query n times the function f , anywhere in the domain, and observe noisy samples of this function. These samples are collected sequentially, and our aim is to design an adaptive procedure that selects wisely where on the domain to query the function, according to the information provided by the previous samples. More formally: Observed process We consider an unknown Y-valued process defined on X , written ν : X → M+ (Y), where M+ (Y) refers to the set of all probability measures on Y, such that for all x ∈ X , 1 1 def the random variable Y (x) ∼ ν(x) has mean f (x) = E[Y (x)|x] ∈ R. We write for convenience the model in the following way Y (x) = f (x) + noise(x) , def where noise(x) = Y (x) − E[Y (x)|x] is the centered random variable corresponding to the noise, o with unknown variance σ 2 (x). We assume throughout this paper that f is α-H¨ lder. Partition We consider we can define a partition P of the input space X , with finitely many P regions {Rp }1≤p≤P that are assumed to be convex and not degenerated, i.e. such that the interior of each region Rp has positive Lebesgue volume vp . Moreover, with each region Rp is associated a sampling distribution in that region, written µp ∈ M+ (Rp ). Thus, when we decide to sample in 1 region Rp , a new sample X ∈ Rp is generated according to X ∼ µp . Allocation. We consider that we have a finite budget of n ∈ N samples that we can use in order to allocate samples as we wish among the regions {Rp }1≤p≤P . For illustration, let us assume that we deterministically allocate Tp,n ∈ N samples in region Rp , with the constraint that the allocation {Tp,n }1≤p≤P must some to n. In region Rp , we thus sample points {Xp,i }1≤p≤P at random 1 according to the sampling distribution µp , and then get the corresponding values {Yp,i }1≤i≤Tp,n , where Yp,i ∼ ν(Xp,i ). In the sequel, the distribution µp is assumed to be the uniform distribution dλ(x)1x∈R over region Rp , i.e. the density of µp is λ(Rp ) p where λ denotes the Lebesgue measure. Note that this is not restrictive since we are in an active, not passive setting. Piecewise constant mean-approximation. We use the collected samples in order to build a pieceˆ wise constant approximation fn of the mean f , and measure the accuracy of approximation on a region Rp with the expected quadratic norm of the approximation error, namely � � � � � ˆ (x))2 λ(dx) = Eµ ,ν (f (X) − mp,n )2 , ˆ (f (x) − fn E p λ(Rp ) Rp ˆ where mp,n is the constant value that takes fn on the region Rp . A natural choice for the estimator ˆ mp,n is to use the empirical mean that is unbiased and asymptotically optimal for this criterion. ˆ Thus we consider the following estimate (histogram) ˆ fn (x) = P � p=1 mp,n I{x ∈ Rp } where mp,n = ˆ ˆ Tp,n 1 � Tp,n Yp,i . i=1 Pseudo loss Note that, since the Tp,n are deterministic, the expected quadratic norm of the approximation error of this estimator can be written in the following form � � � � � � ˆ Eµp ,ν (f (X) − mp,n )2 ˆ = Eµp ,ν (f (X) − Eµp [f (X)])2 + Eµp ,ν (Eµp [f (X)] − mp,n )2 � � � � = Vµp f (X) + Vµp ,ν mp,n ˆ � � � � 1 Vµp ,ν Y (X) . = Vµp f (X) + Tp,n Now, using the following immediate decomposition � � � � � Vµp ,ν Y (X) = Vµp f (X) + σ 2 (x)µp (dx) , Rp we deduce that the maximal expected quadratic norm of the approximation error over the regions def {Rp }1≤p≤P , that depends on the choice of the considered allocation strategy A = {Tp,n }1≤p≤P is thus given by the following so-called pseudo-loss � � � � � � Tp,n + 1 1 def 2 (1) Vµp f (X) + Eµ σ (X) . Ln (A) = max 1≤ p ≤P Tp,n Tp,n p Our goal is to minimize this pseudo-loss. Note that this is a local measure of performance, as opposed to a more usual yet less challenging global quadratic error. Eventually, as the number of �� �2 � ˆ cells tends to ∞, this local measure of performance approaches supx∈X Eν f (x) − fn (x) . At this point, let us also introduce, for convenience, the notation Qp (Tp,n ) that denotes the term inside the max, in order to emphasize the dependency on the quadratic error with the allocation. Previous work There is a huge literature on the topic of functional estimation in batch setting. Since it is a rather old and well studied question in statistics, many books have been written on this topic, such as Bosq and Lecoutre [1987], Rosenblatt [1991], Gy¨ rfi et al. [2002], where piecewise constant meano approximation are also called “partitioning estimate” or “regressogram” (first introduced by Tukey [1947]). The minimax-optimal rate of approximation on the class of α-H¨ lder functions is known o 2α to be in O(n− 2α+d ) (see e.g. Ibragimov and Hasminski [1981], Stone [1980], Gy¨ rfi et al. [2002]). o In such setting, a dataset {(Xi , Yi )}i≤n is given to the learner, and a typical question is thus to try to find the best possible histogram in order to minimize a approximation error. Thus the dataset is fixed and we typically resort to techniques such as model selection where each model corresponds to one histogram (see Arlot [2007] for an extensive study of such). However, we here ask a very different question, that is how to optimally sample in an online setting in order to minimize the approximation error of some histogram. Thus we choose the histogram 2 before we see any sample, then it is fixed and we need to decide which cell to sample from at each time step. Motivation for this setting comes naturally from some recent works in the setting of active learning for the multi-armed bandit problem Antos et al. [2010], Carpentier et al. [2011]. In these works, the objective is to estimate with equal precision the mean of a finite number of distributions (arms), which would correspond to the special case when X = {1, . . . , P } is a finite set in our setting. Intuitively, we reduce the problem to such bandit problem with finite set of arms (regions), and our setting answers the question whether it is possible to extend those results to the case when the arms do not correspond to a singleton, but rather to a continuous region. We show that the answer is positive, yet non trivial. This is non trivial due to the variance estimation in each region: points x in some region may have different means f(x), so that standard estimators for the variance are biased, contrary to the point-wise case and thus finite-arm techniques may yield disastrous results. (Estimating the variance of the distribution in a continuous region actually needs to take into account not only the point-wise noise but also the variation of the function f and the noise level σ 2 in that region.) We describe a way, inspired from quasi Monte-Carlo techniques, to correct this bias so that we can handle the additional error. Also, it is worth mentioning that this setting can be informally linked to a notion of curiosity-driven learning (see Schmidhuber [2010], Baranes and Oudeyer [2009]), since we want to decide in which region of the space to sample, without explicit reward but optimizing the goal to understand the unknown environment. Outline Section 2 provides more intuition about the pseudo-loss and a result about the optimal oracle strategy when the domain is partitioned in a minimax-optimal way on the class of α−H¨ lder o functions. Section 3 presents our assumptions, that are basically to have a sub-Gaussian noise and smooth mean and variance functions, then our estimator of the pseudo-loss together with its concentration properties, before introducing our sampling procedure, called OAHPA-pcma. Finally, the performance of this procedure is provided and discussed in Section 4. 2 The pseudo-loss: study and optimal strategies 2.1 More intuition on each term in the pseudo-loss It is natural to look at what happens to each of the two terms that appear in equation 1 when one makes Rp shrink towards a point. More precisely, let xp be the mean of X ∼ µp and let us look at the limit of Vµp (f (X)) when vp goes to 0. Assuming that f is differentiable, we get �2 � �� lim Vµp (f (X)) = lim Eµp f (X) − f (xp ) − E[f (X) − f (xp )] vp →0 vp →0 = = = lim Eµp �� �X − xp , ∇f (xp )� − E[�X − xp , ∇f (xp )�] vp →0 � � lim Eµp �X − xp , ∇f (xp )�2 vp →0 � � lim ∇f (xp )T Eµp (X − xp )(X − xp )T ∇f (xp ) . �2 � vp →0 Therefore, if we introduce Σp to be the covariance matrix of the random variable X ∼ µp , then we simply have lim Vµp (f (X)) = lim ||∇f (xp )||2 p . Σ vp →0 vp →0 Example with hyper-cubic regions An important example is when Rp is a hypercube with side 1/d length vp and µp is the uniform distribution over the region Rp . In that case (see Lemma 1), we dx have µp (dx) = , and 2/d vp vp . ||∇f (xp )||2 p = ||∇f (xp )||2 Σ 12 More generally, when f is α−differentiable, i.e. that ∀a ∈ X , ∃∇α f (a, ·) ∈ Sd (0, 1)R such that ∀x ∈ Sd (0, 1), limh→0 f (a+hx)−f (a) = ∇α f (a, x), then it is not too difficult to show that for such hα hyper-cubic regions, we have � � � 2α � Vµp f (X) = O vpd sup |∇α f (xp , u)|2 . S(0,1) � � On the other hand, by direct computation, the second term is such that limvp →0 Eµp σ 2 (X) = � � � � σ 2 (xp ). Thus, while Vµp f (X) vanishes, Eµp σ 2 (X) stays bounded away from 0 (unless ν is deterministic). 3 2.2 Oracle allocation and homogeneous partitioning for piecewise constant mean-approximation. We now assume that we are allowed to choose the partition P depending on n, thus P = Pn , amongst all homogeneous partitions of the space, i.e. partitions such that all cells have the same volume, and come from a regular grid of the space. Thus the only free parameter is the number of cells Pn of the partition. An exact yet not explicit oracle algorithm. The minimization of the pseudo-loss (1) does not yield to a closed-form solution in general. However, we can still derive the order of the optimal loss (see [Carpentier and Maillard, 2012, Lemma 2] in the full version of the paper for an example of minimax yet non adaptive oracle � algorithm given in closed-form solution): � � −β � � � −α� � � Lemma 1 In the case when Vµp f (X) = Ω Pn and Rp σ 2 (x)µp (dx) = Ω Pn , then an � optimal allocation and partitioning strategy An satisfies that� � � � Vµp f (X) + Eµp σ 2 (X) � � , L − Vµp f (X) � as soon as there exists, for such range of Pn , a constant L such that � � � � � Pn � Vµp f (X) + Eµp σ 2 (X) � � = n. L − Vµp f (X) p=1 1 � Pn = Ω(n max(1+α� −β� ,1) ) and def � Tp,n = The pseudo-loss of such an algorithm A� , optimal amongst the allocations strategies that use the n � partition Pn in Pn regions, is then given by � � � � def max(1 − β , 1 − α ) − 1. where γ = Ln (A� ) = Ω nγ n max(1 + α� − β � , 1) The condition involving the constant L is here to ensure that the partition is not degenerate. It is morally satisfied as soon as the variance of f and the noise are bounded and n is large enough. This Lemma applies to the important class W 1,2 (R) of functions that admit a weak derivative that o belongs to L2 (R). Indeed these functions are H¨ lder with coefficient α = 1/2, i.e. we have o W 1,2 (R) ⊂ C 0,1/2 (R). The standard Brownian motion is an example of function that is 1/2-H¨ lder. More generally, for k = d + α with α = 1/2 when d is odd and α = 1 when d is even, we have the 2 inclusion W k,2 (Rd ) ⊂ C 0,α (Rd ) , where W k,2 (Rd ) is the set of functions that admit a k th weak derivative belonging to L2 (Rd ). Thus the previous Lemma applies to sufficiently smooth functions with smoothness linearly increasing with the dimension d of the input space X . Important remark Note that this Lemma gives us a choice of the partition that is minimax-optimal, and an allocation strategy on that partition that is not only minimax-optimal but also adaptive to the function f itself. Thus it provides a way to decide in a minimax way what is the good number of regions, and then to provide the best oracle way to allocate the budget. We can deduce the following immediate corollary on the class of α−H¨ lder functions observed in a o non-negligible noise of bounded variance (i.e. in the setting β � = 0 and α� = 2α ). d Corollary 1 Consider that f is α−H¨ lder and the noise is of bounded variance. Then a minimaxo d � d+2α ) and an optimal allocation achieves the rate L (A� ) = optimal partition satisfies Pn = Ω(n n n � −2α � Ω n d+2α . Moreover, the strategy of Lemma 1 is optimal amongst the allocations strategies that � use the partition Pn in Pn regions. � −2α � The rate Ω n d+2α is minimax-optimal on the class of α−H¨ lder functions (see Gy¨ rfi et al. [2002], o o Ibragimov and Hasminski [1981], Stone [1980]), and it is thus interesting to consider an initial numd � � d+2α ). After having built the partition, if the quantities ber �� � 2 �� � � of�regions Pn that is of order Pn = Ω(n Vµp f p≤P and Eµp σ p≤P are known to the learner, it is optimal, in the aim of minimizing � the pseudo-loss, to allocate to each region the number of samples Tp,n provided in Lemma 1. Our objective in this paper is, after having chosen beforehand a minimax-optimal partition, to allocate 4 the samples properly in the regions, without having any access to those quantities. It is then �� � � necessary to balance between exploration, i.e. allocating the samples in order to estimate Vµp f p≤P � � �� and Eµp σ 2 p≤P , and exploitation, i.e. use the estimates to target the optimal allocation. 3 Online algorithms for allocation and homogeneous partitioning for piecewise constant mean-approximation In this section, we now turn to the design of algorithms that are fully online, with the goal to be competitive against the kind of oracle algorithms considered in Section 2.2. We now assume that the space X = [0, 1]d is divided in Pn hyper-cubic regions of same measure (the Lebesgue measure on 1 [0, 1]d ) vp = v = Pn . The goal of an algorithm is to minimize the quadratic error of approximation of f by a constant over each cell, in expectation, which we write as � � � � � � 2 λ(dx) ˆ (x))2 λ(dx) = max E , max E (f (x) − fn (f (x) − mp,n ) ˆ 1≤p≤Pn 1≤p≤Pn λ(Rp ) λ(Rp ) Rp Rp ˆ where fn is the histogram estimate of the function f on the partition P and mp,n is the empirical ˆ mean defined on region Rp with the samples (Xi , Yi ) such that Xi ∈ Rp . To do so, an algorithm is only allowed to specify at each time step t, the next point Xt where to sample, based on all the past samples {(Xs , Ys )}s < ∞ satisfies that λ2 σ 2 (x) , ∀λ ∈ R+ log E exp[λ noise(x)] ≤ 2 and we further assume that it satisfies the following slightly stronger second property (that is for instance exactly verified for a Gaussian variable, looking at the moment generating function): � � � � 1 λ2 σ 2 (x) ∀λ, γ ∈ R+ log E exp λnoise(x) + γnoise(x)2 ≤ − log 1 − 2γσ 2 (x) . 2(1 − 2γσ 2 (x)) 2 5 The function f is assumed to be (L, α)-H¨ lder, meaning that it satifies o � ∀x, x ∈ X f (x) − f (x� ) ≤ L||x − x� ||α . Similarly, the function σ 2 is assumed to be (M, β)-H¨ lder i.e. it satisfies o � 2 2 � ∀x, x ∈ X σ (x) − σ (x ) ≤ M ||x − x� ||β . We assume that Y is a convex and compact subset of R, thus w.l.g. that it is [0, 1], and that it is known that ||σ 2 ||∞ , which is thus finite, is bounded by the constant 1. 3.2 Empirical estimation of the quadratic approximation error on each cell We define the sampling distribution µp in the region Rp for each p ∈ {1, . . . , Pn } as a quasi-uniform ˜ sampling scheme using the uniform distribution over the sub-regions. More precisely at time t ≤ n, if we decide to sample in the region Rp according to µp , we sample uniformly in each sub-region ˜ one sample, resulting in a new batch of samples {(Xt,k , Yt,k )}1≤k≤K , where Xt,k ∼ µp,k . Note that due to this sampling process, the number of points Tp,t sampled in sub-region Rp at time t is always Tp,t a multiple of K and that moreover for all k, k � ∈ {1, . . . , K} we have that Tp,k,t = Tp,k� ,t = K . Now this specific sampling is used in order to be able to estimate the variances Vµp f and Eµp σ 2 , � so that the best proportions Tp,n can be computed as accurately as possible. Indeed, as explained in � � � � Lemma 1, we have that Vµp f (X) + Eµp σ 2 (X) � def � � . Tp,n = L − Vµp f (X) ˆ Variance estimation We now introduce two estimators. The first estimator is written Vp,t and is def ˆ built in the following way. First,let us introduce the empirical estimate fp,k,t of the mean fp,k = � � Eµp,k f (X) of f in sub-region Rp,k . Similarly, to avoid some cumbersome notations, we introduce � � � � � � def def def 2 fp = Eµp f (X) and vp,k = Vµp,k f (X) for the function f , and then σp,k = Eµp,k σ 2 (X) for the variance of the noise σ 2 . We now define the empirical variance estimator to be K 1 � ˆ ˆ (fp,k,t − mp,t )2 , ˆ Vp,t = K −1 k=1 that is a biased estimator. Indeed, for a deterministic Tp,t , it is not difficult to show that we have � K K � � � � � � �� � � � � 2 1 �� 1 � ˆ E Vp,t + Eµp,k f − Eµp f = Vµp,k f + Eµp,k σ 2 . K −1 Tp,t k=1 k=1 � � The leading term in this decomposition, that is given by the first sum, is closed to Vµp f since, by using the assumption that f is (L, α)−H¨ lder, we have the following inequality o � � K � �� �� � �1 � � � � 2 2L2 dα � Eµp,k f − Eµp f − Vµp f (X) � ≤ , � �K (KPn )2α/d k=1 where we also used that the diameter of a sub-region Rp,k is given by diam(Rp,k ) = d1/2 . (KPn )1/d ˆ Then, the second term also contributes to the bias, essentially due to the fact that V[fp,k,t ] = � � � � 2 def def 1 1 2 2 2 Tp,k,t (vp,k + σp,k ) and not Tp,t (vk + σk ) (with vp = Vµp f (X) and σp = Eµp σ (X) ). ˆ p,k,t In order to correct this term, we now introduce the second estimator σ 2 that estimates the variance � � � � � � of the outputs in a region Rp,k , i.e. Vµp,k ,ν Y (X) = Vµp,k f (X) + Eµp,k σ 2 . It is defined as �2 t t �� 1 1 � def ˆ p,k,t = Yi − Yj I{Xj ∈ Rp,k } I{Xi ∈ Rp,k } . σ2 Tp,k,t − 1 i=1 Tp,k,t j=1 Now, we combine the two previous estimators to form the following estimator K 1 �� 1 1 � 2 ˆ ˆ ˆ σ − . Qp,t = Vp,t − K Tp,k,t Tp,t p,k,t k=1 ˆ The following proposition provides a high-probability bound on the difference between Qp,t and the quantity we want to estimate. We report the detailed proof in [Carpentier and Maillard, 2012]. 6 ˆ Proposition 1 By the assumption that f is (L, α)-H¨ lder, the bias of the estimator Qp,t , and for o deterministic Tp,t , is given by � K � � � � � � � � � 2 1 � 2L2 dα ˆ − Vµp f (X) ≤ . Eµp,k f − Eµp f E Qp,t − Qp (Tp,t ) = K (KPn )2α/d k=1 Moreover, it satisfies that for all δ ∈ [0, 1], there exists an event of probability higher than 1 − δ such that on this event, we have � � � � � � K K � � � � 8 log(4/δ) � σ 2 �1 � � � � ˆ p,k,t 1 � 2 ˆ ˆ � Qp,t − E Qp,t � ≤ � √ +o σ p,k . � � (K − 1)2 T2 T K K k=1 p,k,t p,k,t k=1 We also state the following Lemma that we are going to use in the analysis, and that takes into account randomness of the stopping times Tp,k,t . Lemma 2 Let {Xp,k,u }p≤P, k≤K, u≤n be samples potentially sampled in region Rp,k . We introduce qp,u to be the�equivalent of Qp (Tp,t ) with explicitly fixed value of Tp,t = u. Let also qp,u be the ˆ � ˆ p,t but computed with the first u samples in estimate of E qp,u , that is to say the equivalent of Q each region Rp,k (i.e. Tp,t = u). Let us define the event � � � � � � � AK log(4nP/δ)V � � ˆp,t 2L2 dα � � ξn,P,K (δ) = + ω : � qp,u (ω) − E qp,u � ≤ ˆ , u K −1 (KPn )2α/d p≤P u≤n �K 1 ˆ ˆ ˆ p,k,t and where A ≤ 4 is a numerical constant. Then it where Vp,t = Vp (Tp,t ) = K−1 k=1 σ 2 holds that � � P ξn,P,K (δ) ≥ 1 − δ . Note that, with the notations of this Lemma, Proposition 1 above is thus about qp,u . ˆ 3.3 The Online allocation and homogeneous partitioning algorithm for piecewise constant mean-approximation (OAHPA-pcma) We are now ready to state the algorithm that we propose for minimizing the quadratic error of approximation of f . The algorithm is described in Figure 1. Although it looks similar, this algorithm is ˆ quite different from a normal UCB algorithm since Qp,t decreases in expectation with Tp,t . Indeed, � � � � � �� �K � 1 its expectation is close to Vµp f + KTp,t k=1 Vµp,k f + Eµp,k σ 2 . Algorithm 1 OAHPA-pcma. 1: Input: A, L, α, Horizon n; Partition {Rp }p≤P , with sub-partitions {Rp,k }k≤K . 2: Initialization: Sample K points in every sub-region {Rp,k }p≤P,k≤K 3: for t = K 2 P + 1; t ≤ n; t = t + K do ˆ 4: Compute ∀p, Qp,t . � ˆ ˆ p,t + AK log(4nP/δ)Vp,t + 2L2 dα . 5: Compute ∀p, Bp,t = Q 2α/d Tp,t K−1 (KPn ) 6: Select the region pt = argmax1≤p≤Pn Bp,t where to sample. 7: Sample K samples in region Rpt one per sub-region Rpt ,k according to µpt ,k . 8: end for 4 Performance of the allocation strategy and discussion Here is the main result of the paper; see the full version [Carpentier and Maillard, 2012] for the proof. We remind that the objective is to minimize for an algorithm A the pseudo-loss Ln (A). Theorem 1 (Main result) Let γ = � maxp Tp,n � minp Tp,n be the distortion factor of the optimal allocation stratdef d d egy, and let � > 0. Then with the choice of the number of regions Pn = n 2α+d �2+ 2α , and of the 2d d def def 8L2 α number of sub-regions K = C 4α+d �−2− α , where C = Ad1−α then the pseudo-loss of the OAHPApcma algorithm satisfies, under the assumptions of Section 3.1 and on an event of probability higher than 1 − δ, � � � � � 2α 1 + �γC � log(1/δ) Ln (A� ) + o n− 2α+d , Ln (A) ≤ n for some numerical constant C � not depending on n, where A� is the oracle of Lemma 1. n 7 Minimax-optimal partitioning and �-adaptive performance Theorem 1 provides a high probability bound on the performance of the OAHPA-pcma allocation strategy. It shows that this performance is competitive with that of an optimal (i.e. adaptive to the function f , see Lemma 1) allocation d A� on a partition with a number of cells Pn chosen to be of minimax order n 2α+d for the class of 2α α-H¨ lder functions. In particular, since Ln (A� ) = O(n d+2α ) on that class, we recover the same o n minimax order as what is obtained in the batch learning setting, when using for instance wavelets, or Kernel estimates (see e.g. Stone [1980], Ibragimov and Hasminski [1981]). But moreover, due to the adaptivity of A� to the function itself, this procedure is also �-adaptive to the function and not n only minimax-optimal on the class, on that partition (see Section 2.2). Naturally, the performance of the method increases, in the same way than for any classical functional estimation method, when the smoothness of the function increases. Similarly, in agreement with the classical curse of dimension, the higher the dimension of the domain, the less efficient the method. Limitations In this work, we assume that the smoothness α of the function is available to the learner, which enables her to calibrate Pn properly. Now it makes sense to combine the OAHPApcma procedure with existing methods that enable to estimate this smoothness online (under a slightly stronger assumption than H¨ lder, such as H¨ lder functions that attain their exponents, o o see Gin´ and Nickl [2010]). It is thus interesting, when no preliminary knowledge on the smoothness e of f is available, to spend some of the initial budget in order to estimate α. We have seen that the OAHPA-pcma procedure, although very simple, manages to get minimax optimal results. Now the downside of the simplicity of the OAHPA-pcma strategy is two-fold. � The first limitation is that the factor (1 + �γC � log(1/δ)) = (1 + O(�)) appearing in the bound before Ln (A� ) is not 1, but higher than 1. Of course it is generally difficult to get a constant 1 in the batch setting (see Arlot [2007]), and similarly this is a difficult task in our online setting too: If � is chosen to be small, then the error with respect to the optimal allocation is small. However, since Pn is expressed as an increasing function of �, this implies that the minimax bound on the loss for partition P increases also with �. That said, in the view of the work on active learning multi-armed bandit that we extend, we would still prefer to get the optimal constant 1. The second limitation is more problematic: since K is chosen irrespective of the region Rp , this causes the presence of the factor γ. Thus the algorithm will essentially no longer enjoy near-optimal performance guarantees when the optimal allocation strategy is highly not homogeneous. Conclusion and future work In this paper, we considered online regression with histograms in an active setting (we select in which bean to sample), and when we can choose the histogram in a class of homogeneous histograms. Since the (unknown) noise is heteroscedastic and we compete not only with the minimax allocation oracle on α-H¨ lder functions but with the adaptive oracle o that uses a minimax optimal histogram and allocates samples adaptively to the target function, this is an extremely challenging (and very practical) setting. Our contribution can be seen as a non trivial extension of the setting of active learning for multi-armed bandits to the case when each arm corresponds to one continuous region of a sampling space, as opposed to a singleton, which can also be seen as a problem of non parametric function approximation. This new setting offers interesting challenges: We provided a simple procedure, based on the computation of upper confidence bounds of the estimation of the local quadratic error of approximation, and provided a performance analysis that shows that OAHPA-pcma is first order �-optimal with respect to the function, for a partition chosen to be minimax-optimal on the class of α-H¨ lder functions. However, this simplicity also o has a drawback if one is interested in building exactly first order optimal procedure, and going beyond these limitations is definitely not trivial: A more optimal but much more complex algorithm would indeed need to tune a different factor Kp in each cell in an online way, i.e. define some Kp,t that evolves with time, and redefine sub-regions accordingly. Now, the analysis of the OAHPA-pcma already makes use of powerful tools such as empirical-Bernstein bounds for variance estimation (and not only for mean estimation), which make it non trivial; in order to handle possibly evolving subregions and deal with the progressive refinement of the regions, we would need even more intricate analysis, due to the fact that we are online and active. This interesting next step is postponed to future work. Acknowledgements This research was partially supported by Nord-Pas-de-Calais Regional Council, French ANR EXPLO-RA (ANR-08-COSI-004), the European Communitys Seventh Framework Programme (FP7/2007-2013) under grant agreement no 270327 (CompLACS) and no 216886 (PASCAL2). 8 References Andr` s Antos, Varun Grover, and Csaba Szepesv` ri. Active learning in heteroscedastic noise. Thea a oretical Computer Science, 411(29-30):2712–2728, 2010. Sylvain Arlot. R´ echantillonnage et S´ lection de mod` les. PhD thesis, Universit´ Paris Sud - Paris e´ e e e XI, 2007. A. Baranes and P.-Y. Oudeyer. R-IAC: Robust Intrinsically Motivated Exploration and Active Learning. IEEE Transactions on Autonomous Mental Development, 1(3):155–169, October 2009. D. Bosq and J.P. Lecoutre. Th´ orie de l’estimation fonctionnelle, volume 21. Economica, 1987. e Alexandra Carpentier and Odalric-Ambrym Maillard. Online allocation and homogeneous partitioning for piecewise constant mean-approximation. HAL, 2012. URL http://hal.archives-ouvertes.fr/hal-00742893. Alexandra Carpentier, Alessandro Lazaric, Mohammad Ghavamzadeh, Rmi Munos, and Peter Auer. Upper-confidence-bound algorithms for active learning in multi-armed bandits. In Jyrki Kivinen, Csaba Szepesv` ri, Esko Ukkonen, and Thomas Zeugmann, editors, Algorithmic Learning Theory, a volume 6925 of Lecture Notes in Computer Science, pages 189–203. Springer Berlin / Heidelberg, 2011. E. Gin´ and R. Nickl. Confidence bands in density estimation. The Annals of Statistics, 38(2): e 1122–1170, 2010. L. Gy¨ rfi, M. Kohler, A. Krzy´ ak, and Walk H. A distribution-free theory of nonparametric regreso z sion. Springer Verlag, 2002. I. Ibragimov and R. Hasminski. Statistical estimation: Asymptotic theory. 1981. M. Rosenblatt. Stochastic curve estimation, volume 3. Inst of Mathematical Statistic, 1991. J. Schmidhuber. Formal theory of creativity, fun, and intrinsic motivation (19902010). Autonomous Mental Development, IEEE Transactions on, 2(3):230–247, 2010. C.J. Stone. Optimal rates of convergence for nonparametric estimators. The annals of Statistics, pages 1348–1360, 1980. J.W. Tukey. Non-parametric estimation ii. statistically equivalent blocks and tolerance regions–the continuous case. The Annals of Mathematical Statistics, 18(4):529–539, 1947. 9

3 0.55503905 259 nips-2012-Online Regret Bounds for Undiscounted Continuous Reinforcement Learning

Author: Ronald Ortner, Daniil Ryabko

Abstract: We derive sublinear regret bounds for undiscounted reinforcement learning in continuous state space. The proposed algorithm combines state aggregation with the use of upper confidence bounds for implementing optimism in the face of uncertainty. Beside the existence of an optimal policy which satisfies the Poisson equation, the only assumptions made are H¨ lder continuity of rewards and transition o probabilities. 1

4 0.53592253 216 nips-2012-Mirror Descent Meets Fixed Share (and feels no regret)

Author: Nicolò Cesa-bianchi, Pierre Gaillard, Gabor Lugosi, Gilles Stoltz

Abstract: Mirror descent with an entropic regularizer is known to achieve shifting regret bounds that are logarithmic in the dimension. This is done using either a carefully designed projection or by a weight sharing technique. Via a novel unified analysis, we show that these two approaches deliver essentially equivalent bounds on a notion of regret generalizing shifting, adaptive, discounted, and other related regrets. Our analysis also captures and extends the generalized weight sharing technique of Bousquet and Warmuth, and can be refined in several ways, including improvements for small losses and adaptive tuning of parameters. 1

5 0.53054035 61 nips-2012-Best Arm Identification: A Unified Approach to Fixed Budget and Fixed Confidence

Author: Victor Gabillon, Mohammad Ghavamzadeh, Alessandro Lazaric

Abstract: We study the problem of identifying the best arm(s) in the stochastic multi-armed bandit setting. This problem has been studied in the literature from two different perspectives: fixed budget and fixed confidence. We propose a unifying approach that leads to a meta-algorithm called unified gap-based exploration (UGapE), with a common structure and similar theoretical analysis for these two settings. We prove a performance bound for the two versions of the algorithm showing that the two problems are characterized by the same notion of complexity. We also show how the UGapE algorithm as well as its theoretical analysis can be extended to take into account the variance of the arms and to multiple bandits. Finally, we evaluate the performance of UGapE and compare it with a number of existing fixed budget and fixed confidence algorithms. 1

6 0.52687347 343 nips-2012-Tight Bounds on Profile Redundancy and Distinguishability

7 0.52560228 110 nips-2012-Efficient Reinforcement Learning for High Dimensional Linear Quadratic Systems

8 0.50580722 295 nips-2012-Risk-Aversion in Multi-armed Bandits

9 0.47515333 156 nips-2012-Identifiability and Unmixing of Latent Parse Trees

10 0.45014533 36 nips-2012-Adaptive Stratified Sampling for Monte-Carlo integration of Differentiable functions

11 0.43023586 199 nips-2012-Link Prediction in Graphs with Autoregressive Features

12 0.4213731 115 nips-2012-Efficient high dimensional maximum entropy modeling via symmetric partition functions

13 0.40589368 267 nips-2012-Perceptron Learning of SAT

14 0.38862205 260 nips-2012-Online Sum-Product Computation Over Trees

15 0.38179976 183 nips-2012-Learning Partially Observable Models Using Temporally Abstract Decision Trees

16 0.37865999 217 nips-2012-Mixability in Statistical Learning

17 0.37394163 179 nips-2012-Learning Manifolds with K-Means and K-Flats

18 0.37013537 122 nips-2012-Exploration in Model-based Reinforcement Learning by Empirically Estimating Learning Progress

19 0.35657641 10 nips-2012-A Linear Time Active Learning Algorithm for Link Classification

20 0.3554728 83 nips-2012-Controlled Recognition Bounds for Visual Learning and Exploration


similar papers computed by lda model

lda for this paper:

topicId topicWeight

[(0, 0.065), (15, 0.164), (17, 0.011), (21, 0.046), (28, 0.043), (36, 0.025), (38, 0.124), (39, 0.016), (42, 0.023), (53, 0.011), (54, 0.025), (55, 0.018), (74, 0.094), (76, 0.121), (80, 0.082), (92, 0.066)]

similar papers list:

simIndex simValue paperId paperTitle

1 0.89994019 31 nips-2012-Action-Model Based Multi-agent Plan Recognition

Author: Hankz H. Zhuo, Qiang Yang, Subbarao Kambhampati

Abstract: Multi-Agent Plan Recognition (MAPR) aims to recognize dynamic team structures and team behaviors from the observed team traces (activity sequences) of a set of intelligent agents. Previous MAPR approaches required a library of team activity sequences (team plans) be given as input. However, collecting a library of team plans to ensure adequate coverage is often difficult and costly. In this paper, we relax this constraint, so that team plans are not required to be provided beforehand. We assume instead that a set of action models are available. Such models are often already created to describe domain physics; i.e., the preconditions and effects of effects actions. We propose a novel approach for recognizing multi-agent team plans based on such action models rather than libraries of team plans. We encode the resulting MAPR problem as a satisfiability problem and solve the problem using a state-of-the-art weighted MAX-SAT solver. Our approach also allows for incompleteness in the observed plan traces. Our empirical studies demonstrate that our algorithm is both effective and efficient in comparison to state-of-the-art MAPR methods based on plan libraries. 1

2 0.86388552 4 nips-2012-A Better Way to Pretrain Deep Boltzmann Machines

Author: Geoffrey E. Hinton, Ruslan Salakhutdinov

Abstract: We describe how the pretraining algorithm for Deep Boltzmann Machines (DBMs) is related to the pretraining algorithm for Deep Belief Networks and we show that under certain conditions, the pretraining procedure improves the variational lower bound of a two-hidden-layer DBM. Based on this analysis, we develop a different method of pretraining DBMs that distributes the modelling work more evenly over the hidden layers. Our results on the MNIST and NORB datasets demonstrate that the new pretraining algorithm allows us to learn better generative models. 1

same-paper 3 0.84460002 149 nips-2012-Hierarchical Optimistic Region Selection driven by Curiosity

Author: Odalric-ambrym Maillard

Abstract: This paper aims to take a step forwards making the term “intrinsic motivation” from reinforcement learning theoretically well founded, focusing on curiositydriven learning. To that end, we consider the setting where, a fixed partition P of a continuous space X being given, and a process ν defined on X being unknown, we are asked to sequentially decide which cell of the partition to select as well as where to sample ν in that cell, in order to minimize a loss function that is inspired from previous work on curiosity-driven learning. The loss on each cell consists of one term measuring a simple worst case quadratic sampling error, and a penalty term proportional to the range of the variance in that cell. The corresponding problem formulation extends the setting known as active learning for multi-armed bandits to the case when each arm is a continuous region, and we show how an adaptation of recent algorithms for that problem and of hierarchical optimistic sampling algorithms for optimization can be used in order to solve this problem. The resulting procedure, called Hierarchical Optimistic Region SElection driven by Curiosity (HORSE.C) is provided together with a finite-time regret analysis. 1

4 0.78052968 260 nips-2012-Online Sum-Product Computation Over Trees

Author: Mark Herbster, Stephen Pasteris, Fabio Vitale

Abstract: We consider the problem of performing efficient sum-product computations in an online setting over a tree. A natural application of our methods is to compute the marginal distribution at a vertex in a tree-structured Markov random field. Belief propagation can be used to solve this problem, but requires time linear in the size of the tree, and is therefore too slow in an online setting where we are continuously receiving new data and computing individual marginals. With our method we aim to update the data and compute marginals in time that is no more than logarithmic in the size of the tree, and is often significantly less. We accomplish this via a hierarchical covering structure that caches previous local sum-product computations. Our contribution is three-fold: we i) give a linear time algorithm to find an optimal hierarchical cover of a tree; ii) give a sum-productlike algorithm to efficiently compute marginals with respect to this cover; and iii) apply “i” and “ii” to find an efficient algorithm with a regret bound for the online allocation problem in a multi-task setting. 1

5 0.77820927 308 nips-2012-Semi-Supervised Domain Adaptation with Non-Parametric Copulas

Author: David Lopez-paz, Jose M. Hernández-lobato, Bernhard Schölkopf

Abstract: A new framework based on the theory of copulas is proposed to address semisupervised domain adaptation problems. The presented method factorizes any multivariate density into a product of marginal distributions and bivariate copula functions. Therefore, changes in each of these factors can be detected and corrected to adapt a density model accross different learning domains. Importantly, we introduce a novel vine copula model, which allows for this factorization in a non-parametric manner. Experimental results on regression problems with real-world data illustrate the efficacy of the proposed approach when compared to state-of-the-art techniques. 1

6 0.77700549 229 nips-2012-Multimodal Learning with Deep Boltzmann Machines

7 0.77498877 274 nips-2012-Priors for Diversity in Generative Latent Variable Models

8 0.77318829 83 nips-2012-Controlled Recognition Bounds for Visual Learning and Exploration

9 0.77188545 235 nips-2012-Natural Images, Gaussian Mixtures and Dead Leaves

10 0.76621783 176 nips-2012-Learning Image Descriptors with the Boosting-Trick

11 0.76613045 168 nips-2012-Kernel Latent SVM for Visual Recognition

12 0.76447946 172 nips-2012-Latent Graphical Model Selection: Efficient Methods for Locally Tree-like Graphs

13 0.76420069 316 nips-2012-Small-Variance Asymptotics for Exponential Family Dirichlet Process Mixture Models

14 0.76380885 112 nips-2012-Efficient Spike-Coding with Multiplicative Adaptation in a Spike Response Model

15 0.76289189 355 nips-2012-Truncation-free Online Variational Inference for Bayesian Nonparametric Models

16 0.76245838 69 nips-2012-Clustering Sparse Graphs

17 0.76163369 210 nips-2012-Memorability of Image Regions

18 0.75975651 8 nips-2012-A Generative Model for Parts-based Object Segmentation

19 0.75891131 48 nips-2012-Augmented-SVM: Automatic space partitioning for combining multiple non-linear dynamics

20 0.75850958 193 nips-2012-Learning to Align from Scratch