nips nips2012 nips2012-343 knowledge-graph by maker-knowledge-mining

343 nips-2012-Tight Bounds on Profile Redundancy and Distinguishability


Source: pdf

Author: Jayadev Acharya, Hirakendu Das, Alon Orlitsky

Abstract: The minimax KL-divergence of any distribution from all distributions in a collection P has several practical implications. In compression, it is called redundancy and represents the least additional number of bits over the entropy needed to encode the output of any distribution in P. In online estimation and learning, it is the lowest expected log-loss regret when guessing a sequence of random values generated by a distribution in P. In hypothesis testing, it upper bounds the largest number of distinguishable distributions in P. Motivated by problems ranging from population estimation to text classification and speech recognition, several machine-learning and information-theory researchers have recently considered label-invariant observations and properties induced by i.i.d. distributions. A sufficient statistic for all these properties is the data’s profile, the multiset of the number of times each data element appears. Improving on a sequence of previous works, we show that the redundancy of the collection of distributions induced over profiles by length-n i.i.d. sequences is between 0.3 · n1/3 and n1/3 log2 n, in particular, establishing its exact growth power. 1

Reference: text


Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 edu Abstract The minimax KL-divergence of any distribution from all distributions in a collection P has several practical implications. [sent-5, score-0.264]

2 In compression, it is called redundancy and represents the least additional number of bits over the entropy needed to encode the output of any distribution in P. [sent-6, score-0.675]

3 In online estimation and learning, it is the lowest expected log-loss regret when guessing a sequence of random values generated by a distribution in P. [sent-7, score-0.145]

4 In hypothesis testing, it upper bounds the largest number of distinguishable distributions in P. [sent-8, score-0.281]

5 A sufficient statistic for all these properties is the data’s profile, the multiset of the number of times each data element appears. [sent-13, score-0.193]

6 Improving on a sequence of previous works, we show that the redundancy of the collection of distributions induced over profiles by length-n i. [sent-14, score-0.652]

7 The minimax KL divergence—a fundamental measure for, among other things, how difficult distributions are to compress, predict, and classify, and profiles—a relatively new approach for compression, classification, and property testing over large alphabets. [sent-22, score-0.197]

8 1 Minimax KL divergence As is well known in information theory, the expected number of bits required to compress data X generated according to a known distribution P is the distribution’s entropy, H(P ) = EP log 1/P (X), and is achieved by encoding X using roughly log 1/P (X) bits. [sent-28, score-0.362]

9 This uncertainty typically raises the number of bits above the entropy and is studied in Universal compression [9, 13]. [sent-33, score-0.26]

10 Hence the increase in the expected number of bits used to encode the output of P is EP log 1/Q(X) − H(P ) = D(P ||Q), the KL divergence between P and Q. [sent-35, score-0.253]

11 The resulting quantity, called the (expected) redundancy of P, e. [sent-37, score-0.427]

12 13], is therefore the KL minimax def R(P) = min max D(P ||Q). [sent-40, score-0.32]

13 One of the most popular measures for the performance of an estimator Q is the per-symbol n 1 log loss n i=1 log Q(Xi |X i−1 ). [sent-48, score-0.128]

14 As in compression, for underlying distribution P ∈ P, the expected log loss is EP log 1/Q(X), and the log-loss regret is EP log 1/Q(X) − H(P ) = D(P ||Q). [sent-49, score-0.288]

15 1 In statistics, redundancy arises in multiple hypothesis testing. [sent-51, score-0.427]

16 For example, the largest number of topics distinguishable based on text of a given length. [sent-53, score-0.136]

17 Let P be a collection of distributions over a support set X . [sent-54, score-0.143]

18 Let M (P, ) be the largest number of -distinguishable distributions in P, and let h( ) be the binary entropy function. [sent-56, score-0.162]

19 Redundancy has many other connections to data compression [27, 28], the minimum-description-length principle [3, 16, 17], sequential prediction [21], and gambling [20]. [sent-58, score-0.155]

20 distributions over alphabets of size k, a string of works [7, 10, 11, 28, 33, 35, 36] determined the redundancy up to a diminishing additive term, n R(Ik ) = k−1 log n + Ck + o(1), 2 (2) where the constant Ck was determined exactly in terms of k. [sent-73, score-0.763]

21 For compression this shows that the extra number of bits per symbol required to encode an i. [sent-74, score-0.306]

22 sequence when the underlying distribution is unknown diminishes to zero as (k − 1) log n/(2n). [sent-77, score-0.22]

23 In hypothesis testing this shows that there are roughly n(k−1)/2 distinguishable i. [sent-79, score-0.143]

24 n Unfortunately, while R(Ik ) increases logarithmically in the sequence length n, it grows linearly in the alphabet size k. [sent-83, score-0.255]

25 3 Patterns Partly motivated by redundancy’s fast increase with the alphabet size, a new approach was recently proposed to address compression, estimation, classification, and property testing over large alphabets. [sent-86, score-0.186]

26 The pattern [25] of a sequence represents the relative order in which its symbols appear. [sent-87, score-0.149]

27 A natural method to compress a sequence over a large alphabet is to compress its pattern as well as the dictionary that maps the order to the original symbols. [sent-89, score-0.354]

28 distributions, over any alphabet, even infinitely large, as the sequence length increases, essentially all the entropy lies in the pattern, and practically none is in the dictionary. [sent-94, score-0.155]

29 Hence [25] focused on the redundancy of compressing patterns. [sent-95, score-0.427]

30 sequences over large alphabets have arbitrarily high per-symbol redundancy, and although as above patterns contain essentially all the information of long sequences, the per-symbol redundancy of patterns diminishes to zero at a uniform rate independent of the alphabet size. [sent-102, score-0.867]

31 For example, after observing the sequence dad, with pattern 121, we estimate the probabilities of 1, 2, and 3. [sent-104, score-0.127]

32 It is easy to see that the probability of a pattern is determined by the fingerprint [4] or profile [25] of the pattern, the multiset of the number of appearances of the symbols in the pattern. [sent-114, score-0.319]

33 For example, the profile of the pattern 121 is {1, 2} and all patterns with this profile, 112, 121, 122 will have the same probability under any distribution P . [sent-115, score-0.133]

34 Hence pattern redundancy is the same as profile redundancy [25]. [sent-118, score-0.903]

35 The multiplicity µ(a) of a symbol a in a sequence is the number of times it appears. [sent-120, score-0.164]

36 The profile ϕ(x) of a sequence x is the multiset of multiplicities of all symbols appearing in it [24, 25]. [sent-121, score-0.342]

37 The profile of the sequence is the multiset of multiplicities. [sent-122, score-0.218]

38 The prevalence ϕµ of a multiplicity µ is the number of elements with multiplicity µ. [sent-124, score-0.162]

39 For example, for sequences of length one there is a single element appearing once, hence Φ1 = {{1}}, for length two, either one element appears twice, or each of two elements appear once, hence Φ2 = {{2}, {1, 1}}, similarly Φ3 = {{3}, {2, 1}, {1, 1, 1}}, etc. [sent-126, score-0.243]

40 The probability of a profile def ϕ ∈ Φn is the sum of the probabilities of all sequences of this profile, P (ϕ) = x:ϕ(x)=ϕ P (x). [sent-135, score-0.316]

41 distribution} be the collection of all distributions on Φn induced by any discrete i. [sent-141, score-0.176]

42 Profile redundancy arises in at least two other machine-learning applications, closeness-testing and classification. [sent-151, score-0.427]

43 In closeness testing [4], we try to determine if two sequences are generated by same or different distributions. [sent-152, score-0.128]

44 Joint profiles and quantities related to profile redundancy are used to construct competitive closeness tests and classifiers that perform almost as well as the best possible [1, 2]. [sent-154, score-0.474]

45 All these properties depend only on the multiset of probability values in the distribution. [sent-160, score-0.169]

46 n Instead of the expected redundancy R(IΦ ) that reflects the increase in the expected number of bits, [25] bounded the ˆ n more stringent but closely-related worst-case redundancy, R(IΦ ), reflecting the increase in the worst-case number of bits, namely over all sequences. [sent-169, score-0.479]

47 distributions, where the redundancy (2) was determined up to a diminishing additive constant, here not even the power was known. [sent-175, score-0.491]

48 Since expected redundancy is at most the worst-case redundancy, the upper bound applies also for expected redundancy. [sent-177, score-0.551]

49 (3) New results n In Theorem 15 we use error-correcting codes to exhibit a larger class of distinguishable distributions in IΦ than was known before, thereby removing the log n factor from the lower bound in (3). [sent-182, score-0.351]

50 In Theorem 11 we demonstrate a small n number of distributions such that every distribution in IΦ is within a small KL divergence from one of them, thereby reducing the upper bound to have the same power as the lower bound. [sent-183, score-0.223]

51 They show that when a pattern is compressed or a sequence is estimated (with all unseen elements combined into new), the per-symbol redundancy and log-loss decrease to 0 uniformly over all distributions faster than log2 n/n2/3 , a rate that is optimal n up to a log2 n factor. [sent-187, score-0.7]

52 They also show that for length-n profiles, the redundancy R(IΦ ) is essentially the logarithm n log M (IΦ , ) of the number of distinguishable distributions. [sent-188, score-0.602]

53 7 Outline In the next section we describe properties of Poisson sampling and redundancy that will be used later in the paper. [sent-190, score-0.427]

54 A standard approach [22] to overcome the dependence is to sample the distribution a random poi(n) times, the Poisson distribution with parameter n, resulting in sequences of random length near close to n. [sent-199, score-0.179]

55 We now express profile probabilities and redundancy under Poisson sampling. [sent-208, score-0.456]

56 As we saw, the probability of a profile is determined by just the multiset of probability value and the symbol labels are irrelevant. [sent-209, score-0.247]

57 For convenience, we assume that the distribution is over the positive integers, and we replace the distribution parameters {pi } by the Poisson def parameters {npi }. [sent-210, score-0.316]

58 The profile generated 4 by this distribution is a multiset ϕ = {µ1 , µ2 , . [sent-218, score-0.208]

59 poi(n) poi(n) The redundancy R(IΦ ), and -distinguishability M (IΦ , ) are defined as before. [sent-236, score-0.427]

60 For any fixed > 0, √ n− n log n (1 − o(1))R(IΦ poi(n) ) ≤ R(IΦ ) and poi(n) M (IΦ √ n+ n log n , ) ≤ M (IΦ , 2 ). [sent-239, score-0.128]

61 Combining this with the fact that √ √ the probability that poi(n) is less than n − n log n or greater than n + n log n goes to 0 yields the bounds. [sent-242, score-0.128]

62 Similarly, for a collection P of distributions over A, let f (P) = {f (P ) : P ∈ P}. [sent-249, score-0.143]

63 For a collection P of distributions over A × B, let PA and PB be the collection of marginal distributions over A and B, respectively. [sent-252, score-0.286]

64 However, when P consists of product distributions, namely P (a, b) = PA (a) · PB (b), the redundancy of the product is at most the sum of the marginal redundancies. [sent-254, score-0.427]

65 If P be a collection of product distributions over A × B, then R(P) ≤ R(PA ) + R(PB ). [sent-257, score-0.143]

66 For a prefix-free code C : A → {0, 1}∗ , let EP [|C|] be the expected length of C under distribution P . [sent-258, score-0.144]

67 Redundancy is the extra number of bits above the entropy needed to encode the output of any distribution in P. [sent-259, score-0.248]

68 R( 1≤i≤k 1≤i≤T 5 3 Upper bound poi(n) A distribution in Λ ∈ IΦ is a multiset of λ’s adding to n. [sent-267, score-0.249]

69 For any such distribution, let def def def Λlow = {λ ∈ Λ : λ ≤ n1/3 }, Λmed = {λ ∈ Λ : n1/3 < λ ≤ n2/3 }, Λhigh = {λ ∈ Λ : λ > n2/3 }, and let ϕlow , ϕmed , ϕhigh denote the corresponding profile each subset generates. [sent-268, score-0.714]

70 A distribution in Iϕlow is a multiset of λ’s such that each is ≤ n1/3 and they sum to either n or to ≤ n − n1/3 . [sent-272, score-0.208]

71 1 we show that Iϕlow < 4n1/3 log n and Iϕhigh < 4n1/3 log n. [sent-277, score-0.128]

72 1 Bounds on R(Iϕlow ) and R(Iϕhigh ) Elias Codes [12] are prefix-free codes that encode a positive integer n using at most log n + log(log n + 1) + 1 bits. [sent-282, score-0.161]

73 We use Elias codes and design explicit coding schemes for distributions in Iϕlow and Iϕhigh , and prove the following result. [sent-283, score-0.171]

74 R(Iϕlow ) < 4n1/3 log n, and R(Iϕhigh ) < 2n1/3 log n. [sent-285, score-0.128]

75 For λ > 10, the number of bits needed to encode a poi(λ) random variable using Elias codes can be shown to be at most 2 log λ. [sent-291, score-0.271]

76 We encode distinct multiplicities along with their prevalences, using two integers for each distinct multiplicity. [sent-295, score-0.15]

77 Using Poisson tail bounds, we bound the largest multiplicity in ϕlow , and use arguments similar to Iϕhigh to obtain R(Iϕlow ) < 4n1/3 log n. [sent-297, score-0.19]

78 We show that within each interval, there is a uniform distribution such that the KL divergence between the underlying distribution and the induced uniform distribution is small. [sent-301, score-0.179]

79 We then show that the number of uniform distributions needed is at most exp(n1/3 log n). [sent-302, score-0.168]

80 We partition Iϕmed into T ≤ exp(n1/3 log n) classes, upper bound the redundancy of each class, and then invoke Lemma 7 to obtain an upper bound on R Iϕmed . [sent-304, score-0.662]

81 def Consider any partition of (n1/3 , n2/3 ] into B = n1/3 consecutive intervals I1 , I2 , . [sent-309, score-0.297]

82 , mj } = {λ : λ ∈ Λ ∩ Ij } be def def the set of elements of Λ in Ij where mj = mj (Λ) = |Λj | is the number of elements of Λ in Ij . [sent-319, score-0.911]

83 Let def T = T (∆) be the set of all possible different τ (such that Iτ is non-empty), and T = |T | be the number of classes. [sent-332, score-0.238]

84 Observe that for any Λ ∈ Iϕmed , and any j, we have mj < n2/3 , otherwise λ∈Λ λ > 1/3 mj · n1/3 = n. [sent-334, score-0.234]

85 def j−1 For any choice of ∆, let λ− = n1/3 + i=1 ∆i be the left end point of the interval Ij for j = 1, 2, . [sent-337, score-0.238]

86 , mB ) ∈ T (∆), we show a distribution Λ∗ ∈ Iτ such that for ∆2 B j all Λ ∈ Iτ , D(Λ||Λ∗ ) ≤ j=1 mj λ∗ . [sent-357, score-0.156]

87 A solution to the following optimization problem yields a bound : B B ∆2 j mj − , subject to min max mj λ− ≤ n. [sent-374, score-0.275]

88 , mB ) ∈ T (∆), B mj j=1 ∆2 j λ− j B mj λ− ≤ c2 n < j = c2 j=1 2 log(n1/3 ) n1/3 2 n= This, along with Lemma 7 gives the following Corollary for sufficiently large n. [sent-385, score-0.234]

89 9 4 Lower bound We use error-correcting codes to construct a collection of 20. [sent-391, score-0.153]

90 1/3 distinguishable distributions, improving by a loga- The convexity of KL-divergence can be used to show Lemma 12. [sent-393, score-0.134]

91 δ def We use this result to show that (1 − ) log M (P, ) ≤ R(P). [sent-397, score-0.302]

92 def def We now describe the class of distinguishable distributions. [sent-413, score-0.587]

93 Let λ∗ = Ci2 , K = (3n/C)1/3 , and i def S = {λ∗ : 1 ≤ i ≤ K}. [sent-415, score-0.238]

94 xK be a binary i string and def Λx = {λ∗ : xi = 1} ∪ n − λ∗ xi . [sent-420, score-0.262]

95 A binary i code of length k and minimum distance dmin is a collection of k−length binary strings with Hamming distance between any two strings is at least dmin . [sent-422, score-0.335]

96 Let C be a code satisfying Lemma 13 for k = K and let L = {Λc : c ∈ C} be the set of distributions generated by using the strings in C. [sent-428, score-0.161]

97 The following result shows that distributions in L are distinguishable and is proved in Appendix . [sent-429, score-0.194]

98 A lower-bound for the maximin redundancy in pattern coding. [sent-526, score-0.476]

99 A new upper bound on the redundancy of unknown alphabets. [sent-622, score-0.523]

100 Asymptotic minimax regret for data compression, gambling and prediction. [sent-650, score-0.151]


similar papers computed by tfidf model

tfidf for this paper:

wordName wordTfidf (topN-words)

[('redundancy', 0.427), ('poi', 0.421), ('pro', 0.361), ('med', 0.252), ('def', 0.238), ('le', 0.175), ('multiset', 0.169), ('alphabet', 0.154), ('les', 0.134), ('orlitsky', 0.134), ('mj', 0.117), ('compression', 0.117), ('distinguishable', 0.111), ('universal', 0.105), ('mb', 0.09), ('bits', 0.089), ('elias', 0.084), ('distributions', 0.083), ('minimax', 0.082), ('lemma', 0.08), ('alphabets', 0.078), ('multiplicities', 0.073), ('poisson', 0.064), ('log', 0.064), ('collection', 0.06), ('multiplicity', 0.06), ('santhanam', 0.058), ('acharya', 0.057), ('ucsd', 0.057), ('symbol', 0.055), ('entropy', 0.054), ('length', 0.052), ('codes', 0.052), ('compress', 0.051), ('pb', 0.051), ('symbols', 0.051), ('jafarpour', 0.051), ('strings', 0.051), ('kl', 0.05), ('low', 0.05), ('sequences', 0.049), ('sequence', 0.049), ('transactions', 0.049), ('pattern', 0.049), ('dmin', 0.047), ('closeness', 0.047), ('patterns', 0.045), ('encode', 0.045), ('ep', 0.044), ('diminishes', 0.044), ('pm', 0.043), ('elements', 0.042), ('diminishing', 0.041), ('bound', 0.041), ('distribution', 0.039), ('das', 0.038), ('abracadabra', 0.038), ('gambling', 0.038), ('npi', 0.038), ('rissanen', 0.038), ('coding', 0.036), ('gr', 0.034), ('viswanathan', 0.034), ('pre', 0.033), ('induced', 0.033), ('testing', 0.032), ('intervals', 0.032), ('integers', 0.032), ('theory', 0.032), ('upper', 0.031), ('regret', 0.031), ('ib', 0.031), ('bounds', 0.031), ('ij', 0.03), ('pan', 0.029), ('probabilities', 0.029), ('divergence', 0.029), ('ieee', 0.028), ('ece', 0.028), ('pa', 0.028), ('code', 0.027), ('appearances', 0.027), ('alon', 0.027), ('partition', 0.027), ('ik', 0.026), ('expected', 0.026), ('compressed', 0.025), ('high', 0.025), ('unseen', 0.025), ('subsection', 0.025), ('largest', 0.025), ('unknown', 0.024), ('element', 0.024), ('string', 0.024), ('determined', 0.023), ('pi', 0.023), ('improving', 0.023), ('assign', 0.023), ('growth', 0.022), ('needed', 0.021)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.99999982 343 nips-2012-Tight Bounds on Profile Redundancy and Distinguishability

Author: Jayadev Acharya, Hirakendu Das, Alon Orlitsky

Abstract: The minimax KL-divergence of any distribution from all distributions in a collection P has several practical implications. In compression, it is called redundancy and represents the least additional number of bits over the entropy needed to encode the output of any distribution in P. In online estimation and learning, it is the lowest expected log-loss regret when guessing a sequence of random values generated by a distribution in P. In hypothesis testing, it upper bounds the largest number of distinguishable distributions in P. Motivated by problems ranging from population estimation to text classification and speech recognition, several machine-learning and information-theory researchers have recently considered label-invariant observations and properties induced by i.i.d. distributions. A sufficient statistic for all these properties is the data’s profile, the multiset of the number of times each data element appears. Improving on a sequence of previous works, we show that the redundancy of the collection of distributions induced over profiles by length-n i.i.d. sequences is between 0.3 · n1/3 and n1/3 log2 n, in particular, establishing its exact growth power. 1

2 0.17244807 194 nips-2012-Learning to Discover Social Circles in Ego Networks

Author: Jure Leskovec, Julian J. Mcauley

Abstract: Our personal social networks are big and cluttered, and currently there is no good way to organize them. Social networking sites allow users to manually categorize their friends into social circles (e.g. ‘circles’ on Google+, and ‘lists’ on Facebook and Twitter), however they are laborious to construct and must be updated whenever a user’s network grows. We define a novel machine learning task of identifying users’ social circles. We pose the problem as a node clustering problem on a user’s ego-network, a network of connections between her friends. We develop a model for detecting circles that combines network structure as well as user profile information. For each circle we learn its members and the circle-specific user profile similarity metric. Modeling node membership to multiple circles allows us to detect overlapping as well as hierarchically nested circles. Experiments show that our model accurately identifies circles on a diverse set of data from Facebook, Google+, and Twitter for all of which we obtain hand-labeled ground-truth. 1

3 0.13085774 149 nips-2012-Hierarchical Optimistic Region Selection driven by Curiosity

Author: Odalric-ambrym Maillard

Abstract: This paper aims to take a step forwards making the term “intrinsic motivation” from reinforcement learning theoretically well founded, focusing on curiositydriven learning. To that end, we consider the setting where, a fixed partition P of a continuous space X being given, and a process ν defined on X being unknown, we are asked to sequentially decide which cell of the partition to select as well as where to sample ν in that cell, in order to minimize a loss function that is inspired from previous work on curiosity-driven learning. The loss on each cell consists of one term measuring a simple worst case quadratic sampling error, and a penalty term proportional to the range of the variance in that cell. The corresponding problem formulation extends the setting known as active learning for multi-armed bandits to the case when each arm is a continuous region, and we show how an adaptation of recent algorithms for that problem and of hierarchical optimistic sampling algorithms for optimization can be used in order to solve this problem. The resulting procedure, called Hierarchical Optimistic Region SElection driven by Curiosity (HORSE.C) is provided together with a finite-time regret analysis. 1

4 0.09455508 261 nips-2012-Online allocation and homogeneous partitioning for piecewise constant mean-approximation

Author: Alexandra Carpentier, Odalric-ambrym Maillard

Abstract: In the setting of active learning for the multi-armed bandit, where the goal of a learner is to estimate with equal precision the mean of a finite number of arms, recent results show that it is possible to derive strategies based on finite-time confidence bounds that are competitive with the best possible strategy. We here consider an extension of this problem to the case when the arms are the cells of a finite partition P of a continuous sampling space X ⊂ Rd . Our goal is now to build a piecewise constant approximation of a noisy function (where each piece is one region of P and P is fixed beforehand) in order to maintain the local quadratic error of approximation on each cell equally low. Although this extension is not trivial, we show that a simple algorithm based on upper confidence bounds can be proved to be adaptive to the function itself in a near-optimal way, when |P| is chosen to be of minimax-optimal order on the class of α−H¨ lder functions. o 1 Setting and Previous work Let us consider some space X ⊂ Rd , and Y ⊂ R. We call X the input space or sampling space, Y the output space or value space. We consider the problem of estimating with uniform precision the function f : X ⊂ Rd → Y ⊂ R. We assume that we can query n times the function f , anywhere in the domain, and observe noisy samples of this function. These samples are collected sequentially, and our aim is to design an adaptive procedure that selects wisely where on the domain to query the function, according to the information provided by the previous samples. More formally: Observed process We consider an unknown Y-valued process defined on X , written ν : X → M+ (Y), where M+ (Y) refers to the set of all probability measures on Y, such that for all x ∈ X , 1 1 def the random variable Y (x) ∼ ν(x) has mean f (x) = E[Y (x)|x] ∈ R. We write for convenience the model in the following way Y (x) = f (x) + noise(x) , def where noise(x) = Y (x) − E[Y (x)|x] is the centered random variable corresponding to the noise, o with unknown variance σ 2 (x). We assume throughout this paper that f is α-H¨ lder. Partition We consider we can define a partition P of the input space X , with finitely many P regions {Rp }1≤p≤P that are assumed to be convex and not degenerated, i.e. such that the interior of each region Rp has positive Lebesgue volume vp . Moreover, with each region Rp is associated a sampling distribution in that region, written µp ∈ M+ (Rp ). Thus, when we decide to sample in 1 region Rp , a new sample X ∈ Rp is generated according to X ∼ µp . Allocation. We consider that we have a finite budget of n ∈ N samples that we can use in order to allocate samples as we wish among the regions {Rp }1≤p≤P . For illustration, let us assume that we deterministically allocate Tp,n ∈ N samples in region Rp , with the constraint that the allocation {Tp,n }1≤p≤P must some to n. In region Rp , we thus sample points {Xp,i }1≤p≤P at random 1 according to the sampling distribution µp , and then get the corresponding values {Yp,i }1≤i≤Tp,n , where Yp,i ∼ ν(Xp,i ). In the sequel, the distribution µp is assumed to be the uniform distribution dλ(x)1x∈R over region Rp , i.e. the density of µp is λ(Rp ) p where λ denotes the Lebesgue measure. Note that this is not restrictive since we are in an active, not passive setting. Piecewise constant mean-approximation. We use the collected samples in order to build a pieceˆ wise constant approximation fn of the mean f , and measure the accuracy of approximation on a region Rp with the expected quadratic norm of the approximation error, namely � � � � � ˆ (x))2 λ(dx) = Eµ ,ν (f (X) − mp,n )2 , ˆ (f (x) − fn E p λ(Rp ) Rp ˆ where mp,n is the constant value that takes fn on the region Rp . A natural choice for the estimator ˆ mp,n is to use the empirical mean that is unbiased and asymptotically optimal for this criterion. ˆ Thus we consider the following estimate (histogram) ˆ fn (x) = P � p=1 mp,n I{x ∈ Rp } where mp,n = ˆ ˆ Tp,n 1 � Tp,n Yp,i . i=1 Pseudo loss Note that, since the Tp,n are deterministic, the expected quadratic norm of the approximation error of this estimator can be written in the following form � � � � � � ˆ Eµp ,ν (f (X) − mp,n )2 ˆ = Eµp ,ν (f (X) − Eµp [f (X)])2 + Eµp ,ν (Eµp [f (X)] − mp,n )2 � � � � = Vµp f (X) + Vµp ,ν mp,n ˆ � � � � 1 Vµp ,ν Y (X) . = Vµp f (X) + Tp,n Now, using the following immediate decomposition � � � � � Vµp ,ν Y (X) = Vµp f (X) + σ 2 (x)µp (dx) , Rp we deduce that the maximal expected quadratic norm of the approximation error over the regions def {Rp }1≤p≤P , that depends on the choice of the considered allocation strategy A = {Tp,n }1≤p≤P is thus given by the following so-called pseudo-loss � � � � � � Tp,n + 1 1 def 2 (1) Vµp f (X) + Eµ σ (X) . Ln (A) = max 1≤ p ≤P Tp,n Tp,n p Our goal is to minimize this pseudo-loss. Note that this is a local measure of performance, as opposed to a more usual yet less challenging global quadratic error. Eventually, as the number of �� �2 � ˆ cells tends to ∞, this local measure of performance approaches supx∈X Eν f (x) − fn (x) . At this point, let us also introduce, for convenience, the notation Qp (Tp,n ) that denotes the term inside the max, in order to emphasize the dependency on the quadratic error with the allocation. Previous work There is a huge literature on the topic of functional estimation in batch setting. Since it is a rather old and well studied question in statistics, many books have been written on this topic, such as Bosq and Lecoutre [1987], Rosenblatt [1991], Gy¨ rfi et al. [2002], where piecewise constant meano approximation are also called “partitioning estimate” or “regressogram” (first introduced by Tukey [1947]). The minimax-optimal rate of approximation on the class of α-H¨ lder functions is known o 2α to be in O(n− 2α+d ) (see e.g. Ibragimov and Hasminski [1981], Stone [1980], Gy¨ rfi et al. [2002]). o In such setting, a dataset {(Xi , Yi )}i≤n is given to the learner, and a typical question is thus to try to find the best possible histogram in order to minimize a approximation error. Thus the dataset is fixed and we typically resort to techniques such as model selection where each model corresponds to one histogram (see Arlot [2007] for an extensive study of such). However, we here ask a very different question, that is how to optimally sample in an online setting in order to minimize the approximation error of some histogram. Thus we choose the histogram 2 before we see any sample, then it is fixed and we need to decide which cell to sample from at each time step. Motivation for this setting comes naturally from some recent works in the setting of active learning for the multi-armed bandit problem Antos et al. [2010], Carpentier et al. [2011]. In these works, the objective is to estimate with equal precision the mean of a finite number of distributions (arms), which would correspond to the special case when X = {1, . . . , P } is a finite set in our setting. Intuitively, we reduce the problem to such bandit problem with finite set of arms (regions), and our setting answers the question whether it is possible to extend those results to the case when the arms do not correspond to a singleton, but rather to a continuous region. We show that the answer is positive, yet non trivial. This is non trivial due to the variance estimation in each region: points x in some region may have different means f(x), so that standard estimators for the variance are biased, contrary to the point-wise case and thus finite-arm techniques may yield disastrous results. (Estimating the variance of the distribution in a continuous region actually needs to take into account not only the point-wise noise but also the variation of the function f and the noise level σ 2 in that region.) We describe a way, inspired from quasi Monte-Carlo techniques, to correct this bias so that we can handle the additional error. Also, it is worth mentioning that this setting can be informally linked to a notion of curiosity-driven learning (see Schmidhuber [2010], Baranes and Oudeyer [2009]), since we want to decide in which region of the space to sample, without explicit reward but optimizing the goal to understand the unknown environment. Outline Section 2 provides more intuition about the pseudo-loss and a result about the optimal oracle strategy when the domain is partitioned in a minimax-optimal way on the class of α−H¨ lder o functions. Section 3 presents our assumptions, that are basically to have a sub-Gaussian noise and smooth mean and variance functions, then our estimator of the pseudo-loss together with its concentration properties, before introducing our sampling procedure, called OAHPA-pcma. Finally, the performance of this procedure is provided and discussed in Section 4. 2 The pseudo-loss: study and optimal strategies 2.1 More intuition on each term in the pseudo-loss It is natural to look at what happens to each of the two terms that appear in equation 1 when one makes Rp shrink towards a point. More precisely, let xp be the mean of X ∼ µp and let us look at the limit of Vµp (f (X)) when vp goes to 0. Assuming that f is differentiable, we get �2 � �� lim Vµp (f (X)) = lim Eµp f (X) − f (xp ) − E[f (X) − f (xp )] vp →0 vp →0 = = = lim Eµp �� �X − xp , ∇f (xp )� − E[�X − xp , ∇f (xp )�] vp →0 � � lim Eµp �X − xp , ∇f (xp )�2 vp →0 � � lim ∇f (xp )T Eµp (X − xp )(X − xp )T ∇f (xp ) . �2 � vp →0 Therefore, if we introduce Σp to be the covariance matrix of the random variable X ∼ µp , then we simply have lim Vµp (f (X)) = lim ||∇f (xp )||2 p . Σ vp →0 vp →0 Example with hyper-cubic regions An important example is when Rp is a hypercube with side 1/d length vp and µp is the uniform distribution over the region Rp . In that case (see Lemma 1), we dx have µp (dx) = , and 2/d vp vp . ||∇f (xp )||2 p = ||∇f (xp )||2 Σ 12 More generally, when f is α−differentiable, i.e. that ∀a ∈ X , ∃∇α f (a, ·) ∈ Sd (0, 1)R such that ∀x ∈ Sd (0, 1), limh→0 f (a+hx)−f (a) = ∇α f (a, x), then it is not too difficult to show that for such hα hyper-cubic regions, we have � � � 2α � Vµp f (X) = O vpd sup |∇α f (xp , u)|2 . S(0,1) � � On the other hand, by direct computation, the second term is such that limvp →0 Eµp σ 2 (X) = � � � � σ 2 (xp ). Thus, while Vµp f (X) vanishes, Eµp σ 2 (X) stays bounded away from 0 (unless ν is deterministic). 3 2.2 Oracle allocation and homogeneous partitioning for piecewise constant mean-approximation. We now assume that we are allowed to choose the partition P depending on n, thus P = Pn , amongst all homogeneous partitions of the space, i.e. partitions such that all cells have the same volume, and come from a regular grid of the space. Thus the only free parameter is the number of cells Pn of the partition. An exact yet not explicit oracle algorithm. The minimization of the pseudo-loss (1) does not yield to a closed-form solution in general. However, we can still derive the order of the optimal loss (see [Carpentier and Maillard, 2012, Lemma 2] in the full version of the paper for an example of minimax yet non adaptive oracle � algorithm given in closed-form solution): � � −β � � � −α� � � Lemma 1 In the case when Vµp f (X) = Ω Pn and Rp σ 2 (x)µp (dx) = Ω Pn , then an � optimal allocation and partitioning strategy An satisfies that� � � � Vµp f (X) + Eµp σ 2 (X) � � , L − Vµp f (X) � as soon as there exists, for such range of Pn , a constant L such that � � � � � Pn � Vµp f (X) + Eµp σ 2 (X) � � = n. L − Vµp f (X) p=1 1 � Pn = Ω(n max(1+α� −β� ,1) ) and def � Tp,n = The pseudo-loss of such an algorithm A� , optimal amongst the allocations strategies that use the n � partition Pn in Pn regions, is then given by � � � � def max(1 − β , 1 − α ) − 1. where γ = Ln (A� ) = Ω nγ n max(1 + α� − β � , 1) The condition involving the constant L is here to ensure that the partition is not degenerate. It is morally satisfied as soon as the variance of f and the noise are bounded and n is large enough. This Lemma applies to the important class W 1,2 (R) of functions that admit a weak derivative that o belongs to L2 (R). Indeed these functions are H¨ lder with coefficient α = 1/2, i.e. we have o W 1,2 (R) ⊂ C 0,1/2 (R). The standard Brownian motion is an example of function that is 1/2-H¨ lder. More generally, for k = d + α with α = 1/2 when d is odd and α = 1 when d is even, we have the 2 inclusion W k,2 (Rd ) ⊂ C 0,α (Rd ) , where W k,2 (Rd ) is the set of functions that admit a k th weak derivative belonging to L2 (Rd ). Thus the previous Lemma applies to sufficiently smooth functions with smoothness linearly increasing with the dimension d of the input space X . Important remark Note that this Lemma gives us a choice of the partition that is minimax-optimal, and an allocation strategy on that partition that is not only minimax-optimal but also adaptive to the function f itself. Thus it provides a way to decide in a minimax way what is the good number of regions, and then to provide the best oracle way to allocate the budget. We can deduce the following immediate corollary on the class of α−H¨ lder functions observed in a o non-negligible noise of bounded variance (i.e. in the setting β � = 0 and α� = 2α ). d Corollary 1 Consider that f is α−H¨ lder and the noise is of bounded variance. Then a minimaxo d � d+2α ) and an optimal allocation achieves the rate L (A� ) = optimal partition satisfies Pn = Ω(n n n � −2α � Ω n d+2α . Moreover, the strategy of Lemma 1 is optimal amongst the allocations strategies that � use the partition Pn in Pn regions. � −2α � The rate Ω n d+2α is minimax-optimal on the class of α−H¨ lder functions (see Gy¨ rfi et al. [2002], o o Ibragimov and Hasminski [1981], Stone [1980]), and it is thus interesting to consider an initial numd � � d+2α ). After having built the partition, if the quantities ber �� � 2 �� � � of�regions Pn that is of order Pn = Ω(n Vµp f p≤P and Eµp σ p≤P are known to the learner, it is optimal, in the aim of minimizing � the pseudo-loss, to allocate to each region the number of samples Tp,n provided in Lemma 1. Our objective in this paper is, after having chosen beforehand a minimax-optimal partition, to allocate 4 the samples properly in the regions, without having any access to those quantities. It is then �� � � necessary to balance between exploration, i.e. allocating the samples in order to estimate Vµp f p≤P � � �� and Eµp σ 2 p≤P , and exploitation, i.e. use the estimates to target the optimal allocation. 3 Online algorithms for allocation and homogeneous partitioning for piecewise constant mean-approximation In this section, we now turn to the design of algorithms that are fully online, with the goal to be competitive against the kind of oracle algorithms considered in Section 2.2. We now assume that the space X = [0, 1]d is divided in Pn hyper-cubic regions of same measure (the Lebesgue measure on 1 [0, 1]d ) vp = v = Pn . The goal of an algorithm is to minimize the quadratic error of approximation of f by a constant over each cell, in expectation, which we write as � � � � � � 2 λ(dx) ˆ (x))2 λ(dx) = max E , max E (f (x) − fn (f (x) − mp,n ) ˆ 1≤p≤Pn 1≤p≤Pn λ(Rp ) λ(Rp ) Rp Rp ˆ where fn is the histogram estimate of the function f on the partition P and mp,n is the empirical ˆ mean defined on region Rp with the samples (Xi , Yi ) such that Xi ∈ Rp . To do so, an algorithm is only allowed to specify at each time step t, the next point Xt where to sample, based on all the past samples {(Xs , Ys )}s < ∞ satisfies that λ2 σ 2 (x) , ∀λ ∈ R+ log E exp[λ noise(x)] ≤ 2 and we further assume that it satisfies the following slightly stronger second property (that is for instance exactly verified for a Gaussian variable, looking at the moment generating function): � � � � 1 λ2 σ 2 (x) ∀λ, γ ∈ R+ log E exp λnoise(x) + γnoise(x)2 ≤ − log 1 − 2γσ 2 (x) . 2(1 − 2γσ 2 (x)) 2 5 The function f is assumed to be (L, α)-H¨ lder, meaning that it satifies o � ∀x, x ∈ X f (x) − f (x� ) ≤ L||x − x� ||α . Similarly, the function σ 2 is assumed to be (M, β)-H¨ lder i.e. it satisfies o � 2 2 � ∀x, x ∈ X σ (x) − σ (x ) ≤ M ||x − x� ||β . We assume that Y is a convex and compact subset of R, thus w.l.g. that it is [0, 1], and that it is known that ||σ 2 ||∞ , which is thus finite, is bounded by the constant 1. 3.2 Empirical estimation of the quadratic approximation error on each cell We define the sampling distribution µp in the region Rp for each p ∈ {1, . . . , Pn } as a quasi-uniform ˜ sampling scheme using the uniform distribution over the sub-regions. More precisely at time t ≤ n, if we decide to sample in the region Rp according to µp , we sample uniformly in each sub-region ˜ one sample, resulting in a new batch of samples {(Xt,k , Yt,k )}1≤k≤K , where Xt,k ∼ µp,k . Note that due to this sampling process, the number of points Tp,t sampled in sub-region Rp at time t is always Tp,t a multiple of K and that moreover for all k, k � ∈ {1, . . . , K} we have that Tp,k,t = Tp,k� ,t = K . Now this specific sampling is used in order to be able to estimate the variances Vµp f and Eµp σ 2 , � so that the best proportions Tp,n can be computed as accurately as possible. Indeed, as explained in � � � � Lemma 1, we have that Vµp f (X) + Eµp σ 2 (X) � def � � . Tp,n = L − Vµp f (X) ˆ Variance estimation We now introduce two estimators. The first estimator is written Vp,t and is def ˆ built in the following way. First,let us introduce the empirical estimate fp,k,t of the mean fp,k = � � Eµp,k f (X) of f in sub-region Rp,k . Similarly, to avoid some cumbersome notations, we introduce � � � � � � def def def 2 fp = Eµp f (X) and vp,k = Vµp,k f (X) for the function f , and then σp,k = Eµp,k σ 2 (X) for the variance of the noise σ 2 . We now define the empirical variance estimator to be K 1 � ˆ ˆ (fp,k,t − mp,t )2 , ˆ Vp,t = K −1 k=1 that is a biased estimator. Indeed, for a deterministic Tp,t , it is not difficult to show that we have � K K � � � � � � �� � � � � 2 1 �� 1 � ˆ E Vp,t + Eµp,k f − Eµp f = Vµp,k f + Eµp,k σ 2 . K −1 Tp,t k=1 k=1 � � The leading term in this decomposition, that is given by the first sum, is closed to Vµp f since, by using the assumption that f is (L, α)−H¨ lder, we have the following inequality o � � K � �� �� � �1 � � � � 2 2L2 dα � Eµp,k f − Eµp f − Vµp f (X) � ≤ , � �K (KPn )2α/d k=1 where we also used that the diameter of a sub-region Rp,k is given by diam(Rp,k ) = d1/2 . (KPn )1/d ˆ Then, the second term also contributes to the bias, essentially due to the fact that V[fp,k,t ] = � � � � 2 def def 1 1 2 2 2 Tp,k,t (vp,k + σp,k ) and not Tp,t (vk + σk ) (with vp = Vµp f (X) and σp = Eµp σ (X) ). ˆ p,k,t In order to correct this term, we now introduce the second estimator σ 2 that estimates the variance � � � � � � of the outputs in a region Rp,k , i.e. Vµp,k ,ν Y (X) = Vµp,k f (X) + Eµp,k σ 2 . It is defined as �2 t t �� 1 1 � def ˆ p,k,t = Yi − Yj I{Xj ∈ Rp,k } I{Xi ∈ Rp,k } . σ2 Tp,k,t − 1 i=1 Tp,k,t j=1 Now, we combine the two previous estimators to form the following estimator K 1 �� 1 1 � 2 ˆ ˆ ˆ σ − . Qp,t = Vp,t − K Tp,k,t Tp,t p,k,t k=1 ˆ The following proposition provides a high-probability bound on the difference between Qp,t and the quantity we want to estimate. We report the detailed proof in [Carpentier and Maillard, 2012]. 6 ˆ Proposition 1 By the assumption that f is (L, α)-H¨ lder, the bias of the estimator Qp,t , and for o deterministic Tp,t , is given by � K � � � � � � � � � 2 1 � 2L2 dα ˆ − Vµp f (X) ≤ . Eµp,k f − Eµp f E Qp,t − Qp (Tp,t ) = K (KPn )2α/d k=1 Moreover, it satisfies that for all δ ∈ [0, 1], there exists an event of probability higher than 1 − δ such that on this event, we have � � � � � � K K � � � � 8 log(4/δ) � σ 2 �1 � � � � ˆ p,k,t 1 � 2 ˆ ˆ � Qp,t − E Qp,t � ≤ � √ +o σ p,k . � � (K − 1)2 T2 T K K k=1 p,k,t p,k,t k=1 We also state the following Lemma that we are going to use in the analysis, and that takes into account randomness of the stopping times Tp,k,t . Lemma 2 Let {Xp,k,u }p≤P, k≤K, u≤n be samples potentially sampled in region Rp,k . We introduce qp,u to be the�equivalent of Qp (Tp,t ) with explicitly fixed value of Tp,t = u. Let also qp,u be the ˆ � ˆ p,t but computed with the first u samples in estimate of E qp,u , that is to say the equivalent of Q each region Rp,k (i.e. Tp,t = u). Let us define the event � � � � � � � AK log(4nP/δ)V � � ˆp,t 2L2 dα � � ξn,P,K (δ) = + ω : � qp,u (ω) − E qp,u � ≤ ˆ , u K −1 (KPn )2α/d p≤P u≤n �K 1 ˆ ˆ ˆ p,k,t and where A ≤ 4 is a numerical constant. Then it where Vp,t = Vp (Tp,t ) = K−1 k=1 σ 2 holds that � � P ξn,P,K (δ) ≥ 1 − δ . Note that, with the notations of this Lemma, Proposition 1 above is thus about qp,u . ˆ 3.3 The Online allocation and homogeneous partitioning algorithm for piecewise constant mean-approximation (OAHPA-pcma) We are now ready to state the algorithm that we propose for minimizing the quadratic error of approximation of f . The algorithm is described in Figure 1. Although it looks similar, this algorithm is ˆ quite different from a normal UCB algorithm since Qp,t decreases in expectation with Tp,t . Indeed, � � � � � �� �K � 1 its expectation is close to Vµp f + KTp,t k=1 Vµp,k f + Eµp,k σ 2 . Algorithm 1 OAHPA-pcma. 1: Input: A, L, α, Horizon n; Partition {Rp }p≤P , with sub-partitions {Rp,k }k≤K . 2: Initialization: Sample K points in every sub-region {Rp,k }p≤P,k≤K 3: for t = K 2 P + 1; t ≤ n; t = t + K do ˆ 4: Compute ∀p, Qp,t . � ˆ ˆ p,t + AK log(4nP/δ)Vp,t + 2L2 dα . 5: Compute ∀p, Bp,t = Q 2α/d Tp,t K−1 (KPn ) 6: Select the region pt = argmax1≤p≤Pn Bp,t where to sample. 7: Sample K samples in region Rpt one per sub-region Rpt ,k according to µpt ,k . 8: end for 4 Performance of the allocation strategy and discussion Here is the main result of the paper; see the full version [Carpentier and Maillard, 2012] for the proof. We remind that the objective is to minimize for an algorithm A the pseudo-loss Ln (A). Theorem 1 (Main result) Let γ = � maxp Tp,n � minp Tp,n be the distortion factor of the optimal allocation stratdef d d egy, and let � > 0. Then with the choice of the number of regions Pn = n 2α+d �2+ 2α , and of the 2d d def def 8L2 α number of sub-regions K = C 4α+d �−2− α , where C = Ad1−α then the pseudo-loss of the OAHPApcma algorithm satisfies, under the assumptions of Section 3.1 and on an event of probability higher than 1 − δ, � � � � � 2α 1 + �γC � log(1/δ) Ln (A� ) + o n− 2α+d , Ln (A) ≤ n for some numerical constant C � not depending on n, where A� is the oracle of Lemma 1. n 7 Minimax-optimal partitioning and �-adaptive performance Theorem 1 provides a high probability bound on the performance of the OAHPA-pcma allocation strategy. It shows that this performance is competitive with that of an optimal (i.e. adaptive to the function f , see Lemma 1) allocation d A� on a partition with a number of cells Pn chosen to be of minimax order n 2α+d for the class of 2α α-H¨ lder functions. In particular, since Ln (A� ) = O(n d+2α ) on that class, we recover the same o n minimax order as what is obtained in the batch learning setting, when using for instance wavelets, or Kernel estimates (see e.g. Stone [1980], Ibragimov and Hasminski [1981]). But moreover, due to the adaptivity of A� to the function itself, this procedure is also �-adaptive to the function and not n only minimax-optimal on the class, on that partition (see Section 2.2). Naturally, the performance of the method increases, in the same way than for any classical functional estimation method, when the smoothness of the function increases. Similarly, in agreement with the classical curse of dimension, the higher the dimension of the domain, the less efficient the method. Limitations In this work, we assume that the smoothness α of the function is available to the learner, which enables her to calibrate Pn properly. Now it makes sense to combine the OAHPApcma procedure with existing methods that enable to estimate this smoothness online (under a slightly stronger assumption than H¨ lder, such as H¨ lder functions that attain their exponents, o o see Gin´ and Nickl [2010]). It is thus interesting, when no preliminary knowledge on the smoothness e of f is available, to spend some of the initial budget in order to estimate α. We have seen that the OAHPA-pcma procedure, although very simple, manages to get minimax optimal results. Now the downside of the simplicity of the OAHPA-pcma strategy is two-fold. � The first limitation is that the factor (1 + �γC � log(1/δ)) = (1 + O(�)) appearing in the bound before Ln (A� ) is not 1, but higher than 1. Of course it is generally difficult to get a constant 1 in the batch setting (see Arlot [2007]), and similarly this is a difficult task in our online setting too: If � is chosen to be small, then the error with respect to the optimal allocation is small. However, since Pn is expressed as an increasing function of �, this implies that the minimax bound on the loss for partition P increases also with �. That said, in the view of the work on active learning multi-armed bandit that we extend, we would still prefer to get the optimal constant 1. The second limitation is more problematic: since K is chosen irrespective of the region Rp , this causes the presence of the factor γ. Thus the algorithm will essentially no longer enjoy near-optimal performance guarantees when the optimal allocation strategy is highly not homogeneous. Conclusion and future work In this paper, we considered online regression with histograms in an active setting (we select in which bean to sample), and when we can choose the histogram in a class of homogeneous histograms. Since the (unknown) noise is heteroscedastic and we compete not only with the minimax allocation oracle on α-H¨ lder functions but with the adaptive oracle o that uses a minimax optimal histogram and allocates samples adaptively to the target function, this is an extremely challenging (and very practical) setting. Our contribution can be seen as a non trivial extension of the setting of active learning for multi-armed bandits to the case when each arm corresponds to one continuous region of a sampling space, as opposed to a singleton, which can also be seen as a problem of non parametric function approximation. This new setting offers interesting challenges: We provided a simple procedure, based on the computation of upper confidence bounds of the estimation of the local quadratic error of approximation, and provided a performance analysis that shows that OAHPA-pcma is first order �-optimal with respect to the function, for a partition chosen to be minimax-optimal on the class of α-H¨ lder functions. However, this simplicity also o has a drawback if one is interested in building exactly first order optimal procedure, and going beyond these limitations is definitely not trivial: A more optimal but much more complex algorithm would indeed need to tune a different factor Kp in each cell in an online way, i.e. define some Kp,t that evolves with time, and redefine sub-regions accordingly. Now, the analysis of the OAHPA-pcma already makes use of powerful tools such as empirical-Bernstein bounds for variance estimation (and not only for mean estimation), which make it non trivial; in order to handle possibly evolving subregions and deal with the progressive refinement of the regions, we would need even more intricate analysis, due to the fact that we are online and active. This interesting next step is postponed to future work. Acknowledgements This research was partially supported by Nord-Pas-de-Calais Regional Council, French ANR EXPLO-RA (ANR-08-COSI-004), the European Communitys Seventh Framework Programme (FP7/2007-2013) under grant agreement no 270327 (CompLACS) and no 216886 (PASCAL2). 8 References Andr` s Antos, Varun Grover, and Csaba Szepesv` ri. Active learning in heteroscedastic noise. Thea a oretical Computer Science, 411(29-30):2712–2728, 2010. Sylvain Arlot. R´ echantillonnage et S´ lection de mod` les. PhD thesis, Universit´ Paris Sud - Paris e´ e e e XI, 2007. A. Baranes and P.-Y. Oudeyer. R-IAC: Robust Intrinsically Motivated Exploration and Active Learning. IEEE Transactions on Autonomous Mental Development, 1(3):155–169, October 2009. D. Bosq and J.P. Lecoutre. Th´ orie de l’estimation fonctionnelle, volume 21. Economica, 1987. e Alexandra Carpentier and Odalric-Ambrym Maillard. Online allocation and homogeneous partitioning for piecewise constant mean-approximation. HAL, 2012. URL http://hal.archives-ouvertes.fr/hal-00742893. Alexandra Carpentier, Alessandro Lazaric, Mohammad Ghavamzadeh, Rmi Munos, and Peter Auer. Upper-confidence-bound algorithms for active learning in multi-armed bandits. In Jyrki Kivinen, Csaba Szepesv` ri, Esko Ukkonen, and Thomas Zeugmann, editors, Algorithmic Learning Theory, a volume 6925 of Lecture Notes in Computer Science, pages 189–203. Springer Berlin / Heidelberg, 2011. E. Gin´ and R. Nickl. Confidence bands in density estimation. The Annals of Statistics, 38(2): e 1122–1170, 2010. L. Gy¨ rfi, M. Kohler, A. Krzy´ ak, and Walk H. A distribution-free theory of nonparametric regreso z sion. Springer Verlag, 2002. I. Ibragimov and R. Hasminski. Statistical estimation: Asymptotic theory. 1981. M. Rosenblatt. Stochastic curve estimation, volume 3. Inst of Mathematical Statistic, 1991. J. Schmidhuber. Formal theory of creativity, fun, and intrinsic motivation (19902010). Autonomous Mental Development, IEEE Transactions on, 2(3):230–247, 2010. C.J. Stone. Optimal rates of convergence for nonparametric estimators. The annals of Statistics, pages 1348–1360, 1980. J.W. Tukey. Non-parametric estimation ii. statistically equivalent blocks and tolerance regions–the continuous case. The Annals of Mathematical Statistics, 18(4):529–539, 1947. 9

5 0.084312573 156 nips-2012-Identifiability and Unmixing of Latent Parse Trees

Author: Percy Liang, Daniel J. Hsu, Sham M. Kakade

Abstract: This paper explores unsupervised learning of parsing models along two directions. First, which models are identifiable from infinite data? We use a general technique for numerically checking identifiability based on the rank of a Jacobian matrix, and apply it to several standard constituency and dependency parsing models. Second, for identifiable models, how do we estimate the parameters efficiently? EM suffers from local optima, while recent work using spectral methods [1] cannot be directly applied since the topology of the parse tree varies across sentences. We develop a strategy, unmixing, which deals with this additional complexity for restricted classes of parsing models. 1

6 0.081940793 183 nips-2012-Learning Partially Observable Models Using Temporally Abstract Decision Trees

7 0.076453172 246 nips-2012-Nonparametric Max-Margin Matrix Factorization for Collaborative Prediction

8 0.073814429 109 nips-2012-Efficient Monte Carlo Counterfactual Regret Minimization in Games with Many Player Actions

9 0.068328559 134 nips-2012-Finite Sample Convergence Rates of Zero-Order Stochastic Optimization Methods

10 0.065326832 57 nips-2012-Bayesian estimation of discrete entropy with mixtures of stick-breaking priors

11 0.057663523 342 nips-2012-The variational hierarchical EM algorithm for clustering hidden Markov models

12 0.051971812 79 nips-2012-Compressive neural representation of sparse, high-dimensional probabilities

13 0.050828855 179 nips-2012-Learning Manifolds with K-Means and K-Flats

14 0.050249025 188 nips-2012-Learning from Distributions via Support Measure Machines

15 0.049932756 203 nips-2012-Locating Changes in Highly Dependent Data with Unknown Number of Change Points

16 0.048946347 148 nips-2012-Hamming Distance Metric Learning

17 0.048279088 247 nips-2012-Nonparametric Reduced Rank Regression

18 0.04635416 325 nips-2012-Stochastic optimization and sparse statistical recovery: Optimal algorithms for high dimensions

19 0.045782927 85 nips-2012-Convergence and Energy Landscape for Cheeger Cut Clustering

20 0.045115273 211 nips-2012-Meta-Gaussian Information Bottleneck


similar papers computed by lsi model

lsi for this paper:

topicId topicWeight

[(0, 0.138), (1, 0.01), (2, 0.025), (3, -0.005), (4, 0.006), (5, 0.007), (6, -0.014), (7, 0.049), (8, -0.041), (9, 0.029), (10, 0.045), (11, 0.022), (12, -0.066), (13, -0.069), (14, -0.084), (15, -0.027), (16, -0.008), (17, -0.01), (18, -0.05), (19, 0.004), (20, -0.024), (21, 0.003), (22, 0.032), (23, -0.005), (24, -0.046), (25, 0.058), (26, 0.036), (27, -0.006), (28, 0.055), (29, 0.03), (30, 0.062), (31, 0.047), (32, -0.036), (33, -0.038), (34, -0.073), (35, 0.002), (36, 0.045), (37, -0.02), (38, 0.011), (39, 0.081), (40, -0.081), (41, 0.017), (42, -0.084), (43, -0.059), (44, -0.04), (45, -0.09), (46, 0.094), (47, -0.104), (48, 0.106), (49, 0.12)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.91673881 343 nips-2012-Tight Bounds on Profile Redundancy and Distinguishability

Author: Jayadev Acharya, Hirakendu Das, Alon Orlitsky

Abstract: The minimax KL-divergence of any distribution from all distributions in a collection P has several practical implications. In compression, it is called redundancy and represents the least additional number of bits over the entropy needed to encode the output of any distribution in P. In online estimation and learning, it is the lowest expected log-loss regret when guessing a sequence of random values generated by a distribution in P. In hypothesis testing, it upper bounds the largest number of distinguishable distributions in P. Motivated by problems ranging from population estimation to text classification and speech recognition, several machine-learning and information-theory researchers have recently considered label-invariant observations and properties induced by i.i.d. distributions. A sufficient statistic for all these properties is the data’s profile, the multiset of the number of times each data element appears. Improving on a sequence of previous works, we show that the redundancy of the collection of distributions induced over profiles by length-n i.i.d. sequences is between 0.3 · n1/3 and n1/3 log2 n, in particular, establishing its exact growth power. 1

2 0.62255317 261 nips-2012-Online allocation and homogeneous partitioning for piecewise constant mean-approximation

Author: Alexandra Carpentier, Odalric-ambrym Maillard

Abstract: In the setting of active learning for the multi-armed bandit, where the goal of a learner is to estimate with equal precision the mean of a finite number of arms, recent results show that it is possible to derive strategies based on finite-time confidence bounds that are competitive with the best possible strategy. We here consider an extension of this problem to the case when the arms are the cells of a finite partition P of a continuous sampling space X ⊂ Rd . Our goal is now to build a piecewise constant approximation of a noisy function (where each piece is one region of P and P is fixed beforehand) in order to maintain the local quadratic error of approximation on each cell equally low. Although this extension is not trivial, we show that a simple algorithm based on upper confidence bounds can be proved to be adaptive to the function itself in a near-optimal way, when |P| is chosen to be of minimax-optimal order on the class of α−H¨ lder functions. o 1 Setting and Previous work Let us consider some space X ⊂ Rd , and Y ⊂ R. We call X the input space or sampling space, Y the output space or value space. We consider the problem of estimating with uniform precision the function f : X ⊂ Rd → Y ⊂ R. We assume that we can query n times the function f , anywhere in the domain, and observe noisy samples of this function. These samples are collected sequentially, and our aim is to design an adaptive procedure that selects wisely where on the domain to query the function, according to the information provided by the previous samples. More formally: Observed process We consider an unknown Y-valued process defined on X , written ν : X → M+ (Y), where M+ (Y) refers to the set of all probability measures on Y, such that for all x ∈ X , 1 1 def the random variable Y (x) ∼ ν(x) has mean f (x) = E[Y (x)|x] ∈ R. We write for convenience the model in the following way Y (x) = f (x) + noise(x) , def where noise(x) = Y (x) − E[Y (x)|x] is the centered random variable corresponding to the noise, o with unknown variance σ 2 (x). We assume throughout this paper that f is α-H¨ lder. Partition We consider we can define a partition P of the input space X , with finitely many P regions {Rp }1≤p≤P that are assumed to be convex and not degenerated, i.e. such that the interior of each region Rp has positive Lebesgue volume vp . Moreover, with each region Rp is associated a sampling distribution in that region, written µp ∈ M+ (Rp ). Thus, when we decide to sample in 1 region Rp , a new sample X ∈ Rp is generated according to X ∼ µp . Allocation. We consider that we have a finite budget of n ∈ N samples that we can use in order to allocate samples as we wish among the regions {Rp }1≤p≤P . For illustration, let us assume that we deterministically allocate Tp,n ∈ N samples in region Rp , with the constraint that the allocation {Tp,n }1≤p≤P must some to n. In region Rp , we thus sample points {Xp,i }1≤p≤P at random 1 according to the sampling distribution µp , and then get the corresponding values {Yp,i }1≤i≤Tp,n , where Yp,i ∼ ν(Xp,i ). In the sequel, the distribution µp is assumed to be the uniform distribution dλ(x)1x∈R over region Rp , i.e. the density of µp is λ(Rp ) p where λ denotes the Lebesgue measure. Note that this is not restrictive since we are in an active, not passive setting. Piecewise constant mean-approximation. We use the collected samples in order to build a pieceˆ wise constant approximation fn of the mean f , and measure the accuracy of approximation on a region Rp with the expected quadratic norm of the approximation error, namely � � � � � ˆ (x))2 λ(dx) = Eµ ,ν (f (X) − mp,n )2 , ˆ (f (x) − fn E p λ(Rp ) Rp ˆ where mp,n is the constant value that takes fn on the region Rp . A natural choice for the estimator ˆ mp,n is to use the empirical mean that is unbiased and asymptotically optimal for this criterion. ˆ Thus we consider the following estimate (histogram) ˆ fn (x) = P � p=1 mp,n I{x ∈ Rp } where mp,n = ˆ ˆ Tp,n 1 � Tp,n Yp,i . i=1 Pseudo loss Note that, since the Tp,n are deterministic, the expected quadratic norm of the approximation error of this estimator can be written in the following form � � � � � � ˆ Eµp ,ν (f (X) − mp,n )2 ˆ = Eµp ,ν (f (X) − Eµp [f (X)])2 + Eµp ,ν (Eµp [f (X)] − mp,n )2 � � � � = Vµp f (X) + Vµp ,ν mp,n ˆ � � � � 1 Vµp ,ν Y (X) . = Vµp f (X) + Tp,n Now, using the following immediate decomposition � � � � � Vµp ,ν Y (X) = Vµp f (X) + σ 2 (x)µp (dx) , Rp we deduce that the maximal expected quadratic norm of the approximation error over the regions def {Rp }1≤p≤P , that depends on the choice of the considered allocation strategy A = {Tp,n }1≤p≤P is thus given by the following so-called pseudo-loss � � � � � � Tp,n + 1 1 def 2 (1) Vµp f (X) + Eµ σ (X) . Ln (A) = max 1≤ p ≤P Tp,n Tp,n p Our goal is to minimize this pseudo-loss. Note that this is a local measure of performance, as opposed to a more usual yet less challenging global quadratic error. Eventually, as the number of �� �2 � ˆ cells tends to ∞, this local measure of performance approaches supx∈X Eν f (x) − fn (x) . At this point, let us also introduce, for convenience, the notation Qp (Tp,n ) that denotes the term inside the max, in order to emphasize the dependency on the quadratic error with the allocation. Previous work There is a huge literature on the topic of functional estimation in batch setting. Since it is a rather old and well studied question in statistics, many books have been written on this topic, such as Bosq and Lecoutre [1987], Rosenblatt [1991], Gy¨ rfi et al. [2002], where piecewise constant meano approximation are also called “partitioning estimate” or “regressogram” (first introduced by Tukey [1947]). The minimax-optimal rate of approximation on the class of α-H¨ lder functions is known o 2α to be in O(n− 2α+d ) (see e.g. Ibragimov and Hasminski [1981], Stone [1980], Gy¨ rfi et al. [2002]). o In such setting, a dataset {(Xi , Yi )}i≤n is given to the learner, and a typical question is thus to try to find the best possible histogram in order to minimize a approximation error. Thus the dataset is fixed and we typically resort to techniques such as model selection where each model corresponds to one histogram (see Arlot [2007] for an extensive study of such). However, we here ask a very different question, that is how to optimally sample in an online setting in order to minimize the approximation error of some histogram. Thus we choose the histogram 2 before we see any sample, then it is fixed and we need to decide which cell to sample from at each time step. Motivation for this setting comes naturally from some recent works in the setting of active learning for the multi-armed bandit problem Antos et al. [2010], Carpentier et al. [2011]. In these works, the objective is to estimate with equal precision the mean of a finite number of distributions (arms), which would correspond to the special case when X = {1, . . . , P } is a finite set in our setting. Intuitively, we reduce the problem to such bandit problem with finite set of arms (regions), and our setting answers the question whether it is possible to extend those results to the case when the arms do not correspond to a singleton, but rather to a continuous region. We show that the answer is positive, yet non trivial. This is non trivial due to the variance estimation in each region: points x in some region may have different means f(x), so that standard estimators for the variance are biased, contrary to the point-wise case and thus finite-arm techniques may yield disastrous results. (Estimating the variance of the distribution in a continuous region actually needs to take into account not only the point-wise noise but also the variation of the function f and the noise level σ 2 in that region.) We describe a way, inspired from quasi Monte-Carlo techniques, to correct this bias so that we can handle the additional error. Also, it is worth mentioning that this setting can be informally linked to a notion of curiosity-driven learning (see Schmidhuber [2010], Baranes and Oudeyer [2009]), since we want to decide in which region of the space to sample, without explicit reward but optimizing the goal to understand the unknown environment. Outline Section 2 provides more intuition about the pseudo-loss and a result about the optimal oracle strategy when the domain is partitioned in a minimax-optimal way on the class of α−H¨ lder o functions. Section 3 presents our assumptions, that are basically to have a sub-Gaussian noise and smooth mean and variance functions, then our estimator of the pseudo-loss together with its concentration properties, before introducing our sampling procedure, called OAHPA-pcma. Finally, the performance of this procedure is provided and discussed in Section 4. 2 The pseudo-loss: study and optimal strategies 2.1 More intuition on each term in the pseudo-loss It is natural to look at what happens to each of the two terms that appear in equation 1 when one makes Rp shrink towards a point. More precisely, let xp be the mean of X ∼ µp and let us look at the limit of Vµp (f (X)) when vp goes to 0. Assuming that f is differentiable, we get �2 � �� lim Vµp (f (X)) = lim Eµp f (X) − f (xp ) − E[f (X) − f (xp )] vp →0 vp →0 = = = lim Eµp �� �X − xp , ∇f (xp )� − E[�X − xp , ∇f (xp )�] vp →0 � � lim Eµp �X − xp , ∇f (xp )�2 vp →0 � � lim ∇f (xp )T Eµp (X − xp )(X − xp )T ∇f (xp ) . �2 � vp →0 Therefore, if we introduce Σp to be the covariance matrix of the random variable X ∼ µp , then we simply have lim Vµp (f (X)) = lim ||∇f (xp )||2 p . Σ vp →0 vp →0 Example with hyper-cubic regions An important example is when Rp is a hypercube with side 1/d length vp and µp is the uniform distribution over the region Rp . In that case (see Lemma 1), we dx have µp (dx) = , and 2/d vp vp . ||∇f (xp )||2 p = ||∇f (xp )||2 Σ 12 More generally, when f is α−differentiable, i.e. that ∀a ∈ X , ∃∇α f (a, ·) ∈ Sd (0, 1)R such that ∀x ∈ Sd (0, 1), limh→0 f (a+hx)−f (a) = ∇α f (a, x), then it is not too difficult to show that for such hα hyper-cubic regions, we have � � � 2α � Vµp f (X) = O vpd sup |∇α f (xp , u)|2 . S(0,1) � � On the other hand, by direct computation, the second term is such that limvp →0 Eµp σ 2 (X) = � � � � σ 2 (xp ). Thus, while Vµp f (X) vanishes, Eµp σ 2 (X) stays bounded away from 0 (unless ν is deterministic). 3 2.2 Oracle allocation and homogeneous partitioning for piecewise constant mean-approximation. We now assume that we are allowed to choose the partition P depending on n, thus P = Pn , amongst all homogeneous partitions of the space, i.e. partitions such that all cells have the same volume, and come from a regular grid of the space. Thus the only free parameter is the number of cells Pn of the partition. An exact yet not explicit oracle algorithm. The minimization of the pseudo-loss (1) does not yield to a closed-form solution in general. However, we can still derive the order of the optimal loss (see [Carpentier and Maillard, 2012, Lemma 2] in the full version of the paper for an example of minimax yet non adaptive oracle � algorithm given in closed-form solution): � � −β � � � −α� � � Lemma 1 In the case when Vµp f (X) = Ω Pn and Rp σ 2 (x)µp (dx) = Ω Pn , then an � optimal allocation and partitioning strategy An satisfies that� � � � Vµp f (X) + Eµp σ 2 (X) � � , L − Vµp f (X) � as soon as there exists, for such range of Pn , a constant L such that � � � � � Pn � Vµp f (X) + Eµp σ 2 (X) � � = n. L − Vµp f (X) p=1 1 � Pn = Ω(n max(1+α� −β� ,1) ) and def � Tp,n = The pseudo-loss of such an algorithm A� , optimal amongst the allocations strategies that use the n � partition Pn in Pn regions, is then given by � � � � def max(1 − β , 1 − α ) − 1. where γ = Ln (A� ) = Ω nγ n max(1 + α� − β � , 1) The condition involving the constant L is here to ensure that the partition is not degenerate. It is morally satisfied as soon as the variance of f and the noise are bounded and n is large enough. This Lemma applies to the important class W 1,2 (R) of functions that admit a weak derivative that o belongs to L2 (R). Indeed these functions are H¨ lder with coefficient α = 1/2, i.e. we have o W 1,2 (R) ⊂ C 0,1/2 (R). The standard Brownian motion is an example of function that is 1/2-H¨ lder. More generally, for k = d + α with α = 1/2 when d is odd and α = 1 when d is even, we have the 2 inclusion W k,2 (Rd ) ⊂ C 0,α (Rd ) , where W k,2 (Rd ) is the set of functions that admit a k th weak derivative belonging to L2 (Rd ). Thus the previous Lemma applies to sufficiently smooth functions with smoothness linearly increasing with the dimension d of the input space X . Important remark Note that this Lemma gives us a choice of the partition that is minimax-optimal, and an allocation strategy on that partition that is not only minimax-optimal but also adaptive to the function f itself. Thus it provides a way to decide in a minimax way what is the good number of regions, and then to provide the best oracle way to allocate the budget. We can deduce the following immediate corollary on the class of α−H¨ lder functions observed in a o non-negligible noise of bounded variance (i.e. in the setting β � = 0 and α� = 2α ). d Corollary 1 Consider that f is α−H¨ lder and the noise is of bounded variance. Then a minimaxo d � d+2α ) and an optimal allocation achieves the rate L (A� ) = optimal partition satisfies Pn = Ω(n n n � −2α � Ω n d+2α . Moreover, the strategy of Lemma 1 is optimal amongst the allocations strategies that � use the partition Pn in Pn regions. � −2α � The rate Ω n d+2α is minimax-optimal on the class of α−H¨ lder functions (see Gy¨ rfi et al. [2002], o o Ibragimov and Hasminski [1981], Stone [1980]), and it is thus interesting to consider an initial numd � � d+2α ). After having built the partition, if the quantities ber �� � 2 �� � � of�regions Pn that is of order Pn = Ω(n Vµp f p≤P and Eµp σ p≤P are known to the learner, it is optimal, in the aim of minimizing � the pseudo-loss, to allocate to each region the number of samples Tp,n provided in Lemma 1. Our objective in this paper is, after having chosen beforehand a minimax-optimal partition, to allocate 4 the samples properly in the regions, without having any access to those quantities. It is then �� � � necessary to balance between exploration, i.e. allocating the samples in order to estimate Vµp f p≤P � � �� and Eµp σ 2 p≤P , and exploitation, i.e. use the estimates to target the optimal allocation. 3 Online algorithms for allocation and homogeneous partitioning for piecewise constant mean-approximation In this section, we now turn to the design of algorithms that are fully online, with the goal to be competitive against the kind of oracle algorithms considered in Section 2.2. We now assume that the space X = [0, 1]d is divided in Pn hyper-cubic regions of same measure (the Lebesgue measure on 1 [0, 1]d ) vp = v = Pn . The goal of an algorithm is to minimize the quadratic error of approximation of f by a constant over each cell, in expectation, which we write as � � � � � � 2 λ(dx) ˆ (x))2 λ(dx) = max E , max E (f (x) − fn (f (x) − mp,n ) ˆ 1≤p≤Pn 1≤p≤Pn λ(Rp ) λ(Rp ) Rp Rp ˆ where fn is the histogram estimate of the function f on the partition P and mp,n is the empirical ˆ mean defined on region Rp with the samples (Xi , Yi ) such that Xi ∈ Rp . To do so, an algorithm is only allowed to specify at each time step t, the next point Xt where to sample, based on all the past samples {(Xs , Ys )}s < ∞ satisfies that λ2 σ 2 (x) , ∀λ ∈ R+ log E exp[λ noise(x)] ≤ 2 and we further assume that it satisfies the following slightly stronger second property (that is for instance exactly verified for a Gaussian variable, looking at the moment generating function): � � � � 1 λ2 σ 2 (x) ∀λ, γ ∈ R+ log E exp λnoise(x) + γnoise(x)2 ≤ − log 1 − 2γσ 2 (x) . 2(1 − 2γσ 2 (x)) 2 5 The function f is assumed to be (L, α)-H¨ lder, meaning that it satifies o � ∀x, x ∈ X f (x) − f (x� ) ≤ L||x − x� ||α . Similarly, the function σ 2 is assumed to be (M, β)-H¨ lder i.e. it satisfies o � 2 2 � ∀x, x ∈ X σ (x) − σ (x ) ≤ M ||x − x� ||β . We assume that Y is a convex and compact subset of R, thus w.l.g. that it is [0, 1], and that it is known that ||σ 2 ||∞ , which is thus finite, is bounded by the constant 1. 3.2 Empirical estimation of the quadratic approximation error on each cell We define the sampling distribution µp in the region Rp for each p ∈ {1, . . . , Pn } as a quasi-uniform ˜ sampling scheme using the uniform distribution over the sub-regions. More precisely at time t ≤ n, if we decide to sample in the region Rp according to µp , we sample uniformly in each sub-region ˜ one sample, resulting in a new batch of samples {(Xt,k , Yt,k )}1≤k≤K , where Xt,k ∼ µp,k . Note that due to this sampling process, the number of points Tp,t sampled in sub-region Rp at time t is always Tp,t a multiple of K and that moreover for all k, k � ∈ {1, . . . , K} we have that Tp,k,t = Tp,k� ,t = K . Now this specific sampling is used in order to be able to estimate the variances Vµp f and Eµp σ 2 , � so that the best proportions Tp,n can be computed as accurately as possible. Indeed, as explained in � � � � Lemma 1, we have that Vµp f (X) + Eµp σ 2 (X) � def � � . Tp,n = L − Vµp f (X) ˆ Variance estimation We now introduce two estimators. The first estimator is written Vp,t and is def ˆ built in the following way. First,let us introduce the empirical estimate fp,k,t of the mean fp,k = � � Eµp,k f (X) of f in sub-region Rp,k . Similarly, to avoid some cumbersome notations, we introduce � � � � � � def def def 2 fp = Eµp f (X) and vp,k = Vµp,k f (X) for the function f , and then σp,k = Eµp,k σ 2 (X) for the variance of the noise σ 2 . We now define the empirical variance estimator to be K 1 � ˆ ˆ (fp,k,t − mp,t )2 , ˆ Vp,t = K −1 k=1 that is a biased estimator. Indeed, for a deterministic Tp,t , it is not difficult to show that we have � K K � � � � � � �� � � � � 2 1 �� 1 � ˆ E Vp,t + Eµp,k f − Eµp f = Vµp,k f + Eµp,k σ 2 . K −1 Tp,t k=1 k=1 � � The leading term in this decomposition, that is given by the first sum, is closed to Vµp f since, by using the assumption that f is (L, α)−H¨ lder, we have the following inequality o � � K � �� �� � �1 � � � � 2 2L2 dα � Eµp,k f − Eµp f − Vµp f (X) � ≤ , � �K (KPn )2α/d k=1 where we also used that the diameter of a sub-region Rp,k is given by diam(Rp,k ) = d1/2 . (KPn )1/d ˆ Then, the second term also contributes to the bias, essentially due to the fact that V[fp,k,t ] = � � � � 2 def def 1 1 2 2 2 Tp,k,t (vp,k + σp,k ) and not Tp,t (vk + σk ) (with vp = Vµp f (X) and σp = Eµp σ (X) ). ˆ p,k,t In order to correct this term, we now introduce the second estimator σ 2 that estimates the variance � � � � � � of the outputs in a region Rp,k , i.e. Vµp,k ,ν Y (X) = Vµp,k f (X) + Eµp,k σ 2 . It is defined as �2 t t �� 1 1 � def ˆ p,k,t = Yi − Yj I{Xj ∈ Rp,k } I{Xi ∈ Rp,k } . σ2 Tp,k,t − 1 i=1 Tp,k,t j=1 Now, we combine the two previous estimators to form the following estimator K 1 �� 1 1 � 2 ˆ ˆ ˆ σ − . Qp,t = Vp,t − K Tp,k,t Tp,t p,k,t k=1 ˆ The following proposition provides a high-probability bound on the difference between Qp,t and the quantity we want to estimate. We report the detailed proof in [Carpentier and Maillard, 2012]. 6 ˆ Proposition 1 By the assumption that f is (L, α)-H¨ lder, the bias of the estimator Qp,t , and for o deterministic Tp,t , is given by � K � � � � � � � � � 2 1 � 2L2 dα ˆ − Vµp f (X) ≤ . Eµp,k f − Eµp f E Qp,t − Qp (Tp,t ) = K (KPn )2α/d k=1 Moreover, it satisfies that for all δ ∈ [0, 1], there exists an event of probability higher than 1 − δ such that on this event, we have � � � � � � K K � � � � 8 log(4/δ) � σ 2 �1 � � � � ˆ p,k,t 1 � 2 ˆ ˆ � Qp,t − E Qp,t � ≤ � √ +o σ p,k . � � (K − 1)2 T2 T K K k=1 p,k,t p,k,t k=1 We also state the following Lemma that we are going to use in the analysis, and that takes into account randomness of the stopping times Tp,k,t . Lemma 2 Let {Xp,k,u }p≤P, k≤K, u≤n be samples potentially sampled in region Rp,k . We introduce qp,u to be the�equivalent of Qp (Tp,t ) with explicitly fixed value of Tp,t = u. Let also qp,u be the ˆ � ˆ p,t but computed with the first u samples in estimate of E qp,u , that is to say the equivalent of Q each region Rp,k (i.e. Tp,t = u). Let us define the event � � � � � � � AK log(4nP/δ)V � � ˆp,t 2L2 dα � � ξn,P,K (δ) = + ω : � qp,u (ω) − E qp,u � ≤ ˆ , u K −1 (KPn )2α/d p≤P u≤n �K 1 ˆ ˆ ˆ p,k,t and where A ≤ 4 is a numerical constant. Then it where Vp,t = Vp (Tp,t ) = K−1 k=1 σ 2 holds that � � P ξn,P,K (δ) ≥ 1 − δ . Note that, with the notations of this Lemma, Proposition 1 above is thus about qp,u . ˆ 3.3 The Online allocation and homogeneous partitioning algorithm for piecewise constant mean-approximation (OAHPA-pcma) We are now ready to state the algorithm that we propose for minimizing the quadratic error of approximation of f . The algorithm is described in Figure 1. Although it looks similar, this algorithm is ˆ quite different from a normal UCB algorithm since Qp,t decreases in expectation with Tp,t . Indeed, � � � � � �� �K � 1 its expectation is close to Vµp f + KTp,t k=1 Vµp,k f + Eµp,k σ 2 . Algorithm 1 OAHPA-pcma. 1: Input: A, L, α, Horizon n; Partition {Rp }p≤P , with sub-partitions {Rp,k }k≤K . 2: Initialization: Sample K points in every sub-region {Rp,k }p≤P,k≤K 3: for t = K 2 P + 1; t ≤ n; t = t + K do ˆ 4: Compute ∀p, Qp,t . � ˆ ˆ p,t + AK log(4nP/δ)Vp,t + 2L2 dα . 5: Compute ∀p, Bp,t = Q 2α/d Tp,t K−1 (KPn ) 6: Select the region pt = argmax1≤p≤Pn Bp,t where to sample. 7: Sample K samples in region Rpt one per sub-region Rpt ,k according to µpt ,k . 8: end for 4 Performance of the allocation strategy and discussion Here is the main result of the paper; see the full version [Carpentier and Maillard, 2012] for the proof. We remind that the objective is to minimize for an algorithm A the pseudo-loss Ln (A). Theorem 1 (Main result) Let γ = � maxp Tp,n � minp Tp,n be the distortion factor of the optimal allocation stratdef d d egy, and let � > 0. Then with the choice of the number of regions Pn = n 2α+d �2+ 2α , and of the 2d d def def 8L2 α number of sub-regions K = C 4α+d �−2− α , where C = Ad1−α then the pseudo-loss of the OAHPApcma algorithm satisfies, under the assumptions of Section 3.1 and on an event of probability higher than 1 − δ, � � � � � 2α 1 + �γC � log(1/δ) Ln (A� ) + o n− 2α+d , Ln (A) ≤ n for some numerical constant C � not depending on n, where A� is the oracle of Lemma 1. n 7 Minimax-optimal partitioning and �-adaptive performance Theorem 1 provides a high probability bound on the performance of the OAHPA-pcma allocation strategy. It shows that this performance is competitive with that of an optimal (i.e. adaptive to the function f , see Lemma 1) allocation d A� on a partition with a number of cells Pn chosen to be of minimax order n 2α+d for the class of 2α α-H¨ lder functions. In particular, since Ln (A� ) = O(n d+2α ) on that class, we recover the same o n minimax order as what is obtained in the batch learning setting, when using for instance wavelets, or Kernel estimates (see e.g. Stone [1980], Ibragimov and Hasminski [1981]). But moreover, due to the adaptivity of A� to the function itself, this procedure is also �-adaptive to the function and not n only minimax-optimal on the class, on that partition (see Section 2.2). Naturally, the performance of the method increases, in the same way than for any classical functional estimation method, when the smoothness of the function increases. Similarly, in agreement with the classical curse of dimension, the higher the dimension of the domain, the less efficient the method. Limitations In this work, we assume that the smoothness α of the function is available to the learner, which enables her to calibrate Pn properly. Now it makes sense to combine the OAHPApcma procedure with existing methods that enable to estimate this smoothness online (under a slightly stronger assumption than H¨ lder, such as H¨ lder functions that attain their exponents, o o see Gin´ and Nickl [2010]). It is thus interesting, when no preliminary knowledge on the smoothness e of f is available, to spend some of the initial budget in order to estimate α. We have seen that the OAHPA-pcma procedure, although very simple, manages to get minimax optimal results. Now the downside of the simplicity of the OAHPA-pcma strategy is two-fold. � The first limitation is that the factor (1 + �γC � log(1/δ)) = (1 + O(�)) appearing in the bound before Ln (A� ) is not 1, but higher than 1. Of course it is generally difficult to get a constant 1 in the batch setting (see Arlot [2007]), and similarly this is a difficult task in our online setting too: If � is chosen to be small, then the error with respect to the optimal allocation is small. However, since Pn is expressed as an increasing function of �, this implies that the minimax bound on the loss for partition P increases also with �. That said, in the view of the work on active learning multi-armed bandit that we extend, we would still prefer to get the optimal constant 1. The second limitation is more problematic: since K is chosen irrespective of the region Rp , this causes the presence of the factor γ. Thus the algorithm will essentially no longer enjoy near-optimal performance guarantees when the optimal allocation strategy is highly not homogeneous. Conclusion and future work In this paper, we considered online regression with histograms in an active setting (we select in which bean to sample), and when we can choose the histogram in a class of homogeneous histograms. Since the (unknown) noise is heteroscedastic and we compete not only with the minimax allocation oracle on α-H¨ lder functions but with the adaptive oracle o that uses a minimax optimal histogram and allocates samples adaptively to the target function, this is an extremely challenging (and very practical) setting. Our contribution can be seen as a non trivial extension of the setting of active learning for multi-armed bandits to the case when each arm corresponds to one continuous region of a sampling space, as opposed to a singleton, which can also be seen as a problem of non parametric function approximation. This new setting offers interesting challenges: We provided a simple procedure, based on the computation of upper confidence bounds of the estimation of the local quadratic error of approximation, and provided a performance analysis that shows that OAHPA-pcma is first order �-optimal with respect to the function, for a partition chosen to be minimax-optimal on the class of α-H¨ lder functions. However, this simplicity also o has a drawback if one is interested in building exactly first order optimal procedure, and going beyond these limitations is definitely not trivial: A more optimal but much more complex algorithm would indeed need to tune a different factor Kp in each cell in an online way, i.e. define some Kp,t that evolves with time, and redefine sub-regions accordingly. Now, the analysis of the OAHPA-pcma already makes use of powerful tools such as empirical-Bernstein bounds for variance estimation (and not only for mean estimation), which make it non trivial; in order to handle possibly evolving subregions and deal with the progressive refinement of the regions, we would need even more intricate analysis, due to the fact that we are online and active. This interesting next step is postponed to future work. Acknowledgements This research was partially supported by Nord-Pas-de-Calais Regional Council, French ANR EXPLO-RA (ANR-08-COSI-004), the European Communitys Seventh Framework Programme (FP7/2007-2013) under grant agreement no 270327 (CompLACS) and no 216886 (PASCAL2). 8 References Andr` s Antos, Varun Grover, and Csaba Szepesv` ri. Active learning in heteroscedastic noise. Thea a oretical Computer Science, 411(29-30):2712–2728, 2010. Sylvain Arlot. R´ echantillonnage et S´ lection de mod` les. PhD thesis, Universit´ Paris Sud - Paris e´ e e e XI, 2007. A. Baranes and P.-Y. Oudeyer. R-IAC: Robust Intrinsically Motivated Exploration and Active Learning. IEEE Transactions on Autonomous Mental Development, 1(3):155–169, October 2009. D. Bosq and J.P. Lecoutre. Th´ orie de l’estimation fonctionnelle, volume 21. Economica, 1987. e Alexandra Carpentier and Odalric-Ambrym Maillard. Online allocation and homogeneous partitioning for piecewise constant mean-approximation. HAL, 2012. URL http://hal.archives-ouvertes.fr/hal-00742893. Alexandra Carpentier, Alessandro Lazaric, Mohammad Ghavamzadeh, Rmi Munos, and Peter Auer. Upper-confidence-bound algorithms for active learning in multi-armed bandits. In Jyrki Kivinen, Csaba Szepesv` ri, Esko Ukkonen, and Thomas Zeugmann, editors, Algorithmic Learning Theory, a volume 6925 of Lecture Notes in Computer Science, pages 189–203. Springer Berlin / Heidelberg, 2011. E. Gin´ and R. Nickl. Confidence bands in density estimation. The Annals of Statistics, 38(2): e 1122–1170, 2010. L. Gy¨ rfi, M. Kohler, A. Krzy´ ak, and Walk H. A distribution-free theory of nonparametric regreso z sion. Springer Verlag, 2002. I. Ibragimov and R. Hasminski. Statistical estimation: Asymptotic theory. 1981. M. Rosenblatt. Stochastic curve estimation, volume 3. Inst of Mathematical Statistic, 1991. J. Schmidhuber. Formal theory of creativity, fun, and intrinsic motivation (19902010). Autonomous Mental Development, IEEE Transactions on, 2(3):230–247, 2010. C.J. Stone. Optimal rates of convergence for nonparametric estimators. The annals of Statistics, pages 1348–1360, 1980. J.W. Tukey. Non-parametric estimation ii. statistically equivalent blocks and tolerance regions–the continuous case. The Annals of Mathematical Statistics, 18(4):529–539, 1947. 9

3 0.57694715 149 nips-2012-Hierarchical Optimistic Region Selection driven by Curiosity

Author: Odalric-ambrym Maillard

Abstract: This paper aims to take a step forwards making the term “intrinsic motivation” from reinforcement learning theoretically well founded, focusing on curiositydriven learning. To that end, we consider the setting where, a fixed partition P of a continuous space X being given, and a process ν defined on X being unknown, we are asked to sequentially decide which cell of the partition to select as well as where to sample ν in that cell, in order to minimize a loss function that is inspired from previous work on curiosity-driven learning. The loss on each cell consists of one term measuring a simple worst case quadratic sampling error, and a penalty term proportional to the range of the variance in that cell. The corresponding problem formulation extends the setting known as active learning for multi-armed bandits to the case when each arm is a continuous region, and we show how an adaptation of recent algorithms for that problem and of hierarchical optimistic sampling algorithms for optimization can be used in order to solve this problem. The resulting procedure, called Hierarchical Optimistic Region SElection driven by Curiosity (HORSE.C) is provided together with a finite-time regret analysis. 1

4 0.52914381 194 nips-2012-Learning to Discover Social Circles in Ego Networks

Author: Jure Leskovec, Julian J. Mcauley

Abstract: Our personal social networks are big and cluttered, and currently there is no good way to organize them. Social networking sites allow users to manually categorize their friends into social circles (e.g. ‘circles’ on Google+, and ‘lists’ on Facebook and Twitter), however they are laborious to construct and must be updated whenever a user’s network grows. We define a novel machine learning task of identifying users’ social circles. We pose the problem as a node clustering problem on a user’s ego-network, a network of connections between her friends. We develop a model for detecting circles that combines network structure as well as user profile information. For each circle we learn its members and the circle-specific user profile similarity metric. Modeling node membership to multiple circles allows us to detect overlapping as well as hierarchically nested circles. Experiments show that our model accurately identifies circles on a diverse set of data from Facebook, Google+, and Twitter for all of which we obtain hand-labeled ground-truth. 1

5 0.51596451 156 nips-2012-Identifiability and Unmixing of Latent Parse Trees

Author: Percy Liang, Daniel J. Hsu, Sham M. Kakade

Abstract: This paper explores unsupervised learning of parsing models along two directions. First, which models are identifiable from infinite data? We use a general technique for numerically checking identifiability based on the rank of a Jacobian matrix, and apply it to several standard constituency and dependency parsing models. Second, for identifiable models, how do we estimate the parameters efficiently? EM suffers from local optima, while recent work using spectral methods [1] cannot be directly applied since the topology of the parse tree varies across sentences. We develop a strategy, unmixing, which deals with this additional complexity for restricted classes of parsing models. 1

6 0.50579447 184 nips-2012-Learning Probability Measures with respect to Optimal Transport Metrics

7 0.50021267 161 nips-2012-Interpreting prediction markets: a stochastic approach

8 0.48606834 110 nips-2012-Efficient Reinforcement Learning for High Dimensional Linear Quadratic Systems

9 0.47817644 36 nips-2012-Adaptive Stratified Sampling for Monte-Carlo integration of Differentiable functions

10 0.46696302 115 nips-2012-Efficient high dimensional maximum entropy modeling via symmetric partition functions

11 0.46677065 179 nips-2012-Learning Manifolds with K-Means and K-Flats

12 0.4506278 269 nips-2012-Persistent Homology for Learning Densities with Bounded Support

13 0.45049 267 nips-2012-Perceptron Learning of SAT

14 0.43403599 342 nips-2012-The variational hierarchical EM algorithm for clustering hidden Markov models

15 0.42894676 217 nips-2012-Mixability in Statistical Learning

16 0.42789137 189 nips-2012-Learning from the Wisdom of Crowds by Minimax Entropy

17 0.42648557 34 nips-2012-Active Learning of Multi-Index Function Models

18 0.41749382 183 nips-2012-Learning Partially Observable Models Using Temporally Abstract Decision Trees

19 0.4050529 139 nips-2012-Fused sparsity and robust estimation for linear models with unknown variance

20 0.39907187 223 nips-2012-Multi-criteria Anomaly Detection using Pareto Depth Analysis


similar papers computed by lda model

lda for this paper:

topicId topicWeight

[(0, 0.028), (21, 0.037), (38, 0.106), (39, 0.012), (42, 0.05), (54, 0.013), (55, 0.041), (74, 0.028), (76, 0.214), (80, 0.08), (82, 0.255), (92, 0.037)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.81371075 343 nips-2012-Tight Bounds on Profile Redundancy and Distinguishability

Author: Jayadev Acharya, Hirakendu Das, Alon Orlitsky

Abstract: The minimax KL-divergence of any distribution from all distributions in a collection P has several practical implications. In compression, it is called redundancy and represents the least additional number of bits over the entropy needed to encode the output of any distribution in P. In online estimation and learning, it is the lowest expected log-loss regret when guessing a sequence of random values generated by a distribution in P. In hypothesis testing, it upper bounds the largest number of distinguishable distributions in P. Motivated by problems ranging from population estimation to text classification and speech recognition, several machine-learning and information-theory researchers have recently considered label-invariant observations and properties induced by i.i.d. distributions. A sufficient statistic for all these properties is the data’s profile, the multiset of the number of times each data element appears. Improving on a sequence of previous works, we show that the redundancy of the collection of distributions induced over profiles by length-n i.i.d. sequences is between 0.3 · n1/3 and n1/3 log2 n, in particular, establishing its exact growth power. 1

2 0.80241847 54 nips-2012-Bayesian Probabilistic Co-Subspace Addition

Author: Lei Shi

Abstract: For modeling data matrices, this paper introduces Probabilistic Co-Subspace Addition (PCSA) model by simultaneously capturing the dependent structures among both rows and columns. Briefly, PCSA assumes that each entry of a matrix is generated by the additive combination of the linear mappings of two low-dimensional features, which distribute in the row-wise and column-wise latent subspaces respectively. In consequence, PCSA captures the dependencies among entries intricately, and is able to handle non-Gaussian and heteroscedastic densities. By formulating the posterior updating into the task of solving Sylvester equations, we propose an efficient variational inference algorithm. Furthermore, PCSA is extended to tackling and filling missing values, to adapting model sparseness, and to modelling tensor data. In comparison with several state-of-art methods, experiments demonstrate the effectiveness and efficiency of Bayesian (sparse) PCSA on modeling matrix (tensor) data and filling missing values.

3 0.76193994 176 nips-2012-Learning Image Descriptors with the Boosting-Trick

Author: Tomasz Trzcinski, Mario Christoudias, Vincent Lepetit, Pascal Fua

Abstract: In this paper we apply boosting to learn complex non-linear local visual feature representations, drawing inspiration from its successful application to visual object detection. The main goal of local feature descriptors is to distinctively represent a salient image region while remaining invariant to viewpoint and illumination changes. This representation can be improved using machine learning, however, past approaches have been mostly limited to learning linear feature mappings in either the original input or a kernelized input feature space. While kernelized methods have proven somewhat effective for learning non-linear local feature descriptors, they rely heavily on the choice of an appropriate kernel function whose selection is often difficult and non-intuitive. We propose to use the boosting-trick to obtain a non-linear mapping of the input to a high-dimensional feature space. The non-linear feature mapping obtained with the boosting-trick is highly intuitive. We employ gradient-based weak learners resulting in a learned descriptor that closely resembles the well-known SIFT. As demonstrated in our experiments, the resulting descriptor can be learned directly from intensity patches achieving state-of-the-art performance. 1

4 0.70761198 307 nips-2012-Semi-Crowdsourced Clustering: Generalizing Crowd Labeling by Robust Distance Metric Learning

Author: Jinfeng Yi, Rong Jin, Shaili Jain, Tianbao Yang, Anil K. Jain

Abstract: One of the main challenges in data clustering is to define an appropriate similarity measure between two objects. Crowdclustering addresses this challenge by defining the pairwise similarity based on the manual annotations obtained through crowdsourcing. Despite its encouraging results, a key limitation of crowdclustering is that it can only cluster objects when their manual annotations are available. To address this limitation, we propose a new approach for clustering, called semi-crowdsourced clustering that effectively combines the low-level features of objects with the manual annotations of a subset of the objects obtained via crowdsourcing. The key idea is to learn an appropriate similarity measure, based on the low-level features of objects and from the manual annotations of only a small portion of the data to be clustered. One difficulty in learning the pairwise similarity measure is that there is a significant amount of noise and inter-worker variations in the manual annotations obtained via crowdsourcing. We address this difficulty by developing a metric learning algorithm based on the matrix completion method. Our empirical study with two real-world image data sets shows that the proposed algorithm outperforms state-of-the-art distance metric learning algorithms in both clustering accuracy and computational efficiency. 1

5 0.70741785 318 nips-2012-Sparse Approximate Manifolds for Differential Geometric MCMC

Author: Ben Calderhead, Mátyás A. Sustik

Abstract: One of the enduring challenges in Markov chain Monte Carlo methodology is the development of proposal mechanisms to make moves distant from the current point, that are accepted with high probability and at low computational cost. The recent introduction of locally adaptive MCMC methods based on the natural underlying Riemannian geometry of such models goes some way to alleviating these problems for certain classes of models for which the metric tensor is analytically tractable, however computational efficiency is not assured due to the necessity of potentially high-dimensional matrix operations at each iteration. In this paper we firstly investigate a sampling-based approach for approximating the metric tensor and suggest a valid MCMC algorithm that extends the applicability of Riemannian Manifold MCMC methods to statistical models that do not admit an analytically computable metric tensor. Secondly, we show how the approximation scheme we consider naturally motivates the use of 1 regularisation to improve estimates and obtain a sparse approximate inverse of the metric, which enables stable and sparse approximations of the local geometry to be made. We demonstrate the application of this algorithm for inferring the parameters of a realistic system of ordinary differential equations using a biologically motivated robust Student-t error model, for which the Expected Fisher Information is analytically intractable. 1

6 0.70334494 74 nips-2012-Collaborative Gaussian Processes for Preference Learning

7 0.70246446 203 nips-2012-Locating Changes in Highly Dependent Data with Unknown Number of Change Points

8 0.70246315 188 nips-2012-Learning from Distributions via Support Measure Machines

9 0.70212609 142 nips-2012-Generalization Bounds for Domain Adaptation

10 0.70086378 338 nips-2012-The Perturbed Variation

11 0.69916028 291 nips-2012-Reducing statistical time-series problems to binary classification

12 0.69729471 294 nips-2012-Repulsive Mixtures

13 0.69695318 336 nips-2012-The Coloured Noise Expansion and Parameter Estimation of Diffusion Processes

14 0.69673091 90 nips-2012-Deep Learning of Invariant Features via Simulated Fixations in Video

15 0.6965549 95 nips-2012-Density-Difference Estimation

16 0.69633728 272 nips-2012-Practical Bayesian Optimization of Machine Learning Algorithms

17 0.69628656 326 nips-2012-Structure estimation for discrete graphical models: Generalized covariance matrices and their inverses

18 0.69598651 264 nips-2012-Optimal kernel choice for large-scale two-sample tests

19 0.69561195 126 nips-2012-FastEx: Hash Clustering with Exponential Families

20 0.69512713 247 nips-2012-Nonparametric Reduced Rank Regression