nips nips2012 nips2012-163 knowledge-graph by maker-knowledge-mining

163 nips-2012-Isotropic Hashing

Source: pdf

Author: Weihao Kong, Wu-jun Li

Abstract: Most existing hashing methods adopt some projection functions to project the original data into several dimensions of real values, and then each of these projected dimensions is quantized into one bit (zero or one) by thresholding. Typically, the variances of different projected dimensions are different for existing projection functions such as principal component analysis (PCA). Using the same number of bits for different projected dimensions is unreasonable because larger-variance dimensions will carry more information. Although this viewpoint has been widely accepted by many researchers, it is still not veriﬁed by either theory or experiment because no methods have been proposed to ﬁnd a projection with equal variances for different dimensions. In this paper, we propose a novel method, called isotropic hashing (IsoHash), to learn projection functions which can produce projected dimensions with isotropic variances (equal variances). Experimental results on real data sets show that IsoHash can outperform its counterpart with different variances for different dimensions, which veriﬁes the viewpoint that projections with isotropic variances will be better than those with anisotropic variances. 1

Reference: text

Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 cn Abstract Most existing hashing methods adopt some projection functions to project the original data into several dimensions of real values, and then each of these projected dimensions is quantized into one bit (zero or one) by thresholding. [sent-4, score-0.714]

2 Typically, the variances of different projected dimensions are different for existing projection functions such as principal component analysis (PCA). [sent-5, score-0.512]

3 Using the same number of bits for different projected dimensions is unreasonable because larger-variance dimensions will carry more information. [sent-6, score-0.463]

4 Although this viewpoint has been widely accepted by many researchers, it is still not veriﬁed by either theory or experiment because no methods have been proposed to ﬁnd a projection with equal variances for different dimensions. [sent-7, score-0.348]

5 In this paper, we propose a novel method, called isotropic hashing (IsoHash), to learn projection functions which can produce projected dimensions with isotropic variances (equal variances). [sent-8, score-1.047]

6 Experimental results on real data sets show that IsoHash can outperform its counterpart with different variances for different dimensions, which veriﬁes the viewpoint that projections with isotropic variances will be better than those with anisotropic variances. [sent-9, score-0.641]

7 1 Introduction Due to its fast query speed and low storage cost, hashing [1, 5] has been successfully used for approximate nearest neighbor (ANN) search [28]. [sent-10, score-0.295]

8 The basic idea of hashing is to learn similaritypreserving binary codes for data representation. [sent-11, score-0.36]

9 More speciﬁcally, each data point will be hashed into a compact binary string, and similar points in the original feature space should be hashed into close points in the hashcode space. [sent-12, score-0.119]

10 Compared with the original feature representation, hashing has two advantages. [sent-13, score-0.247]

11 These two advantages make hashing become a promising choice for efﬁcient ANN search in massive data sets [1, 5, 6, 9, 10, 14, 15, 17, 20, 21, 23, 26, 29, 30, 31, 32, 33, 34]. [sent-15, score-0.247]

12 Most existing hashing methods adopt some projection functions to project the original data into several dimensions of real values, and then each of these projected dimensions is quantized into one bit (zero or one) by thresholding. [sent-16, score-0.714]

13 Locality-sensitive hashing (LSH) [1, 5] and its extensions [4, 18, 19, 22, 25] use simple random projections for hash functions. [sent-17, score-0.317]

14 These methods are called data-independent methods because the projection functions are independent of training data. [sent-18, score-0.175]

15 Another class of methods are called data-dependent methods, whose projection functions are learned from training data. [sent-19, score-0.175]

16 Representative data-dependent methods include spectral hashing (SH) [31], anchor graph hashing (AGH) [21], sequential projection learning (SPL) [29], principal component analysis [13] based hashing (PCAH) [7], and iterative quantization (ITQ) [7, 8]. [sent-20, score-0.963]

17 SH learns the hashing functions based on spectral graph partitioning. [sent-21, score-0.266]

18 AGH adopts anchor graphs to speed up the computation of graph Laplacian eigenvectors, based on which the Nystr¨ m method is used to compute o projection functions. [sent-22, score-0.167]

19 SPL leans the projection functions in a sequential way that each function is designed to correct the errors caused by the previous one. [sent-23, score-0.151]

20 PCAH adopts principal component analysis (PCA) to learn the projection functions. [sent-24, score-0.2]

21 ITQ tries to learn an orthogonal rotation matrix to reﬁne the initial projection matrix learned by PCA so that the quantization error of mapping the data 1 to the vertices of binary hypercube is minimized. [sent-25, score-0.379]

22 Compared to the data-dependent methods, the data-independent methods need longer codes to achieve satisfactory performance [7]. [sent-26, score-0.05]

23 For most existing projection functions such as those mentioned above, the variances of different projected dimensions are different. [sent-27, score-0.493]

24 Many researchers [7, 12, 21] have argued that using the same number of bits for different projected dimensions with unequal variances is unreasonable because larger-variance dimensions will carry more information. [sent-28, score-0.711]

25 Some methods [7, 12] use orthogonal transformation to the PCA-projected data with the expectation of balancing the variances of different PCA dimensions, and achieve better performance than the original PCA based hashing. [sent-29, score-0.239]

26 However, to the best of our knowledge, there exist no methods which can guarantee to learn a projection with equal variances for different dimensions. [sent-30, score-0.334]

27 Hence, the viewpoint that using the same number of bits for different projected dimensions is unreasonable has still not been veriﬁed by either theory or experiment. [sent-31, score-0.409]

28 In this paper, a novel hashing method, called isotropic hashing (IsoHash), is proposed to learn a projection function which can produce projected dimensions with isotropic variances (equal variances). [sent-32, score-1.275]

29 To the best of our knowledge, this is the ﬁrst work which can learn projections with isotropic variances for hashing. [sent-33, score-0.368]

30 Experimental results on real data sets show that IsoHash can outperform its counterpart with anisotropic variances for different dimensions, which veriﬁes the intuitive viewpoint that projections with isotropic variances will be better than those with anisotropic variances. [sent-34, score-0.729]

31 The basic idea of hashing is to map each point xi into m a binary string yi ∈ {0, 1} with m denoting the code size. [sent-39, score-0.323]

32 Furthermore, close points in the original space Rd should be hashed into similar binary codes in the code space {0, 1}m to preserve the similarity structure in the original space. [sent-40, score-0.151]

33 In general, we compute the binary code of xi as yi = [h1 (xi ), h2 (xi ), · · · , hm (xi )]T with m binary hash functions {hk (·)}m . [sent-41, score-0.142]

34 k=1 Because it is NP hard to directly compute the best binary functions hk (·) for a given data set [31], most hashing methods adopt a two-stage strategy to learn hk (·). [sent-42, score-0.399]

35 In the projection stage, m realvalued projection functions {fk (x)}m are learned and each function can generate one real value. [sent-43, score-0.283]

36 k=1 Hence, we have m projected dimensions each of which corresponds to one projection function. [sent-44, score-0.303]

37 In the quantization stage, the real-values are quantized into a binary string by thresholding. [sent-45, score-0.138]

38 Currently, most methods use one bit to quantize each projected dimension. [sent-46, score-0.137]

39 The exceptions of the quantization methods only contain AGH [21], DBQ [14] and MH [15], which use two bits to quantize each dimension. [sent-48, score-0.2]

40 In sum, all of these methods adopt the same number (either one or two) of bits for different projected dimensions. [sent-49, score-0.225]

41 However, the variances of different projected dimensions are unequal, and larger-variance dimensions typically carry more information. [sent-50, score-0.441]

42 Hence, using the same number of bits for different projected dimensions with unequal variances is unreasonable, which has also been argued by many researchers [7, 12, 21]. [sent-51, score-0.537]

43 Unfortunately, there exist no methods which can learn projection functions with equal variances for different dimensions. [sent-52, score-0.353]

44 In the following content of this section, we present a novel model to learn projections with isotropic variances. [sent-53, score-0.197]

45 2 Model Formulation The idea of our IsoHash method is to learn an orthogonal matrix to rotate the PCA projection matrix. [sent-55, score-0.305]

46 To generate a code of m bits, PCAH performs PCA on X, and then use the top m eigenvectors of the covariance matrix XX T as columns of the projection matrix W ∈ Rd×m . [sent-56, score-0.236]

47 Here, top m eigenvectors are those corresponding to the m largest eigenvalues {λk }m , generally arranged with the nonk=1 2 increasing order λ1 ≥ λ2 ≥ · · · ≥ λm . [sent-57, score-0.054]

48 Hence, the projection functions of PCAH are deﬁned as T follows: fk (x) = wk x, where wk is the kth column of W . [sent-58, score-0.215]

49 Let λ = [λ1 , λ2 , · · · , λm ]T and Λ = diag(λ), where diag(λ) denotes the diagonal matrix whose diagonal entries are formed from the vector λ. [sent-59, score-0.121]

50 Hence, the variance of the values {fk (xi )}n on the kth projected dimension, which corresponds to the kth i=1 row of W T X, is λk . [sent-61, score-0.149]

51 Obviously, the variances for different PCA dimensions are anisotropic. [sent-62, score-0.253]

52 To get isotropic projection functions, the idea of our IsoHash method is to learn an orthogonal matrix Q ∈ Rm×m which makes QT W T XX T W Q become a matrix with equal diagonal values, i. [sent-63, score-0.476]

53 Here, Aii denotes the ith diagonal entry of a square matrix A, and a matrix Q is said to be orthogonal if QT Q = I where I is an identity matrix whose dimensionality depends on the context. [sent-66, score-0.206]

54 The effect of the orthogonal matrix Q is to rotate the coordinate axes while keeping the Euclidean distances between any two points unchanged. [sent-67, score-0.142]

55 It is easy to prove that the new projection functions of IsoHash are fk (x) = (W Q)T x which have the same (isotropic) variance. [sent-68, score-0.202]

56 Hence, to make QT W T XX T W Q become a matrix with equal diagonal values, we m λ should set this diagonal value a = i=1 i . [sent-74, score-0.121]

57 m Let a = [a1 , a2 , · · · , am ] with ai = a = m i=1 λi m , (1) and T (z) = {T ∈ Rm×m |diag(T ) = diag(z)}, where z is a vector of length m, diag(T ) is overloaded to denote a diagonal matrix with the same diagonal entries of matrix T . [sent-75, score-0.168]

58 The problem of IsoHash is to ﬁnd an orthogonal matrix Q making QT W T XX T W Q ∈ T (a), where a is deﬁned in (1). [sent-77, score-0.099]

59 As in [2], we deﬁne M(Λ) = {QT ΛQ|Q ∈ O(m)}, (2) where O(m) is the set of all orthogonal matrices in Rm×m , i. [sent-84, score-0.068]

60 According to Theorem 1, the problem of IsoHash is equivalent to ﬁnding an orthogonal matrix Q for the following equation [2]: ||T − Z||F = 0, (3) where T ∈ T (a), Z ∈ M(Λ), || · ||F denotes the Frobenius norm. [sent-87, score-0.099]

61 There exists a Hermitian matrix H with eigenvalues c and diagonal values b if and only if k k bi ≤ i=1 m ci , for any k = 1, 2, . [sent-94, score-0.105]

62 Furthermore, we can prove that i=1 λi = a Hermitian matrix H with eigenvalues λ and diagonal values a. [sent-108, score-0.122]

63 Moreover, we can prove that H is in the intersection of T (a) and M(Λ), i. [sent-109, score-0.054]

64 argmin (4) Q:T ∈T (a),Z∈M(Λ) As in [2], we propose two algorithms to learn Q: one is called lift and projection (LP), and the other is called gradient ﬂow (GF). [sent-115, score-0.276]

65 4 We call Z (k) a lift of T (k) onto M(Λ) and T (k+1) a projection of Z (k) onto T (a). [sent-125, score-0.221]

66 Suppose Z (k) = [zij ], then T (k+1) = [tij ] must be given by tij = zij , if i = j ai , if i = j (5) For the lift operation, we have the following Theorem 3. [sent-127, score-0.146]

67 Then the nearest neighbor of T in M(Λ) is given by Z = QT ΛQ. [sent-133, score-0.048]

68 Algorithm 1 Lift and projection based IsoHash (IsoHash-LP) Input: X ∈ Rd×n , m ∈ N+ , t ∈ N+ • [Λ, W ] = P CA(X, m), as stated in Section 2. [sent-140, score-0.132]

69 • Generate a random orthogonal matrix Q0 ∈ Rm×m . [sent-142, score-0.099]

70 For example, if we set Z (0) = Λ, the lift and projection learning algorithm would get no progress because the Z and T are already in a stationary point. [sent-149, score-0.221]

71 To solve this problem of degenerate solutions, we initiate Z as a transformed Λ with some random orthogonal matrix Q0 , which is illustrated in Algorithm 1. [sent-150, score-0.099]

72 2 Gradient Flow Another learning algorithm is a continuous one based on the construction of a gradient ﬂow (GF) on the surface M(Λ) that moves towards the desired intersection point. [sent-153, score-0.061]

73 Once we have computed the gradient of F , it can be projected onto the manifold O(m) according to the following Theorem 4. [sent-158, score-0.113]

74 The projection of F (Q) onto O(m) is given by g(Q) = Q[QT ΛQ, β(Q)] (9) where [A, B] = AB − BA is the Lie bracket. [sent-160, score-0.132]

75 Hence, the gradient ﬂow method can always ﬁnd an intersection point as the solution. [sent-168, score-0.061]

76 • Generate a random orthogonal matrix Q0 ∈ Rm×m . [sent-172, score-0.099]

77 Since all diagonal matrices in M(Λ) ˙ result in Z = 0, one should not use Λ as the starting point. [sent-179, score-0.045]

78 , a random orthogonal transformation matrix Q0 is use to rotate Λ. [sent-182, score-0.142]

79 The relative error of the algorithm is computed by comparing the diagonal entries of Z to the target diag(a). [sent-184, score-0.045]

80 One promising property of both LP and GF is that the time complexity after PCA is independent of the number of training data, which makes them scalable to large-scale data sets. [sent-193, score-0.041]

81 3 Relation to Existing Works The most related method of IsoHash is ITQ [7], because both ITQ and IsoHash have to learn an orthogonal matrix. [sent-194, score-0.099]

82 More speciﬁcally, a threshold of the average distance to the 50th nearest neighbor is used to deﬁne whether a point is a true positive or not. [sent-207, score-0.048]

83 For all experiments, we randomly select 1000 points as queries, and leave the rest as training set to learn the hash functions. [sent-209, score-0.097]

84 Although a lot of hashing methods have been proposed, some of them are either supervised [23] or semi-supervised [29]. [sent-211, score-0.247]

85 The main difference between IsoHash and PCAH is that the PCAH dimensions have anisotropic variances while IsoHash dimensions have isotropic variances. [sent-220, score-0.561]

86 Hence, the intuitive viewpoint that using the same number of bits for different projected dimensions with anisotropic variances is unreasonable has been successfully veriﬁed by our experiments. [sent-221, score-0.668]

87 3147 Precision Method # bits IsoHash-GF IsoHash-LP PCAH ITQ SH SIKH LSH 0. [sent-331, score-0.118]

88 8 1 Recall (c) 96 bits (d) 256 bits Figure 1: Precision-recall curves on LabelMe data set. [sent-333, score-0.236]

89 as that in IsoHash, and the second part is an iteration algorithm to rotate the original PCA matrix with time complexity O(nm2 ), where n is the number of training points and m is the number of bits in the binary code. [sent-334, score-0.265]

90 Hence, as the number of training data increases, the second-part time complexity of ITQ will increase linearly to the number of training points. [sent-335, score-0.065]

91 This is clearly shown in Figure 2 which illustrates the training time when the numbers of training data are changed. [sent-338, score-0.048]

92 # bits IsoHash-GF IsoHash-LP PCAH ITQ SH SIKH LSH 32 2. [sent-340, score-0.118]

93 The proposed IsoHash method in this paper is the ﬁrst work to learn projection functions which can produce projected dimensions with isotropic variances (equal variances). [sent-377, score-0.662]

94 Experimental results on real data sets have successfully veriﬁed the viewpoint that projections with isotropic variances will be better than those with anisotropic variances. [sent-378, score-0.47]

95 Near-optimal hashing algorithms for approximate nearest neighbor in high dimensions. [sent-387, score-0.295]

96 Constructing a Hermitian matrix from its diagonal entries and eigenvalues. [sent-393, score-0.076]

97 The projected gradient method for least squares matrix approximations with spectral constraints. [sent-400, score-0.144]

98 Angular quantization based binary codes for fast similarity search. [sent-421, score-0.153]

99 Iterative quantization: A Procrustean approach to learning binary codes for large-scale image retrieval. [sent-433, score-0.082]

100 Doubly stochastic matrices and the diagonal of a rotation matrix. [sent-459, score-0.045]

similar papers computed by tfidf model

tfidf for this paper:

wordName wordTfidf (topN-words)

[('isohash', 0.696), ('itq', 0.254), ('hashing', 0.247), ('qt', 0.23), ('pcah', 0.227), ('variances', 0.171), ('isotropic', 0.138), ('projection', 0.132), ('sikh', 0.128), ('bits', 0.118), ('gf', 0.103), ('xx', 0.101), ('sh', 0.098), ('lsh', 0.095), ('pca', 0.092), ('projected', 0.089), ('lift', 0.089), ('anisotropic', 0.088), ('dimensions', 0.082), ('labelme', 0.079), ('unreasonable', 0.075), ('lp', 0.07), ('orthogonal', 0.068), ('cifar', 0.065), ('quantization', 0.054), ('diag', 0.052), ('codes', 0.05), ('diagonal', 0.045), ('viewpoint', 0.045), ('dist', 0.043), ('rotate', 0.043), ('agh', 0.043), ('hash', 0.042), ('ow', 0.039), ('intersection', 0.037), ('veri', 0.036), ('hashed', 0.035), ('sgn', 0.035), ('fk', 0.034), ('binary', 0.032), ('learn', 0.031), ('hermitian', 0.031), ('matrix', 0.031), ('unequal', 0.03), ('kth', 0.03), ('eigenvalues', 0.029), ('projections', 0.028), ('quantize', 0.028), ('shanghai', 0.028), ('string', 0.027), ('hk', 0.026), ('kong', 0.026), ('tr', 0.026), ('neighbor', 0.025), ('kumar', 0.025), ('quantized', 0.025), ('rm', 0.025), ('eigenvectors', 0.025), ('spl', 0.025), ('tij', 0.025), ('researchers', 0.025), ('gradient', 0.024), ('training', 0.024), ('gong', 0.024), ('kulis', 0.024), ('lemma', 0.023), ('nearest', 0.023), ('abbreviated', 0.023), ('argued', 0.022), ('please', 0.022), ('procrustean', 0.022), ('flow', 0.022), ('china', 0.021), ('gist', 0.021), ('cvpr', 0.02), ('bit', 0.02), ('wang', 0.019), ('functions', 0.019), ('hence', 0.019), ('principal', 0.019), ('precision', 0.019), ('adopts', 0.018), ('indyk', 0.018), ('protocols', 0.018), ('rd', 0.018), ('descriptors', 0.018), ('theorem', 0.018), ('adopt', 0.018), ('comparable', 0.017), ('similarity', 0.017), ('complexity', 0.017), ('anchor', 0.017), ('prove', 0.017), ('compact', 0.017), ('code', 0.017), ('carry', 0.017), ('reformulated', 0.016), ('zij', 0.016), ('ann', 0.016), ('ai', 0.016)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.99999982 163 nips-2012-Isotropic Hashing

Author: Weihao Kong, Wu-jun Li

2 0.23614731 313 nips-2012-Sketch-Based Linear Value Function Approximation

Author: Marc Bellemare, Joel Veness, Michael Bowling

Abstract: Hashing is a common method to reduce large, potentially inﬁnite feature vectors to a ﬁxed-size table. In reinforcement learning, hashing is often used in conjunction with tile coding to represent states in continuous spaces. Hashing is also a promising approach to value function approximation in large discrete domains such as Go and Hearts, where feature vectors can be constructed by exhaustively combining a set of atomic features. Unfortunately, the typical use of hashing in value function approximation results in biased value estimates due to the possibility of collisions. Recent work in data stream summaries has led to the development of the tug-of-war sketch, an unbiased estimator for approximating inner products. Our work investigates the application of this new data structure to linear value function approximation. Although in the reinforcement learning setting the use of the tug-of-war sketch leads to biased value estimates, we show that this bias can be orders of magnitude less than that of standard hashing. We provide empirical results on two RL benchmark domains and ﬁfty-ﬁve Atari 2600 games to highlight the superior learning performance obtained when using tug-of-war hashing. 1

3 0.21115224 42 nips-2012-Angular Quantization-based Binary Codes for Fast Similarity Search

Author: Yunchao Gong, Sanjiv Kumar, Vishal Verma, Svetlana Lazebnik

Abstract: This paper focuses on the problem of learning binary codes for efﬁcient retrieval of high-dimensional non-negative data that arises in vision and text applications where counts or frequencies are used as features. The similarity of such feature vectors is commonly measured using the cosine of the angle between them. In this work, we introduce a novel angular quantization-based binary coding (AQBC) technique for such data and analyze its properties. In its most basic form, AQBC works by mapping each non-negative feature vector onto the vertex of the binary hypercube with which it has the smallest angle. Even though the number of vertices (quantization landmarks) in this scheme grows exponentially with data dimensionality d, we propose a method for mapping feature vectors to their smallest-angle binary vertices that scales as O(d log d). Further, we propose a method for learning a linear transformation of the data to minimize the quantization error, and show that it results in improved binary codes. Experiments on image and text datasets show that the proposed AQBC method outperforms the state of the art. 1

4 0.14141832 257 nips-2012-One Permutation Hashing

Author: Ping Li, Art Owen, Cun-hui Zhang

Abstract: Minwise hashing is a standard procedure in the context of search, for efﬁciently estimating set similarities in massive binary data such as text. Recently, b-bit minwise hashing has been applied to large-scale learning and sublinear time nearneighbor search. The major drawback of minwise hashing is the expensive preprocessing, as the method requires applying (e.g.,) k = 200 to 500 permutations on the data. This paper presents a simple solution called one permutation hashing. Conceptually, given a binary data matrix, we permute the columns once and divide the permuted columns evenly into k bins; and we store, for each data vector, the smallest nonzero location in each bin. The probability analysis illustrates that this one permutation scheme should perform similarly to the original (k-permutation) minwise hashing. Our experiments with training SVM and logistic regression conﬁrm that one permutation hashing can achieve similar (or even better) accuracies compared to the k-permutation scheme. See more details in arXiv:1208.1259.

5 0.13238209 148 nips-2012-Hamming Distance Metric Learning

Author: Mohammad Norouzi, David M. Blei, Ruslan Salakhutdinov

Abstract: Motivated by large-scale multimedia applications we propose to learn mappings from high-dimensional data to binary codes that preserve semantic similarity. Binary codes are well suited to large-scale applications as they are storage efﬁcient and permit exact sub-linear kNN search. The framework is applicable to broad families of mappings, and uses a ﬂexible form of triplet ranking loss. We overcome discontinuous optimization of the discrete mappings by minimizing a piecewise-smooth upper bound on empirical loss, inspired by latent structural SVMs. We develop a new loss-augmented inference algorithm that is quadratic in the code length. We show strong retrieval performance on CIFAR-10 and MNIST, with promising classiﬁcation results using no more than kNN on the binary codes. 1

6 0.11417497 329 nips-2012-Super-Bit Locality-Sensitive Hashing

7 0.097380348 71 nips-2012-Co-Regularized Hashing for Multimodal Data

8 0.075647734 126 nips-2012-FastEx: Hash Clustering with Exponential Families

9 0.074364699 365 nips-2012-Why MCA? Nonlinear sparse coding with spike-and-slab prior for neurally plausible image encoding

10 0.059938841 293 nips-2012-Relax and Randomize : From Value to Algorithms

11 0.053247958 279 nips-2012-Projection Retrieval for Classification

12 0.049638201 106 nips-2012-Dynamical And-Or Graph Learning for Object Shape Modeling and Detection

13 0.041686602 254 nips-2012-On the Sample Complexity of Robust PCA

14 0.041426759 92 nips-2012-Deep Representations and Codes for Image Auto-Annotation

15 0.040489681 15 nips-2012-A Polylog Pivot Steps Simplex Algorithm for Classification

16 0.039535508 176 nips-2012-Learning Image Descriptors with the Boosting-Trick

17 0.038138859 277 nips-2012-Probabilistic Low-Rank Subspace Clustering

18 0.036986396 204 nips-2012-MAP Inference in Chains using Column Generation

19 0.03698054 237 nips-2012-Near-optimal Differentially Private Principal Components

20 0.035791248 281 nips-2012-Provable ICA with Unknown Gaussian Noise, with Implications for Gaussian Mixtures and Autoencoders

similar papers computed by lsi model

lsi for this paper:

topicId topicWeight

[(0, 0.11), (1, 0.02), (2, -0.019), (3, -0.053), (4, 0.07), (5, -0.033), (6, -0.005), (7, 0.075), (8, 0.129), (9, 0.055), (10, 0.079), (11, -0.152), (12, 0.063), (13, 0.18), (14, -0.178), (15, 0.105), (16, 0.101), (17, -0.053), (18, -0.162), (19, -0.015), (20, 0.013), (21, 0.092), (22, 0.044), (23, -0.088), (24, -0.045), (25, 0.037), (26, 0.038), (27, -0.058), (28, -0.03), (29, -0.005), (30, -0.02), (31, 0.086), (32, 0.005), (33, 0.028), (34, -0.031), (35, -0.054), (36, -0.038), (37, -0.098), (38, 0.01), (39, 0.033), (40, -0.057), (41, 0.005), (42, -0.0), (43, -0.023), (44, 0.025), (45, 0.049), (46, 0.018), (47, -0.011), (48, -0.044), (49, -0.036)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.92870474 163 nips-2012-Isotropic Hashing

Author: Weihao Kong, Wu-jun Li

2 0.84938693 257 nips-2012-One Permutation Hashing

Author: Ping Li, Art Owen, Cun-hui Zhang

3 0.84887248 329 nips-2012-Super-Bit Locality-Sensitive Hashing

Author: Jianqiu Ji, Jianmin Li, Shuicheng Yan, Bo Zhang, Qi Tian

Abstract: Sign-random-projection locality-sensitive hashing (SRP-LSH) is a probabilistic dimension reduction method which provides an unbiased estimate of angular similarity, yet suffers from the large variance of its estimation. In this work, we propose the Super-Bit locality-sensitive hashing (SBLSH). It is easy to implement, which orthogonalizes the random projection vectors in batches, and it is theoretically guaranteed that SBLSH also provides an unbiased estimate of angular similarity, yet with a smaller variance when the angle to estimate is within (0, ⇡/2]. The extensive experiments on real data well validate that given the same length of binary code, SBLSH may achieve signiﬁcant mean squared error reduction in estimating pairwise angular similarity. Moreover, SBLSH shows the superiority over SRP-LSH in approximate nearest neighbor (ANN) retrieval experiments. 1

4 0.84719098 313 nips-2012-Sketch-Based Linear Value Function Approximation

Author: Marc Bellemare, Joel Veness, Michael Bowling

5 0.78316277 71 nips-2012-Co-Regularized Hashing for Multimodal Data

Author: Yi Zhen, Dit-Yan Yeung

Abstract: Hashing-based methods provide a very promising approach to large-scale similarity search. To obtain compact hash codes, a recent trend seeks to learn the hash functions from data automatically. In this paper, we study hash function learning in the context of multimodal data. We propose a novel multimodal hash function learning method, called Co-Regularized Hashing (CRH), based on a boosted coregularization framework. The hash functions for each bit of the hash codes are learned by solving DC (difference of convex functions) programs, while the learning for multiple bits proceeds via a boosting procedure so that the bias introduced by the hash functions can be sequentially minimized. We empirically compare CRH with two state-of-the-art multimodal hash function learning methods on two publicly available data sets. 1

6 0.71533221 42 nips-2012-Angular Quantization-based Binary Codes for Fast Similarity Search

7 0.55879432 148 nips-2012-Hamming Distance Metric Learning

8 0.32003444 176 nips-2012-Learning Image Descriptors with the Boosting-Trick

9 0.3185648 202 nips-2012-Locally Uniform Comparison Image Descriptor

10 0.29763353 126 nips-2012-FastEx: Hash Clustering with Exponential Families

11 0.296437 17 nips-2012-A Scalable CUR Matrix Decomposition Algorithm: Lower Time Complexity and Tighter Bound

12 0.26854354 135 nips-2012-Forging The Graphs: A Low Rank and Positive Semidefinite Graph Learning Approach

13 0.26641089 309 nips-2012-Semi-supervised Eigenvectors for Locally-biased Learning

14 0.26520547 279 nips-2012-Projection Retrieval for Classification

15 0.25820065 281 nips-2012-Provable ICA with Unknown Gaussian Noise, with Implications for Gaussian Mixtures and Autoencoders

16 0.25032645 66 nips-2012-Causal discovery with scale-mixture model for spatiotemporal variance dependencies

17 0.24973759 267 nips-2012-Perceptron Learning of SAT

18 0.24657191 221 nips-2012-Multi-Stage Multi-Task Feature Learning

19 0.24645227 115 nips-2012-Efficient high dimensional maximum entropy modeling via symmetric partition functions

20 0.24028936 133 nips-2012-Finding Exemplars from Pairwise Dissimilarities via Simultaneous Sparse Recovery

similar papers computed by lda model

lda for this paper:

topicId topicWeight

[(0, 0.017), (17, 0.012), (21, 0.017), (28, 0.234), (38, 0.092), (39, 0.037), (42, 0.035), (44, 0.034), (54, 0.039), (55, 0.01), (74, 0.054), (76, 0.133), (80, 0.056), (92, 0.123)]

similar papers list:

simIndex simValue paperId paperTitle

1 0.74910963 237 nips-2012-Near-optimal Differentially Private Principal Components

Author: Kamalika Chaudhuri, Anand Sarwate, Kaushik Sinha

Abstract: Principal components analysis (PCA) is a standard tool for identifying good lowdimensional approximations to data sets in high dimension. Many current data sets of interest contain private or sensitive information about individuals. Algorithms which operate on such data should be sensitive to the privacy risks in publishing their outputs. Differential privacy is a framework for developing tradeoffs between privacy and the utility of these outputs. In this paper we investigate the theory and empirical performance of differentially private approximations to PCA and propose a new method which explicitly optimizes the utility of the output. We demonstrate that on real data, there is a large performance gap between the existing method and our method. We show that the sample complexity for the two procedures differs in the scaling with the data dimension, and that our method is nearly optimal in terms of this scaling. 1

same-paper 2 0.74682188 163 nips-2012-Isotropic Hashing

Author: Weihao Kong, Wu-jun Li

3 0.6967864 261 nips-2012-Online allocation and homogeneous partitioning for piecewise constant mean-approximation

Author: Alexandra Carpentier, Odalric-ambrym Maillard

Abstract: In the setting of active learning for the multi-armed bandit, where the goal of a learner is to estimate with equal precision the mean of a ﬁnite number of arms, recent results show that it is possible to derive strategies based on ﬁnite-time conﬁdence bounds that are competitive with the best possible strategy. We here consider an extension of this problem to the case when the arms are the cells of a ﬁnite partition P of a continuous sampling space X ⊂ Rd . Our goal is now to build a piecewise constant approximation of a noisy function (where each piece is one region of P and P is ﬁxed beforehand) in order to maintain the local quadratic error of approximation on each cell equally low. Although this extension is not trivial, we show that a simple algorithm based on upper conﬁdence bounds can be proved to be adaptive to the function itself in a near-optimal way, when |P| is chosen to be of minimax-optimal order on the class of α−H¨ lder functions. o 1 Setting and Previous work Let us consider some space X ⊂ Rd , and Y ⊂ R. We call X the input space or sampling space, Y the output space or value space. We consider the problem of estimating with uniform precision the function f : X ⊂ Rd → Y ⊂ R. We assume that we can query n times the function f , anywhere in the domain, and observe noisy samples of this function. These samples are collected sequentially, and our aim is to design an adaptive procedure that selects wisely where on the domain to query the function, according to the information provided by the previous samples. More formally: Observed process We consider an unknown Y-valued process deﬁned on X , written ν : X → M+ (Y), where M+ (Y) refers to the set of all probability measures on Y, such that for all x ∈ X , 1 1 def the random variable Y (x) ∼ ν(x) has mean f (x) = E[Y (x)|x] ∈ R. We write for convenience the model in the following way Y (x) = f (x) + noise(x) , def where noise(x) = Y (x) − E[Y (x)|x] is the centered random variable corresponding to the noise, o with unknown variance σ 2 (x). We assume throughout this paper that f is α-H¨ lder. Partition We consider we can deﬁne a partition P of the input space X , with ﬁnitely many P regions {Rp }1≤p≤P that are assumed to be convex and not degenerated, i.e. such that the interior of each region Rp has positive Lebesgue volume vp . Moreover, with each region Rp is associated a sampling distribution in that region, written µp ∈ M+ (Rp ). Thus, when we decide to sample in 1 region Rp , a new sample X ∈ Rp is generated according to X ∼ µp . Allocation. We consider that we have a ﬁnite budget of n ∈ N samples that we can use in order to allocate samples as we wish among the regions {Rp }1≤p≤P . For illustration, let us assume that we deterministically allocate Tp,n ∈ N samples in region Rp , with the constraint that the allocation {Tp,n }1≤p≤P must some to n. In region Rp , we thus sample points {Xp,i }1≤p≤P at random 1 according to the sampling distribution µp , and then get the corresponding values {Yp,i }1≤i≤Tp,n , where Yp,i ∼ ν(Xp,i ). In the sequel, the distribution µp is assumed to be the uniform distribution dλ(x)1x∈R over region Rp , i.e. the density of µp is λ(Rp ) p where λ denotes the Lebesgue measure. Note that this is not restrictive since we are in an active, not passive setting. Piecewise constant mean-approximation. We use the collected samples in order to build a pieceˆ wise constant approximation fn of the mean f , and measure the accuracy of approximation on a region Rp with the expected quadratic norm of the approximation error, namely � � � � � ˆ (x))2 λ(dx) = Eµ ,ν (f (X) − mp,n )2 , ˆ (f (x) − fn E p λ(Rp ) Rp ˆ where mp,n is the constant value that takes fn on the region Rp . A natural choice for the estimator ˆ mp,n is to use the empirical mean that is unbiased and asymptotically optimal for this criterion. ˆ Thus we consider the following estimate (histogram) ˆ fn (x) = P � p=1 mp,n I{x ∈ Rp } where mp,n = ˆ ˆ Tp,n 1 � Tp,n Yp,i . i=1 Pseudo loss Note that, since the Tp,n are deterministic, the expected quadratic norm of the approximation error of this estimator can be written in the following form � � � � � � ˆ Eµp ,ν (f (X) − mp,n )2 ˆ = Eµp ,ν (f (X) − Eµp [f (X)])2 + Eµp ,ν (Eµp [f (X)] − mp,n )2 � � � � = Vµp f (X) + Vµp ,ν mp,n ˆ � � � � 1 Vµp ,ν Y (X) . = Vµp f (X) + Tp,n Now, using the following immediate decomposition � � � � � Vµp ,ν Y (X) = Vµp f (X) + σ 2 (x)µp (dx) , Rp we deduce that the maximal expected quadratic norm of the approximation error over the regions def {Rp }1≤p≤P , that depends on the choice of the considered allocation strategy A = {Tp,n }1≤p≤P is thus given by the following so-called pseudo-loss � � � � � � Tp,n + 1 1 def 2 (1) Vµp f (X) + Eµ σ (X) . Ln (A) = max 1≤ p ≤P Tp,n Tp,n p Our goal is to minimize this pseudo-loss. Note that this is a local measure of performance, as opposed to a more usual yet less challenging global quadratic error. Eventually, as the number of �� 2 � ˆ cells tends to ∞, this local measure of performance approaches supx∈X Eν f (x) − fn (x) . At this point, let us also introduce, for convenience, the notation Qp (Tp,n ) that denotes the term inside the max, in order to emphasize the dependency on the quadratic error with the allocation. Previous work There is a huge literature on the topic of functional estimation in batch setting. Since it is a rather old and well studied question in statistics, many books have been written on this topic, such as Bosq and Lecoutre [1987], Rosenblatt [1991], Gy¨ rﬁ et al. [2002], where piecewise constant meano approximation are also called “partitioning estimate” or “regressogram” (ﬁrst introduced by Tukey [1947]). The minimax-optimal rate of approximation on the class of α-H¨ lder functions is known o 2α to be in O(n− 2α+d ) (see e.g. Ibragimov and Hasminski [1981], Stone [1980], Gy¨ rﬁ et al. [2002]). o In such setting, a dataset {(Xi , Yi )}i≤n is given to the learner, and a typical question is thus to try to ﬁnd the best possible histogram in order to minimize a approximation error. Thus the dataset is ﬁxed and we typically resort to techniques such as model selection where each model corresponds to one histogram (see Arlot [2007] for an extensive study of such). However, we here ask a very different question, that is how to optimally sample in an online setting in order to minimize the approximation error of some histogram. Thus we choose the histogram 2 before we see any sample, then it is ﬁxed and we need to decide which cell to sample from at each time step. Motivation for this setting comes naturally from some recent works in the setting of active learning for the multi-armed bandit problem Antos et al. [2010], Carpentier et al. [2011]. In these works, the objective is to estimate with equal precision the mean of a ﬁnite number of distributions (arms), which would correspond to the special case when X = {1, . . . , P } is a ﬁnite set in our setting. Intuitively, we reduce the problem to such bandit problem with ﬁnite set of arms (regions), and our setting answers the question whether it is possible to extend those results to the case when the arms do not correspond to a singleton, but rather to a continuous region. We show that the answer is positive, yet non trivial. This is non trivial due to the variance estimation in each region: points x in some region may have different means f(x), so that standard estimators for the variance are biased, contrary to the point-wise case and thus ﬁnite-arm techniques may yield disastrous results. (Estimating the variance of the distribution in a continuous region actually needs to take into account not only the point-wise noise but also the variation of the function f and the noise level σ 2 in that region.) We describe a way, inspired from quasi Monte-Carlo techniques, to correct this bias so that we can handle the additional error. Also, it is worth mentioning that this setting can be informally linked to a notion of curiosity-driven learning (see Schmidhuber [2010], Baranes and Oudeyer [2009]), since we want to decide in which region of the space to sample, without explicit reward but optimizing the goal to understand the unknown environment. Outline Section 2 provides more intuition about the pseudo-loss and a result about the optimal oracle strategy when the domain is partitioned in a minimax-optimal way on the class of α−H¨ lder o functions. Section 3 presents our assumptions, that are basically to have a sub-Gaussian noise and smooth mean and variance functions, then our estimator of the pseudo-loss together with its concentration properties, before introducing our sampling procedure, called OAHPA-pcma. Finally, the performance of this procedure is provided and discussed in Section 4. 2 The pseudo-loss: study and optimal strategies 2.1 More intuition on each term in the pseudo-loss It is natural to look at what happens to each of the two terms that appear in equation 1 when one makes Rp shrink towards a point. More precisely, let xp be the mean of X ∼ µp and let us look at the limit of Vµp (f (X)) when vp goes to 0. Assuming that f is differentiable, we get �2 � �� lim Vµp (f (X)) = lim Eµp f (X) − f (xp ) − E[f (X) − f (xp )] vp →0 vp →0 = = = lim Eµp �� X − xp , ∇f (xp )� − E[�X − xp , ∇f (xp )�] vp →0 � � lim Eµp �X − xp , ∇f (xp )�2 vp →0 � � lim ∇f (xp )T Eµp (X − xp )(X − xp )T ∇f (xp ) . �2 � vp →0 Therefore, if we introduce Σp to be the covariance matrix of the random variable X ∼ µp , then we simply have lim Vµp (f (X)) = lim ||∇f (xp )||2 p . Σ vp →0 vp →0 Example with hyper-cubic regions An important example is when Rp is a hypercube with side 1/d length vp and µp is the uniform distribution over the region Rp . In that case (see Lemma 1), we dx have µp (dx) = , and 2/d vp vp . ||∇f (xp )||2 p = ||∇f (xp )||2 Σ 12 More generally, when f is α−differentiable, i.e. that ∀a ∈ X , ∃∇α f (a, ·) ∈ Sd (0, 1)R such that ∀x ∈ Sd (0, 1), limh→0 f (a+hx)−f (a) = ∇α f (a, x), then it is not too difﬁcult to show that for such hα hyper-cubic regions, we have � � � 2α � Vµp f (X) = O vpd sup |∇α f (xp , u)|2 . S(0,1) � � On the other hand, by direct computation, the second term is such that limvp →0 Eµp σ 2 (X) = � � � � σ 2 (xp ). Thus, while Vµp f (X) vanishes, Eµp σ 2 (X) stays bounded away from 0 (unless ν is deterministic). 3 2.2 Oracle allocation and homogeneous partitioning for piecewise constant mean-approximation. We now assume that we are allowed to choose the partition P depending on n, thus P = Pn , amongst all homogeneous partitions of the space, i.e. partitions such that all cells have the same volume, and come from a regular grid of the space. Thus the only free parameter is the number of cells Pn of the partition. An exact yet not explicit oracle algorithm. The minimization of the pseudo-loss (1) does not yield to a closed-form solution in general. However, we can still derive the order of the optimal loss (see [Carpentier and Maillard, 2012, Lemma 2] in the full version of the paper for an example of minimax yet non adaptive oracle � algorithm given in closed-form solution): � � −β � � � −α� � � Lemma 1 In the case when Vµp f (X) = Ω Pn and Rp σ 2 (x)µp (dx) = Ω Pn , then an � optimal allocation and partitioning strategy An satisﬁes that� � � � Vµp f (X) + Eµp σ 2 (X) � � , L − Vµp f (X) � as soon as there exists, for such range of Pn , a constant L such that � � � � � Pn � Vµp f (X) + Eµp σ 2 (X) � � = n. L − Vµp f (X) p=1 1 � Pn = Ω(n max(1+α� −β� ,1) ) and def � Tp,n = The pseudo-loss of such an algorithm A� , optimal amongst the allocations strategies that use the n � partition Pn in Pn regions, is then given by � � � � def max(1 − β , 1 − α ) − 1. where γ = Ln (A� ) = Ω nγ n max(1 + α� − β � , 1) The condition involving the constant L is here to ensure that the partition is not degenerate. It is morally satisﬁed as soon as the variance of f and the noise are bounded and n is large enough. This Lemma applies to the important class W 1,2 (R) of functions that admit a weak derivative that o belongs to L2 (R). Indeed these functions are H¨ lder with coefﬁcient α = 1/2, i.e. we have o W 1,2 (R) ⊂ C 0,1/2 (R). The standard Brownian motion is an example of function that is 1/2-H¨ lder. More generally, for k = d + α with α = 1/2 when d is odd and α = 1 when d is even, we have the 2 inclusion W k,2 (Rd ) ⊂ C 0,α (Rd ) , where W k,2 (Rd ) is the set of functions that admit a k th weak derivative belonging to L2 (Rd ). Thus the previous Lemma applies to sufﬁciently smooth functions with smoothness linearly increasing with the dimension d of the input space X . Important remark Note that this Lemma gives us a choice of the partition that is minimax-optimal, and an allocation strategy on that partition that is not only minimax-optimal but also adaptive to the function f itself. Thus it provides a way to decide in a minimax way what is the good number of regions, and then to provide the best oracle way to allocate the budget. We can deduce the following immediate corollary on the class of α−H¨ lder functions observed in a o non-negligible noise of bounded variance (i.e. in the setting β � = 0 and α� = 2α ). d Corollary 1 Consider that f is α−H¨ lder and the noise is of bounded variance. Then a minimaxo d � d+2α ) and an optimal allocation achieves the rate L (A� ) = optimal partition satisﬁes Pn = Ω(n n n � −2α � Ω n d+2α . Moreover, the strategy of Lemma 1 is optimal amongst the allocations strategies that � use the partition Pn in Pn regions. � −2α � The rate Ω n d+2α is minimax-optimal on the class of α−H¨ lder functions (see Gy¨ rﬁ et al. [2002], o o Ibragimov and Hasminski [1981], Stone [1980]), and it is thus interesting to consider an initial numd � � d+2α ). After having built the partition, if the quantities ber �� 2 �� of�regions Pn that is of order Pn = Ω(n Vµp f p≤P and Eµp σ p≤P are known to the learner, it is optimal, in the aim of minimizing � the pseudo-loss, to allocate to each region the number of samples Tp,n provided in Lemma 1. Our objective in this paper is, after having chosen beforehand a minimax-optimal partition, to allocate 4 the samples properly in the regions, without having any access to those quantities. It is then �� necessary to balance between exploration, i.e. allocating the samples in order to estimate Vµp f p≤P � � �� and Eµp σ 2 p≤P , and exploitation, i.e. use the estimates to target the optimal allocation. 3 Online algorithms for allocation and homogeneous partitioning for piecewise constant mean-approximation In this section, we now turn to the design of algorithms that are fully online, with the goal to be competitive against the kind of oracle algorithms considered in Section 2.2. We now assume that the space X = [0, 1]d is divided in Pn hyper-cubic regions of same measure (the Lebesgue measure on 1 [0, 1]d ) vp = v = Pn . The goal of an algorithm is to minimize the quadratic error of approximation of f by a constant over each cell, in expectation, which we write as � � � � � � 2 λ(dx) ˆ (x))2 λ(dx) = max E , max E (f (x) − fn (f (x) − mp,n ) ˆ 1≤p≤Pn 1≤p≤Pn λ(Rp ) λ(Rp ) Rp Rp ˆ where fn is the histogram estimate of the function f on the partition P and mp,n is the empirical ˆ mean deﬁned on region Rp with the samples (Xi , Yi ) such that Xi ∈ Rp . To do so, an algorithm is only allowed to specify at each time step t, the next point Xt where to sample, based on all the past samples {(Xs , Ys )}s < ∞ satisﬁes that λ2 σ 2 (x) , ∀λ ∈ R+ log E exp[λ noise(x)] ≤ 2 and we further assume that it satisﬁes the following slightly stronger second property (that is for instance exactly veriﬁed for a Gaussian variable, looking at the moment generating function): � � � � 1 λ2 σ 2 (x) ∀λ, γ ∈ R+ log E exp λnoise(x) + γnoise(x)2 ≤ − log 1 − 2γσ 2 (x) . 2(1 − 2γσ 2 (x)) 2 5 The function f is assumed to be (L, α)-H¨ lder, meaning that it satiﬁes o � ∀x, x ∈ X f (x) − f (x� ) ≤ L||x − x� ||α . Similarly, the function σ 2 is assumed to be (M, β)-H¨ lder i.e. it satisﬁes o � 2 2 � ∀x, x ∈ X σ (x) − σ (x ) ≤ M ||x − x� ||β . We assume that Y is a convex and compact subset of R, thus w.l.g. that it is [0, 1], and that it is known that ||σ 2 ||∞ , which is thus ﬁnite, is bounded by the constant 1. 3.2 Empirical estimation of the quadratic approximation error on each cell We deﬁne the sampling distribution µp in the region Rp for each p ∈ {1, . . . , Pn } as a quasi-uniform ˜ sampling scheme using the uniform distribution over the sub-regions. More precisely at time t ≤ n, if we decide to sample in the region Rp according to µp , we sample uniformly in each sub-region ˜ one sample, resulting in a new batch of samples {(Xt,k , Yt,k )}1≤k≤K , where Xt,k ∼ µp,k . Note that due to this sampling process, the number of points Tp,t sampled in sub-region Rp at time t is always Tp,t a multiple of K and that moreover for all k, k � ∈ {1, . . . , K} we have that Tp,k,t = Tp,k� ,t = K . Now this speciﬁc sampling is used in order to be able to estimate the variances Vµp f and Eµp σ 2 , � so that the best proportions Tp,n can be computed as accurately as possible. Indeed, as explained in � � � � Lemma 1, we have that Vµp f (X) + Eµp σ 2 (X) � def � � . Tp,n = L − Vµp f (X) ˆ Variance estimation We now introduce two estimators. The ﬁrst estimator is written Vp,t and is def ˆ built in the following way. First,let us introduce the empirical estimate fp,k,t of the mean fp,k = � � Eµp,k f (X) of f in sub-region Rp,k . Similarly, to avoid some cumbersome notations, we introduce � � � � � � def def def 2 fp = Eµp f (X) and vp,k = Vµp,k f (X) for the function f , and then σp,k = Eµp,k σ 2 (X) for the variance of the noise σ 2 . We now deﬁne the empirical variance estimator to be K 1 � ˆ ˆ (fp,k,t − mp,t )2 , ˆ Vp,t = K −1 k=1 that is a biased estimator. Indeed, for a deterministic Tp,t , it is not difﬁcult to show that we have � K K � � � � � � �� 2 1 �� 1 � ˆ E Vp,t + Eµp,k f − Eµp f = Vµp,k f + Eµp,k σ 2 . K −1 Tp,t k=1 k=1 � � The leading term in this decomposition, that is given by the ﬁrst sum, is closed to Vµp f since, by using the assumption that f is (L, α)−H¨ lder, we have the following inequality o � � K � �� 1 � � � � 2 2L2 dα � Eµp,k f − Eµp f − Vµp f (X) � ≤ , � �K (KPn )2α/d k=1 where we also used that the diameter of a sub-region Rp,k is given by diam(Rp,k ) = d1/2 . (KPn )1/d ˆ Then, the second term also contributes to the bias, essentially due to the fact that V[fp,k,t ] = � � � � 2 def def 1 1 2 2 2 Tp,k,t (vp,k + σp,k ) and not Tp,t (vk + σk ) (with vp = Vµp f (X) and σp = Eµp σ (X) ). ˆ p,k,t In order to correct this term, we now introduce the second estimator σ 2 that estimates the variance � � � � � � of the outputs in a region Rp,k , i.e. Vµp,k ,ν Y (X) = Vµp,k f (X) + Eµp,k σ 2 . It is deﬁned as �2 t t �� 1 1 � def ˆ p,k,t = Yi − Yj I{Xj ∈ Rp,k } I{Xi ∈ Rp,k } . σ2 Tp,k,t − 1 i=1 Tp,k,t j=1 Now, we combine the two previous estimators to form the following estimator K 1 �� 1 1 � 2 ˆ ˆ ˆ σ − . Qp,t = Vp,t − K Tp,k,t Tp,t p,k,t k=1 ˆ The following proposition provides a high-probability bound on the difference between Qp,t and the quantity we want to estimate. We report the detailed proof in [Carpentier and Maillard, 2012]. 6 ˆ Proposition 1 By the assumption that f is (L, α)-H¨ lder, the bias of the estimator Qp,t , and for o deterministic Tp,t , is given by � K � � � � � � � � � 2 1 � 2L2 dα ˆ − Vµp f (X) ≤ . Eµp,k f − Eµp f E Qp,t − Qp (Tp,t ) = K (KPn )2α/d k=1 Moreover, it satisﬁes that for all δ ∈ [0, 1], there exists an event of probability higher than 1 − δ such that on this event, we have � � � � � � K K � � � � 8 log(4/δ) � σ 2 �1 � � � � ˆ p,k,t 1 � 2 ˆ ˆ � Qp,t − E Qp,t � ≤ � √ +o σ p,k . � � (K − 1)2 T2 T K K k=1 p,k,t p,k,t k=1 We also state the following Lemma that we are going to use in the analysis, and that takes into account randomness of the stopping times Tp,k,t . Lemma 2 Let {Xp,k,u }p≤P, k≤K, u≤n be samples potentially sampled in region Rp,k . We introduce qp,u to be the�equivalent of Qp (Tp,t ) with explicitly ﬁxed value of Tp,t = u. Let also qp,u be the ˆ � ˆ p,t but computed with the ﬁrst u samples in estimate of E qp,u , that is to say the equivalent of Q each region Rp,k (i.e. Tp,t = u). Let us deﬁne the event � � � � � � � AK log(4nP/δ)V � � ˆp,t 2L2 dα � � ξn,P,K (δ) = + ω : � qp,u (ω) − E qp,u � ≤ ˆ , u K −1 (KPn )2α/d p≤P u≤n �K 1 ˆ ˆ ˆ p,k,t and where A ≤ 4 is a numerical constant. Then it where Vp,t = Vp (Tp,t ) = K−1 k=1 σ 2 holds that � � P ξn,P,K (δ) ≥ 1 − δ . Note that, with the notations of this Lemma, Proposition 1 above is thus about qp,u . ˆ 3.3 The Online allocation and homogeneous partitioning algorithm for piecewise constant mean-approximation (OAHPA-pcma) We are now ready to state the algorithm that we propose for minimizing the quadratic error of approximation of f . The algorithm is described in Figure 1. Although it looks similar, this algorithm is ˆ quite different from a normal UCB algorithm since Qp,t decreases in expectation with Tp,t . Indeed, � � � � � �� K � 1 its expectation is close to Vµp f + KTp,t k=1 Vµp,k f + Eµp,k σ 2 . Algorithm 1 OAHPA-pcma. 1: Input: A, L, α, Horizon n; Partition {Rp }p≤P , with sub-partitions {Rp,k }k≤K . 2: Initialization: Sample K points in every sub-region {Rp,k }p≤P,k≤K 3: for t = K 2 P + 1; t ≤ n; t = t + K do ˆ 4: Compute ∀p, Qp,t . � ˆ ˆ p,t + AK log(4nP/δ)Vp,t + 2L2 dα . 5: Compute ∀p, Bp,t = Q 2α/d Tp,t K−1 (KPn ) 6: Select the region pt = argmax1≤p≤Pn Bp,t where to sample. 7: Sample K samples in region Rpt one per sub-region Rpt ,k according to µpt ,k . 8: end for 4 Performance of the allocation strategy and discussion Here is the main result of the paper; see the full version [Carpentier and Maillard, 2012] for the proof. We remind that the objective is to minimize for an algorithm A the pseudo-loss Ln (A). Theorem 1 (Main result) Let γ = � maxp Tp,n � minp Tp,n be the distortion factor of the optimal allocation stratdef d d egy, and let � > 0. Then with the choice of the number of regions Pn = n 2α+d �2+ 2α , and of the 2d d def def 8L2 α number of sub-regions K = C 4α+d �−2− α , where C = Ad1−α then the pseudo-loss of the OAHPApcma algorithm satisﬁes, under the assumptions of Section 3.1 and on an event of probability higher than 1 − δ, � � � � � 2α 1 + �γC � log(1/δ) Ln (A� ) + o n− 2α+d , Ln (A) ≤ n for some numerical constant C � not depending on n, where A� is the oracle of Lemma 1. n 7 Minimax-optimal partitioning and �-adaptive performance Theorem 1 provides a high probability bound on the performance of the OAHPA-pcma allocation strategy. It shows that this performance is competitive with that of an optimal (i.e. adaptive to the function f , see Lemma 1) allocation d A� on a partition with a number of cells Pn chosen to be of minimax order n 2α+d for the class of 2α α-H¨ lder functions. In particular, since Ln (A� ) = O(n d+2α ) on that class, we recover the same o n minimax order as what is obtained in the batch learning setting, when using for instance wavelets, or Kernel estimates (see e.g. Stone [1980], Ibragimov and Hasminski [1981]). But moreover, due to the adaptivity of A� to the function itself, this procedure is also �-adaptive to the function and not n only minimax-optimal on the class, on that partition (see Section 2.2). Naturally, the performance of the method increases, in the same way than for any classical functional estimation method, when the smoothness of the function increases. Similarly, in agreement with the classical curse of dimension, the higher the dimension of the domain, the less efﬁcient the method. Limitations In this work, we assume that the smoothness α of the function is available to the learner, which enables her to calibrate Pn properly. Now it makes sense to combine the OAHPApcma procedure with existing methods that enable to estimate this smoothness online (under a slightly stronger assumption than H¨ lder, such as H¨ lder functions that attain their exponents, o o see Gin´ and Nickl [2010]). It is thus interesting, when no preliminary knowledge on the smoothness e of f is available, to spend some of the initial budget in order to estimate α. We have seen that the OAHPA-pcma procedure, although very simple, manages to get minimax optimal results. Now the downside of the simplicity of the OAHPA-pcma strategy is two-fold. � The ﬁrst limitation is that the factor (1 + �γC � log(1/δ)) = (1 + O(�)) appearing in the bound before Ln (A� ) is not 1, but higher than 1. Of course it is generally difﬁcult to get a constant 1 in the batch setting (see Arlot [2007]), and similarly this is a difﬁcult task in our online setting too: If � is chosen to be small, then the error with respect to the optimal allocation is small. However, since Pn is expressed as an increasing function of �, this implies that the minimax bound on the loss for partition P increases also with �. That said, in the view of the work on active learning multi-armed bandit that we extend, we would still prefer to get the optimal constant 1. The second limitation is more problematic: since K is chosen irrespective of the region Rp , this causes the presence of the factor γ. Thus the algorithm will essentially no longer enjoy near-optimal performance guarantees when the optimal allocation strategy is highly not homogeneous. Conclusion and future work In this paper, we considered online regression with histograms in an active setting (we select in which bean to sample), and when we can choose the histogram in a class of homogeneous histograms. Since the (unknown) noise is heteroscedastic and we compete not only with the minimax allocation oracle on α-H¨ lder functions but with the adaptive oracle o that uses a minimax optimal histogram and allocates samples adaptively to the target function, this is an extremely challenging (and very practical) setting. Our contribution can be seen as a non trivial extension of the setting of active learning for multi-armed bandits to the case when each arm corresponds to one continuous region of a sampling space, as opposed to a singleton, which can also be seen as a problem of non parametric function approximation. This new setting offers interesting challenges: We provided a simple procedure, based on the computation of upper conﬁdence bounds of the estimation of the local quadratic error of approximation, and provided a performance analysis that shows that OAHPA-pcma is ﬁrst order �-optimal with respect to the function, for a partition chosen to be minimax-optimal on the class of α-H¨ lder functions. However, this simplicity also o has a drawback if one is interested in building exactly ﬁrst order optimal procedure, and going beyond these limitations is deﬁnitely not trivial: A more optimal but much more complex algorithm would indeed need to tune a different factor Kp in each cell in an online way, i.e. deﬁne some Kp,t that evolves with time, and redeﬁne sub-regions accordingly. Now, the analysis of the OAHPA-pcma already makes use of powerful tools such as empirical-Bernstein bounds for variance estimation (and not only for mean estimation), which make it non trivial; in order to handle possibly evolving subregions and deal with the progressive reﬁnement of the regions, we would need even more intricate analysis, due to the fact that we are online and active. This interesting next step is postponed to future work. Acknowledgements This research was partially supported by Nord-Pas-de-Calais Regional Council, French ANR EXPLO-RA (ANR-08-COSI-004), the European Communitys Seventh Framework Programme (FP7/2007-2013) under grant agreement no 270327 (CompLACS) and no 216886 (PASCAL2). 8 References Andr` s Antos, Varun Grover, and Csaba Szepesv` ri. Active learning in heteroscedastic noise. Thea a oretical Computer Science, 411(29-30):2712–2728, 2010. Sylvain Arlot. R´ echantillonnage et S´ lection de mod` les. PhD thesis, Universit´ Paris Sud - Paris e´ e e e XI, 2007. A. Baranes and P.-Y. Oudeyer. R-IAC: Robust Intrinsically Motivated Exploration and Active Learning. IEEE Transactions on Autonomous Mental Development, 1(3):155–169, October 2009. D. Bosq and J.P. Lecoutre. Th´ orie de l’estimation fonctionnelle, volume 21. Economica, 1987. e Alexandra Carpentier and Odalric-Ambrym Maillard. Online allocation and homogeneous partitioning for piecewise constant mean-approximation. HAL, 2012. URL http://hal.archives-ouvertes.fr/hal-00742893. Alexandra Carpentier, Alessandro Lazaric, Mohammad Ghavamzadeh, Rmi Munos, and Peter Auer. Upper-conﬁdence-bound algorithms for active learning in multi-armed bandits. In Jyrki Kivinen, Csaba Szepesv` ri, Esko Ukkonen, and Thomas Zeugmann, editors, Algorithmic Learning Theory, a volume 6925 of Lecture Notes in Computer Science, pages 189–203. Springer Berlin / Heidelberg, 2011. E. Gin´ and R. Nickl. Conﬁdence bands in density estimation. The Annals of Statistics, 38(2): e 1122–1170, 2010. L. Gy¨ rﬁ, M. Kohler, A. Krzy´ ak, and Walk H. A distribution-free theory of nonparametric regreso z sion. Springer Verlag, 2002. I. Ibragimov and R. Hasminski. Statistical estimation: Asymptotic theory. 1981. M. Rosenblatt. Stochastic curve estimation, volume 3. Inst of Mathematical Statistic, 1991. J. Schmidhuber. Formal theory of creativity, fun, and intrinsic motivation (19902010). Autonomous Mental Development, IEEE Transactions on, 2(3):230–247, 2010. C.J. Stone. Optimal rates of convergence for nonparametric estimators. The annals of Statistics, pages 1348–1360, 1980. J.W. Tukey. Non-parametric estimation ii. statistically equivalent blocks and tolerance regions–the continuous case. The Annals of Mathematical Statistics, 18(4):529–539, 1947. 9

4 0.67054117 42 nips-2012-Angular Quantization-based Binary Codes for Fast Similarity Search

Author: Yunchao Gong, Sanjiv Kumar, Vishal Verma, Svetlana Lazebnik

5 0.66170156 349 nips-2012-Training sparse natural image models with a fast Gibbs sampler of an extended state space

Author: Lucas Theis, Jascha Sohl-dickstein, Matthias Bethge

Abstract: We present a new learning strategy based on an efﬁcient blocked Gibbs sampler for sparse overcomplete linear models. Particular emphasis is placed on statistical image modeling, where overcomplete models have played an important role in discovering sparse representations. Our Gibbs sampler is faster than general purpose sampling schemes while also requiring no tuning as it is free of parameters. Using the Gibbs sampler and a persistent variant of expectation maximization, we are able to extract highly sparse distributions over latent sources from data. When applied to natural images, our algorithm learns source distributions which resemble spike-and-slab distributions. We evaluate the likelihood and quantitatively compare the performance of the overcomplete linear model to its complete counterpart as well as a product of experts model, which represents another overcomplete generalization of the complete linear model. In contrast to previous claims, we ﬁnd that overcomplete representations lead to signiﬁcant improvements, but that the overcomplete linear model still underperforms other models. 1

6 0.65947425 87 nips-2012-Convolutional-Recursive Deep Learning for 3D Object Classification

7 0.65932029 82 nips-2012-Continuous Relaxations for Discrete Hamiltonian Monte Carlo

8 0.65273345 110 nips-2012-Efficient Reinforcement Learning for High Dimensional Linear Quadratic Systems

9 0.64975965 44 nips-2012-Approximating Concavely Parameterized Optimization Problems

10 0.64657217 18 nips-2012-A Simple and Practical Algorithm for Differentially Private Data Release

11 0.64047641 101 nips-2012-Discriminatively Trained Sparse Code Gradients for Contour Detection

12 0.63758719 148 nips-2012-Hamming Distance Metric Learning

13 0.63334942 329 nips-2012-Super-Bit Locality-Sensitive Hashing

14 0.63283962 90 nips-2012-Deep Learning of Invariant Features via Simulated Fixations in Video

15 0.62937814 193 nips-2012-Learning to Align from Scratch

16 0.62474334 229 nips-2012-Multimodal Learning with Deep Boltzmann Machines

17 0.62445509 260 nips-2012-Online Sum-Product Computation Over Trees

18 0.62282199 149 nips-2012-Hierarchical Optimistic Region Selection driven by Curiosity

19 0.62195855 235 nips-2012-Natural Images, Gaussian Mixtures and Dead Leaves

20 0.61973947 197 nips-2012-Learning with Recursive Perceptual Representations