nips nips2012 nips2012-237 knowledge-graph by maker-knowledge-mining

237 nips-2012-Near-optimal Differentially Private Principal Components

Source: pdf

Author: Kamalika Chaudhuri, Anand Sarwate, Kaushik Sinha

Abstract: Principal components analysis (PCA) is a standard tool for identifying good lowdimensional approximations to data sets in high dimension. Many current data sets of interest contain private or sensitive information about individuals. Algorithms which operate on such data should be sensitive to the privacy risks in publishing their outputs. Differential privacy is a framework for developing tradeoffs between privacy and the utility of these outputs. In this paper we investigate the theory and empirical performance of differentially private approximations to PCA and propose a new method which explicitly optimizes the utility of the output. We demonstrate that on real data, there is a large performance gap between the existing method and our method. We show that the sample complexity for the two procedures differs in the scaling with the data dimension, and that our method is nearly optimal in terms of this scaling. 1

Reference: text

Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 Many current data sets of interest contain private or sensitive information about individuals. [sent-7, score-0.375]

2 Algorithms which operate on such data should be sensitive to the privacy risks in publishing their outputs. [sent-8, score-0.596]

3 Differential privacy is a framework for developing tradeoffs between privacy and the utility of these outputs. [sent-9, score-1.173]

4 In this paper we investigate the theory and empirical performance of differentially private approximations to PCA and propose a new method which explicitly optimizes the utility of the output. [sent-10, score-0.728]

5 We show that the sample complexity for the two procedures differs in the scaling with the data dimension, and that our method is nearly optimal in terms of this scaling. [sent-12, score-0.119]

6 One of the oldest and most classical methods for dimensionality reduction is principal components analysis (PCA), which computes a low-rank approximation to the second moment matrix of a set of points in Rd . [sent-16, score-0.128]

7 The rank k of the approximation is chosen to be the intrinsic dimension of the data. [sent-17, score-0.083]

8 We view this procedure as specifying a k-dimensional subspace of Rd . [sent-18, score-0.102]

9 Much of today’s machine-learning is performed on the vast amounts of personal information collected by private companies and government agencies about individuals, such as customers, users, and subjects. [sent-19, score-0.333]

10 These datasets contain sensitive information about individuals and typically involve a large number of features. [sent-20, score-0.072]

11 We study approximations to PCA which guarantee differential privacy, a cryptographically motivated deﬁnition of privacy [9] that has gained signiﬁcant attention over the past few years in the machine-learning and data-mining communities [19, 21, 20, 10, 23]. [sent-22, score-0.7]

12 Differential privacy measures privacy risk by a parameter ↵ that bounds the log-likelihood ratio of output of a (private) algorithm under two databases differing in a single individual. [sent-23, score-1.14]

13 There are many general tools for providing differential privacy. [sent-24, score-0.118]

14 The sensitivity method [9] computes the desired algorithm (PCA) on the data and then adds noise proportional to the maximum change than can be induced by changing a single point in the data set. [sent-25, score-0.179]

15 The PCA algorithm is very sensitive 1 in this sense because the top eigenvector can change by 90 by changing one point in the data set. [sent-26, score-0.124]

16 [2] adds noise to the second moment matrix and then runs PCA on the noisy matrix. [sent-29, score-0.109]

17 The general SULQ method does not take into account the quality of approximation to the nonprivate PCA output. [sent-31, score-0.123]

18 We address this by proposing a new method, PPCA, that is an instance of the exponential mechanism of McSherry and Talwar [22]. [sent-32, score-0.07]

19 For any k < d, this differentially private method outputs a k-dimensional subspace; the output is biased towards subspaces which are close to the output of PCA. [sent-33, score-0.673]

20 In order to understand the performance gap, we prove sample complexity bounds in case of k = 1 for SULQ and PPCA, as well as a general lower bound on the sample complexity for any differentially p private algorithm. [sent-36, score-0.746]

21 Furthermore, any differentially private algorithm requires ⌦(d) samples, showing that PPCA is nearly optimal in terms of sample complexity as a function of data dimension. [sent-38, score-0.694]

22 These theoretical results suggest that our experiments exhibit the limit of how well ↵differentially private algorithms can perform, and our experiments show that this gap should persist for general k. [sent-39, score-0.391]

23 Differentially privacy is a mathematical deﬁnition, but algorithms must be implemented using ﬁnite precision machines. [sent-42, score-0.533]

24 A second set of issues is theoretical – while the privacy guarantees of PPCA hold for all k, our theoretical analysis of sample complexity applies only to k = 1 in which the distance and angles between vectors are related. [sent-44, score-0.641]

25 An interesting direction is to develop theoretical bounds for general k; challenges here are providing the right notion of approximation of PCA, and extending the theory using packings of Grassman or Stiefel manifolds. [sent-45, score-0.104]

26 , xn } where each xi corresponds to the private value of one individual, xi 2 Rd , and kxi k  1 for all i. [sent-49, score-0.333]

27 , xn ] 1 be the matrix whose columns are the data vectors {xi }. [sent-53, score-0.072]

28 A popular ˆ ˆ solution is to compute a rank-k matrix A which minimizes the norm kA AkF , where k is much lower than the data dimension d. [sent-57, score-0.087]

29 The matrix A decomposes as A = V ⇤V T , (1) where V is an orthonormal matrix of eigenvectors. [sent-62, score-0.106]

30 The top-k subspace of A is the matrix Vk (A) = [v1 v2 · · · vk ] , (2) where vi is the i-th column of V in (1). [sent-63, score-0.239]

31 Given the top-k subspace and the eigenvalue matrix ⇤, we can form an approximation A(k) = Vk (A)⇤k Vk (A)T to A, where ⇤k contains the k largest eigenvalues in ⇤. [sent-64, score-0.213]

32 For a d ⇥ k matrix V with orthonormal columns, the quality of V in approximating A can be measured by ⇣ ⌘ ˆ ˆ ˆ qF (V ) = tr V T AV . [sent-67, score-0.071]

33 For these results, we measure the inner product between the output vector v1 and the true top eigenvector v1 : ˆ (4) qA (ˆ1 ) = |hˆ1 , v1 i| . [sent-70, score-0.088]

34 A randomized algorithm A(·) is an (⇢, ⌘)-close approximation to the top eigenvector if for all data sets D of n points, P (qA (A(D)) ⇢) 1 ⌘, (5) where the probability is taken over A(·). [sent-75, score-0.127]

35 ˆ We study approximations to A that preserve the privacy of the underlying data. [sent-76, score-0.561]

36 The notion of privacy that we use is differential privacy, which quantiﬁes the privacy guaranteed by a randomized algorithm P applied to a data set D. [sent-77, score-1.219]

37 An algorithm A(B) taking values in a set T provides (↵, )-differential privacy if P (A(D) 2 S)  e↵ P (A(D0 ) 2 S) + , (7) for all all measurable S ✓ T and all data sets D and D differing in a single entry. [sent-82, score-0.574]

38 0 Here ↵ and are privacy parameters, where low ↵ and ensure more privacy. [sent-83, score-0.533]

39 The second privacy guarantee is weaker; the parameter bounds the probability of failure, and is typically chosen to be quite small. [sent-85, score-0.579]

40 In this paper we are interested in proving results on the sample complexity of differentially private algorithms that approximate PCA. [sent-86, score-0.657]

41 That is, for a given ↵ and ⇢, how large must the number of individuals n in the data set be such that it is ↵-differentially private and also a (⇢, ⌘)-close approximation to PCA? [sent-87, score-0.426]

42 It is well known that as the number of individuals n grows, it is easier to guarantee the same level of privacy with relatively less noise or perturbation, and therefore the utility of the approximation also improves. [sent-88, score-0.765]

43 Our results characterize how privacy and utility scale with n and the tradeoff between them for ﬁxed n. [sent-89, score-0.64]

44 Related Work Differential privacy was proposed by Dwork et al. [sent-90, score-0.533]

45 [9], and has spawned an extensive literature of general methods and applications [1, 21, 27, 6, 24, 3, 22, 10] Differential privacy has been shown to have strong semantic guarantees [9, 17] and is resistant to many attacks [12] that succeed against some other deﬁnitions of privacy. [sent-91, score-0.571]

46 There are several standard approaches for designing differentially-private data-mining algorithms, including input perturbation [2], output perturbation [9], the exponential mechanism [22], and objective perturbation [6]. [sent-92, score-0.278]

47 Finally, dimension reduction through random projection has been considered as a technique for sanitizing data prior to publication [18]; our work differs from this line of work in that we offer differential privacy guarantees, and we only release the PCA subspace, not actual data. [sent-100, score-0.754]

48 Independently, Kapralov and Talwar [16] have proposed a dynamic programming algorithm for differentially private low rank matrix approximation which involves sampling from a distribution induced by the exponential mechanism. [sent-101, score-0.717]

49 3 Algorithms and results In this section we describe differentially private techniques for approximating (2). [sent-103, score-0.593]

50 Both procedures provide differentially private approximations to the top-k subspace: SULQ provides (↵, )differential privacy and PPCA provides ↵-differential privacy. [sent-106, score-1.154]

51 The SULQ method perturbs each entry of the empirical second moment matrix A to ensure differential privacy and releases the top k eigenvectors of this perturbed matrix. [sent-109, score-0.76]

52 This guarantees a weaker privacy deﬁnition known as (↵, )-differential privacy. [sent-114, score-0.55]

53 One problem with this approach is that with probability 1 the matrix A + N is not symmetric, so the largest eigenvalue may not be real and the entries of the corresponding eigenvector may be complex. [sent-115, score-0.097]

54 However, a simple modiﬁcation to the basic SULQ approach does guarantee (↵, ) differential privacy. [sent-117, score-0.139]

55 Note that this matrix is symmetric but not necessarily positive semideﬁnite, so some eigenvalues may be negative but the eigenvectors are all real. [sent-123, score-0.09]

56 Algorithm 1: Algorithm MOD-SULQ (input pertubation) inputs: d ⇥ n data matrix X, privacy parameter ↵, parameter ˆ outputs: d ⇥ k matrix Vk = [ˆ1 v2 · · · vk ] with orthonormal columns v ˆ ˆ 1 T 1 Set A = n XX . [sent-125, score-0.76]

57 Our new method, PPCA, randomly samples a k-dimensional subspace from a distribution that ensures differential privacy and is biased towards high utility. [sent-133, score-0.753]

58 The distribution from which our released subspace is sampled is known in the statistics literature as the matrix Bingham distribution [7], which we denote by BMFk (B). [sent-134, score-0.137]

59 The algorithm is in terms of general k < d but our theoretical results focus on the special case k = 1 where we wish to release a onedimensional approximation to the data covariance matrix. [sent-135, score-0.097]

60 The matrix Bingham distribution takes values on the set of all k-dimensional subspaces of Rd and has a density equal to 1 f (V ) = exp(tr(V T BV )), (8) 1 F1 2 k, 1 d, B 1 2 where V is a d ⇥ k matrix whose columns are orthonormal and 1 F1 hypergeometric function [7, p. [sent-136, score-0.156]

61 There are other general algorithmic strategies for guaranteeing differential privacy. [sent-145, score-0.118]

62 The sensitivity method [9] adds noise proportional to the maximum change that can be induced by changing a single point in the data set. [sent-146, score-0.162]

63 Consider a data set D with m + 1 copies of a unit vector u and m copies of a unit vector u0 with u ? [sent-147, score-0.081]

64 Thus the global sensitivity does not scale with the number of data points, so as n increases the variance of the noise required by the Laplace mechanism [9] will not decrease. [sent-150, score-0.151]

65 An alternative to global sensitivity is smooth sensitivity [24]; except for special cases, such as the sample median, smooth sensitivity is difﬁcult to compute for general functions. [sent-151, score-0.206]

66 Our theoretical results are sample complexity bounds for PPCA and MOD-SULQ as well as a general lower bound on the sample complexity for any ↵-differentially private algorithm. [sent-154, score-0.508]

67 These results show that the PPCA is nearly optimal in terms the scaling of the sample complexity with respect to the data dimension d, privacy parameter ↵, and eigengap . [sent-155, score-0.687]

68 We further show that MOD-SULQ requires more samples as a function of d, despite having a slightly weaker privacy guarantee. [sent-156, score-0.55]

69 Even though both algorithms can output the top-k PCA subspace for general k  d, we prove results for the case k = 1. [sent-158, score-0.127]

70 Finding the scaling behavior of the sample complexity with k is an interesting open problem that we leave for future work; challenges here are ﬁnding the right notion of approximation of the PCA, and extending the theory using packings of Grassman or Stiefel manifolds. [sent-159, score-0.139]

71 For the in Algorithm 1, the MOD-SULQ algorithm is (↵, ) differentially private. [sent-161, score-0.26]

72 The fact that these two algorithms are differentially private follows from some simple calculations. [sent-164, score-0.593]

73 Our ﬁrst sample complexity result provides an upper bound on the number of samples required by 5 PPCA to guarantee a certain level of privacy and accuracy. [sent-165, score-0.618]

74 The sample complexity of PPCA n grows linearly with the dimension d, inversely with ↵, and inversely with the correlation gap (1 ⇢) and eigenvalue gap 1 (A) 2 (A). [sent-166, score-0.229]

75 For any ⇢ 1 d 2 16 , no ↵-differentially private algorithm A can approximate PCA with expected utility greater than ⇢ on all databases with n points in dimension d ⇢ q d having eigenvalue gap , where n < max d↵ , 180 · ↵p1 ⇢ . [sent-172, score-0.531]

76 d Theorem 3 shows that if n scales like ↵ (1 ⇢) log 1 1⇢2 then PPCA produces an approximation v1 ˆ d p that has correlation ⇢ with v1 , whereas Theorem 4 shows that n must scale like for any ↵ (1 ⇢) ↵-differentially private algorithm. [sent-173, score-0.362]

77 There are constants c and c0 such p d3/2 log(d/ ) that if n < c (1 c0 (1 ⇢)), then there is a dataset of size n in dimension d such that ↵ the top PCA direction v and the output v of MOD-SULQ satisfy E[|hˆ1 , v1 i|]  ⇢. [sent-177, score-0.081]

78 We chose k to be 4, 8, 10, and 11 such that the top-k PCA subspace had qF (Vk ) at least 80% of kAkF . [sent-184, score-0.122]

79 Standard PCA is non-private; changing a single data point will change the output, and hence violate differential privacy. [sent-191, score-0.154]

80 We measured the utility qF (U ), where U is the k-dimensional subspace output by the algorithm; kU k is maximized when U is the top-k PCA subspace, and thus this reﬂects how close the output subspace is to the true PCA subspace in terms of representing the data. [sent-192, score-0.463]

81 Figures 1(a), 1(b), 1(c), and 1(d) show qF (U ) as a function of sample size for the k-dimensional subspace output by PPCA, MOD-SULQ, non-private PCA, and random projections. [sent-194, score-0.156]

82 1 2e+04 4e+04 6e+04 8e+04 1e+05 2000 4000 n 6000 8000 10000 n (c) localization (d) insurance Figure 1: Utility qF (U ) for the four data sets KDDCUP LOCALIZATION Non-private PCA 98. [sent-218, score-0.094]

83 Our theoretical results suggest that the performance of differentially private PCA cannot be signiﬁcantly improved over these experiments. [sent-238, score-0.615]

84 A common use of a dimension reduction algorithm is as a precursor to classiﬁcation or clustering; to evaluate the effectiveness of the different algorithms, we projected the data onto the subspace output by the algorithms, and measured the classiﬁcation accuracy using the projected data. [sent-240, score-0.201]

85 We projected the classiﬁcation data onto the subspace computed based on the holdout set; 10% of this data was used for training and parameter-tuning, and the rest for testing. [sent-247, score-0.161]

86 To check the effect of the privacy requirement, we generated a synthetic data set of n = 5,000 points drawn from a Gaussian distribution in d = 10 with mean 0 and whose covariance matrix had eigenvalues {0. [sent-265, score-0.612]

87 In this case the space spanned by the top two eigenvectors has most of the energy, so we chose k = 2 and plotted the utility qF (·) for nonprivate PCA, MOD-SULQ with = 0. [sent-276, score-0.27]

88 We drew 100 samples from each privacypreserving algorithm and the plot of the average utility versus ↵ is shown in Figure 2. [sent-278, score-0.107]

89 As ↵ increases, the privacy requirement is relaxed and both MOD-SULQ and PPCA approach the utility of PCA without privacy constraints. [sent-279, score-1.173]

90 5 Conclusion In this paper we investigated the theoretical and empirical performance of differentially private approximations to PCA. [sent-281, score-0.643]

91 Empirically, we showed that MOD-SULQ and PPCA differ markedly in how well they approximate the top-k subspace of the data. [sent-282, score-0.102]

92 Because PPCA uses the exponential mechanism with qF (·) as the utility function, it is not surprising that it performs well. [sent-284, score-0.177]

93 We furthermore showed that PPCA is nearly optimal, in that any differentially private approximation to PCA must use ⌦(d) samples. [sent-286, score-0.642]

94 The description of differentially private algorithms assume an ideal model of computation : real systems require additional security assumptions that have to be veriﬁed. [sent-288, score-0.593]

95 The difference between truly random noise and pseudorandomness and the effects of ﬁnite precision can lead to a gap between the theoretical ideal and practice. [sent-289, score-0.117]

96 Numerical optimization methods used in objective perturbation [6] can only produce approximate solutions, and have complex termination conditions unaccounted for in the theoretical analysis. [sent-290, score-0.083]

97 For k = 1 the utility functions qF (·) and qA (·) are related, but for general k it is not immediately clear what metric best captures the idea of “approximating” PCA. [sent-294, score-0.107]

98 A note on differential privacy: Deﬁning resistance to arbitrary side information. [sent-408, score-0.118]

99 Random projection-based multiplicative data perturbation for privacy preserving distributed data mining. [sent-414, score-0.628]

100 Differentially private recommender systems: Building privacy into the netﬂix prize contenders. [sent-434, score-0.866]

similar papers computed by tfidf model

tfidf for this paper:

wordName wordTfidf (topN-words)

[('privacy', 0.533), ('ppca', 0.453), ('private', 0.333), ('sulq', 0.275), ('differentially', 0.26), ('pca', 0.207), ('herry', 0.125), ('qf', 0.119), ('differential', 0.118), ('utility', 0.107), ('subspace', 0.102), ('nonprivate', 0.094), ('vk', 0.084), ('talwar', 0.083), ('dwork', 0.072), ('qa', 0.065), ('bingham', 0.062), ('haudhuri', 0.062), ('mith', 0.062), ('perturbation', 0.061), ('sensitivity', 0.059), ('individuals', 0.047), ('mechanism', 0.047), ('issim', 0.047), ('eigenvector', 0.042), ('census', 0.041), ('localization', 0.039), ('attacks', 0.038), ('insurance', 0.038), ('projections', 0.036), ('orthonormal', 0.036), ('gap', 0.036), ('complexity', 0.035), ('dimension', 0.035), ('matrix', 0.035), ('ka', 0.033), ('copies', 0.032), ('asiviswanathan', 0.031), ('ironov', 0.031), ('lum', 0.031), ('pseudorandomness', 0.031), ('ung', 0.031), ('subspaces', 0.03), ('sample', 0.029), ('classi', 0.029), ('approximation', 0.029), ('release', 0.029), ('approximations', 0.028), ('noise', 0.028), ('xx', 0.028), ('eigenvectors', 0.028), ('icde', 0.028), ('stiefel', 0.028), ('packings', 0.028), ('grassman', 0.028), ('hen', 0.028), ('kddcup', 0.028), ('eigenvalues', 0.027), ('kdd', 0.026), ('output', 0.025), ('holdout', 0.025), ('hoff', 0.025), ('bounds', 0.025), ('sensitive', 0.025), ('moment', 0.025), ('differing', 0.024), ('mcsherry', 0.024), ('pods', 0.024), ('stoc', 0.023), ('exponential', 0.023), ('singular', 0.023), ('reduction', 0.022), ('theoretical', 0.022), ('top', 0.021), ('adds', 0.021), ('publishing', 0.021), ('diego', 0.021), ('guarantee', 0.021), ('mcmc', 0.021), ('eigenvalue', 0.02), ('columns', 0.02), ('chose', 0.02), ('nearly', 0.02), ('changing', 0.019), ('gibbs', 0.019), ('inversely', 0.019), ('rank', 0.019), ('scaling', 0.018), ('semide', 0.018), ('induced', 0.018), ('randomized', 0.018), ('vi', 0.018), ('dimensionality', 0.017), ('sampler', 0.017), ('weaker', 0.017), ('cation', 0.017), ('avenue', 0.017), ('data', 0.017), ('composition', 0.017), ('roth', 0.017)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 1.0000002 237 nips-2012-Near-optimal Differentially Private Principal Components

Author: Kamalika Chaudhuri, Anand Sarwate, Kaushik Sinha

2 0.5458079 275 nips-2012-Privacy Aware Learning

Author: Martin J. Wainwright, Michael I. Jordan, John C. Duchi

Abstract: We study statistical risk minimization problems under a version of privacy in which the data is kept conﬁdential even from the learner. In this local privacy framework, we establish sharp upper and lower bounds on the convergence rates of statistical estimation procedures. As a consequence, we exhibit a precise tradeoff between the amount of privacy the data preserves and the utility, measured by convergence rate, of any statistical estimator. 1

3 0.36087579 18 nips-2012-A Simple and Practical Algorithm for Differentially Private Data Release

Author: Moritz Hardt, Katrina Ligett, Frank Mcsherry

Abstract: We present a new algorithm for differentially private data release, based on a simple combination of the Multiplicative Weights update rule with the Exponential Mechanism. Our MWEM algorithm achieves what are the best known and nearly optimal theoretical guarantees, while at the same time being simple to implement and experimentally more accurate on actual data sets than existing techniques. 1

4 0.11553983 103 nips-2012-Distributed Probabilistic Learning for Camera Networks with Missing Data

Author: Sejong Yoon, Vladimir Pavlovic

Abstract: Probabilistic approaches to computer vision typically assume a centralized setting, with the algorithm granted access to all observed data points. However, many problems in wide-area surveillance can beneﬁt from distributed modeling, either because of physical or computational constraints. Most distributed models to date use algebraic approaches (such as distributed SVD) and as a result cannot explicitly deal with missing data. In this work we present an approach to estimation and learning of generative probabilistic models in a distributed context where certain sensor data can be missing. In particular, we show how traditional centralized models, such as probabilistic PCA and missing-data PPCA, can be learned when the data is distributed across a network of sensors. We demonstrate the utility of this approach on the problem of distributed afﬁne structure from motion. Our experiments suggest that the accuracy of the learned probabilistic structure and motion models rivals that of traditional centralized factorization methods while being able to handle challenging situations such as missing or noisy observations. 1

5 0.099516079 277 nips-2012-Probabilistic Low-Rank Subspace Clustering

Author: S. D. Babacan, Shinichi Nakajima, Minh Do

Abstract: In this paper, we consider the problem of clustering data points into lowdimensional subspaces in the presence of outliers. We pose the problem using a density estimation formulation with an associated generative model. Based on this probability model, we ﬁrst develop an iterative expectation-maximization (EM) algorithm and then derive its global solution. In addition, we develop two Bayesian methods based on variational Bayesian (VB) approximation, which are capable of automatic dimensionality selection. While the ﬁrst method is based on an alternating optimization scheme for all unknowns, the second method makes use of recent results in VB matrix factorization leading to fast and effective estimation. Both methods are extended to handle sparse outliers for robustness and can handle missing values. Experimental results suggest that proposed methods are very effective in subspace clustering and identifying outliers. 1

6 0.076324187 254 nips-2012-On the Sample Complexity of Robust PCA

7 0.052488837 279 nips-2012-Projection Retrieval for Classification

8 0.050151713 289 nips-2012-Recognizing Activities by Attribute Dynamics

9 0.048762213 255 nips-2012-On the Use of Non-Stationary Policies for Stationary Infinite-Horizon Markov Decision Processes

10 0.048053756 301 nips-2012-Scaled Gradients on Grassmann Manifolds for Matrix Completion

11 0.047582112 171 nips-2012-Latent Coincidence Analysis: A Hidden Variable Model for Distance Metric Learning

12 0.043616049 234 nips-2012-Multiresolution analysis on the symmetric group

13 0.042406842 17 nips-2012-A Scalable CUR Matrix Decomposition Algorithm: Lower Time Complexity and Tighter Bound

14 0.041864064 281 nips-2012-Provable ICA with Unknown Gaussian Noise, with Implications for Gaussian Mixtures and Autoencoders

15 0.040172968 351 nips-2012-Transelliptical Component Analysis

16 0.038727034 318 nips-2012-Sparse Approximate Manifolds for Differential Geometric MCMC

17 0.036986325 336 nips-2012-The Coloured Noise Expansion and Parameter Estimation of Diffusion Processes

18 0.03698054 163 nips-2012-Isotropic Hashing

19 0.035980411 187 nips-2012-Learning curves for multi-task Gaussian process regression

20 0.035207652 34 nips-2012-Active Learning of Multi-Index Function Models

similar papers computed by lsi model

lsi for this paper:

topicId topicWeight

[(0, 0.124), (1, 0.02), (2, 0.034), (3, -0.054), (4, 0.016), (5, 0.042), (6, 0.008), (7, 0.1), (8, -0.015), (9, -0.048), (10, 0.035), (11, -0.061), (12, -0.112), (13, -0.084), (14, 0.005), (15, 0.091), (16, 0.064), (17, -0.517), (18, 0.16), (19, 0.094), (20, -0.335), (21, 0.018), (22, -0.015), (23, -0.196), (24, 0.252), (25, -0.05), (26, 0.006), (27, -0.083), (28, 0.047), (29, 0.038), (30, -0.031), (31, 0.024), (32, 0.068), (33, -0.038), (34, 0.109), (35, -0.055), (36, -0.066), (37, 0.026), (38, -0.011), (39, -0.046), (40, -0.014), (41, 0.023), (42, 0.033), (43, 0.028), (44, 0.017), (45, 0.007), (46, 0.02), (47, -0.008), (48, -0.035), (49, 0.022)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.92634189 237 nips-2012-Near-optimal Differentially Private Principal Components

Author: Kamalika Chaudhuri, Anand Sarwate, Kaushik Sinha

2 0.90767992 18 nips-2012-A Simple and Practical Algorithm for Differentially Private Data Release

Author: Moritz Hardt, Katrina Ligett, Frank Mcsherry

3 0.77614278 275 nips-2012-Privacy Aware Learning

Author: Martin J. Wainwright, Michael I. Jordan, John C. Duchi

4 0.36140949 103 nips-2012-Distributed Probabilistic Learning for Camera Networks with Missing Data

Author: Sejong Yoon, Vladimir Pavlovic

5 0.25867265 254 nips-2012-On the Sample Complexity of Robust PCA

Author: Matthew Coudron, Gilad Lerman

Abstract: We estimate the rate of convergence and sample complexity of a recent robust estimator for a generalized version of the inverse covariance matrix. This estimator is used in a convex algorithm for robust subspace recovery (i.e., robust PCA). Our model assumes a sub-Gaussian underlying distribution and an i.i.d. sample from it. Our main result shows with high probability that the norm of the difference between the generalized inverse covariance of the underlying distribution and its estimator from an i.i.d. sample of size N is of order O(N −0.5+ ) for arbitrarily small > 0 (affecting the probabilistic estimate); this rate of convergence is close to the one of direct covariance estimation, i.e., O(N −0.5 ). Our precise probabilistic estimate implies for some natural settings that the sample complexity of the generalized inverse covariance estimation when using the Frobenius norm is O(D2+δ ) for arbitrarily small δ > 0 (whereas the sample complexity of direct covariance estimation with Frobenius norm is O(D2 )). These results provide similar rates of convergence and sample complexity for the corresponding robust subspace recovery algorithm. To the best of our knowledge, this is the only work analyzing the sample complexity of any robust PCA algorithm. 1

6 0.22525014 46 nips-2012-Assessing Blinding in Clinical Trials

7 0.22096749 17 nips-2012-A Scalable CUR Matrix Decomposition Algorithm: Lower Time Complexity and Tighter Bound

8 0.21412314 285 nips-2012-Query Complexity of Derivative-Free Optimization

9 0.21020666 279 nips-2012-Projection Retrieval for Classification

10 0.21000028 34 nips-2012-Active Learning of Multi-Index Function Models

11 0.19929126 277 nips-2012-Probabilistic Low-Rank Subspace Clustering

12 0.19714198 301 nips-2012-Scaled Gradients on Grassmann Manifolds for Matrix Completion

13 0.19606197 268 nips-2012-Perfect Dimensionality Recovery by Variational Bayesian PCA

14 0.19556314 115 nips-2012-Efficient high dimensional maximum entropy modeling via symmetric partition functions

15 0.19494937 333 nips-2012-Synchronization can Control Regularization in Neural Systems via Correlated Noise Processes

16 0.19216359 63 nips-2012-CPRL -- An Extension of Compressive Sensing to the Phase Retrieval Problem

17 0.1835317 76 nips-2012-Communication-Efficient Algorithms for Statistical Optimization

18 0.17940451 134 nips-2012-Finite Sample Convergence Rates of Zero-Order Stochastic Optimization Methods

19 0.1723599 281 nips-2012-Provable ICA with Unknown Gaussian Noise, with Implications for Gaussian Mixtures and Autoencoders

20 0.17226966 52 nips-2012-Bayesian Nonparametric Modeling of Suicide Attempts

similar papers computed by lda model

lda for this paper:

topicId topicWeight

[(0, 0.017), (17, 0.012), (21, 0.036), (28, 0.3), (38, 0.112), (39, 0.026), (42, 0.057), (54, 0.031), (55, 0.015), (74, 0.047), (76, 0.123), (80, 0.078), (86, 0.012), (92, 0.029)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.73523343 237 nips-2012-Near-optimal Differentially Private Principal Components

Author: Kamalika Chaudhuri, Anand Sarwate, Kaushik Sinha

2 0.66529852 261 nips-2012-Online allocation and homogeneous partitioning for piecewise constant mean-approximation

Author: Alexandra Carpentier, Odalric-ambrym Maillard

Abstract: In the setting of active learning for the multi-armed bandit, where the goal of a learner is to estimate with equal precision the mean of a ﬁnite number of arms, recent results show that it is possible to derive strategies based on ﬁnite-time conﬁdence bounds that are competitive with the best possible strategy. We here consider an extension of this problem to the case when the arms are the cells of a ﬁnite partition P of a continuous sampling space X ⊂ Rd . Our goal is now to build a piecewise constant approximation of a noisy function (where each piece is one region of P and P is ﬁxed beforehand) in order to maintain the local quadratic error of approximation on each cell equally low. Although this extension is not trivial, we show that a simple algorithm based on upper conﬁdence bounds can be proved to be adaptive to the function itself in a near-optimal way, when |P| is chosen to be of minimax-optimal order on the class of α−H¨ lder functions. o 1 Setting and Previous work Let us consider some space X ⊂ Rd , and Y ⊂ R. We call X the input space or sampling space, Y the output space or value space. We consider the problem of estimating with uniform precision the function f : X ⊂ Rd → Y ⊂ R. We assume that we can query n times the function f , anywhere in the domain, and observe noisy samples of this function. These samples are collected sequentially, and our aim is to design an adaptive procedure that selects wisely where on the domain to query the function, according to the information provided by the previous samples. More formally: Observed process We consider an unknown Y-valued process deﬁned on X , written ν : X → M+ (Y), where M+ (Y) refers to the set of all probability measures on Y, such that for all x ∈ X , 1 1 def the random variable Y (x) ∼ ν(x) has mean f (x) = E[Y (x)|x] ∈ R. We write for convenience the model in the following way Y (x) = f (x) + noise(x) , def where noise(x) = Y (x) − E[Y (x)|x] is the centered random variable corresponding to the noise, o with unknown variance σ 2 (x). We assume throughout this paper that f is α-H¨ lder. Partition We consider we can deﬁne a partition P of the input space X , with ﬁnitely many P regions {Rp }1≤p≤P that are assumed to be convex and not degenerated, i.e. such that the interior of each region Rp has positive Lebesgue volume vp . Moreover, with each region Rp is associated a sampling distribution in that region, written µp ∈ M+ (Rp ). Thus, when we decide to sample in 1 region Rp , a new sample X ∈ Rp is generated according to X ∼ µp . Allocation. We consider that we have a ﬁnite budget of n ∈ N samples that we can use in order to allocate samples as we wish among the regions {Rp }1≤p≤P . For illustration, let us assume that we deterministically allocate Tp,n ∈ N samples in region Rp , with the constraint that the allocation {Tp,n }1≤p≤P must some to n. In region Rp , we thus sample points {Xp,i }1≤p≤P at random 1 according to the sampling distribution µp , and then get the corresponding values {Yp,i }1≤i≤Tp,n , where Yp,i ∼ ν(Xp,i ). In the sequel, the distribution µp is assumed to be the uniform distribution dλ(x)1x∈R over region Rp , i.e. the density of µp is λ(Rp ) p where λ denotes the Lebesgue measure. Note that this is not restrictive since we are in an active, not passive setting. Piecewise constant mean-approximation. We use the collected samples in order to build a pieceˆ wise constant approximation fn of the mean f , and measure the accuracy of approximation on a region Rp with the expected quadratic norm of the approximation error, namely � � � � � ˆ (x))2 λ(dx) = Eµ ,ν (f (X) − mp,n )2 , ˆ (f (x) − fn E p λ(Rp ) Rp ˆ where mp,n is the constant value that takes fn on the region Rp . A natural choice for the estimator ˆ mp,n is to use the empirical mean that is unbiased and asymptotically optimal for this criterion. ˆ Thus we consider the following estimate (histogram) ˆ fn (x) = P � p=1 mp,n I{x ∈ Rp } where mp,n = ˆ ˆ Tp,n 1 � Tp,n Yp,i . i=1 Pseudo loss Note that, since the Tp,n are deterministic, the expected quadratic norm of the approximation error of this estimator can be written in the following form � � � � � � ˆ Eµp ,ν (f (X) − mp,n )2 ˆ = Eµp ,ν (f (X) − Eµp [f (X)])2 + Eµp ,ν (Eµp [f (X)] − mp,n )2 � � � � = Vµp f (X) + Vµp ,ν mp,n ˆ � � � � 1 Vµp ,ν Y (X) . = Vµp f (X) + Tp,n Now, using the following immediate decomposition � � � � � Vµp ,ν Y (X) = Vµp f (X) + σ 2 (x)µp (dx) , Rp we deduce that the maximal expected quadratic norm of the approximation error over the regions def {Rp }1≤p≤P , that depends on the choice of the considered allocation strategy A = {Tp,n }1≤p≤P is thus given by the following so-called pseudo-loss � � � � � � Tp,n + 1 1 def 2 (1) Vµp f (X) + Eµ σ (X) . Ln (A) = max 1≤ p ≤P Tp,n Tp,n p Our goal is to minimize this pseudo-loss. Note that this is a local measure of performance, as opposed to a more usual yet less challenging global quadratic error. Eventually, as the number of �� 2 � ˆ cells tends to ∞, this local measure of performance approaches supx∈X Eν f (x) − fn (x) . At this point, let us also introduce, for convenience, the notation Qp (Tp,n ) that denotes the term inside the max, in order to emphasize the dependency on the quadratic error with the allocation. Previous work There is a huge literature on the topic of functional estimation in batch setting. Since it is a rather old and well studied question in statistics, many books have been written on this topic, such as Bosq and Lecoutre [1987], Rosenblatt [1991], Gy¨ rﬁ et al. [2002], where piecewise constant meano approximation are also called “partitioning estimate” or “regressogram” (ﬁrst introduced by Tukey [1947]). The minimax-optimal rate of approximation on the class of α-H¨ lder functions is known o 2α to be in O(n− 2α+d ) (see e.g. Ibragimov and Hasminski [1981], Stone [1980], Gy¨ rﬁ et al. [2002]). o In such setting, a dataset {(Xi , Yi )}i≤n is given to the learner, and a typical question is thus to try to ﬁnd the best possible histogram in order to minimize a approximation error. Thus the dataset is ﬁxed and we typically resort to techniques such as model selection where each model corresponds to one histogram (see Arlot [2007] for an extensive study of such). However, we here ask a very different question, that is how to optimally sample in an online setting in order to minimize the approximation error of some histogram. Thus we choose the histogram 2 before we see any sample, then it is ﬁxed and we need to decide which cell to sample from at each time step. Motivation for this setting comes naturally from some recent works in the setting of active learning for the multi-armed bandit problem Antos et al. [2010], Carpentier et al. [2011]. In these works, the objective is to estimate with equal precision the mean of a ﬁnite number of distributions (arms), which would correspond to the special case when X = {1, . . . , P } is a ﬁnite set in our setting. Intuitively, we reduce the problem to such bandit problem with ﬁnite set of arms (regions), and our setting answers the question whether it is possible to extend those results to the case when the arms do not correspond to a singleton, but rather to a continuous region. We show that the answer is positive, yet non trivial. This is non trivial due to the variance estimation in each region: points x in some region may have different means f(x), so that standard estimators for the variance are biased, contrary to the point-wise case and thus ﬁnite-arm techniques may yield disastrous results. (Estimating the variance of the distribution in a continuous region actually needs to take into account not only the point-wise noise but also the variation of the function f and the noise level σ 2 in that region.) We describe a way, inspired from quasi Monte-Carlo techniques, to correct this bias so that we can handle the additional error. Also, it is worth mentioning that this setting can be informally linked to a notion of curiosity-driven learning (see Schmidhuber [2010], Baranes and Oudeyer [2009]), since we want to decide in which region of the space to sample, without explicit reward but optimizing the goal to understand the unknown environment. Outline Section 2 provides more intuition about the pseudo-loss and a result about the optimal oracle strategy when the domain is partitioned in a minimax-optimal way on the class of α−H¨ lder o functions. Section 3 presents our assumptions, that are basically to have a sub-Gaussian noise and smooth mean and variance functions, then our estimator of the pseudo-loss together with its concentration properties, before introducing our sampling procedure, called OAHPA-pcma. Finally, the performance of this procedure is provided and discussed in Section 4. 2 The pseudo-loss: study and optimal strategies 2.1 More intuition on each term in the pseudo-loss It is natural to look at what happens to each of the two terms that appear in equation 1 when one makes Rp shrink towards a point. More precisely, let xp be the mean of X ∼ µp and let us look at the limit of Vµp (f (X)) when vp goes to 0. Assuming that f is differentiable, we get �2 � �� lim Vµp (f (X)) = lim Eµp f (X) − f (xp ) − E[f (X) − f (xp )] vp →0 vp →0 = = = lim Eµp �� X − xp , ∇f (xp )� − E[�X − xp , ∇f (xp )�] vp →0 � � lim Eµp �X − xp , ∇f (xp )�2 vp →0 � � lim ∇f (xp )T Eµp (X − xp )(X − xp )T ∇f (xp ) . �2 � vp →0 Therefore, if we introduce Σp to be the covariance matrix of the random variable X ∼ µp , then we simply have lim Vµp (f (X)) = lim ||∇f (xp )||2 p . Σ vp →0 vp →0 Example with hyper-cubic regions An important example is when Rp is a hypercube with side 1/d length vp and µp is the uniform distribution over the region Rp . In that case (see Lemma 1), we dx have µp (dx) = , and 2/d vp vp . ||∇f (xp )||2 p = ||∇f (xp )||2 Σ 12 More generally, when f is α−differentiable, i.e. that ∀a ∈ X , ∃∇α f (a, ·) ∈ Sd (0, 1)R such that ∀x ∈ Sd (0, 1), limh→0 f (a+hx)−f (a) = ∇α f (a, x), then it is not too difﬁcult to show that for such hα hyper-cubic regions, we have � � � 2α � Vµp f (X) = O vpd sup |∇α f (xp , u)|2 . S(0,1) � � On the other hand, by direct computation, the second term is such that limvp →0 Eµp σ 2 (X) = � � � � σ 2 (xp ). Thus, while Vµp f (X) vanishes, Eµp σ 2 (X) stays bounded away from 0 (unless ν is deterministic). 3 2.2 Oracle allocation and homogeneous partitioning for piecewise constant mean-approximation. We now assume that we are allowed to choose the partition P depending on n, thus P = Pn , amongst all homogeneous partitions of the space, i.e. partitions such that all cells have the same volume, and come from a regular grid of the space. Thus the only free parameter is the number of cells Pn of the partition. An exact yet not explicit oracle algorithm. The minimization of the pseudo-loss (1) does not yield to a closed-form solution in general. However, we can still derive the order of the optimal loss (see [Carpentier and Maillard, 2012, Lemma 2] in the full version of the paper for an example of minimax yet non adaptive oracle � algorithm given in closed-form solution): � � −β � � � −α� � � Lemma 1 In the case when Vµp f (X) = Ω Pn and Rp σ 2 (x)µp (dx) = Ω Pn , then an � optimal allocation and partitioning strategy An satisﬁes that� � � � Vµp f (X) + Eµp σ 2 (X) � � , L − Vµp f (X) � as soon as there exists, for such range of Pn , a constant L such that � � � � � Pn � Vµp f (X) + Eµp σ 2 (X) � � = n. L − Vµp f (X) p=1 1 � Pn = Ω(n max(1+α� −β� ,1) ) and def � Tp,n = The pseudo-loss of such an algorithm A� , optimal amongst the allocations strategies that use the n � partition Pn in Pn regions, is then given by � � � � def max(1 − β , 1 − α ) − 1. where γ = Ln (A� ) = Ω nγ n max(1 + α� − β � , 1) The condition involving the constant L is here to ensure that the partition is not degenerate. It is morally satisﬁed as soon as the variance of f and the noise are bounded and n is large enough. This Lemma applies to the important class W 1,2 (R) of functions that admit a weak derivative that o belongs to L2 (R). Indeed these functions are H¨ lder with coefﬁcient α = 1/2, i.e. we have o W 1,2 (R) ⊂ C 0,1/2 (R). The standard Brownian motion is an example of function that is 1/2-H¨ lder. More generally, for k = d + α with α = 1/2 when d is odd and α = 1 when d is even, we have the 2 inclusion W k,2 (Rd ) ⊂ C 0,α (Rd ) , where W k,2 (Rd ) is the set of functions that admit a k th weak derivative belonging to L2 (Rd ). Thus the previous Lemma applies to sufﬁciently smooth functions with smoothness linearly increasing with the dimension d of the input space X . Important remark Note that this Lemma gives us a choice of the partition that is minimax-optimal, and an allocation strategy on that partition that is not only minimax-optimal but also adaptive to the function f itself. Thus it provides a way to decide in a minimax way what is the good number of regions, and then to provide the best oracle way to allocate the budget. We can deduce the following immediate corollary on the class of α−H¨ lder functions observed in a o non-negligible noise of bounded variance (i.e. in the setting β � = 0 and α� = 2α ). d Corollary 1 Consider that f is α−H¨ lder and the noise is of bounded variance. Then a minimaxo d � d+2α ) and an optimal allocation achieves the rate L (A� ) = optimal partition satisﬁes Pn = Ω(n n n � −2α � Ω n d+2α . Moreover, the strategy of Lemma 1 is optimal amongst the allocations strategies that � use the partition Pn in Pn regions. � −2α � The rate Ω n d+2α is minimax-optimal on the class of α−H¨ lder functions (see Gy¨ rﬁ et al. [2002], o o Ibragimov and Hasminski [1981], Stone [1980]), and it is thus interesting to consider an initial numd � � d+2α ). After having built the partition, if the quantities ber �� 2 �� of�regions Pn that is of order Pn = Ω(n Vµp f p≤P and Eµp σ p≤P are known to the learner, it is optimal, in the aim of minimizing � the pseudo-loss, to allocate to each region the number of samples Tp,n provided in Lemma 1. Our objective in this paper is, after having chosen beforehand a minimax-optimal partition, to allocate 4 the samples properly in the regions, without having any access to those quantities. It is then �� necessary to balance between exploration, i.e. allocating the samples in order to estimate Vµp f p≤P � � �� and Eµp σ 2 p≤P , and exploitation, i.e. use the estimates to target the optimal allocation. 3 Online algorithms for allocation and homogeneous partitioning for piecewise constant mean-approximation In this section, we now turn to the design of algorithms that are fully online, with the goal to be competitive against the kind of oracle algorithms considered in Section 2.2. We now assume that the space X = [0, 1]d is divided in Pn hyper-cubic regions of same measure (the Lebesgue measure on 1 [0, 1]d ) vp = v = Pn . The goal of an algorithm is to minimize the quadratic error of approximation of f by a constant over each cell, in expectation, which we write as � � � � � � 2 λ(dx) ˆ (x))2 λ(dx) = max E , max E (f (x) − fn (f (x) − mp,n ) ˆ 1≤p≤Pn 1≤p≤Pn λ(Rp ) λ(Rp ) Rp Rp ˆ where fn is the histogram estimate of the function f on the partition P and mp,n is the empirical ˆ mean deﬁned on region Rp with the samples (Xi , Yi ) such that Xi ∈ Rp . To do so, an algorithm is only allowed to specify at each time step t, the next point Xt where to sample, based on all the past samples {(Xs , Ys )}s < ∞ satisﬁes that λ2 σ 2 (x) , ∀λ ∈ R+ log E exp[λ noise(x)] ≤ 2 and we further assume that it satisﬁes the following slightly stronger second property (that is for instance exactly veriﬁed for a Gaussian variable, looking at the moment generating function): � � � � 1 λ2 σ 2 (x) ∀λ, γ ∈ R+ log E exp λnoise(x) + γnoise(x)2 ≤ − log 1 − 2γσ 2 (x) . 2(1 − 2γσ 2 (x)) 2 5 The function f is assumed to be (L, α)-H¨ lder, meaning that it satiﬁes o � ∀x, x ∈ X f (x) − f (x� ) ≤ L||x − x� ||α . Similarly, the function σ 2 is assumed to be (M, β)-H¨ lder i.e. it satisﬁes o � 2 2 � ∀x, x ∈ X σ (x) − σ (x ) ≤ M ||x − x� ||β . We assume that Y is a convex and compact subset of R, thus w.l.g. that it is [0, 1], and that it is known that ||σ 2 ||∞ , which is thus ﬁnite, is bounded by the constant 1. 3.2 Empirical estimation of the quadratic approximation error on each cell We deﬁne the sampling distribution µp in the region Rp for each p ∈ {1, . . . , Pn } as a quasi-uniform ˜ sampling scheme using the uniform distribution over the sub-regions. More precisely at time t ≤ n, if we decide to sample in the region Rp according to µp , we sample uniformly in each sub-region ˜ one sample, resulting in a new batch of samples {(Xt,k , Yt,k )}1≤k≤K , where Xt,k ∼ µp,k . Note that due to this sampling process, the number of points Tp,t sampled in sub-region Rp at time t is always Tp,t a multiple of K and that moreover for all k, k � ∈ {1, . . . , K} we have that Tp,k,t = Tp,k� ,t = K . Now this speciﬁc sampling is used in order to be able to estimate the variances Vµp f and Eµp σ 2 , � so that the best proportions Tp,n can be computed as accurately as possible. Indeed, as explained in � � � � Lemma 1, we have that Vµp f (X) + Eµp σ 2 (X) � def � � . Tp,n = L − Vµp f (X) ˆ Variance estimation We now introduce two estimators. The ﬁrst estimator is written Vp,t and is def ˆ built in the following way. First,let us introduce the empirical estimate fp,k,t of the mean fp,k = � � Eµp,k f (X) of f in sub-region Rp,k . Similarly, to avoid some cumbersome notations, we introduce � � � � � � def def def 2 fp = Eµp f (X) and vp,k = Vµp,k f (X) for the function f , and then σp,k = Eµp,k σ 2 (X) for the variance of the noise σ 2 . We now deﬁne the empirical variance estimator to be K 1 � ˆ ˆ (fp,k,t − mp,t )2 , ˆ Vp,t = K −1 k=1 that is a biased estimator. Indeed, for a deterministic Tp,t , it is not difﬁcult to show that we have � K K � � � � � � �� 2 1 �� 1 � ˆ E Vp,t + Eµp,k f − Eµp f = Vµp,k f + Eµp,k σ 2 . K −1 Tp,t k=1 k=1 � � The leading term in this decomposition, that is given by the ﬁrst sum, is closed to Vµp f since, by using the assumption that f is (L, α)−H¨ lder, we have the following inequality o � � K � �� 1 � � � � 2 2L2 dα � Eµp,k f − Eµp f − Vµp f (X) � ≤ , � �K (KPn )2α/d k=1 where we also used that the diameter of a sub-region Rp,k is given by diam(Rp,k ) = d1/2 . (KPn )1/d ˆ Then, the second term also contributes to the bias, essentially due to the fact that V[fp,k,t ] = � � � � 2 def def 1 1 2 2 2 Tp,k,t (vp,k + σp,k ) and not Tp,t (vk + σk ) (with vp = Vµp f (X) and σp = Eµp σ (X) ). ˆ p,k,t In order to correct this term, we now introduce the second estimator σ 2 that estimates the variance � � � � � � of the outputs in a region Rp,k , i.e. Vµp,k ,ν Y (X) = Vµp,k f (X) + Eµp,k σ 2 . It is deﬁned as �2 t t �� 1 1 � def ˆ p,k,t = Yi − Yj I{Xj ∈ Rp,k } I{Xi ∈ Rp,k } . σ2 Tp,k,t − 1 i=1 Tp,k,t j=1 Now, we combine the two previous estimators to form the following estimator K 1 �� 1 1 � 2 ˆ ˆ ˆ σ − . Qp,t = Vp,t − K Tp,k,t Tp,t p,k,t k=1 ˆ The following proposition provides a high-probability bound on the difference between Qp,t and the quantity we want to estimate. We report the detailed proof in [Carpentier and Maillard, 2012]. 6 ˆ Proposition 1 By the assumption that f is (L, α)-H¨ lder, the bias of the estimator Qp,t , and for o deterministic Tp,t , is given by � K � � � � � � � � � 2 1 � 2L2 dα ˆ − Vµp f (X) ≤ . Eµp,k f − Eµp f E Qp,t − Qp (Tp,t ) = K (KPn )2α/d k=1 Moreover, it satisﬁes that for all δ ∈ [0, 1], there exists an event of probability higher than 1 − δ such that on this event, we have � � � � � � K K � � � � 8 log(4/δ) � σ 2 �1 � � � � ˆ p,k,t 1 � 2 ˆ ˆ � Qp,t − E Qp,t � ≤ � √ +o σ p,k . � � (K − 1)2 T2 T K K k=1 p,k,t p,k,t k=1 We also state the following Lemma that we are going to use in the analysis, and that takes into account randomness of the stopping times Tp,k,t . Lemma 2 Let {Xp,k,u }p≤P, k≤K, u≤n be samples potentially sampled in region Rp,k . We introduce qp,u to be the�equivalent of Qp (Tp,t ) with explicitly ﬁxed value of Tp,t = u. Let also qp,u be the ˆ � ˆ p,t but computed with the ﬁrst u samples in estimate of E qp,u , that is to say the equivalent of Q each region Rp,k (i.e. Tp,t = u). Let us deﬁne the event � � � � � � � AK log(4nP/δ)V � � ˆp,t 2L2 dα � � ξn,P,K (δ) = + ω : � qp,u (ω) − E qp,u � ≤ ˆ , u K −1 (KPn )2α/d p≤P u≤n �K 1 ˆ ˆ ˆ p,k,t and where A ≤ 4 is a numerical constant. Then it where Vp,t = Vp (Tp,t ) = K−1 k=1 σ 2 holds that � � P ξn,P,K (δ) ≥ 1 − δ . Note that, with the notations of this Lemma, Proposition 1 above is thus about qp,u . ˆ 3.3 The Online allocation and homogeneous partitioning algorithm for piecewise constant mean-approximation (OAHPA-pcma) We are now ready to state the algorithm that we propose for minimizing the quadratic error of approximation of f . The algorithm is described in Figure 1. Although it looks similar, this algorithm is ˆ quite different from a normal UCB algorithm since Qp,t decreases in expectation with Tp,t . Indeed, � � � � � �� K � 1 its expectation is close to Vµp f + KTp,t k=1 Vµp,k f + Eµp,k σ 2 . Algorithm 1 OAHPA-pcma. 1: Input: A, L, α, Horizon n; Partition {Rp }p≤P , with sub-partitions {Rp,k }k≤K . 2: Initialization: Sample K points in every sub-region {Rp,k }p≤P,k≤K 3: for t = K 2 P + 1; t ≤ n; t = t + K do ˆ 4: Compute ∀p, Qp,t . � ˆ ˆ p,t + AK log(4nP/δ)Vp,t + 2L2 dα . 5: Compute ∀p, Bp,t = Q 2α/d Tp,t K−1 (KPn ) 6: Select the region pt = argmax1≤p≤Pn Bp,t where to sample. 7: Sample K samples in region Rpt one per sub-region Rpt ,k according to µpt ,k . 8: end for 4 Performance of the allocation strategy and discussion Here is the main result of the paper; see the full version [Carpentier and Maillard, 2012] for the proof. We remind that the objective is to minimize for an algorithm A the pseudo-loss Ln (A). Theorem 1 (Main result) Let γ = � maxp Tp,n � minp Tp,n be the distortion factor of the optimal allocation stratdef d d egy, and let � > 0. Then with the choice of the number of regions Pn = n 2α+d �2+ 2α , and of the 2d d def def 8L2 α number of sub-regions K = C 4α+d �−2− α , where C = Ad1−α then the pseudo-loss of the OAHPApcma algorithm satisﬁes, under the assumptions of Section 3.1 and on an event of probability higher than 1 − δ, � � � � � 2α 1 + �γC � log(1/δ) Ln (A� ) + o n− 2α+d , Ln (A) ≤ n for some numerical constant C � not depending on n, where A� is the oracle of Lemma 1. n 7 Minimax-optimal partitioning and �-adaptive performance Theorem 1 provides a high probability bound on the performance of the OAHPA-pcma allocation strategy. It shows that this performance is competitive with that of an optimal (i.e. adaptive to the function f , see Lemma 1) allocation d A� on a partition with a number of cells Pn chosen to be of minimax order n 2α+d for the class of 2α α-H¨ lder functions. In particular, since Ln (A� ) = O(n d+2α ) on that class, we recover the same o n minimax order as what is obtained in the batch learning setting, when using for instance wavelets, or Kernel estimates (see e.g. Stone [1980], Ibragimov and Hasminski [1981]). But moreover, due to the adaptivity of A� to the function itself, this procedure is also �-adaptive to the function and not n only minimax-optimal on the class, on that partition (see Section 2.2). Naturally, the performance of the method increases, in the same way than for any classical functional estimation method, when the smoothness of the function increases. Similarly, in agreement with the classical curse of dimension, the higher the dimension of the domain, the less efﬁcient the method. Limitations In this work, we assume that the smoothness α of the function is available to the learner, which enables her to calibrate Pn properly. Now it makes sense to combine the OAHPApcma procedure with existing methods that enable to estimate this smoothness online (under a slightly stronger assumption than H¨ lder, such as H¨ lder functions that attain their exponents, o o see Gin´ and Nickl [2010]). It is thus interesting, when no preliminary knowledge on the smoothness e of f is available, to spend some of the initial budget in order to estimate α. We have seen that the OAHPA-pcma procedure, although very simple, manages to get minimax optimal results. Now the downside of the simplicity of the OAHPA-pcma strategy is two-fold. � The ﬁrst limitation is that the factor (1 + �γC � log(1/δ)) = (1 + O(�)) appearing in the bound before Ln (A� ) is not 1, but higher than 1. Of course it is generally difﬁcult to get a constant 1 in the batch setting (see Arlot [2007]), and similarly this is a difﬁcult task in our online setting too: If � is chosen to be small, then the error with respect to the optimal allocation is small. However, since Pn is expressed as an increasing function of �, this implies that the minimax bound on the loss for partition P increases also with �. That said, in the view of the work on active learning multi-armed bandit that we extend, we would still prefer to get the optimal constant 1. The second limitation is more problematic: since K is chosen irrespective of the region Rp , this causes the presence of the factor γ. Thus the algorithm will essentially no longer enjoy near-optimal performance guarantees when the optimal allocation strategy is highly not homogeneous. Conclusion and future work In this paper, we considered online regression with histograms in an active setting (we select in which bean to sample), and when we can choose the histogram in a class of homogeneous histograms. Since the (unknown) noise is heteroscedastic and we compete not only with the minimax allocation oracle on α-H¨ lder functions but with the adaptive oracle o that uses a minimax optimal histogram and allocates samples adaptively to the target function, this is an extremely challenging (and very practical) setting. Our contribution can be seen as a non trivial extension of the setting of active learning for multi-armed bandits to the case when each arm corresponds to one continuous region of a sampling space, as opposed to a singleton, which can also be seen as a problem of non parametric function approximation. This new setting offers interesting challenges: We provided a simple procedure, based on the computation of upper conﬁdence bounds of the estimation of the local quadratic error of approximation, and provided a performance analysis that shows that OAHPA-pcma is ﬁrst order �-optimal with respect to the function, for a partition chosen to be minimax-optimal on the class of α-H¨ lder functions. However, this simplicity also o has a drawback if one is interested in building exactly ﬁrst order optimal procedure, and going beyond these limitations is deﬁnitely not trivial: A more optimal but much more complex algorithm would indeed need to tune a different factor Kp in each cell in an online way, i.e. deﬁne some Kp,t that evolves with time, and redeﬁne sub-regions accordingly. Now, the analysis of the OAHPA-pcma already makes use of powerful tools such as empirical-Bernstein bounds for variance estimation (and not only for mean estimation), which make it non trivial; in order to handle possibly evolving subregions and deal with the progressive reﬁnement of the regions, we would need even more intricate analysis, due to the fact that we are online and active. This interesting next step is postponed to future work. Acknowledgements This research was partially supported by Nord-Pas-de-Calais Regional Council, French ANR EXPLO-RA (ANR-08-COSI-004), the European Communitys Seventh Framework Programme (FP7/2007-2013) under grant agreement no 270327 (CompLACS) and no 216886 (PASCAL2). 8 References Andr` s Antos, Varun Grover, and Csaba Szepesv` ri. Active learning in heteroscedastic noise. Thea a oretical Computer Science, 411(29-30):2712–2728, 2010. Sylvain Arlot. R´ echantillonnage et S´ lection de mod` les. PhD thesis, Universit´ Paris Sud - Paris e´ e e e XI, 2007. A. Baranes and P.-Y. Oudeyer. R-IAC: Robust Intrinsically Motivated Exploration and Active Learning. IEEE Transactions on Autonomous Mental Development, 1(3):155–169, October 2009. D. Bosq and J.P. Lecoutre. Th´ orie de l’estimation fonctionnelle, volume 21. Economica, 1987. e Alexandra Carpentier and Odalric-Ambrym Maillard. Online allocation and homogeneous partitioning for piecewise constant mean-approximation. HAL, 2012. URL http://hal.archives-ouvertes.fr/hal-00742893. Alexandra Carpentier, Alessandro Lazaric, Mohammad Ghavamzadeh, Rmi Munos, and Peter Auer. Upper-conﬁdence-bound algorithms for active learning in multi-armed bandits. In Jyrki Kivinen, Csaba Szepesv` ri, Esko Ukkonen, and Thomas Zeugmann, editors, Algorithmic Learning Theory, a volume 6925 of Lecture Notes in Computer Science, pages 189–203. Springer Berlin / Heidelberg, 2011. E. Gin´ and R. Nickl. Conﬁdence bands in density estimation. The Annals of Statistics, 38(2): e 1122–1170, 2010. L. Gy¨ rﬁ, M. Kohler, A. Krzy´ ak, and Walk H. A distribution-free theory of nonparametric regreso z sion. Springer Verlag, 2002. I. Ibragimov and R. Hasminski. Statistical estimation: Asymptotic theory. 1981. M. Rosenblatt. Stochastic curve estimation, volume 3. Inst of Mathematical Statistic, 1991. J. Schmidhuber. Formal theory of creativity, fun, and intrinsic motivation (19902010). Autonomous Mental Development, IEEE Transactions on, 2(3):230–247, 2010. C.J. Stone. Optimal rates of convergence for nonparametric estimators. The annals of Statistics, pages 1348–1360, 1980. J.W. Tukey. Non-parametric estimation ii. statistically equivalent blocks and tolerance regions–the continuous case. The Annals of Mathematical Statistics, 18(4):529–539, 1947. 9

3 0.65245652 163 nips-2012-Isotropic Hashing

Author: Weihao Kong, Wu-jun Li

Abstract: Most existing hashing methods adopt some projection functions to project the original data into several dimensions of real values, and then each of these projected dimensions is quantized into one bit (zero or one) by thresholding. Typically, the variances of different projected dimensions are different for existing projection functions such as principal component analysis (PCA). Using the same number of bits for different projected dimensions is unreasonable because larger-variance dimensions will carry more information. Although this viewpoint has been widely accepted by many researchers, it is still not veriﬁed by either theory or experiment because no methods have been proposed to ﬁnd a projection with equal variances for different dimensions. In this paper, we propose a novel method, called isotropic hashing (IsoHash), to learn projection functions which can produce projected dimensions with isotropic variances (equal variances). Experimental results on real data sets show that IsoHash can outperform its counterpart with different variances for different dimensions, which veriﬁes the viewpoint that projections with isotropic variances will be better than those with anisotropic variances. 1

4 0.60686898 18 nips-2012-A Simple and Practical Algorithm for Differentially Private Data Release

Author: Moritz Hardt, Katrina Ligett, Frank Mcsherry

5 0.57712615 275 nips-2012-Privacy Aware Learning

Author: Martin J. Wainwright, Michael I. Jordan, John C. Duchi

6 0.56907296 149 nips-2012-Hierarchical Optimistic Region Selection driven by Curiosity

7 0.55309916 162 nips-2012-Inverse Reinforcement Learning through Structured Classification

8 0.55033433 38 nips-2012-Algorithms for Learning Markov Field Policies

9 0.54908997 83 nips-2012-Controlled Recognition Bounds for Visual Learning and Exploration

10 0.54753071 316 nips-2012-Small-Variance Asymptotics for Exponential Family Dirichlet Process Mixture Models

11 0.54609972 292 nips-2012-Regularized Off-Policy TD-Learning

12 0.54588079 353 nips-2012-Transferring Expectations in Model-based Reinforcement Learning

13 0.54430962 112 nips-2012-Efficient Spike-Coding with Multiplicative Adaptation in a Spike Response Model