nips nips2012 nips2012-275 knowledge-graph by maker-knowledge-mining
Source: pdf
Author: Martin J. Wainwright, Michael I. Jordan, John C. Duchi
Abstract: We study statistical risk minimization problems under a version of privacy in which the data is kept confidential even from the learner. In this local privacy framework, we establish sharp upper and lower bounds on the convergence rates of statistical estimation procedures. As a consequence, we exhibit a precise tradeoff between the amount of privacy the data preserves and the utility, measured by convergence rate, of any statistical estimator. 1
Reference: text
sentIndex sentText sentNum sentScore
1 edu Abstract We study statistical risk minimization problems under a version of privacy in which the data is kept confidential even from the learner. [sent-6, score-0.903]
2 In this local privacy framework, we establish sharp upper and lower bounds on the convergence rates of statistical estimation procedures. [sent-7, score-1.154]
3 As a consequence, we exhibit a precise tradeoff between the amount of privacy the data preserves and the utility, measured by convergence rate, of any statistical estimator. [sent-8, score-0.92]
4 1 Introduction There are natural tensions between learning and privacy that arise whenever a learner must aggregate data across multiple individuals. [sent-9, score-0.879]
5 The learner wishes to make optimal use of each data point, but the providers of the data may wish to limit detailed exposure, either to the learner or to other individuals. [sent-10, score-0.154]
6 First, the use of loss functions and risk functions provides a compelling formal foundation for defining “learning,” one that dates back to Wald [28] in the 1930’s, and which has seen continued development in the context of research on machine learning over the past two decades. [sent-14, score-0.183]
7 Third, an appeal to decision theory permits abstraction over the details of specific learning procedures, providing (under certain conditions) minimax lower bounds that apply to any specific procedure. [sent-16, score-0.155]
8 Finally, the use of loss functions, in particular convex loss functions, in the design of a learning system allows powerful tools of optimization theory to be brought to bear. [sent-17, score-0.187]
9 Given a compact convex set Θ ⊂ Rd , we wish to find a parameter value θ ∈ Θ achieving good average performance under a loss function : X × Rd → R+ . [sent-19, score-0.204]
10 Instead of allowing M access to the samples Xi , however, we study the effect of giving only a perturbed view Zi of each datum Xi , quantifying the rate of convergence of R(θn ) to inf θ∈Θ R(θ) as a function of both the number of samples n and the amount of privacy Zi provides for Xi . [sent-26, score-1.12]
11 1 There is a long history of research at the intersection of privacy and statistics, where there is a natural competition between maintaining the privacy of elements in a dataset {X1 , . [sent-27, score-1.622]
12 Recently, there has been substantial work on privacy—focusing on a measure known as differential privacy [12]—in statistics, computer science, and other fields. [sent-32, score-0.849]
13 In this paper, we study local privacy [13, 17], in which each datum Xi is kept private from the method M. [sent-34, score-1.098]
14 The goal of many types of privacy is to guarantee that the output θn of the method M based on the data cannot be used to discover information about the individual samples X1 , . [sent-35, score-0.85]
15 , Xn , but locally private algorithms access only disguised views of each datum Xi . [sent-38, score-0.322]
16 Locally private algorithms are natural when the providers of the data—the population sampled to give X1 , . [sent-40, score-0.239]
17 For example, in medical applications, a participant may be embarrassed about his use of drugs, but if the loss is able to measure the likelihood of developing cancer, the participant has high utility for access to the optimal parameters θ∗ . [sent-44, score-0.23]
18 Our goal is to understand the fundamental tradeoffs between maintaining privacy while still retaining the utility of the statistical inference method M. [sent-49, score-0.936]
19 [7] develop differentially private empirical risk minimization algorithms, and Dwork and Lei [11] and Smith [26] analyze similar statistical procedures, but do not show that there must be negative effects of privacy. [sent-52, score-0.351]
20 [15] gives sharp minimax rates of convergence for differentially private histogram estimation. [sent-56, score-0.444]
21 [5] also give lower bounds on the closeness of certain statistical quantities computed from the dataset, though their upper and lower bounds do not match. [sent-58, score-0.218]
22 [17], who show that that locally private algorithms coincide with concepts that can be learned with polynomial sample complexity in Kearns’s statistical query (SQ) model. [sent-62, score-0.202]
23 In contrast, our analysis addresses sharp rates of convergence, and applies to estimators for a broad class of convex risks (1). [sent-63, score-0.13]
24 2 Main results and approach Our approach to local privacy is based on a worst-case measure of mutual information, where we view privacy preservation as a game between the providers of the data—who wish to preserve privacy—and nature. [sent-64, score-1.886]
25 Recalling that the method sees only the perturbed version Zi of Xi , we adopt a uniform variant of the mutual information I(Zi ; Xi ) between the random variables Xi and Zi as our measure for privacy. [sent-65, score-0.154]
26 This use of mutual information is by no means original [13, 25], but because standard mutual information has deficiencies as a measure of privacy [e. [sent-66, score-1.017]
27 13], we say the distribution Q generating Z from X is private only if I(X; Z) is small for all possible distributions P on X (possibly subject to constraints). [sent-68, score-0.192]
28 (In the long version of this paper [9] we also consider differentially private algorithms. [sent-71, score-0.229]
29 ) The central consequences of our main results are, under standard conditions on the loss functions , sharp upper and lower bounds on the possible convergence rates for estimation procedures when we wish to guarantee a level of privacy I(Xi ; Zi ) ≤ I ∗ . [sent-72, score-1.369]
30 2 We show that stochastic gradient descent is one procedure that achieves the optimal convergence rates, which means additionally that our upper bounds apply in streaming and online settings, requiring only a fixed-size memory footprint. [sent-75, score-0.25]
31 Our subsequent analysis builds on this favorable property of gradient-based methods, whence we focus on statistical estimation procedures that access data through the subgradients of the loss functions ∂ (X, θ). [sent-76, score-0.274]
32 Our mechanism gives M access to a vector Zi that is a stochastic (sub)gradient of the loss evaluated on the sample Xi at a parameter θ of the method’s choosing: E[Zi | Xi , θ] ∈ ∂ (Xi , θ), (2) where ∂ (Xi , θ) denotes the subgradient set of the function θ → (Xi , θ). [sent-81, score-0.214]
33 To obtain upper and lower bound on the convergence rate of estimation procedures, we provide a two-part analysis. [sent-83, score-0.129]
34 One part requires studying saddle points of the mutual information I(X; Z) (as a function of the distributions P of X and Q(· | X) of Z) under natural constraints that allow inference of the optimal parameters θ∗ for the risk R. [sent-84, score-0.331]
35 We show that for certain classes of loss functions and constraints on the communicated version Zi of the data Xi , there is a unique distribution Q(· | Xi ) that attains the smallest possible mutual information I(X; Z) for all distributions on X. [sent-85, score-0.33]
36 The uniqueness results for the conditional distribution Q show that no algorithm guaranteeing privacy between M and the samples Xi can do better. [sent-87, score-0.849]
37 We can obtain matching upper bounds by application of known convergence rates for stochastic gradient and mirror descent algorithms [20, 21], which are computationally efficient. [sent-88, score-0.304]
38 Before doing so, we require some formalization of our notions of privacy and error measures, which we now provide. [sent-90, score-0.811]
39 We may obtain privacy by allowing a perturbation to the subgradient g so long as the perturbation lives in a (larger) norm ball of radius M ≥ L, so that C = {g ∈ Rd : g ≤ L} ⊂ D = {g ∈ Rd : g ≤ M }. [sent-96, score-1.018]
40 (3) We view the problem of privacy as a game between the adversary controlling P and the data owners, who use Q to obscure the samples X. [sent-100, score-0.811]
41 In particular, we say a distribution Q guarantees a level of privacy I ∗ if and only if supP I(P, Q) ≤ I ∗ . [sent-101, score-0.811]
42 Indeed, if we can find P ∗ and Q∗ satisfying the saddle point (4), then the trivial direction of the max-min inequality yields sup inf I(P, Q) = I(P ∗ , Q∗ ) = inf sup I(P, Q). [sent-105, score-0.366]
43 For sets C ⊂ D ⊂ Rd , we define the source set P (C) := {Distributions P such that supp P ⊂ C} (5a) and the set of regular conditional distributions (r. [sent-107, score-0.18]
44 The conditional distribution Q∗ satisfies optimal local privacy for the sets C ⊂ D ⊂ Rd at level I ∗ if sup I(P, Q∗ ) = inf sup I(P, Q) = I ∗ , Q P P where the supremum is taken over distributions P ∈ P (C) and the infimum is taken over regular conditional distributions Q ∈ Q (C, D). [sent-119, score-1.233]
45 If a distribution Q∗ satisfies optimal local privacy, then it guarantees that even for the worst possible distribution on X, the information communicated about X is limited. [sent-120, score-0.13]
46 In a sense, Definition 1 captures the natural competition between privacy and learnability. [sent-121, score-0.811]
47 The method M specifies the set D to which the data Z it receives must belong; the “teachers,” or owners of the data X, choose the distribution Q to guarantee as much privacy as possible subject to this constraint. [sent-122, score-0.918]
48 Using this mechanism, if we can characterize a unique distribution Q∗ attaining the infimum (4) for P ∗ (and by extension, for any P ), then we may study the effects of privacy between the method M and X. [sent-123, score-0.845]
49 2 Minimax error and loss functions Having defined our privacy metric, we now turn to our original goal: quantification of the effect privacy has on statistical estimation rates. [sent-125, score-1.793]
50 We define the (random) error of the method M on the risk R(θ) = E[ (X, θ)] after receiving n sample gradients as n (M, , Θ, P ) := R(θn ) − inf R(θ) = EP [ (X, θn )] − inf EP [ (X, θ)]. [sent-128, score-0.233]
51 Let Q denote the regular conditional probability—the channel distribution— whose conditional part is defined on the range of the subgradient mapping ∂ (X, ·). [sent-130, score-0.246]
52 Now, let L be a collection of loss functions, where L(P ) denotes the losses : supp P × Θ → R belonging to L. [sent-132, score-0.196]
53 We define the minimax error ∗ n (L, Θ) := inf sup M ∈L(P ),P EP,Q [ n (M, , Θ, P )], (7) where the expectation is taken over the random samples X ∼ P and Z ∼ Q(· | X). [sent-133, score-0.208]
54 We characterize the minimax error (7) for several classes of loss functions L(P ), giving sharp results when the privacy distribution Q satisfies optimal local privacy. [sent-134, score-1.146]
55 A loss satisfies the condition (8) if and only if for all θ ∈ Θ, we have the inequality g p ≤ L for any subgradient g ∈ ∂ (x, θ) (e. [sent-140, score-0.139]
56 The privacy-guaranteeing channel distributions Q∗ we construct in Section 4 are motivated by our concern with the (L, p) families of loss functions. [sent-148, score-0.175]
57 In our model of computation, the learning method M queries the loss (Xi , ·) at the point θ; the owner of the datum Xi then computes the subgradient ∂ (Xi , θ) and returns a masked version Zi with the property that E[Zi | Xi , θ] ∈ ∂ (Xi , θ). [sent-149, score-0.227]
58 Our first theorem applies to the class of (L, ∞) loss functions (recall Definition 2). [sent-153, score-0.149]
59 Let L be the collection of (L, ∞) loss functions and assume the conditions of the preceding paragraph. [sent-157, score-0.137]
60 Let Q satisfy be optimally private for the collection L. [sent-158, score-0.162]
61 Then (a) If Θ contains the ∞ ball of radius r, ∗ n (L, Θ) (b) If Θ = {θ ∈ Rd : θ 1 ≥ 1 M∞ rd · √ . [sent-159, score-0.157]
62 17 n For our second theorem, we assume that the loss functions L consist of (L, 1) losses, and that the perturbed data must belong to the 1 ball of radius M1 , i. [sent-161, score-0.287]
63 Setting M = M1 /L, we define (these constants relate to the optimal local privacy distribution for 1 -balls) γ := log 2d − 2 + (2d − 2)2 + 4(M 2 − 1) 2(M − 1) , and ∆(γ) := eγ − e−γ . [sent-164, score-0.907]
64 Let L be the collection of (L, 1) loss functions and assume the conditions of the preceding paragraph. [sent-166, score-0.137]
65 Let Q be optimally locally private for the collection L. [sent-167, score-0.162]
66 Theorem 2 provides a similar result when we take M1 ↓ L, though in this case stochastic gradient descent attains the matching upper bound. [sent-176, score-0.168]
67 In these, we show that the lower bounds in Theorems 1 and 2 give sharp tradeoffs between the statistical rate of convergence for any statistical procedure and the desired privacy of a user. [sent-178, score-1.1]
68 In each corollary, we look ahead to Section 4 and use one of Propositions 1 or 2 to derive a bijection between the size M∞ or M1 of the perturbation set and the amount of privacy—as measured by the worst case mutual information I ∗ —provided. [sent-180, score-0.21]
69 Assume Q∗ satisfies optimal local privacy at information level I ∗ . [sent-184, score-0.878]
70 nI ∗ nI ∗ Proof Since Θ ⊆ {θ ∈ Rd : θ 1 ≤ r}, mirror descent [2, 21, 20, Chapter 5], using n unbiased stochastic gradient samples whose ∞ norms are bounded by M∞ , obtains convergence √ √ rate O(M∞ r log d/ n). [sent-186, score-0.21]
71 Now fix our desired amount of mutual information I ∗ . [sent-188, score-0.133]
72 From the remarks following Proposition 1, if we must guarantee that I ∗ ≥ supP I(P, Q) for any distribution P and loss function whose gradients are bounded in ∞ -norm by L, we must (by the remarks following Proposition 1) have dL2 . [sent-189, score-0.361]
73 I∗ 2 M∞ Up to higher-order terms, to guarantee a level of privacy with mutual information I ∗ , we must allow gradient noise up to M∞ = L d/I ∗ . [sent-190, score-1.013]
74 Using the bijection between M∞ and the maximal allowed √ √ mutual information I ∗ under local privacy that we have shown, we substitute M∞ = L d/ I ∗ into the upper and lower bounds that we have already attained. [sent-191, score-1.096]
75 Similar upper and lower bounds can be obtained under the conditions of part (a) of Theorem 1, √ where we need not assume Θ is an 1 -ball, but we lose a factor of log d in the lower bound. [sent-192, score-0.197]
76 Assume that Q∗ satisfies optimal local privacy at information level I ∗ . [sent-196, score-0.878]
77 n nI ∗ nI ∗ Proof By the conditions of optimal local privacy (Proposition 2 and Corollary 3), to have I ∗ ≥ supP I(P, Q) for any loss whose gradients are bounded in 1 -norm by L, we must have I∗ dL2 2, 2M1 using Corollary 3. [sent-198, score-1.064]
78 Rewriting this, we see that we must have M1 = L d/2I ∗ (to higher-order terms) to be able to guarantee an amount of privacy I ∗ . [sent-199, score-0.91]
79 Indeed, stochastic gradient descent (SGD) enjoys the following convergence guarantees (e. [sent-201, score-0.142]
80 Let Θ ⊆ Rd be contained in the ∞ ball of radius r and the gradients of the loss belong to the √ √ ∗ 1 -ball of radius M1 . [sent-204, score-0.263]
81 6 4 Saddle points, optimal privacy, and mutual information In this section, we explore conditions for a distribution Q∗ to satisfy optimal local privacy, as given by Definition 1. [sent-207, score-0.234]
82 We give characterizations of necessary and sufficient conditions based on the compact sets C ⊂ D for distributions P ∗ and Q∗ to achieve the saddle point (4). [sent-208, score-0.225]
83 Our results can be viewed as rate distortion theorems [14, 8] (with source P and channel Q) for certain compact alphabets, though as far as we know, they are all new. [sent-209, score-0.17]
84 Thus we sometimes refer to the conditional distribution Q, which is designed to maintain the privacy of the data X by communication of Z, as the channel distribution. [sent-210, score-0.95]
85 Since we wish to bound I(X; Z) for arbitrary losses , we must address the case when (X, θ) = θ, X , in which case (X, θ) = X; by the data-processing inequality [14, Chapter 5] it is thus no loss of generality to assume that X ∈ C and that E[Z | X] = X. [sent-211, score-0.197]
86 Let C ⊂ Rd be a compact convex set with extreme points ui ∈ Rd , i ∈ I for some index set I. [sent-215, score-0.189]
87 Then C is rotationally invariant through its extreme points if ui 2 = uj 2 for each i, j, and for any unitary matrix U such that U ui = uj for some i = j, then U C = C. [sent-216, score-0.248]
88 Some examples of convex sets rotationally invariant through their extreme points include p -norm balls for p = 1, 2, ∞, though p -balls for p ∈ {1, 2, ∞} are not. [sent-217, score-0.193]
89 The following theorem gives a general characterization of the minimax mutual information for rotationally invariant norm balls with finite numbers of extreme points by providing saddle point distributions P ∗ and Q∗ . [sent-218, score-0.534]
90 Moreover, the distribution P ∗ uniform on {ui }m uniquely attains the saddle point (4). [sent-228, score-0.147]
91 In fact, we may introduce a random variable X between X and Z: let X be distributed among the extreme points {ui }m of C in any way such that E[X | X] = i=1 X, then use the maximum entropy distribution Q∗ (· | ui ) defined in the theorem when X ∈ {ui }m i=1 to sample Z from X . [sent-230, score-0.18]
92 The information processing inequality [14, Chapter 5] guarantees the Markov chain X → X → Z satisfies the minimax bound I(X; Z) ≤ inf Q supP I(P, Q). [sent-231, score-0.15]
93 With Theorem 3 in place, we can explicitly characterize the distributions achieving optimal local privacy (recall Definition 1) for 1 and ∞ balls. [sent-232, score-0.942]
94 Here the arguments are slightly more complicated, as the coordinates of the random variables are no longer independent, but Theorem 3 still allows us to explicitly characterize the saddle point of the mutual information. [sent-246, score-0.253]
95 (9), and let Q∗ be the distribution on Z | X such that Z is supported on {±M ei }d , and i=1 eγ ∗ , (10a) Q (Z = M ei | X = ei ) = γ −γ + (2d − 2) e +e e−γ Q∗ (Z = −M ei | X = ei ) = γ , (10b) −γ + (2d − 2) e +e 1 . [sent-250, score-0.43]
96 Additionally, an asymptotic expansion allows us to gain a somewhat clearer picture of the values of the mutual information, though we do not derive upper bounds as we did for Proposition 1. [sent-255, score-0.181]
97 We study procedures that access each datum only once and through a perturbed view Zi of the subgradient ∂ (Xi , θ), which allows us to use (essentially) any convex loss. [sent-262, score-0.343]
98 These questions do not appear to have easy answers, especially when we wish to allow each provider of a single datum to be able to guarantee his or her own privacy. [sent-274, score-0.174]
99 Nevertheless, we hope that our view of privacy and the techniques we have developed herein prove fruitful, and we hope to investigate some of the above issues in future work. [sent-275, score-0.811]
100 Mirror descent and nonlinear projected subgradient methods for convex optimization. [sent-294, score-0.133]
wordName wordTfidf (topN-words)
[('privacy', 0.811), ('private', 0.162), ('saddle', 0.116), ('mutual', 0.103), ('zi', 0.092), ('rd', 0.09), ('datum', 0.088), ('ei', 0.086), ('minimax', 0.083), ('providers', 0.077), ('rotationally', 0.077), ('supp', 0.076), ('loss', 0.075), ('dwork', 0.074), ('channel', 0.07), ('remarks', 0.07), ('procedures', 0.069), ('inf', 0.067), ('differentially', 0.067), ('subgradient', 0.064), ('ui', 0.064), ('communicated', 0.063), ('sup', 0.058), ('mievski', 0.058), ('xi', 0.057), ('theorems', 0.055), ('risk', 0.052), ('perturbed', 0.051), ('tradeoffs', 0.05), ('ni', 0.05), ('sharp', 0.048), ('gradients', 0.047), ('wish', 0.047), ('theorem', 0.046), ('proposition', 0.046), ('rates', 0.045), ('losses', 0.045), ('compact', 0.045), ('bounds', 0.044), ('extreme', 0.043), ('stochastic', 0.041), ('corollary', 0.04), ('statistical', 0.04), ('mirror', 0.039), ('ev', 0.039), ('bijection', 0.039), ('convergence', 0.039), ('guarantee', 0.039), ('deferring', 0.038), ('disguised', 0.038), ('owners', 0.038), ('rld', 0.038), ('sankar', 0.038), ('tensions', 0.038), ('zdq', 0.038), ('differential', 0.038), ('conditional', 0.038), ('perturbation', 0.038), ('radius', 0.038), ('convex', 0.037), ('local', 0.037), ('regular', 0.036), ('balls', 0.036), ('belong', 0.036), ('satis', 0.035), ('utility', 0.035), ('characterize', 0.034), ('conditions', 0.034), ('upper', 0.034), ('mum', 0.034), ('access', 0.034), ('nissim', 0.034), ('kasiviswanathan', 0.034), ('warner', 0.034), ('rl', 0.033), ('nition', 0.033), ('descent', 0.032), ('chapter', 0.032), ('evidently', 0.031), ('attains', 0.031), ('communication', 0.031), ('optimal', 0.03), ('amount', 0.03), ('must', 0.03), ('distributions', 0.03), ('gradient', 0.03), ('ep', 0.03), ('symposium', 0.029), ('ball', 0.029), ('log', 0.029), ('lower', 0.028), ('participant', 0.028), ('rubinstein', 0.028), ('estimation', 0.028), ('functions', 0.028), ('eq', 0.027), ('entropy', 0.027), ('survey', 0.027), ('wasserman', 0.027), ('chaudhuri', 0.026)]
simIndex simValue paperId paperTitle
same-paper 1 1.0000006 275 nips-2012-Privacy Aware Learning
Author: Martin J. Wainwright, Michael I. Jordan, John C. Duchi
Abstract: We study statistical risk minimization problems under a version of privacy in which the data is kept confidential even from the learner. In this local privacy framework, we establish sharp upper and lower bounds on the convergence rates of statistical estimation procedures. As a consequence, we exhibit a precise tradeoff between the amount of privacy the data preserves and the utility, measured by convergence rate, of any statistical estimator. 1
2 0.5458079 237 nips-2012-Near-optimal Differentially Private Principal Components
Author: Kamalika Chaudhuri, Anand Sarwate, Kaushik Sinha
Abstract: Principal components analysis (PCA) is a standard tool for identifying good lowdimensional approximations to data sets in high dimension. Many current data sets of interest contain private or sensitive information about individuals. Algorithms which operate on such data should be sensitive to the privacy risks in publishing their outputs. Differential privacy is a framework for developing tradeoffs between privacy and the utility of these outputs. In this paper we investigate the theory and empirical performance of differentially private approximations to PCA and propose a new method which explicitly optimizes the utility of the output. We demonstrate that on real data, there is a large performance gap between the existing method and our method. We show that the sample complexity for the two procedures differs in the scaling with the data dimension, and that our method is nearly optimal in terms of this scaling. 1
3 0.36220768 18 nips-2012-A Simple and Practical Algorithm for Differentially Private Data Release
Author: Moritz Hardt, Katrina Ligett, Frank Mcsherry
Abstract: We present a new algorithm for differentially private data release, based on a simple combination of the Multiplicative Weights update rule with the Exponential Mechanism. Our MWEM algorithm achieves what are the best known and nearly optimal theoretical guarantees, while at the same time being simple to implement and experimentally more accurate on actual data sets than existing techniques. 1
4 0.16689122 134 nips-2012-Finite Sample Convergence Rates of Zero-Order Stochastic Optimization Methods
Author: Andre Wibisono, Martin J. Wainwright, Michael I. Jordan, John C. Duchi
Abstract: We consider derivative-free algorithms for stochastic optimization problems that use only noisy function values rather than gradients, analyzing their finite-sample convergence rates. We show that if pairs of function values are available, algorithms that √ gradient estimates based on random perturbations suffer a factor use of at most d in convergence rate over traditional stochastic gradient methods, where d is the problem dimension. We complement our algorithmic development with information-theoretic lower bounds on the minimax convergence rate of such problems, which show that our bounds are sharp with respect to all problemdependent quantities: they cannot be improved by more than constant factors. 1
5 0.097008251 325 nips-2012-Stochastic optimization and sparse statistical recovery: Optimal algorithms for high dimensions
Author: Alekh Agarwal, Sahand Negahban, Martin J. Wainwright
Abstract: We develop and analyze stochastic optimization algorithms for problems in which the expected loss is strongly convex, and the optimum is (approximately) sparse. Previous approaches are able to exploit only one of these two structures, yielding a O(d/T ) convergence rate for strongly convex objectives in d dimensions and O( s(log d)/T ) convergence rate when the optimum is s-sparse. Our algorithm is based on successively solving a series of ℓ1 -regularized optimization problems using Nesterov’s dual averaging algorithm. We establish that the error of our solution after T iterations is at most O(s(log d)/T ), with natural extensions to approximate sparsity. Our results apply to locally Lipschitz losses including the logistic, exponential, hinge and least-squares losses. By recourse to statistical minimax results, we show that our convergence rates are optimal up to constants. The effectiveness of our approach is also confirmed in numerical simulations where we compare to several baselines on a least-squares regression problem.
6 0.079842485 16 nips-2012-A Polynomial-time Form of Robust Regression
7 0.073143899 293 nips-2012-Relax and Randomize : From Value to Algorithms
8 0.062364608 67 nips-2012-Classification Calibration Dimension for General Multiclass Losses
9 0.059714902 272 nips-2012-Practical Bayesian Optimization of Machine Learning Algorithms
10 0.059186369 355 nips-2012-Truncation-free Online Variational Inference for Bayesian Nonparametric Models
11 0.058050822 206 nips-2012-Majorization for CRFs and Latent Likelihoods
12 0.057119727 174 nips-2012-Learning Halfspaces with the Zero-One Loss: Time-Accuracy Tradeoffs
13 0.056983285 227 nips-2012-Multiclass Learning with Simplex Coding
14 0.054217137 324 nips-2012-Stochastic Gradient Descent with Only One Projection
15 0.053970683 76 nips-2012-Communication-Efficient Algorithms for Statistical Optimization
16 0.052604459 361 nips-2012-Volume Regularization for Binary Classification
17 0.052403178 212 nips-2012-Minimax Multi-Task Learning and a Generalized Loss-Compositional Paradigm for MTL
18 0.05181038 184 nips-2012-Learning Probability Measures with respect to Optimal Transport Metrics
19 0.050725609 254 nips-2012-On the Sample Complexity of Robust PCA
20 0.050219007 186 nips-2012-Learning as MAP Inference in Discrete Graphical Models
topicId topicWeight
[(0, 0.172), (1, 0.013), (2, 0.097), (3, -0.046), (4, 0.05), (5, 0.056), (6, -0.001), (7, 0.12), (8, -0.025), (9, -0.046), (10, 0.049), (11, -0.001), (12, -0.129), (13, -0.07), (14, -0.007), (15, 0.075), (16, 0.018), (17, -0.532), (18, 0.129), (19, 0.131), (20, -0.316), (21, -0.013), (22, 0.022), (23, -0.178), (24, 0.222), (25, -0.046), (26, 0.009), (27, -0.053), (28, 0.059), (29, -0.0), (30, 0.043), (31, -0.026), (32, 0.007), (33, -0.037), (34, 0.12), (35, 0.006), (36, -0.008), (37, -0.004), (38, -0.005), (39, -0.071), (40, -0.016), (41, 0.007), (42, 0.019), (43, 0.047), (44, -0.012), (45, -0.011), (46, 0.003), (47, 0.014), (48, 0.001), (49, 0.04)]
simIndex simValue paperId paperTitle
1 0.91347528 18 nips-2012-A Simple and Practical Algorithm for Differentially Private Data Release
Author: Moritz Hardt, Katrina Ligett, Frank Mcsherry
Abstract: We present a new algorithm for differentially private data release, based on a simple combination of the Multiplicative Weights update rule with the Exponential Mechanism. Our MWEM algorithm achieves what are the best known and nearly optimal theoretical guarantees, while at the same time being simple to implement and experimentally more accurate on actual data sets than existing techniques. 1
2 0.90238428 237 nips-2012-Near-optimal Differentially Private Principal Components
Author: Kamalika Chaudhuri, Anand Sarwate, Kaushik Sinha
Abstract: Principal components analysis (PCA) is a standard tool for identifying good lowdimensional approximations to data sets in high dimension. Many current data sets of interest contain private or sensitive information about individuals. Algorithms which operate on such data should be sensitive to the privacy risks in publishing their outputs. Differential privacy is a framework for developing tradeoffs between privacy and the utility of these outputs. In this paper we investigate the theory and empirical performance of differentially private approximations to PCA and propose a new method which explicitly optimizes the utility of the output. We demonstrate that on real data, there is a large performance gap between the existing method and our method. We show that the sample complexity for the two procedures differs in the scaling with the data dimension, and that our method is nearly optimal in terms of this scaling. 1
same-paper 3 0.88527864 275 nips-2012-Privacy Aware Learning
Author: Martin J. Wainwright, Michael I. Jordan, John C. Duchi
Abstract: We study statistical risk minimization problems under a version of privacy in which the data is kept confidential even from the learner. In this local privacy framework, we establish sharp upper and lower bounds on the convergence rates of statistical estimation procedures. As a consequence, we exhibit a precise tradeoff between the amount of privacy the data preserves and the utility, measured by convergence rate, of any statistical estimator. 1
4 0.3829174 285 nips-2012-Query Complexity of Derivative-Free Optimization
Author: Ben Recht, Kevin G. Jamieson, Robert Nowak
Abstract: This paper provides lower bounds on the convergence rate of Derivative Free Optimization (DFO) with noisy function evaluations, exposing a fundamental and unavoidable gap between the performance of algorithms with access to gradients and those with access to only function evaluations. However, there are situations in which DFO is unavoidable, and for such situations we propose a new DFO algorithm that is proved to be near optimal for the class of strongly convex objective functions. A distinctive feature of the algorithm is that it uses only Boolean-valued function comparisons, rather than function evaluations. This makes the algorithm useful in an even wider range of applications, such as optimization based on paired comparisons from human subjects, for example. We also show that regardless of whether DFO is based on noisy function evaluations or Boolean-valued function comparisons, the convergence rate is the same. 1
5 0.37384865 134 nips-2012-Finite Sample Convergence Rates of Zero-Order Stochastic Optimization Methods
Author: Andre Wibisono, Martin J. Wainwright, Michael I. Jordan, John C. Duchi
Abstract: We consider derivative-free algorithms for stochastic optimization problems that use only noisy function values rather than gradients, analyzing their finite-sample convergence rates. We show that if pairs of function values are available, algorithms that √ gradient estimates based on random perturbations suffer a factor use of at most d in convergence rate over traditional stochastic gradient methods, where d is the problem dimension. We complement our algorithmic development with information-theoretic lower bounds on the minimax convergence rate of such problems, which show that our bounds are sharp with respect to all problemdependent quantities: they cannot be improved by more than constant factors. 1
6 0.34146926 76 nips-2012-Communication-Efficient Algorithms for Statistical Optimization
7 0.3226679 103 nips-2012-Distributed Probabilistic Learning for Camera Networks with Missing Data
8 0.31914079 217 nips-2012-Mixability in Statistical Learning
9 0.31356248 325 nips-2012-Stochastic optimization and sparse statistical recovery: Optimal algorithms for high dimensions
10 0.29735234 254 nips-2012-On the Sample Complexity of Robust PCA
11 0.28619504 161 nips-2012-Interpreting prediction markets: a stochastic approach
12 0.27516839 46 nips-2012-Assessing Blinding in Clinical Trials
13 0.27438816 174 nips-2012-Learning Halfspaces with the Zero-One Loss: Time-Accuracy Tradeoffs
14 0.26719081 16 nips-2012-A Polynomial-time Form of Robust Regression
15 0.25594783 34 nips-2012-Active Learning of Multi-Index Function Models
16 0.25473902 361 nips-2012-Volume Regularization for Binary Classification
17 0.24892882 206 nips-2012-Majorization for CRFs and Latent Likelihoods
18 0.24883407 212 nips-2012-Minimax Multi-Task Learning and a Generalized Loss-Compositional Paradigm for MTL
19 0.24869849 271 nips-2012-Pointwise Tracking the Optimal Regression Function
20 0.24386115 338 nips-2012-The Perturbed Variation
topicId topicWeight
[(0, 0.019), (14, 0.013), (21, 0.019), (28, 0.029), (38, 0.149), (39, 0.01), (42, 0.119), (53, 0.011), (54, 0.018), (55, 0.03), (74, 0.038), (76, 0.142), (80, 0.087), (86, 0.18), (92, 0.044)]
simIndex simValue paperId paperTitle
1 0.86739874 137 nips-2012-From Deformations to Parts: Motion-based Segmentation of 3D Objects
Author: Soumya Ghosh, Matthew Loper, Erik B. Sudderth, Michael J. Black
Abstract: We develop a method for discovering the parts of an articulated object from aligned meshes of the object in various three-dimensional poses. We adapt the distance dependent Chinese restaurant process (ddCRP) to allow nonparametric discovery of a potentially unbounded number of parts, while simultaneously guaranteeing a spatially connected segmentation. To allow analysis of datasets in which object instances have varying 3D shapes, we model part variability across poses via affine transformations. By placing a matrix normal-inverse-Wishart prior on these affine transformations, we develop a ddCRP Gibbs sampler which tractably marginalizes over transformation uncertainty. Analyzing a dataset of humans captured in dozens of poses, we infer parts which provide quantitatively better deformation predictions than conventional clustering methods.
2 0.84672666 57 nips-2012-Bayesian estimation of discrete entropy with mixtures of stick-breaking priors
Author: Evan Archer, Il M. Park, Jonathan W. Pillow
Abstract: We consider the problem of estimating Shannon’s entropy H in the under-sampled regime, where the number of possible symbols may be unknown or countably infinite. Dirichlet and Pitman-Yor processes provide tractable prior distributions over the space of countably infinite discrete distributions, and have found major applications in Bayesian non-parametric statistics and machine learning. Here we show that they provide natural priors for Bayesian entropy estimation, due to the analytic tractability of the moments of the induced posterior distribution over entropy H. We derive formulas for the posterior mean and variance of H given data. However, we show that a fixed Dirichlet or Pitman-Yor process prior implies a narrow prior on H, meaning the prior strongly determines the estimate in the under-sampled regime. We therefore define a family of continuous mixing measures such that the resulting mixture of Dirichlet or Pitman-Yor processes produces an approximately flat prior over H. We explore the theoretical properties of the resulting estimators and show that they perform well on data sampled from both exponential and power-law tailed distributions. 1
same-paper 3 0.83856767 275 nips-2012-Privacy Aware Learning
Author: Martin J. Wainwright, Michael I. Jordan, John C. Duchi
Abstract: We study statistical risk minimization problems under a version of privacy in which the data is kept confidential even from the learner. In this local privacy framework, we establish sharp upper and lower bounds on the convergence rates of statistical estimation procedures. As a consequence, we exhibit a precise tradeoff between the amount of privacy the data preserves and the utility, measured by convergence rate, of any statistical estimator. 1
4 0.81987137 303 nips-2012-Searching for objects driven by context
Author: Bogdan Alexe, Nicolas Heess, Yee W. Teh, Vittorio Ferrari
Abstract: The dominant visual search paradigm for object class detection is sliding windows. Although simple and effective, it is also wasteful, unnatural and rigidly hardwired. We propose strategies to search for objects which intelligently explore the space of windows by making sequential observations at locations decided based on previous observations. Our strategies adapt to the class being searched and to the content of a particular test image, exploiting context as the statistical relation between the appearance of a window and its location relative to the object, as observed in the training set. In addition to being more elegant than sliding windows, we demonstrate experimentally on the PASCAL VOC 2010 dataset that our strategies evaluate two orders of magnitude fewer windows while achieving higher object detection performance. 1
5 0.81505007 243 nips-2012-Non-parametric Approximate Dynamic Programming via the Kernel Method
Author: Nikhil Bhat, Vivek Farias, Ciamac C. Moallemi
Abstract: This paper presents a novel non-parametric approximate dynamic programming (ADP) algorithm that enjoys graceful approximation and sample complexity guarantees. In particular, we establish both theoretically and computationally that our proposal can serve as a viable alternative to state-of-the-art parametric ADP algorithms, freeing the designer from carefully specifying an approximation architecture. We accomplish this by developing a kernel-based mathematical program for ADP. Via a computational study on a controlled queueing network, we show that our procedure is competitive with parametric ADP approaches. 1
6 0.80298162 242 nips-2012-Non-linear Metric Learning
7 0.80167472 74 nips-2012-Collaborative Gaussian Processes for Preference Learning
8 0.80063039 134 nips-2012-Finite Sample Convergence Rates of Zero-Order Stochastic Optimization Methods
9 0.79930574 289 nips-2012-Recognizing Activities by Attribute Dynamics
10 0.78829008 219 nips-2012-Modelling Reciprocating Relationships with Hawkes Processes
11 0.78422254 98 nips-2012-Dimensionality Dependent PAC-Bayes Margin Bound
12 0.77315021 162 nips-2012-Inverse Reinforcement Learning through Structured Classification
13 0.76885772 325 nips-2012-Stochastic optimization and sparse statistical recovery: Optimal algorithms for high dimensions
14 0.76643854 263 nips-2012-Optimal Regularized Dual Averaging Methods for Stochastic Optimization
15 0.76572913 292 nips-2012-Regularized Off-Policy TD-Learning
16 0.76528871 237 nips-2012-Near-optimal Differentially Private Principal Components
17 0.76503903 358 nips-2012-Value Pursuit Iteration
18 0.76463234 265 nips-2012-Parametric Local Metric Learning for Nearest Neighbor Classification
19 0.76377696 261 nips-2012-Online allocation and homogeneous partitioning for piecewise constant mean-approximation
20 0.76236463 226 nips-2012-Multiclass Learning Approaches: A Theoretical Comparison with Implications