nips nips2012 nips2012-275 knowledge-graph by maker-knowledge-mining

275 nips-2012-Privacy Aware Learning

Source: pdf

Author: Martin J. Wainwright, Michael I. Jordan, John C. Duchi

Abstract: We study statistical risk minimization problems under a version of privacy in which the data is kept conﬁdential even from the learner. In this local privacy framework, we establish sharp upper and lower bounds on the convergence rates of statistical estimation procedures. As a consequence, we exhibit a precise tradeoff between the amount of privacy the data preserves and the utility, measured by convergence rate, of any statistical estimator. 1

Reference: text

Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 edu Abstract We study statistical risk minimization problems under a version of privacy in which the data is kept conﬁdential even from the learner. [sent-6, score-0.903]

2 In this local privacy framework, we establish sharp upper and lower bounds on the convergence rates of statistical estimation procedures. [sent-7, score-1.154]

3 As a consequence, we exhibit a precise tradeoff between the amount of privacy the data preserves and the utility, measured by convergence rate, of any statistical estimator. [sent-8, score-0.92]

4 1 Introduction There are natural tensions between learning and privacy that arise whenever a learner must aggregate data across multiple individuals. [sent-9, score-0.879]

5 The learner wishes to make optimal use of each data point, but the providers of the data may wish to limit detailed exposure, either to the learner or to other individuals. [sent-10, score-0.154]

6 First, the use of loss functions and risk functions provides a compelling formal foundation for deﬁning “learning,” one that dates back to Wald [28] in the 1930’s, and which has seen continued development in the context of research on machine learning over the past two decades. [sent-14, score-0.183]

7 Third, an appeal to decision theory permits abstraction over the details of speciﬁc learning procedures, providing (under certain conditions) minimax lower bounds that apply to any speciﬁc procedure. [sent-16, score-0.155]

8 Finally, the use of loss functions, in particular convex loss functions, in the design of a learning system allows powerful tools of optimization theory to be brought to bear. [sent-17, score-0.187]

9 Given a compact convex set Θ ⊂ Rd , we wish to ﬁnd a parameter value θ ∈ Θ achieving good average performance under a loss function : X × Rd → R+ . [sent-19, score-0.204]

10 Instead of allowing M access to the samples Xi , however, we study the effect of giving only a perturbed view Zi of each datum Xi , quantifying the rate of convergence of R(θn ) to inf θ∈Θ R(θ) as a function of both the number of samples n and the amount of privacy Zi provides for Xi . [sent-26, score-1.12]

11 1 There is a long history of research at the intersection of privacy and statistics, where there is a natural competition between maintaining the privacy of elements in a dataset {X1 , . [sent-27, score-1.622]

12 Recently, there has been substantial work on privacy—focusing on a measure known as differential privacy [12]—in statistics, computer science, and other ﬁelds. [sent-32, score-0.849]

13 In this paper, we study local privacy [13, 17], in which each datum Xi is kept private from the method M. [sent-34, score-1.098]

14 The goal of many types of privacy is to guarantee that the output θn of the method M based on the data cannot be used to discover information about the individual samples X1 , . [sent-35, score-0.85]

15 , Xn , but locally private algorithms access only disguised views of each datum Xi . [sent-38, score-0.322]

16 Locally private algorithms are natural when the providers of the data—the population sampled to give X1 , . [sent-40, score-0.239]

17 For example, in medical applications, a participant may be embarrassed about his use of drugs, but if the loss is able to measure the likelihood of developing cancer, the participant has high utility for access to the optimal parameters θ∗ . [sent-44, score-0.23]

18 Our goal is to understand the fundamental tradeoffs between maintaining privacy while still retaining the utility of the statistical inference method M. [sent-49, score-0.936]

19 [7] develop differentially private empirical risk minimization algorithms, and Dwork and Lei [11] and Smith [26] analyze similar statistical procedures, but do not show that there must be negative effects of privacy. [sent-52, score-0.351]

20 [15] gives sharp minimax rates of convergence for differentially private histogram estimation. [sent-56, score-0.444]

21 [5] also give lower bounds on the closeness of certain statistical quantities computed from the dataset, though their upper and lower bounds do not match. [sent-58, score-0.218]

22 [17], who show that that locally private algorithms coincide with concepts that can be learned with polynomial sample complexity in Kearns’s statistical query (SQ) model. [sent-62, score-0.202]

23 In contrast, our analysis addresses sharp rates of convergence, and applies to estimators for a broad class of convex risks (1). [sent-63, score-0.13]

24 2 Main results and approach Our approach to local privacy is based on a worst-case measure of mutual information, where we view privacy preservation as a game between the providers of the data—who wish to preserve privacy—and nature. [sent-64, score-1.886]

25 Recalling that the method sees only the perturbed version Zi of Xi , we adopt a uniform variant of the mutual information I(Zi ; Xi ) between the random variables Xi and Zi as our measure for privacy. [sent-65, score-0.154]

26 This use of mutual information is by no means original [13, 25], but because standard mutual information has deﬁciencies as a measure of privacy [e. [sent-66, score-1.017]

27 13], we say the distribution Q generating Z from X is private only if I(X; Z) is small for all possible distributions P on X (possibly subject to constraints). [sent-68, score-0.192]

28 (In the long version of this paper [9] we also consider differentially private algorithms. [sent-71, score-0.229]

29 ) The central consequences of our main results are, under standard conditions on the loss functions , sharp upper and lower bounds on the possible convergence rates for estimation procedures when we wish to guarantee a level of privacy I(Xi ; Zi ) ≤ I ∗ . [sent-72, score-1.369]

30 2 We show that stochastic gradient descent is one procedure that achieves the optimal convergence rates, which means additionally that our upper bounds apply in streaming and online settings, requiring only a ﬁxed-size memory footprint. [sent-75, score-0.25]

31 Our subsequent analysis builds on this favorable property of gradient-based methods, whence we focus on statistical estimation procedures that access data through the subgradients of the loss functions ∂ (X, θ). [sent-76, score-0.274]

32 Our mechanism gives M access to a vector Zi that is a stochastic (sub)gradient of the loss evaluated on the sample Xi at a parameter θ of the method’s choosing: E[Zi | Xi , θ] ∈ ∂ (Xi , θ), (2) where ∂ (Xi , θ) denotes the subgradient set of the function θ → (Xi , θ). [sent-81, score-0.214]

33 To obtain upper and lower bound on the convergence rate of estimation procedures, we provide a two-part analysis. [sent-83, score-0.129]

34 One part requires studying saddle points of the mutual information I(X; Z) (as a function of the distributions P of X and Q(· | X) of Z) under natural constraints that allow inference of the optimal parameters θ∗ for the risk R. [sent-84, score-0.331]

35 We show that for certain classes of loss functions and constraints on the communicated version Zi of the data Xi , there is a unique distribution Q(· | Xi ) that attains the smallest possible mutual information I(X; Z) for all distributions on X. [sent-85, score-0.33]

36 The uniqueness results for the conditional distribution Q show that no algorithm guaranteeing privacy between M and the samples Xi can do better. [sent-87, score-0.849]

37 We can obtain matching upper bounds by application of known convergence rates for stochastic gradient and mirror descent algorithms [20, 21], which are computationally efﬁcient. [sent-88, score-0.304]

38 Before doing so, we require some formalization of our notions of privacy and error measures, which we now provide. [sent-90, score-0.811]

39 We may obtain privacy by allowing a perturbation to the subgradient g so long as the perturbation lives in a (larger) norm ball of radius M ≥ L, so that C = {g ∈ Rd : g ≤ L} ⊂ D = {g ∈ Rd : g ≤ M }. [sent-96, score-1.018]

40 (3) We view the problem of privacy as a game between the adversary controlling P and the data owners, who use Q to obscure the samples X. [sent-100, score-0.811]

41 In particular, we say a distribution Q guarantees a level of privacy I ∗ if and only if supP I(P, Q) ≤ I ∗ . [sent-101, score-0.811]

42 Indeed, if we can ﬁnd P ∗ and Q∗ satisfying the saddle point (4), then the trivial direction of the max-min inequality yields sup inf I(P, Q) = I(P ∗ , Q∗ ) = inf sup I(P, Q). [sent-105, score-0.366]

43 For sets C ⊂ D ⊂ Rd , we deﬁne the source set P (C) := {Distributions P such that supp P ⊂ C} (5a) and the set of regular conditional distributions (r. [sent-107, score-0.18]

44 The conditional distribution Q∗ satisﬁes optimal local privacy for the sets C ⊂ D ⊂ Rd at level I ∗ if sup I(P, Q∗ ) = inf sup I(P, Q) = I ∗ , Q P P where the supremum is taken over distributions P ∈ P (C) and the inﬁmum is taken over regular conditional distributions Q ∈ Q (C, D). [sent-119, score-1.233]

45 If a distribution Q∗ satisﬁes optimal local privacy, then it guarantees that even for the worst possible distribution on X, the information communicated about X is limited. [sent-120, score-0.13]

46 In a sense, Deﬁnition 1 captures the natural competition between privacy and learnability. [sent-121, score-0.811]

47 The method M speciﬁes the set D to which the data Z it receives must belong; the “teachers,” or owners of the data X, choose the distribution Q to guarantee as much privacy as possible subject to this constraint. [sent-122, score-0.918]

48 Using this mechanism, if we can characterize a unique distribution Q∗ attaining the inﬁmum (4) for P ∗ (and by extension, for any P ), then we may study the effects of privacy between the method M and X. [sent-123, score-0.845]

49 2 Minimax error and loss functions Having deﬁned our privacy metric, we now turn to our original goal: quantiﬁcation of the effect privacy has on statistical estimation rates. [sent-125, score-1.793]

50 We deﬁne the (random) error of the method M on the risk R(θ) = E[ (X, θ)] after receiving n sample gradients as n (M, , Θ, P ) := R(θn ) − inf R(θ) = EP [ (X, θn )] − inf EP [ (X, θ)]. [sent-128, score-0.233]

51 Let Q denote the regular conditional probability—the channel distribution— whose conditional part is deﬁned on the range of the subgradient mapping ∂ (X, ·). [sent-130, score-0.246]

52 Now, let L be a collection of loss functions, where L(P ) denotes the losses : supp P × Θ → R belonging to L. [sent-132, score-0.196]

53 We deﬁne the minimax error ∗ n (L, Θ) := inf sup M ∈L(P ),P EP,Q [ n (M, , Θ, P )], (7) where the expectation is taken over the random samples X ∼ P and Z ∼ Q(· | X). [sent-133, score-0.208]

54 We characterize the minimax error (7) for several classes of loss functions L(P ), giving sharp results when the privacy distribution Q satisﬁes optimal local privacy. [sent-134, score-1.146]

55 A loss satisﬁes the condition (8) if and only if for all θ ∈ Θ, we have the inequality g p ≤ L for any subgradient g ∈ ∂ (x, θ) (e. [sent-140, score-0.139]

56 The privacy-guaranteeing channel distributions Q∗ we construct in Section 4 are motivated by our concern with the (L, p) families of loss functions. [sent-148, score-0.175]

57 In our model of computation, the learning method M queries the loss (Xi , ·) at the point θ; the owner of the datum Xi then computes the subgradient ∂ (Xi , θ) and returns a masked version Zi with the property that E[Zi | Xi , θ] ∈ ∂ (Xi , θ). [sent-149, score-0.227]

58 Our ﬁrst theorem applies to the class of (L, ∞) loss functions (recall Deﬁnition 2). [sent-153, score-0.149]

59 Let L be the collection of (L, ∞) loss functions and assume the conditions of the preceding paragraph. [sent-157, score-0.137]

60 Let Q satisfy be optimally private for the collection L. [sent-158, score-0.162]

61 Then (a) If Θ contains the ∞ ball of radius r, ∗ n (L, Θ) (b) If Θ = {θ ∈ Rd : θ 1 ≥ 1 M∞ rd · √ . [sent-159, score-0.157]

62 17 n For our second theorem, we assume that the loss functions L consist of (L, 1) losses, and that the perturbed data must belong to the 1 ball of radius M1 , i. [sent-161, score-0.287]

63 Setting M = M1 /L, we deﬁne (these constants relate to the optimal local privacy distribution for 1 -balls) γ := log 2d − 2 + (2d − 2)2 + 4(M 2 − 1) 2(M − 1) , and ∆(γ) := eγ − e−γ . [sent-164, score-0.907]

64 Let L be the collection of (L, 1) loss functions and assume the conditions of the preceding paragraph. [sent-166, score-0.137]

65 Let Q be optimally locally private for the collection L. [sent-167, score-0.162]

66 Theorem 2 provides a similar result when we take M1 ↓ L, though in this case stochastic gradient descent attains the matching upper bound. [sent-176, score-0.168]

67 In these, we show that the lower bounds in Theorems 1 and 2 give sharp tradeoffs between the statistical rate of convergence for any statistical procedure and the desired privacy of a user. [sent-178, score-1.1]

68 In each corollary, we look ahead to Section 4 and use one of Propositions 1 or 2 to derive a bijection between the size M∞ or M1 of the perturbation set and the amount of privacy—as measured by the worst case mutual information I ∗ —provided. [sent-180, score-0.21]

69 Assume Q∗ satisﬁes optimal local privacy at information level I ∗ . [sent-184, score-0.878]

70 nI ∗ nI ∗ Proof Since Θ ⊆ {θ ∈ Rd : θ 1 ≤ r}, mirror descent [2, 21, 20, Chapter 5], using n unbiased stochastic gradient samples whose ∞ norms are bounded by M∞ , obtains convergence √ √ rate O(M∞ r log d/ n). [sent-186, score-0.21]

71 Now ﬁx our desired amount of mutual information I ∗ . [sent-188, score-0.133]

72 From the remarks following Proposition 1, if we must guarantee that I ∗ ≥ supP I(P, Q) for any distribution P and loss function whose gradients are bounded in ∞ -norm by L, we must (by the remarks following Proposition 1) have dL2 . [sent-189, score-0.361]

73 I∗ 2 M∞ Up to higher-order terms, to guarantee a level of privacy with mutual information I ∗ , we must allow gradient noise up to M∞ = L d/I ∗ . [sent-190, score-1.013]

74 Using the bijection between M∞ and the maximal allowed √ √ mutual information I ∗ under local privacy that we have shown, we substitute M∞ = L d/ I ∗ into the upper and lower bounds that we have already attained. [sent-191, score-1.096]

75 Similar upper and lower bounds can be obtained under the conditions of part (a) of Theorem 1, √ where we need not assume Θ is an 1 -ball, but we lose a factor of log d in the lower bound. [sent-192, score-0.197]

76 Assume that Q∗ satisﬁes optimal local privacy at information level I ∗ . [sent-196, score-0.878]

77 n nI ∗ nI ∗ Proof By the conditions of optimal local privacy (Proposition 2 and Corollary 3), to have I ∗ ≥ supP I(P, Q) for any loss whose gradients are bounded in 1 -norm by L, we must have I∗ dL2 2, 2M1 using Corollary 3. [sent-198, score-1.064]

78 Rewriting this, we see that we must have M1 = L d/2I ∗ (to higher-order terms) to be able to guarantee an amount of privacy I ∗ . [sent-199, score-0.91]

79 Indeed, stochastic gradient descent (SGD) enjoys the following convergence guarantees (e. [sent-201, score-0.142]

80 Let Θ ⊆ Rd be contained in the ∞ ball of radius r and the gradients of the loss belong to the √ √ ∗ 1 -ball of radius M1 . [sent-204, score-0.263]

81 6 4 Saddle points, optimal privacy, and mutual information In this section, we explore conditions for a distribution Q∗ to satisfy optimal local privacy, as given by Deﬁnition 1. [sent-207, score-0.234]

82 We give characterizations of necessary and sufﬁcient conditions based on the compact sets C ⊂ D for distributions P ∗ and Q∗ to achieve the saddle point (4). [sent-208, score-0.225]

83 Our results can be viewed as rate distortion theorems [14, 8] (with source P and channel Q) for certain compact alphabets, though as far as we know, they are all new. [sent-209, score-0.17]

84 Thus we sometimes refer to the conditional distribution Q, which is designed to maintain the privacy of the data X by communication of Z, as the channel distribution. [sent-210, score-0.95]

85 Since we wish to bound I(X; Z) for arbitrary losses , we must address the case when (X, θ) = θ, X , in which case (X, θ) = X; by the data-processing inequality [14, Chapter 5] it is thus no loss of generality to assume that X ∈ C and that E[Z | X] = X. [sent-211, score-0.197]

86 Let C ⊂ Rd be a compact convex set with extreme points ui ∈ Rd , i ∈ I for some index set I. [sent-215, score-0.189]

87 Then C is rotationally invariant through its extreme points if ui 2 = uj 2 for each i, j, and for any unitary matrix U such that U ui = uj for some i = j, then U C = C. [sent-216, score-0.248]

88 Some examples of convex sets rotationally invariant through their extreme points include p -norm balls for p = 1, 2, ∞, though p -balls for p ∈ {1, 2, ∞} are not. [sent-217, score-0.193]

89 The following theorem gives a general characterization of the minimax mutual information for rotationally invariant norm balls with ﬁnite numbers of extreme points by providing saddle point distributions P ∗ and Q∗ . [sent-218, score-0.534]

90 Moreover, the distribution P ∗ uniform on {ui }m uniquely attains the saddle point (4). [sent-228, score-0.147]

91 In fact, we may introduce a random variable X between X and Z: let X be distributed among the extreme points {ui }m of C in any way such that E[X | X] = i=1 X, then use the maximum entropy distribution Q∗ (· | ui ) deﬁned in the theorem when X ∈ {ui }m i=1 to sample Z from X . [sent-230, score-0.18]

92 The information processing inequality [14, Chapter 5] guarantees the Markov chain X → X → Z satisﬁes the minimax bound I(X; Z) ≤ inf Q supP I(P, Q). [sent-231, score-0.15]

93 With Theorem 3 in place, we can explicitly characterize the distributions achieving optimal local privacy (recall Deﬁnition 1) for 1 and ∞ balls. [sent-232, score-0.942]

94 Here the arguments are slightly more complicated, as the coordinates of the random variables are no longer independent, but Theorem 3 still allows us to explicitly characterize the saddle point of the mutual information. [sent-246, score-0.253]

95 (9), and let Q∗ be the distribution on Z | X such that Z is supported on {±M ei }d , and i=1 eγ ∗ , (10a) Q (Z = M ei | X = ei ) = γ −γ + (2d − 2) e +e e−γ Q∗ (Z = −M ei | X = ei ) = γ , (10b) −γ + (2d − 2) e +e 1 . [sent-250, score-0.43]

96 Additionally, an asymptotic expansion allows us to gain a somewhat clearer picture of the values of the mutual information, though we do not derive upper bounds as we did for Proposition 1. [sent-255, score-0.181]

97 We study procedures that access each datum only once and through a perturbed view Zi of the subgradient ∂ (Xi , θ), which allows us to use (essentially) any convex loss. [sent-262, score-0.343]

98 These questions do not appear to have easy answers, especially when we wish to allow each provider of a single datum to be able to guarantee his or her own privacy. [sent-274, score-0.174]

99 Nevertheless, we hope that our view of privacy and the techniques we have developed herein prove fruitful, and we hope to investigate some of the above issues in future work. [sent-275, score-0.811]

100 Mirror descent and nonlinear projected subgradient methods for convex optimization. [sent-294, score-0.133]

similar papers computed by tfidf model

tfidf for this paper:

wordName wordTfidf (topN-words)

[('privacy', 0.811), ('private', 0.162), ('saddle', 0.116), ('mutual', 0.103), ('zi', 0.092), ('rd', 0.09), ('datum', 0.088), ('ei', 0.086), ('minimax', 0.083), ('providers', 0.077), ('rotationally', 0.077), ('supp', 0.076), ('loss', 0.075), ('dwork', 0.074), ('channel', 0.07), ('remarks', 0.07), ('procedures', 0.069), ('inf', 0.067), ('differentially', 0.067), ('subgradient', 0.064), ('ui', 0.064), ('communicated', 0.063), ('sup', 0.058), ('mievski', 0.058), ('xi', 0.057), ('theorems', 0.055), ('risk', 0.052), ('perturbed', 0.051), ('tradeoffs', 0.05), ('ni', 0.05), ('sharp', 0.048), ('gradients', 0.047), ('wish', 0.047), ('theorem', 0.046), ('proposition', 0.046), ('rates', 0.045), ('losses', 0.045), ('compact', 0.045), ('bounds', 0.044), ('extreme', 0.043), ('stochastic', 0.041), ('corollary', 0.04), ('statistical', 0.04), ('mirror', 0.039), ('ev', 0.039), ('bijection', 0.039), ('convergence', 0.039), ('guarantee', 0.039), ('deferring', 0.038), ('disguised', 0.038), ('owners', 0.038), ('rld', 0.038), ('sankar', 0.038), ('tensions', 0.038), ('zdq', 0.038), ('differential', 0.038), ('conditional', 0.038), ('perturbation', 0.038), ('radius', 0.038), ('convex', 0.037), ('local', 0.037), ('regular', 0.036), ('balls', 0.036), ('belong', 0.036), ('satis', 0.035), ('utility', 0.035), ('characterize', 0.034), ('conditions', 0.034), ('upper', 0.034), ('mum', 0.034), ('access', 0.034), ('nissim', 0.034), ('kasiviswanathan', 0.034), ('warner', 0.034), ('rl', 0.033), ('nition', 0.033), ('descent', 0.032), ('chapter', 0.032), ('evidently', 0.031), ('attains', 0.031), ('communication', 0.031), ('optimal', 0.03), ('amount', 0.03), ('must', 0.03), ('distributions', 0.03), ('gradient', 0.03), ('ep', 0.03), ('symposium', 0.029), ('ball', 0.029), ('log', 0.029), ('lower', 0.028), ('participant', 0.028), ('rubinstein', 0.028), ('estimation', 0.028), ('functions', 0.028), ('eq', 0.027), ('entropy', 0.027), ('survey', 0.027), ('wasserman', 0.027), ('chaudhuri', 0.026)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 1.0000006 275 nips-2012-Privacy Aware Learning

Author: Martin J. Wainwright, Michael I. Jordan, John C. Duchi

2 0.5458079 237 nips-2012-Near-optimal Differentially Private Principal Components

Author: Kamalika Chaudhuri, Anand Sarwate, Kaushik Sinha

Abstract: Principal components analysis (PCA) is a standard tool for identifying good lowdimensional approximations to data sets in high dimension. Many current data sets of interest contain private or sensitive information about individuals. Algorithms which operate on such data should be sensitive to the privacy risks in publishing their outputs. Differential privacy is a framework for developing tradeoffs between privacy and the utility of these outputs. In this paper we investigate the theory and empirical performance of differentially private approximations to PCA and propose a new method which explicitly optimizes the utility of the output. We demonstrate that on real data, there is a large performance gap between the existing method and our method. We show that the sample complexity for the two procedures differs in the scaling with the data dimension, and that our method is nearly optimal in terms of this scaling. 1

3 0.36220768 18 nips-2012-A Simple and Practical Algorithm for Differentially Private Data Release

Author: Moritz Hardt, Katrina Ligett, Frank Mcsherry

Abstract: We present a new algorithm for differentially private data release, based on a simple combination of the Multiplicative Weights update rule with the Exponential Mechanism. Our MWEM algorithm achieves what are the best known and nearly optimal theoretical guarantees, while at the same time being simple to implement and experimentally more accurate on actual data sets than existing techniques. 1

4 0.16689122 134 nips-2012-Finite Sample Convergence Rates of Zero-Order Stochastic Optimization Methods

Author: Andre Wibisono, Martin J. Wainwright, Michael I. Jordan, John C. Duchi

Abstract: We consider derivative-free algorithms for stochastic optimization problems that use only noisy function values rather than gradients, analyzing their ﬁnite-sample convergence rates. We show that if pairs of function values are available, algorithms that √ gradient estimates based on random perturbations suffer a factor use of at most d in convergence rate over traditional stochastic gradient methods, where d is the problem dimension. We complement our algorithmic development with information-theoretic lower bounds on the minimax convergence rate of such problems, which show that our bounds are sharp with respect to all problemdependent quantities: they cannot be improved by more than constant factors. 1

5 0.097008251 325 nips-2012-Stochastic optimization and sparse statistical recovery: Optimal algorithms for high dimensions

Author: Alekh Agarwal, Sahand Negahban, Martin J. Wainwright

Abstract: We develop and analyze stochastic optimization algorithms for problems in which the expected loss is strongly convex, and the optimum is (approximately) sparse. Previous approaches are able to exploit only one of these two structures, yielding a O(d/T ) convergence rate for strongly convex objectives in d dimensions and O( s(log d)/T ) convergence rate when the optimum is s-sparse. Our algorithm is based on successively solving a series of ℓ1 -regularized optimization problems using Nesterov’s dual averaging algorithm. We establish that the error of our solution after T iterations is at most O(s(log d)/T ), with natural extensions to approximate sparsity. Our results apply to locally Lipschitz losses including the logistic, exponential, hinge and least-squares losses. By recourse to statistical minimax results, we show that our convergence rates are optimal up to constants. The effectiveness of our approach is also conﬁrmed in numerical simulations where we compare to several baselines on a least-squares regression problem.

6 0.079842485 16 nips-2012-A Polynomial-time Form of Robust Regression

7 0.073143899 293 nips-2012-Relax and Randomize : From Value to Algorithms

8 0.062364608 67 nips-2012-Classification Calibration Dimension for General Multiclass Losses

9 0.059714902 272 nips-2012-Practical Bayesian Optimization of Machine Learning Algorithms

10 0.059186369 355 nips-2012-Truncation-free Online Variational Inference for Bayesian Nonparametric Models

11 0.058050822 206 nips-2012-Majorization for CRFs and Latent Likelihoods

12 0.057119727 174 nips-2012-Learning Halfspaces with the Zero-One Loss: Time-Accuracy Tradeoffs

13 0.056983285 227 nips-2012-Multiclass Learning with Simplex Coding

14 0.054217137 324 nips-2012-Stochastic Gradient Descent with Only One Projection

15 0.053970683 76 nips-2012-Communication-Efficient Algorithms for Statistical Optimization

16 0.052604459 361 nips-2012-Volume Regularization for Binary Classification

17 0.052403178 212 nips-2012-Minimax Multi-Task Learning and a Generalized Loss-Compositional Paradigm for MTL

18 0.05181038 184 nips-2012-Learning Probability Measures with respect to Optimal Transport Metrics

19 0.050725609 254 nips-2012-On the Sample Complexity of Robust PCA

20 0.050219007 186 nips-2012-Learning as MAP Inference in Discrete Graphical Models

similar papers computed by lsi model

lsi for this paper:

topicId topicWeight

[(0, 0.172), (1, 0.013), (2, 0.097), (3, -0.046), (4, 0.05), (5, 0.056), (6, -0.001), (7, 0.12), (8, -0.025), (9, -0.046), (10, 0.049), (11, -0.001), (12, -0.129), (13, -0.07), (14, -0.007), (15, 0.075), (16, 0.018), (17, -0.532), (18, 0.129), (19, 0.131), (20, -0.316), (21, -0.013), (22, 0.022), (23, -0.178), (24, 0.222), (25, -0.046), (26, 0.009), (27, -0.053), (28, 0.059), (29, -0.0), (30, 0.043), (31, -0.026), (32, 0.007), (33, -0.037), (34, 0.12), (35, 0.006), (36, -0.008), (37, -0.004), (38, -0.005), (39, -0.071), (40, -0.016), (41, 0.007), (42, 0.019), (43, 0.047), (44, -0.012), (45, -0.011), (46, 0.003), (47, 0.014), (48, 0.001), (49, 0.04)]

similar papers list:

simIndex simValue paperId paperTitle

1 0.91347528 18 nips-2012-A Simple and Practical Algorithm for Differentially Private Data Release

Author: Moritz Hardt, Katrina Ligett, Frank Mcsherry

2 0.90238428 237 nips-2012-Near-optimal Differentially Private Principal Components

Author: Kamalika Chaudhuri, Anand Sarwate, Kaushik Sinha

same-paper 3 0.88527864 275 nips-2012-Privacy Aware Learning

Author: Martin J. Wainwright, Michael I. Jordan, John C. Duchi

4 0.3829174 285 nips-2012-Query Complexity of Derivative-Free Optimization

Author: Ben Recht, Kevin G. Jamieson, Robert Nowak

Abstract: This paper provides lower bounds on the convergence rate of Derivative Free Optimization (DFO) with noisy function evaluations, exposing a fundamental and unavoidable gap between the performance of algorithms with access to gradients and those with access to only function evaluations. However, there are situations in which DFO is unavoidable, and for such situations we propose a new DFO algorithm that is proved to be near optimal for the class of strongly convex objective functions. A distinctive feature of the algorithm is that it uses only Boolean-valued function comparisons, rather than function evaluations. This makes the algorithm useful in an even wider range of applications, such as optimization based on paired comparisons from human subjects, for example. We also show that regardless of whether DFO is based on noisy function evaluations or Boolean-valued function comparisons, the convergence rate is the same. 1

5 0.37384865 134 nips-2012-Finite Sample Convergence Rates of Zero-Order Stochastic Optimization Methods

Author: Andre Wibisono, Martin J. Wainwright, Michael I. Jordan, John C. Duchi

6 0.34146926 76 nips-2012-Communication-Efficient Algorithms for Statistical Optimization

7 0.3226679 103 nips-2012-Distributed Probabilistic Learning for Camera Networks with Missing Data

8 0.31914079 217 nips-2012-Mixability in Statistical Learning

9 0.31356248 325 nips-2012-Stochastic optimization and sparse statistical recovery: Optimal algorithms for high dimensions

10 0.29735234 254 nips-2012-On the Sample Complexity of Robust PCA

11 0.28619504 161 nips-2012-Interpreting prediction markets: a stochastic approach

12 0.27516839 46 nips-2012-Assessing Blinding in Clinical Trials

13 0.27438816 174 nips-2012-Learning Halfspaces with the Zero-One Loss: Time-Accuracy Tradeoffs

14 0.26719081 16 nips-2012-A Polynomial-time Form of Robust Regression

15 0.25594783 34 nips-2012-Active Learning of Multi-Index Function Models

16 0.25473902 361 nips-2012-Volume Regularization for Binary Classification

17 0.24892882 206 nips-2012-Majorization for CRFs and Latent Likelihoods

18 0.24883407 212 nips-2012-Minimax Multi-Task Learning and a Generalized Loss-Compositional Paradigm for MTL

19 0.24869849 271 nips-2012-Pointwise Tracking the Optimal Regression Function

20 0.24386115 338 nips-2012-The Perturbed Variation

similar papers computed by lda model

lda for this paper:

topicId topicWeight

[(0, 0.019), (14, 0.013), (21, 0.019), (28, 0.029), (38, 0.149), (39, 0.01), (42, 0.119), (53, 0.011), (54, 0.018), (55, 0.03), (74, 0.038), (76, 0.142), (80, 0.087), (86, 0.18), (92, 0.044)]

similar papers list:

simIndex simValue paperId paperTitle

1 0.86739874 137 nips-2012-From Deformations to Parts: Motion-based Segmentation of 3D Objects

Author: Soumya Ghosh, Matthew Loper, Erik B. Sudderth, Michael J. Black

Abstract: We develop a method for discovering the parts of an articulated object from aligned meshes of the object in various three-dimensional poses. We adapt the distance dependent Chinese restaurant process (ddCRP) to allow nonparametric discovery of a potentially unbounded number of parts, while simultaneously guaranteeing a spatially connected segmentation. To allow analysis of datasets in which object instances have varying 3D shapes, we model part variability across poses via afﬁne transformations. By placing a matrix normal-inverse-Wishart prior on these afﬁne transformations, we develop a ddCRP Gibbs sampler which tractably marginalizes over transformation uncertainty. Analyzing a dataset of humans captured in dozens of poses, we infer parts which provide quantitatively better deformation predictions than conventional clustering methods.

2 0.84672666 57 nips-2012-Bayesian estimation of discrete entropy with mixtures of stick-breaking priors

Author: Evan Archer, Il M. Park, Jonathan W. Pillow

Abstract: We consider the problem of estimating Shannon’s entropy H in the under-sampled regime, where the number of possible symbols may be unknown or countably inﬁnite. Dirichlet and Pitman-Yor processes provide tractable prior distributions over the space of countably inﬁnite discrete distributions, and have found major applications in Bayesian non-parametric statistics and machine learning. Here we show that they provide natural priors for Bayesian entropy estimation, due to the analytic tractability of the moments of the induced posterior distribution over entropy H. We derive formulas for the posterior mean and variance of H given data. However, we show that a ﬁxed Dirichlet or Pitman-Yor process prior implies a narrow prior on H, meaning the prior strongly determines the estimate in the under-sampled regime. We therefore deﬁne a family of continuous mixing measures such that the resulting mixture of Dirichlet or Pitman-Yor processes produces an approximately ﬂat prior over H. We explore the theoretical properties of the resulting estimators and show that they perform well on data sampled from both exponential and power-law tailed distributions. 1

same-paper 3 0.83856767 275 nips-2012-Privacy Aware Learning

Author: Martin J. Wainwright, Michael I. Jordan, John C. Duchi

4 0.81987137 303 nips-2012-Searching for objects driven by context

Author: Bogdan Alexe, Nicolas Heess, Yee W. Teh, Vittorio Ferrari

Abstract: The dominant visual search paradigm for object class detection is sliding windows. Although simple and effective, it is also wasteful, unnatural and rigidly hardwired. We propose strategies to search for objects which intelligently explore the space of windows by making sequential observations at locations decided based on previous observations. Our strategies adapt to the class being searched and to the content of a particular test image, exploiting context as the statistical relation between the appearance of a window and its location relative to the object, as observed in the training set. In addition to being more elegant than sliding windows, we demonstrate experimentally on the PASCAL VOC 2010 dataset that our strategies evaluate two orders of magnitude fewer windows while achieving higher object detection performance. 1

5 0.81505007 243 nips-2012-Non-parametric Approximate Dynamic Programming via the Kernel Method

Author: Nikhil Bhat, Vivek Farias, Ciamac C. Moallemi

Abstract: This paper presents a novel non-parametric approximate dynamic programming (ADP) algorithm that enjoys graceful approximation and sample complexity guarantees. In particular, we establish both theoretically and computationally that our proposal can serve as a viable alternative to state-of-the-art parametric ADP algorithms, freeing the designer from carefully specifying an approximation architecture. We accomplish this by developing a kernel-based mathematical program for ADP. Via a computational study on a controlled queueing network, we show that our procedure is competitive with parametric ADP approaches. 1

6 0.80298162 242 nips-2012-Non-linear Metric Learning

7 0.80167472 74 nips-2012-Collaborative Gaussian Processes for Preference Learning

8 0.80063039 134 nips-2012-Finite Sample Convergence Rates of Zero-Order Stochastic Optimization Methods

9 0.79930574 289 nips-2012-Recognizing Activities by Attribute Dynamics

10 0.78829008 219 nips-2012-Modelling Reciprocating Relationships with Hawkes Processes

11 0.78422254 98 nips-2012-Dimensionality Dependent PAC-Bayes Margin Bound

12 0.77315021 162 nips-2012-Inverse Reinforcement Learning through Structured Classification

13 0.76885772 325 nips-2012-Stochastic optimization and sparse statistical recovery: Optimal algorithms for high dimensions

14 0.76643854 263 nips-2012-Optimal Regularized Dual Averaging Methods for Stochastic Optimization

15 0.76572913 292 nips-2012-Regularized Off-Policy TD-Learning

16 0.76528871 237 nips-2012-Near-optimal Differentially Private Principal Components

17 0.76503903 358 nips-2012-Value Pursuit Iteration

18 0.76463234 265 nips-2012-Parametric Local Metric Learning for Nearest Neighbor Classification

19 0.76377696 261 nips-2012-Online allocation and homogeneous partitioning for piecewise constant mean-approximation

20 0.76236463 226 nips-2012-Multiclass Learning Approaches: A Theoretical Comparison with Implications