nips nips2013 nips2013-177 knowledge-graph by maker-knowledge-mining
Source: pdf
Author: John Duchi, Martin J. Wainwright, Michael Jordan
Abstract: We provide a detailed study of the estimation of probability distributions— discrete and continuous—in a stringent setting in which data is kept private even from the statistician. We give sharp minimax rates of convergence for estimation in these locally private settings, exhibiting fundamental trade-offs between privacy and convergence rate, as well as providing tools to allow movement along the privacy-statistical efficiency continuum. One of the consequences of our results is that Warner’s classical work on randomized response is an optimal way to perform survey sampling while maintaining privacy of the respondents. 1
Reference: text
sentIndex sentText sentNum sentScore
1 edu Abstract We provide a detailed study of the estimation of probability distributions— discrete and continuous—in a stringent setting in which data is kept private even from the statistician. [sent-6, score-0.472]
2 We give sharp minimax rates of convergence for estimation in these locally private settings, exhibiting fundamental trade-offs between privacy and convergence rate, as well as providing tools to allow movement along the privacy-statistical efficiency continuum. [sent-7, score-1.696]
3 One of the consequences of our results is that Warner’s classical work on randomized response is an optimal way to perform survey sampling while maintaining privacy of the respondents. [sent-8, score-0.895]
4 With this motivation, Warner considered the problem of estimating the fractions of the population belonging to certain strata, which can be viewed as probability estimation within a multinomial model. [sent-10, score-0.258]
5 In this paper, we revisit Warner’s probability estimation problem, doing so within a theoretical framework that allows us to characterize optimal estimation under constraints on privacy. [sent-11, score-0.202]
6 We also apply our theoretical tools to a further probability estimation problem—that of nonparametric density estimation. [sent-12, score-0.261]
7 In the large body of research on privacy and statistical inference [e. [sent-13, score-0.715]
8 The literature has stopped short, however, of providing a formal treatment of disclosure risk that would permit decision-theoretic tools to be used in characterizing trade-offs between the utility of achieving privacy and the utility associated with an inferential goal. [sent-16, score-1.067]
9 Recently, a formal treatment of disclosure risk known as “differential privacy” has been proposed and studied in the cryptography, database and theoretical computer science literatures [11, 1]. [sent-17, score-0.222]
10 Differential privacy has strong semantic privacy guarantees that make it a good candidate for declaring a statistical procedure or data collection mechanism private, and it has been the focus of a growing body of recent work [13, 16, 24, 21, 6, 18, 8, 5, 9]. [sent-18, score-1.41]
11 In this paper, we bring together the formal treatment of disclosure risk provided by differential privacy with the tools of minimax decision theory to provide a theoretical treatment of probability estimation under privacy constraints. [sent-19, score-2.103]
12 Just as in classical minimax theory, we are able to provide lower bounds on the convergence rates of any estimator, in our case under a restriction to estimators that guarantee privacy. [sent-20, score-0.524]
13 We thus bring classical notions of privacy, as introduced by Warner [23], into contact with differential privacy and statistical decision theory, obtaining quantitative trade-offs between privacy and statistical efficiency. [sent-22, score-1.645]
14 We study procedures that receive private views Z1 , . [sent-25, score-0.403]
15 We assume each of these private views Zi is α-differentially private for the original data Xi . [sent-35, score-0.742]
16 Then Q provides α-local differential privacy if sup Q(S | Xi = x) | S ∈ σ(Z), and x, x′ ∈ X Q(S | Xi = x′ ) ≤ exp(α). [sent-37, score-0.851]
17 (1) This formulation of local privacy was first proposed by Evfimievski et al. [sent-38, score-0.706]
18 In the current paper, we study minimax convergence rates when the data provided satisfies the local privacy guarantee (1). [sent-43, score-1.031]
19 Our two main results quantify the penalty that must be paid when local privacy at a level α is provided in multinomial estimation and density estimation problems. [sent-44, score-1.122]
20 At a high level, our first result implies that for estimation of a d-dimensional multinomial probability mass function, the effective sample size of any statistical estimation procedure decreases from n to nα2 /d whenever α is a sufficiently small constant. [sent-45, score-0.374]
21 Our second main result, on density estimation, exhibits an interesting departure from standard minimax estimation results. [sent-47, score-0.417]
22 If the density being estimated has β continuous derivatives, then classical results on density estimation [e. [sent-48, score-0.351]
23 , 26, 25, 22] show that the minimax integrated squared error scales (in the sample size n) as n−2β/(2β+1) . [sent-50, score-0.225]
24 In the locally private case, we show that there is a difference in the polynomial rate of convergence: we obtain a scaling of (α2 n)−2β/(2β+2) . [sent-51, score-0.451]
25 We give efficiently implementable algorithms that attain sharp upper bounds as companions to our lower bounds, which in some cases exhibit the necessity of non-trivial sampling strategies to guarantee privacy. [sent-52, score-0.211]
26 We write 2 an bn to denote an = O(bn ) and an ≍ bn to denote an = O(bn ) and bn = O(an ). [sent-58, score-0.183]
27 2 2 Background and Problem Formulation In this section, we provide the necessary background on the minimax framework used throughout the paper, more details of which can be found in standard sources [e. [sent-60, score-0.225]
28 We also reference our work [9] paper on statistical inference under differential privacy constraints; we restate two theorems from the paper [9] to keep our presentation self-contained. [sent-63, score-0.862]
29 The range Θ depends on the underlying statistical model; for example, for density estimation, Θ may consist of the set of probability densities defined on [0, 1]. [sent-66, score-0.233]
30 Recalling that Z is the domain of the private variables Zi , let θ : Z n → Θ denote an arbitrary estimator for θ. [sent-68, score-0.447]
31 Let Qα denote the set of conditional (or channel) distributions guaranteeing α-local privacy (1). [sent-69, score-0.666]
32 Looking uniformly over all channels Q ∈ Qα , we define the central object of interest for this paper, the α-private minimax rate for the family θ(P), Mn (θ(P), Φ ◦ ρ, α) := inf sup EP,Q Φ ρ(θ(Z1 , . [sent-70, score-0.374]
33 We remark here (see also the discussion in [9]) that the private minimax risk (2) is different from previous work on optimality in differential privacy (e. [sent-78, score-1.468]
34 [2, 16, 8]): prior work focuses on accurate estimation of a sample quantity θ(x1:n ) based on the sample x1:n , while we provide lower bounds on error of the population estimator θ(P ). [sent-80, score-0.299]
35 Lower bounds on population estimation imply those on sample estimation, so our lower bounds are stronger than most of those in prior work. [sent-81, score-0.277]
36 A standard route for lower bounding the minimax risk (2) is by reducing the estimation problem to the testing problem of identifying a point θ ∈ Θ from a collection of well-separated points [26, 25]. [sent-82, score-0.446]
37 In this work we have the additional complication that all the statistician observes are the private samples Z1 , . [sent-92, score-0.459]
38 (3) Letting ψ : Z n → V denote an arbitrary testing procedure, we have the following minimax bound, whose two parts are known as Le Cam’s two-point method [26, 22] and Fano’s inequality [25, 7, 22]. [sent-109, score-0.252]
39 For the previously described estimation and testing problems, Mn (θ(P), Φ ◦ ρ, Q) ≥ Φ(δ) inf P(ψ(Z1 , . [sent-111, score-0.16]
40 2 Information bounds The main step in proving minimax lower bounds is to control the divergences involved in the lower bounds (5a) and (5b). [sent-126, score-0.455]
41 We review two results from our work [9] that obtain such bounds as a function of the amount of privacy provided. [sent-127, score-0.72]
42 (7) By combining Proposition 1 with Lemma 1, it is possible to derive sharp lower bounds on arbitrary estimation procedures under α-local privacy. [sent-143, score-0.291]
43 3 Multinomial Estimation under Local Privacy In this section we return to the classical problem of avoiding answer bias in surveys, the original motivation for studying local privacy [23]. [sent-145, score-0.801]
44 1 Minimax rates of convergence for multinomial estimation d Let ∆d := θ ∈ Rd | θ ≥ 0, j=1 θj = 1 denote the probability simplex in Rd . [sent-147, score-0.364]
45 from a multinomial with parameters θ, where Pθ (X = j) = θj for j ∈ {1, . [sent-152, score-0.123]
46 In one of the earliest evaluations of privacy, Warner [23] studied the Bernoulli variant of this problem and proposed randomized response: for a given survey question, respondents provide a truthful answer with probability p > 1/2 and lie with probability 1 − p. [sent-156, score-0.163]
47 In our setting, we assume the statistician sees α-locally private (1) random variables Zi for the corresponding samples Xi from the multinomial. [sent-157, score-0.459]
48 In this case, we have the following result, which characterizes the minimax rate of estimation of a multinomial in both mean-squared error E[ θ − θ 2 ] 2 and absolute error E[ θ − θ 1 ]; the latter may be more relevant for probability estimation problems. [sent-158, score-0.602]
49 There exist universal constants 0 < cℓ ≤ cu < 5 such that for all α ∈ [0, 1], the minimax rate for multinomial estimation satisfies the bounds d d 1 2 , cℓ min 1, √ ≤ Mn ∆d , · 2 , α ≤ cu min 1, , (8) 2 nα2 nα2 nα and d d ≤ Mn (∆d , · 1 , α) ≤ cu min 1, √ . [sent-160, score-0.672]
50 (9) cℓ min 1, √ 2 nα nα2 Theorem 1 shows that providing local privacy can sometimes be quite detrimental to the quality of statistical estimators. [sent-161, score-0.804]
51 Indeed, let us compare this rate to the classical rate in which there is no privacy. [sent-162, score-0.172]
52 n By inequality (8), for suitably large sample sizes n, the effect of providing differential privacy at a level α causes a reduction in the effective sample size of n → nα2 /d. [sent-166, score-0.862]
53 2 Optimal mechanisms: attainability for multinomial estimation An interesting consequence of the lower bound in (8) is the following fact that we now demonstrate: Warner’s classical randomized response mechanism [23] (with minor modification) achieves the optimal convergence rate. [sent-168, score-0.564]
54 There are also other relatively simple estimation strategies that achieve convergence rate d/nα2 ; the perturbation approach Dwork et al. [sent-169, score-0.245]
55 [11] propose, where Laplace(α) noise is added to each coordinate of a multinomial sample, is one such strategy. [sent-170, score-0.123]
56 Nonetheless, the ease of use and explainability of randomized response, coupled with our optimality results, provide support for randomized response as a preferred method for private estimation of population probabilities. [sent-171, score-0.698]
57 We now prove that randomized response attains the optimal rate of convergence. [sent-172, score-0.18]
58 There is a bijection between multinomial samples x ∈ {1, . [sent-173, score-0.152]
59 , ed ∈ Rd , so we abuse notation and represent samples x as either when designing estimation strategies. [sent-179, score-0.13]
60 In randomized response, we construct the private vector Z ∈ {0, 1}d from a multinomial observation x ∈ {e1 , . [sent-180, score-0.558]
61 eα/2 − 1 Zi − ½/(1 + eα/2 ) (11) The projection of θpart onto the probability simplex can be done in time linear in the dimension d of the problem [3], so the estimator (11) is efficiently computable. [sent-188, score-0.142]
62 n eα/2 − 1 nα2 A similar argument shows that randomized response is minimax optimal for the ℓ1 -loss as well. [sent-190, score-0.353]
63 E 4 θ−θ 2 2 ≤ min 2, E θpart − θ 2 2 ≤ min 2, Density Estimation under Local Privacy In this section, we turn to studying a nonparametric statistical problem in which the effects of local differential privacy turn out to be somewhat more severe. [sent-191, score-0.933]
64 We show that for the problem of density estimation, instead of just multiplicative loss in the effective sample size as in the previous section, imposing local differential privacy leads to a different convergence rate. [sent-192, score-0.992]
65 In more detail, we consider estimation of probability densities f : R → R+ , f (x)dx = 1 and f ≥ 0, defined on the real line, focusing on a standard family of densities of varying smoothness [e. [sent-193, score-0.328]
66 Roughly, we consider densities that have bounded βth derivative, and we study density estimation using the squared L2 2 norm f 2 := f 2 (x)dx as our metric; in formal terms, we impose these constraints in terms of Sobolev classes (e. [sent-197, score-0.312]
67 For a given orthonormal basis {ϕj } of L2 ([0, 1]), smoothness parameter β > 1/2 and radius C, the function class Fβ [C] is given by ∞ ∞ Fβ [C] := f ∈ L2 ([0, 1]) | f = θj ϕj such that j=1 j=1 j 2β ϕ2 ≤ C 2 . [sent-203, score-0.14]
68 j If we choose the trigonometric basis as our orthonormal basis, then membership in the class Fβ [C] corresponds to certain smoothness constraints on the derivatives of f . [sent-204, score-0.186]
69 More precisely, for j ∈ N, consider the orthonormal basis for L2 ([0, 1]) of trigonometric functions: √ √ ϕ0 (t) = 1, ϕ2j (t) = 2 cos(2πjt), ϕ2j+1 (t) = 2 sin(2πjt). [sent-205, score-0.145]
70 It is well known [26, 25, 22] that the minimax risk for nonprivate estimation of densities in the class Fβ scales as Mn Fβ , · 2 2 2β , ∞ ≍ n− 2β+1 . [sent-211, score-0.478]
71 (13) Our main result is to demonstrate that the classical rate (13) is no longer attainable when we require α-local differential privacy. [sent-212, score-0.296]
72 3, we show how to achieve the (new) optimal rate using histogram and orthogonal series estimators. [sent-215, score-0.144]
73 1 Lower bounds on density estimation We begin by giving our main lower bound on the minimax rate of estimation of densities when are kept differentially private, providing the proof in the longer paper [9]. [sent-217, score-0.907]
74 Consider the class of densities Fβ defined using the trigonometric basis (12). [sent-219, score-0.185]
75 For some α ∈ [0, 1], suppose Zi are α-locally private (1) for the samples Xi ∈ [0, 1]. [sent-220, score-0.4]
76 (14) In comparison with the classical minimax rate (13), the lower bound (14) involves a different polynomial exponent: privacy reduces the exponent from 2β/(2β + 1) to 2β/(2β + 2). [sent-222, score-1.078]
77 For example, for Lipschitz densities we have β = 1, and the rate degrades from n−2/3 to n−1/2 . [sent-223, score-0.145]
78 Interestingly, no estimator based on Laplace (or exponential) perturbation of the samples Xi themselves can attain the rate of convergence (14). [sent-224, score-0.302]
79 Since the Laplace distribution’s characteristic function has tails decaying as t−2 , no estimator based on perturbing the samples directly can attain a rate of convergence better than n−2β/(2β+5) . [sent-226, score-0.286]
80 If the lower bound (14) is attainable, we must then study privacy mechanisms that are not simply based on direct perturbation of the samples {Xi }n . [sent-227, score-0.833]
81 2 Achievability by histogram estimators We now turn to the mean-squared errors achieved by specific practical schemes, beginning with the special case of Lipschitz density functions (β = 1), for which it suffices to consider a private version of a classical histogram estimate. [sent-229, score-0.699]
82 6 Any histogram estimate of the density based on these k bins can be specified by a vector θ ∈ k∆k , where we recall ∆k ⊂ Rk is the probability simplex. [sent-234, score-0.154]
83 Any such vector defines a density estimate via + k the sum fθ := j=1 θj 1Xj , where 1E denotes the characteristic (indicator) function of the set E. [sent-235, score-0.119]
84 Let us now describe a mechanism that guarantees α-local differential privacy. [sent-236, score-0.176]
85 The variables {Zi }n i=1 so-defined are α-locally differentially private for {Xi }n . [sent-247, score-0.445]
86 i=1 Using these private variables, we then form the density estimate f := fθ = θ := Πk k n k j=1 θj 1Xj based on n Zi , (16) i=1 where Πk denotes the Euclidean projection operator onto the set k∆k . [sent-248, score-0.488]
87 1 For any fixed α > 0, the first term in the bound (17) dominates, and the O((α2 n)− 2 ) rate matches the minimax lower bound (14) in the case β = 1: the privatized histogram estimator is minimaxoptimal for Lipschitz densities. [sent-253, score-0.516]
88 This result provides the private analog of the classical result that histogram estimators are minimax-optimal (in the non-private setting) for Lipschitz densities. [sent-254, score-0.545]
89 3 Achievability by orthogonal projection estimators For higher degrees of smoothness (β > 1), histogram estimators no longer achieve optimal rates in the classical setting [20]. [sent-256, score-0.365]
90 Accordingly, we turn to estimators based on orthogonal series and show that even under local privacy, they achieve the lower bound (14) for all orders of smoothness β ≥ 1. [sent-257, score-0.22]
91 (19) By inspection, Z is α-differentially private for any initial vector in the box [−B0 , B0 ]k , and moreover, the samples (19) are efficiently computable (for example by rejection sampling). [sent-282, score-0.4]
92 Propositions 2 and 3 make clear that the minimax lower bound (14) is sharp, as claimed. [sent-291, score-0.292]
93 6], is different in that it is locally private and requires a different noise strategy to obtain both α-local privacy and optimal convergence rate. [sent-294, score-1.113]
94 Lei [19] considers private M -estimators based on first performing a histogram density estimate, then using this to construct a second estimator; his estimator is not locally private, and the resulting M -estimators have sub-optimal convergence rates. [sent-295, score-0.677]
95 Finally, we remark that density estimators that are based on orthogo2β nal series and Laplace perturbation are sub-optimal: they can achieve (at best) rates of (nα2 )− 2β+3 , which is polynomially worse than the sharp result provided by Proposition 3. [sent-296, score-0.3]
96 5 Discussion We have linked minimax analysis from statistical decision theory with differential privacy, bringing some of their respective foundational principles into close contact. [sent-298, score-0.421]
97 In this paper particularly, we showed how to apply our divergence bounds to obtain sharp bounds on the convergence rate for certain nonparametric problems in addition to standard finite-dimensional settings. [sent-299, score-0.309]
98 By providing sharp convergence rates for many standard statistical inference procedures under local differential privacy, we have developed and explored some tools that may be used to better understand privacy-preserving statistical inference and estimation procedures. [sent-300, score-0.675]
99 We have identified a fundamental continuum along which privacy may be traded for utility in the form of accurate statistical estimates, providing a way to adjust statistical procedures to meet the privacy or utility needs of the statistician and the population being sampled. [sent-301, score-1.668]
100 Distributed private data analysis: Simultaneously solving how and what. [sent-322, score-0.371]
wordName wordTfidf (topN-words)
[('privacy', 0.666), ('private', 0.371), ('minimax', 0.225), ('warner', 0.177), ('differential', 0.147), ('multinomial', 0.123), ('zn', 0.107), ('estimation', 0.101), ('disclosure', 0.098), ('zi', 0.094), ('densities', 0.093), ('density', 0.091), ('estimator', 0.076), ('laplace', 0.075), ('differentially', 0.074), ('sobolev', 0.072), ('sharp', 0.07), ('channel', 0.069), ('xi', 0.069), ('classical', 0.068), ('randomized', 0.064), ('response', 0.064), ('histogram', 0.063), ('bn', 0.061), ('risk', 0.059), ('statistician', 0.059), ('mn', 0.056), ('dkl', 0.054), ('bounds', 0.054), ('attain', 0.053), ('orthonormal', 0.053), ('rate', 0.052), ('rates', 0.052), ('cryptography', 0.051), ('statistical', 0.049), ('providing', 0.049), ('dwork', 0.048), ('nissim', 0.048), ('convergence', 0.048), ('basis', 0.046), ('trigonometric', 0.046), ('dp', 0.046), ('perturbation', 0.044), ('achievability', 0.044), ('mievski', 0.044), ('proposition', 0.044), ('estimators', 0.043), ('smoothness', 0.041), ('xn', 0.04), ('local', 0.04), ('simplex', 0.04), ('cu', 0.039), ('carroll', 0.039), ('respondents', 0.039), ('chaudhuri', 0.038), ('sup', 0.038), ('treatment', 0.038), ('tools', 0.038), ('ck', 0.037), ('dx', 0.037), ('tv', 0.037), ('lipschitz', 0.036), ('fano', 0.036), ('url', 0.034), ('population', 0.034), ('lower', 0.034), ('bound', 0.033), ('survey', 0.033), ('utility', 0.032), ('cam', 0.032), ('duchi', 0.032), ('inf', 0.032), ('procedures', 0.032), ('packing', 0.031), ('nonparametric', 0.031), ('qi', 0.03), ('ev', 0.03), ('wasserman', 0.03), ('jt', 0.03), ('orthogonal', 0.029), ('mechanism', 0.029), ('mcsherry', 0.029), ('army', 0.029), ('attainable', 0.029), ('samples', 0.029), ('characteristic', 0.028), ('locally', 0.028), ('inferential', 0.028), ('ek', 0.028), ('formal', 0.027), ('xj', 0.027), ('channels', 0.027), ('qn', 0.027), ('mechanisms', 0.027), ('answer', 0.027), ('testing', 0.027), ('association', 0.026), ('projection', 0.026), ('symposium', 0.026), ('american', 0.026)]
simIndex simValue paperId paperTitle
same-paper 1 0.99999923 177 nips-2013-Local Privacy and Minimax Bounds: Sharp Rates for Probability Estimation
Author: John Duchi, Martin J. Wainwright, Michael Jordan
Abstract: We provide a detailed study of the estimation of probability distributions— discrete and continuous—in a stringent setting in which data is kept private even from the statistician. We give sharp minimax rates of convergence for estimation in these locally private settings, exhibiting fundamental trade-offs between privacy and convergence rate, as well as providing tools to allow movement along the privacy-statistical efficiency continuum. One of the consequences of our results is that Warner’s classical work on randomized response is an optimal way to perform survey sampling while maintaining privacy of the respondents. 1
2 0.59100688 14 nips-2013-A Stability-based Validation Procedure for Differentially Private Machine Learning
Author: Kamalika Chaudhuri, Staal A. Vinterbo
Abstract: Differential privacy is a cryptographically motivated definition of privacy which has gained considerable attention in the algorithms, machine-learning and datamining communities. While there has been an explosion of work on differentially private machine learning algorithms, a major barrier to achieving end-to-end differential privacy in practical machine learning applications is the lack of an effective procedure for differentially private parameter tuning, or, determining the parameter value, such as a bin size in a histogram, or a regularization parameter, that is suitable for a particular application. In this paper, we introduce a generic validation procedure for differentially private machine learning algorithms that apply when a certain stability condition holds on the training algorithm and the validation performance metric. The training data size and the privacy budget used for training in our procedure is independent of the number of parameter values searched over. We apply our generic procedure to two fundamental tasks in statistics and machine-learning – training a regularized linear classifier and building a histogram density estimator that result in end-toend differentially private solutions for these problems. 1
3 0.42583185 2 nips-2013-(Nearly) Optimal Algorithms for Private Online Learning in Full-information and Bandit Settings
Author: Abhradeep Guha Thakurta, Adam Smith
Abstract: We give differentially private algorithms for a large class of online learning algorithms, in both the full information and bandit settings. Our algorithms aim to minimize a convex loss function which is a sum of smaller convex loss terms, one for each data point. To design our algorithms, we modify the popular mirror descent approach, or rather a variant called follow the approximate leader. The technique leads to the first nonprivate algorithms for private online learning in the bandit setting. In the full information setting, our algorithms improve over the regret bounds of previous work (due to Dwork, Naor, Pitassi and Rothblum (2010) and Jain, Kothari and Thakurta (2012)). In many cases, our algorithms (in both settings) match the dependence on the input length, T , of the optimal nonprivate regret bounds up to logarithmic factors in T . Our algorithms require logarithmic space and update time. 1
4 0.29226398 102 nips-2013-Efficient Algorithm for Privately Releasing Smooth Queries
Author: Ziteng Wang, Kai Fan, Jiaqi Zhang, Liwei Wang
Abstract: We study differentially private mechanisms for answering smooth queries on databases consisting of data points in Rd . A K-smooth query is specified by a function whose partial derivatives up to order K are all bounded. We develop an -differentially private mechanism which for the class of K-smooth queries has K accuracy O(n− 2d+K / ). The mechanism first outputs a summary of the database. To obtain an answer of a query, the user runs a public evaluation algorithm which contains no information of the database. Outputting the summary runs in time d O(n1+ 2d+K ), and the evaluation algorithm for answering a query runs in time d+2+ 2d K ˜ O(n 2d+K ). Our mechanism is based on L∞ -approximation of (transformed) smooth functions by low degree even trigonometric polynomials with small and efficiently computable coefficients. 1
5 0.20104286 25 nips-2013-Adaptive Anonymity via $b$-Matching
Author: Krzysztof M. Choromanski, Tony Jebara, Kui Tang
Abstract: The adaptive anonymity problem is formalized where each individual shares their data along with an integer value to indicate their personal level of desired privacy. This problem leads to a generalization of k-anonymity to the b-matching setting. Novel algorithms and theory are provided to implement this type of anonymity. The relaxation achieves better utility, admits theoretical privacy guarantees that are as strong, and, most importantly, accommodates a variable level of anonymity for each individual. Empirical results confirm improved utility on benchmark and social data-sets.
7 0.11526166 309 nips-2013-Statistical Active Learning Algorithms
8 0.10453013 68 nips-2013-Confidence Intervals and Hypothesis Testing for High-Dimensional Statistical Models
9 0.10086819 192 nips-2013-Minimax Theory for High-dimensional Gaussian Mixtures with Sparse Mean Separation
10 0.07860516 52 nips-2013-Bayesian inference as iterated random functions with applications to sequential inference in graphical models
11 0.077588014 63 nips-2013-Cluster Trees on Manifolds
12 0.072850116 191 nips-2013-Minimax Optimal Algorithms for Unconstrained Linear Optimization
13 0.067936443 144 nips-2013-Inverse Density as an Inverse Problem: the Fredholm Equation Approach
14 0.06426207 273 nips-2013-Reinforcement Learning in Robust Markov Decision Processes
15 0.063937292 116 nips-2013-Fantope Projection and Selection: A near-optimal convex relaxation of sparse PCA
16 0.063675851 229 nips-2013-Online Learning of Nonparametric Mixture Models via Sequential Variational Approximation
17 0.062600642 80 nips-2013-Data-driven Distributionally Robust Polynomial Optimization
18 0.061973829 109 nips-2013-Estimating LASSO Risk and Noise Level
19 0.060806192 194 nips-2013-Model Selection for High-Dimensional Regression under the Generalized Irrepresentability Condition
20 0.057549186 171 nips-2013-Learning with Noisy Labels
topicId topicWeight
[(0, 0.187), (1, 0.023), (2, 0.146), (3, -0.101), (4, 0.039), (5, 0.103), (6, -0.007), (7, -0.01), (8, 0.05), (9, 0.47), (10, 0.256), (11, -0.062), (12, -0.254), (13, 0.354), (14, -0.061), (15, 0.151), (16, 0.158), (17, -0.107), (18, -0.017), (19, -0.033), (20, 0.158), (21, -0.043), (22, 0.071), (23, -0.037), (24, -0.002), (25, 0.024), (26, 0.111), (27, -0.015), (28, 0.037), (29, 0.054), (30, -0.042), (31, -0.001), (32, -0.01), (33, 0.017), (34, 0.027), (35, -0.007), (36, -0.072), (37, -0.045), (38, -0.009), (39, 0.001), (40, -0.004), (41, -0.011), (42, 0.038), (43, 0.016), (44, 0.004), (45, 0.027), (46, 0.024), (47, -0.023), (48, -0.011), (49, -0.021)]
simIndex simValue paperId paperTitle
same-paper 1 0.92796624 177 nips-2013-Local Privacy and Minimax Bounds: Sharp Rates for Probability Estimation
Author: John Duchi, Martin J. Wainwright, Michael Jordan
Abstract: We provide a detailed study of the estimation of probability distributions— discrete and continuous—in a stringent setting in which data is kept private even from the statistician. We give sharp minimax rates of convergence for estimation in these locally private settings, exhibiting fundamental trade-offs between privacy and convergence rate, as well as providing tools to allow movement along the privacy-statistical efficiency continuum. One of the consequences of our results is that Warner’s classical work on randomized response is an optimal way to perform survey sampling while maintaining privacy of the respondents. 1
2 0.92670029 14 nips-2013-A Stability-based Validation Procedure for Differentially Private Machine Learning
Author: Kamalika Chaudhuri, Staal A. Vinterbo
Abstract: Differential privacy is a cryptographically motivated definition of privacy which has gained considerable attention in the algorithms, machine-learning and datamining communities. While there has been an explosion of work on differentially private machine learning algorithms, a major barrier to achieving end-to-end differential privacy in practical machine learning applications is the lack of an effective procedure for differentially private parameter tuning, or, determining the parameter value, such as a bin size in a histogram, or a regularization parameter, that is suitable for a particular application. In this paper, we introduce a generic validation procedure for differentially private machine learning algorithms that apply when a certain stability condition holds on the training algorithm and the validation performance metric. The training data size and the privacy budget used for training in our procedure is independent of the number of parameter values searched over. We apply our generic procedure to two fundamental tasks in statistics and machine-learning – training a regularized linear classifier and building a histogram density estimator that result in end-toend differentially private solutions for these problems. 1
3 0.7638675 102 nips-2013-Efficient Algorithm for Privately Releasing Smooth Queries
Author: Ziteng Wang, Kai Fan, Jiaqi Zhang, Liwei Wang
Abstract: We study differentially private mechanisms for answering smooth queries on databases consisting of data points in Rd . A K-smooth query is specified by a function whose partial derivatives up to order K are all bounded. We develop an -differentially private mechanism which for the class of K-smooth queries has K accuracy O(n− 2d+K / ). The mechanism first outputs a summary of the database. To obtain an answer of a query, the user runs a public evaluation algorithm which contains no information of the database. Outputting the summary runs in time d O(n1+ 2d+K ), and the evaluation algorithm for answering a query runs in time d+2+ 2d K ˜ O(n 2d+K ). Our mechanism is based on L∞ -approximation of (transformed) smooth functions by low degree even trigonometric polynomials with small and efficiently computable coefficients. 1
4 0.59046054 25 nips-2013-Adaptive Anonymity via $b$-Matching
Author: Krzysztof M. Choromanski, Tony Jebara, Kui Tang
Abstract: The adaptive anonymity problem is formalized where each individual shares their data along with an integer value to indicate their personal level of desired privacy. This problem leads to a generalization of k-anonymity to the b-matching setting. Novel algorithms and theory are provided to implement this type of anonymity. The relaxation achieves better utility, admits theoretical privacy guarantees that are as strong, and, most importantly, accommodates a variable level of anonymity for each individual. Empirical results confirm improved utility on benchmark and social data-sets.
5 0.58313179 2 nips-2013-(Nearly) Optimal Algorithms for Private Online Learning in Full-information and Bandit Settings
Author: Abhradeep Guha Thakurta, Adam Smith
Abstract: We give differentially private algorithms for a large class of online learning algorithms, in both the full information and bandit settings. Our algorithms aim to minimize a convex loss function which is a sum of smaller convex loss terms, one for each data point. To design our algorithms, we modify the popular mirror descent approach, or rather a variant called follow the approximate leader. The technique leads to the first nonprivate algorithms for private online learning in the bandit setting. In the full information setting, our algorithms improve over the regret bounds of previous work (due to Dwork, Naor, Pitassi and Rothblum (2010) and Jain, Kothari and Thakurta (2012)). In many cases, our algorithms (in both settings) match the dependence on the input length, T , of the optimal nonprivate regret bounds up to logarithmic factors in T . Our algorithms require logarithmic space and update time. 1
7 0.33153811 80 nips-2013-Data-driven Distributionally Robust Polynomial Optimization
8 0.30471197 110 nips-2013-Estimating the Unseen: Improved Estimators for Entropy and other Properties
9 0.29071426 63 nips-2013-Cluster Trees on Manifolds
10 0.28813902 52 nips-2013-Bayesian inference as iterated random functions with applications to sequential inference in graphical models
11 0.27946609 68 nips-2013-Confidence Intervals and Hypothesis Testing for High-Dimensional Statistical Models
12 0.2665014 192 nips-2013-Minimax Theory for High-dimensional Gaussian Mixtures with Sparse Mean Separation
13 0.25769627 144 nips-2013-Inverse Density as an Inverse Problem: the Fredholm Equation Approach
14 0.25404921 302 nips-2013-Sparse Inverse Covariance Estimation with Calibration
15 0.25302455 199 nips-2013-More data speeds up training time in learning halfspaces over sparse vectors
16 0.24924852 111 nips-2013-Estimation, Optimization, and Parallelism when Data is Sparse
17 0.23905177 31 nips-2013-Adaptivity to Local Smoothness and Dimension in Kernel Regression
18 0.23496512 101 nips-2013-EDML for Learning Parameters in Directed and Undirected Graphical Models
19 0.22885247 112 nips-2013-Estimation Bias in Multi-Armed Bandit Algorithms for Search Advertising
20 0.22678053 358 nips-2013-q-OCSVM: A q-Quantile Estimator for High-Dimensional Distributions
topicId topicWeight
[(2, 0.041), (16, 0.04), (33, 0.161), (34, 0.101), (41, 0.028), (49, 0.03), (56, 0.138), (67, 0.192), (70, 0.025), (76, 0.01), (85, 0.034), (89, 0.059), (93, 0.028), (95, 0.032)]
simIndex simValue paperId paperTitle
1 0.91590989 132 nips-2013-Global MAP-Optimality by Shrinking the Combinatorial Search Area with Convex Relaxation
Author: Bogdan Savchynskyy, Jörg Hendrik Kappes, Paul Swoboda, Christoph Schnörr
Abstract: We consider energy minimization for undirected graphical models, also known as the MAP-inference problem for Markov random fields. Although combinatorial methods, which return a provably optimal integral solution of the problem, made a significant progress in the past decade, they are still typically unable to cope with large-scale datasets. On the other hand, large scale datasets are often defined on sparse graphs and convex relaxation methods, such as linear programming relaxations then provide good approximations to integral solutions. We propose a novel method of combining combinatorial and convex programming techniques to obtain a global solution of the initial combinatorial problem. Based on the information obtained from the solution of the convex relaxation, our method confines application of the combinatorial solver to a small fraction of the initial graphical model, which allows to optimally solve much larger problems. We demonstrate the efficacy of our approach on a computer vision energy minimization benchmark. 1
2 0.88961911 117 nips-2013-Fast Algorithms for Gaussian Noise Invariant Independent Component Analysis
Author: James R. Voss, Luis Rademacher, Mikhail Belkin
Abstract: The performance of standard algorithms for Independent Component Analysis quickly deteriorates under the addition of Gaussian noise. This is partially due to a common first step that typically consists of whitening, i.e., applying Principal Component Analysis (PCA) and rescaling the components to have identity covariance, which is not invariant under Gaussian noise. In our paper we develop the first practical algorithm for Independent Component Analysis that is provably invariant under Gaussian noise. The two main contributions of this work are as follows: 1. We develop and implement an efficient, Gaussian noise invariant decorrelation (quasi-orthogonalization) algorithm using Hessians of the cumulant functions. 2. We propose a very simple and efficient fixed-point GI-ICA (Gradient Iteration ICA) algorithm, which is compatible with quasi-orthogonalization, as well as with the usual PCA-based whitening in the noiseless case. The algorithm is based on a special form of gradient iteration (different from gradient descent). We provide an analysis of our algorithm demonstrating fast convergence following from the basic properties of cumulants. We also present a number of experimental comparisons with the existing methods, showing superior results on noisy data and very competitive performance in the noiseless case. 1 Introduction and Related Works In the Blind Signal Separation setting, it is assumed that observed data is drawn from an unknown distribution. The goal is to recover the latent signals under some appropriate structural assumption. A prototypical setting is the so-called cocktail party problem: in a room, there are d people speaking simultaneously and d microphones, with each microphone capturing a superposition of the voices. The objective is to recover the speech of each individual speaker. The simplest modeling assumption is to consider each speaker as producing a signal that is a random variable independent of the others, and to take the superposition to be a linear transformation independent of time. This leads to the following formalization: We observe samples from a random vector x distributed according to the equation x = As + b + η where A is a linear mixing matrix, b ∈ Rd is a constant vector, s is a latent random vector with independent coordinates, and η is an unknown random noise independent 1 of s. For simplicity, we assume A ∈ Rd×d is square and of full rank. The latent components of s are viewed as containing the information describing the makeup of the observed signal (voices of individual speakers in the cocktail party setting). The goal of Independent Component Analysis is to approximate the matrix A in order to recover the latent signal s. In practice, most methods ignore the noise term, leaving the simpler problem of recovering the mixing matrix A when x = As is observed. Arguably the two most widely used ICA algorithms are FastICA [13] and JADE [6]. Both of these algorithms are based on a two step process: (1) The data is centered and whitened, that is, made to have identity covariance matrix. This is typically done using principal component analysis (PCA) and rescaling the appropriate components. In the noiseless case this procedure orthogonalizes and rescales the independent components and thus recovers A up to an unknown orthogonal matrix R. (2) Recover the orthogonal matrix R. Most practical ICA algorithms differ only in the second step. In FastICA, various objective functions are used to perform a projection pursuit style algorithm which recovers the columns of R one at a time. JADE uses a fourth-cumulant based technique to simultaneously recover all columns of R. Step 1 of ICA is affected by the addition of a Gaussian noise. Even if the noise is white (has a scalar times identity covariance matrix) the PCA-based whitening procedure can no longer guarantee the whitening of the underlying independent components. Hence, the second step of the process is no longer justified. This failure may be even more significant if the noise is not white, which is likely to be the case in many practical situations. Recent theoretical developments (see, [2] and [3]) consider the case where the noise η is an arbitrary (not necessarily white) additive Gaussian variable drawn independently from s. In [2], it was observed that certain cumulant-based techniques for ICA can still be applied for the second step if the underlying signals can be orthogonalized.1 Orthogonalization of the latent signals (quasi-orthogonalization) is a significantly less restrictive condition as it does not force the underlying signal to have identity covariance (as in whitening in the noiseless case). In the noisy setting, the usual PCA cannot achieve quasi-orthogonalization as it will whiten the mixed signal, but not the underlying components. In [3], we show how quasi-orthogonalization can be achieved in a noise-invariant way through a method based on the fourth-order cumulant tensor. However, a direct implementation of that method requires estimating the full fourth-order cumulant tensor, which is computationally challenging even in relatively low dimensions. In this paper we derive a practical version of that algorithm based on directional Hessians of the fourth univariate cumulant, thus reducing the complexity dependence on the data dimensionality from d4 to d3 , and also allowing for a fully vectorized implementation. We also develop a fast and very simple gradient iteration (not to be confused with gradient descent) algorithm, GI-ICA, which is compatible with the quasi-orthogonalization step and can be shown to have convergence of order r − 1, when implemented using a univariate cumulant of order r. For the cumulant of order four, commonly used in practical applications, we obtain cubic convergence. We show how these convergence rates follow directly from the properties of the cumulants, which sheds some light on the somewhat surprising cubic convergence seen in fourth-order based ICA methods [13, 18, 22]. The update step has complexity O(N d) where N is the number of samples, giving a total algorithmic complexity of O(N d3 ) for step 1 and O(N d2 t) for step 2, where t is the number of iterations for convergence in the gradient iteration. Interestingly, while the techniques are quite different, our gradient iteration algorithm turns out to be closely related to Fast ICA in the noiseless setting, in the case when the data is whitened and the cumulants of order three or four are used. Thus, GI-ICA can be viewed as a generalization (and a conceptual simplification) of Fast ICA for more general quasi-orthogonalized data. We present experimental results showing superior performance in the case of data contaminated by Gaussian noise and very competitive performance for clean data. We also note that the GIICA algorithms are fast in practice, allowing us to process (decorrelate and detect the independent 1 This process of orthogonalizing the latent signals was called quasi-whitening in [2] and later in [3]. However, this conflicts with the definition of quasi-whitening given in [12] which requires the latent signals to be whitened. To avoid the confusion we will use the term quasi-orthogonalization for the process of orthogonalizing the latent signals. 2 components) 100 000 points in dimension 5 in well under a second on a standard desktop computer. Our Matlab implementation of GI-ICA is available for download at http://sourceforge. net/projects/giica/. Finally, we observe that our method is partially compatible with the robust cumulants introduced in [20]. We briefly discuss how GI-ICA can be extended using these noise-robust techniques for ICA to reduce the impact of sparse noise. The paper is organized as follows. In section 2, we discuss the relevant properties of cumulants, and discuss results from prior work which allows for the quasi-orthogonalization of signals with non-zero fourth cumulant. In section 3, we discuss the connection between the fourth-order cumulant tensor method for quasi-orthogonalization discussed in section 2 with Hessian-based techniques seen in [2] and [11]. We use this connection to create a more computationally efficient and practically implementable version of the quasi-orthogonalization algorithm discussed in section 2. In section 4, we discuss new, fast, projection-pursuit style algorithms for the second step of ICA which are compatible with quasi-orthogonalization. In order to simplify the presentation, all algorithms are stated in an abstract form as if we have exact knowledge of required distribution parameters. Section 5 discusses the estimators of required distribution parameters to be used in practice. Section 6 discusses numerical experiments demonstrating the applicability of our techniques. Related Work. The name Independent Component Analysis refers to a broad range of algorithms addressing the blind signal separation problem as well as its variants and extensions. There is an extensive literature on ICA in the signal processing and machine learning communities due to its applicability to a variety of important practical situations. For a comprehensive introduction see the books [8, 14]. In this paper we develop techniques for dealing with noisy data by introducing new and more efficient techniques for quasi-orthogonalization and subsequent component recovery. The quasi-orthogonalization step was introduced in [2], where the authors proposed an algorithm for the case when the fourth cumulants of all independent components are of the same sign. A general algorithm with complete theoretical analysis was provided in [3]. That algorithm required estimating the full fourth-order cumulant tensor. We note that Hessian based techniques for ICA were used in [21, 2, 11], with [11] and [2] using the Hessian of the fourth-order cumulant. The papers [21] and [11] proposed interesting randomized one step noise-robust ICA algorithms based on the cumulant generating function and the fourth cumulant respectively in primarily theoretical settings. The gradient iteration algorithm proposed is closely related to the work [18], which provides a gradient-based algorithm derived from the fourth moment with cubic convergence to learn an unknown parallelepiped in a cryptographic setting. For the special case of the fourth cumulant, the idea of gradient iteration has appeared in the context of FastICA with a different justification, see e.g. [16, Equation 11 and Theorem 2]. We also note the work [12], which develops methods for Gaussian noise-invariant ICA under the assumption that the noise parameters are known. Finally, there are several papers that considered the problem of performing PCA in a noisy framework. [5] gives a provably robust algorithm for PCA under a sparse noise model. [4] performs PCA robust to white Gaussian noise, and [9] performs PCA robust to white Gaussian noise and sparse noise. 2 Using Cumulants to Orthogonalize the Independent Components Properties of Cumulants: Cumulants are similar to moments and can be expressed in terms of certain polynomials of the moments. However, cumulants have additional properties which allow independent random variables to be algebraically separated. We will be interested in the fourth order multi-variate cumulants, and univariate cumulants of arbitrary order. Denote by Qx the fourth order cumulant tensor for the random vector x. So, (Qx )ijkl is the cross-cumulant between the random variables xi , xj , xk , and xl , which we alternatively denote as Cum(xi , xj , xk , xl ). Cumulant tensors are symmetric, i.e. (Qx )ijkl is invariant under permutations of indices. Multivariate cumulants have the following properties (written in the case of fourth order cumulants): 1. (Multilinearity) Cum(αxi , xj , xk , xl ) = α Cum(xi , xj , xk , xl ) for random vector x and scalar α. If y is a random variable, then Cum(xi +y, xj , xk , xl ) = Cum(xi , xj , xk , xl )+Cum(y, xj , xk , xl ). 2. (Independence) If xi and xj are independent random variables, then Cum(xi , xj , xk , xl ) = 0. When x and y are independent, Qx+y = Qx + Qy . 3. (Vanishing Gaussian) Cumulants of order 3 and above are zero for Gaussian random variables. 3 The first order cumulant is the mean, and the second order multivariate cumulant is the covariance matrix. We will denote by κr (x) the order-r univariate cumulant, which is equivalent to the crosscumulant of x with itself r times: κr (x) := Cum(x, x, . . . , x) (where x appears r times). Univariate r-cumulants are additive for independent random variables, i.e. κr (x + y) = κr (x) + κr (y), and homogeneous of degree r, i.e. κr (αx) = αr κr (x). Quasi-Orthogonalization Using Cumulant Tensors. Recalling our original notation, x = As + b + η gives the generative ICA model. We define an operation of fourth-order tensors on matrices: For Q ∈ Rd×d×d×d and M ∈ Rd×d , Q(M ) is the matrix such that d d Q(M )ij := Qijkl mlk . (1) k=1 l=1 We can use this operation to orthogonalize the latent random signals. Definition 2.1. A matrix W is called a quasi-orthogonalization matrix if there exists an orthogonal matrix R and a nonsingular diagonal matrix D such that W A = RD. We will need the following results from [3]. Here we use Aq to denote the q th column of A. Lemma 2.2. Let M ∈ Rd×d be an arbitrary matrix. Then, Qx (M ) = ADAT where D is a diagonal matrix with entries dqq = κ4 (sq )AT M Aq . q Theorem 2.3. Suppose that each component of s has non-zero fourth cumulant. Let M = Qx (I), and let C = Qx (M −1 ). Then C = ADAT where D is a diagonal matrix with entries dqq = 1/ Aq 2 . In particular, C is positive definite, and for any factorization BB T of C, B −1 is a quasi2 orthogonalization matrix. 3 Quasi-Orthogonalization using Cumulant Hessians We have seen in Theorem 2.3 a tensor-based method which can be used to quasi-orthogonalize observed data. However, this method na¨vely requires the estimation of O(d4 ) terms from data. ı There is a connection between the cumulant Hessian-based techniques used in ICA [2, 11] and the tensor-based technique for quasi-orthogonalization described in Theorem 2.3 that allows the tensor-method to be rewritten using a series of Hessian operations. We make this connection precise below. The Hessian version requires only O(d3 ) terms to be estimated from data and simplifies the computation to consist of matrix and vector operations. Let Hu denote the Hessian operator with respect to a vector u ∈ Rd . The following lemma connects Hessian methods with our tensor-matrix operation (a special case is discussed in [2, Section 2.1]). Lemma 3.1. Hu (κ4 (uT x)) = ADAT where dqq = 12(uT Aq )2 κ4 (sq ). In Lemma 3.1, the diagonal entries can be rewritten as dqq = 12κ4 (sq )(AT (uuT )Aq ). By comq paring with Lemma 2.2, we see that applying Qx against a symmetric, rank one matrix uuT can be 1 rewritten in terms of the Hessian operations: Qx (uuT ) = 12 Hu (κ4 (uT x)). This formula extends to arbitrary symmetric matrices by the following Lemma. Lemma 3.2. Let M be a symmetric matrix with eigen decomposition U ΛU T such that U = d 1 (u1 , u2 , . . . , ud ) and Λ = diag(λ1 , λ2 , . . . , λd ). Then, Qx (M ) = 12 i=1 λi Hui κ4 (uT x). i The matrices I and M −1 in Theorem 2.3 are symmetric. As such, the tensor-based method for quasi-orthogonalization can be rewritten using Hessian operations. This is done in Algorithm 1. 4 Gradient Iteration ICA In the preceding sections, we discussed techniques to quasi-orthogonalize data. For this section, we will assume that quasi-orthogonalization is accomplished, and discuss deflationary approaches that can quickly recover the directions of the independent components. Let W be a quasiorthogonalization matrix. Then, define y := W x = W As + W η. Note that since η is Gaussian noise, so is W η. There exists a rotation matrix R and a diagonal matrix D such that W A = RD. Let ˜ := Ds. The coordinates of ˜ are still independent random variables. Gaussian noise makes s s recovering the scaling matrix D impossible. We aim to recover the rotation matrix R. 4 Algorithm 1 Hessian-based algorithm to generate a quasi-orthogonalization matrix. 1: function F IND Q UASI O RTHOGONALIZATION M ATRIX(x) d 1 2: Let M = 12 i=1 Hu κ4 (uT x)|u=ei . See Equation (4) for the estimator. T 3: Let U ΛU give the eigendecomposition of M −1 d 4: Let C = i=1 λi Hu κ4 (uT x)|u=Ui . See Equation (4) for the estimator. 5: Factorize C as BB T . 6: return B −1 7: end function To see why recovery of D is impossible, we note that a white Gaussian random variable η 1 has independent components. It is impossible to distinguish between the case where η 1 is part of the signal, i.e. W A(s + η 1 ) + W η, and the case where Aη 1 is part of the additive Gaussian noise, i.e. W As + W (Aη 1 + η), when s, η 1 , and η are drawn independently. In the noise-free ICA setting, the latent signal is typically assumed to have identity covariance, placing the scaling information in the columns of A. The presence of additive Gaussian noise makes recovery of the scaling information impossible since the latent signals become ill-defined. Following the idea popularized in FastICA, we will discuss a deflationary technique to recover the columns of R one at a time. Fast Recovery of a Single Independent Component. In the deflationary approach, a function f is fixed that acts upon a directional vector u ∈ Rd . Based on some criterion (typically maximization or minimization of f ), an iterative optimization step is performed until convergence. This technique was popularized in FastICA, which is considered fast for the following reasons: 1. As an approximate Newton method, FastICA requires computation of u f and a quick-tocompute estimate of (Hu (f ))−1 at each iterative step. Due to the estimate, the computation runs in O(N d) time, where N is the number of samples. 2. The iterative step in FastICA has local quadratic order convergence using arbitrary functions, and global cubic-order convergence when using the fourth cumulant [13]. We note that cubic convergence rates are not unique to FastICA and have been seen using gradient descent (with the correct step-size) when choosing f as the fourth moment [18]. Our proposed deflationary algorithm will be comparable with FastICA in terms of computational complexity, and the iterative step will take on a conceptually simpler form as it only relies on u κr . We provide a derivation of fast convergence rates that relies entirely on the properties of cumulants. As cumulants are invariant with respect to the additive Gaussian noise, the proposed methods will be admissible for both standard and noisy ICA. While cumulants are essentially unique with the additivity and homogeneity properties [17] when no restrictions are made on the probability space, the preprocessing step of ICA gives additional structure (like orthogonality and centering), providing additional admissible functions. In particular, [20] designs “robust cumulants” which are only minimally effected by sparse noise. Welling’s robust cumulants have versions of the additivity and homogeneity properties, and are consistent with our update step. For this reason, we will state our results in greater generality. Let G be a function of univariate random variables that satisfies the additivity, degree-r (r ≥ 3) homogeneity, and (for the noisy case) the vanishing Gaussians properties of cumulants. Then for a generic choice of input vector v, Algorithm 2 will demonstrate order r−1 convergence. In particular, if G is κ3 , then we obtain quadratic convergence; and if G is κ4 , we obtain cubic convergence. Lemma 4.1 helps explain why this is true. Lemma 4.1. v G(v · y) = r d i=1 (v · Ri )r−1 G(˜i )Ri . s If we consider what is happening in the basis of the columns of R, then up to some multiplicative constant, each coordinate is raised to the r − 1 power and then renormalized during each step of Algorithm 2. This ultimately leads to the order r − 1 convergence. Theorem 4.2. If for a unit vector input v to Algorithm 2 h = arg maxi |(v · Ri )r−2 G(˜i )| has a s unique answer, then v has order r − 1 convergence to Rh up to sign. In particular, if the following conditions are met: (1) There exists a coordinate random variable si of s such that G(si ) = 0. (2) v inputted into Algorithm 2 is chosen uniformly at random from the unit sphere S d−1 . Then Algorithm 2 converges to a column of R (up to sign) almost surely, and convergence is of order r − 1. 5 Algorithm 2 A fast algorithm to recover a single column of R when v is drawn generically from the unit sphere. Equations (2) and (3) provide k-statistic based estimates of v κ3 and v κ4 , which can be used as practical choices of v G on real data. 1: function GI-ICA(v, y) 2: repeat 3: v ← v G(vT y) 4: v ← v/ v 2 5: until Convergence return v 6: end function ˜ Algorithm 3 Algorithm for ICA in the presence of Gaussian noise. A recovers A up to column order and scaling. RT W is the demixing matrix for the observed random vector x. function G AUSSIAN ROBUST ICA(G, x) W = F IND Q UASI O RTHOGONALIZATION M ATRIX(x) y = Wx R columns = ∅ for i = 1 to d do Draw v from S d−1 ∩ span(R columns)⊥ uniformly at random. R columns = R columns ∪ {GI-ICA(v, y)} end for Construct a matrix R using the elements of R columns as columns. ˜ = RT y s ˜ A = (RT W )−1 ˜ s return A, ˜ end function By convergence up to sign, we include the possibility that v oscillates between Rh and −Rh on alternating steps. This can occur if G(˜i ) < 0 and r is odd. Due to space limitations, the proof is s omitted. Recovering all Independent Components. As a Corollary to Theorem 4.2 we get: Corollary 4.3. Suppose R1 , R2 , . . . , Rk are known for some k < d. Suppose there exists i > k such that G(si ) = 0. If v is drawn uniformly at random from S d−1 ∩ span(R1 , . . . , Rk )⊥ where S d−1 denotes the unit sphere in Rd , then Algorithm 2 with input v converges to a new column of R almost surely. Since the indexing of R is arbitrary, Corollary 4.3 gives a solution to noisy ICA, in Algorithm 3. In practice (not required by the theory), it may be better to enforce orthogonality between the columns of R, by orthogonalizing v against previously found columns of R at the end of each step in Algorithm 2. We expect the fourth or third cumulant function will typically be chosen for G. 5 Time Complexity Analysis and Estimation of Cumulants To implement Algorithms 1 and 2 requires the estimation of functions from data. We will limit our discussion to estimation of the third and fourth cumulants, as lower order cumulants are more statistically stable to estimate than higher order cumulants. κ3 is useful in Algorithm 2 for nonsymmetric distributions. However, since κ3 (si ) = 0 whenever si is a symmetric distribution, it is plausible that κ3 would not recover all columns of R. When s is suspected of being symmetric, it is prudent to use κ4 for G. Alternatively, one can fall back to κ4 from κ3 when κ3 is detected to be near 0. Denote by z (1) , z (2) , . . . , z (N ) the observed samples of a random variable z. Given a sample, each cumulant can be estimated in an unbiased fashion by its k-statistic. Denote by kr (z (i) ) the kN 1 statistic sample estimate of κr (z). Letting mr (z (i) ) := N i=1 (z (i) − z )r give the rth sample ¯ central moment, then N 2 m3 (z (i) ) (N + 1)m4 (z (i) ) − 3(N − 1)m2 (z (i) )2 k3 (z (i) ) := , k4 (z (i) ) := N 2 (N − 1)(N − 2) (N − 1)(N − 2)(N − 3) 6 gives the third and fourth k-statistics [15]. However, we are interested in estimating the gradients (for Algorithm 2) and Hessians (for Algorithm 1) of the cumulants rather than the cumulants themselves. The following Lemma shows how to obtain unbiased estimates: Lemma 5.1. Let z be a d-dimensional random vector with finite moments up to order r. Let z(i) be α an iid sample of z. Let α ∈ Nd be a multi-index. Then ∂u kr (u · z(i) ) is an unbiased estimate for α ∂u κr (u · z). If we mean-subtract (via the sample mean) all observed random variables, then the resulting estimates are: N u k3 (u · y) = (N − 1)−1 (N − 2)−1 3N (u · y(i) )2 y(i) (2) i=1 u k4 (u · y) = N2 (N − 1)(N − 2)(N − 3) −12 Hu k4 (u · x) = N −1 − N2 N −1 N2 N +1 N N N ((u · y(i) ))3 y(i) i=1 N (u · y(i) )2 i=1 12N 2 (N − 1)(N − 2)(N − 3) N 4 (u · y(i) )y(i) (3) i=1 N +1 N N 2N − 2 (u · x(i) )2 (xxT )(i) − N2 i=1 i=1 N ((u · x(i) ))2 (xxT )(i) (4) i=1 N (u · x(i) )x(i) i=1 T N (u · x(i) )x(i) i=1 Using (4) to estimate Hu κ4 (uT x) from data when implementing Algorithm 1, the resulting quasiorthogonalization algorithm runs in O(N d3 ) time. Using (2) or (3) to estimate u G(vT y) (with G chosen to be κ3 or κ4 respectively) when implementing Algorithm 2 gives an update step that runs in O(N d) time. If t bounds the number of iterations to convergence in Algorithm 2, then O(N d2 t) steps are required to recover all columns of R once quasi-orthogonalization has been achieved. 6 Simulation Results In Figure 1, we compare our algorithms to the baselines JADE [7] and versions of FastICA [10], using the code made available by the authors. Except for the choice of the contrast function for FastICA the baselines were run using default settings. All tests were done using artificially generated data. In implementing our algorithms (available at [19]), we opted to enforce orthogonality during the update step of Algorithm 2 with previously found columns of R. In Figure 1, comparison on five distributions indicates that each of the independent coordinates was generated from a distinct distribution among the Laplace distribution, the Bernoulli distribution with parameter 0.5, the tdistribution with 5 degrees of freedom, the exponential distribution, and the continuous uniform distribution. Most of these distributions are symmetric, making GI-κ3 inadmissible. When generating data for the ICA algorithm, we generate a random mixing matrix A with condition number 10 (minimum singular value 1 and maximum singular value 10), and intermediate singular values chosen uniformly at random. The noise magnitude indicates the strength of an additive white Gaussian noise. We define 100% noise magnitude to mean variance 10, with 25% noise and 50% noise indicating variances 2.5 and 5 respectively. Performance was measured using the Amari Index ˆ introduced in [1]. Let B denote the approximate demixing matrix returned by an ICA algorithm, |m | n n ˆ and let M = BA. Then, the Amari index is given by: E := i=1 j=1 maxk ij ik | − 1 + |m n j=1 n i=1 |mij | maxk |mkj | − 1 . The Amari index takes on values between 0 and the dimensionality d. It can be roughly viewed as the distance of M from the nearest scaled permutation matrix P D (where P is a permutation matrix and D is a diagonal matrix). From the noiseles data, we see that quasi-orthogonalization requires more data than whitening in order to provide accurate results. Once sufficient data is provided, all fourth order methods (GI-κ4 , JADE, and κ4 -FastICA) perform comparably. The difference between GI-κ4 and κ4 -FastICA is not 7 ICA Comparison on 5 distributions (d=5, noisless data) ICA Comparison on 5 distributions (d=5, 25% noise magnitude) 1.00 ICA Comparison on 5 distributions (d=5, 50% noise magnitude) 1.00 1.00 GI−κ4 (quasi−orthogonal) κ4−FastICA κ4−FastICA κ4−FastICA log cosh−FastICA JADE log cosh−FastICA JADE log cosh−FastICA JADE 0.10 0.01 100 0.10 0.01 100 1000 10000 100000 Number of Samples ICA Comparison on 5 distributions (d=10, noisless data) 10.00 Amari Index GI−κ4 (white) GI−κ4 (quasi−orthogonal) Amari Index GI−κ4 (white) GI−κ4 (quasi−orthogonal) Amari Index GI−κ4 (white) 0.10 0.01 100 1000 10000 100000 Number of Samples ICA Comparison on 5 distributions (d=10, 25% noise magnitude) 10.00 1000 10000 100000 Number of Samples ICA Comparison on 5 distributions (d=10, 50% noise magnitude) 10.00 GI−κ4 (white) GI−κ4 (quasi−orthogonal) κ4−FastICA log cosh−FastICA JADE log cosh−FastICA JADE 0.10 0.01 100 0.10 1000 10000 Number of Samples 100000 κ4−FastICA log cosh−FastICA JADE 1.00 Amari Index 1.00 Amari Index Amari Index GI−κ4 (white) GI−κ4 (quasi−orthogonal) κ4−FastICA 1.00 GI−κ4 (white) GI−κ4 (quasi−orthogonal) 0.01 100 0.10 1000 10000 Number of Samples 100000 0.01 100 1000 10000 Number of Samples 100000 Figure 1: Comparison of ICA algorithms under various levels of noise. White and quasi-orthogonal refer to the choice of the first step of ICA. All baseline algorithms use whitening. Reported Amari indices denote the mean Amari index over 50 runs on different draws of both A and the data. d gives the data dimensionality, with two copies of each distribution used when d = 10. statistically significant over 50 runs with 100 000 samples. We note that GI-κ4 under whitening and κ4 -FastICA have the same update step (up to a slightly different choice of estimators), with GI-κ4 differing to allow for quasi-orthogonalization. Where provided, the error bars give a 2σ confidence interval on the mean Amari index. In all cases, error bars for our algorithms are provided, and error bars for the baseline algorithms are provided when they do not hinder readability. It is clear that all algorithms degrade with the addition of Gaussian noise. However, GI-κ4 under quasi-orthogonalization degrades far less when given sufficient samples. For this reason, the quasi-orthogonalized GI-κ4 outperforms all other algorithms (given sufficient samples) including the log cosh-FastICA, which performs best in the noiseless case. Contrasting the performance of GIκ4 under whitening with itself under quasi-orthogonalization, it is clear that quasi-orthogonalization is necessary to be robust to Gaussian noise. Run times were indeed reasonably fast. For 100 000 samples on the varied distributions (d = 5) with 50% Gaussian noise magnitude, GI-κ4 (including the orthogonalization step) had an average running time2 of 0.19 seconds using PCA whitening, and 0.23 seconds under quasi-orthogonalization. The corresponding average number of iterations to convergence per independent component (at 0.0001 error) were 4.16 and 4.08. In the following table, we report the mean number of steps to convergence (per independent component) over the 50 runs for the 50% noise distribution (d = 5), and note that once sufficiently many samples were taken, the number of steps to convergence becomes remarkably small. Number of data pts whitening+GI-κ4 : mean num steps quasi-orth.+GI-κ4 : mean num steps 7 500 11.76 213.92 1000 5.92 65.95 5000 4.99 4.48 10000 4.59 4.36 Acknowledgments This work was supported by NSF grant IIS 1117707. 2 Using a standard desktop with an i7-2600 3.4 GHz CPU and 16 GB RAM. 8 50000 4.35 4.06 100000 4.16 4.08 References [1] S. Amari, A. Cichocki, H. H. Yang, et al. A new learning algorithm for blind signal separation. Advances in neural information processing systems, pages 757–763, 1996. [2] S. Arora, R. Ge, A. Moitra, and S. Sachdeva. Provable ICA with unknown Gaussian noise, with implications for Gaussian mixtures and autoencoders. In NIPS, pages 2384–2392, 2012. [3] M. Belkin, L. Rademacher, and J. Voss. Blind signal separation in the presence of Gaussian noise. In JMLR W&CP;, volume 30: COLT, pages 270–287, 2013. [4] C. M. Bishop. Variational principal components. Proc. Ninth Int. Conf. on Articial Neural Networks. ICANN, 1:509–514, 1999. [5] E. J. Cand` s, X. Li, Y. Ma, and J. Wright. Robust principal component analysis? CoRR, e abs/0912.3599, 2009. [6] J. Cardoso and A. Souloumiac. Blind beamforming for non-Gaussian signals. In Radar and Signal Processing, IEE Proceedings F, volume 140, pages 362–370. IET, 1993. [7] J.-F. Cardoso and A. Souloumiac. Matlab JADE for real-valued data v 1.8. http:// perso.telecom-paristech.fr/˜cardoso/Algo/Jade/jadeR.m, 2005. [Online; accessed 8-May-2013]. [8] P. Comon and C. Jutten, editors. Handbook of Blind Source Separation. Academic Press, 2010. [9] X. Ding, L. He, and L. Carin. Bayesian robust principal component analysis. Image Processing, IEEE Transactions on, 20(12):3419–3430, 2011. [10] H. G¨ vert, J. Hurri, J. S¨ rel¨ , and A. Hyv¨ rinen. Matlab FastICA v 2.5. http:// a a a a research.ics.aalto.fi/ica/fastica/code/dlcode.shtml, 2005. [Online; accessed 1-May-2013]. [11] D. Hsu and S. M. Kakade. Learning mixtures of spherical Gaussians: Moment methods and spectral decompositions. In ITCS, pages 11–20, 2013. [12] A. Hyv¨ rinen. Independent component analysis in the presence of Gaussian noise by maxia mizing joint likelihood. Neurocomputing, 22(1-3):49–67, 1998. [13] A. Hyv¨ rinen. Fast and robust fixed-point algorithms for independent component analysis. a IEEE Transactions on Neural Networks, 10(3):626–634, 1999. [14] A. Hyv¨ rinen and E. Oja. Independent component analysis: Algorithms and applications. a Neural Networks, 13(4-5):411–430, 2000. [15] J. F. Kenney and E. S. Keeping. Mathematics of Statistics, part 2. van Nostrand, 1962. [16] H. Li and T. Adali. A class of complex ICA algorithms based on the kurtosis cost function. IEEE Transactions on Neural Networks, 19(3):408–420, 2008. [17] L. Mafttner. What are cumulants. Documenta Mathematica, 4:601–622, 1999. [18] P. Q. Nguyen and O. Regev. Learning a parallelepiped: Cryptanalysis of GGH and NTRU signatures. J. Cryptology, 22(2):139–160, 2009. [19] J. Voss, L. Rademacher, and M. Belkin. Matlab GI-ICA implementation. sourceforge.net/projects/giica/, 2013. [Online]. http:// [20] M. Welling. Robust higher order statistics. In Tenth International Workshop on Artificial Intelligence and Statistics, pages 405–412, 2005. [21] A. Yeredor. Blind source separation via the second characteristic function. Signal Processing, 80(5):897–902, 2000. [22] V. Zarzoso and P. Comon. How fast is FastICA. EUSIPCO, 2006. 9
3 0.88419044 322 nips-2013-Symbolic Opportunistic Policy Iteration for Factored-Action MDPs
Author: Aswin Raghavan, Roni Khardon, Alan Fern, Prasad Tadepalli
Abstract: This paper addresses the scalability of symbolic planning under uncertainty with factored states and actions. Our first contribution is a symbolic implementation of Modified Policy Iteration (MPI) for factored actions that views policy evaluation as policy-constrained value iteration (VI). Unfortunately, a na¨ve approach ı to enforce policy constraints can lead to large memory requirements, sometimes making symbolic MPI worse than VI. We address this through our second and main contribution, symbolic Opportunistic Policy Iteration (OPI), which is a novel convergent algorithm lying between VI and MPI, that applies policy constraints if it does not increase the size of the value function representation, and otherwise performs VI backups. We also give a memory bounded version of this algorithm allowing a space-time tradeoff. Empirical results show significantly improved scalability over state-of-the-art symbolic planners. 1
same-paper 4 0.85253775 177 nips-2013-Local Privacy and Minimax Bounds: Sharp Rates for Probability Estimation
Author: John Duchi, Martin J. Wainwright, Michael Jordan
Abstract: We provide a detailed study of the estimation of probability distributions— discrete and continuous—in a stringent setting in which data is kept private even from the statistician. We give sharp minimax rates of convergence for estimation in these locally private settings, exhibiting fundamental trade-offs between privacy and convergence rate, as well as providing tools to allow movement along the privacy-statistical efficiency continuum. One of the consequences of our results is that Warner’s classical work on randomized response is an optimal way to perform survey sampling while maintaining privacy of the respondents. 1
5 0.79176128 310 nips-2013-Statistical analysis of coupled time series with Kernel Cross-Spectral Density operators.
Author: Michel Besserve, Nikos K. Logothetis, Bernhard Schölkopf
Abstract: Many applications require the analysis of complex interactions between time series. These interactions can be non-linear and involve vector valued as well as complex data structures such as graphs or strings. Here we provide a general framework for the statistical analysis of these dependencies when random variables are sampled from stationary time-series of arbitrary objects. To achieve this goal, we study the properties of the Kernel Cross-Spectral Density (KCSD) operator induced by positive definite kernels on arbitrary input domains. This framework enables us to develop an independence test between time series, as well as a similarity measure to compare different types of coupling. The performance of our test is compared to the HSIC test using i.i.d. assumptions, showing improvements in terms of detection errors, as well as the suitability of this approach for testing dependency in complex dynamical systems. This similarity measure enables us to identify different types of interactions in electrophysiological neural time series. 1
6 0.78272045 116 nips-2013-Fantope Projection and Selection: A near-optimal convex relaxation of sparse PCA
8 0.77906585 182 nips-2013-Manifold-based Similarity Adaptation for Label Propagation
9 0.77896047 355 nips-2013-Which Space Partitioning Tree to Use for Search?
10 0.7784822 171 nips-2013-Learning with Noisy Labels
11 0.77818912 14 nips-2013-A Stability-based Validation Procedure for Differentially Private Machine Learning
12 0.77761364 233 nips-2013-Online Robust PCA via Stochastic Optimization
13 0.77638704 318 nips-2013-Structured Learning via Logistic Regression
14 0.7761904 68 nips-2013-Confidence Intervals and Hypothesis Testing for High-Dimensional Statistical Models
15 0.7759847 74 nips-2013-Convex Tensor Decomposition via Structured Schatten Norm Regularization
16 0.77463192 52 nips-2013-Bayesian inference as iterated random functions with applications to sequential inference in graphical models
17 0.77460235 101 nips-2013-EDML for Learning Parameters in Directed and Undirected Graphical Models
18 0.77425241 249 nips-2013-Polar Operators for Structured Sparse Estimation
19 0.77370656 149 nips-2013-Latent Structured Active Learning
20 0.77339399 228 nips-2013-Online Learning of Dynamic Parameters in Social Networks