jmlr jmlr2011 jmlr2011-22 knowledge-graph by maker-knowledge-mining

22 jmlr-2011-Differentially Private Empirical Risk Minimization

Source: pdf

Author: Kamalika Chaudhuri, Claire Monteleoni, Anand D. Sarwate

Abstract: Privacy-preserving machine learning algorithms are crucial for the increasingly common setting in which personal data, such as medical or ﬁnancial records, are analyzed. We provide general techniques to produce privacy-preserving approximations of classiﬁers learned via (regularized) empirical risk minimization (ERM). These algorithms are private under the ε-differential privacy deﬁnition due to Dwork et al. (2006). First we apply the output perturbation ideas of Dwork et al. (2006), to ERM classiﬁcation. Then we propose a new method, objective perturbation, for privacy-preserving machine learning algorithm design. This method entails perturbing the objective function before optimizing over classiﬁers. If the loss and regularizer satisfy certain convexity and differentiability criteria, we prove theoretical results showing that our algorithms preserve privacy, and provide generalization bounds for linear and nonlinear kernels. We further present a privacypreserving technique for tuning the parameters in general machine learning algorithms, thereby providing end-to-end privacy guarantees for the training process. We apply these results to produce privacy-preserving analogues of regularized logistic regression and support vector machines. We obtain encouraging results from evaluating their performance on real demographic and benchmark data sets. Our results show that both theoretically and empirically, objective perturbation is superior to the previous state-of-the-art, output perturbation, in managing the inherent tradeoff between privacy and learning performance. Keywords: privacy, classiﬁcation, optimization, empirical risk minimization, support vector machines, logistic regression

Reference: text

Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 These algorithms are private under the ε-differential privacy deﬁnition due to Dwork et al. [sent-8, score-0.68]

2 We further present a privacypreserving technique for tuning the parameters in general machine learning algorithms, thereby providing end-to-end privacy guarantees for the training process. [sent-15, score-0.515]

3 Our results show that both theoretically and empirically, objective perturbation is superior to the previous state-of-the-art, output perturbation, in managing the inherent tradeoff between privacy and learning performance. [sent-18, score-0.7]

4 Thus, there is a great need for designing machine learning algorithms that also preserve the privacy of individuals in the data sets on which they train and operate. [sent-33, score-0.482]

5 For our privacy measure, we use a deﬁnition due to Dwork et al. [sent-41, score-0.458]

6 Their ε-differential privacy model is a strong, cryptographically-motivated deﬁnition of privacy that has recently received a signiﬁcant amount of research attention for its robustness to known attacks, such as those involving side information (Ganta et al. [sent-43, score-0.916]

7 Algorithms satisfying ε-differential privacy are randomized; the output is a random variable whose distribution is conditioned on the data set. [sent-45, score-0.509]

8 A statistical procedure satisﬁes ε-differential privacy if changing a single data point does not shift the output distribution by too much. [sent-46, score-0.509]

9 An important aspect of our work is that we develop methods for end-to-end privacy; each step in the learning process can cause additional risk of privacy violation, and we provide algorithms with quantiﬁable privacy guarantees for training as well as parameter tuning. [sent-50, score-0.952]

10 In the context of privacy, we must guarantee the privacy of the holdout data as well. [sent-61, score-0.473]

11 We exploit results from the theory of differential privacy to develop a privacy-preserving parameter tuning algorithm, and demonstrate its use in practice. [sent-62, score-0.515]

12 Together with our training algorithms, this parameter tuning algorithm guarantees privacy to all data used in the learning process. [sent-63, score-0.5]

13 Guaranteeing privacy incurs a cost in performance; because the algorithms must cause some uncertainty in the output, they increase the loss of the output predictor. [sent-64, score-0.531]

14 Because the ε-differential privacy model requires robustness against all data sets, we make no assumptions on the underlying data for the purposes of making privacy guarantees. [sent-65, score-0.916]

15 However, to prove the impact of privacy constraints on the generalization error, we assume the data is i. [sent-66, score-0.473]

16 In addition to privacy guarantees, we also provide generalization bounds for this algorithm. [sent-78, score-0.473]

17 We also demonstrate the impact of end-to-end privacy on generalization error. [sent-87, score-0.473]

18 (2009) show that releasing R2 -values computed on high-dimensional genetic data can lead to privacy breaches by an adversary who is armed with a small amount of auxiliary information. [sent-95, score-0.519]

19 , 2006), spanning several communities, that uses privacy models other than differential privacy. [sent-98, score-0.495]

20 , 2008) considers the problem of privacy-preserving SVM classiﬁcation when separate agents have to share private data, and provides a solution that uses random kernels, but does provide any formal privacy guarantee. [sent-102, score-0.68]

21 An alternative line of privacy work is in the secure multiparty computation setting due to Yao (1982), where the sensitive data is split across multiple hostile databases, and the goal is to compute a function on the union of these databases. [sent-103, score-0.472]

22 Differential privacy, the formal privacy deﬁnition used in our paper, was proposed by the seminal work of Dwork et al. [sent-107, score-0.458]

23 (2006b), and has been used since in numerous works on privacy (Chaudhuri and Mishra, 2006; McSherry and Talwar, 2007; Nissim et al. [sent-108, score-0.458]

24 Unlike many other privacy deﬁnitions, such as those mentioned above, differential privacy has been shown to be resistant to composition attacks (attacks involving side-information) (Ganta et al. [sent-112, score-0.979]

25 Some follow-up work on differential privacy includes work on differentially-private combinatorial optimization, due to Gupta et al. [sent-114, score-0.495]

26 In addition, Dwork and Lei (2009) explore a connection between differential privacy and robust statistics, and provide an algorithm 1072 D IFFERENTIALLY P RIVATE ERM for privacy-preserving regression using ideas from robust statistics. [sent-129, score-0.495]

27 In general, output perturbation methods alter the output of the function computed on the database, before releasing it; in particular the sensitivity method makes an algorithm differentially private by adding noise to its output. [sent-134, score-0.632]

28 In the classiﬁcation setting, the noise protects the privacy of the training data, but increases the prediction error of the classiﬁer. [sent-135, score-0.495]

29 In contrast, our approach takes into account how the value of the regularization parameter might change due to privacy constraints. [sent-141, score-0.458]

30 2 Assumptions on Loss and Regularizer The conditions under which we can prove results on privacy and generalization error depend on analytic properties of the loss and regularizer. [sent-176, score-0.495]

31 Strong convexity plays a role in guaranteeing our privacy and generalization requirements. [sent-181, score-0.487]

32 For our privacy results to hold we will also require that the regularizer N(·) and loss function ℓ(·, ·) be differentiable functions of f. [sent-182, score-0.557]

33 In some cases we can prove privacy guarantees for approximations to these non-differentiable functions. [sent-184, score-0.458]

34 3 Privacy Model We are interested in producing a classiﬁer in a manner that preserves the privacy of individual entries of the data set D that is used in training the classiﬁer. [sent-186, score-0.498]

35 The notion of privacy we use is the ε-differential privacy model, developed by Dwork et al. [sent-187, score-0.916]

36 (2006b) (see also Dwork (2006)), which deﬁnes a notion of privacy for a randomized algorithm A (D ). [sent-188, score-0.458]

37 The algorithm A provides differential privacy if for any set S , the likelihood that A (D ) ∈ S is close to the likelihood A (D ′ ) ∈ S , (where the likelihood is over the randomness in the algorithm). [sent-191, score-0.495]

38 The following deﬁnition of differential privacy is due to Dwork et al. [sent-193, score-0.495]

39 Deﬁnition 3 An algorithm A provides ε p -differential privacy if for any two data sets D and D ′ that differ in a single entry and for any set S , exp(−ε p )P(A (D ′ ) ∈ S ) ≤ P(A (D ) ∈ S ) ≤ exp(ε p )P(A (D ′ ) ∈ S ), (3) where A (D ) (resp. [sent-206, score-0.471]

40 From this deﬁnition, it is clear that the A (D ) that outputs the minimizer of the ERM objective (1) does not provide ε p -differential privacy for any ε p . [sent-210, score-0.525]

41 Note that the privacy parameter here does not depend on the sensitivity of the of the classiﬁcation algorithm. [sent-283, score-0.55]

42 1 Compute fpriv = argmin Jpriv (f, D ) + 2 ∆||f||2 . [sent-289, score-0.721]

43 2 c 2c The algorithm requires a certain slack, log(1 + nΛ + n2 Λ2 ), in the privacy parameter. [sent-290, score-0.458]

44 We observe that given any ﬁxed fpriv and a ﬁxed data set D , there always exists a b such that Algorithm 2 outputs fpriv on input D . [sent-344, score-1.419]

45 Because ℓ is differentiable and convex, and N(·) is differentiable, we can take the gradient of the objective function and set it to 0 at fpriv . [sent-345, score-0.789]

46 , (xn , yn ), there is a bijection between b and fpriv . [sent-351, score-0.728]

47 The relation (6) shows that two different b values cannot result in the same fpriv . [sent-352, score-0.702]

48 Furthermore, since the objective is strictly convex, for a ﬁxed b and D , there is a unique fpriv ; therefore the map from b to fpriv is injective. [sent-353, score-1.442]

49 The relation (6) also shows that for any fpriv there exists a b for which fpriv is the minimizer, so the map from b to fpriv is surjective. [sent-354, score-2.106]

50 To show ε p -differential privacy, we need to compute the ratio g(fpriv |D )/g(fpriv |D ′ ) of the densities of fpriv under the two data sets D and D ′ . [sent-355, score-0.702]

51 We note that the Jacobian is deﬁned for all fpriv because N(·) and ℓ are globally doubly differentiable. [sent-361, score-0.744]

52 Moreover, we deﬁne matrices A and E as follows: n A = nΛ∇2 N(fpriv ) + ∑ y2 ℓ′′ (yi fT xi )xi xT + n∆Id , i priv i E= i=1 2 ′′ T −yn ℓ (yn fpriv xn )xn xT n T + (y′ )2 ℓ′′ (y′ fT x′ n )x′ n x′ n . [sent-369, score-0.779]

53 Let G denote the map from fpriv to b in (6) under B = D , and H denote the map under B = D ′ . [sent-422, score-0.702]

54 1 1 Corollary 13 Let fpriv be the output of Algorithm 2 with ℓ = ℓHuber , c = 2h , and N(f) = 2 ||f||2 . [sent-424, score-0.753]

55 For 2 ′ which differ in the private value any set S of possible values of fpriv , and any pair of data sets D , D of one person (xn , yn ), e−ε p P(S | B = D ′ ) ≤ P(S | B = D ) ≤ eε p P(S | B = D ′ ). [sent-425, score-0.95]

56 If S \ C has positive measure we can 1083 C HAUDHURI , M ONTELEONI AND S ARWATE calculate the ratio of the probabilities for fpriv for which the loss is twice-differentiable. [sent-441, score-0.724]

57 For such fpriv the Jacobian is also deﬁned, and we can use a method similar to Theorem 9 to prove the result. [sent-442, score-0.702]

58 Remark: Because the privacy proof for Algorithm 1 does not require the analytic properties of 2, we can also use Huber loss in Algorithm 1 to get an εg -differentially private approximation to the SVM. [sent-443, score-0.702]

59 It also serves as a reference against which we can compare the additional burden on the sample complexity imposed by the privacy constraints. [sent-464, score-0.458]

60 Proof Let ¯ frtr = argmin J(f), f f∗ = argmin J(f, D ), f and fpriv denote the output of Algorithm 1. [sent-478, score-0.979]

61 Therefore, from Lemma 16, with probability 1 − δ over the privacy mechanism, J(fpriv , D ) − J(f∗ , D ) ≤ 8d 2 log2 (d/δ)(c + Λ) . [sent-482, score-0.458]

62 Combining the preceeding two statements, with probability 1 − 2δ over the noise in the privacy mechanism and the data distribution, the second term in the right-hand-side of (9) is at most: 16d 2 log2 (d/δ)(c + Λ) log(1/δ) ¯ ¯ J(fpriv ) − J(frtr ) ≤ . [sent-486, score-0.486]

63 Given training data D , let f∗ be a classiﬁer that minimizes J(f, D ) and let fpriv be the classiﬁer output by Algorithm 1. [sent-491, score-0.775]

64 As 0 ≤ t ≤ 1, we can combine (12), (13), and (14) to obtain ∇J(tf∗ + (1 − t)fpriv , D ) ≤ Λ(∇N(f∗ ) − ∇N(tf∗ + (1 − t)fpriv )) + 1 yi (ℓ′ (yi (f∗ )T xi ) − ℓ′ (yi (tf∗ + (1 − t)fpriv )T xi ))xi n∑ i 1 ≤ (1 − t) fpriv − f∗ · Λη + · n · c n ∗ ≤ fpriv − f (Λη + c). [sent-505, score-1.459]

65 (15) From the deﬁnition of Algorithm 1, fpriv − f∗ = b, where b is the noise vector. [sent-506, score-0.717]

66 according to P, and if n > C max ||f0 ||2 log(1/δ) c||f0 ||2 d log( d )||f0 || δ , , ε2 εg ε p εg ε p g , then the output fpriv of Algorithm 2 satisﬁes P L(fpriv ) ≤ L∗ + εg ≥ 1 − 2δ. [sent-532, score-0.753]

67 Proof Let ¯ frtr = argmin J(f), f ∗ f = argmin J(f, D ), f and fpriv denote the output of Algorithm 1. [sent-533, score-0.979]

68 1088 D IFFERENTIALLY P RIVATE ERM Therefore, we can apply Lemma 19 to conclude that with probability at least 1 − δ over the privacy mechanism, J(fpriv , D ) − J(f∗ , D ) ≤ 4d 2 log2 (d/δ) . [sent-538, score-0.458]

69 Let f∗ = argmin J(f, D ), and let fpriv be the classiﬁer output by Algorithm 2. [sent-548, score-0.772]

70 Proof By the assumption ε′p > 0, the classiﬁer fpriv minimizes the objective function J(f, D ) + 1 bT f, n and therefore 1 J(fpriv , D ) ≤ J(f∗ , D ) + bT (f∗ − fpriv ). [sent-550, score-1.442]

71 We can therefore apply Lemma 7 with G(f) = J(f, D ) and 1 g(f) = n bT f to obtain the bound ||f∗ − fpriv || ≤ ||b|| 1 1 ∇( bT f) ≤ . [sent-553, score-0.702]

72 There exists a C1 such that if n > C1 max ||f0 ||2 log( 1 ) d log( d )||f0 || d log( d )||f0 ||2 δ δ δ , , 3/2 ε2 εg ε p g εg ε p , then the output fpriv of Algorithm 1 satisﬁes P L(fpriv ) ≤ L∗ + εg ≥ 1 − δ. [sent-568, score-0.753]

73 There exists a C2 such that if n > C max ||f0 ||2 log(1/δ) ||f0 ||2 d log( d )||f0 || δ , , ε2 εg ε p εg ε p g then the output fpriv of Algorithm 2 with c = 1 4 , satisﬁes P L(fpriv ) ≤ L∗ + εg ≥ 1 − δ. [sent-570, score-0.753]

74 There exists a C1 such that if n > C1 max ||f0 ||2 log( 1 ) d log( d )||f0 || d log( d )||f0 ||2 δ δ δ , , 1/2 ε3/2 ε ε2 εg ε p g h g p then the output fpriv of Algorithm 1 satisﬁes P L(fpriv ) ≤ L∗ + εg ≥ 1 − δ. [sent-581, score-0.753]

75 There exists a C2 such that if n > C max ||f0 ||2 log(1/δ) ||f0 ||2 d log( d )||f0 || δ , , ε2 hεg ε p εg ε p g then the output fpriv of Algorithm 2 with c = 1 4 , satisﬁes P L(fpriv ) ≤ L∗ + εg ≥ 1 − δ. [sent-583, score-0.753]

76 A crucial difﬁculty in terms of privacy is that this directly releases the private values x(i) of some individuals in the training set. [sent-593, score-0.746]

77 1092 D IFFERENTIALLY P RIVATE ERM Algorithm 3 Private ERM for nonlinear kernels ¯ Inputs: Data {(xi , yi ) : i ∈ [n]}, positive deﬁnite kernel function k(·, ·), sampling function K(θ), parameters ε p , Λ, D Outputs: Predictor fpriv and pre-ﬁlter {θ j : j ∈ [D]}. [sent-620, score-0.729]

78 3 Privacy Guarantees Because the workhorse of Algorithm 3 is a differentially-private version of ERM for linear classiﬁers (either Algorithm 1 or Algorithm 2), and the points {θ j : j ∈ [D]} are independent of the data, the privacy guarantees for Algorithm 3 follow trivially from Theorems 6 and 9. [sent-628, score-0.458]

79 That is, given an f0 with some properties, we will choose regularization parameter Λ, dimension D, and number of samples n so that the predictor fpriv has expected loss close to that of f0 . [sent-638, score-0.742]

80 D ∞ ≤ C/D, so (22) Now given such an f p we must show that fpriv will have true risk close to that of f p as long as there are enough data points. [sent-653, score-0.716]

81 Then we have ¯ ¯ L(fpriv ) − L(f p ) ≤ (J(fpriv ) − J(frtr )) + Λ fp 2 2 2− Λ fpriv 2 2 . [sent-658, score-0.726]

82 2 ε2 Λn p Λn 1094 (26) D IFFERENTIALLY P RIVATE ERM 2 2 Plugging (26) into (23), discarding the negative term involving fpriv gives L(fpriv ) − L(f p ) ≤ 8 fp 2 2 2 2 D log (D/δ) n2 ε2 εg p 2 1 2 log δ fp +O and setting Λ = εg / f p εg . [sent-664, score-0.786]

83 Then by the deﬁnition of frtr , L(frtr ) + Since εg 2 · frtr 2 H f0 2 H εg frtr 2 H ≤ L(f ) + εg . [sent-689, score-0.564]

84 However, even though the output of an algorithm preserves ε p -differential privacy for a ﬁxed Λ (as is the case with Algorithms 1 and 2), by choosing a Λ based on empirical performance on a validation set may violate ε p -differential privacy guarantees. [sent-701, score-0.985]

85 Since the value of Λ does not depend on the values in the private data set, this procedure will still preserve the privacy of individuals in the private data. [sent-706, score-0.926]

86 The last step is needed to guarantee ε p -differential privacy for individuals in the validation set. [sent-711, score-0.482]

87 Set fpriv = fi with probability qi = e−ε p zi /2 . [sent-729, score-0.72]

88 , Dm+1 are disjoint; moreover, the randomness in the privacy mechanisms are independent. [sent-750, score-0.458]

89 We ﬁrst demonstrate how the accuracy of the classiﬁcation algorithms vary with ε p , the privacy requirement. [sent-819, score-0.458]

90 2 Privacy-Accuracy Tradeoff For our ﬁrst set of experiments, we study the tradeoff between the privacy requirement on the classiﬁer, and its classiﬁcation accuracy, when the classiﬁer is trained on data of a ﬁxed size. [sent-840, score-0.474]

91 The privacy requirement is quantiﬁed by the value of ε p ; increasing ε p implies a higher change in the belief of the adversary when one entry in D changes, and thus lower privacy. [sent-841, score-0.497]

92 Objective perturbation outperforms sensitivity, and objective perturbation for support vector machines achieves lower classiﬁcation error than objective perturbation for logistic regression. [sent-1003, score-0.524]

93 In the low privacy regime, logistic regression under objective perturbation is better than support vector machines. [sent-1070, score-0.67]

94 In contrast, in the high privacy regime (low ε p ), support vector machines with objective perturbation outperform logistic regression. [sent-1071, score-0.67]

95 We consider privacy in the ε p -differential privacy model of Dwork et al. [sent-1139, score-0.916]

96 In general, for classiﬁcation, the error rate increases as the privacy requirements are made more stringent. [sent-1147, score-0.458]

97 ” Our experiments, as well as theoretical results, indicate that objective perturbation usually outperforms the sensitivity methods at managing the tradeoff between privacy and learning performance. [sent-1149, score-0.741]

98 Our current method is based on a reduction to the linear case, using the algorithm of Rahimi and Recht (2008b); however, this method can be statistically inefﬁcient, and require a lot of training data, particularly when coupled with our privacy mechanism. [sent-1173, score-0.48]

99 We believe the present study is a starting point for further study of the differential privacy model in this relatively new subﬁeld of machine learning. [sent-1180, score-0.495]

100 Demanding high privacy requires sacriﬁcing utility, which in the context of classiﬁcation and prediction is excess loss or regret. [sent-1183, score-0.48]

similar papers computed by tfidf model

tfidf for this paper:

wordName wordTfidf (topN-words)

[('fpriv', 0.702), ('privacy', 0.458), ('private', 0.222), ('frtr', 0.188), ('erm', 0.173), ('perturbation', 0.137), ('dwork', 0.106), ('arwate', 0.099), ('haudhuri', 0.099), ('ifferentially', 0.099), ('onteleoni', 0.099), ('rivate', 0.099), ('sensitivity', 0.092), ('huber', 0.083), ('lr', 0.063), ('mcsherry', 0.054), ('output', 0.051), ('er', 0.051), ('priv', 0.049), ('differentiable', 0.049), ('det', 0.046), ('chaudhuri', 0.045), ('ft', 0.043), ('classi', 0.042), ('doubly', 0.042), ('rahimi', 0.042), ('tf', 0.042), ('fm', 0.041), ('misclassification', 0.04), ('monteleoni', 0.04), ('objective', 0.038), ('logistic', 0.037), ('differential', 0.037), ('regularized', 0.036), ('releasing', 0.035), ('dm', 0.034), ('vol', 0.034), ('adult', 0.033), ('svm', 0.033), ('recht', 0.03), ('kasiviswanathan', 0.03), ('rtr', 0.03), ('surf', 0.03), ('talwar', 0.03), ('differentially', 0.029), ('regularizer', 0.028), ('yi', 0.027), ('attacks', 0.026), ('yn', 0.026), ('adversary', 0.026), ('pb', 0.026), ('jacobian', 0.025), ('ganta', 0.025), ('nissim', 0.025), ('zmin', 0.025), ('individuals', 0.024), ('fp', 0.024), ('loss', 0.022), ('training', 0.022), ('convex', 0.02), ('argminf', 0.02), ('machanavajjhala', 0.02), ('releases', 0.02), ('tuning', 0.02), ('argmin', 0.019), ('sridharan', 0.019), ('preserves', 0.018), ('zi', 0.018), ('log', 0.018), ('predictor', 0.018), ('srebro', 0.017), ('ers', 0.017), ('tradeoff', 0.016), ('corollary', 0.015), ('holdout', 0.015), ('noise', 0.015), ('outputs', 0.015), ('generalization', 0.015), ('backstrom', 0.015), ('barak', 0.015), ('gehrke', 0.015), ('kifer', 0.015), ('ligett', 0.015), ('narayanan', 0.015), ('privacypreserving', 0.015), ('yft', 0.015), ('xi', 0.014), ('xn', 0.014), ('risk', 0.014), ('convexity', 0.014), ('dk', 0.014), ('sensitive', 0.014), ('minimizer', 0.014), ('symposium', 0.014), ('entry', 0.013), ('mechanism', 0.013), ('non', 0.013), ('theorem', 0.013), ('lemma', 0.013), ('norm', 0.013)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 1.0000001 22 jmlr-2011-Differentially Private Empirical Risk Minimization

Author: Kamalika Chaudhuri, Claire Monteleoni, Anand D. Sarwate

2 0.033693604 100 jmlr-2011-Unsupervised Supervised Learning II: Margin-Based Classification Without Labels

Author: Krishnakumar Balasubramanian, Pinar Donmez, Guy Lebanon

Abstract: Many popular linear classiﬁers, such as logistic regression, boosting, or SVM, are trained by optimizing a margin-based risk function. Traditionally, these risk functions are computed based on a labeled data set. We develop a novel technique for estimating such risks using only unlabeled data and the marginal label distribution. We prove that the proposed risk estimator is consistent on high-dimensional data sets and demonstrate it on synthetic and real-world data. In particular, we show how the estimate is used for evaluating classiﬁers in transfer learning, and for training classiﬁers with no labeled data whatsoever. Keywords: classiﬁcation, large margin, maximum likelihood

3 0.031556167 69 jmlr-2011-Neyman-Pearson Classification, Convexity and Stochastic Constraints

Author: Philippe Rigollet, Xin Tong

Abstract: Motivated by problems of anomaly detection, this paper implements the Neyman-Pearson paradigm to deal with asymmetric errors in binary classiﬁcation with a convex loss ϕ. Given a ﬁnite collection of classiﬁers, we combine them and obtain a new classiﬁer that satisﬁes simultaneously the two following properties with high probability: (i) its ϕ-type I error is below a pre-speciﬁed level and (ii), it has ϕ-type II error close to the minimum possible. The proposed classiﬁer is obtained by minimizing an empirical convex objective with an empirical convex constraint. The novelty of the method is that the classiﬁer output by this computationally feasible program is shown to satisfy the original constraint on type I error. New techniques to handle such problems are developed and they have consequences on chance constrained programming. We also evaluate the price to pay in terms of type II error for being conservative on type I error. Keywords: binary classiﬁcation, Neyman-Pearson paradigm, anomaly detection, empirical constraint, empirical risk minimization, chance constrained optimization

4 0.028568555 14 jmlr-2011-Better Algorithms for Benign Bandits

Author: Elad Hazan, Satyen Kale

Abstract: The online multi-armed bandit problem and its generalizations are repeated decision making problems, where the goal is to select one of several possible decisions in every round, and incur a cost associated with the decision, in such a way that the total cost incurred over all iterations is close to the cost of the best ﬁxed decision in hindsight. The difference in these costs is known as the regret of the algorithm. The term bandit refers to the setting where one only obtains the cost of the decision used in a given iteration and no other information. A very general form of this problem is the non-stochastic bandit linear optimization problem, where the set of decisions is a convex set in some√ Euclidean space, and the cost functions are linear. ˜ Only recently an efﬁcient algorithm attaining O( T ) regret was discovered in this setting. In this paper we propose a new algorithm for the bandit linear optimization problem which √ ˜ obtains a tighter regret bound of O( Q), where Q is the total variation in the cost functions. This regret bound, previously conjectured to hold in the full information case, shows that it is possible to incur much less regret in a slowly changing environment even in the bandit setting. Our algorithm is efﬁcient and applies several new ideas to bandit optimization such as reservoir sampling. Keywords: multi-armed bandit, regret minimization, online learning

5 0.026784159 29 jmlr-2011-Efficient Learning with Partially Observed Attributes

Author: Nicolò Cesa-Bianchi, Shai Shalev-Shwartz, Ohad Shamir

Abstract: We investigate three variants of budgeted learning, a setting in which the learner is allowed to access a limited number of attributes from training or test examples. In the “local budget” setting, where a constraint is imposed on the number of available attributes per training example, we design and analyze an efﬁcient algorithm for learning linear predictors that actively samples the attributes of each training instance. Our analysis bounds the number of additional examples sufﬁcient to compensate for the lack of full information on the training set. This result is complemented by a general lower bound for the easier “global budget” setting, where it is only the overall number of accessible training attributes that is being constrained. In the third, “prediction on a budget” setting, when the constraint is on the number of available attributes per test example, we show that there are cases in which there exists a linear predictor with zero error but it is statistically impossible to achieve arbitrary accuracy without full information on test examples. Finally, we run simple experiments on a digit recognition problem that reveal that our algorithm has a good performance against both partial information and full information baselines. Keywords: budgeted learning, statistical learning, linear predictors, learning with partial information, learning theory

6 0.025802033 43 jmlr-2011-Information, Divergence and Risk for Binary Experiments

7 0.024985053 18 jmlr-2011-Convergence Rates of Efficient Global Optimization Algorithms

8 0.023301221 40 jmlr-2011-Hyper-Sparse Optimal Aggregation

9 0.022806523 67 jmlr-2011-Multitask Sparsity via Maximum Entropy Discrimination

10 0.021042807 105 jmlr-2011-lp-Norm Multiple Kernel Learning

11 0.02102877 12 jmlr-2011-Bayesian Co-Training

12 0.021002736 52 jmlr-2011-Large Margin Hierarchical Classification with Mutually Exclusive Class Membership

13 0.020888565 8 jmlr-2011-Adaptive Subgradient Methods for Online Learning and Stochastic Optimization

14 0.019487562 71 jmlr-2011-On Equivalence Relationships Between Classification and Ranking Algorithms

15 0.019148607 41 jmlr-2011-Improved Moves for Truncated Convex Models

16 0.018416826 6 jmlr-2011-A Simpler Approach to Matrix Completion

17 0.01828889 35 jmlr-2011-Forest Density Estimation

18 0.018201616 58 jmlr-2011-Learning from Partial Labels

19 0.017826866 28 jmlr-2011-Double Updating Online Learning

20 0.017513791 66 jmlr-2011-Multiple Kernel Learning Algorithms

similar papers computed by lsi model

lsi for this paper:

topicId topicWeight

[(0, 0.096), (1, 0.02), (2, 0.003), (3, -0.013), (4, 0.037), (5, -0.013), (6, 0.002), (7, -0.012), (8, -0.058), (9, -0.037), (10, -0.075), (11, -0.009), (12, 0.035), (13, -0.011), (14, -0.027), (15, 0.091), (16, -0.055), (17, -0.033), (18, -0.075), (19, 0.026), (20, 0.017), (21, 0.0), (22, 0.011), (23, 0.017), (24, -0.0), (25, -0.006), (26, -0.042), (27, 0.072), (28, 0.029), (29, 0.003), (30, -0.076), (31, 0.141), (32, -0.131), (33, -0.019), (34, -0.054), (35, -0.039), (36, -0.2), (37, 0.032), (38, -0.149), (39, 0.478), (40, 0.309), (41, 0.0), (42, -0.156), (43, 0.15), (44, -0.102), (45, 0.462), (46, 0.143), (47, 0.297), (48, -0.176), (49, 0.044)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.93329674 22 jmlr-2011-Differentially Private Empirical Risk Minimization

Author: Kamalika Chaudhuri, Claire Monteleoni, Anand D. Sarwate

2 0.21991698 69 jmlr-2011-Neyman-Pearson Classification, Convexity and Stochastic Constraints

Author: Philippe Rigollet, Xin Tong

3 0.19230735 29 jmlr-2011-Efficient Learning with Partially Observed Attributes

Author: Nicolò Cesa-Bianchi, Shai Shalev-Shwartz, Ohad Shamir

4 0.17709833 6 jmlr-2011-A Simpler Approach to Matrix Completion

Author: Benjamin Recht

Abstract: This paper provides the best bounds to date on the number of randomly sampled entries required to reconstruct an unknown low-rank matrix. These results improve on prior work by Cand` s and e Recht (2009), Cand` s and Tao (2009), and Keshavan et al. (2009). The reconstruction is accome plished by minimizing the nuclear norm, or sum of the singular values, of the hidden matrix subject to agreement with the provided entries. If the underlying matrix satisﬁes a certain incoherence condition, then the number of entries required is equal to a quadratic logarithmic factor times the number of parameters in the singular value decomposition. The proof of this assertion is short, self contained, and uses very elementary analysis. The novel techniques herein are based on recent work in quantum information theory. Keywords: matrix completion, low-rank matrices, convex optimization, nuclear norm minimization, random matrices, operator Chernoff bound, compressed sensing

5 0.16590941 32 jmlr-2011-Exploitation of Machine Learning Techniques in Modelling Phrase Movements for Machine Translation

Author: Yizhao Ni, Craig Saunders, Sandor Szedmak, Mahesan Niranjan

Abstract: We propose a distance phrase reordering model (DPR) for statistical machine translation (SMT), where the aim is to learn the grammatical rules and context dependent changes using a phrase reordering classiﬁcation framework. We consider a variety of machine learning techniques, including state-of-the-art structured prediction methods. Techniques are compared and evaluated on a Chinese-English corpus, a language pair known for the high reordering characteristics which cannot be adequately captured with current models. In the reordering classiﬁcation task, the method signiﬁcantly outperforms the baseline against which it was tested, and further, when integrated as a component of the state-of-the-art machine translation system, MOSES, it achieves improvement in translation results. Keywords: statistical machine translation (SMT), phrase reordering, lexicalized reordering (LR), maximum entropy (ME), support vector machine (SVM), maximum margin regression (MMR) , max-margin structure learning (MMS)

6 0.15005764 52 jmlr-2011-Large Margin Hierarchical Classification with Mutually Exclusive Class Membership

7 0.14070587 12 jmlr-2011-Bayesian Co-Training

8 0.12974983 93 jmlr-2011-The arules R-Package Ecosystem: Analyzing Interesting Patterns from Large Transaction Data Sets

9 0.12915236 18 jmlr-2011-Convergence Rates of Efficient Global Optimization Algorithms

10 0.12308846 100 jmlr-2011-Unsupervised Supervised Learning II: Margin-Based Classification Without Labels

11 0.11758871 25 jmlr-2011-Discriminative Learning of Bayesian Networks via Factorized Conditional Log-Likelihood

12 0.11380684 13 jmlr-2011-Bayesian Generalized Kernel Mixed Models

13 0.11377963 71 jmlr-2011-On Equivalence Relationships Between Classification and Ranking Algorithms

14 0.10226267 40 jmlr-2011-Hyper-Sparse Optimal Aggregation

15 0.10141017 43 jmlr-2011-Information, Divergence and Risk for Binary Experiments

16 0.097273409 3 jmlr-2011-A Cure for Variance Inflation in High Dimensional Kernel Principal Component Analysis

17 0.0964313 67 jmlr-2011-Multitask Sparsity via Maximum Entropy Discrimination

18 0.096401848 35 jmlr-2011-Forest Density Estimation

19 0.095760785 49 jmlr-2011-Kernel Regression in the Presence of Correlated Errors

20 0.095644698 19 jmlr-2011-Convergence of Distributed Asynchronous Learning Vector Quantization Algorithms

similar papers computed by lda model

lda for this paper:

topicId topicWeight

[(4, 0.032), (9, 0.025), (10, 0.023), (24, 0.022), (31, 0.072), (32, 0.023), (41, 0.018), (60, 0.012), (65, 0.016), (73, 0.026), (78, 0.581), (90, 0.012)]

similar papers list:

simIndex simValue paperId paperTitle

1 0.99113953 33 jmlr-2011-Exploiting Best-Match Equations for Efficient Reinforcement Learning

Author: Harm van Seijen, Shimon Whiteson, Hado van Hasselt, Marco Wiering

Abstract: This article presents and evaluates best-match learning, a new approach to reinforcement learning that trades off the sample efﬁciency of model-based methods with the space efﬁciency of modelfree methods. Best-match learning works by approximating the solution to a set of best-match equations, which combine a sparse model with a model-free Q-value function constructed from samples not used by the model. We prove that, unlike regular sparse model-based methods, bestmatch learning is guaranteed to converge to the optimal Q-values in the tabular case. Empirical results demonstrate that best-match learning can substantially outperform regular sparse modelbased methods, as well as several model-free methods that strive to improve the sample efﬁciency of temporal-difference methods. In addition, we demonstrate that best-match learning can be successfully combined with function approximation. Keywords: reinforcement learning, on-line learning, temporal-difference methods, function approximation, data reuse

2 0.98837894 39 jmlr-2011-High-dimensional Covariance Estimation Based On Gaussian Graphical Models

Author: Shuheng Zhou, Philipp Rütimann, Min Xu, Peter Bühlmann

Abstract: Undirected graphs are often used to describe high dimensional distributions. Under sparsity conditions, the graph can be estimated using ℓ1 -penalization methods. We propose and study the following method. We combine a multiple regression approach with ideas of thresholding and reﬁtting: ﬁrst we infer a sparse undirected graphical model structure via thresholding of each among many ℓ1 -norm penalized regression functions; we then estimate the covariance matrix and its inverse using the maximum likelihood estimator. We show that under suitable conditions, this approach yields consistent estimation in terms of graphical structure and fast convergence rates with respect to the operator and Frobenius norm for the covariance matrix and its inverse. We also derive an explicit bound for the Kullback Leibler divergence. Keywords: graphical model selection, covariance estimation, Lasso, nodewise regression, thresholding c 2011 Shuheng Zhou, Philipp R¨ timann, Min Xu and Peter B¨ hlmann. u u ¨ ¨ Z HOU , R UTIMANN , X U AND B UHLMANN

3 0.97997797 87 jmlr-2011-Stochastic Methods forl1-regularized Loss Minimization

Author: Shai Shalev-Shwartz, Ambuj Tewari

Abstract: We describe and analyze two stochastic methods for ℓ1 regularized loss minimization problems, such as the Lasso. The ﬁrst method updates the weight of a single feature at each iteration while the second method updates the entire weight vector but only uses a single training example at each iteration. In both methods, the choice of feature or example is uniformly at random. Our theoretical runtime analysis suggests that the stochastic methods should outperform state-of-the-art deterministic approaches, including their deterministic counterparts, when the size of the problem is large. We demonstrate the advantage of stochastic methods by experimenting with synthetic and natural data sets.1 Keywords: L1 regularization, optimization, coordinate descent, mirror descent, sparsity

same-paper 4 0.94976252 22 jmlr-2011-Differentially Private Empirical Risk Minimization

Author: Kamalika Chaudhuri, Claire Monteleoni, Anand D. Sarwate

5 0.85928875 71 jmlr-2011-On Equivalence Relationships Between Classification and Ranking Algorithms

Author: Şeyda Ertekin, Cynthia Rudin

Abstract: We demonstrate that there are machine learning algorithms that can achieve success for two separate tasks simultaneously, namely the tasks of classiﬁcation and bipartite ranking. This means that advantages gained from solving one task can be carried over to the other task, such as the ability to obtain conditional density estimates, and an order-of-magnitude reduction in computational time for training the algorithm. It also means that some algorithms are robust to the choice of evaluation metric used; they can theoretically perform well when performance is measured either by a misclassiﬁcation error or by a statistic of the ROC curve (such as the area under the curve). Speciﬁcally, we provide such an equivalence relationship between a generalization of Freund et al.’s RankBoost algorithm, called the “P-Norm Push,” and a particular cost-sensitive classiﬁcation algorithm that generalizes AdaBoost, which we call “P-Classiﬁcation.” We discuss and validate the potential beneﬁts of this equivalence relationship, and perform controlled experiments to understand P-Classiﬁcation’s empirical performance. There is no established equivalence relationship for logistic regression and its ranking counterpart, so we introduce a logistic-regression-style algorithm that aims in between classiﬁcation and ranking, and has promising experimental performance with respect to both tasks. Keywords: supervised classiﬁcation, bipartite ranking, area under the curve, rank statistics, boosting, logistic regression

6 0.79892647 81 jmlr-2011-Robust Approximate Bilinear Programming for Value Function Approximation

7 0.7949608 37 jmlr-2011-Group Lasso Estimation of High-dimensional Covariance Matrices

8 0.78005594 40 jmlr-2011-Hyper-Sparse Optimal Aggregation

9 0.76971179 89 jmlr-2011-Super-Linear Convergence of Dual Augmented Lagrangian Algorithm for Sparsity Regularized Estimation

10 0.76519352 8 jmlr-2011-Adaptive Subgradient Methods for Online Learning and Stochastic Optimization

11 0.74443609 29 jmlr-2011-Efficient Learning with Partially Observed Attributes

12 0.73760062 19 jmlr-2011-Convergence of Distributed Asynchronous Learning Vector Quantization Algorithms

13 0.7162177 36 jmlr-2011-Generalized TD Learning

14 0.71431124 1 jmlr-2011-A Bayesian Approach for Learning and Planning in Partially Observable Markov Decision Processes

15 0.7065987 9 jmlr-2011-An Asymptotic Behaviour of the Marginal Likelihood for General Markov Models

16 0.7052843 72 jmlr-2011-On the Relation between Realizable and Nonrealizable Cases of the Sequence Prediction Problem

17 0.7047745 104 jmlr-2011-X-Armed Bandits

18 0.7007156 91 jmlr-2011-The Sample Complexity of Dictionary Learning

19 0.69669098 73 jmlr-2011-Online Learning in Case of Unbounded Losses Using Follow the Perturbed Leader Algorithm

20 0.69287449 85 jmlr-2011-Smoothness, Disagreement Coefficient, and the Label Complexity of Agnostic Active Learning