nips nips2013 nips2013-171 knowledge-graph by maker-knowledge-mining

171 nips-2013-Learning with Noisy Labels

Source: pdf

Author: Nagarajan Natarajan, Inderjit Dhillon, Pradeep Ravikumar, Ambuj Tewari

Abstract: In this paper, we theoretically study the problem of binary classiﬁcation in the presence of random classiﬁcation noise — the learner, instead of seeing the true labels, sees labels that have independently been ﬂipped with some small probability. Moreover, random label noise is class-conditional — the ﬂip probability depends on the class. We provide two approaches to suitably modify any given surrogate loss function. First, we provide a simple unbiased estimator of any loss, and obtain performance bounds for empirical risk minimization in the presence of iid data with noisy labels. If the loss function satisﬁes a simple symmetry condition, we show that the method leads to an efﬁcient algorithm for empirical minimization. Second, by leveraging a reduction of risk minimization under noisy labels to classiﬁcation with weighted 0-1 loss, we suggest the use of a simple weighted surrogate loss, for which we are able to obtain strong empirical risk bounds. This approach has a very remarkable consequence — methods used in practice such as biased SVM and weighted logistic regression are provably noise-tolerant. On a synthetic non-separable dataset, our methods achieve over 88% accuracy even when 40% of the labels are corrupted, and are competitive with respect to recently proposed methods for dealing with label noise in several benchmark datasets.

Reference: text

Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 edu Abstract In this paper, we theoretically study the problem of binary classiﬁcation in the presence of random classiﬁcation noise — the learner, instead of seeing the true labels, sees labels that have independently been ﬂipped with some small probability. [sent-7, score-0.476]

2 Moreover, random label noise is class-conditional — the ﬂip probability depends on the class. [sent-8, score-0.336]

3 We provide two approaches to suitably modify any given surrogate loss function. [sent-9, score-0.373]

4 First, we provide a simple unbiased estimator of any loss, and obtain performance bounds for empirical risk minimization in the presence of iid data with noisy labels. [sent-10, score-0.846]

5 If the loss function satisﬁes a simple symmetry condition, we show that the method leads to an efﬁcient algorithm for empirical minimization. [sent-11, score-0.283]

6 Second, by leveraging a reduction of risk minimization under noisy labels to classiﬁcation with weighted 0-1 loss, we suggest the use of a simple weighted surrogate loss, for which we are able to obtain strong empirical risk bounds. [sent-12, score-1.209]

7 On a synthetic non-separable dataset, our methods achieve over 88% accuracy even when 40% of the labels are corrupted, and are competitive with respect to recently proposed methods for dealing with label noise in several benchmark datasets. [sent-14, score-0.69]

8 1 Introduction Designing supervised learning algorithms that can learn from data sets with noisy labels is a problem of great practical importance. [sent-15, score-0.387]

9 Here, by noisy labels, we refer to the setting where an adversary has deliberately corrupted the labels [Biggio et al. [sent-16, score-0.536]

10 Given the importance of learning from such noisy labels, a great deal of practical work has been done on the problem (see, for instance, the survey article by Nettleton et al. [sent-18, score-0.314]

11 Soon after the introduction of the noise-free PAC model, Angluin and Laird [1988] proposed the random classiﬁcation noise (RCN) model where each label is ﬂipped independently with some probability ρ ∈ [0, 1/2). [sent-21, score-0.336]

12 Similarly, in the online mistake bound model, the parameter that characterizes learnability without noise — the Littestone dimension — continues to characterize learnability even in the presence of random label noise [Ben-David et al. [sent-24, score-0.81]

13 Learning with convex losses has been addressed only under limiting assumptions like separability or uniform noise rates [Manwani and Sastry, 2013]. [sent-27, score-0.529]

14 In this paper, we consider risk minimization in the presence of class-conditional random label noise (abbreviated CCN). [sent-28, score-0.718]

15 The learning algorithm sees samples drawn from a noisy version Dρ of D — where the noise rates depend on the class label. [sent-30, score-0.574]

16 To this end, we develop two methods for suitably modifying any given surrogate loss function ℓ, and show that minimizing the sample average of the modiﬁed proxy loss function 1 ˜ ℓ leads to provable risk bounds where the risk is calculated using the original loss ℓ on the clean distribution. [sent-32, score-1.555]

17 In our ﬁrst approach, the modiﬁed or proxy loss is an unbiased estimate of the loss function. [sent-33, score-0.654]

18 The idea of using unbiased estimators is well-known in stochastic optimization [Nemirovski et al. [sent-34, score-0.265]

19 , 2009], and regret bounds can be obtained for learning with noisy labels in an online learning setting (See Appendix B). [sent-35, score-0.428]

20 Nonetheless, we bring out some important aspects of using unbiased estimators of loss functions for empirical risk minimization under CCN. [sent-36, score-0.762]

21 In particular, we give a simple symmetry condition on the loss (enjoyed, for instance, by the Huber, logistic, and squared losses) to ensure that the proxy loss is also convex. [sent-37, score-0.56]

22 Hinge loss does not satisfy the symmetry condition, and thus leads to a non-convex problem. [sent-38, score-0.246]

23 We nonetheless provide a convex surrogate, leveraging the fact that the non-convex hinge problem is “close” to a convex problem (Theorem 6). [sent-39, score-0.326]

24 Our second approach is based on the fundamental observation that the minimizer of the risk (i. [sent-40, score-0.351]

25 probability of misclassiﬁcation) under the noisy distribution differs from that of the clean distribution only in where it thresholds η(x) = P (Y = 1|x) to decide the label. [sent-42, score-0.437]

26 In order to correct for the threshold, we then propose a simple weighted loss function, where the weights are label-dependent, as the proxy loss function. [sent-43, score-0.577]

27 Our analysis builds on the notion of consistency of weighted loss functions studied by Scott [2012]. [sent-44, score-0.263]

28 To the best of our knowledge, we are the ﬁrst to provide guarantees for risk minimization under random label noise in the general setting of convex surrogates, without any assumptions on the true distribution. [sent-48, score-0.761]

29 We provide two different approaches to suitably modifying any given surrogate loss function, that surprisingly lead to very similar risk bounds (Theorems 3 and 11). [sent-50, score-0.603]

30 Experiments on benchmark datasets show that the methods are robust even at high noise rates. [sent-57, score-0.298]

31 In Section 4, we give our second and third main results for certain weighted loss functions. [sent-61, score-0.263]

32 1 Related Work Starting from the work of Bylander [1994], many noise tolerant versions of the perceptron algorithm have been developed. [sent-64, score-0.299]

33 A Bayesian approach to the problem of noisy labels is taken by Graepel and Herbrich [2000] and Lawrence and Sch¨ lkopf [2001]. [sent-70, score-0.387]

34 As Adaboost is very o sensitive to label noise, random label noise has also been considered in the context of boosting. [sent-71, score-0.468]

35 Long and Servedio [2010] prove that any method based on a convex potential is inherently ill-suited to random label noise. [sent-72, score-0.244]

36 Stempfel and Ralaivola [2009] proposed the minimization of an unbiased proxy for the case of the hinge loss. [sent-74, score-0.441]

37 However the hinge loss leads to a non-convex problem. [sent-75, score-0.301]

38 [2011] focus on the online learning algorithms where they only need unbiased estimates of the gradient of the loss to provide guarantees for learning with noisy data. [sent-79, score-0.609]

39 However, they consider a much harder noise model where instances as well as labels are noisy. [sent-80, score-0.363]

40 Because of the harder noise model, they necessarily require multiple noisy copies per clean example and the unbiased estimation schemes also become fairly complicated. [sent-81, score-0.782]

41 In contrast, we show that unbiased estimation is always possible in the more benign random classiﬁcation noise setting. [sent-83, score-0.345]

42 Manwani and Sastry [2013] consider whether empirical risk minimization of the loss itself on the 2 noisy data is a good idea when the goal is to obtain small risk under the clean distribution. [sent-84, score-1.216]

43 Therefore, if empirical risk minimization over noisy samples has to work, we necessarily have to change the loss used to calculate the empirical risk. [sent-86, score-0.814]

44 However, they approach the problem from a different set of assumptions — the noise rates are not known, and the true distribution satisﬁes a certain “mutual irreducibility” property. [sent-89, score-0.302]

45 After injecting random classiﬁcation noise (indepen˜ ˜ dently for each i) into these samples, corrupted samples (X1 , Y1 ), . [sent-95, score-0.301]

46 The class-conditional random noise model (CCN, for short) is given by: ˜ ˜ P (Y = −1|Y = +1) = ρ+1 , P (Y = +1|Y = −1) = ρ−1 , and ρ+1 + ρ−1 < 1 The corrupted samples are what the learning algorithm sees. [sent-99, score-0.301]

47 We will assume that the noise rates ˜ ρ+1 and ρ−1 are known1 to the learner. [sent-100, score-0.302]

48 We denote by R∗ the corresponding Bayes risk under the clean distribution D, i. [sent-110, score-0.439]

49 Let ℓ(t, y ) denote a suitably modiﬁed ℓ for use with noisy labels (obtained using methods in Sections 3 and 4). [sent-114, score-0.447]

50 Typically, ℓ is a convex function that is calibrated with respect to an underlying loss function such as the 0-1 loss. [sent-121, score-0.363]

51 The interpretation is that we can control the excess 0-1 risk by controlling the excess ℓ-risk. [sent-124, score-0.354]

52 3 Method of Unbiased Estimators Let F : X → R be a ﬁxed class of real-valued decision functions, over which the empirical risk is minimized. [sent-128, score-0.302]

53 The method of unbiased estimators uses the noise rates to construct an unbiased estima˜ ˜ tor ℓ(t, y ) for the loss ℓ(t, y). [sent-129, score-0.855]

54 However, in the experiments we will tune the noise rate parameters through cross-validation. [sent-130, score-0.24]

55 The following key lemma tells us how to construct unbiased estimators of the loss from noisy labels. [sent-131, score-0.705]

56 3 We can try to learn a good predictor in the presence of label noise by minimizing the sample average ˆ f ← argmin R ˜(f ) . [sent-137, score-0.461]

57 f ∈F ℓ ˜ By unbiasedness of ℓ (Lemma 1), we know that, for any ﬁxed f ∈ F , the above sample average converges to Rℓ,D (f ) even though the former is computed using noisy labels whereas the latter depends on the true labels. [sent-138, score-0.387]

58 The main idea in the proof is to use the contraction principle for Rademacher complexity to get rid of the dependence on the proxy loss ˜ ˜ ℓ. [sent-140, score-0.314]

59 max |Rℓ (f ) − Rℓ,Dρ (f )| ≤ 2Lρ R(F ) + ˜ ˜ ˆ The above lemma immediately leads to a performance bound for f with respect to the clean distribution D. [sent-146, score-0.274]

60 This is despite the fact that the method of unbiased estimators ˆ computes the empirical minimizer f on a sample from the noisy distribution. [sent-153, score-0.599]

61 1 Convex losses and their estimators ˜ Note that the loss ℓ may not be convex even if we start with a convex ℓ. [sent-157, score-0.61]

62 An example is provided ˜ by the familiar hinge loss ℓhin (t, y) = [1 − yt]+ . [sent-158, score-0.301]

63 Suppose ℓ(t, y) is convex and twice differentiable almost everywhere in t (for every y) and also satisﬁes the symmetry property ˜ Then ℓ(t, y) is also convex in t. [sent-162, score-0.271]

64 Then, we only need the expected loss to be convex and therefore 4 ℓhin does not present a problem. [sent-167, score-0.311]

65 Recall that the (Fenchel) ˜ biconjugate F ⋆⋆ is the largest convex function that minorizes F . [sent-174, score-0.239]

66 Let F : W → R be a non-convex function deﬁned on function class W such it is ε-close to a convex function G : W → R: ∀w ∈ W, |F (w) − G(w)| ≤ ε Then any minimizer of F ⋆⋆ is a 2ε-approximate (global) minimizer of F . [sent-176, score-0.354]

67 Let W = {w : ˆ w2 ≤ W2 }, let Xi 2 ≤ X2 almost surely, and let wapprox be any (exact) minimizer of the convex problem min F ⋆⋆ (w) , w∈W where F ⋆⋆ (w) is the (Fenchel) biconjugate of the function F (w) = Rℓ (w). [sent-180, score-0.446]

68 First, the Bayes clas˜ siﬁer for noisy distribution, denoted f ∗ , for the case ρ+1 = ρ−1 , simply uses a threshold different ˜∗ is the minimizer of a “label-dependent 0-1 loss” on the noisy distribution. [sent-186, score-0.61]

69 The Bayes classiﬁer under ˜ ˜∗ = argmin E ˜ the noisy distribution, f ˜ f (X,Y )∼Dρ 1{sign(f (X))=Y } is given by, ˜ f ∗ (x) = sign(˜(x) − 1/2) = sign η(x) − η 1/2 − ρ−1 . [sent-192, score-0.373]

70 1 − ρ+1 − ρ−1 Interestingly, this “noisy” Bayes classiﬁer can also be obtained as the minimizer of a weighted 0-1 loss; which as we will show, allows us to “correct” for the threshold under the noisy distribution. [sent-193, score-0.446]

71 We can write the 0-1 loss as a label-dependent loss as follows: 1{sign(f (X))=Y } = 1{Y =1} 1{f (X)≤0} + 1{Y =−1} 1{f (X)>0} We realize that the classical 0-1 loss is unweighted. [sent-195, score-0.597]

72 the α-weighted 0-1 loss under noisy distribution Dρ : ˜ Rα,Dρ (f ) = E(X,Y )∼Dρ Uα (f (X), Y ) . [sent-212, score-0.427]

73 ˜ At this juncture, we are interested in the following question: Does there exist an α ∈ (0, 1) such that the minimizer of Uα -risk under noisy distribution Dρ has the same sign as that of the Bayes optimal f ∗ ? [sent-213, score-0.438]

74 We now present our second main result in the following theorem that makes a stronger statement — the Uα -risk under noisy distribution Dρ is linearly related to the 0-1 risk under the clean distribution D. [sent-214, score-0.7]

75 The α -weighted Bayes optimal classiﬁer under noisy distribution coincides with that of 0-1 loss under clean distribution: argmin Rα∗ ,Dρ (f ) = argmin RD (f ) = sign(η(x) − 1/2). [sent-219, score-0.748]

76 1 Proposed Proxy Surrogate Losses Consider any surrogate loss function ℓ; and the following decomposition: ℓ(t, y) = 1{y=1} ℓ1 (t) + 1{y=−1} ℓ−1 (t) where ℓ1 and ℓ−1 are partial losses of ℓ. [sent-221, score-0.428]

77 Analogous to the 0-1 loss case, we can deﬁne α-weighted loss function (Eqn. [sent-222, score-0.398]

78 Can we hope to minimize an αweighted ℓ-risk with respect to noisy distribution Dρ and yet bound the excess 0-1 risk with respect to the clean distribution D? [sent-224, score-0.729]

79 Consider the empirical risk minimization problem with noisy labels: n 1 ˜ ˆ ℓα (f (Xi ), Yi ). [sent-227, score-0.578]

80 fα = argmin n i=1 f ∈F Deﬁne ℓα as an α-weighted margin loss function of the form: ℓα (t, y) = (1 − α)1{y=1} ℓ(t) + α1{y=−1} ℓ(−t) (1) where ℓ : R → [0, ∞) is a convex loss function with Lipschitz constant L such that it is classiﬁcation′ calibrated (i. [sent-228, score-0.618]

81 2n Aside from bounding excess 0-1 risk under the clean distribution, the importance of the above theorem lies in the fact that it prescribes an efﬁcient algorithm for empirical minimization with noisy ˆ labels: ℓα is convex if ℓ is convex. [sent-232, score-0.994]

82 Thus for any surrogate loss function including ℓhin , fα∗ can be efﬁciently computed using the method of label-dependent costs. [sent-233, score-0.313]

83 6 5 Experiments We show the robustness of the proposed algorithms to increasing rates of label noise on synthetic and real-world datasets. [sent-242, score-0.475]

84 For given noise rates ρ+1 and ρ−1 , labels of the training data are ﬂipped accordingly and average accuracy over 3 train-test splits is computed2. [sent-247, score-0.506]

85 For ˜ evaluation, we choose a representative algorithm based on each of the two proposed methods — ℓlog for the method of unbiased estimators and the widely-used C-SVM [Liu et al. [sent-248, score-0.265]

86 The results for higher noise rates are impressive as observed from Figures 2(d) and 2(e). [sent-257, score-0.302]

87 In particular, the Random Projection classiﬁer [Stempfel and Ralaivola, 2007] that learns a kernel perceptron in the presence of noisy labels achieves about 84% accuracy at ρ+1 = ρ−1 = 0. [sent-259, score-0.593]

88 Plots (b) and (c) show training data corrupted with noise rates (ρ+1 = ρ−1 = ρ) 0. [sent-270, score-0.399]

89 Plots (b) and (c) show training data corrupted with noise rates (ρ+1 = ρ−1 = ρ) 0. [sent-280, score-0.399]

90 Even when 40% of the labels are corrupted (ρ+1 = ρ−1 = 0. [sent-289, score-0.256]

91 To account for randomness in the ﬂips to simulate a given noise rate, we repeat each experiment 3 times — independent corruptions of the data set for same setting of ρ+1 and ρ−1 , and present the mean accuracy over the trials. [sent-295, score-0.249]

92 [Crammer and Lee, 2010]) (project and exact variants3 ), and perceptron algorithm with margin (PAM) which was shown to be robust to label noise by Khardon and Wachman [2007]. [sent-416, score-0.428]

93 Overall, we observe that the proposed methods are competitive and are able to tolerate moderate to high amounts of label noise in the data. [sent-432, score-0.388]

94 Finally, in domains where noise rates are approximately known, our methods can beneﬁt from the knowledge of noise rates. [sent-433, score-0.506]

95 Our analysis shows that the methods are fairly robust to misspeciﬁcation of noise rates (See Appendix C for results). [sent-434, score-0.339]

96 6 Conclusions and Future Work We addressed the problem of risk minimization in the presence of random classiﬁcation noise, and obtained general results in the setting using the methods of unbiased estimators and weighted loss functions. [sent-435, score-0.858]

97 The proposed algorithms are easy to implement and the classiﬁcation performance is impressive even at high noise rates and competitive with state-of-the-art methods on benchmark data. [sent-437, score-0.411]

98 We could consider harder noise models such as label noise depending on the example, and “nasty label noise” where labels to ﬂip are chosen adversarially. [sent-439, score-0.831]

99 Estimating a kernel Fisher discriminant in the presence of label o noise. [sent-530, score-0.238]

100 Learning kernel perceptrons on noisy data using random projections. [sent-590, score-0.265]

similar papers computed by tfidf model

tfidf for this paper:

wordName wordTfidf (topN-words)

[('stempfel', 0.287), ('risk', 0.23), ('noisy', 0.228), ('clean', 0.209), ('noise', 0.204), ('loss', 0.199), ('crammer', 0.198), ('nherd', 0.172), ('labels', 0.159), ('ralaivola', 0.152), ('classi', 0.145), ('hin', 0.144), ('unbiased', 0.141), ('label', 0.132), ('biconjugate', 0.127), ('minimizer', 0.121), ('losses', 0.115), ('proxy', 0.115), ('surrogate', 0.114), ('scott', 0.113), ('convex', 0.112), ('hinge', 0.102), ('rates', 0.098), ('corrupted', 0.097), ('sign', 0.089), ('ipped', 0.088), ('banana', 0.086), ('khardon', 0.086), ('manwani', 0.086), ('wapprox', 0.086), ('minimization', 0.083), ('bayes', 0.081), ('elkan', 0.076), ('estimators', 0.072), ('presence', 0.069), ('shai', 0.066), ('koby', 0.066), ('lemma', 0.065), ('rd', 0.064), ('weighted', 0.064), ('dredze', 0.063), ('excess', 0.062), ('rademacher', 0.062), ('suitably', 0.06), ('cation', 0.059), ('iid', 0.058), ('nondecreasing', 0.058), ('ccn', 0.057), ('eli', 0.057), ('nettleton', 0.057), ('noto', 0.057), ('wachman', 0.057), ('benchmark', 0.057), ('argmin', 0.056), ('perceptron', 0.055), ('learnability', 0.054), ('uci', 0.053), ('calibrated', 0.052), ('nemirovski', 0.052), ('et', 0.052), ('competitive', 0.052), ('thyroid', 0.051), ('pam', 0.051), ('angluin', 0.051), ('aslam', 0.051), ('biggio', 0.051), ('rcn', 0.051), ('sastry', 0.051), ('svm', 0.05), ('symmetry', 0.047), ('clayton', 0.047), ('graepel', 0.047), ('unlabeled', 0.047), ('accuracy', 0.045), ('er', 0.045), ('fenchel', 0.044), ('sees', 0.044), ('yt', 0.042), ('bx', 0.042), ('online', 0.041), ('synthetic', 0.041), ('tolerant', 0.04), ('huber', 0.038), ('robust', 0.037), ('empirical', 0.037), ('kernel', 0.037), ('tune', 0.036), ('lee', 0.036), ('lipschitz', 0.035), ('decision', 0.035), ('breast', 0.035), ('article', 0.034), ('width', 0.033), ('threshold', 0.033), ('philip', 0.033), ('positives', 0.033), ('costs', 0.033), ('theorem', 0.033), ('convexity', 0.033), ('log', 0.032)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.99999994 171 nips-2013-Learning with Noisy Labels

Author: Nagarajan Natarajan, Inderjit Dhillon, Pradeep Ravikumar, Ambuj Tewari

2 0.1258965 230 nips-2013-Online Learning with Costly Features and Labels

Author: Navid Zolghadr, Gabor Bartok, Russell Greiner, András György, Csaba Szepesvari

Abstract: This paper introduces the online probing problem: In each round, the learner is able to purchase the values of a subset of feature values. After the learner uses this information to come up with a prediction for the given round, he then has the option of paying to see the loss function that he is evaluated against. Either way, the learner pays for both the errors of his predictions and also whatever he chooses to observe, including the cost of observing the loss function for the given round and the cost of the observed features. We consider two variations of this problem, depending on whether the learner can observe the label for free or not. We provide algorithms and upper and lower bounds on the regret for both variants. We show that a positive cost for observing the label signiﬁcantly increases the regret of the problem. 1

3 0.12420163 72 nips-2013-Convex Calibrated Surrogates for Low-Rank Loss Matrices with Applications to Subset Ranking Losses

Author: Harish G. Ramaswamy, Shivani Agarwal, Ambuj Tewari

Abstract: The design of convex, calibrated surrogate losses, whose minimization entails consistency with respect to a desired target loss, is an important concept to have emerged in the theory of machine learning in recent years. We give an explicit construction of a convex least-squares type surrogate loss that can be designed to be calibrated for any multiclass learning problem for which the target loss matrix has a low-rank structure; the surrogate loss operates on a surrogate target space of dimension at most the rank of the target loss. We use this result to design convex calibrated surrogates for a variety of subset ranking problems, with target losses including the precision@q, expected rank utility, mean average precision, and pairwise disagreement. 1

4 0.12043054 279 nips-2013-Robust Bloom Filters for Large MultiLabel Classification Tasks

Author: Moustapha M. Cisse, Nicolas Usunier, Thierry Artières, Patrick Gallinari

Abstract: This paper presents an approach to multilabel classiﬁcation (MLC) with a large number of labels. Our approach is a reduction to binary classiﬁcation in which label sets are represented by low dimensional binary vectors. This representation follows the principle of Bloom ﬁlters, a space-efﬁcient data structure originally designed for approximate membership testing. We show that a naive application of Bloom ﬁlters in MLC is not robust to individual binary classiﬁers’ errors. We then present an approach that exploits a speciﬁc feature of real-world datasets when the number of labels is large: many labels (almost) never appear together. Our approach is provably robust, has sublinear training and inference complexity with respect to the number of labels, and compares favorably to state-of-the-art algorithms on two large scale multilabel datasets. 1

5 0.11906201 91 nips-2013-Dirty Statistical Models

Author: Eunho Yang, Pradeep Ravikumar

Abstract: We provide a uniﬁed framework for the high-dimensional analysis of “superposition-structured” or “dirty” statistical models: where the model parameters are a superposition of structurally constrained parameters. We allow for any number and types of structures, and any statistical model. We consider the general class of M -estimators that minimize the sum of any loss function, and an instance of what we call a “hybrid” regularization, that is the inﬁmal convolution of weighted regularization functions, one for each structural component. We provide corollaries showcasing our uniﬁed framework for varied statistical models such as linear regression, multiple regression and principal component analysis, over varied superposition structures. 1

6 0.11406983 75 nips-2013-Convex Two-Layer Modeling

7 0.1131428 109 nips-2013-Estimating LASSO Risk and Noise Level

8 0.11167423 309 nips-2013-Statistical Active Learning Algorithms

9 0.1085894 216 nips-2013-On Flat versus Hierarchical Classification in Large-Scale Taxonomies

10 0.10698813 158 nips-2013-Learning Multiple Models via Regularized Weighting

11 0.10440312 156 nips-2013-Learning Kernels Using Local Rademacher Complexity

12 0.1018066 90 nips-2013-Direct 0-1 Loss Minimization and Margin Maximization with Boosting

13 0.10127161 359 nips-2013-Σ-Optimality for Active Learning on Gaussian Random Fields

14 0.09472049 242 nips-2013-PAC-Bayes-Empirical-Bernstein Inequality

15 0.094232805 265 nips-2013-Reconciling "priors" & "priors" without prejudice?

16 0.091435276 313 nips-2013-Stochastic Majorization-Minimization Algorithms for Large-Scale Optimization

17 0.090308003 10 nips-2013-A Latent Source Model for Nonparametric Time Series Classification

18 0.090070486 227 nips-2013-Online Learning in Markov Decision Processes with Adversarially Chosen Transition Probability Distributions

19 0.085899748 222 nips-2013-On the Linear Convergence of the Proximal Gradient Method for Trace Norm Regularization

20 0.085398801 318 nips-2013-Structured Learning via Logistic Regression

similar papers computed by lsi model

lsi for this paper:

topicId topicWeight

[(0, 0.236), (1, 0.046), (2, 0.085), (3, -0.039), (4, 0.084), (5, 0.017), (6, -0.085), (7, 0.031), (8, -0.07), (9, 0.072), (10, 0.024), (11, -0.02), (12, -0.016), (13, -0.031), (14, 0.037), (15, -0.153), (16, -0.01), (17, 0.1), (18, 0.075), (19, -0.039), (20, -0.102), (21, 0.061), (22, 0.094), (23, 0.02), (24, -0.033), (25, 0.055), (26, -0.036), (27, -0.018), (28, 0.075), (29, -0.074), (30, -0.063), (31, 0.076), (32, -0.016), (33, -0.065), (34, -0.027), (35, -0.121), (36, 0.032), (37, -0.119), (38, -0.002), (39, -0.031), (40, -0.012), (41, -0.058), (42, 0.022), (43, 0.075), (44, -0.03), (45, 0.068), (46, 0.003), (47, 0.045), (48, -0.044), (49, -0.013)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.97207344 171 nips-2013-Learning with Noisy Labels

Author: Nagarajan Natarajan, Inderjit Dhillon, Pradeep Ravikumar, Ambuj Tewari

2 0.80038583 223 nips-2013-On the Relationship Between Binary Classification, Bipartite Ranking, and Binary Class Probability Estimation

Author: Harikrishna Narasimhan, Shivani Agarwal

Abstract: We investigate the relationship between three fundamental problems in machine learning: binary classiﬁcation, bipartite ranking, and binary class probability estimation (CPE). It is known that a good binary CPE model can be used to obtain a good binary classiﬁcation model (by thresholding at 0.5), and also to obtain a good bipartite ranking model (by using the CPE model directly as a ranking model); it is also known that a binary classiﬁcation model does not necessarily yield a CPE model. However, not much is known about other directions. Formally, these relationships involve regret transfer bounds. In this paper, we introduce the notion of weak regret transfer bounds, where the mapping needed to transform a model from one problem to another depends on the underlying probability distribution (and in practice, must be estimated from data). We then show that, in this weaker sense, a good bipartite ranking model can be used to construct a good classiﬁcation model (by thresholding at a suitable point), and more surprisingly, also to construct a good binary CPE model (by calibrating the scores of the ranking model). 1

3 0.78171748 279 nips-2013-Robust Bloom Filters for Large MultiLabel Classification Tasks

Author: Moustapha M. Cisse, Nicolas Usunier, Thierry Artières, Patrick Gallinari

4 0.77748448 72 nips-2013-Convex Calibrated Surrogates for Low-Rank Loss Matrices with Applications to Subset Ranking Losses

Author: Harish G. Ramaswamy, Shivani Agarwal, Ambuj Tewari

5 0.77353388 90 nips-2013-Direct 0-1 Loss Minimization and Margin Maximization with Boosting

Author: Shaodan Zhai, Tian Xia, Ming Tan, Shaojun Wang

Abstract: We propose a boosting method, DirectBoost, a greedy coordinate descent algorithm that builds an ensemble classiﬁer of weak classiﬁers through directly minimizing empirical classiﬁcation error over labeled training examples; once the training classiﬁcation error is reduced to a local coordinatewise minimum, DirectBoost runs a greedy coordinate ascent algorithm that continuously adds weak classiﬁers to maximize any targeted arbitrarily deﬁned margins until reaching a local coordinatewise maximum of the margins in a certain sense. Experimental results on a collection of machine-learning benchmark datasets show that DirectBoost gives better results than AdaBoost, LogitBoost, LPBoost with column generation and BrownBoost, and is noise tolerant when it maximizes an n′ th order bottom sample margin. 1

6 0.68532509 216 nips-2013-On Flat versus Hierarchical Classification in Large-Scale Taxonomies

7 0.64843231 275 nips-2013-Reservoir Boosting : Between Online and Offline Ensemble Learning

8 0.6319952 76 nips-2013-Correlated random features for fast semi-supervised learning

9 0.60060495 359 nips-2013-Σ-Optimality for Active Learning on Gaussian Random Fields

10 0.59928578 244 nips-2013-Parametric Task Learning

11 0.58975452 181 nips-2013-Machine Teaching for Bayesian Learners in the Exponential Family

12 0.58800179 156 nips-2013-Learning Kernels Using Local Rademacher Complexity

13 0.58681458 135 nips-2013-Heterogeneous-Neighborhood-based Multi-Task Local Learning Algorithms

14 0.5832274 271 nips-2013-Regularized M-estimators with nonconvexity: Statistical and algorithmic theory for local optima

15 0.58239144 10 nips-2013-A Latent Source Model for Nonparametric Time Series Classification

16 0.57898456 230 nips-2013-Online Learning with Costly Features and Labels

17 0.57713932 199 nips-2013-More data speeds up training time in learning halfspaces over sparse vectors

18 0.5751791 170 nips-2013-Learning with Invariance via Linear Functionals on Reproducing Kernel Hilbert Space

19 0.56763035 225 nips-2013-One-shot learning and big data with n=2

20 0.55692077 152 nips-2013-Learning Efficient Random Maximum A-Posteriori Predictors with Non-Decomposable Loss Functions

similar papers computed by lda model

lda for this paper:

topicId topicWeight

[(2, 0.332), (16, 0.014), (33, 0.156), (34, 0.088), (49, 0.022), (56, 0.122), (70, 0.03), (85, 0.039), (89, 0.073), (93, 0.038), (95, 0.021)]

similar papers list:

simIndex simValue paperId paperTitle

1 0.85545057 191 nips-2013-Minimax Optimal Algorithms for Unconstrained Linear Optimization

Author: Brendan McMahan, Jacob Abernethy

Abstract: We design and analyze minimax-optimal algorithms for online linear optimization games where the player’s choice is unconstrained. The player strives to minimize regret, the difference between his loss and the loss of a post-hoc benchmark strategy. While the standard benchmark is the loss of the best strategy chosen from a bounded comparator set, we consider a very broad range of benchmark functions. The problem is cast as a sequential multi-stage zero-sum game, and we give a thorough analysis of the minimax behavior of the game, providing characterizations for the value of the game, as well as both the player’s and the adversary’s optimal strategy. We show how these objects can be computed efﬁciently under certain circumstances, and by selecting an appropriate benchmark, we construct a novel hedging strategy for an unconstrained betting game. 1

2 0.81215489 51 nips-2013-Bayesian entropy estimation for binary spike train data using parametric prior knowledge

Author: Evan W. Archer, Il M. Park, Jonathan W. Pillow

Abstract: Shannon’s entropy is a basic quantity in information theory, and a fundamental building block for the analysis of neural codes. Estimating the entropy of a discrete distribution from samples is an important and difﬁcult problem that has received considerable attention in statistics and theoretical neuroscience. However, neural responses have characteristic statistical structure that generic entropy estimators fail to exploit. For example, existing Bayesian entropy estimators make the naive assumption that all spike words are equally likely a priori, which makes for an inefﬁcient allocation of prior probability mass in cases where spikes are sparse. Here we develop Bayesian estimators for the entropy of binary spike trains using priors designed to ﬂexibly exploit the statistical structure of simultaneouslyrecorded spike responses. We deﬁne two prior distributions over spike words using mixtures of Dirichlet distributions centered on simple parametric models. The parametric model captures high-level statistical features of the data, such as the average spike count in a spike word, which allows the posterior over entropy to concentrate more rapidly than with standard estimators (e.g., in cases where the probability of spiking differs strongly from 0.5). Conversely, the Dirichlet distributions assign prior mass to distributions far from the parametric model, ensuring consistent estimates for arbitrary distributions. We devise a compact representation of the data and prior that allow for computationally efﬁcient implementations of Bayesian least squares and empirical Bayes entropy estimators with large numbers of neurons. We apply these estimators to simulated and real neural data and show that they substantially outperform traditional methods.

3 0.8014527 248 nips-2013-Point Based Value Iteration with Optimal Belief Compression for Dec-POMDPs

Author: Liam C. MacDermed, Charles Isbell

Abstract: We present four major results towards solving decentralized partially observable Markov decision problems (DecPOMDPs) culminating in an algorithm that outperforms all existing algorithms on all but one standard inﬁnite-horizon benchmark problems. (1) We give an integer program that solves collaborative Bayesian games (CBGs). The program is notable because its linear relaxation is very often integral. (2) We show that a DecPOMDP with bounded belief can be converted to a POMDP (albeit with actions exponential in the number of beliefs). These actions correspond to strategies of a CBG. (3) We present a method to transform any DecPOMDP into a DecPOMDP with bounded beliefs (the number of beliefs is a free parameter) using optimal (not lossless) belief compression. (4) We show that the combination of these results opens the door for new classes of DecPOMDP algorithms based on previous POMDP algorithms. We choose one such algorithm, point-based valued iteration, and modify it to produce the ﬁrst tractable value iteration method for DecPOMDPs that outperforms existing algorithms. 1

4 0.78805065 123 nips-2013-Flexible sampling of discrete data correlations without the marginal distributions

Author: Alfredo Kalaitzis, Ricardo Silva

Abstract: Learning the joint dependence of discrete variables is a fundamental problem in machine learning, with many applications including prediction, clustering and dimensionality reduction. More recently, the framework of copula modeling has gained popularity due to its modular parameterization of joint distributions. Among other properties, copulas provide a recipe for combining ﬂexible models for univariate marginal distributions with parametric families suitable for potentially high dimensional dependence structures. More radically, the extended rank likelihood approach of Hoff (2007) bypasses learning marginal models completely when such information is ancillary to the learning task at hand as in, e.g., standard dimensionality reduction problems or copula parameter estimation. The main idea is to represent data by their observable rank statistics, ignoring any other information from the marginals. Inference is typically done in a Bayesian framework with Gaussian copulas, and it is complicated by the fact this implies sampling within a space where the number of constraints increases quadratically with the number of data points. The result is slow mixing when using off-the-shelf Gibbs sampling. We present an efﬁcient algorithm based on recent advances on constrained Hamiltonian Markov chain Monte Carlo that is simple to implement and does not require paying for a quadratic cost in sample size. 1 Contribution There are many ways of constructing multivariate discrete distributions: from full contingency tables in the small dimensional case [1], to structured models given by sparsity constraints [11] and (hierarchies of) latent variable models [6]. More recently, the idea of copula modeling [16] has been combined with such standard building blocks. Our contribution is a novel algorithm for efﬁcient Markov chain Monte Carlo (MCMC) for the copula framework introduced by [7], extending algorithmic ideas introduced by [17]. A copula is a continuous cumulative distribution function (CDF) with uniformly distributed univariate marginals in the unit interval [0, 1]. It complements graphical models and other formalisms that provide a modular parameterization of joint distributions. The core idea is simple and given by the following observation: suppose we are given a (say) bivariate CDF F (y1 , y2 ) with marginals −1 −1 F1 (y1 ) and F2 (y2 ). This CDF can then be rewritten as F (F1 (F1 (y1 )), F2 (F2 (y2 ))). The func−1 −1 tion C(·, ·) given by F (F1 (·), F2 (·)) is a copula. For discrete distributions, this decomposition is not unique but still well-deﬁned [16]. Copulas have found numerous applications in statistics and machine learning since they provide a way of constructing ﬂexible multivariate distributions by mix-and-matching different copulas with different univariate marginals. For instance, one can combine ﬂexible univariate marginals Fi (·) with useful but more constrained high-dimensional copulas. We will not further motivate the use of copula models, which has been discussed at length in recent 1 machine learning publications and conference workshops, and for which comprehensive textbooks exist [e.g., 9]. For a recent discussion on the applications of copulas from a machine learning perspective, [4] provides an overview. [10] is an early reference in machine learning. The core idea dates back at least to the 1950s [16]. In the discrete case, copulas can be difﬁcult to apply: transforming a copula CDF into a probability mass function (PMF) is computationally intractable in general. For the continuous case, a common ˆ trick goes as follows: transform variables by deﬁning ai ≡ Fi (yi ) for an estimate of Fi (·) and then ﬁt a copula density c(·, . . . , ·) to the resulting ai [e.g. 9]. It is not hard to check this breaks down in the discrete case [7]. An alternative is to represent the CDF to PMF transformation for each data point by a continuous integral on a bounded space. Sampling methods can then be used. This trick has allowed many applications of the Gaussian copula to discrete domains. Readers familiar with probit models will recognize the similarities to models where an underlying latent Gaussian ﬁeld is discretized into observable integers as in Gaussian process classiﬁers and ordinal regression [18]. Such models can be indirectly interpreted as special cases of the Gaussian copula. In what follows, we describe in Section 2 the Gaussian copula and the general framework for constructing Bayesian estimators of Gaussian copulas by [7], the extended rank likelihood framework. This framework entails computational issues which are discussed. A recent general approach for MCMC in constrained Gaussian ﬁelds by [17] can in principle be directly applied to this problem as a blackbox, but at a cost that scales quadratically in sample size and as such it is not practical in general. Our key contribution is given in Section 4. An application experiment on the Bayesian Gaussian copula factor model is performed in Section 5. Conclusions are discussed in the ﬁnal section. 2 Gaussian copulas and the extended rank likelihood It is not hard to see that any multivariate Gaussian copula is fully deﬁned by a correlation matrix C, since marginal distributions have no free parameters. In practice, the following equivalent generative model is used to deﬁne a sample U according to a Gaussian copula GC(C): 1. Sample Z from a zero mean Gaussian with covariance matrix C 2. For each Zj , set Uj = Φ(zj ), where Φ(·) is the CDF of the standard Gaussian It is clear that each Uj follows a uniform distribution in [0, 1]. To obtain a model for variables {y1 , y2 , . . . , yp } with marginal distributions Fj (·) and copula GC(C), one can add the deterministic (n) (1) (1) (2) step yj = Fj−1 (uj ). Now, given n samples of observed data Y ≡ {y1 , . . . , yp , y1 , . . . , yp }, one is interested on inferring C via a Bayesian approach and the posterior distribution p(C, θF | Y) ∝ pGC (Y | C, θF )π(C, θF ) where π(·) is a prior distribution, θF are marginal parameters for each Fj (·), which in general might need to be marginalized since they will be unknown, and pGC (·) is the PMF of a (here discrete) distribution with a Gaussian copula and marginals given by θF . Let Z be the underlying latent Gaussians of the corresponding copula for dataset Y. Although Y is a deterministic function of Z, this mapping is not invertible due to the discreteness of the distribution: each marginal Fj (·) has jumps. Instead, the reverse mapping only enforces the constraints where (i ) (i ) (i ) (i ) yj 1 < yj 2 implies zj 1 < zj 2 . Based on this observation, [7] considers the event Z ∈ D(y), where D(y) is the set of values of Z in Rn×p obeying those constraints, that is (k) (k) D(y) ≡ Z ∈ Rn×p : max zj s.t. yj (i) < yj (i) (k) (i) (k) < zj < min zj s.t. yj < yj . Since {Y = y} ⇒ Z(y) ∈ D(y), we have pGC (Y | C, θF ) = pGC (Z ∈ D(y), Y | C, θF ) = pN (Z ∈ D(y) | C) × pGC (Y| Z ∈ D(y), C, θF ), (1) the ﬁrst factor of the last line being that of a zero-mean a Gaussian density function marginalized over D(y). 2 The extended rank likelihood is deﬁned by the ﬁrst factor of (1). With this likelihood, inference for C is given simply by marginalizing p(C, Z | Y) ∝ I(Z ∈ D(y)) pN (Z| C) π(C), (2) the ﬁrst factor of the right-hand side being the usual binary indicator function. Strictly speaking, this is not a fully Bayesian method since partial information on the marginals is ignored. Nevertheless, it is possible to show that under some mild conditions there is information in the extended rank likelihood to consistently estimate C [13]. It has two important properties: ﬁrst, in many applications where marginal distributions are nuisance parameters, this sidesteps any major assumptions about the shape of {Fi (·)} – applications include learning the degree of dependence among variables (e.g., to understand relationships between social indicators as in [7] and [13]) and copula-based dimensionality reduction (a generalization of correlation-based principal component analysis, e.g., [5]); second, MCMC inference in the extended rank likelihood is conceptually simpler than with the joint likelihood, since dropping marginal models will remove complicated entanglements between C and θF . Therefore, even if θF is necessary (when, for instance, predicting missing values of Y), an estimate of C can be computed separately and will not depend on the choice of estimator for {Fi (·)}. The standard model with a full correlation matrix C can be further reﬁned to take into account structure implied by sparse inverse correlation matrices [2] or low rank decompositions via higher-order latent variable models [13], among others. We explore the latter case in section 5. An off-the-shelf algorithm for sampling from (2) is full Gibbs sampling: ﬁrst, given Z, the (full or structured) correlation matrix C can be sampled by standard methods. More to the point, sampling (i) Z is straightforward if for each variable j and data point i we sample Zj conditioned on all other variables. The corresponding distribution is an univariate truncated Gaussian. This is the approach used originally by Hoff. However, mixing can be severely compromised by the sampling of Z, and that is where novel sampling methods can facilitate inference. 3 Exact HMC for truncated Gaussian distributions (i) Hoff’s algorithm modiﬁes the positions of all Zj associated with a particular discrete value of Yj , conditioned on the remaining points. As the number of data points increases, the spread of the hard (i) boundaries on Zj , given by data points of Zj associated with other levels of Yj , increases. This (i) reduces the space in which variables Zj can move at a time. To improve the mixing, we aim to sample from the joint Gaussian distribution of all latent variables (i) Zj , i = 1 . . . n , conditioned on other columns of the data, such that the constraints between them are satisﬁed and thus the ordering in the observation level is conserved. Standard Gibbs approaches for sampling from truncated Gaussians reduce the problem to sampling from univariate truncated Gaussians. Even though each step is computationally simple, mixing can be slow when strong correlations are induced by very tight truncation bounds. In the following, we brieﬂy describe the methodology recently introduced by [17] that deals with the problem of sampling from log p(x) ∝ − 1 x Mx + r x , where x, r ∈ Rn and M is positive 2 deﬁnite, with linear constraints of the form fj x ≤ gj , where fj ∈ Rn , j = 1 . . . m, is the normal vector to some linear boundary in the sample space. Later in this section we shall describe how this framework can be applied to the Gaussian copula extended rank likelihood model. More importantly, the observed rank statistics impose only linear constraints of the form xi1 ≤ xi2 . We shall describe how this special structure can be exploited to reduce the runtime complexity of the constrained sampler from O(n2 ) (in the number of observations) to O(n) in practice. 3.1 Hamiltonian Monte Carlo for the Gaussian distribution Hamiltonian Monte Carlo (HMC) [15] is a MCMC method that extends the sampling space with auxiliary variables so that (ideally) deterministic moves in the joint space brings the sampler to 3 potentially far places in the original variable space. Deterministic moves cannot in general be done, but this is possible in the Gaussian case. The form of the Hamiltonian for the general d-dimensional Gaussian case with mean µ and precision matrix M is: 1 1 H = x Mx − r x + s M−1 s , (3) 2 2 where M is also known in the present context as the mass matrix, r = Mµ and s is the velocity. Both x and s are Gaussian distributed so this Hamiltonian can be seen (up to a constant) as the negative log of the product of two independent Gaussian random variables. The physical interpretation is that of a sum of potential and kinetic energy terms, where the total energy of the system is conserved. In a system where this Hamiltonian function is constant, we can exactly compute its evolution through the pair of differential equations: ˙ x= sH = M−1 s , ˙ s=− xH = −Mx + r . (4) These are solved exactly by x(t) = µ + a sin(t) + b cos(t) , where a and b can be identiﬁed at initial conditions (t = 0) : ˙ a = x(0) = M−1 s , b = x(0) − µ . (5) Therefore, the exact HMC algorithm can be summarised as follows: • Initialise the allowed travel time T and some initial position x0 . • Repeat for HMC samples k = 1 . . . N 1. Sample sk ∼ N (0, M) 2. Use sk and xk to update a and b and store the new position at the end of the trajectory xk+1 = x(T ) as an HMC sample. It can be easily shown that the Markov chain of sampled positions has the desired equilibrium distribution N µ, M−1 [17]. 3.2 Sampling with linear constraints Sampling from multivariate Gaussians does not require any method as sophisticated as HMC, but the plot thickens when the target distribution is truncated by linear constraints of the form Fx ≤ g . Here, F ∈ Rm×n is a constraint matrix whose every row is the normal vector to a linear boundary in the sample space. This is equivalent to sampling from a Gaussian that is conﬁned in the (not necessarily bounded) convex polyhedron {x : Fx ≤ g}. In general, to remain within the boundaries of each wall, once a new velocity has been sampled one must compute all possible collision times with the walls. The smallest of all collision times signiﬁes the wall that the particle should bounce from at that collision time. Figure 1 illustrates the concept with two simple examples on 2 and 3 dimensions. The collision times can be computed analytically and their equations can be found in the supplementary material. We also point the reader to [17] for a more detailed discussion of this implementation. Once the wall to be hit has been found, then position and velocity at impact time are computed and the velocity is reﬂected about the boundary normal1 . The constrained HMC sampler is summarized follows: • Initialise the allowed travel time T and some initial position x0 . • Repeat for HMC samples k = 1 . . . N 1. Sample sk ∼ N (0, M) 2. Use sk and xk to update a and b . 1 Also equivalent to transforming the velocity with a Householder reﬂection matrix about the bounding hyperplane. 4 1 2 3 4 1 2 3 4 Figure 1: Left: Trajectories of the ﬁrst 40 iterations of the exact HMC sampler on a 2D truncated Gaussian. A reﬂection of the velocity can clearly be seen when the particle meets wall #2 . Here, the constraint matrix F is a 4 × 2 matrix. Center: The same example after 40000 samples. The coloring of each sample indicates its density value. Right: The anatomy of a 3D Gaussian. The walls are now planes and in this case F is a 2 × 3 matrix. Figure best seen in color. 3. Reset remaining travel time Tleft ← T . Until no travel time is left or no walls can be reached (no solutions exist), do: (a) Compute impact times with all walls and pick the smallest one, th (if a solution exists). (b) Compute v(th ) and reﬂect it about the hyperplane fh . This is the updated velocity after impact. The updated position is x(th ) . (c) Tleft ← Tleft − th 4. Store the new position at the end of the trajectory xk+1 as an HMC sample. In general, all walls are candidates for impact, so the runtime of the sampler is linear in m , the number of constraints. This means that the computational load is concentrated in step 3(a). Another consideration is that of the allocated travel time T . Depending on the shape of the bounding polyhedron and the number of walls, a very large travel time can induce many more bounces thus requiring more computations per sample. On the other hand, a very small travel time explores the distribution more locally so the mixing of the chain can suffer. What constitutes a given travel time “large” or “small” is relative to the dimensionality, the number of constraints and the structure of the constraints. Due to the nature of our problem, the number of constraints, when explicitly expressed as linear functions, is O(n2 ) . Clearly, this restricts any direct application of the HMC framework for Gaussian copula estimation to small-sample (n) datasets. More importantly, we show how to exploit the structure of the constraints to reduce the number of candidate walls (prior to each bounce) to O(n) . 4 HMC for the Gaussian Copula extended rank likelihood model Given some discrete data Y ∈ Rn×p , the task is to infer the correlation matrix of the underlying Gaussian copula. Hoff’s sampling algorithm proceeds by alternating between sampling the continu(i) (i) ous latent representation Zj of each Yj , for i = 1 . . . n, j = 1 . . . p , and sampling a covariance matrix from an inverse-Wishart distribution conditioned on the sampled matrix Z ∈ Rn×p , which is then renormalized as a correlation matrix. From here on, we use matrix notation for the samples, as opposed to the random variables – with (i) Zi,j replacing Zj , Z:,j being a column of Z, and Z:,\j being the submatrix of Z without the j-th column. In a similar vein to Hoff’s sampling algorithm, we replace the successive sampling of each Zi,j conditioned on Zi,\j (a conditional univariate truncated Gaussian) with the simultaneous sampling of Z:,j conditioned on Z:,\j . This is done through an HMC step from a conditional multivariate truncated Gaussian. The added beneﬁt of this HMC step over the standard Gibbs approach, is that of a handle for regulating the trade-off between exploration and runtime via the allocated travel time T . Larger travel times potentially allow for larger moves in the sample space, but it comes at a cost as explained in the sequel. 5 4.1 The Hough envelope algorithm The special structure of constraints. Recall that the number of constraints is quadratic in the dimension of the distribution. This is because every Z sample must satisfy the conditions of the event Z ∈ D(y) of the extended rank likelihood (see Section 2). In other words, for any column Z:,j , all entries are organised into a partition L(j) of |L(j) | levels, the number of unique values observed for the discrete or ordinal variable Y (j) . Thereby, for any two adjacent levels lk , lk+1 ∈ L(j) and any pair i1 ∈ lk , i2 ∈ lk+1 , it must be true that Zli ,j < Zli+1 ,j . Equivalently, a constraint f exists where fi1 = 1, fi2 = −1 and g = 0 . It is easy to see that O(n2 ) of such constraints are induced by the order statistics of the j-th variable. To deal with this boundary explosion, we developed the Hough Envelope algorithm to search efﬁciently, within all pairs in {Z:,j }, in practically linear time. Recall in HMC (section 3.2) that the trajectory of the particle, x(t), is decomposed as xi (t) = ai sin(t) + bi cos(t) + µi , (6) and there are n such functions, grouped into a partition of levels as described above. The Hough envelope2 is found for every pair of adjacent levels. We illustrate this with an example of 10 dimensions and two levels in Figure 2, without loss of generalization to any number of levels or dimensions. Assume we represent trajectories for points in level lk with blue curves, and points in lk+1 with red curves. Assuming we start with a valid state, at time t = 0 all red curves are above all blue curves. The goal is to ﬁnd the smallest t where a blue curve meets a red curve. This will be our collision time where a bounce will be necessary. 5 3 1 2 Figure 2: The trajectories xj (t) of each component are sinusoid functions. The right-most green dot signiﬁes the wall and the time th of the earliest bounce, where the ﬁrst inter-level pair (that is, any two components respectively from the blue and red level) becomes equal, in this case the constraint activated being xblue2 = xred2 . 4 4 5 1 2 3 0.2 0.4 0.6 t 0.8 1 1.2 1.4 1. First we ﬁnd the largest component bluemax of the blue level at t = 0. This takes O(n) time. Clearly, this will be the largest component until its sinusoid intersects that of any other component. 2. To ﬁnd the next largest component, compute the roots of xbluemax (t) − xi (t) = 0 for all components and pick the smallest (earliest) one (represented by a green dot). This also takes O(n) time. 3. Repeat this procedure until a red sinusoid crosses the highest running blue sinusoid. When this happens, the time of earliest bounce and its constraint are found. In the worst-case scenario, n such repetitions have to be made, but in practice we can safely assume an ﬁxed upper bound h on the number of blue crossings before a inter-level crossing occurs. In experiments, we found h << n, no more than 10 in simulations with hundreds of thousands of curves. Thus, this search strategy takes O(n) time in practice to complete, mirroring the analysis of other output-sensitive algorithms such as the gift wrapping algorithm for computing convex hulls [8]. Our HMC sampling approach is summarized in Algorithm 1. 2 The name is inspired from the fact that each xi (t) is the sinusoid representation, in angle-distance space, of all lines that pass from the (ai , bi ) point in a − b space. A representation known in image processing as the Hough transform [3]. 6 Algorithm 1 HMC for GCERL # Notation: T MN (µ, C, F) is a truncated multivariate normal with location vector µ, scale matrix C and constraints encoded by F and g = 0 . # IW(df, V0 ) is an inverse-Wishart prior with degrees of freedom df and scale matrix V0 . Input: Y ∈ Rn×p , allocated travel time T , a starting Z and variable covariance V ∈ Rp×p , df = p + 2, V0 = df Ip and chain size N . Generate constraints F(j) from Y:,j , for j = 1 . . . p . for samples k = 1 . . . N do # Resample Z as follows: for variables j = 1 . . . p do −1 −1 2 Compute parameters: σj = Vjj − Vj,\j V\j,\j V\j,j , µj = Z:,\j V\j,\j V\j,j . 2 Get one sample Z:,j ∼ T MN µj , σj I, F(j) efﬁciently by using the Hough Envelope algorithm, see section 4.1. end for Resample V ∼ IW(df + n, V0 + Z Z) . Compute correlation matrix C, s.t. Ci,j = Vi,j / Vi,i Vj,j and store sample, C(k) ← C . end for 5 An application on the Bayesian Gausian copula factor model In this section we describe an experiment that highlights the beneﬁts of our HMC treatment, compared to a state-of-the-art parameter expansion (PX) sampling scheme. During this experiment we ask the important question: “How do the two schemes compare when we exploit the full-advantage of the HMC machinery to jointly sample parameters and the augmented data Z, in a model of latent variables and structured correlations?” We argue that under such circumstances the superior convergence speed and mixing of HMC undeniably compensate for its computational overhead. Experimental setup In this section we provide results from an application on the Gaussian copula latent factor model of [13] (Hoff’s model [7] for low-rank structured correlation matrices). We modify the parameter expansion (PX) algorithm used by [13] by replacing two of its Gibbs steps with a single HMC step. We show a much faster convergence to the true mode with considerable support on its vicinity. We show that unlike the HMC, the PX algorithm falls short of properly exploring the posterior in any reasonable ﬁnite amount of time, even for small models, even for small samples. Worse, PX fails in ways one cannot easily detect. Namely, we sample each row of the factor loadings matrix Λ jointly with the corresponding column of the augmented data matrix Z, conditioning on the higher-order latent factors. This step is analogous to Pakman and Paninski’s [17, sec.3.1] use of HMC in the context of a binary probit model (the extension to many levels in the discrete marginal is straightforward with direct application of the constraint matrix F and the Hough envelope algorithm). The sampling of the higher level latent factors remains identical to [13]. Our scheme involves no parameter expansion. We do however interweave the Gibbs step for the Z matrix similarly to Hoff. This has the added beneﬁt of exploring the Z sample space within their current boundaries, complementing the joint (λ, z) sampling which moves the boundaries jointly. The value of such ”interweaving” schemes has been addressed in [19]. Results We perform simulations of 10000 iterations, n = 1000 observations (rows of Y), travel time π/2 for HMC with the setups listed in the following table, along with the elapsed times of each sampling scheme. These experiments were run on Intel COREi7 desktops with 4 cores and 8GB of RAM. Both methods were parallelized across the observed variables (p). Figure p (vars) k (latent factors) M (ordinal levels) elapsed (mins): HMC PX 3(a) : 20 5 2 115 8 3(b) : 10 3 2 80 6 10 3 5 203 16 3(c) : Many functionals of the loadings matrix Λ can be assessed. We focus on reconstructing the true (low-rank) correlation matrix of the Gaussian copula. In particular, we summarize the algorithm’s 7 outcome with the root mean squared error (RMSE) of the differences between entries of the ground-truth correlation matrix and the implied correlation matrix at each iteration of a MCMC scheme (so the following plots looks like a time-series of 10000 timepoints), see Figures 3(a), 3(b) and 3(c) . (a) (b) (c) Figure 3: Reconstruction (RMSE per iteration) of the low-rank structured correlation matrix of the Gaussian copula and its histogram (along the left side). (a) Simulation setup: 20 variables, 5 factors, 5 levels. HMC (blue) reaches a better mode faster (in iterations/CPU-time) than PX (red). Even more importantly the RMSE posterior samples of PX are concentrated in a much smaller region compared to HMC, even after 10000 iterations. This illustrates that PX poorly explores the true distribution. (b) Simulation setup: 10 vars, 3 factors, 2 levels. We observe behaviors similar to Figure 3(a). Note that the histogram counts RMSEs after the burn-in period of PX (iteration #500). (c) Simulation setup: 10 vars, 3 factors, 5 levels. We observe behaviors similar to Figures 3(a) and 3(b) but with a thinner tail for HMC. Note that the histogram counts RMSEs after the burn-in period of PX (iteration #2000). Main message HMC reaches a better mode faster (iterations/CPUtime). Even more importantly the RMSE posterior samples of PX are concentrated in a much smaller region compared to HMC, even after 10000 iterations. This illustrates that PX poorly explores the true distribution. As an analogous situation we refer to the top and bottom panels of Figure 14 of Radford Neal’s slice sampler paper [14]. If there was no comparison against HMC, there would be no evidence from the PX plot alone that the algorithm is performing poorly. This mirrors Radford Neal’s statement opening Section 8 of his paper: “a wrong answer is obtained without any obvious indication that something is amiss”. The concentration on the posterior mode of PX in these simulations is misleading of the truth. PX might seen a bit simpler to implement, but it seems one cannot avoid using complex algorithms for complex models. We urge practitioners to revisit their past work with this model to ﬁnd out by how much credible intervals of functionals of interest have been overconﬁdent. Whether trivially or severely, our algorithm offers the ﬁrst principled approach for checking this out. 6 Conclusion Sampling large random vectors simultaneously in order to improve mixing is in general a very hard problem, and this is why clever methods such as HMC or elliptical slice sampling [12] are necessary. We expect that the method here developed is useful not only for those with data analysis problems within the large family of Gaussian copula extended rank likelihood models, but the method itself and its behaviour might provide some new insights on MCMC sampling in constrained spaces in general. Another direction of future work consists of exploring methods for elliptical copulas, and related possible extensions of general HMC for non-Gaussian copula models. Acknowledgements The quality of this work has beneﬁted largely from comments by our anonymous reviewers and useful discussions with Simon Byrne and Vassilios Stathopoulos. Research was supported by EPSRC grant EP/J013293/1. 8 References [1] Y. Bishop, S. Fienberg, and P. Holland. Discrete Multivariate Analysis: Theory and Practice. MIT Press, 1975. [2] A. Dobra and A. Lenkoski. Copula Gaussian graphical models and their application to modeling functional disability data. Annals of Applied Statistics, 5:969–993, 2011. [3] R. O. Duda and P. E. Hart. Use of the Hough transformation to detect lines and curves in pictures. Communications of the ACM, 15(1):11–15, 1972. [4] G. Elidan. Copulas and machine learning. Proceedings of the Copulae in Mathematical and Quantitative Finance workshop, to appear, 2013. [5] F. Han and H. Liu. Semiparametric principal component analysis. Advances in Neural Information Processing Systems, 25:171–179, 2012. [6] G. Hinton and R. Salakhutdinov. Reducing the dimensionality of data with neural networks. Science, 313(5786):504–507, 2006. [7] P. Hoff. Extending the rank likelihood for semiparametric copula estimation. Annals of Applied Statistics, 1:265–283, 2007. [8] R. Jarvis. On the identiﬁcation of the convex hull of a ﬁnite set of points in the plane. Information Processing Letters, 2(1):18–21, 1973. [9] H. Joe. Multivariate Models and Dependence Concepts. Chapman-Hall, 1997. [10] S. Kirshner. Learning with tree-averaged densities and distributions. Neural Information Processing Systems, 2007. [11] S. Lauritzen. Graphical Models. Oxford University Press, 1996. [12] I. Murray, R. Adams, and D. MacKay. Elliptical slice sampling. JMLR Workshop and Conference Proceedings: AISTATS 2010, 9:541–548, 2010. [13] J. Murray, D. Dunson, L. Carin, and J. Lucas. Bayesian Gaussian copula factor models for mixed data. Journal of the American Statistical Association, to appear, 2013. [14] R. Neal. Slice sampling. The Annals of Statistics, 31:705–767, 2003. [15] R. Neal. MCMC using Hamiltonian dynamics. Handbook of Markov Chain Monte Carlo, pages 113–162, 2010. [16] R. Nelsen. An Introduction to Copulas. Springer-Verlag, 2007. [17] A. Pakman and L. Paninski. Exact Hamiltonian Monte Carlo for truncated multivariate Gaussians. arXiv:1208.4118, 2012. [18] C. Rasmussen and C. Williams. Gaussian Processes for Machine Learning. MIT Press, 2006. [19] Y. Yu and X. L. Meng. To center or not to center: That is not the question — An ancillaritysufﬁciency interweaving strategy (ASIS) for boosting MCMC efﬁciency. Journal of Computational and Graphical Statistics, 20(3):531–570, 2011. 9

same-paper 5 0.76683164 171 nips-2013-Learning with Noisy Labels

Author: Nagarajan Natarajan, Inderjit Dhillon, Pradeep Ravikumar, Ambuj Tewari

6 0.73256361 182 nips-2013-Manifold-based Similarity Adaptation for Label Propagation

7 0.70363665 310 nips-2013-Statistical analysis of coupled time series with Kernel Cross-Spectral Density operators.

8 0.68868029 66 nips-2013-Computing the Stationary Distribution Locally

9 0.68248701 325 nips-2013-The Pareto Regret Frontier

10 0.65551496 231 nips-2013-Online Learning with Switching Costs and Other Adaptive Adversaries

11 0.64851302 240 nips-2013-Optimization, Learning, and Games with Predictable Sequences

12 0.64140326 125 nips-2013-From Bandits to Experts: A Tale of Domination and Independence

13 0.63427526 327 nips-2013-The Randomized Dependence Coefficient

14 0.62446707 308 nips-2013-Spike train entropy-rate estimation using hierarchical Dirichlet process priors

15 0.62136722 273 nips-2013-Reinforcement Learning in Robust Markov Decision Processes

16 0.62086499 142 nips-2013-Information-theoretic lower bounds for distributed statistical estimation with communication constraints

17 0.60873312 177 nips-2013-Local Privacy and Minimax Bounds: Sharp Rates for Probability Estimation

18 0.60810602 110 nips-2013-Estimating the Unseen: Improved Estimators for Entropy and other Properties

19 0.60734123 173 nips-2013-Least Informative Dimensions

20 0.60343629 25 nips-2013-Adaptive Anonymity via $b$-Matching