jmlr jmlr2011 jmlr2011-100 knowledge-graph by maker-knowledge-mining

100 jmlr-2011-Unsupervised Supervised Learning II: Margin-Based Classification Without Labels

Source: pdf

Author: Krishnakumar Balasubramanian, Pinar Donmez, Guy Lebanon

Abstract: Many popular linear classiﬁers, such as logistic regression, boosting, or SVM, are trained by optimizing a margin-based risk function. Traditionally, these risk functions are computed based on a labeled data set. We develop a novel technique for estimating such risks using only unlabeled data and the marginal label distribution. We prove that the proposed risk estimator is consistent on high-dimensional data sets and demonstrate it on synthetic and real-world data. In particular, we show how the estimate is used for evaluating classiﬁers in transfer learning, and for training classiﬁers with no labeled data whatsoever. Keywords: classiﬁcation, large margin, maximum likelihood

Reference: text

Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 Atlanta, GA 30332, USA Editor: Ingo Steinwart Abstract Many popular linear classiﬁers, such as logistic regression, boosting, or SVM, are trained by optimizing a margin-based risk function. [sent-9, score-0.258]

2 Traditionally, these risk functions are computed based on a labeled data set. [sent-10, score-0.254]

3 We develop a novel technique for estimating such risks using only unlabeled data and the marginal label distribution. [sent-11, score-0.238]

4 We prove that the proposed risk estimator is consistent on high-dimensional data sets and demonstrate it on synthetic and real-world data. [sent-12, score-0.235]

5 In particular, we show how the estimate is used for evaluating classiﬁers in transfer learning, and for training classiﬁers with no labeled data whatsoever. [sent-13, score-0.224]

6 Introduction Many popular linear classiﬁers, such as logistic regression, boosting, or SVM, are trained by opˆ timizing a margin-based risk function. [sent-15, score-0.258]

7 This is done by minimizing the risk or expected loss R(θ) = E p(X,Y ) L (Y, fθ (X)) c 2011 Krishnakumar Balasubramanian, Pinar Donmez, and Guy Lebanon. [sent-18, score-0.154]

8 Since the risk R(θ) depends on the unknown distribution p, it is usually replaced during training with its empirical counterpart 1 n ∑ L (Y (i) , fθ (X (i) )) n i=1 (5) (X (1) ,Y (1) ), . [sent-20, score-0.198]

9 , (X (n) ,Y (n) ) ∼ p (6) Rn (θ) = based on a labeled training set iid leading to the following estimator ˆ θn = arg min Rn (θ). [sent-23, score-0.342]

10 In this paper we construct an estimator for R(θ) using only unlabeled data, that is using X (1) , . [sent-26, score-0.209]

11 We then observe that the limit distributions of (8) may be estimated from unlabeled data (7) and that these distributions may be used to measure margin-based losses such as (2)-(4). [sent-32, score-0.166]

12 We examine two novel unsupervised applications: (i) estimating marginbased losses in transfer learning and (ii) training margin-based classiﬁers. [sent-33, score-0.309]

13 Our empirical evaluation shows the effectiveness of the proposed framework in risk estimation and classiﬁer training without any labeled data. [sent-35, score-0.298]

14 Label scarcity is a well known problem which has lead to the emergence of semisupervised learning: learning using a few labeled examples and many unlabeled ones. [sent-37, score-0.274]

15 Speciﬁcally, we construct an estimator for R(θ) deﬁned in (1) using the unlabeled data ˆ ˆ (7) which we denote Rn (θ ; X (1) , . [sent-41, score-0.209]

16 The loglikelihood (10) does not use labeled data (it marginalizes over the label y(i) ). [sent-81, score-0.157]

17 The estimation problem (11) is equivalent to the problem of maximum likelihood for means and variances of a Gaussian mixture model where the label marginals are assumed to be known. [sent-86, score-0.187]

18 The estimator Rn (12) is consistent in the limit of inﬁnite unlabeled data ˆ P lim Rn (θ) = R(θ) = 1. [sent-89, score-0.247]

19 The two risk estimators Rn (θ) (12) and Rn (θ) (5) approximate the expected loss R(θ). [sent-91, score-0.154]

20 Under suitable conditions arg minθ Rn (θ) converges to the expected risk minimizer P ˆ lim arg min Rn (θ) = arg min R(θ) n→∞ θ∈Θ θ∈Θ = 1. [sent-94, score-0.367]

21 This far reaching conclusion implies that in cases where arg minθ R(θ) is the Bayes classiﬁer (as is the case with exponential loss, log loss, and hinge loss) we can retrieve the optimal classiﬁer without a single labeled data point. [sent-95, score-0.171]

22 One way to verify this is empirically, as we show in Figures 1-3 which contrast the histogram with a ﬁtted normal pdf for text, digit images, and face images data. [sent-98, score-0.36]

23 , 2004, MNIST digit images, and face images, Pham et al. [sent-109, score-0.177]

24 The twelve panels show that even in moderate dimensionality (RCV1: 1000 top words, MNIST digits: 784 pixels, face images: 400 pixels) the assumption that fθ (X)|Y is normal holds often for randomly drawn θ. [sent-112, score-0.291]

25 3123 BALASUBRAMANIAN , D ONMEZ , AND L EBANON face images Fisher’s LDA RCV1 text data 0 5 −5 0 5 −5 0 5 −5 0 5 −5 0 5 −5 0 5 −5 0 5 −5 0 5 −5 0 5 −5 0 5 −5 0 5 −5 0 5 log. [sent-113, score-0.199]

26 , 2004, MNIST digit images, and face images, Pham et al. [sent-117, score-0.177]

27 The twelve panels show that even in moderate dimensionality (RCV1: 1000 top words, MNIST digits: 784 pixels, face images: 400 pixels) the assumption that fθ (X)|Y is normal holds well for ﬁtted θ values (except perhaps for L1 regularization in the last row which promotes sparse θ). [sent-120, score-0.291]

28 In practice one needs to verify the normality assumption empirically, which is simple to do by comparing the empirical histogram of fθ (X) with that of a ﬁtted mixture of Gaussians. [sent-180, score-0.153]

29 2 Unsupervised Consistency of Rn (θ) We start with proving identiﬁability of the maximum likelihood estimator (MLE) for a mixture of two Gaussians with known ordering of mixture proportions. [sent-185, score-0.249]

30 Proposition 8 Assuming known label marginals with p(Y = 1) = p(Y = −1) the Gaussian mixture family pµ,σ (x) = p(y = 1)N(x ; µ1 , σ2 ) + p(y = −1)N(x ; µ−1 , σ2 ) 1 −1 is identiﬁable. [sent-189, score-0.187]

31 Proof It can be shown that the family of Gaussian mixture model with no apriori information about label marginals is identiﬁable up to a permutation of the labels y (Teicher, 1963). [sent-190, score-0.23]

32 Proposition 10 Under the assumptions of Proposition 8 and assuming the loss L is given by one of (2)-(4) with a normal fθ (X)|Y ∼ N(µy , σ2 ), the plug-in risk estimate y ˆ Rn (θ) = ∑ p(y) R y∈{−1,+1} pµ(n) ,σ(n) ( fθ (X) = α|y)L (y, α) dα. [sent-202, score-0.221]

33 n ˆ Proof The plug-in risk estimate Rn in (14) is a continuous function (when L is given by (2), (3) (n) (n) (n) (n) ˆ ˆ ˆ or (4)) of µ1 , µ−1 , σ1 , σ−1 (note that µy and σy are functions of θ), which we denote Rn (θ) = ˆ ˆ (n) (n) (n) (n) ˆ ˆ h(ˆ 1 , µ−1 , σ1 , σ−1 ). [sent-204, score-0.154]

34 This surprising result indicates that in some cases it is possible to retrieve the expected risk minimizer (and therefore the Bayes classiﬁer in the case of the hinge loss, log-loss and exp-loss) using only unlabeled data. [sent-210, score-0.282]

35 Uniform ˆ convergence of the risk estimator Rn (θ) follows. [sent-213, score-0.235]

36 We do so by computing the asymptotic variance of the estimator which equals the inverse Fisher information √ mle ˆ n(ηn − η0 ) N(0, I −1 (ηtrue )) and analyzing its dependency on the model parameters. [sent-240, score-0.311]

37 pη (z) In some cases it is more instructive to consider the asymptotic variance of the risk estimator ˆ Rn (θ) rather than that of the parameter estimate for η = (µ, σ). [sent-244, score-0.276]

38 5 Multiclass Classiﬁcation Thus far, we have considered unsupervised risk estimation in binary classiﬁcation. [sent-251, score-0.339]

39 In this case the margin vector associated with the multiclass classiﬁer ˆ Y = arg max fθk (X), k=1,. [sent-253, score-0.162]

40 We thus have K Gaussian mixture models, each one with K mixture components. [sent-264, score-0.168]

41 The estimated parameters are plugged-in as before into the multiclass risk R(θ) = E p( fθ (X),Y ) L (Y, fθ (X)) 3132 M ARGIN -BASED C LASSIFICATION WITHOUT L ABELS 2 (2,5,1,1) (2,2,4,1) (2,5,4,1) 3. [sent-265, score-0.199]

42 8 1 |µ1−µ2| P(Y=1) ˆ Figure 4: Left panel: asymptotic accuracy (inverse of trace of asymptotic variance) of Rn (θ) for logloss as a function of the imbalance of the class marginal p(Y ). [sent-301, score-0.322]

43 Furthermore, if the loss L is a continuous function of these ˆ parameters (as is the case for (15)-(16)) the risk estimator Rn (θ) is consistent as well. [sent-312, score-0.235]

44 The transfer learning assumption that labeled data exists for the training domain but not for the test domain motivates the use of our unsupervised risk estimation. [sent-316, score-0.611]

45 002 0 1000 2500 5000 7500 10000 0 1000 2500 5000 7500 10000 ˆ ˆ Figure 5: The relative accuracy of Rn (measured by |Rn (θ) − Rn (θ)|/Rn (θ)) as a function of n, classiﬁer accuracy (acc) and the label marginal p(Y ) (left: logloss, right: hinge-loss). [sent-367, score-0.186]

46 This controlled setting allows us to examine the accuracy of the risk estimator as a function of n, p(Y ), and the classiﬁer accuracy. [sent-372, score-0.273]

47 Table 1 shows the accuracy of logloss estimation for a real world transfer learning experiment based on the 20-newsgroup data. [sent-380, score-0.267]

48 7972 Table 1: Error in estimating logloss for logistic regression classiﬁers trained on one 20-newsgroup classiﬁcation task and tested on another. [sent-413, score-0.317]

49 The ﬁfth and sixth columns indicate the train set size and the label marginal distribution. [sent-419, score-0.165]

50 We trained a logistic regression classiﬁer on the training set and estimate its risk on the test set of a different distribution. [sent-423, score-0.414]

51 Our unsupervised risk estimator was quite effective in estimating the risk with relative accuracy greater than 96% and absolute error less than 0. [sent-424, score-0.612]

52 Application 2: Unsupervised Learning of Classiﬁers Our second application is a very ambitious one: training classiﬁers using unlabeled data by minˆ ˆ imizing the unsupervised risk estimate θn = arg min Rn (θ). [sent-427, score-0.582]

53 We evaluate the performance of the ˆ n based on three quantities: (i) the unsupervised risk estimate Rn (θn ), (ii) the suˆ ˆ learned classiﬁer θ ˆ n ), and (iii) its classiﬁcation error rate. [sent-428, score-0.339]

54 We also compare the performance pervised risk estimate Rn (θ ˆ ˆ of θn = arg min Rn (θ) with that of its supervised analog arg min Rn (θ). [sent-429, score-0.36]

55 , X (n) ∈ Rd , p(Y ), grid-size τ Initialize θi ∼ Uniform(−2, 2) for all i repeat for i = 1 to d do Construct τ points grid in the range [θi − 4τ, θi + 4τ] Compute the risk estimate (14) where all dimensions of θ(t) are ﬁxed except for [θ(t) ]i which is evaluated at each grid point. [sent-454, score-0.218]

56 Our classiﬁcation task was to distinguish images of the digit one (positive) from the digit 2 (negative) resulting in 14867 samples and p(Y = 1) = 0. [sent-461, score-0.282]

57 Figures 6-7 indicate that minimizing the unsupervised logloss estimate is quite effective in learning an accurate classiﬁer without labels. [sent-464, score-0.334]

58 Both the unsupervised and supervised risk estimates ˆ ˆ ˆ Rn (θn ), Rn (θn ) decay nicely when computed over the train set as well as the test set. [sent-465, score-0.55]

59 For comparison purposes supervised logistic regression with the same n achieved only slightly better test set error rate: 0. [sent-467, score-0.28]

60 In another experiment we examined the proposed approach on several different data sets and compared the classiﬁcation performance with a supervised baseline (logistic regression) and Gaussian mixture modeling (GMM) clustering with known label proportions in the original data space (Table 2). [sent-472, score-0.27]

61 1 0 1 10 20 30 40 50 0 1 5 10 15 20 25 ˆ Figure 6: Performance of unsupervised logistic regression classiﬁer θn computed using Algorithm 1 (left) and Algorithm 2 (right) on the RCV1 data set. [sent-533, score-0.353]

62 The top two rows show the decay ˆ ˆ ˆ of the two risk estimates Rn (θn ), Rn (θn ) as a function of the algorithm iterations. [sent-534, score-0.235]

63 risk estimates of θ ˆ The bottom row displays the decay of the test set error rate of θn as a function of the algorithm iterations. [sent-536, score-0.246]

64 For comparison, the test error rate for supervised logistic regression with the same n is 0. [sent-539, score-0.28]

65 5 4 ˆ train Rn (θ) train Rn (θ) 3 ˆ train Rn (θ) train Rn (θ) 3. [sent-542, score-0.22]

66 5 30 60 90 120 150 10 20 30 3 ˆ test Rn (θ) test Rn (θ) 3 0 1 40 ˆ test Rn (θ) test Rn (θ) 2. [sent-550, score-0.192]

67 1 30 60 90 120 150 0 1 10 20 30 40 ˆ Figure 7: Performance of unsupervised logistic regression classiﬁer θn computed using Algorithm 1 (left) and Algorithm 2 (right) on the MNIST data set. [sent-572, score-0.353]

68 The top two rows show the decay ˆ ˆ ˆ of the two risk estimates Rn (θn ), Rn (θn ) as a function of the algorithm iterations. [sent-573, score-0.235]

69 The ˆ risk estimates of θn were computed using the train set (top) and the test set (middle). [sent-574, score-0.257]

70 For comparison, the test error rate for supervised logistic regression with the same n is 0. [sent-578, score-0.28]

71 3139 BALASUBRAMANIAN , D ONMEZ , AND L EBANON Data set RCV1 Mnist 20 news group USPS Umist Arcene Isolet Dexter Secom Pham faces CMU pie face Madelon Dimensions top 504 words 784 top 750 words 256 400 PCA components 1000 PCA components 617 top-700 words 591 400 1024 500 Supervised log-reg 0. [sent-580, score-0.18]

72 1120 Table 2: Comparison (test set error rate) between supervised logistic regression, Unsupervised logistic regression and Gaussian mixture modeling in original data space. [sent-616, score-0.42]

73 The unsupervised classiﬁer performs better than the GMM clustering on the original space and compares well with its supervised counterpart on most data sets. [sent-617, score-0.249]

74 We note that our unsupervised approach achieves test set errors comparable to the supervised logistic regression in several data sets. [sent-623, score-0.465]

75 The poor performance of the unsupervised technique on the Dexter data set is due to the fact that the data contains many irrelevant features. [sent-624, score-0.185]

76 Small deviations of the assumed p(Y ) from the true p(Y ) result in a small degradation of logloss estimation quality and testing set error rate. [sent-635, score-0.188]

77 9 Figure 8: Performance of unsupervised classiﬁer training on RCV1 data (top class vs. [sent-654, score-0.229]

78 The performance of the estimated classiﬁer (in terms of training set empirical logloss Rn (5) and test error rate measured using held-out labels) decreases with the deviation between the assumed and true p(Y = 1) (true p(Y = 1) = 0. [sent-656, score-0.28]

79 2 Effect of Regularization and Dimensionality Reduction In Figure 9 we examine the effect of regularization on the performance of the unsupervised classiﬁer. [sent-660, score-0.185]

80 In Figure 10 we examine the effect of reducing the data dimensionality via PCA prior to training the unsupervised classiﬁer. [sent-670, score-0.268]

81 For the original dimensionality of 256 or a slightly lower dimensionality the classiﬁcation performance of the unsupervised classiﬁer is comparable to the supervised. [sent-672, score-0.263]

82 The supervised classiﬁer also degrades in performance as less dimensions are used but not as fast as the unsupervised classiﬁer. [sent-675, score-0.313]

83 Related Work Semi-supervised approaches: Semisupervised learning is closely related to our work in that unsupervised classiﬁcation may be viewed as a limiting case. [sent-677, score-0.185]

84 One of the ﬁrst attempts at studying the sample complexity of classiﬁcation with unlabeled and labeled data was by Castelli and Cover (1995). [sent-678, score-0.228]

85 They consider a setting when data is generated by mixture distributions and show that with inﬁnite unlabeled data, the probability of error decays exponentially faster in the labeled data to the 3141 BALASUBRAMANIAN , D ONMEZ , AND L EBANON 0. [sent-679, score-0.312]

86 9 1 Figure 9: Test set error rate versus regularization parameter (L2 on the left panel and L1 on the right panel) for supervised and unsupervised logistic regression on RCV1 data set. [sent-715, score-0.519]

87 05 0 2 10 50 100 125 Dimensions 150 200 256 Figure 10: Test set error rate versus the amount of dimensions used (extracted via PCA) for supervised and unsupervised logistic regression on USPS data set. [sent-725, score-0.536]

88 They also analyze the case when there are only ﬁnite labeled and unlabeled data samples, with known class conditional densities but unknown mixing proportions (Castelli and Cover, 1996). [sent-728, score-0.293]

89 In addition, assuming known mixing proportions, we propose 3142 M ARGIN -BASED C LASSIFICATION WITHOUT L ABELS a framework for training a classiﬁer with no labeled samples, while approaches above still need labeled samples for classiﬁcation. [sent-734, score-0.244]

90 (2009) aims to estimate the labels of an unlabeled testing set using known label proportions of several sets of unlabeled observations. [sent-740, score-0.421]

91 This approach attempts to estimate a conditional probabilistic model in an unsupervised way by maximizing mutual information between the empirical input distribution and the label distribution. [sent-746, score-0.242]

92 A key difference is the focus on probabilistic classiﬁers and in particular logistic regression whereas our approach is based on empirical risk minimization which also includes SVM. [sent-747, score-0.322]

93 It is similar in that it estimates the 0/1 risk rather than the margin based risk. [sent-752, score-0.2]

94 An important distinction between our work and the references above is that our work provides an estimate for the margin-based risk and therefore leads naturally to unsupervised versions of logistic regression and support vector machines. [sent-754, score-0.507]

95 Experimental results show that in practice the accuracy of the unsupervised classiﬁer is on the same order (but slightly lower naturally) as its supervised analog. [sent-756, score-0.287]

96 Remarkably, the theory states that assuming normality of fθ (X) and a known p(Y ) we are able to estimate the risk R(θ) without a single labeled example. [sent-761, score-0.323]

97 That is the risk estimate converges to the true risk as the number of unlabeled data increase. [sent-762, score-0.475]

98 Can it be extended to non-classiﬁcation scenarios such as margin based regression or margin based structured prediction? [sent-769, score-0.156]

99 The relative value of labeled and unlabeled samples in pattern recognition with an unknown mixing parameter. [sent-794, score-0.228]

100 Learning from a mixture of labeled and unlabeled examples with parametric side information. [sent-845, score-0.312]

similar papers computed by tfidf model

tfidf for this paper:

wordName wordTfidf (topN-words)

[('rn', 0.315), ('clt', 0.281), ('balasubramanian', 0.263), ('ebanon', 0.228), ('onmez', 0.228), ('abels', 0.194), ('unsupervised', 0.185), ('argin', 0.16), ('risk', 0.154), ('logloss', 0.149), ('mle', 0.149), ('lassification', 0.149), ('unlabeled', 0.128), ('mnist', 0.117), ('acc', 0.111), ('er', 0.107), ('digit', 0.106), ('zi', 0.106), ('logistic', 0.104), ('classi', 0.102), ('labeled', 0.1), ('ers', 0.095), ('reg', 0.09), ('donmez', 0.088), ('pham', 0.088), ('usl', 0.088), ('mixture', 0.084), ('estimator', 0.081), ('transfer', 0.08), ('face', 0.071), ('arg', 0.071), ('castelli', 0.07), ('clts', 0.07), ('comp', 0.07), ('images', 0.07), ('normality', 0.069), ('normal', 0.067), ('proportions', 0.065), ('supervised', 0.064), ('dimensions', 0.064), ('regression', 0.064), ('proposition', 0.062), ('gomes', 0.06), ('tted', 0.06), ('text', 0.058), ('zd', 0.057), ('label', 0.057), ('versus', 0.055), ('train', 0.055), ('isolet', 0.054), ('arcene', 0.054), ('decaying', 0.053), ('krishnakumar', 0.053), ('overlayed', 0.053), ('teicher', 0.053), ('marginal', 0.053), ('test', 0.048), ('panel', 0.047), ('pixels', 0.047), ('usps', 0.046), ('marginals', 0.046), ('iid', 0.046), ('margin', 0.046), ('regularized', 0.046), ('semisupervised', 0.046), ('gmm', 0.046), ('pdf', 0.046), ('fisher', 0.046), ('multiclass', 0.045), ('lebanon', 0.045), ('quadrianto', 0.045), ('dexter', 0.045), ('decay', 0.044), ('training', 0.044), ('labels', 0.043), ('asymptotic', 0.041), ('lda', 0.041), ('sci', 0.04), ('twelve', 0.04), ('rec', 0.04), ('dependency', 0.04), ('true', 0.039), ('dimensionality', 0.039), ('limit', 0.038), ('accuracy', 0.038), ('subjects', 0.037), ('panels', 0.037), ('top', 0.037), ('consistency', 0.036), ('documents', 0.035), ('berk', 0.035), ('brightness', 0.035), ('expectancy', 0.035), ('ferst', 0.035), ('lindberg', 0.035), ('pie', 0.035), ('pinar', 0.035), ('ratsaby', 0.035), ('rinott', 0.035), ('rvs', 0.035)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 1.0000012 100 jmlr-2011-Unsupervised Supervised Learning II: Margin-Based Classification Without Labels

Author: Krishnakumar Balasubramanian, Pinar Donmez, Guy Lebanon

2 0.11718647 58 jmlr-2011-Learning from Partial Labels

Author: Timothee Cour, Ben Sapp, Ben Taskar

Abstract: We address the problem of partially-labeled multiclass classiﬁcation, where instead of a single label per instance, the algorithm is given a candidate set of labels, only one of which is correct. Our setting is motivated by a common scenario in many image and video collections, where only partial access to labels is available. The goal is to learn a classiﬁer that can disambiguate the partiallylabeled training instances, and generalize to unseen data. We deﬁne an intuitive property of the data distribution that sharply characterizes the ability to learn in this setting and show that effective learning is possible even when all the data is only partially labeled. Exploiting this property of the data, we propose a convex learning formulation based on minimization of a loss function appropriate for the partial label setting. We analyze the conditions under which our loss function is asymptotically consistent, as well as its generalization and transductive performance. We apply our framework to identifying faces culled from web news sources and to naming characters in TV series and movies; in particular, we annotated and experimented on a very large video data set and achieve 6% error for character naming on 16 episodes of the TV series Lost. Keywords: weakly supervised learning, multiclass classiﬁcation, convex learning, generalization bounds, names and faces

3 0.10942667 69 jmlr-2011-Neyman-Pearson Classification, Convexity and Stochastic Constraints

Author: Philippe Rigollet, Xin Tong

Abstract: Motivated by problems of anomaly detection, this paper implements the Neyman-Pearson paradigm to deal with asymmetric errors in binary classiﬁcation with a convex loss ϕ. Given a ﬁnite collection of classiﬁers, we combine them and obtain a new classiﬁer that satisﬁes simultaneously the two following properties with high probability: (i) its ϕ-type I error is below a pre-speciﬁed level and (ii), it has ϕ-type II error close to the minimum possible. The proposed classiﬁer is obtained by minimizing an empirical convex objective with an empirical convex constraint. The novelty of the method is that the classiﬁer output by this computationally feasible program is shown to satisfy the original constraint on type I error. New techniques to handle such problems are developed and they have consequences on chance constrained programming. We also evaluate the price to pay in terms of type II error for being conservative on type I error. Keywords: binary classiﬁcation, Neyman-Pearson paradigm, anomaly detection, empirical constraint, empirical risk minimization, chance constrained optimization

4 0.068998657 39 jmlr-2011-High-dimensional Covariance Estimation Based On Gaussian Graphical Models

Author: Shuheng Zhou, Philipp Rütimann, Min Xu, Peter Bühlmann

Abstract: Undirected graphs are often used to describe high dimensional distributions. Under sparsity conditions, the graph can be estimated using ℓ1 -penalization methods. We propose and study the following method. We combine a multiple regression approach with ideas of thresholding and reﬁtting: ﬁrst we infer a sparse undirected graphical model structure via thresholding of each among many ℓ1 -norm penalized regression functions; we then estimate the covariance matrix and its inverse using the maximum likelihood estimator. We show that under suitable conditions, this approach yields consistent estimation in terms of graphical structure and fast convergence rates with respect to the operator and Frobenius norm for the covariance matrix and its inverse. We also derive an explicit bound for the Kullback Leibler divergence. Keywords: graphical model selection, covariance estimation, Lasso, nodewise regression, thresholding c 2011 Shuheng Zhou, Philipp R¨ timann, Min Xu and Peter B¨ hlmann. u u ¨ ¨ Z HOU , R UTIMANN , X U AND B UHLMANN

5 0.065335698 52 jmlr-2011-Large Margin Hierarchical Classification with Mutually Exclusive Class Membership

Author: Huixin Wang, Xiaotong Shen, Wei Pan

Abstract: In hierarchical classiﬁcation, class labels are structured, that is each label value corresponds to one non-root node in a tree, where the inter-class relationship for classiﬁcation is speciﬁed by directed paths of the tree. In such a situation, the focus has been on how to leverage the interclass relationship to enhance the performance of ﬂat classiﬁcation, which ignores such dependency. This is critical when the number of classes becomes large relative to the sample size. This paper considers single-path or partial-path hierarchical classiﬁcation, where only one path is permitted from the root to a leaf node. A large margin method is introduced based on a new concept of generalized margins with respect to hierarchy. For implementation, we consider support vector machines and ψ-learning. Numerical and theoretical analyses suggest that the proposed method achieves the desired objective and compares favorably against strong competitors in the literature, including its ﬂat counterparts. Finally, an application to gene function prediction is discussed. Keywords: difference convex programming, gene function annotation, margins, multi-class classiﬁcation, structured learning

6 0.06305673 5 jmlr-2011-A Refined Margin Analysis for Boosting Algorithms via Equilibrium Margin

7 0.062666252 43 jmlr-2011-Information, Divergence and Risk for Binary Experiments

8 0.061852358 12 jmlr-2011-Bayesian Co-Training

9 0.059682902 44 jmlr-2011-Information Rates of Nonparametric Gaussian Process Methods

10 0.059025224 3 jmlr-2011-A Cure for Variance Inflation in High Dimensional Kernel Principal Component Analysis

11 0.057772331 18 jmlr-2011-Convergence Rates of Efficient Global Optimization Algorithms

12 0.05337907 84 jmlr-2011-Semi-Supervised Learning with Measure Propagation

13 0.051098205 24 jmlr-2011-Dirichlet Process Mixtures of Generalized Linear Models

14 0.050853677 85 jmlr-2011-Smoothness, Disagreement Coefficient, and the Label Complexity of Agnostic Active Learning

15 0.049633853 99 jmlr-2011-Unsupervised Similarity-Based Risk Stratification for Cardiovascular Events Using Long-Term Time-Series Data

16 0.048963312 42 jmlr-2011-In All Likelihood, Deep Belief Is Not Enough

17 0.048235413 53 jmlr-2011-Learning High-Dimensional Markov Forest Distributions: Analysis of Error Rates

18 0.046001062 71 jmlr-2011-On Equivalence Relationships Between Classification and Ranking Algorithms

19 0.045553029 37 jmlr-2011-Group Lasso Estimation of High-dimensional Covariance Matrices

20 0.043260984 31 jmlr-2011-Efficient and Effective Visual Codebook Generation Using Additive Kernels

similar papers computed by lsi model

lsi for this paper:

topicId topicWeight

[(0, 0.253), (1, -0.065), (2, -0.018), (3, 0.016), (4, 0.001), (5, -0.09), (6, -0.036), (7, -0.036), (8, -0.171), (9, -0.043), (10, -0.162), (11, -0.059), (12, 0.094), (13, -0.084), (14, -0.114), (15, 0.221), (16, -0.186), (17, -0.163), (18, -0.039), (19, -0.136), (20, 0.043), (21, 0.044), (22, 0.009), (23, 0.072), (24, -0.01), (25, 0.002), (26, -0.102), (27, 0.19), (28, -0.008), (29, 0.126), (30, 0.052), (31, -0.181), (32, -0.114), (33, -0.033), (34, 0.128), (35, -0.017), (36, 0.044), (37, 0.009), (38, -0.022), (39, 0.012), (40, -0.076), (41, -0.021), (42, 0.116), (43, -0.118), (44, 0.063), (45, 0.016), (46, -0.027), (47, 0.013), (48, -0.036), (49, 0.018)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.96843332 100 jmlr-2011-Unsupervised Supervised Learning II: Margin-Based Classification Without Labels

Author: Krishnakumar Balasubramanian, Pinar Donmez, Guy Lebanon

2 0.74056667 69 jmlr-2011-Neyman-Pearson Classification, Convexity and Stochastic Constraints

Author: Philippe Rigollet, Xin Tong

3 0.6947419 58 jmlr-2011-Learning from Partial Labels

Author: Timothee Cour, Ben Sapp, Ben Taskar

4 0.51780063 52 jmlr-2011-Large Margin Hierarchical Classification with Mutually Exclusive Class Membership

Author: Huixin Wang, Xiaotong Shen, Wei Pan

5 0.41376054 5 jmlr-2011-A Refined Margin Analysis for Boosting Algorithms via Equilibrium Margin

Author: Liwei Wang, Masashi Sugiyama, Zhaoxiang Jing, Cheng Yang, Zhi-Hua Zhou, Jufu Feng

Abstract: Much attention has been paid to the theoretical explanation of the empirical success of AdaBoost. The most inﬂuential work is the margin theory, which is essentially an upper bound for the generalization error of any voting classiﬁer in terms of the margin distribution over the training data. However, important questions were raised about the margin explanation. Breiman (1999) proved a bound in terms of the minimum margin, which is sharper than the margin distribution bound. He argued that the minimum margin would be better in predicting the generalization error. Grove and Schuurmans (1998) developed an algorithm called LP-AdaBoost which maximizes the minimum margin while keeping all other factors the same as AdaBoost. In experiments however, LP-AdaBoost usually performs worse than AdaBoost, putting the margin explanation into serious doubt. In this paper, we make a reﬁned analysis of the margin theory. We prove a bound in terms of a new margin measure called the Equilibrium margin (Emargin). The Emargin bound is uniformly ©2011 Liwei Wang, Masashi Sugiyama, Zhaoxiang Jing, Cheng Yang, Zhi-Hua Zhou and Jufu Feng. WANG , S UGIYAMA , J ING , YANG , Z HOU AND F ENG sharper than Breiman’s minimum margin bound. Thus our result suggests that the minimum margin may be not crucial for the generalization error. We also show that a large Emargin and a small empirical error at Emargin imply a smaller bound of the generalization error. Experimental results on benchmark data sets demonstrate that AdaBoost usually has a larger Emargin and a smaller test error than LP-AdaBoost, which agrees well with our theory. Keywords: boosting, margin bounds, voting classiﬁer

6 0.39954966 39 jmlr-2011-High-dimensional Covariance Estimation Based On Gaussian Graphical Models

7 0.3872045 12 jmlr-2011-Bayesian Co-Training

8 0.37208483 43 jmlr-2011-Information, Divergence and Risk for Binary Experiments

9 0.36880797 51 jmlr-2011-Laplacian Support Vector Machines Trained in the Primal

10 0.36225846 3 jmlr-2011-A Cure for Variance Inflation in High Dimensional Kernel Principal Component Analysis

11 0.36182407 84 jmlr-2011-Semi-Supervised Learning with Measure Propagation

12 0.34290045 31 jmlr-2011-Efficient and Effective Visual Codebook Generation Using Additive Kernels

13 0.30067173 18 jmlr-2011-Convergence Rates of Efficient Global Optimization Algorithms

14 0.29449368 44 jmlr-2011-Information Rates of Nonparametric Gaussian Process Methods

15 0.28908262 99 jmlr-2011-Unsupervised Similarity-Based Risk Stratification for Cardiovascular Events Using Long-Term Time-Series Data

16 0.28325081 22 jmlr-2011-Differentially Private Empirical Risk Minimization

17 0.2819089 37 jmlr-2011-Group Lasso Estimation of High-dimensional Covariance Matrices

18 0.28172615 24 jmlr-2011-Dirichlet Process Mixtures of Generalized Linear Models

19 0.27543157 71 jmlr-2011-On Equivalence Relationships Between Classification and Ranking Algorithms

20 0.27268282 85 jmlr-2011-Smoothness, Disagreement Coefficient, and the Label Complexity of Agnostic Active Learning

similar papers computed by lda model

lda for this paper:

topicId topicWeight

[(4, 0.019), (9, 0.012), (10, 0.016), (24, 0.038), (31, 0.697), (32, 0.015), (41, 0.016), (60, 0.011), (73, 0.022), (78, 0.058), (90, 0.017)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.98702645 100 jmlr-2011-Unsupervised Supervised Learning II: Margin-Based Classification Without Labels

Author: Krishnakumar Balasubramanian, Pinar Donmez, Guy Lebanon

2 0.96655107 26 jmlr-2011-Distance Dependent Chinese Restaurant Processes

Author: David M. Blei, Peter I. Frazier

Abstract: We develop the distance dependent Chinese restaurant process, a ﬂexible class of distributions over partitions that allows for dependencies between the elements. This class can be used to model many kinds of dependencies between data in inﬁnite clustering models, including dependencies arising from time, space, and network connectivity. We examine the properties of the distance dependent CRP, discuss its connections to Bayesian nonparametric mixture models, and derive a Gibbs sampler for both fully observed and latent mixture settings. We study its empirical performance with three text corpora. We show that relaxing the assumption of exchangeability with distance dependent CRPs can provide a better ﬁt to sequential data and network data. We also show that the distance dependent CRP representation of the traditional CRP mixture leads to a faster-mixing Gibbs sampling algorithm than the one based on the original formulation. Keywords: Chinese restaurant processes, Bayesian nonparametrics

3 0.95714754 101 jmlr-2011-Variable Sparsity Kernel Learning

Author: Jonathan Aflalo, Aharon Ben-Tal, Chiranjib Bhattacharyya, Jagarlapudi Saketha Nath, Sankaran Raman

Abstract: paper1 This presents novel algorithms and applications for a particular class of mixed-norm regularization based Multiple Kernel Learning (MKL) formulations. The formulations assume that the given kernels are grouped and employ l1 norm regularization for promoting sparsity within RKHS norms of each group and ls , s ≥ 2 norm regularization for promoting non-sparse combinations across groups. Various sparsity levels in combining the kernels can be achieved by varying the grouping of kernels—hence we name the formulations as Variable Sparsity Kernel Learning (VSKL) formulations. While previous attempts have a non-convex formulation, here we present a convex formulation which admits efﬁcient Mirror-Descent (MD) based solving techniques. The proposed MD based algorithm optimizes over product of simplices and has a computational complexity of O m2 ntot log nmax /ε2 where m is no. training data points, nmax , ntot are the maximum no. kernels in any group, total no. kernels respectively and ε is the error in approximating the objective. A detailed proof of convergence of the algorithm is also presented. Experimental results show that the VSKL formulations are well-suited for multi-modal learning tasks like object categorization. Results also show that the MD based algorithm outperforms state-of-the-art MKL solvers in terms of computational efﬁciency. Keywords: multiple kernel learning, mirror descent, mixed-norm, object categorization, scalability 1. All authors contributed equally. The author names appear in alphabetical order. c 2011 Jonathan Aﬂalo, Aharon Ben-Tal, Chiranjib Bhattacharyya, Jagarlapudi Saketha Nath and Sankaran Raman. A FLALO , B EN -TAL , B HATTACHARYYA , NATH AND R AMAN

4 0.95643443 95 jmlr-2011-Training SVMs Without Offset

Author: Ingo Steinwart, Don Hush, Clint Scovel

Abstract: We develop, analyze, and test a training algorithm for support vector machine classiﬁers without offset. Key features of this algorithm are a new, statistically motivated stopping criterion, new warm start options, and a set of inexpensive working set selection strategies that signiﬁcantly reduce the number of iterations. For these working set strategies, we establish convergence rates that, not surprisingly, coincide with the best known rates for SVMs with offset. We further conduct various experiments that investigate both the run time behavior and the performed iterations of the new training algorithm. It turns out, that the new algorithm needs signiﬁcantly less iterations and also runs substantially faster than standard training algorithms for SVMs with offset. Keywords: support vector machines, decomposition algorithms

5 0.92610222 96 jmlr-2011-Two Distributed-State Models For Generating High-Dimensional Time Series

Author: Graham W. Taylor, Geoffrey E. Hinton, Sam T. Roweis

Abstract: In this paper we develop a class of nonlinear generative models for high-dimensional time series. We ﬁrst propose a model based on the restricted Boltzmann machine (RBM) that uses an undirected model with binary latent variables and real-valued “visible” variables. The latent and visible variables at each time step receive directed connections from the visible variables at the last few time-steps. This “conditional” RBM (CRBM) makes on-line inference efﬁcient and allows us to use a simple approximate learning procedure. We demonstrate the power of our approach by synthesizing various sequences from a model trained on motion capture data and by performing on-line ﬁlling in of data lost during capture. We extend the CRBM in a way that preserves its most important computational properties and introduces multiplicative three-way interactions that allow the effective interaction weight between two variables to be modulated by the dynamic state of a third variable. We introduce a factoring of the implied three-way weight tensor to permit a more compact parameterization. The resulting model can capture diverse styles of motion with a single set of parameters, and the three-way interactions greatly improve its ability to blend motion styles or to transition smoothly among them. Videos and source code can be found at http://www.cs.nyu.edu/˜gwtaylor/publications/ jmlr2011. Keywords: unsupervised learning, restricted Boltzmann machines, time series, generative models, motion capture

6 0.74905133 51 jmlr-2011-Laplacian Support Vector Machines Trained in the Primal

7 0.72196376 84 jmlr-2011-Semi-Supervised Learning with Measure Propagation

8 0.69415754 27 jmlr-2011-Domain Decomposition Approach for Fast Gaussian Process Regression of Large Spatial Data Sets

9 0.69388902 99 jmlr-2011-Unsupervised Similarity-Based Risk Stratification for Cardiovascular Events Using Long-Term Time-Series Data

10 0.69245893 12 jmlr-2011-Bayesian Co-Training

11 0.68947452 3 jmlr-2011-A Cure for Variance Inflation in High Dimensional Kernel Principal Component Analysis

12 0.68945271 77 jmlr-2011-Posterior Sparsity in Unsupervised Dependency Parsing

13 0.68369043 69 jmlr-2011-Neyman-Pearson Classification, Convexity and Stochastic Constraints

14 0.67572153 36 jmlr-2011-Generalized TD Learning

15 0.67360741 13 jmlr-2011-Bayesian Generalized Kernel Mixed Models

16 0.66364098 74 jmlr-2011-Operator Norm Convergence of Spectral Clustering on Level Sets

17 0.66100675 24 jmlr-2011-Dirichlet Process Mixtures of Generalized Linear Models

18 0.65978903 86 jmlr-2011-Sparse Linear Identifiable Multivariate Modeling

19 0.65826076 16 jmlr-2011-Clustering Algorithms for Chains

20 0.65787768 38 jmlr-2011-Hierarchical Knowledge Gradient for Sequential Sampling