jmlr jmlr2012 jmlr2012-65 knowledge-graph by maker-knowledge-mining

65 jmlr-2012-MedLDA: Maximum Margin Supervised Topic Models

Source: pdf

Author: Jun Zhu, Amr Ahmed, Eric P. Xing

Abstract: A supervised topic model can use side information such as ratings or labels associated with documents or images to discover more predictive low dimensional topical representations of the data. However, existing supervised topic models predominantly employ likelihood-driven objective functions for learning and inference, leaving the popular and potentially powerful max-margin principle unexploited for seeking predictive representations of data and more discriminative topic bases for the corpus. In this paper, we propose the maximum entropy discrimination latent Dirichlet allocation (MedLDA) model, which integrates the mechanism behind the max-margin prediction models (e.g., SVMs) with the mechanism behind the hierarchical Bayesian topic models (e.g., LDA) under a uniﬁed constrained optimization framework, and yields latent topical representations that are more discriminative and more suitable for prediction tasks such as document classiﬁcation or regression. The principle underlying the MedLDA formalism is quite general and can be applied for jointly max-margin and maximum likelihood learning of directed or undirected topic models when supervising side information is available. Efﬁcient variational methods for posterior inference and parameter estimation are derived and extensive empirical studies on several real data sets are also provided. Our experimental results demonstrate qualitatively and quantitatively that MedLDA could: 1) discover sparse and highly discriminative topical representations; 2) achieve state of the art prediction performance; and 3) be more efﬁcient than existing supervised topic models, especially for classiﬁcation. Keywords: supervised topic models, max-margin learning, maximum entropy discrimination, latent Dirichlet allocation, support vector machines

Reference: text

Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 , LDA) under a uniﬁed constrained optimization framework, and yields latent topical representations that are more discriminative and more suitable for prediction tasks such as document classiﬁcation or regression. [sent-15, score-0.452]

2 The principle underlying the MedLDA formalism is quite general and can be applied for jointly max-margin and maximum likelihood learning of directed or undirected topic models when supervising side information is available. [sent-16, score-0.357]

3 Keywords: supervised topic models, max-margin learning, maximum entropy discrimination, latent Dirichlet allocation, support vector machines 1. [sent-19, score-0.433]

4 An LDA model posits that each document is an admixture of latent topics, of which each topic is represented as a unique unigram distribution c 2012 Jun Zhu, Amr Ahmed and Eric P. [sent-22, score-0.425]

5 The document-speciﬁc admixture proportion vector θ , also known as the topic vector, is modeled as a latent Dirichlet random variable, and can be regarded as a low dimensional representation of the document in a topical space. [sent-25, score-0.6]

6 Representative attempts include supervised topic model (sLDA) (Blei and McAuliffe, 2007), which captures real-valued document rating as a regression response; multi-class sLDA (Wang et al. [sent-39, score-0.425]

7 More variants of supervised topic models can be found in a number of applied domains, such as the aspect rating model (Titov and McDonald, 2008) for predicting ratings for each aspect of a hotel and the credit attribution model (Ramage et al. [sent-43, score-0.391]

8 In computer vision, several supervised topic models have been designed for understanding complex scene images (Sudderth et al. [sent-45, score-0.334]

9 It is worth pointing out that among existing supervised topic models for incorporating side information, there are two classes of approaches, namely, downstream supervised topic model (DSTM) and upstream supervised topic model (USTM). [sent-49, score-1.018]

10 Another distinction between existing supervised topic models is the training criterion, or more precisely, the choice of objective function in the optimization-based learning. [sent-64, score-0.334]

11 To the best of our knowledge, all the existing supervised topic models are trained by optimizing a likelihood-based objective; the highly successful margin-based objectives such as the hinge loss commonly used in discriminative models such as SVMs have never been employed. [sent-70, score-0.395]

12 In this paper, we propose maximum entropy discrimination latent Dirichlet allocation (MedLDA), a supervised topic model leveraging the maximum margin principle for making more effective use of side information during estimation of latent topical representations. [sent-71, score-0.8]

13 It employs a composite objective motivated by a tradeoff between two components—the negative log-likelihood of an underlying topic model which measures the goodness of ﬁt for document contents, and a measure of prediction error on training data. [sent-77, score-0.376]

14 This interplay can yield latent topical representations that are more discriminative and more suitable for supervised prediction tasks, as we demonstrate in the experimental section. [sent-86, score-0.43]

15 Preliminaries We begin with a brief overview of the fundamentals of topic models, support vector machines, and the maximum entropy discrimination formulism (Jaakkola et al. [sent-105, score-0.32]

16 , 2003) is a hierarchical Bayesian model that projects a text document into a latent low dimensional space spanned by a set of automatically learned topical bases. [sent-109, score-0.364]

17 With a little abuse of notations, we use β zdn to denote the topic that is selected by the non-zero element of zdn . [sent-123, score-0.44]

18 To estimate the α θ unknown parameters (α , β ), and to infer the posterior distributions of latent variables {θ d , zd }, an 4 p(W|α , β ). [sent-125, score-0.398]

19 As we have stated, the unsupervised LDA described above does not use side information for learning topics and inferring topic vectors θ . [sent-138, score-0.384]

20 In order to consider side information appropriately for discovering more predictive representations, supervised topic models (sLDA) (Blei and McAuliffe, 2007) introduce a response variable Y to LDA for each document, as shown in Figure 1. [sent-139, score-0.414]

21 For regression, where y ∈ R, the generative process of sLDA is similar to LDA, but with an additional η ¯ step—draw a response variable: y|zd , η , δ2 ∼ N (η ⊤ zd , δ2 ) for each document d, where η is the 2 is a noise variance parameter. [sent-140, score-0.383]

22 Then, the joint distribution of sLDA regression weight vector and δ is: D θ α θ α p({θ d , zd }, y, W|α , β , η , δ2 ) = ∏ p(θ d |α ) d=1 N θ ∏ p(zdn |θ d )p(wdn |zdn , β) η ¯ p(yd |η ⊤ zd , δ2 ), (2) n=1 4. [sent-141, score-0.542]

23 This is because the non-Gaussian probability distribution in Equation (5) is highly nonlinear of η and z and its normalization factor can make the topic assignments of different words in the same document strongly coupled. [sent-157, score-0.361]

24 , 2008) is yet another supervised topic model for classiﬁcation. [sent-161, score-0.312]

25 This progress notwithstanding, to the best of our knowledge, current developments of supervised topic models have been solely built on a likelihood-driven probabilistic inference paradigm. [sent-163, score-0.392]

26 The arguably sometimes more powerful max-margin based techniques widely used in learning discriminative models have not been exploited to learn supervised topic models. [sent-164, score-0.373]

27 The main goal of this paper is to systematically investigate how the max-margin principe can be exploited inside a topic model to learn topics that are better at discriminating documents than current likelihood-driven learning achieves while retaining semantic interpretability as the later allows. [sent-165, score-0.415]

28 Below we use document rating prediction as an example to recapitulate the ideas behind support vector regression (SVR) (Smola and Sch¨ lkopf, o 2003), which we will shortly leverage to build our ﬁrst instance of max-margin topic model. [sent-173, score-0.389]

29 3 Maximum Entropy Discrimination To unite the principles behind topic models and SVR, namely, Bayesian inference and max-margin learning, we employ a formalism known as maximum entropy discrimination (MED) (Jaakkola et al. [sent-186, score-0.381]

30 To apply the MED idea to learn a supervised topic model, a major difﬁculty is the presence of heterogeneous latent variables in the topic models, such as the topic vector θ and topic indicator Z. [sent-196, score-1.159]

31 This is to contrast conventional heuristics that ﬁrst learn a topic model, and then independently train a classiﬁer such as SVM using the per-document topic vectors resultant from the ﬁrst step as inputs. [sent-200, score-0.523]

32 In such a heuristic, the document labels are never able to inﬂuence the way topics can be learned, and the per-document topic vectors are often found to be not strongly predictive (Xing et al. [sent-201, score-0.445]

33 1 Regressional MedLDA We ﬁrst consider the scenario where the numerical-valued rating of documents in the corpus is available, and our goal is to learn a supervised topic model specialized at predicting the rating of new documents through a regression function. [sent-204, score-0.551]

34 2244 M ED LDA: M AXIMUM M ARGIN S UPERVISED T OPIC M ODELS strong predictivity) and the topic model architecture (for topic discovery). [sent-213, score-0.504]

35 For brevity, here we present a regressional MedLDA that uses the supervised sLDA as the underlying topic model. [sent-220, score-0.33]

36 2 and Appendix B, the underlying topic model can also be an unsupervised LDA. [sent-222, score-0.318]

37 η θ η θ α Let q(η , {θ d , zd }) be a variational approximation to the posterior p(η , {θ d , zd }|α , β , δ2 , y, W). [sent-227, score-0.663]

38 Then, an upper bound6 L bs (q; α , β , δ2 ) of the negative log-likelihood is L bs (q; α , β , δ2 ) η θ α η θ −Eq [log p(η , {θ d , zd }, y, W|α , β , δ2 )] − H (q(η , {θ d , zd })) s η η = KL(q(η ) p0 (η )) + Eq(η) [L ]. [sent-228, score-0.634]

39 The margin constraints in P2 are of the same form as those in P0, but in an expectation version because both the topic assignments Z and parameters η are latent random variables in MedLDAr . [sent-236, score-0.398]

40 Therefore, problem P2 is a joint maximum margin learning and maximum likelihood estimation (with appropriate regularization), and the two components are coupled by sharing latent topic assignments Z and parameters η . [sent-245, score-0.424]

41 The maxmargin learning and topic discovery procedure are coupled together via the constraints, which are deﬁned on the expectations of model parameters η and latent topical assignments Z. [sent-249, score-0.545]

42 The other lagrange multipliers, which are not explicitly involved in topic inference η and estimation of q(η ), are solved according to KKT conditions. [sent-330, score-0.314]

43 2 Classiﬁcational MedLDA Now, we present the MedLDA classiﬁcation model, of which the discrete labels of the documents are available, and our goal is to learn a supervised topic model specialized at predicting the labels of new documents through a discriminant function. [sent-332, score-0.47]

44 For classiﬁcation, if the latent topic assignments z {z1 ; · · · ; zN } of all the words in a document are given, we deﬁne the latent linear discriminant function F(y, z, η; w) = η⊤ z, y ¯ ˆ ˆ 8. [sent-336, score-0.543]

45 However, we cannot directly use the latent function F(y, z, η ; w) to make prediction for an observed input w of a document because the topic assignments z are hidden variables. [sent-342, score-0.476]

46 1, inference under sLDA can be harder and slower because the probability model of discrete Y in Equation (5) is highly nonlinear over η and Z, both of which are latent variables in our case, and its normalization factor strongly couples the topic assignments of different words in the same document. [sent-353, score-0.409]

47 Therefore, in this paper we focus on the case of using an LDA that only models the likelihood of document contents W but not document label Y as the underlying topic model to discover latent representations Z. [sent-354, score-0.636]

48 Even with this likelihood model, document labels can still inﬂuence topic learning and inference because they induce margin constraints pertinent to the topical distributions. [sent-355, score-0.602]

49 The integrated problem of discovering latent topical representations and learning a distribution of classiﬁers is deﬁned as follows: P3(MedLDAc ) : min η αβξ q,q(η ),α ,β ,ξ ∀d, y ∈ C , s. [sent-364, score-0.33]

50 : η η L u (q; α , β ) + KL(q(η )||p0 (η )) + η E[η ⊤ ∆fd (y)] ≥ ∆ℓd (y) − ξd ξd ≥ 0, 2251 C D ∑ ξd D d=1 Z HU , A HMED AND X ING θ where q denotes the variational distribution q({θ d , zd }); ∆ℓd (y) is a non-negative cost function (e. [sent-366, score-0.356]

51 2 VARIATIONAL A LGORITHM FOR M ED LDAc As in MedLDAr , we make the fully-factorized mean ﬁeld assumption that N D θ γ θ φ q({θ d , zd }) = ∏ q(θ d |γ d ) ∏ q(zdn |φ dn ), n=1 d=1 where γ d and φ dn are variational parameters, having the same meaning as in MedLDAr . [sent-392, score-0.436]

52 , 2003), and the last term is due to the max-margin formulation of P3 and reﬂects our intuition that the discovered latent topical representation is inﬂuenced by the margin constraints. [sent-409, score-0.316]

53 In fact, the likelihood component of MedLDA can be any other form of generative topic model, such as correlated topic models (Blei and Lafferty, 2005), or latent space Markov random ﬁelds, such as exponential family harmoniums (Welling et al. [sent-429, score-0.643]

54 The same principle can also be applied to upstream latent topic models, which have been widely used in computer vision applications (Sudderth et al. [sent-433, score-0.391]

55 In this section, we formulate a general framework of applying the max-margin principle to learn discriminative latent topic models when supervising side information is available, and we discuss more insights on developing approximate inference algorithms. [sent-436, score-0.482]

56 2254 M ED LDA: M AXIMUM M ARGIN S UPERVISED T OPIC M ODELS Formally, a maximum entropy discrimination topic model (MedTM) consists of two components— an underlying topic model that ﬁts observed data and a MED max-margin model that performs prediction. [sent-442, score-0.59]

57 Then, p(D |Ψ) is the marginal data likelihood of the corpus D , which may or may not include the supervising side information depending on choice of speciﬁc form of the underlying topic model. [sent-451, score-0.336]

58 As discussed before, for a general topic model, p(D |Ψ) is intractable, therefore a generic variational method can be employed. [sent-452, score-0.337]

59 Then, L t (q(H|ϒ); Ψ, ϒ) is the variational bound of the data likelihood associated with the underlying topic model. [sent-456, score-0.381]

60 For instance, when the underlying topic model is supervised sLDA, L t reduces to L s , as we discussed in Equation (7). [sent-457, score-0.33]

61 When the underlying topic model is unsupervised LDA, the corpus D only contains document contents, and p(H, D |Ψ, ϒ) = p(H, D |Ψ). [sent-458, score-0.419]

62 Based on recent developments on learning latent topic models, two commonly used approaches can be applied to get an approximate solution to P5(MedTM), namely, Markov Chain Monte Carlo (MCMC) (Grifﬁths and Steyvers, 2004) and variational (Blei et al. [sent-473, score-0.428]

63 Likelihood based structured prediction latent topic models have been developed in different scenarios, such as image annotation (He and Zemel, 2008) and statistical machine translation (Zhao and Xing, 2007). [sent-483, score-0.389]

64 Experiments In this section, we provide qualitative as well as quantitative evaluation of MedLDA on topic estimation, document classiﬁcation and regression. [sent-486, score-0.334]

65 To visually illustrate the discriminative power of the latent representations, that is, the topic proportion vector θ of documents, we illustrate and compare the per-class distribution over topics for each model at the right side of Figure 3. [sent-530, score-0.466]

66 This distribution is computed by averaging the expected topic vector of the documents in each class. [sent-531, score-0.331]

67 2 1 20 40 60 # Topics 80 100 120 Figure 4: The average entropy of θ over documents of different topic models on 20 Newsgroups data. [sent-538, score-0.383]

68 can see that their per-class average distributions over topics are very different, which suggests that the topical representations by MedLDAc have a good discrimination power. [sent-539, score-0.338]

69 We compute the entropy of the inferred topic proportion for each document and take the average over the corpus. [sent-557, score-0.364]

70 Figure 4 shows the average entropy of different models on testing documents when different topic numbers are chosen. [sent-577, score-0.4]

71 We can see that all the supervised topic models discover more predictive topical representations for classiﬁcation, and the discriminative max-margin MedLDAc and DiscLDA perform comparably, slightly better than the standard multi-class SVM (about 0. [sent-695, score-0.638]

72 We compare MedLDAr with unsupervised LDA, supervised sLDA, MedLDAr —a MedLDA regression model p which uses unsupervised LDA as the underlying topic model (Please see Appendix B for details), and the linear SVR that uses the empirical word frequency as input features. [sent-733, score-0.461]

73 We can see that the supervised MedLDA and sLDA can get better results than unsupervised LDA, which ignores supervised responses during discovering topical representations, and the linear SVR regression model. [sent-743, score-0.366]

74 Indeed, when the number of topics is small, the latent representation of sLDA alone does not result in a highly separable problem, thus the integration of max-margin training helps in discovering a more discriminative latent representation using the same number of topics. [sent-747, score-0.328]

75 Those terms make the max-margin estimation and latent topic discovery attached more tightly. [sent-773, score-0.343]

76 Also, the rich features in reviews can be exploited to discover interesting latent structures with a conditional topic model (Zhu and Xing, 2010). [sent-776, score-0.384]

77 The HTMM is more robust but its performance is worse than those of the supervised topic models. [sent-790, score-0.312]

78 , 2011), depending on the data and problems, max-margin supervised topic models can outperform SVM models, or they are comparable if no gains on predictive performance are obtained. [sent-842, score-0.361]

79 , concatenation with appropriate re-scaling of different features) of the discovered latent topical representations and the original input features could potentially improve the performance, as demonstrated in Wang and Mori (2011) for image classiﬁcation. [sent-849, score-0.329]

80 η 1 + exp{−η ⊤ θd } In other words, the class labels are solely inﬂuenced by the latent topic representations. [sent-864, score-0.343]

81 In all the three settings, we can see that a na¨ve combination of both latent topic representations ı and input word counts could improve the performance in some cases, or at least it will produce comparable performance with the better model between MedLDAc and SVM. [sent-898, score-0.419]

82 Therefore, the posterior inference is slower than that of unsupervised LDA and MedLDAc which uses unsupervised LDA as the underlying topic model. [sent-952, score-0.441]

83 , a softmax function), the posterior inference on different topic assignment variables (in the same document) are strongly correlated. [sent-963, score-0.327]

84 Therefore, the inference is (about 10 times) slower than that on unsupervised LDA and MedLDAc which takes an unsupervised LDA as the underlying topic model. [sent-964, score-0.405]

85 The SVM classiﬁers built on raw input word count features are generally much more faster than all the topic models. [sent-982, score-0.327]

86 This is reasonable because SVM classiﬁers do not spend time on inferring the latent topic representations. [sent-984, score-0.343]

87 However, DiscLDA is an upstream model, for which the prediction task is done with multiple times of doing inference to ﬁnd the category-dependent latent topical representations. [sent-992, score-0.359]

88 Therefore, in principle, the testing time of an upstream topic model is about |C | times slower than that of its downstream counterpart model, where C is the ﬁnite set of categories. [sent-993, score-0.329]

89 Conclusions and Discussions We have presented maximum entropy discrimination LDA (MedLDA), a supervised topic model that uses the discriminative max-margin principle to estimate model parameters such as topic distributions underlying a corpus, and infer latent topical vectors of documents. [sent-997, score-0.973]

90 MedLDA integrates the max-margin principle into the process of topic learning and inference via optimizing one single objective function with a set of expected margin constraints. [sent-998, score-0.337]

91 The objective function is a tradeoff between the goodness of ﬁt of an underlying topic model and the prediction accuracy of the resultant topic vectors on a max-margin classiﬁer. [sent-999, score-0.565]

92 We also present a general formulation of learning maximum entropy discrimination topic models, which allows any form of likelihood based topic models to be discriminatively trained. [sent-1001, score-0.62]

93 MedLDA represents the ﬁrst step towards integrating the max-margin principle into supervised topic models, and under the general MedTM framework presented in Section 4, several improvements and extensions are in the horizon. [sent-1004, score-0.33]

94 1, we have presented the MedLDA regression model that uses supervised sLDA (Blei and McAuliffe, 2007) to discover latent topic assignments Z and document-level topical representations θ . [sent-1034, score-0.668]

95 , regression) ı is a two-stage procedure: 1) using unsupervised LDA to discover the latent topical representations of documents; and 2) feeding the low-dimensional topical representations into a regression model (e. [sent-1041, score-0.593]

96 The inter-play between topic discovery and supervised prediction will result in more discriminative latent topical representations, similar as in MedLDAr . [sent-1048, score-0.641]

97 α When the underlying topic model is unsupervised LDA, the likelihood is p(W|α , β ), the same c . [sent-1049, score-0.344]

98 Speciﬁcally, we θ φ assume that q({θ d , zd }) = ∏D q(θ d |γ d ) ∏N q(zdn |φ dn ), where the variational parameters γ and n=1 d=1 θ γ r . [sent-1058, score-0.396]

99 Labeled LDA: A supervised topic model for credit attribution in multi-labeled corpora. [sent-1221, score-0.312]

100 MedLDA: Maximum margin supervised topic models for regression and classiﬁcation. [sent-1331, score-0.362]

similar papers computed by tfidf model

tfidf for this paper:

wordName wordTfidf (topN-words)

[('medldac', 0.415), ('lda', 0.386), ('medlda', 0.342), ('slda', 0.295), ('zd', 0.271), ('topic', 0.252), ('medldar', 0.244), ('topical', 0.175), ('yd', 0.168), ('svr', 0.122), ('blei', 0.117), ('disclda', 0.115), ('zdn', 0.094), ('latent', 0.091), ('hmed', 0.09), ('upervised', 0.086), ('variational', 0.085), ('topics', 0.084), ('document', 0.082), ('argin', 0.08), ('documents', 0.079), ('opic', 0.073), ('svm', 0.066), ('aximum', 0.066), ('supervised', 0.06), ('med', 0.051), ('zhu', 0.048), ('dirichlet', 0.048), ('unsupervised', 0.048), ('bs', 0.046), ('hu', 0.045), ('eq', 0.043), ('mcauliffe', 0.043), ('wdn', 0.043), ('odels', 0.043), ('representations', 0.041), ('dn', 0.04), ('inference', 0.039), ('discriminative', 0.039), ('medtm', 0.038), ('discrimination', 0.038), ('eric', 0.037), ('posterior', 0.036), ('ing', 0.035), ('word', 0.035), ('rating', 0.031), ('slack', 0.031), ('response', 0.03), ('upstream', 0.03), ('entropy', 0.03), ('downstream', 0.03), ('margin', 0.028), ('assignments', 0.027), ('predictive', 0.027), ('likelihood', 0.026), ('hotel', 0.026), ('tsinghua', 0.026), ('wd', 0.026), ('jun', 0.025), ('prediction', 0.024), ('fd', 0.024), ('classi', 0.023), ('discovering', 0.023), ('lagrange', 0.023), ('models', 0.022), ('discover', 0.022), ('svmstruct', 0.022), ('discovered', 0.022), ('supervising', 0.021), ('mail', 0.021), ('collapsed', 0.021), ('newsgroups', 0.021), ('kl', 0.021), ('multipliers', 0.021), ('raw', 0.021), ('xing', 0.02), ('multi', 0.02), ('dual', 0.02), ('category', 0.019), ('reviews', 0.019), ('corpus', 0.019), ('built', 0.019), ('resultant', 0.019), ('principle', 0.018), ('joachims', 0.018), ('underlying', 0.018), ('primal', 0.018), ('david', 0.017), ('optimum', 0.017), ('nips', 0.017), ('htmm', 0.017), ('proceddings', 0.017), ('sudderth', 0.017), ('allocation', 0.017), ('grif', 0.017), ('testing', 0.017), ('ths', 0.016), ('writes', 0.016), ('text', 0.016), ('wang', 0.016)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.99999988 65 jmlr-2012-MedLDA: Maximum Margin Supervised Topic Models

Author: Jun Zhu, Amr Ahmed, Eric P. Xing

2 0.24635428 9 jmlr-2012-A Topic Modeling Toolbox Using Belief Propagation

Author: Jia Zeng

Abstract: Latent Dirichlet allocation (LDA) is an important hierarchical Bayesian model for probabilistic topic modeling, which attracts worldwide interests and touches on many important applications in text mining, computer vision and computational biology. This paper introduces a topic modeling toolbox (TMBP) based on the belief propagation (BP) algorithms. TMBP toolbox is implemented by MEX C++/Matlab/Octave for either Windows 7 or Linux. Compared with existing topic modeling packages, the novelty of this toolbox lies in the BP algorithms for learning LDA-based topic models. The current version includes BP algorithms for latent Dirichlet allocation (LDA), authortopic models (ATM), relational topic models (RTM), and labeled LDA (LaLDA). This toolbox is an ongoing project and more BP-based algorithms for various topic models will be added in the near future. Interested users may also extend BP algorithms for learning more complicated topic models. The source codes are freely available under the GNU General Public Licence, Version 1.0 at https://mloss.org/software/view/399/. Keywords: topic models, belief propagation, variational Bayes, Gibbs sampling

3 0.091684684 54 jmlr-2012-Large-scale Linear Support Vector Regression

Author: Chia-Hua Ho, Chih-Jen Lin

Abstract: Support vector regression (SVR) and support vector classiﬁcation (SVC) are popular learning techniques, but their use with kernels is often time consuming. Recently, linear SVC without kernels has been shown to give competitive accuracy for some applications, but enjoys much faster training/testing. However, few studies have focused on linear SVR. In this paper, we extend state-of-theart training methods for linear SVC to linear SVR. We show that the extension is straightforward for some methods, but is not trivial for some others. Our experiments demonstrate that for some problems, the proposed linear-SVR training methods can very efﬁciently produce models that are as good as kernel SVR. Keywords: support vector regression, Newton methods, coordinate descent methods

4 0.053800628 118 jmlr-2012-Variational Multinomial Logit Gaussian Process

Author: Kian Ming A. Chai

Abstract: Gaussian process prior with an appropriate likelihood function is a ﬂexible non-parametric model for a variety of learning tasks. One important and standard task is multi-class classiﬁcation, which is the categorization of an item into one of several ﬁxed classes. A usual likelihood function for this is the multinomial logistic likelihood function. However, exact inference with this model has proved to be difﬁcult because high-dimensional integrations are required. In this paper, we propose a variational approximation to this model, and we describe the optimization of the variational parameters. Experiments have shown our approximation to be tight. In addition, we provide dataindependent bounds on the marginal likelihood of the model, one of which is shown to be much tighter than the existing variational mean-ﬁeld bound in the experiments. We also derive a proper lower bound on the predictive likelihood that involves the Kullback-Leibler divergence between the approximating and the true posterior. We combine our approach with a recently proposed sparse approximation to give a variational sparse approximation to the Gaussian process multi-class model. We also derive criteria which can be used to select the inducing set, and we show the effectiveness of these criteria over random selection in an experiment. Keywords: Gaussian process, probabilistic classiﬁcation, multinomial logistic, variational approximation, sparse approximation

5 0.042957053 87 jmlr-2012-PAC-Bayes Bounds with Data Dependent Priors

Author: Emilio Parrado-Hernández, Amiran Ambroladze, John Shawe-Taylor, Shiliang Sun

Abstract: This paper presents the prior PAC-Bayes bound and explores its capabilities as a tool to provide tight predictions of SVMs’ generalization. The computation of the bound involves estimating a prior of the distribution of classiﬁers from the available data, and then manipulating this prior in the usual PAC-Bayes generalization bound. We explore two alternatives: to learn the prior from a separate data set, or to consider an expectation prior that does not need this separate data set. The prior PAC-Bayes bound motivates two SVM-like classiﬁcation algorithms, prior SVM and ηprior SVM, whose regularization term pushes towards the minimization of the prior PAC-Bayes bound. The experimental work illustrates that the new bounds can be signiﬁcantly tighter than the original PAC-Bayes bound when applied to SVMs, and among them the combination of the prior PAC-Bayes bound and the prior SVM algorithm gives the tightest bound. Keywords: PAC-Bayes bound, support vector machine, generalization capability prediction, classiﬁcation

6 0.042662285 26 jmlr-2012-Coherence Functions with Applications in Large-Margin Classification Methods

7 0.041865967 78 jmlr-2012-Nonparametric Guidance of Autoencoder Representations using Label Information

8 0.039848045 21 jmlr-2012-Bayesian Mixed-Effects Inference on Classification Performance in Hierarchical Data Sets

9 0.039642692 55 jmlr-2012-Learning Algorithms for the Classification Restricted Boltzmann Machine

10 0.038738079 22 jmlr-2012-Bounding the Probability of Error for High Precision Optical Character Recognition

11 0.037024096 28 jmlr-2012-Confidence-Weighted Linear Classification for Text Categorization

12 0.036437966 119 jmlr-2012-glm-ie: Generalised Linear Models Inference & Estimation Toolbox

13 0.035516616 92 jmlr-2012-Positive Semidefinite Metric Learning Using Boosting-like Algorithms

14 0.030315069 11 jmlr-2012-A Unifying Probabilistic Perspective for Spectral Dimensionality Reduction: Insights and New Models

15 0.029350471 30 jmlr-2012-DARWIN: A Framework for Machine Learning and Computer Vision Research and Development

16 0.028509738 1 jmlr-2012-A Case Study on Meta-Generalising: A Gaussian Processes Approach

17 0.02782904 39 jmlr-2012-Estimation and Selection via Absolute Penalized Convex Minimization And Its Multistage Adaptive Applications

18 0.026130361 32 jmlr-2012-Discriminative Hierarchical Part-based Models for Human Parsing and Action Recognition

19 0.024269536 49 jmlr-2012-Hope and Fear for Discriminative Training of Statistical Translation Models

20 0.023456959 90 jmlr-2012-Pattern for Python

similar papers computed by lsi model

lsi for this paper:

topicId topicWeight

[(0, -0.125), (1, 0.036), (2, 0.172), (3, -0.029), (4, 0.059), (5, 0.032), (6, 0.145), (7, 0.025), (8, -0.202), (9, 0.052), (10, 0.115), (11, 0.185), (12, 0.139), (13, 0.034), (14, 0.357), (15, -0.227), (16, -0.245), (17, -0.319), (18, 0.221), (19, -0.046), (20, 0.029), (21, -0.041), (22, 0.014), (23, 0.003), (24, 0.075), (25, 0.061), (26, 0.016), (27, 0.014), (28, 0.011), (29, -0.028), (30, 0.009), (31, -0.055), (32, -0.062), (33, -0.058), (34, 0.03), (35, 0.02), (36, -0.023), (37, -0.007), (38, -0.058), (39, 0.0), (40, -0.032), (41, -0.042), (42, 0.106), (43, -0.003), (44, 0.001), (45, -0.05), (46, -0.006), (47, 0.025), (48, -0.024), (49, 0.009)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.95106125 65 jmlr-2012-MedLDA: Maximum Margin Supervised Topic Models

Author: Jun Zhu, Amr Ahmed, Eric P. Xing

2 0.91412205 9 jmlr-2012-A Topic Modeling Toolbox Using Belief Propagation

Author: Jia Zeng

3 0.33558026 54 jmlr-2012-Large-scale Linear Support Vector Regression

Author: Chia-Hua Ho, Chih-Jen Lin

4 0.22305626 78 jmlr-2012-Nonparametric Guidance of Autoencoder Representations using Label Information

Author: Jasper Snoek, Ryan P. Adams, Hugo Larochelle

Abstract: While unsupervised learning has long been useful for density modeling, exploratory data analysis and visualization, it has become increasingly important for discovering features that will later be used for discriminative tasks. Discriminative algorithms often work best with highly-informative features; remarkably, such features can often be learned without the labels. One particularly effective way to perform such unsupervised learning has been to use autoencoder neural networks, which ﬁnd latent representations that are constrained but nevertheless informative for reconstruction. However, pure unsupervised learning with autoencoders can ﬁnd representations that may or may not be useful for the ultimate discriminative task. It is a continuing challenge to guide the training of an autoencoder so that it ﬁnds features which will be useful for predicting labels. Similarly, we often have a priori information regarding what statistical variation will be irrelevant to the ultimate discriminative task, and we would like to be able to use this for guidance as well. Although a typical strategy would be to include a parametric discriminative model as part of the autoencoder training, here we propose a nonparametric approach that uses a Gaussian process to guide the representation. By using a nonparametric model, we can ensure that a useful discriminative function exists for a given set of features, without explicitly instantiating it. We demonstrate the superiority of this guidance mechanism on four data sets, including a real-world application to rehabilitation research. We also show how our proposed approach can learn to explicitly ignore statistically signiﬁcant covariate information that is label-irrelevant, by evaluating on the small NORB image recognition problem in which pose and lighting labels are available. Keywords: autoencoder, gaussian process, gaussian process latent variable model, representation learning, unsupervised learning

5 0.20775661 55 jmlr-2012-Learning Algorithms for the Classification Restricted Boltzmann Machine

Author: Hugo Larochelle, Michael Mandel, Razvan Pascanu, Yoshua Bengio

Abstract: Recent developments have demonstrated the capacity of restricted Boltzmann machines (RBM) to be powerful generative models, able to extract useful features from input data or construct deep artiﬁcial neural networks. In such settings, the RBM only yields a preprocessing or an initialization for some other model, instead of acting as a complete supervised model in its own right. In this paper, we argue that RBMs can provide a self-contained framework for developing competitive classiﬁers. We study the Classiﬁcation RBM (ClassRBM), a variant on the RBM adapted to the classiﬁcation setting. We study different strategies for training the ClassRBM and show that competitive classiﬁcation performances can be reached when appropriately combining discriminative and generative training objectives. Since training according to the generative objective requires the computation of a generally intractable gradient, we also compare different approaches to estimating this gradient and address the issue of obtaining such a gradient for problems with very high dimensional inputs. Finally, we describe how to adapt the ClassRBM to two special cases of classiﬁcation problems, namely semi-supervised and multitask learning. Keywords: restricted Boltzmann machine, classiﬁcation, discriminative learning, generative learning

6 0.18815559 22 jmlr-2012-Bounding the Probability of Error for High Precision Optical Character Recognition

7 0.18551631 118 jmlr-2012-Variational Multinomial Logit Gaussian Process

8 0.1698489 87 jmlr-2012-PAC-Bayes Bounds with Data Dependent Priors

9 0.16773905 28 jmlr-2012-Confidence-Weighted Linear Classification for Text Categorization

10 0.14942421 119 jmlr-2012-glm-ie: Generalised Linear Models Inference & Estimation Toolbox

11 0.14697321 26 jmlr-2012-Coherence Functions with Applications in Large-Margin Classification Methods

12 0.14577191 30 jmlr-2012-DARWIN: A Framework for Machine Learning and Computer Vision Research and Development

13 0.14022961 21 jmlr-2012-Bayesian Mixed-Effects Inference on Classification Performance in Hierarchical Data Sets

14 0.12161193 11 jmlr-2012-A Unifying Probabilistic Perspective for Spectral Dimensionality Reduction: Insights and New Models

15 0.11366877 49 jmlr-2012-Hope and Fear for Discriminative Training of Statistical Translation Models

16 0.11205189 57 jmlr-2012-Learning Symbolic Representations of Hybrid Dynamical Systems

17 0.10899553 24 jmlr-2012-Causal Bounds and Observable Constraints for Non-deterministic Models

18 0.10753128 83 jmlr-2012-Online Learning in the Embedded Manifold of Low-rank Matrices

19 0.10572148 101 jmlr-2012-SVDFeature: A Toolkit for Feature-based Collaborative Filtering

20 0.10549203 92 jmlr-2012-Positive Semidefinite Metric Learning Using Boosting-like Algorithms

similar papers computed by lda model

lda for this paper:

topicId topicWeight

[(0, 0.011), (7, 0.022), (14, 0.017), (20, 0.343), (21, 0.027), (26, 0.044), (27, 0.019), (29, 0.031), (35, 0.015), (49, 0.026), (56, 0.018), (57, 0.02), (64, 0.018), (69, 0.019), (75, 0.062), (77, 0.036), (79, 0.01), (81, 0.036), (92, 0.047), (96, 0.111)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.71925026 65 jmlr-2012-MedLDA: Maximum Margin Supervised Topic Models

Author: Jun Zhu, Amr Ahmed, Eric P. Xing

2 0.60594964 77 jmlr-2012-Non-Sparse Multiple Kernel Fisher Discriminant Analysis

Author: Fei Yan, Josef Kittler, Krystian Mikolajczyk, Atif Tahir

Abstract: Sparsity-inducing multiple kernel Fisher discriminant analysis (MK-FDA) has been studied in the literature. Building on recent advances in non-sparse multiple kernel learning (MKL), we propose a non-sparse version of MK-FDA, which imposes a general ℓ p norm regularisation on the kernel weights. We formulate the associated optimisation problem as a semi-inﬁnite program (SIP), and adapt an iterative wrapper algorithm to solve it. We then discuss, in light of latest advances in MKL optimisation techniques, several reformulations and optimisation strategies that can potentially lead to signiﬁcant improvements in the efﬁciency and scalability of MK-FDA. We carry out extensive experiments on six datasets from various application areas, and compare closely the performance of ℓ p MK-FDA, ﬁxed norm MK-FDA, and several variants of SVM-based MKL (MK-SVM). Our results demonstrate that ℓ p MK-FDA improves upon sparse MK-FDA in many practical situations. The results also show that on image categorisation problems, ℓ p MK-FDA tends to outperform its SVM counterpart. Finally, we also discuss the connection between (MK-)FDA and (MK-)SVM, under the uniﬁed framework of regularised kernel machines. Keywords: multiple kernel learning, kernel ﬁsher discriminant analysis, regularised least squares, support vector machines

3 0.39636028 83 jmlr-2012-Online Learning in the Embedded Manifold of Low-rank Matrices

Author: Uri Shalit, Daphna Weinshall, Gal Chechik

Abstract: When learning models that are represented in matrix forms, enforcing a low-rank constraint can dramatically improve the memory and run time complexity, while providing a natural regularization of the model. However, naive approaches to minimizing functions over the set of low-rank matrices are either prohibitively time consuming (repeated singular value decomposition of the matrix) or numerically unstable (optimizing a factored representation of the low-rank matrix). We build on recent advances in optimization over manifolds, and describe an iterative online learning procedure, consisting of a gradient step, followed by a second-order retraction back to the manifold. While the ideal retraction is costly to compute, and so is the projection operator that approximates it, we describe another retraction that can be computed efﬁciently. It has run time and memory complexity of O ((n + m)k) for a rank-k matrix of dimension m × n, when using an online procedure with rank-one gradients. We use this algorithm, L ORETA, to learn a matrix-form similarity measure over pairs of documents represented as high dimensional vectors. L ORETA improves the mean average precision over a passive-aggressive approach in a factorized model, and also improves over a full model trained on pre-selected features using the same memory requirements. We further adapt L ORETA to learn positive semi-deﬁnite low-rank matrices, providing an online algorithm for low-rank metric learning. L ORETA also shows consistent improvement over standard weakly supervised methods in a large (1600 classes and 1 million images, using ImageNet) multi-label image classiﬁcation task. Keywords: low rank, Riemannian manifolds, metric learning, retractions, multitask learning, online learning

4 0.39498377 92 jmlr-2012-Positive Semidefinite Metric Learning Using Boosting-like Algorithms

Author: Chunhua Shen, Junae Kim, Lei Wang, Anton van den Hengel

Abstract: The success of many machine learning and pattern recognition methods relies heavily upon the identiﬁcation of an appropriate distance metric on the input data. It is often beneﬁcial to learn such a metric from the input training data, instead of using a default one such as the Euclidean distance. In this work, we propose a boosting-based technique, termed B OOST M ETRIC, for learning a quadratic Mahalanobis distance metric. Learning a valid Mahalanobis distance metric requires enforcing the constraint that the matrix parameter to the metric remains positive semideﬁnite. Semideﬁnite programming is often used to enforce this constraint, but does not scale well and is not easy to implement. B OOST M ETRIC is instead based on the observation that any positive semideﬁnite matrix can be decomposed into a linear combination of trace-one rank-one matrices. B OOST M ETRIC thus uses rank-one positive semideﬁnite matrices as weak learners within an efﬁcient and scalable boosting-based learning process. The resulting methods are easy to implement, efﬁcient, and can accommodate various types of constraints. We extend traditional boosting algorithms in that its weak learner is a positive semideﬁnite matrix with trace and rank being one rather than a classiﬁer or regressor. Experiments on various data sets demonstrate that the proposed algorithms compare favorably to those state-of-the-art methods in terms of classiﬁcation accuracy and running time. Keywords: Mahalanobis distance, semideﬁnite programming, column generation, boosting, Lagrange duality, large margin nearest neighbor

5 0.38795054 55 jmlr-2012-Learning Algorithms for the Classification Restricted Boltzmann Machine

Author: Hugo Larochelle, Michael Mandel, Razvan Pascanu, Yoshua Bengio

6 0.38565296 54 jmlr-2012-Large-scale Linear Support Vector Regression

7 0.38439316 63 jmlr-2012-Mal-ID: Automatic Malware Detection Using Common Segment Analysis and Meta-Features

8 0.38169336 85 jmlr-2012-Optimal Distributed Online Prediction Using Mini-Batches

9 0.38145977 78 jmlr-2012-Nonparametric Guidance of Autoencoder Representations using Label Information

10 0.38096374 11 jmlr-2012-A Unifying Probabilistic Perspective for Spectral Dimensionality Reduction: Insights and New Models

11 0.38051587 18 jmlr-2012-An Improved GLMNET for L1-regularized Logistic Regression

12 0.37825921 105 jmlr-2012-Selective Sampling and Active Learning from Single and Multiple Teachers

13 0.3782185 115 jmlr-2012-Trading Regret for Efficiency: Online Convex Optimization with Long Term Constraints

14 0.37742335 98 jmlr-2012-Regularized Bundle Methods for Convex and Non-Convex Risks

15 0.37628105 106 jmlr-2012-Sign Language Recognition using Sub-Units

16 0.37524611 1 jmlr-2012-A Case Study on Meta-Generalising: A Gaussian Processes Approach

17 0.37291324 64 jmlr-2012-Manifold Identification in Dual Averaging for Regularized Stochastic Online Learning

18 0.36996913 36 jmlr-2012-Efficient Methods for Robust Classification Under Uncertainty in Kernel Matrices

19 0.36868712 8 jmlr-2012-A Primal-Dual Convergence Analysis of Boosting

20 0.36638969 27 jmlr-2012-Conditional Likelihood Maximisation: A Unifying Framework for Information Theoretic Feature Selection