nips nips2008 nips2008-64 knowledge-graph by maker-knowledge-mining
Source: pdf
Author: Simon Lacoste-julien, Fei Sha, Michael I. Jordan
Abstract: Probabilistic topic models have become popular as methods for dimensionality reduction in collections of text documents or images. These models are usually treated as generative models and trained using maximum likelihood or Bayesian methods. In this paper, we discuss an alternative: a discriminative framework in which we assume that supervised side information is present, and in which we wish to take that side information into account in finding a reduced dimensionality representation. Specifically, we present DiscLDA, a discriminative variation on Latent Dirichlet Allocation (LDA) in which a class-dependent linear transformation is introduced on the topic mixture proportions. This parameter is estimated by maximizing the conditional likelihood. By using the transformed topic mixture proportions as a new representation of documents, we obtain a supervised dimensionality reduction algorithm that uncovers the latent structure in a document collection while preserving predictive power for the task of classification. We compare the predictive power of the latent structure of DiscLDA with unsupervised LDA on the 20 Newsgroups document classification task and show how our model can identify shared topics across classes as well as class-dependent topics.
Reference: text
sentIndex sentText sentNum sentScore
1 of EECS and Statistics UC Berkeley Berkeley, CA 94720 Abstract Probabilistic topic models have become popular as methods for dimensionality reduction in collections of text documents or images. [sent-4, score-0.569]
2 These models are usually treated as generative models and trained using maximum likelihood or Bayesian methods. [sent-5, score-0.081]
3 In this paper, we discuss an alternative: a discriminative framework in which we assume that supervised side information is present, and in which we wish to take that side information into account in finding a reduced dimensionality representation. [sent-6, score-0.323]
4 Specifically, we present DiscLDA, a discriminative variation on Latent Dirichlet Allocation (LDA) in which a class-dependent linear transformation is introduced on the topic mixture proportions. [sent-7, score-0.467]
5 By using the transformed topic mixture proportions as a new representation of documents, we obtain a supervised dimensionality reduction algorithm that uncovers the latent structure in a document collection while preserving predictive power for the task of classification. [sent-9, score-0.787]
6 We compare the predictive power of the latent structure of DiscLDA with unsupervised LDA on the 20 Newsgroups document classification task and show how our model can identify shared topics across classes as well as class-dependent topics. [sent-10, score-0.599]
7 1 Introduction Dimensionality reduction is a common and often necessary step in most machine learning applications and high-dimensional data analyses. [sent-11, score-0.071]
8 A recent trend in dimensionality reduction is to focus on probabilistic models. [sent-13, score-0.151]
9 These models, which include generative topological mapping, factor analysis, independent component analysis and probabilistic latent semantic analysis (pLSA), are generally specified in terms of an underlying independence assumption or low-rank assumption. [sent-14, score-0.096]
10 , a document) as a collection of draws from a mixture model in which each mixture component is known as a topic [3]. [sent-18, score-0.333]
11 The mixing proportions across topics are document-specific, and the posterior distribution across these mixing proportions provides a reduced representation of the document. [sent-19, score-0.38]
12 The dimensionality reduction methods that we have discussed thus far are entirely unsupervised. [sent-21, score-0.151]
13 Another branch of research, known as sufficient dimension reduction (SDR), aims at making use of supervisory data in dimension reduction [4, 7]. [sent-22, score-0.173]
14 For example, we may have class labels or regression responses at our disposal. [sent-23, score-0.085]
15 Having reduced dimensionality in this way, one may wish to subsequently build a classifier or regressor in the reduced representation. [sent-25, score-0.128]
16 But there are other goals for the dimension reduction as well, including visualization, domain understanding, and domain transfer (i. [sent-26, score-0.071]
17 In particular, we wish to incorporate side information such as class labels into LDA, while retaining its favorable unsupervised dimensionality reduction abilities. [sent-30, score-0.335]
18 The goal is to develop parameter estimation procedures that yield LDA topics that characterize the corpus and maximally exploit the predictive power of the side information. [sent-31, score-0.325]
19 As a parametric generative model, parameters in LDA are typically estimated with maximum likelihood estimation or Bayesian posterior inference. [sent-32, score-0.081]
20 In this paper, we use a discriminative learning criterion—conditional likelihood—to train a variant of the LDA model. [sent-34, score-0.088]
21 Moreover, we augment the LDA parameterization by introducing class-label-dependent auxiliary parameters that can be tuned by the discriminative criterion. [sent-35, score-0.127]
22 By retaining the original LDA parameters and introducing these auxiliary parameters, we are able to retain the advantages of the likelihood-based training procedure and provide additional freedom for tracking the side information. [sent-36, score-0.115]
23 In Section 4, we report empirical results on applying DiscLDA to model text documents. [sent-40, score-0.062]
24 2 Model We start by reviewing the LDA model [3] for topic modeling. [sent-42, score-0.267]
25 1 LDA The LDA model is a generative process where each document in the text corpus is modeled as a set of draws from a mixture distribution over a set of hidden topics. [sent-46, score-0.288]
26 A topic is modeled as a probability distribution over words. [sent-47, score-0.24]
27 Let the vector wd be the bag-of-words representation of document d. [sent-48, score-0.193]
28 In both the maximum likelihood and Bayesian framework it is necessary to integrate over θ d to obtain the marginal likelihood, and this is accomplished either using variational inference or Gibbs sampling [3, 8]. [sent-56, score-0.067]
29 2 DiscLDA In our setting, each document is additionally associated with a categorical variable or class label yd ∈ {1, 2, . [sent-58, score-0.305]
30 Specifically, for each class label y, we introduce a linear transformation T y : ℜK → ℜL , which transforms a K-dimensional Dirichlet variable θd to + α α α π π yd θd θd T zdn β β β wdn wdn N Φ θd zdn T zdn yd D Figure 1: LDA model. [sent-67, score-1.273]
31 To generate a word wdn , we draw its topic zdn from T yd θ d . [sent-71, score-0.755]
32 Note that these points can not be placed arbitrarily, as all documents—whether they have the same class labels or they do not— share the parameter Φ ∈ ℜV ×L . [sent-74, score-0.085]
33 The graphical model in Figure 2 shows the new generative process. [sent-75, score-0.065]
34 Compared to standard LDA, we have added the nodes for the variable yd (and its prior distribution π), the transformation matrices T y and the corresponding edges. [sent-76, score-0.223]
35 An alternative to DiscLDA would be a model in which there are class-dependent topic parameters φy which determine the conditional distribution of the words: k d wdn | zdn , yd , Φ ∼ Multi(φydn ). [sent-77, score-0.768]
36 z The problem with this approach is that the posterior p(y|w, Φ) is a highly non-convex function of Φ which makes its optimization very challenging given the high dimensionality of the parameter space in typical applications. [sent-78, score-0.08]
37 Our approach circumvents this difficulty by learning a low-dimensional transformation of the φk ’s in a discriminative manner instead. [sent-79, score-0.194]
38 Indeed, transforming the topic mixture vector θ is actually equivalent to transforming the Φ matrix. [sent-80, score-0.273]
39 To see this, note that by marginalizing out the hidden topic vector z, we get the following distribution for the word wdn given θ: wdn | yd , θd , T ∼ Mult (ΦT y θ d ) . [sent-81, score-0.794]
40 By the associativity of the matrix product, we see that we obtain an equivalent probabilistic model by applying the linear transformation to Φ instead, and, in effect, defining the class-dependent topic parameters as follows: y φy = φl Tlk . [sent-82, score-0.405]
41 k l Another motivation for our approach is that it gives the model the ability to distinguish topics which are shared across different classes versus topics which are class-specific. [sent-83, score-0.516]
42 In this case, the last K topics are shared by both classes, whereas the two first groups of K topics are exclusive to one class or the other. [sent-85, score-0.531]
43 Note that we can give a generative interpretation to the transformation by augmenting the model with a hidden topic vector variable u, as shown in Fig. [sent-87, score-0.411]
44 By including a Dirichlet prior on the T parameters, the DiscLDA model can be related to the authortopic model [10], if we restrict to the special case in which there is only one author per document. [sent-90, score-0.087]
45 In the author-topic model, the bag-of-words representation of a document is augmented by a list of the authors of the document. [sent-91, score-0.146]
46 To generate a word in a document, one first picks at random the author associated with this document. [sent-92, score-0.08]
47 Given the author (y in our notation), a topic is chosen according to corpus-wide author-specific topic-mixture proportions (which is a column vector T y in our notation). [sent-93, score-0.315]
48 The word is then generated from the corresponding topic distribution as usual. [sent-94, score-0.287]
49 According to this analogy, we see that our model not only enables us to predict the author of a document (assuming a small set of possible authors), but we also capture the content of documents (using θ) as well as the corpus-wide class properties (using T ). [sent-95, score-0.376]
50 The focus of the author-topic model was to model the interests of authors, not the content of documents, explaining why there was no need to add document-specific topic-mixture proportions. [sent-96, score-0.099]
51 Because we want to predict the class for a specific document, it is crucial that we also model the content of a document. [sent-97, score-0.114]
52 Recently, there has been growing interest in topic modeling with supervised information. [sent-98, score-0.279]
53 Blei and McAuliffe [2] proposed a supervised LDA model where the empirical topic vector z (sampled from θ) is used as a covariate for a regression on y (see also [6]). [sent-99, score-0.306]
54 Mimno and McCallum [9] proposed a Dirichlet-multinomial regression which can handle various types of side information, including the case in which this side information is an indicator variable of the class (y)1 . [sent-100, score-0.134]
55 Our work differs from theirs, however, in that we train the transformation parameter by maximum conditional likelihood instead of a generative criterion. [sent-101, score-0.22]
56 3 Inference and learning Given a corpus of documents and their labels, we estimate the parameters {T y } by maximizing the conditional likelihood d log p(yd | wd ; {T y }, Φ) while holding Φ fixed. [sent-102, score-0.281]
57 To estimate the parameters Φ, we hold the transformation matrices fixed and maximize the posterior of the model, in much the same way as in standard LDA models. [sent-103, score-0.106]
58 We maximize the conditional likelihood objective with respect to T by using gradient ascent, for a fixed Φ. [sent-108, score-0.076]
59 1 Dimensionality reduction We can obtain a supervised dimensionality reduction method by using the average transformed topic vector as the reduced representation of a test document. [sent-114, score-0.588]
60 The first term on the right-hand side of this equation can 1 In this case, their model is actually the same as Model 1 in [5] with an additional prior on the classdependent parameters for the Dirichlet distribution on the topics. [sent-116, score-0.073]
61 Figure 5: t-SNE 2D embedding of the E [θ|Φ, w, T ] representation of Newsgroups documents, after fitting to the standard unsupervised LDA model. [sent-120, score-0.141]
62 Our experiments aimed to demonstrate the benefits of discriminative training of LDA for discovering a compact latent representation that contains both predictive and shared components across different types of data. [sent-124, score-0.315]
63 1 Text modeling The 20 Newsgroups dataset contains postings to Usenet newsgroups. [sent-127, score-0.078]
64 The postings are organized by content into 20 related categories and are therefore well suited for topic modeling. [sent-128, score-0.363]
65 We fit the dataset to both a standard 110-topic LDA model and a DiscLDA model with restricted forms of the transformation matrices {T y }y=20 . [sent-130, score-0.16]
66 Specifically, the transformation matrix T y for class y=1 label c is fixed and given by the following blocked matrix 0 0 . [sent-131, score-0.245]
67 At the first column and the row y, the block matrix is an identity matrix with dimensionality of K0 × K0 . [sent-146, score-0.168]
68 The last element of T y is another identity matrix with dimensionality K1 . [sent-147, score-0.112]
69 Intuitively, the shared components should use all class labels to model common latent structures, while nonoverlapping components should model specific characteristics of data from each class. [sent-149, score-0.308]
70 We then estimated a new representation for test documents by taking the conditional expectation of T y θ with y marginalized out as explained in Section 3. [sent-187, score-0.182]
71 To obtain an embedding, we first tried standard multidimensional scaling (MDS), using the symmetrical KL divergence between pairs of θtr topic vectors as a dissimilarity metric, but the results were hard to visualize. [sent-190, score-0.24]
72 A more interpretable embedding was obtained using a modified version of the t-SNE stochastic neighborhood embedding presented by van der Maaten and Hinton [11]. [sent-191, score-0.11]
73 4 shows a scatter plot of the 2D–embedding of the topic representation of the 20 Newsgroups test documents, where the colors of the dots, each corresponding to a document, encode class labels. [sent-193, score-0.315]
74 Clearly, the documents are well separated in this space. [sent-194, score-0.116]
75 It is also instructive to examine in detail the topic structures of the fitted DiscLDA model. [sent-198, score-0.24]
76 Given the specific setup of our transformation matrix T , each component of the topic vector u is either associated with a class label or shared across all class labels. [sent-199, score-0.606]
77 For each component, we can compute the most popular words associated from the word-topic distribution Φ. [sent-200, score-0.068]
78 In Table 1, we list these words and group them under each class labels and a special bucket “shared. [sent-201, score-0.157]
79 ” We see that the words are highly indicative of their associated class labels. [sent-202, score-0.083]
80 Additionally, the words in the “shared” category are “neutral,” neither positively nor negatively suggesting proper class labels where they are likely LDA+SVM 20% DiscLDA+SVM 17% discLDA alone 17% Table 2: Binary classification error rates for two newsgroups to appear. [sent-203, score-0.237]
81 2 Document classification It is also of interest to consider the classification problem more directly and ask whether the features delivered by DiscLDA are more useful for classification than those delivered by LDA. [sent-207, score-0.068]
82 Of course, we can also use DiscLDA as a classification method per se, by marginalizing over the latent variables and computing the probability of the label y given the words in a test document. [sent-208, score-0.132]
83 Specifically, we constructed multiclass linear SVM classifiers using the expected topic proportion vectors from unsupervised LDA and DiscLDA models as features as described in Section 3. [sent-213, score-0.293]
84 Using the topic vectors from standard LDA the error rate of classification was 25%. [sent-216, score-0.24]
85 When the topic vectors from the DiscLDA model were used we obtained an error rate of 20%. [sent-217, score-0.267]
86 We also computed the MAP estimate of the class label y ∗ = arg max p(y|w) from DiscLDA and used this estimate directly as a classifier. [sent-219, score-0.075]
87 In a second experiment, we considered the fully adaptive setting in which the transformation matrix T y is learned in a discriminative fashion as described in Section 3. [sent-221, score-0.252]
88 We initialized the matrix T to a smoothed block diagonal matrix having a pattern similar to (1), with 20 shared topics and 20 class-dependent topics per class. [sent-222, score-0.577]
89 This was followed by the discriminative learning process in which we iteratively ran batch gradient (in the log domain, so that T remained normalized) using Monte Carlo EM with a constant step size for 10 epochs. [sent-224, score-0.088]
90 This discriminative learning process was repeated until there was no improvement on a validation data set. [sent-226, score-0.088]
91 In this experiment, we considered the binary classification problem of distinguishing postings of the newsgroup alt. [sent-228, score-0.125]
92 Table 2 summarizes the results of our experiment, where we have used topic vectors from unsupervised LDA and DiscLDA as input features to binary linear SVM classifiers. [sent-232, score-0.293]
93 We also computed the prediction of the label of a document directly with DiscLDA. [sent-233, score-0.146]
94 As shown in the table, the DiscLDA model clearly generates topic vectors with better predictive power than unsupervised LDA. [sent-234, score-0.368]
95 In Table 3 we present the ten most probable words for a subset of topics learned using the discriminative DiscLDA approach. [sent-235, score-0.344]
96 In particular, although we started with 20 shared topics the learned T had only 12 shared topics. [sent-237, score-0.437]
97 We have grouped the topics in Table 3 according to whether they were class-specific or shared, uncovering an interesting latent structure which appears more discriminating than the topics presented in Table 1. [sent-238, score-0.436]
98 5 Discussion We have presented DiscLDA, a variation on LDA in which the LDA parametrization is augmented to include a transformation matrix and in which this matrix is learned via a conditional likelihood criterion. [sent-239, score-0.272]
99 low-dimensional representations of documents, but to also make use of discriminative side information (labels) in forming these representations. [sent-244, score-0.134]
100 Given the high dimensionality of such models, it may be intractable to train all of the parameters via a discriminative criterion such as conditional likelihood. [sent-247, score-0.201]
wordName wordTfidf (topN-words)
[('disclda', 0.645), ('lda', 0.331), ('topic', 0.24), ('wdn', 0.195), ('topics', 0.189), ('zdn', 0.156), ('yd', 0.117), ('documents', 0.116), ('document', 0.113), ('shared', 0.111), ('newsgroups', 0.111), ('transformation', 0.106), ('god', 0.098), ('discriminative', 0.088), ('dimensionality', 0.08), ('bible', 0.078), ('jesus', 0.078), ('postings', 0.078), ('reduction', 0.071), ('dirichlet', 0.066), ('christ', 0.059), ('religion', 0.059), ('latent', 0.058), ('embedding', 0.055), ('unsupervised', 0.053), ('word', 0.047), ('classi', 0.047), ('wd', 0.047), ('multi', 0.047), ('newsgroup', 0.047), ('side', 0.046), ('government', 0.046), ('gibbs', 0.045), ('content', 0.045), ('labels', 0.043), ('likelihood', 0.043), ('class', 0.042), ('corpus', 0.042), ('proportions', 0.042), ('words', 0.041), ('atheism', 0.039), ('atheists', 0.039), ('christians', 0.039), ('dos', 0.039), ('morality', 0.039), ('scsi', 0.039), ('supervised', 0.039), ('ca', 0.039), ('auxiliary', 0.039), ('generative', 0.038), ('ik', 0.036), ('text', 0.035), ('plsa', 0.034), ('maaten', 0.034), ('church', 0.034), ('delivered', 0.034), ('fda', 0.034), ('sdr', 0.034), ('season', 0.034), ('conditional', 0.033), ('author', 0.033), ('drive', 0.033), ('representation', 0.033), ('label', 0.033), ('mixture', 0.033), ('le', 0.033), ('matrix', 0.032), ('men', 0.031), ('mimno', 0.031), ('bucket', 0.031), ('mail', 0.031), ('supervisory', 0.031), ('transformed', 0.03), ('retain', 0.03), ('moral', 0.029), ('team', 0.029), ('svm', 0.028), ('win', 0.028), ('players', 0.028), ('berkeley', 0.027), ('blei', 0.027), ('popular', 0.027), ('model', 0.027), ('card', 0.026), ('christian', 0.026), ('thing', 0.026), ('dir', 0.026), ('learned', 0.026), ('experiment', 0.026), ('graphics', 0.025), ('berg', 0.025), ('predictive', 0.025), ('mixing', 0.025), ('reduced', 0.024), ('block', 0.024), ('health', 0.024), ('accomplished', 0.024), ('mb', 0.024), ('people', 0.023), ('power', 0.023)]
simIndex simValue paperId paperTitle
same-paper 1 1.0000004 64 nips-2008-DiscLDA: Discriminative Learning for Dimensionality Reduction and Classification
Author: Simon Lacoste-julien, Fei Sha, Michael I. Jordan
Abstract: Probabilistic topic models have become popular as methods for dimensionality reduction in collections of text documents or images. These models are usually treated as generative models and trained using maximum likelihood or Bayesian methods. In this paper, we discuss an alternative: a discriminative framework in which we assume that supervised side information is present, and in which we wish to take that side information into account in finding a reduced dimensionality representation. Specifically, we present DiscLDA, a discriminative variation on Latent Dirichlet Allocation (LDA) in which a class-dependent linear transformation is introduced on the topic mixture proportions. This parameter is estimated by maximizing the conditional likelihood. By using the transformed topic mixture proportions as a new representation of documents, we obtain a supervised dimensionality reduction algorithm that uncovers the latent structure in a document collection while preserving predictive power for the task of classification. We compare the predictive power of the latent structure of DiscLDA with unsupervised LDA on the 20 Newsgroups document classification task and show how our model can identify shared topics across classes as well as class-dependent topics.
2 0.22301522 229 nips-2008-Syntactic Topic Models
Author: Jordan L. Boyd-graber, David M. Blei
Abstract: We develop the syntactic topic model (STM), a nonparametric Bayesian model of parsed documents. The STM generates words that are both thematically and syntactically constrained, which combines the semantic insights of topic models with the syntactic information available from parse trees. Each word of a sentence is generated by a distribution that combines document-specific topic weights and parse-tree-specific syntactic transitions. Words are assumed to be generated in an order that respects the parse tree. We derive an approximate posterior inference method based on variational methods for hierarchical Dirichlet processes, and we report qualitative and quantitative results on both synthetic data and hand-parsed documents. 1
3 0.17890713 197 nips-2008-Relative Performance Guarantees for Approximate Inference in Latent Dirichlet Allocation
Author: Indraneel Mukherjee, David M. Blei
Abstract: Hierarchical probabilistic modeling of discrete data has emerged as a powerful tool for text analysis. Posterior inference in such models is intractable, and practitioners rely on approximate posterior inference methods such as variational inference or Gibbs sampling. There has been much research in designing better approximations, but there is yet little theoretical understanding of which of the available techniques are appropriate, and in which data analysis settings. In this paper we provide the beginnings of such understanding. We analyze the improvement that the recently proposed collapsed variational inference (CVB) provides over mean field variational inference (VB) in latent Dirichlet allocation. We prove that the difference in the tightness of the bound on the likelihood of a document decreases as O(k − 1) + log m/m, where k is the number of topics in the model and m is the number of words in a document. As a consequence, the advantage of CVB over VB is lost for long documents but increases with the number of topics. We demonstrate empirically that the theory holds, using simulated text data and two text corpora. We provide practical guidelines for choosing an approximation. 1
4 0.16581002 114 nips-2008-Large Margin Taxonomy Embedding for Document Categorization
Author: Kilian Q. Weinberger, Olivier Chapelle
Abstract: Applications of multi-class classification, such as document categorization, often appear in cost-sensitive settings. Recent work has significantly improved the state of the art by moving beyond “flat” classification through incorporation of class hierarchies [4]. We present a novel algorithm that goes beyond hierarchical classification and estimates the latent semantic space that underlies the class hierarchy. In this space, each class is represented by a prototype and classification is done with the simple nearest neighbor rule. The optimization of the semantic space incorporates large margin constraints that ensure that for each instance the correct class prototype is closer than any other. We show that our optimization is convex and can be solved efficiently for large data sets. Experiments on the OHSUMED medical journal data base yield state-of-the-art results on topic categorization. 1
5 0.15024893 120 nips-2008-Learning the Semantic Correlation: An Alternative Way to Gain from Unlabeled Text
Author: Yi Zhang, Artur Dubrawski, Jeff G. Schneider
Abstract: In this paper, we address the question of what kind of knowledge is generally transferable from unlabeled text. We suggest and analyze the semantic correlation of words as a generally transferable structure of the language and propose a new method to learn this structure using an appropriately chosen latent variable model. This semantic correlation contains structural information of the language space and can be used to control the joint shrinkage of model parameters for any specific task in the same space through regularization. In an empirical study, we construct 190 different text classification tasks from a real-world benchmark, and the unlabeled documents are a mixture from all these tasks. We test the ability of various algorithms to use the mixed unlabeled text to enhance all classification tasks. Empirical results show that the proposed approach is a reliable and scalable method for semi-supervised learning, regardless of the source of unlabeled data, the specific task to be enhanced, and the prediction model used.
6 0.14937286 116 nips-2008-Learning Hybrid Models for Image Annotation with Partially Labeled Data
7 0.14267674 28 nips-2008-Asynchronous Distributed Learning of Topic Models
8 0.13728862 246 nips-2008-Unsupervised Learning of Visual Sense Models for Polysemous Words
9 0.12008278 6 nips-2008-A ``Shape Aware'' Model for semi-supervised Learning of Objects and its Context
10 0.11855578 227 nips-2008-Supervised Exponential Family Principal Component Analysis via Convex Optimization
11 0.098778695 127 nips-2008-Logistic Normal Priors for Unsupervised Probabilistic Grammar Induction
12 0.088609137 93 nips-2008-Global Ranking Using Continuous Conditional Random Fields
13 0.08690919 62 nips-2008-Differentiable Sparse Coding
14 0.082076497 63 nips-2008-Dimensionality Reduction for Data in Multiple Feature Representations
15 0.075695619 194 nips-2008-Regularized Learning with Networks of Features
16 0.073330082 52 nips-2008-Correlated Bigram LSA for Unsupervised Language Model Adaptation
17 0.071938202 226 nips-2008-Supervised Dictionary Learning
18 0.067069978 113 nips-2008-Kernelized Sorting
19 0.063835032 205 nips-2008-Semi-supervised Learning with Weakly-Related Unlabeled Data : Towards Better Text Categorization
20 0.061110899 61 nips-2008-Diffeomorphic Dimensionality Reduction
topicId topicWeight
[(0, -0.194), (1, -0.144), (2, 0.081), (3, -0.168), (4, -0.056), (5, -0.012), (6, 0.115), (7, 0.184), (8, -0.253), (9, 0.025), (10, -0.01), (11, -0.088), (12, -0.062), (13, 0.099), (14, -0.029), (15, -0.019), (16, 0.098), (17, -0.011), (18, -0.088), (19, -0.082), (20, -0.015), (21, -0.051), (22, -0.004), (23, 0.057), (24, 0.067), (25, -0.01), (26, -0.043), (27, 0.078), (28, 0.007), (29, -0.076), (30, 0.032), (31, 0.055), (32, 0.108), (33, -0.06), (34, -0.047), (35, 0.011), (36, -0.054), (37, 0.067), (38, 0.015), (39, -0.014), (40, 0.044), (41, -0.116), (42, 0.073), (43, -0.041), (44, 0.038), (45, -0.027), (46, 0.055), (47, -0.019), (48, 0.024), (49, 0.047)]
simIndex simValue paperId paperTitle
same-paper 1 0.94148612 64 nips-2008-DiscLDA: Discriminative Learning for Dimensionality Reduction and Classification
Author: Simon Lacoste-julien, Fei Sha, Michael I. Jordan
Abstract: Probabilistic topic models have become popular as methods for dimensionality reduction in collections of text documents or images. These models are usually treated as generative models and trained using maximum likelihood or Bayesian methods. In this paper, we discuss an alternative: a discriminative framework in which we assume that supervised side information is present, and in which we wish to take that side information into account in finding a reduced dimensionality representation. Specifically, we present DiscLDA, a discriminative variation on Latent Dirichlet Allocation (LDA) in which a class-dependent linear transformation is introduced on the topic mixture proportions. This parameter is estimated by maximizing the conditional likelihood. By using the transformed topic mixture proportions as a new representation of documents, we obtain a supervised dimensionality reduction algorithm that uncovers the latent structure in a document collection while preserving predictive power for the task of classification. We compare the predictive power of the latent structure of DiscLDA with unsupervised LDA on the 20 Newsgroups document classification task and show how our model can identify shared topics across classes as well as class-dependent topics.
2 0.87420368 28 nips-2008-Asynchronous Distributed Learning of Topic Models
Author: Padhraic Smyth, Max Welling, Arthur U. Asuncion
Abstract: Distributed learning is a problem of fundamental interest in machine learning and cognitive science. In this paper, we present asynchronous distributed learning algorithms for two well-known unsupervised learning frameworks: Latent Dirichlet Allocation (LDA) and Hierarchical Dirichlet Processes (HDP). In the proposed approach, the data are distributed across P processors, and processors independently perform Gibbs sampling on their local data and communicate their information in a local asynchronous manner with other processors. We demonstrate that our asynchronous algorithms are able to learn global topic models that are statistically as accurate as those learned by the standard LDA and HDP samplers, but with significant improvements in computation time and memory. We show speedup results on a 730-million-word text corpus using 32 processors, and we provide perplexity results for up to 1500 virtual processors. As a stepping stone in the development of asynchronous HDP, a parallel HDP sampler is also introduced. 1
3 0.84602547 197 nips-2008-Relative Performance Guarantees for Approximate Inference in Latent Dirichlet Allocation
Author: Indraneel Mukherjee, David M. Blei
Abstract: Hierarchical probabilistic modeling of discrete data has emerged as a powerful tool for text analysis. Posterior inference in such models is intractable, and practitioners rely on approximate posterior inference methods such as variational inference or Gibbs sampling. There has been much research in designing better approximations, but there is yet little theoretical understanding of which of the available techniques are appropriate, and in which data analysis settings. In this paper we provide the beginnings of such understanding. We analyze the improvement that the recently proposed collapsed variational inference (CVB) provides over mean field variational inference (VB) in latent Dirichlet allocation. We prove that the difference in the tightness of the bound on the likelihood of a document decreases as O(k − 1) + log m/m, where k is the number of topics in the model and m is the number of words in a document. As a consequence, the advantage of CVB over VB is lost for long documents but increases with the number of topics. We demonstrate empirically that the theory holds, using simulated text data and two text corpora. We provide practical guidelines for choosing an approximation. 1
4 0.83755219 229 nips-2008-Syntactic Topic Models
Author: Jordan L. Boyd-graber, David M. Blei
Abstract: We develop the syntactic topic model (STM), a nonparametric Bayesian model of parsed documents. The STM generates words that are both thematically and syntactically constrained, which combines the semantic insights of topic models with the syntactic information available from parse trees. Each word of a sentence is generated by a distribution that combines document-specific topic weights and parse-tree-specific syntactic transitions. Words are assumed to be generated in an order that respects the parse tree. We derive an approximate posterior inference method based on variational methods for hierarchical Dirichlet processes, and we report qualitative and quantitative results on both synthetic data and hand-parsed documents. 1
5 0.759161 114 nips-2008-Large Margin Taxonomy Embedding for Document Categorization
Author: Kilian Q. Weinberger, Olivier Chapelle
Abstract: Applications of multi-class classification, such as document categorization, often appear in cost-sensitive settings. Recent work has significantly improved the state of the art by moving beyond “flat” classification through incorporation of class hierarchies [4]. We present a novel algorithm that goes beyond hierarchical classification and estimates the latent semantic space that underlies the class hierarchy. In this space, each class is represented by a prototype and classification is done with the simple nearest neighbor rule. The optimization of the semantic space incorporates large margin constraints that ensure that for each instance the correct class prototype is closer than any other. We show that our optimization is convex and can be solved efficiently for large data sets. Experiments on the OHSUMED medical journal data base yield state-of-the-art results on topic categorization. 1
6 0.7241317 52 nips-2008-Correlated Bigram LSA for Unsupervised Language Model Adaptation
7 0.6036973 120 nips-2008-Learning the Semantic Correlation: An Alternative Way to Gain from Unlabeled Text
8 0.54877073 246 nips-2008-Unsupervised Learning of Visual Sense Models for Polysemous Words
9 0.44591936 227 nips-2008-Supervised Exponential Family Principal Component Analysis via Convex Optimization
10 0.4251461 116 nips-2008-Learning Hybrid Models for Image Annotation with Partially Labeled Data
11 0.41591865 4 nips-2008-A Scalable Hierarchical Distributed Language Model
12 0.40014797 6 nips-2008-A ``Shape Aware'' Model for semi-supervised Learning of Objects and its Context
13 0.39148822 134 nips-2008-Mixed Membership Stochastic Blockmodels
14 0.37636021 61 nips-2008-Diffeomorphic Dimensionality Reduction
15 0.36114758 93 nips-2008-Global Ranking Using Continuous Conditional Random Fields
16 0.35849139 127 nips-2008-Logistic Normal Priors for Unsupervised Probabilistic Grammar Induction
17 0.3542673 31 nips-2008-Bayesian Exponential Family PCA
18 0.35388044 205 nips-2008-Semi-supervised Learning with Weakly-Related Unlabeled Data : Towards Better Text Categorization
19 0.3214795 126 nips-2008-Localized Sliced Inverse Regression
20 0.28566846 216 nips-2008-Sparse probabilistic projections
topicId topicWeight
[(6, 0.055), (7, 0.091), (8, 0.242), (12, 0.041), (28, 0.141), (57, 0.09), (59, 0.012), (63, 0.031), (71, 0.016), (77, 0.047), (78, 0.018), (83, 0.098)]
simIndex simValue paperId paperTitle
same-paper 1 0.79759222 64 nips-2008-DiscLDA: Discriminative Learning for Dimensionality Reduction and Classification
Author: Simon Lacoste-julien, Fei Sha, Michael I. Jordan
Abstract: Probabilistic topic models have become popular as methods for dimensionality reduction in collections of text documents or images. These models are usually treated as generative models and trained using maximum likelihood or Bayesian methods. In this paper, we discuss an alternative: a discriminative framework in which we assume that supervised side information is present, and in which we wish to take that side information into account in finding a reduced dimensionality representation. Specifically, we present DiscLDA, a discriminative variation on Latent Dirichlet Allocation (LDA) in which a class-dependent linear transformation is introduced on the topic mixture proportions. This parameter is estimated by maximizing the conditional likelihood. By using the transformed topic mixture proportions as a new representation of documents, we obtain a supervised dimensionality reduction algorithm that uncovers the latent structure in a document collection while preserving predictive power for the task of classification. We compare the predictive power of the latent structure of DiscLDA with unsupervised LDA on the 20 Newsgroups document classification task and show how our model can identify shared topics across classes as well as class-dependent topics.
2 0.72067785 227 nips-2008-Supervised Exponential Family Principal Component Analysis via Convex Optimization
Author: Yuhong Guo
Abstract: Recently, supervised dimensionality reduction has been gaining attention, owing to the realization that data labels are often available and indicate important underlying structure in the data. In this paper, we present a novel convex supervised dimensionality reduction approach based on exponential family PCA, which is able to avoid the local optima of typical EM learning. Moreover, by introducing a sample-based approximation to exponential family models, it overcomes the limitation of the prevailing Gaussian assumptions of standard PCA, and produces a kernelized formulation for nonlinear supervised dimensionality reduction. A training algorithm is then devised based on a subgradient bundle method, whose scalability can be gained using a coordinate descent procedure. The advantage of our global optimization approach is demonstrated by empirical results over both synthetic and real data. 1
3 0.66446543 116 nips-2008-Learning Hybrid Models for Image Annotation with Partially Labeled Data
Author: Xuming He, Richard S. Zemel
Abstract: Extensive labeled data for image annotation systems, which learn to assign class labels to image regions, is difficult to obtain. We explore a hybrid model framework for utilizing partially labeled data that integrates a generative topic model for image appearance with discriminative label prediction. We propose three alternative formulations for imposing a spatial smoothness prior on the image labels. Tests of the new models and some baseline approaches on three real image datasets demonstrate the effectiveness of incorporating the latent structure. 1
4 0.6640799 194 nips-2008-Regularized Learning with Networks of Features
Author: Ted Sandler, John Blitzer, Partha P. Talukdar, Lyle H. Ungar
Abstract: For many supervised learning problems, we possess prior knowledge about which features yield similar information about the target variable. In predicting the topic of a document, we might know that two words are synonyms, and when performing image recognition, we know which pixels are adjacent. Such synonymous or neighboring features are near-duplicates and should be expected to have similar weights in an accurate model. Here we present a framework for regularized learning when one has prior knowledge about which features are expected to have similar and dissimilar weights. The prior knowledge is encoded as a network whose vertices are features and whose edges represent similarities and dissimilarities between them. During learning, each feature’s weight is penalized by the amount it differs from the average weight of its neighbors. For text classification, regularization using networks of word co-occurrences outperforms manifold learning and compares favorably to other recently proposed semi-supervised learning methods. For sentiment analysis, feature networks constructed from declarative human knowledge significantly improve prediction accuracy. 1
5 0.66006869 95 nips-2008-Grouping Contours Via a Related Image
Author: Praveen Srinivasan, Liming Wang, Jianbo Shi
Abstract: Contours have been established in the biological and computer vision literature as a compact yet descriptive representation of object shape. While individual contours provide structure, they lack the large spatial support of region segments (which lack internal structure). We present a method for further grouping of contours in an image using their relationship to the contours of a second, related image. Stereo, motion, and similarity all provide cues that can aid this task; contours that have similar transformations relating them to their matching contours in the second image likely belong to a single group. To find matches for contours, we rely only on shape, which applies directly to all three modalities without modification, in contrast to the specialized approaches developed for each independently. Visually salient contours are extracted in each image, along with a set of candidate transformations for aligning subsets of them. For each transformation, groups of contours with matching shape across the two images are identified to provide a context for evaluating matches of individual contour points across the images. The resulting contexts of contours are used to perform a final grouping on contours in the original image while simultaneously finding matches in the related image, again by shape matching. We demonstrate grouping results on image pairs consisting of stereo, motion, and similar images. Our method also produces qualitatively better results against a baseline method that does not use the inferred contexts. 1
6 0.65781122 205 nips-2008-Semi-supervised Learning with Weakly-Related Unlabeled Data : Towards Better Text Categorization
7 0.65687674 63 nips-2008-Dimensionality Reduction for Data in Multiple Feature Representations
8 0.65681362 79 nips-2008-Exploring Large Feature Spaces with Hierarchical Multiple Kernel Learning
9 0.65303135 120 nips-2008-Learning the Semantic Correlation: An Alternative Way to Gain from Unlabeled Text
10 0.65266454 26 nips-2008-Analyzing human feature learning as nonparametric Bayesian inference
11 0.65223837 62 nips-2008-Differentiable Sparse Coding
12 0.65174901 176 nips-2008-Partially Observed Maximum Entropy Discrimination Markov Networks
13 0.65114701 42 nips-2008-Cascaded Classification Models: Combining Models for Holistic Scene Understanding
14 0.64852071 200 nips-2008-Robust Kernel Principal Component Analysis
15 0.64812934 245 nips-2008-Unlabeled data: Now it helps, now it doesn't
16 0.64753664 32 nips-2008-Bayesian Kernel Shaping for Learning Control
17 0.64740139 248 nips-2008-Using matrices to model symbolic relationship
18 0.64375043 130 nips-2008-MCBoost: Multiple Classifier Boosting for Perceptual Co-clustering of Images and Visual Features
19 0.64216036 66 nips-2008-Dynamic visual attention: searching for coding length increments
20 0.64210975 197 nips-2008-Relative Performance Guarantees for Approximate Inference in Latent Dirichlet Allocation