nips nips2008 nips2008-114 knowledge-graph by maker-knowledge-mining

114 nips-2008-Large Margin Taxonomy Embedding for Document Categorization

Source: pdf

Author: Kilian Q. Weinberger, Olivier Chapelle

Abstract: Applications of multi-class classiﬁcation, such as document categorization, often appear in cost-sensitive settings. Recent work has signiﬁcantly improved the state of the art by moving beyond “ﬂat” classiﬁcation through incorporation of class hierarchies [4]. We present a novel algorithm that goes beyond hierarchical classiﬁcation and estimates the latent semantic space that underlies the class hierarchy. In this space, each class is represented by a prototype and classiﬁcation is done with the simple nearest neighbor rule. The optimization of the semantic space incorporates large margin constraints that ensure that for each instance the correct class prototype is closer than any other. We show that our optimization is convex and can be solved efﬁciently for large data sets. Experiments on the OHSUMED medical journal data base yield state-of-the-art results on topic categorization. 1

Reference: text

Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 com Abstract Applications of multi-class classiﬁcation, such as document categorization, often appear in cost-sensitive settings. [sent-5, score-0.204]

2 We present a novel algorithm that goes beyond hierarchical classiﬁcation and estimates the latent semantic space that underlies the class hierarchy. [sent-7, score-0.32]

3 In this space, each class is represented by a prototype and classiﬁcation is done with the simple nearest neighbor rule. [sent-8, score-0.341]

4 The optimization of the semantic space incorporates large margin constraints that ensure that for each instance the correct class prototype is closer than any other. [sent-9, score-0.72]

5 Experiments on the OHSUMED medical journal data base yield state-of-the-art results on topic categorization. [sent-11, score-0.365]

6 Similarly, in the context of document categorization it seems more severe to misclassify a medical journal on heart attack as a publication on athlete’s foot than on Coronary artery disease. [sent-15, score-0.479]

7 Although the scope of the proposed method is by no means limited to text data and topic hierarchies, for improved clarity we will restrict ourselves to terminology from document categorization throughout this paper. [sent-16, score-0.651]

8 The most common approach to document categorization is to reduce the problem to a “ﬂat” classiﬁcation problem [13]. [sent-17, score-0.333]

9 However, it is often the case that the topics are not just discrete classes, but are nodes in a complex taxonomy with rich inter-topic relationships. [sent-18, score-0.395]

10 web taxonomy or medical journals can be categorized into the Medical Subject Headings (MeSH) taxonomy. [sent-20, score-0.442]

11 Moving beyond ﬂat classiﬁcation to settings that utilize these hierarchical representations of topics has signiﬁcantly pushed the state-of-the art [4, 15]. [sent-21, score-0.136]

12 Additional information about inter-topic relationships can for example be leveraged through cost-sensitive decision boundaries or knowledge sharing between documents from closely related classes. [sent-22, score-0.15]

13 In reality, however, the topic taxonomy is a crude approximation of topic relations, created by an editor with knowledge of the true underlying semantic space of topics. [sent-23, score-1.083]

14 In this paper we propose a method that moves beyond hierarchical presentations and aims to re-discover the continuous latent semantic space underlying the topic taxonomy. [sent-24, score-0.607]

15 Instead of regarding document categorization as classiﬁcation, we will think of it as a regression problem where new documents are mapped into this latent semantic topic space. [sent-25, score-1.015]

16 Very different from approaches like LSI or LDA [1, 7], our algorithm is entirely supervised and explicitly embeds the topic taxonomy and the documents into a single latent semantic space with “semantically meaningful” Euclidean distances. [sent-26, score-1.009]

17 1 Topic taxonomy T Low dimensional semantic space pneumonia X W P stroke arthritis High dimensional input space F original inputs heart attack class prototypes pα embedded inputs xi Wxi Figure 1: A schematic layout of our taxem method (for Taxonomy Embedding). [sent-27, score-1.585]

18 The classes are embedded as prototypes inside the semantic space. [sent-28, score-0.666]

19 The input documents are mapped into the same space, placed closest to their topic prototypes. [sent-29, score-0.512]

20 In this paper we derive a method to embed the taxonomy of topics into a latent semantic space in form of topic prototypes. [sent-30, score-0.942]

21 A new document can be classiﬁed by ﬁrst mapping it into this space and then assigning the label of the closest prototype. [sent-31, score-0.32]

22 A key contribution of our paper is the derivation of a convex problem that learns the regressor for the documents and the placement of the prototypes in a single optimization. [sent-32, score-0.681]

23 In particular, it places the topic prototypes such that for each document the prototype of the correct topic is much closer than any other prototype by a large margin. [sent-33, score-1.848]

24 Our paper is structured as follows: In section 2 we introduce necessary notation and a ﬁrst version of the algorithm based on a two-step approach of ﬁrst embedding the hierarchical taxonomy into a semantic space and then regressing the input documents close to their respective topic prototypes. [sent-35, score-1.09]

25 In section 3 we extend our model to a single optimization that learns both steps in one convex optimization with large margin constraints. [sent-36, score-0.258]

26 We evaluate our method in section 4 and demonstrate state-of-the-art results on eight different document categorization tasks from the OHSUMED medical journal data set. [sent-37, score-0.411]

27 In addition, the documents are accompanied by single topic labels y1 yn 1 c that lie in some taxonomy T with c total topics. [sent-41, score-0.736]

28 This taxonomy T gives rise to some cost matrix C Rc c , where C 0 deﬁnes the cost of misclassifying an element of topic as and C = 0. [sent-42, score-0.774]

29 Technically, we only require knowledge of the cost matrix C, which could also be obtained from side-information independent of a topic taxonomy. [sent-43, score-0.385]

30 However, we would like to point out that a common way to infer a cost matrix from a taxonomy is to set C to the length of the shortest path between node and , but other approaches have also been studied [3]. [sent-45, score-0.423]

31 Throughout this paper we denote document indices as i j 1 n and topic indices as 1 c . [sent-46, score-0.491]

32 We would like to create a low dimensional semantic F and each document feature space F in which we represent each topic as a topic prototype p xi X as a low dimensional vector zi F. [sent-53, score-1.676]

33 Our goal is to discover a representation of the data where distances reﬂect true underlying dissimilarities and proximity to prototypes indicates topic membership. [sent-54, score-0.774]

34 In other words, documents on the same or related topics should be close to the respective topic prototypes, documents on highly different topics should be well separated. [sent-55, score-0.741]

35 As an essential part of our method is to embed the classes that are typically found in a taxonomy, we refer to our algorithm as taxem (short for “taxonomy embedding”). [sent-57, score-0.252]

36 Embedding topic prototypes The ﬁrst step of our algorithm is to embed the document taxonomy into a Euclidean vector space. [sent-58, score-1.29]

37 , pc ∈ F based on the cost matrix C, where pα is the prototype that represents topic α. [sent-62, score-0.695]

38 , pc ] ∈ Rc×c whose columns consist of the topic prototypes. [sent-66, score-0.312]

39 There are many ways to derive the prototypes from the cost matrix C. [sent-67, score-0.556]

40 Better results can be expected when the prototypes of similar topics are closer than those of dissimilar topics. [sent-71, score-0.565]

41 We use the cost matrix C as an estimate of dissimilarity and aim to place the prototypes such that the distance pα − pβ 2 reﬂects the cost speciﬁed in Cαβ . [sent-72, score-0.622]

42 (1) α,β=1 If the cost matrix C deﬁnes squared Euclidean distances (e. [sent-74, score-0.127]

43 1 Both prototypes embeddings PI and Pmds are still independent of the input data {xi }. [sent-81, score-0.458]

44 Before we can derive a more sophisticated method to place the prototypes with large margin constraints on the document vectors, we will brieﬂy describe the mapping W : X → F of the input documents into the low dimensional feature space F. [sent-82, score-1.133]

45 We need to ﬁnd an appropriate mapping W : X → F, that maps each input xi with label yi as close as possible to its topic prototype pyi . [sent-84, score-0.837]

46 We can ﬁnd such a linear transformation zi = Wxi by setting pyi − Wxi W = argminW 2 +λ W 2 F. [sent-85, score-0.156]

47 Inference Given an input vector xt we ﬁrst map it into F and estimate its label as the topic with the closest prototype pα yt = argminα pα − Wxt 2 . [sent-94, score-0.622]

48 3 Topic taxonomy T I α eα Low dimensional semantic space yi High dimensional input space xi pα ey i py i P X A F zi xi Rd embedded inputs Rc large margin Rc class prototypes Figure 2: The schematic layout of the large-margin embedding of the taxonomy and the documents. [sent-96, score-1.978]

49 As a ﬁrst step, we represent topic as the vector e and document xi as xi = Axi . [sent-97, score-0.647]

50 We then learn the matrix P whose columns are the prototypes p = Pe and which deﬁnes the ﬁnal transformation of the documents zi = Pxi . [sent-98, score-0.691]

51 This ﬁnal transformation is learned such that the correct prototype pyi is closer to zi than any other prototype p by a large margin. [sent-99, score-0.768]

52 However, learning the prototypes independent of the data xi is far from optimal in order to reduce the loss in (5). [sent-101, score-0.59]

53 In this section we will create a joint optimization problem that places the prototypes P and learns the mapping W while minimizing an upper bound on (5). [sent-102, score-0.611]

54 We want to map the input documents closest to their prototypes and at the same time place the prototypes where the documents of the respective topic are mapped to. [sent-104, score-1.602]

55 (6) is entirely independent of P and can be pre-computed before the prototypes have been positioned. [sent-109, score-0.482]

56 We can then rewrite both, the topic prototypes p and the low dimensional documents zi , Rc : as vectors within the range of P : Rc p = Pe and zi = Pxi (7) Optimization Ideally we would like to learn P to minimize (5) directly. [sent-113, score-1.119]

57 4 The loss for a speciﬁc document xi is zero if its corresponding vector zi is closer to the correct prototype pyi than to any other prototype pα . [sent-116, score-1.104]

58 For better generalization it would be preferable if prototype pyi was in fact much closer by a large margin. [sent-117, score-0.432]

59 We can go even further and demand that prototypes that would incur a larger misclassiﬁcation loss should be further separated than those with a small cost. [sent-118, score-0.512]

60 We can express this condition as a set of “soft” inequality constraints, in terms of squared-distances, ∀i, α = yi P(eyi − xi ) 2 2 + Cyi α ≤ P(eα − xi ) 2 2 + ξiα , (8) where the slack-variable ξiα ≥ 0 absorbs the amount of violation of prototype pα into the margin of xi . [sent-120, score-0.649]

61 Given this formulation, we create an upper bound on the loss function (5): Theorem 1 Given a prototype matrix P, the training error (5) is bounded above by 1 n iα ξiα . [sent-121, score-0.401]

62 Proof: First, note that we can rewrite the assignment of the closest prototype (4) as yi = ˆ argminα P(eα − xi ) 2 . [sent-122, score-0.458]

63 It follows that P(eyi − xi ) 2 − P(eyi − xi ) 2 ≥ 0 for all i (with ˆ 2 2 equality when yi = yi ). [sent-123, score-0.246]

64 We therefore obtain: ˆ ξiˆi = P(eyi − xi ) y 2 2 + Cyi yi − P(eyi − xi ) ˆ ˆ 2 2 ≥ Cyi yi . [sent-124, score-0.246]

65 (8), allows us to create an optimization problem that minimizes an upper bound on the average loss in eq. [sent-127, score-0.134]

66 (5) with maximum-margin constraints: ξiα subject to: Minimize P i,α (1) P(eyi − xi ) (2) ξiα ≥ 0 2 2 + Cyi α ≤ P(eα − xi ) 2 2 (11) + ξiα Note that if we have a very large number of classes, it might be beneﬁcial to choose P ∈ Rr×c with r < c. [sent-128, score-0.156]

67 We can make (11) invariant to rotation by deﬁning Q = P P, and rewriting all distances in terms of Q, P(eα − xi ) 2 2 = (eα − xi ) Q(eα − xi ) = eα − xi 2 Q. [sent-134, score-0.341]

68 We can now solve (11) in terms of Q instead of P with the large-margin constraints ∀i, α = yi eyi − xi 2 Q + Cyi α ≤ eα − xi 2 Q + ξiα . [sent-139, score-0.39]

69 Even if the training data might differ from the test data, we know that the taxonomy does not change. [sent-142, score-0.299]

70 for all i we have xi = eyi , Pmds satisﬁes all constraints (8) as equalities with zero slack. [sent-145, score-0.267]

71 The ﬁnal convex optimization of F taxem with regularized objective becomes: ¯ ξiα + µ Q − C 2 F subject to: + Cyi α ≤ eα − xi 2 Q + ξiα Minimize (1 − µ) Q (1) eyi − xi (2) ξiα ≥ 0 (3) Q 0 i,α 2 Q (14) The constant µ ∈ [0, 1] regulates the impact of the regularization term. [sent-150, score-0.63]

72 Once the optimal solution Q∗ is found, one can obtain the position of the prototypes with a simple svd or cholesky decomposition Q∗ = P P and consequently also obtains the mapping W from W = PA. [sent-153, score-0.527]

73 4 Results We evaluated our algorithm taxem on several classiﬁcation problems derived from categorizing publications in the public OHSUMED medical journal data base into the Medical Subject Headings (MeSH) taxonomy. [sent-154, score-0.288]

74 Setup and data set description We used the OHSUMED 87 corpus [9], which consists of abstracts and titles of medical publications. [sent-155, score-0.125]

75 We removed all topic categories that did not appear in the MeSH taxonomy (due to out-dated topic names). [sent-160, score-0.909]

76 For each problem, we created a 70%/30% random split in training and test samples, ensuring however that each topic had at least one document in the training corpus. [sent-166, score-0.491]

77 50 Table 2: The cost-sensitive test error results on various ohsumed classiﬁcation data sets. [sent-238, score-0.141]

78 The taxem algorithm obtains the lowest overall loss and the lowest individual loss on each data set except B. [sent-242, score-0.432]

79 compared taxem against four commonly used algorithms for document categorization: 1. [sent-243, score-0.414]

80 the Cai and Hoffmann SVM formulation with a cost sensitive hierarchical loss function (SVM tax) [4]. [sent-248, score-0.196]

81 All SVM classiﬁers were trained with regularization constant C = 1 (which worked best on problem B; this value is also commonly used in text classiﬁcation when the documents have unit length). [sent-249, score-0.217]

82 Further, we also evaluated the difference between our large margin formulation (taxem) and the results with the simplex (PI -taxem) and mds (Pmds -taxem) prototypes. [sent-250, score-0.201]

83 Up to statistical signiﬁcance, taxem obtains the lowest loss on all data sets and the lowest overall loss. [sent-254, score-0.378]

84 Ignoring statistical signiﬁcance, taxem has the lowest loss on all data sets except B. [sent-255, score-0.305]

85 5 Related Work In recent years, several algorithms for document categorization have been proposed. [sent-260, score-0.333]

86 Several authors proposed adaptations of support vector machines that incorporate the topic taxonomy through costsensitive loss re-weighting and classiﬁcation at multiple nodes in the hierarchy [4, 8, 11]. [sent-261, score-0.671]

87 It differs from all these methods in that it learns a low dimensional semantic representation of the documents and classiﬁes by ﬁnding the nearest prototype. [sent-263, score-0.52]

88 Although their algorithm also reduces the dimensionality with a linear projection, their low dimensional space is obtained through supervised clustering on the document data. [sent-265, score-0.355]

89 In contrast, the semantic space obtained with taxem is obtained through a convex optimization with maximum margin constraints. [sent-266, score-0.592]

90 Further, the low dimensional representation of our method is explicitly constructed to give rise to meaningful Euclidean distances. [sent-267, score-0.146]

91 The optimization with large-margin constraints was partially inspired by recent work on large margin distance metric learning for nearest neighbor classiﬁcation [16]. [sent-268, score-0.239]

92 However our formulation is a much more light-weight optimization problem with O(cn) constraints instead of O(n2 ) as in [16]. [sent-269, score-0.13]

93 7 6 Conclusion In this paper, we have presented a novel framework for classiﬁcation with inter-class relationships based on taxonomy embedding and supervised dimensionality reduction. [sent-271, score-0.375]

94 We derived a single convex optimization problem that learns an embedding of the topic taxonomy as well as a linear mapping from the document space to the resulting low dimensional semantic space. [sent-272, score-1.358]

95 As future work we are planning to extend our algorithm to the more general setting of document categorization with multiple topic memberships and multi-modal topic distributions. [sent-273, score-0.907]

96 Further, we are keen to explore the implications of our proposed conversion of discrete topic taxonomies into continuous semantic spaces. [sent-274, score-0.468]

97 A natural step is to consider the document matching problem (e. [sent-276, score-0.204]

98 of web pages and advertisements) in the semantic space: a fast nearest neighbor search can be performed in a joint low dimensional space without having to resort to classiﬁcation all together. [sent-278, score-0.422]

99 Although this paper is presented in the context of document categorization, it is important to emphasize that our method is by no means limited to text data or class hierarchies. [sent-279, score-0.235]

100 Concept indexing a fast dimensionality reduction algorithm with applications to document retrieval & categorization, 2000. [sent-343, score-0.273]

similar papers computed by tfidf model

tfidf for this paper:

wordName wordTfidf (topN-words)

[('prototypes', 0.458), ('taxonomy', 0.299), ('topic', 0.287), ('prototype', 0.285), ('pmds', 0.21), ('taxem', 0.21), ('document', 0.204), ('cyi', 0.189), ('semantic', 0.181), ('documents', 0.15), ('eyi', 0.141), ('ohsumed', 0.141), ('categorization', 0.129), ('pyi', 0.105), ('rc', 0.105), ('dimensional', 0.087), ('margin', 0.085), ('mesh', 0.084), ('mcsvm', 0.084), ('svm', 0.079), ('medical', 0.078), ('xi', 0.078), ('embedding', 0.076), ('classi', 0.067), ('cost', 0.066), ('topics', 0.065), ('karypis', 0.063), ('wxi', 0.063), ('headings', 0.055), ('cai', 0.055), ('loss', 0.054), ('zi', 0.051), ('optimization', 0.05), ('closest', 0.05), ('constraints', 0.048), ('abstracts', 0.047), ('cance', 0.046), ('yi', 0.045), ('mds', 0.045), ('hierarchical', 0.044), ('pi', 0.043), ('embed', 0.042), ('retrieval', 0.042), ('hoffmann', 0.042), ('pxi', 0.042), ('closer', 0.042), ('lowest', 0.041), ('latent', 0.039), ('simplex', 0.039), ('layout', 0.039), ('cation', 0.038), ('mapping', 0.037), ('euclidean', 0.037), ('convex', 0.037), ('misclassify', 0.037), ('populated', 0.037), ('tax', 0.037), ('kilian', 0.037), ('categories', 0.036), ('regularization', 0.036), ('learns', 0.036), ('xx', 0.036), ('yahoo', 0.036), ('low', 0.035), ('web', 0.034), ('dumais', 0.034), ('misclassi', 0.033), ('formulation', 0.032), ('obtains', 0.032), ('matrix', 0.032), ('weinberger', 0.031), ('categorized', 0.031), ('attack', 0.031), ('rr', 0.031), ('nearest', 0.031), ('nodes', 0.031), ('text', 0.031), ('acm', 0.031), ('create', 0.03), ('schematic', 0.03), ('wordnet', 0.03), ('axi', 0.03), ('highlighted', 0.03), ('han', 0.03), ('distances', 0.029), ('space', 0.029), ('pe', 0.028), ('embedded', 0.027), ('indexing', 0.027), ('crammer', 0.027), ('beyond', 0.027), ('shortest', 0.026), ('df', 0.026), ('mapped', 0.025), ('neighbor', 0.025), ('pc', 0.025), ('entirely', 0.024), ('respective', 0.024), ('category', 0.024), ('rise', 0.024)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 1.0000004 114 nips-2008-Large Margin Taxonomy Embedding for Document Categorization

Author: Kilian Q. Weinberger, Olivier Chapelle

2 0.22346869 229 nips-2008-Syntactic Topic Models

Author: Jordan L. Boyd-graber, David M. Blei

Abstract: We develop the syntactic topic model (STM), a nonparametric Bayesian model of parsed documents. The STM generates words that are both thematically and syntactically constrained, which combines the semantic insights of topic models with the syntactic information available from parse trees. Each word of a sentence is generated by a distribution that combines document-speciﬁc topic weights and parse-tree-speciﬁc syntactic transitions. Words are assumed to be generated in an order that respects the parse tree. We derive an approximate posterior inference method based on variational methods for hierarchical Dirichlet processes, and we report qualitative and quantitative results on both synthetic data and hand-parsed documents. 1

3 0.1881658 120 nips-2008-Learning the Semantic Correlation: An Alternative Way to Gain from Unlabeled Text

Author: Yi Zhang, Artur Dubrawski, Jeff G. Schneider

Abstract: In this paper, we address the question of what kind of knowledge is generally transferable from unlabeled text. We suggest and analyze the semantic correlation of words as a generally transferable structure of the language and propose a new method to learn this structure using an appropriately chosen latent variable model. This semantic correlation contains structural information of the language space and can be used to control the joint shrinkage of model parameters for any speciﬁc task in the same space through regularization. In an empirical study, we construct 190 different text classiﬁcation tasks from a real-world benchmark, and the unlabeled documents are a mixture from all these tasks. We test the ability of various algorithms to use the mixed unlabeled text to enhance all classiﬁcation tasks. Empirical results show that the proposed approach is a reliable and scalable method for semi-supervised learning, regardless of the source of unlabeled data, the speciﬁc task to be enhanced, and the prediction model used.

4 0.16581002 64 nips-2008-DiscLDA: Discriminative Learning for Dimensionality Reduction and Classification

Author: Simon Lacoste-julien, Fei Sha, Michael I. Jordan

Abstract: Probabilistic topic models have become popular as methods for dimensionality reduction in collections of text documents or images. These models are usually treated as generative models and trained using maximum likelihood or Bayesian methods. In this paper, we discuss an alternative: a discriminative framework in which we assume that supervised side information is present, and in which we wish to take that side information into account in ﬁnding a reduced dimensionality representation. Speciﬁcally, we present DiscLDA, a discriminative variation on Latent Dirichlet Allocation (LDA) in which a class-dependent linear transformation is introduced on the topic mixture proportions. This parameter is estimated by maximizing the conditional likelihood. By using the transformed topic mixture proportions as a new representation of documents, we obtain a supervised dimensionality reduction algorithm that uncovers the latent structure in a document collection while preserving predictive power for the task of classiﬁcation. We compare the predictive power of the latent structure of DiscLDA with unsupervised LDA on the 20 Newsgroups document classiﬁcation task and show how our model can identify shared topics across classes as well as class-dependent topics.

5 0.13856837 117 nips-2008-Learning Taxonomies by Dependence Maximization

Author: Matthew Blaschko, Arthur Gretton

Abstract: We introduce a family of unsupervised algorithms, numerical taxonomy clustering, to simultaneously cluster data, and to learn a taxonomy that encodes the relationship between the clusters. The algorithms work by maximizing the dependence between the taxonomy and the original data. The resulting taxonomy is a more informative visualization of complex data than simple clustering; in addition, taking into account the relations between different clusters is shown to substantially improve the quality of the clustering, when compared with state-ofthe-art algorithms in the literature (both spectral clustering and a previous dependence maximization approach). We demonstrate our algorithm on image and text data. 1

6 0.13017423 197 nips-2008-Relative Performance Guarantees for Approximate Inference in Latent Dirichlet Allocation

7 0.12684599 93 nips-2008-Global Ranking Using Continuous Conditional Random Fields

8 0.11125603 116 nips-2008-Learning Hybrid Models for Image Annotation with Partially Labeled Data

9 0.10801467 6 nips-2008-A ``Shape Aware'' Model for semi-supervised Learning of Objects and its Context

10 0.1056402 205 nips-2008-Semi-supervised Learning with Weakly-Related Unlabeled Data : Towards Better Text Categorization

11 0.093482576 194 nips-2008-Regularized Learning with Networks of Features

12 0.090806313 113 nips-2008-Kernelized Sorting

13 0.088905632 196 nips-2008-Relative Margin Machines

14 0.083026022 246 nips-2008-Unsupervised Learning of Visual Sense Models for Polysemous Words

15 0.075656638 62 nips-2008-Differentiable Sparse Coding

16 0.071807459 239 nips-2008-Tighter Bounds for Structured Estimation

17 0.069453321 78 nips-2008-Exact Convex Confidence-Weighted Learning

18 0.06835866 61 nips-2008-Diffeomorphic Dimensionality Reduction

19 0.066817448 175 nips-2008-PSDBoost: Matrix-Generation Linear Programming for Positive Semidefinite Matrices Learning

20 0.064127132 52 nips-2008-Correlated Bigram LSA for Unsupervised Language Model Adaptation

similar papers computed by lsi model

lsi for this paper:

topicId topicWeight

[(0, -0.211), (1, -0.16), (2, -0.014), (3, -0.133), (4, -0.091), (5, 0.004), (6, 0.146), (7, 0.099), (8, -0.224), (9, 0.014), (10, 0.015), (11, -0.053), (12, -0.115), (13, 0.118), (14, -0.0), (15, 0.071), (16, 0.064), (17, -0.014), (18, -0.021), (19, -0.031), (20, -0.025), (21, 0.009), (22, -0.03), (23, 0.097), (24, 0.073), (25, -0.011), (26, -0.041), (27, 0.006), (28, -0.012), (29, -0.106), (30, 0.034), (31, 0.016), (32, 0.101), (33, 0.003), (34, -0.029), (35, -0.01), (36, -0.073), (37, 0.03), (38, -0.02), (39, -0.074), (40, -0.046), (41, -0.03), (42, 0.119), (43, -0.028), (44, 0.0), (45, 0.041), (46, -0.031), (47, -0.077), (48, -0.002), (49, 0.022)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.93723875 114 nips-2008-Large Margin Taxonomy Embedding for Document Categorization

Author: Kilian Q. Weinberger, Olivier Chapelle

2 0.82120818 64 nips-2008-DiscLDA: Discriminative Learning for Dimensionality Reduction and Classification

Author: Simon Lacoste-julien, Fei Sha, Michael I. Jordan

3 0.77324873 229 nips-2008-Syntactic Topic Models

Author: Jordan L. Boyd-graber, David M. Blei

4 0.72948027 197 nips-2008-Relative Performance Guarantees for Approximate Inference in Latent Dirichlet Allocation

Author: Indraneel Mukherjee, David M. Blei

Abstract: Hierarchical probabilistic modeling of discrete data has emerged as a powerful tool for text analysis. Posterior inference in such models is intractable, and practitioners rely on approximate posterior inference methods such as variational inference or Gibbs sampling. There has been much research in designing better approximations, but there is yet little theoretical understanding of which of the available techniques are appropriate, and in which data analysis settings. In this paper we provide the beginnings of such understanding. We analyze the improvement that the recently proposed collapsed variational inference (CVB) provides over mean ﬁeld variational inference (VB) in latent Dirichlet allocation. We prove that the difference in the tightness of the bound on the likelihood of a document decreases as O(k − 1) + log m/m, where k is the number of topics in the model and m is the number of words in a document. As a consequence, the advantage of CVB over VB is lost for long documents but increases with the number of topics. We demonstrate empirically that the theory holds, using simulated text data and two text corpora. We provide practical guidelines for choosing an approximation. 1

5 0.71526349 28 nips-2008-Asynchronous Distributed Learning of Topic Models

Author: Padhraic Smyth, Max Welling, Arthur U. Asuncion

Abstract: Distributed learning is a problem of fundamental interest in machine learning and cognitive science. In this paper, we present asynchronous distributed learning algorithms for two well-known unsupervised learning frameworks: Latent Dirichlet Allocation (LDA) and Hierarchical Dirichlet Processes (HDP). In the proposed approach, the data are distributed across P processors, and processors independently perform Gibbs sampling on their local data and communicate their information in a local asynchronous manner with other processors. We demonstrate that our asynchronous algorithms are able to learn global topic models that are statistically as accurate as those learned by the standard LDA and HDP samplers, but with signiﬁcant improvements in computation time and memory. We show speedup results on a 730-million-word text corpus using 32 processors, and we provide perplexity results for up to 1500 virtual processors. As a stepping stone in the development of asynchronous HDP, a parallel HDP sampler is also introduced. 1

6 0.62525952 120 nips-2008-Learning the Semantic Correlation: An Alternative Way to Gain from Unlabeled Text

7 0.61089361 52 nips-2008-Correlated Bigram LSA for Unsupervised Language Model Adaptation

8 0.52935636 93 nips-2008-Global Ranking Using Continuous Conditional Random Fields

9 0.52441144 246 nips-2008-Unsupervised Learning of Visual Sense Models for Polysemous Words

10 0.45790723 196 nips-2008-Relative Margin Machines

11 0.45630768 205 nips-2008-Semi-supervised Learning with Weakly-Related Unlabeled Data : Towards Better Text Categorization

12 0.4523114 4 nips-2008-A Scalable Hierarchical Distributed Language Model

13 0.41755432 227 nips-2008-Supervised Exponential Family Principal Component Analysis via Convex Optimization

14 0.40430516 226 nips-2008-Supervised Dictionary Learning

15 0.39837646 185 nips-2008-Privacy-preserving logistic regression

16 0.3888751 78 nips-2008-Exact Convex Confidence-Weighted Learning

17 0.3845377 194 nips-2008-Regularized Learning with Networks of Features

18 0.38250834 61 nips-2008-Diffeomorphic Dimensionality Reduction

19 0.37976867 116 nips-2008-Learning Hybrid Models for Image Annotation with Partially Labeled Data

20 0.37922603 113 nips-2008-Kernelized Sorting

similar papers computed by lda model

lda for this paper:

topicId topicWeight

[(6, 0.065), (7, 0.109), (12, 0.033), (28, 0.153), (57, 0.063), (59, 0.015), (63, 0.033), (71, 0.027), (74, 0.267), (77, 0.058), (83, 0.084)]

similar papers list:

simIndex simValue paperId paperTitle

1 0.8088271 248 nips-2008-Using matrices to model symbolic relationship

Author: Ilya Sutskever, Geoffrey E. Hinton

Abstract: We describe a way of learning matrix representations of objects and relationships. The goal of learning is to allow multiplication of matrices to represent symbolic relationships between objects and symbolic relationships between relationships, which is the main novelty of the method. We demonstrate that this leads to excellent generalization in two different domains: modular arithmetic and family relationships. We show that the same system can learn ﬁrst-order propositions such as (2, 5) ∈ +3 or (Christopher, Penelope) ∈ has wife, and higher-order propositions such as (3, +3) ∈ plus and (+3, −3) ∈ inverse or (has husband, has wife) ∈ higher oppsex. We further demonstrate that the system understands how higher-order propositions are related to ﬁrst-order ones by showing that it can correctly answer questions about ﬁrst-order propositions involving the relations +3 or has wife even though it has not been trained on any ﬁrst-order examples involving these relations. 1

2 0.80671865 172 nips-2008-Optimal Response Initiation: Why Recent Experience Matters

Author: Matt Jones, Sachiko Kinoshita, Michael C. Mozer

Abstract: In most cognitive and motor tasks, speed-accuracy tradeoffs are observed: Individuals can respond slowly and accurately, or quickly yet be prone to errors. Control mechanisms governing the initiation of behavioral responses are sensitive not only to task instructions and the stimulus being processed, but also to the recent stimulus history. When stimuli can be characterized on an easy-hard dimension (e.g., word frequency in a naming task), items preceded by easy trials are responded to more quickly, and with more errors, than items preceded by hard trials. We propose a rationally motivated mathematical model of this sequential adaptation of control, based on a diffusion model of the decision process in which difﬁculty corresponds to the drift rate for the correct response. The model assumes that responding is based on the posterior distribution over which response is correct, conditioned on the accumulated evidence. We derive this posterior as a function of the drift rate, and show that higher estimates of the drift rate lead to (normatively) faster responding. Trial-by-trial tracking of difﬁculty thus leads to sequential effects in speed and accuracy. Simulations show the model explains a variety of phenomena in human speeded decision making. We argue this passive statistical mechanism provides a more elegant and parsimonious account than extant theories based on elaborate control structures. 1

same-paper 3 0.79721379 114 nips-2008-Large Margin Taxonomy Embedding for Document Categorization

Author: Kilian Q. Weinberger, Olivier Chapelle

4 0.64378387 194 nips-2008-Regularized Learning with Networks of Features

Author: Ted Sandler, John Blitzer, Partha P. Talukdar, Lyle H. Ungar

Abstract: For many supervised learning problems, we possess prior knowledge about which features yield similar information about the target variable. In predicting the topic of a document, we might know that two words are synonyms, and when performing image recognition, we know which pixels are adjacent. Such synonymous or neighboring features are near-duplicates and should be expected to have similar weights in an accurate model. Here we present a framework for regularized learning when one has prior knowledge about which features are expected to have similar and dissimilar weights. The prior knowledge is encoded as a network whose vertices are features and whose edges represent similarities and dissimilarities between them. During learning, each feature’s weight is penalized by the amount it differs from the average weight of its neighbors. For text classiﬁcation, regularization using networks of word co-occurrences outperforms manifold learning and compares favorably to other recently proposed semi-supervised learning methods. For sentiment analysis, feature networks constructed from declarative human knowledge signiﬁcantly improve prediction accuracy. 1

5 0.64199793 79 nips-2008-Exploring Large Feature Spaces with Hierarchical Multiple Kernel Learning

Author: Francis R. Bach

Abstract: For supervised and unsupervised learning, positive deﬁnite kernels allow to use large and potentially inﬁnite dimensional feature spaces with a computational cost that only depends on the number of observations. This is usually done through the penalization of predictor functions by Euclidean or Hilbertian norms. In this paper, we explore penalizing by sparsity-inducing norms such as the ℓ1 -norm or the block ℓ1 -norm. We assume that the kernel decomposes into a large sum of individual basis kernels which can be embedded in a directed acyclic graph; we show that it is then possible to perform kernel selection through a hierarchical multiple kernel learning framework, in polynomial time in the number of selected kernels. This framework is naturally applied to non linear variable selection; our extensive simulations on synthetic datasets and datasets from the UCI repository show that efﬁciently exploring the large feature space through sparsity-inducing norms leads to state-of-the-art predictive performance.

6 0.63909817 197 nips-2008-Relative Performance Guarantees for Approximate Inference in Latent Dirichlet Allocation

7 0.63853681 205 nips-2008-Semi-supervised Learning with Weakly-Related Unlabeled Data : Towards Better Text Categorization

8 0.6378895 62 nips-2008-Differentiable Sparse Coding

9 0.63741177 63 nips-2008-Dimensionality Reduction for Data in Multiple Feature Representations

10 0.63406128 245 nips-2008-Unlabeled data: Now it helps, now it doesn't

11 0.63153183 120 nips-2008-Learning the Semantic Correlation: An Alternative Way to Gain from Unlabeled Text

12 0.63042372 64 nips-2008-DiscLDA: Discriminative Learning for Dimensionality Reduction and Classification

13 0.63027573 143 nips-2008-Multi-label Multiple Kernel Learning

14 0.6288563 99 nips-2008-High-dimensional support union recovery in multivariate regression

15 0.627473 135 nips-2008-Model Selection in Gaussian Graphical Models: High-Dimensional Consistency of \boldmath$\ell 1$-regularized MLE

16 0.62655199 14 nips-2008-Adaptive Forward-Backward Greedy Algorithm for Sparse Learning with Linear Models

17 0.62654239 71 nips-2008-Efficient Sampling for Gaussian Process Inference using Control Variables

18 0.62634629 202 nips-2008-Robust Regression and Lasso

19 0.62532467 196 nips-2008-Relative Margin Machines

20 0.62523293 175 nips-2008-PSDBoost: Matrix-Generation Linear Programming for Positive Semidefinite Matrices Learning