nips nips2008 nips2008-114 knowledge-graph by maker-knowledge-mining
Source: pdf
Author: Kilian Q. Weinberger, Olivier Chapelle
Abstract: Applications of multi-class classification, such as document categorization, often appear in cost-sensitive settings. Recent work has significantly improved the state of the art by moving beyond “flat” classification through incorporation of class hierarchies [4]. We present a novel algorithm that goes beyond hierarchical classification and estimates the latent semantic space that underlies the class hierarchy. In this space, each class is represented by a prototype and classification is done with the simple nearest neighbor rule. The optimization of the semantic space incorporates large margin constraints that ensure that for each instance the correct class prototype is closer than any other. We show that our optimization is convex and can be solved efficiently for large data sets. Experiments on the OHSUMED medical journal data base yield state-of-the-art results on topic categorization. 1
Reference: text
sentIndex sentText sentNum sentScore
1 com Abstract Applications of multi-class classification, such as document categorization, often appear in cost-sensitive settings. [sent-5, score-0.204]
2 We present a novel algorithm that goes beyond hierarchical classification and estimates the latent semantic space that underlies the class hierarchy. [sent-7, score-0.32]
3 In this space, each class is represented by a prototype and classification is done with the simple nearest neighbor rule. [sent-8, score-0.341]
4 The optimization of the semantic space incorporates large margin constraints that ensure that for each instance the correct class prototype is closer than any other. [sent-9, score-0.72]
5 Experiments on the OHSUMED medical journal data base yield state-of-the-art results on topic categorization. [sent-11, score-0.365]
6 Similarly, in the context of document categorization it seems more severe to misclassify a medical journal on heart attack as a publication on athlete’s foot than on Coronary artery disease. [sent-15, score-0.479]
7 Although the scope of the proposed method is by no means limited to text data and topic hierarchies, for improved clarity we will restrict ourselves to terminology from document categorization throughout this paper. [sent-16, score-0.651]
8 The most common approach to document categorization is to reduce the problem to a “flat” classification problem [13]. [sent-17, score-0.333]
9 However, it is often the case that the topics are not just discrete classes, but are nodes in a complex taxonomy with rich inter-topic relationships. [sent-18, score-0.395]
10 web taxonomy or medical journals can be categorized into the Medical Subject Headings (MeSH) taxonomy. [sent-20, score-0.442]
11 Moving beyond flat classification to settings that utilize these hierarchical representations of topics has significantly pushed the state-of-the art [4, 15]. [sent-21, score-0.136]
12 Additional information about inter-topic relationships can for example be leveraged through cost-sensitive decision boundaries or knowledge sharing between documents from closely related classes. [sent-22, score-0.15]
13 In reality, however, the topic taxonomy is a crude approximation of topic relations, created by an editor with knowledge of the true underlying semantic space of topics. [sent-23, score-1.083]
14 In this paper we propose a method that moves beyond hierarchical presentations and aims to re-discover the continuous latent semantic space underlying the topic taxonomy. [sent-24, score-0.607]
15 Instead of regarding document categorization as classification, we will think of it as a regression problem where new documents are mapped into this latent semantic topic space. [sent-25, score-1.015]
16 Very different from approaches like LSI or LDA [1, 7], our algorithm is entirely supervised and explicitly embeds the topic taxonomy and the documents into a single latent semantic space with “semantically meaningful” Euclidean distances. [sent-26, score-1.009]
17 1 Topic taxonomy T Low dimensional semantic space pneumonia X W P stroke arthritis High dimensional input space F original inputs heart attack class prototypes pα embedded inputs xi Wxi Figure 1: A schematic layout of our taxem method (for Taxonomy Embedding). [sent-27, score-1.585]
18 The classes are embedded as prototypes inside the semantic space. [sent-28, score-0.666]
19 The input documents are mapped into the same space, placed closest to their topic prototypes. [sent-29, score-0.512]
20 In this paper we derive a method to embed the taxonomy of topics into a latent semantic space in form of topic prototypes. [sent-30, score-0.942]
21 A new document can be classified by first mapping it into this space and then assigning the label of the closest prototype. [sent-31, score-0.32]
22 A key contribution of our paper is the derivation of a convex problem that learns the regressor for the documents and the placement of the prototypes in a single optimization. [sent-32, score-0.681]
23 In particular, it places the topic prototypes such that for each document the prototype of the correct topic is much closer than any other prototype by a large margin. [sent-33, score-1.848]
24 Our paper is structured as follows: In section 2 we introduce necessary notation and a first version of the algorithm based on a two-step approach of first embedding the hierarchical taxonomy into a semantic space and then regressing the input documents close to their respective topic prototypes. [sent-35, score-1.09]
25 In section 3 we extend our model to a single optimization that learns both steps in one convex optimization with large margin constraints. [sent-36, score-0.258]
26 We evaluate our method in section 4 and demonstrate state-of-the-art results on eight different document categorization tasks from the OHSUMED medical journal data set. [sent-37, score-0.411]
27 In addition, the documents are accompanied by single topic labels y1 yn 1 c that lie in some taxonomy T with c total topics. [sent-41, score-0.736]
28 This taxonomy T gives rise to some cost matrix C Rc c , where C 0 defines the cost of misclassifying an element of topic as and C = 0. [sent-42, score-0.774]
29 Technically, we only require knowledge of the cost matrix C, which could also be obtained from side-information independent of a topic taxonomy. [sent-43, score-0.385]
30 However, we would like to point out that a common way to infer a cost matrix from a taxonomy is to set C to the length of the shortest path between node and , but other approaches have also been studied [3]. [sent-45, score-0.423]
31 Throughout this paper we denote document indices as i j 1 n and topic indices as 1 c . [sent-46, score-0.491]
32 We would like to create a low dimensional semantic F and each document feature space F in which we represent each topic as a topic prototype p xi X as a low dimensional vector zi F. [sent-53, score-1.676]
33 Our goal is to discover a representation of the data where distances reflect true underlying dissimilarities and proximity to prototypes indicates topic membership. [sent-54, score-0.774]
34 In other words, documents on the same or related topics should be close to the respective topic prototypes, documents on highly different topics should be well separated. [sent-55, score-0.741]
35 As an essential part of our method is to embed the classes that are typically found in a taxonomy, we refer to our algorithm as taxem (short for “taxonomy embedding”). [sent-57, score-0.252]
36 Embedding topic prototypes The first step of our algorithm is to embed the document taxonomy into a Euclidean vector space. [sent-58, score-1.29]
37 , pc ∈ F based on the cost matrix C, where pα is the prototype that represents topic α. [sent-62, score-0.695]
38 , pc ] ∈ Rc×c whose columns consist of the topic prototypes. [sent-66, score-0.312]
39 There are many ways to derive the prototypes from the cost matrix C. [sent-67, score-0.556]
40 Better results can be expected when the prototypes of similar topics are closer than those of dissimilar topics. [sent-71, score-0.565]
41 We use the cost matrix C as an estimate of dissimilarity and aim to place the prototypes such that the distance pα − pβ 2 reflects the cost specified in Cαβ . [sent-72, score-0.622]
42 (1) α,β=1 If the cost matrix C defines squared Euclidean distances (e. [sent-74, score-0.127]
43 1 Both prototypes embeddings PI and Pmds are still independent of the input data {xi }. [sent-81, score-0.458]
44 Before we can derive a more sophisticated method to place the prototypes with large margin constraints on the document vectors, we will briefly describe the mapping W : X → F of the input documents into the low dimensional feature space F. [sent-82, score-1.133]
45 We need to find an appropriate mapping W : X → F, that maps each input xi with label yi as close as possible to its topic prototype pyi . [sent-84, score-0.837]
46 We can find such a linear transformation zi = Wxi by setting pyi − Wxi W = argminW 2 +λ W 2 F. [sent-85, score-0.156]
47 Inference Given an input vector xt we first map it into F and estimate its label as the topic with the closest prototype pα yt = argminα pα − Wxt 2 . [sent-94, score-0.622]
48 3 Topic taxonomy T I α eα Low dimensional semantic space yi High dimensional input space xi pα ey i py i P X A F zi xi Rd embedded inputs Rc large margin Rc class prototypes Figure 2: The schematic layout of the large-margin embedding of the taxonomy and the documents. [sent-96, score-1.978]
49 As a first step, we represent topic as the vector e and document xi as xi = Axi . [sent-97, score-0.647]
50 We then learn the matrix P whose columns are the prototypes p = Pe and which defines the final transformation of the documents zi = Pxi . [sent-98, score-0.691]
51 This final transformation is learned such that the correct prototype pyi is closer to zi than any other prototype p by a large margin. [sent-99, score-0.768]
52 However, learning the prototypes independent of the data xi is far from optimal in order to reduce the loss in (5). [sent-101, score-0.59]
53 In this section we will create a joint optimization problem that places the prototypes P and learns the mapping W while minimizing an upper bound on (5). [sent-102, score-0.611]
54 We want to map the input documents closest to their prototypes and at the same time place the prototypes where the documents of the respective topic are mapped to. [sent-104, score-1.602]
55 (6) is entirely independent of P and can be pre-computed before the prototypes have been positioned. [sent-109, score-0.482]
56 We can then rewrite both, the topic prototypes p and the low dimensional documents zi , Rc : as vectors within the range of P : Rc p = Pe and zi = Pxi (7) Optimization Ideally we would like to learn P to minimize (5) directly. [sent-113, score-1.119]
57 4 The loss for a specific document xi is zero if its corresponding vector zi is closer to the correct prototype pyi than to any other prototype pα . [sent-116, score-1.104]
58 For better generalization it would be preferable if prototype pyi was in fact much closer by a large margin. [sent-117, score-0.432]
59 We can go even further and demand that prototypes that would incur a larger misclassification loss should be further separated than those with a small cost. [sent-118, score-0.512]
60 We can express this condition as a set of “soft” inequality constraints, in terms of squared-distances, ∀i, α = yi P(eyi − xi ) 2 2 + Cyi α ≤ P(eα − xi ) 2 2 + ξiα , (8) where the slack-variable ξiα ≥ 0 absorbs the amount of violation of prototype pα into the margin of xi . [sent-120, score-0.649]
61 Given this formulation, we create an upper bound on the loss function (5): Theorem 1 Given a prototype matrix P, the training error (5) is bounded above by 1 n iα ξiα . [sent-121, score-0.401]
62 Proof: First, note that we can rewrite the assignment of the closest prototype (4) as yi = ˆ argminα P(eα − xi ) 2 . [sent-122, score-0.458]
63 It follows that P(eyi − xi ) 2 − P(eyi − xi ) 2 ≥ 0 for all i (with ˆ 2 2 equality when yi = yi ). [sent-123, score-0.246]
64 We therefore obtain: ˆ ξiˆi = P(eyi − xi ) y 2 2 + Cyi yi − P(eyi − xi ) ˆ ˆ 2 2 ≥ Cyi yi . [sent-124, score-0.246]
65 (8), allows us to create an optimization problem that minimizes an upper bound on the average loss in eq. [sent-127, score-0.134]
66 (5) with maximum-margin constraints: ξiα subject to: Minimize P i,α (1) P(eyi − xi ) (2) ξiα ≥ 0 2 2 + Cyi α ≤ P(eα − xi ) 2 2 (11) + ξiα Note that if we have a very large number of classes, it might be beneficial to choose P ∈ Rr×c with r < c. [sent-128, score-0.156]
67 We can make (11) invariant to rotation by defining Q = P P, and rewriting all distances in terms of Q, P(eα − xi ) 2 2 = (eα − xi ) Q(eα − xi ) = eα − xi 2 Q. [sent-134, score-0.341]
68 We can now solve (11) in terms of Q instead of P with the large-margin constraints ∀i, α = yi eyi − xi 2 Q + Cyi α ≤ eα − xi 2 Q + ξiα . [sent-139, score-0.39]
69 Even if the training data might differ from the test data, we know that the taxonomy does not change. [sent-142, score-0.299]
70 for all i we have xi = eyi , Pmds satisfies all constraints (8) as equalities with zero slack. [sent-145, score-0.267]
71 The final convex optimization of F taxem with regularized objective becomes: ¯ ξiα + µ Q − C 2 F subject to: + Cyi α ≤ eα − xi 2 Q + ξiα Minimize (1 − µ) Q (1) eyi − xi (2) ξiα ≥ 0 (3) Q 0 i,α 2 Q (14) The constant µ ∈ [0, 1] regulates the impact of the regularization term. [sent-150, score-0.63]
72 Once the optimal solution Q∗ is found, one can obtain the position of the prototypes with a simple svd or cholesky decomposition Q∗ = P P and consequently also obtains the mapping W from W = PA. [sent-153, score-0.527]
73 4 Results We evaluated our algorithm taxem on several classification problems derived from categorizing publications in the public OHSUMED medical journal data base into the Medical Subject Headings (MeSH) taxonomy. [sent-154, score-0.288]
74 Setup and data set description We used the OHSUMED 87 corpus [9], which consists of abstracts and titles of medical publications. [sent-155, score-0.125]
75 We removed all topic categories that did not appear in the MeSH taxonomy (due to out-dated topic names). [sent-160, score-0.909]
76 For each problem, we created a 70%/30% random split in training and test samples, ensuring however that each topic had at least one document in the training corpus. [sent-166, score-0.491]
77 50 Table 2: The cost-sensitive test error results on various ohsumed classification data sets. [sent-238, score-0.141]
78 The taxem algorithm obtains the lowest overall loss and the lowest individual loss on each data set except B. [sent-242, score-0.432]
79 compared taxem against four commonly used algorithms for document categorization: 1. [sent-243, score-0.414]
80 the Cai and Hoffmann SVM formulation with a cost sensitive hierarchical loss function (SVM tax) [4]. [sent-248, score-0.196]
81 All SVM classifiers were trained with regularization constant C = 1 (which worked best on problem B; this value is also commonly used in text classification when the documents have unit length). [sent-249, score-0.217]
82 Further, we also evaluated the difference between our large margin formulation (taxem) and the results with the simplex (PI -taxem) and mds (Pmds -taxem) prototypes. [sent-250, score-0.201]
83 Up to statistical significance, taxem obtains the lowest loss on all data sets and the lowest overall loss. [sent-254, score-0.378]
84 Ignoring statistical significance, taxem has the lowest loss on all data sets except B. [sent-255, score-0.305]
85 5 Related Work In recent years, several algorithms for document categorization have been proposed. [sent-260, score-0.333]
86 Several authors proposed adaptations of support vector machines that incorporate the topic taxonomy through costsensitive loss re-weighting and classification at multiple nodes in the hierarchy [4, 8, 11]. [sent-261, score-0.671]
87 It differs from all these methods in that it learns a low dimensional semantic representation of the documents and classifies by finding the nearest prototype. [sent-263, score-0.52]
88 Although their algorithm also reduces the dimensionality with a linear projection, their low dimensional space is obtained through supervised clustering on the document data. [sent-265, score-0.355]
89 In contrast, the semantic space obtained with taxem is obtained through a convex optimization with maximum margin constraints. [sent-266, score-0.592]
90 Further, the low dimensional representation of our method is explicitly constructed to give rise to meaningful Euclidean distances. [sent-267, score-0.146]
91 The optimization with large-margin constraints was partially inspired by recent work on large margin distance metric learning for nearest neighbor classification [16]. [sent-268, score-0.239]
92 However our formulation is a much more light-weight optimization problem with O(cn) constraints instead of O(n2 ) as in [16]. [sent-269, score-0.13]
93 7 6 Conclusion In this paper, we have presented a novel framework for classification with inter-class relationships based on taxonomy embedding and supervised dimensionality reduction. [sent-271, score-0.375]
94 We derived a single convex optimization problem that learns an embedding of the topic taxonomy as well as a linear mapping from the document space to the resulting low dimensional semantic space. [sent-272, score-1.358]
95 As future work we are planning to extend our algorithm to the more general setting of document categorization with multiple topic memberships and multi-modal topic distributions. [sent-273, score-0.907]
96 Further, we are keen to explore the implications of our proposed conversion of discrete topic taxonomies into continuous semantic spaces. [sent-274, score-0.468]
97 A natural step is to consider the document matching problem (e. [sent-276, score-0.204]
98 of web pages and advertisements) in the semantic space: a fast nearest neighbor search can be performed in a joint low dimensional space without having to resort to classification all together. [sent-278, score-0.422]
99 Although this paper is presented in the context of document categorization, it is important to emphasize that our method is by no means limited to text data or class hierarchies. [sent-279, score-0.235]
100 Concept indexing a fast dimensionality reduction algorithm with applications to document retrieval & categorization, 2000. [sent-343, score-0.273]
wordName wordTfidf (topN-words)
[('prototypes', 0.458), ('taxonomy', 0.299), ('topic', 0.287), ('prototype', 0.285), ('pmds', 0.21), ('taxem', 0.21), ('document', 0.204), ('cyi', 0.189), ('semantic', 0.181), ('documents', 0.15), ('eyi', 0.141), ('ohsumed', 0.141), ('categorization', 0.129), ('pyi', 0.105), ('rc', 0.105), ('dimensional', 0.087), ('margin', 0.085), ('mesh', 0.084), ('mcsvm', 0.084), ('svm', 0.079), ('medical', 0.078), ('xi', 0.078), ('embedding', 0.076), ('classi', 0.067), ('cost', 0.066), ('topics', 0.065), ('karypis', 0.063), ('wxi', 0.063), ('headings', 0.055), ('cai', 0.055), ('loss', 0.054), ('zi', 0.051), ('optimization', 0.05), ('closest', 0.05), ('constraints', 0.048), ('abstracts', 0.047), ('cance', 0.046), ('yi', 0.045), ('mds', 0.045), ('hierarchical', 0.044), ('pi', 0.043), ('embed', 0.042), ('retrieval', 0.042), ('hoffmann', 0.042), ('pxi', 0.042), ('closer', 0.042), ('lowest', 0.041), ('latent', 0.039), ('simplex', 0.039), ('layout', 0.039), ('cation', 0.038), ('mapping', 0.037), ('euclidean', 0.037), ('convex', 0.037), ('misclassify', 0.037), ('populated', 0.037), ('tax', 0.037), ('kilian', 0.037), ('categories', 0.036), ('regularization', 0.036), ('learns', 0.036), ('xx', 0.036), ('yahoo', 0.036), ('low', 0.035), ('web', 0.034), ('dumais', 0.034), ('misclassi', 0.033), ('formulation', 0.032), ('obtains', 0.032), ('matrix', 0.032), ('weinberger', 0.031), ('categorized', 0.031), ('attack', 0.031), ('rr', 0.031), ('nearest', 0.031), ('nodes', 0.031), ('text', 0.031), ('acm', 0.031), ('create', 0.03), ('schematic', 0.03), ('wordnet', 0.03), ('axi', 0.03), ('highlighted', 0.03), ('han', 0.03), ('distances', 0.029), ('space', 0.029), ('pe', 0.028), ('embedded', 0.027), ('indexing', 0.027), ('crammer', 0.027), ('beyond', 0.027), ('shortest', 0.026), ('df', 0.026), ('mapped', 0.025), ('neighbor', 0.025), ('pc', 0.025), ('entirely', 0.024), ('respective', 0.024), ('category', 0.024), ('rise', 0.024)]
simIndex simValue paperId paperTitle
same-paper 1 1.0000004 114 nips-2008-Large Margin Taxonomy Embedding for Document Categorization
Author: Kilian Q. Weinberger, Olivier Chapelle
Abstract: Applications of multi-class classification, such as document categorization, often appear in cost-sensitive settings. Recent work has significantly improved the state of the art by moving beyond “flat” classification through incorporation of class hierarchies [4]. We present a novel algorithm that goes beyond hierarchical classification and estimates the latent semantic space that underlies the class hierarchy. In this space, each class is represented by a prototype and classification is done with the simple nearest neighbor rule. The optimization of the semantic space incorporates large margin constraints that ensure that for each instance the correct class prototype is closer than any other. We show that our optimization is convex and can be solved efficiently for large data sets. Experiments on the OHSUMED medical journal data base yield state-of-the-art results on topic categorization. 1
2 0.22346869 229 nips-2008-Syntactic Topic Models
Author: Jordan L. Boyd-graber, David M. Blei
Abstract: We develop the syntactic topic model (STM), a nonparametric Bayesian model of parsed documents. The STM generates words that are both thematically and syntactically constrained, which combines the semantic insights of topic models with the syntactic information available from parse trees. Each word of a sentence is generated by a distribution that combines document-specific topic weights and parse-tree-specific syntactic transitions. Words are assumed to be generated in an order that respects the parse tree. We derive an approximate posterior inference method based on variational methods for hierarchical Dirichlet processes, and we report qualitative and quantitative results on both synthetic data and hand-parsed documents. 1
3 0.1881658 120 nips-2008-Learning the Semantic Correlation: An Alternative Way to Gain from Unlabeled Text
Author: Yi Zhang, Artur Dubrawski, Jeff G. Schneider
Abstract: In this paper, we address the question of what kind of knowledge is generally transferable from unlabeled text. We suggest and analyze the semantic correlation of words as a generally transferable structure of the language and propose a new method to learn this structure using an appropriately chosen latent variable model. This semantic correlation contains structural information of the language space and can be used to control the joint shrinkage of model parameters for any specific task in the same space through regularization. In an empirical study, we construct 190 different text classification tasks from a real-world benchmark, and the unlabeled documents are a mixture from all these tasks. We test the ability of various algorithms to use the mixed unlabeled text to enhance all classification tasks. Empirical results show that the proposed approach is a reliable and scalable method for semi-supervised learning, regardless of the source of unlabeled data, the specific task to be enhanced, and the prediction model used.
4 0.16581002 64 nips-2008-DiscLDA: Discriminative Learning for Dimensionality Reduction and Classification
Author: Simon Lacoste-julien, Fei Sha, Michael I. Jordan
Abstract: Probabilistic topic models have become popular as methods for dimensionality reduction in collections of text documents or images. These models are usually treated as generative models and trained using maximum likelihood or Bayesian methods. In this paper, we discuss an alternative: a discriminative framework in which we assume that supervised side information is present, and in which we wish to take that side information into account in finding a reduced dimensionality representation. Specifically, we present DiscLDA, a discriminative variation on Latent Dirichlet Allocation (LDA) in which a class-dependent linear transformation is introduced on the topic mixture proportions. This parameter is estimated by maximizing the conditional likelihood. By using the transformed topic mixture proportions as a new representation of documents, we obtain a supervised dimensionality reduction algorithm that uncovers the latent structure in a document collection while preserving predictive power for the task of classification. We compare the predictive power of the latent structure of DiscLDA with unsupervised LDA on the 20 Newsgroups document classification task and show how our model can identify shared topics across classes as well as class-dependent topics.
5 0.13856837 117 nips-2008-Learning Taxonomies by Dependence Maximization
Author: Matthew Blaschko, Arthur Gretton
Abstract: We introduce a family of unsupervised algorithms, numerical taxonomy clustering, to simultaneously cluster data, and to learn a taxonomy that encodes the relationship between the clusters. The algorithms work by maximizing the dependence between the taxonomy and the original data. The resulting taxonomy is a more informative visualization of complex data than simple clustering; in addition, taking into account the relations between different clusters is shown to substantially improve the quality of the clustering, when compared with state-ofthe-art algorithms in the literature (both spectral clustering and a previous dependence maximization approach). We demonstrate our algorithm on image and text data. 1
6 0.13017423 197 nips-2008-Relative Performance Guarantees for Approximate Inference in Latent Dirichlet Allocation
7 0.12684599 93 nips-2008-Global Ranking Using Continuous Conditional Random Fields
8 0.11125603 116 nips-2008-Learning Hybrid Models for Image Annotation with Partially Labeled Data
9 0.10801467 6 nips-2008-A ``Shape Aware'' Model for semi-supervised Learning of Objects and its Context
10 0.1056402 205 nips-2008-Semi-supervised Learning with Weakly-Related Unlabeled Data : Towards Better Text Categorization
11 0.093482576 194 nips-2008-Regularized Learning with Networks of Features
12 0.090806313 113 nips-2008-Kernelized Sorting
13 0.088905632 196 nips-2008-Relative Margin Machines
14 0.083026022 246 nips-2008-Unsupervised Learning of Visual Sense Models for Polysemous Words
15 0.075656638 62 nips-2008-Differentiable Sparse Coding
16 0.071807459 239 nips-2008-Tighter Bounds for Structured Estimation
17 0.069453321 78 nips-2008-Exact Convex Confidence-Weighted Learning
18 0.06835866 61 nips-2008-Diffeomorphic Dimensionality Reduction
19 0.066817448 175 nips-2008-PSDBoost: Matrix-Generation Linear Programming for Positive Semidefinite Matrices Learning
20 0.064127132 52 nips-2008-Correlated Bigram LSA for Unsupervised Language Model Adaptation
topicId topicWeight
[(0, -0.211), (1, -0.16), (2, -0.014), (3, -0.133), (4, -0.091), (5, 0.004), (6, 0.146), (7, 0.099), (8, -0.224), (9, 0.014), (10, 0.015), (11, -0.053), (12, -0.115), (13, 0.118), (14, -0.0), (15, 0.071), (16, 0.064), (17, -0.014), (18, -0.021), (19, -0.031), (20, -0.025), (21, 0.009), (22, -0.03), (23, 0.097), (24, 0.073), (25, -0.011), (26, -0.041), (27, 0.006), (28, -0.012), (29, -0.106), (30, 0.034), (31, 0.016), (32, 0.101), (33, 0.003), (34, -0.029), (35, -0.01), (36, -0.073), (37, 0.03), (38, -0.02), (39, -0.074), (40, -0.046), (41, -0.03), (42, 0.119), (43, -0.028), (44, 0.0), (45, 0.041), (46, -0.031), (47, -0.077), (48, -0.002), (49, 0.022)]
simIndex simValue paperId paperTitle
same-paper 1 0.93723875 114 nips-2008-Large Margin Taxonomy Embedding for Document Categorization
Author: Kilian Q. Weinberger, Olivier Chapelle
Abstract: Applications of multi-class classification, such as document categorization, often appear in cost-sensitive settings. Recent work has significantly improved the state of the art by moving beyond “flat” classification through incorporation of class hierarchies [4]. We present a novel algorithm that goes beyond hierarchical classification and estimates the latent semantic space that underlies the class hierarchy. In this space, each class is represented by a prototype and classification is done with the simple nearest neighbor rule. The optimization of the semantic space incorporates large margin constraints that ensure that for each instance the correct class prototype is closer than any other. We show that our optimization is convex and can be solved efficiently for large data sets. Experiments on the OHSUMED medical journal data base yield state-of-the-art results on topic categorization. 1
2 0.82120818 64 nips-2008-DiscLDA: Discriminative Learning for Dimensionality Reduction and Classification
Author: Simon Lacoste-julien, Fei Sha, Michael I. Jordan
Abstract: Probabilistic topic models have become popular as methods for dimensionality reduction in collections of text documents or images. These models are usually treated as generative models and trained using maximum likelihood or Bayesian methods. In this paper, we discuss an alternative: a discriminative framework in which we assume that supervised side information is present, and in which we wish to take that side information into account in finding a reduced dimensionality representation. Specifically, we present DiscLDA, a discriminative variation on Latent Dirichlet Allocation (LDA) in which a class-dependent linear transformation is introduced on the topic mixture proportions. This parameter is estimated by maximizing the conditional likelihood. By using the transformed topic mixture proportions as a new representation of documents, we obtain a supervised dimensionality reduction algorithm that uncovers the latent structure in a document collection while preserving predictive power for the task of classification. We compare the predictive power of the latent structure of DiscLDA with unsupervised LDA on the 20 Newsgroups document classification task and show how our model can identify shared topics across classes as well as class-dependent topics.
3 0.77324873 229 nips-2008-Syntactic Topic Models
Author: Jordan L. Boyd-graber, David M. Blei
Abstract: We develop the syntactic topic model (STM), a nonparametric Bayesian model of parsed documents. The STM generates words that are both thematically and syntactically constrained, which combines the semantic insights of topic models with the syntactic information available from parse trees. Each word of a sentence is generated by a distribution that combines document-specific topic weights and parse-tree-specific syntactic transitions. Words are assumed to be generated in an order that respects the parse tree. We derive an approximate posterior inference method based on variational methods for hierarchical Dirichlet processes, and we report qualitative and quantitative results on both synthetic data and hand-parsed documents. 1
4 0.72948027 197 nips-2008-Relative Performance Guarantees for Approximate Inference in Latent Dirichlet Allocation
Author: Indraneel Mukherjee, David M. Blei
Abstract: Hierarchical probabilistic modeling of discrete data has emerged as a powerful tool for text analysis. Posterior inference in such models is intractable, and practitioners rely on approximate posterior inference methods such as variational inference or Gibbs sampling. There has been much research in designing better approximations, but there is yet little theoretical understanding of which of the available techniques are appropriate, and in which data analysis settings. In this paper we provide the beginnings of such understanding. We analyze the improvement that the recently proposed collapsed variational inference (CVB) provides over mean field variational inference (VB) in latent Dirichlet allocation. We prove that the difference in the tightness of the bound on the likelihood of a document decreases as O(k − 1) + log m/m, where k is the number of topics in the model and m is the number of words in a document. As a consequence, the advantage of CVB over VB is lost for long documents but increases with the number of topics. We demonstrate empirically that the theory holds, using simulated text data and two text corpora. We provide practical guidelines for choosing an approximation. 1
5 0.71526349 28 nips-2008-Asynchronous Distributed Learning of Topic Models
Author: Padhraic Smyth, Max Welling, Arthur U. Asuncion
Abstract: Distributed learning is a problem of fundamental interest in machine learning and cognitive science. In this paper, we present asynchronous distributed learning algorithms for two well-known unsupervised learning frameworks: Latent Dirichlet Allocation (LDA) and Hierarchical Dirichlet Processes (HDP). In the proposed approach, the data are distributed across P processors, and processors independently perform Gibbs sampling on their local data and communicate their information in a local asynchronous manner with other processors. We demonstrate that our asynchronous algorithms are able to learn global topic models that are statistically as accurate as those learned by the standard LDA and HDP samplers, but with significant improvements in computation time and memory. We show speedup results on a 730-million-word text corpus using 32 processors, and we provide perplexity results for up to 1500 virtual processors. As a stepping stone in the development of asynchronous HDP, a parallel HDP sampler is also introduced. 1
6 0.62525952 120 nips-2008-Learning the Semantic Correlation: An Alternative Way to Gain from Unlabeled Text
7 0.61089361 52 nips-2008-Correlated Bigram LSA for Unsupervised Language Model Adaptation
8 0.52935636 93 nips-2008-Global Ranking Using Continuous Conditional Random Fields
9 0.52441144 246 nips-2008-Unsupervised Learning of Visual Sense Models for Polysemous Words
10 0.45790723 196 nips-2008-Relative Margin Machines
11 0.45630768 205 nips-2008-Semi-supervised Learning with Weakly-Related Unlabeled Data : Towards Better Text Categorization
12 0.4523114 4 nips-2008-A Scalable Hierarchical Distributed Language Model
13 0.41755432 227 nips-2008-Supervised Exponential Family Principal Component Analysis via Convex Optimization
14 0.40430516 226 nips-2008-Supervised Dictionary Learning
15 0.39837646 185 nips-2008-Privacy-preserving logistic regression
16 0.3888751 78 nips-2008-Exact Convex Confidence-Weighted Learning
17 0.3845377 194 nips-2008-Regularized Learning with Networks of Features
18 0.38250834 61 nips-2008-Diffeomorphic Dimensionality Reduction
19 0.37976867 116 nips-2008-Learning Hybrid Models for Image Annotation with Partially Labeled Data
20 0.37922603 113 nips-2008-Kernelized Sorting
topicId topicWeight
[(6, 0.065), (7, 0.109), (12, 0.033), (28, 0.153), (57, 0.063), (59, 0.015), (63, 0.033), (71, 0.027), (74, 0.267), (77, 0.058), (83, 0.084)]
simIndex simValue paperId paperTitle
1 0.8088271 248 nips-2008-Using matrices to model symbolic relationship
Author: Ilya Sutskever, Geoffrey E. Hinton
Abstract: We describe a way of learning matrix representations of objects and relationships. The goal of learning is to allow multiplication of matrices to represent symbolic relationships between objects and symbolic relationships between relationships, which is the main novelty of the method. We demonstrate that this leads to excellent generalization in two different domains: modular arithmetic and family relationships. We show that the same system can learn first-order propositions such as (2, 5) ∈ +3 or (Christopher, Penelope) ∈ has wife, and higher-order propositions such as (3, +3) ∈ plus and (+3, −3) ∈ inverse or (has husband, has wife) ∈ higher oppsex. We further demonstrate that the system understands how higher-order propositions are related to first-order ones by showing that it can correctly answer questions about first-order propositions involving the relations +3 or has wife even though it has not been trained on any first-order examples involving these relations. 1
2 0.80671865 172 nips-2008-Optimal Response Initiation: Why Recent Experience Matters
Author: Matt Jones, Sachiko Kinoshita, Michael C. Mozer
Abstract: In most cognitive and motor tasks, speed-accuracy tradeoffs are observed: Individuals can respond slowly and accurately, or quickly yet be prone to errors. Control mechanisms governing the initiation of behavioral responses are sensitive not only to task instructions and the stimulus being processed, but also to the recent stimulus history. When stimuli can be characterized on an easy-hard dimension (e.g., word frequency in a naming task), items preceded by easy trials are responded to more quickly, and with more errors, than items preceded by hard trials. We propose a rationally motivated mathematical model of this sequential adaptation of control, based on a diffusion model of the decision process in which difficulty corresponds to the drift rate for the correct response. The model assumes that responding is based on the posterior distribution over which response is correct, conditioned on the accumulated evidence. We derive this posterior as a function of the drift rate, and show that higher estimates of the drift rate lead to (normatively) faster responding. Trial-by-trial tracking of difficulty thus leads to sequential effects in speed and accuracy. Simulations show the model explains a variety of phenomena in human speeded decision making. We argue this passive statistical mechanism provides a more elegant and parsimonious account than extant theories based on elaborate control structures. 1
same-paper 3 0.79721379 114 nips-2008-Large Margin Taxonomy Embedding for Document Categorization
Author: Kilian Q. Weinberger, Olivier Chapelle
Abstract: Applications of multi-class classification, such as document categorization, often appear in cost-sensitive settings. Recent work has significantly improved the state of the art by moving beyond “flat” classification through incorporation of class hierarchies [4]. We present a novel algorithm that goes beyond hierarchical classification and estimates the latent semantic space that underlies the class hierarchy. In this space, each class is represented by a prototype and classification is done with the simple nearest neighbor rule. The optimization of the semantic space incorporates large margin constraints that ensure that for each instance the correct class prototype is closer than any other. We show that our optimization is convex and can be solved efficiently for large data sets. Experiments on the OHSUMED medical journal data base yield state-of-the-art results on topic categorization. 1
4 0.64378387 194 nips-2008-Regularized Learning with Networks of Features
Author: Ted Sandler, John Blitzer, Partha P. Talukdar, Lyle H. Ungar
Abstract: For many supervised learning problems, we possess prior knowledge about which features yield similar information about the target variable. In predicting the topic of a document, we might know that two words are synonyms, and when performing image recognition, we know which pixels are adjacent. Such synonymous or neighboring features are near-duplicates and should be expected to have similar weights in an accurate model. Here we present a framework for regularized learning when one has prior knowledge about which features are expected to have similar and dissimilar weights. The prior knowledge is encoded as a network whose vertices are features and whose edges represent similarities and dissimilarities between them. During learning, each feature’s weight is penalized by the amount it differs from the average weight of its neighbors. For text classification, regularization using networks of word co-occurrences outperforms manifold learning and compares favorably to other recently proposed semi-supervised learning methods. For sentiment analysis, feature networks constructed from declarative human knowledge significantly improve prediction accuracy. 1
5 0.64199793 79 nips-2008-Exploring Large Feature Spaces with Hierarchical Multiple Kernel Learning
Author: Francis R. Bach
Abstract: For supervised and unsupervised learning, positive definite kernels allow to use large and potentially infinite dimensional feature spaces with a computational cost that only depends on the number of observations. This is usually done through the penalization of predictor functions by Euclidean or Hilbertian norms. In this paper, we explore penalizing by sparsity-inducing norms such as the ℓ1 -norm or the block ℓ1 -norm. We assume that the kernel decomposes into a large sum of individual basis kernels which can be embedded in a directed acyclic graph; we show that it is then possible to perform kernel selection through a hierarchical multiple kernel learning framework, in polynomial time in the number of selected kernels. This framework is naturally applied to non linear variable selection; our extensive simulations on synthetic datasets and datasets from the UCI repository show that efficiently exploring the large feature space through sparsity-inducing norms leads to state-of-the-art predictive performance.
6 0.63909817 197 nips-2008-Relative Performance Guarantees for Approximate Inference in Latent Dirichlet Allocation
7 0.63853681 205 nips-2008-Semi-supervised Learning with Weakly-Related Unlabeled Data : Towards Better Text Categorization
8 0.6378895 62 nips-2008-Differentiable Sparse Coding
9 0.63741177 63 nips-2008-Dimensionality Reduction for Data in Multiple Feature Representations
10 0.63406128 245 nips-2008-Unlabeled data: Now it helps, now it doesn't
11 0.63153183 120 nips-2008-Learning the Semantic Correlation: An Alternative Way to Gain from Unlabeled Text
12 0.63042372 64 nips-2008-DiscLDA: Discriminative Learning for Dimensionality Reduction and Classification
13 0.63027573 143 nips-2008-Multi-label Multiple Kernel Learning
14 0.6288563 99 nips-2008-High-dimensional support union recovery in multivariate regression
16 0.62655199 14 nips-2008-Adaptive Forward-Backward Greedy Algorithm for Sparse Learning with Linear Models
17 0.62654239 71 nips-2008-Efficient Sampling for Gaussian Process Inference using Control Variables
18 0.62634629 202 nips-2008-Robust Regression and Lasso
19 0.62532467 196 nips-2008-Relative Margin Machines
20 0.62523293 175 nips-2008-PSDBoost: Matrix-Generation Linear Programming for Positive Semidefinite Matrices Learning