nips nips2011 nips2011-116 knowledge-graph by maker-knowledge-mining
Source: pdf
Author: Adler J. Perotte, Frank Wood, Noemie Elhadad, Nicholas Bartlett
Abstract: We introduce hierarchically supervised latent Dirichlet allocation (HSLDA), a model for hierarchically and multiply labeled bag-of-word data. Examples of such data include web pages and their placement in directories, product descriptions and associated categories from product hierarchies, and free-text clinical records and their assigned diagnosis codes. Out-of-sample label prediction is the primary goal of this work, but improved lower-dimensional representations of the bagof-word data are also of interest. We demonstrate HSLDA on large-scale data from clinical document labeling and retail product categorization tasks. We show that leveraging the structure from hierarchical labels improves out-of-sample label prediction substantially when compared to models that do not. 1
Reference: text
sentIndex sentText sentNum sentScore
1 edu Abstract We introduce hierarchically supervised latent Dirichlet allocation (HSLDA), a model for hierarchically and multiply labeled bag-of-word data. [sent-3, score-0.326]
2 Examples of such data include web pages and their placement in directories, product descriptions and associated categories from product hierarchies, and free-text clinical records and their assigned diagnosis codes. [sent-4, score-0.386]
3 Out-of-sample label prediction is the primary goal of this work, but improved lower-dimensional representations of the bagof-word data are also of interest. [sent-5, score-0.204]
4 We demonstrate HSLDA on large-scale data from clinical document labeling and retail product categorization tasks. [sent-6, score-0.449]
5 We show that leveraging the structure from hierarchical labels improves out-of-sample label prediction substantially when compared to models that do not. [sent-7, score-0.411]
6 Examples include but are not limited to webpages and curated hierarchical directories of the same [1], product descriptions and catalogs, and patient records and diagnosis codes assigned to them for bookkeeping and insurance purposes. [sent-10, score-0.482]
7 In this work we show how to combine these two sources of information using a single model that allows one to categorize new text documents automatically, suggest labels that might be inaccurate, compute improved similarities between documents for information retrieval purposes, and more. [sent-11, score-0.246]
8 There are several challenges entailed in incorporating a hierarchy of labels into the model. [sent-15, score-0.21]
9 Among them, given a large set of potential labels (often thousands), each instance has only a small number of labels associated to it. [sent-16, score-0.22]
10 Furthermore, there are no naturally occurring negative labeling in the data, and the absence of a label cannot always be interpreted as a negative labeling. [sent-17, score-0.272]
11 Our approach learns topic models of the underlying data and labeling strategies in a joint model, while leveraging the hierarchical structure of the labels. [sent-19, score-0.273]
12 For the sake of simplicity, we focus on “is-a” hierarchies, but the model can be applied to other structured label spaces. [sent-20, score-0.196]
13 We extend supervised latent Dirichlet allocation (sLDA) [6] to take advantage of hierarchical supervision. [sent-21, score-0.209]
14 We hypothesize that the context of labels within the hierarchy provides valuable information about labeling. [sent-23, score-0.21]
15 We demonstrate our model on large, real-world datasets in the clinical and web retail domains. [sent-24, score-0.234]
16 Our results show that a joint, hierarchical model outperforms a classification with unstructured labels as well as a disjoint model, where the topic model and the hierarchical classification are inferred independently of each other. [sent-26, score-0.493]
17 Section 2 introduces hierarchically supervised LDA (HSLDA), while Section 3 details a sampling approach to inference in HSLDA. [sent-28, score-0.182]
18 Section 4 reviews related work, and Section 5 shows results from applying HSLDA to health care and web retail data. [sent-29, score-0.189]
19 Each label l ∈ L, except the root, has a parent pa(l) ∈ L also in the set of labels. [sent-42, score-0.21]
20 We will for exposition purposes assume that this label set has hard “is-a” parent-child constraints (explained later), although this assumption can be relaxed at the cost of more computationally complex inference. [sent-43, score-0.202]
21 Such a label hierarchy forms a multiply rooted tree. [sent-44, score-0.272]
22 Each document has a variable yl,d ∈ {−1, 1} for every label which indicates whether the label is applied to document d or not. [sent-46, score-0.711]
23 In most cases yi,d will be unobserved, in some cases we will be able to fix its value because of constraints on the label hierarchy, and in the relatively minor remainder its value will be observed. [sent-47, score-0.202]
24 The constraints imposed by an is-a label hierarchy are that if the lth label is applied to document d, i. [sent-49, score-0.657]
25 , yl,d = 1, then all labels in the label hierarchy up to the root are also applied to document d, i. [sent-51, score-0.59]
26 Conversely, if a label l is marked as not applying to a document then no descendant of that label may be applied to the same. [sent-57, score-0.558]
27 We assume that at least one label is applied to every document. [sent-58, score-0.196]
28 This is illustrated in Figure 1 where the root label is always applied but only some of the descendant labelings are observed as having been applied (diagonal hashing indicates that potentially some of the plated variables are observed). [sent-59, score-0.3]
29 In HSLDA, documents are modeled using the LDA mixed-membership mixture model with global topic estimation. [sent-60, score-0.208]
30 Label responses are generated using a conditional hierarchy of probit regressors. [sent-61, score-0.206]
31 For each label l ∈ L • Draw a label application coefficient η l | µ, σ ∼ NK (µ1K , σIK ) 3. [sent-70, score-0.344]
32 Draw the global topic proportions β | α ∼ DirK (α 1K ) 4. [sent-71, score-0.164]
33 , zK ] is the empirical topic distribution for document d, in which z ¯ ¯ each entry is the percentage of the words in that document that come from topic k, zk = ¯ Nd −1 Nd I(zn,d = k). [sent-84, score-0.631]
34 Here, each document is labeled generatively using a hierarchy of conditionally dependent probit regressors [14]. [sent-86, score-0.389]
35 For every label l ∈ L, both the empirical topic distribution for document d and whether or not its parent label was applied (i. [sent-87, score-0.705]
36 I(ypa(l),d = 1)) are used to determine whether or not label l is to be applied to document d as well. [sent-89, score-0.355]
37 Note that label yl,d can only be applied to document d if its parent label pa(l) is also applied (these expressions are specific to is-a constraints but can be modified to accommodate different constraints). [sent-90, score-0.619]
38 The net effect of this is that label predictors deeper in the label hierarchy are able to focus on finding specific, conditional labeling features. [sent-92, score-0.515]
39 We believe this to be a significant source of the empirical label prediction improvement we observe experimentally. [sent-93, score-0.204]
40 In particular, choosing probit-style auxiliary variable distributions for the al,d ’s yields conditional posterior distributions for both the auxiliary variables (3) and the regression coefficients (2) which are analytic. [sent-96, score-0.369]
41 In the common case where no negative labels are observed (like the example applications we consider in Section 5), the model must be explicitly biased towards generating data that has negative labels in order to keep it from learning to assign all labels to all documents. [sent-98, score-0.394]
42 for large negative values of µ, all labels become a priori more likely to be negative (yl,d = −1). [sent-103, score-0.174]
43 We explore the ability of µ to bias out-of-sample label prediction performance in Section 5. [sent-104, score-0.204]
44 Let a be the set of all auxiliary variables, w the set of all words, η the set of all regression coefficients, and z\zn,d the set z with element zn,d removed. [sent-107, score-0.169]
45 Here Ld is the set of labels which are observed for document d. [sent-112, score-0.269]
46 The simplicity of this conditional distribution follows from the choice of probit regression [4]; the specific form of the update is a standard result from Bayesian normal linear regression [14]. [sent-118, score-0.232]
47 It also is a standard probit regression result that the conditional posterior distribution of al,d is a truncated normal distribution [4]. [sent-119, score-0.203]
48 HSLDA employs a hierarchical Dirichlet prior over topic assignments (i. [sent-121, score-0.268]
49 4 Related Work In this work we extend supervised latent Dirichlet allocation (sLDA) [6] to take advantage of hierarchical supervision. [sent-133, score-0.209]
50 sLDA is latent Dirichlet allocation (LDA) [7] augmented with per document “supervision,” often taking the form of a single numerical or categorical label. [sent-134, score-0.23]
51 It has been demonstrated that the signal provided by such supervision can result in better, task-specific document models and can also lead to good label prediction for out-of-sample data [6]. [sent-135, score-0.397]
52 None of these models, however, leverage dependency structure in the label space. [sent-141, score-0.172]
53 4 In other work, researchers have classified documents into a hierarchy (a closely related task) with naive Bayes classifiers and support vector machines. [sent-142, score-0.168]
54 Most of this work has been demonstrated on relatively small datasets, small label spaces, and has focused on single label classification without a model of documents such as LDA [21, 11, 17, 8]. [sent-143, score-0.412]
55 5 Experiments We applied HSLDA to data from two domains: predicting medical diagnosis codes from hospital discharge summaries and predicting product categories from Amazon. [sent-144, score-0.669]
56 1 Discharge Summaries and ICD-9 Codes Discharge summaries are authored by clinicians to summarize patient hospitalization courses. [sent-149, score-0.219]
57 The summaries typically contain a record of patient complaints, findings and diagnoses, along with treatment and hospital course. [sent-150, score-0.246]
58 For each hospitalization, trained medical coders review the information in the discharge summary and assign a series of diagnoses codes. [sent-151, score-0.289]
59 1 The ICD-9 codes are organized in a rooted-tree structure, with each edge representing an is-a relationship between parent and child, such that the parent diagnosis subsumes the child diagnosis. [sent-153, score-0.25]
60 It consists of 6,000 discharge summaries and their associated ICD-9 codes (7,298 distinct codes overall), representing all the discharges from the hospital in 2009. [sent-161, score-0.529]
61 We split our dataset into 5,000 discharge summaries for training and 1,000 for testing. [sent-168, score-0.292]
62 The text of the discharge summaries was tokenized with NLTK. [sent-169, score-0.292]
63 2 A fixed vocabulary was formed by taking the top 10,000 tokens with highest document frequency (exclusive of names, places and other identifying numbers). [sent-170, score-0.187]
64 com, an online retail store, organizes its catalog of products in a mulitply-rooted hierarchy and provides textual product descriptions for most products. [sent-175, score-0.355]
65 The product descriptions are shorter than the discharge summaries (91. [sent-185, score-0.413]
66 The comparison models included sLDA with independent regressors (hierarchical constraints on labels ignored) and HSLDA fit by first performing LDA then fitting tree-conditional regressions. [sent-200, score-0.199]
67 These models were chosen to highlight several aspects of HSLDA including performance in the absence of hierarchical constraints, the effect of the combined inference, and regression performance attributable solely to the hierarchical constraints. [sent-201, score-0.257]
68 The distinguishing factor between HSLDA and sLDA is the additional structure imposed on the label space, a distinction that we hypothesized would result in a difference in predictive performance. [sent-203, score-0.213]
69 The second comparison model is HSLDA fit by performing LDA first followed by performing inference over the hierarchically constrained label space. [sent-205, score-0.313]
70 This comparison model examines not the structuring of the label space, but the benefit of combined inference over both the documents and the label space. [sent-208, score-0.446]
71 In the setting where there are no negative labels, a Gaussian prior over the regression parameters with a negative mean implements a prior belief that missing labels are likely to be negative. [sent-211, score-0.299]
72 Figure 2: ROC curves for out-of-sample label prediction varying µ, the prior mean of the regression parameters. [sent-245, score-0.298]
73 In both figures, solid is HSLDA, dashed are independent regressors + sLDA (hierarchical constraints on labels ignored), and dotted is HSLDA fit by running LDA first then running tree-conditional regressions. [sent-246, score-0.199]
74 0 Figure 3: ROC curve for out-of-sample ICD-9 code prediction varying auxiliary variable threshold. [sent-262, score-0.187]
75 To make the comparison as fair as possible among models, ancestors of observed nodes in the label hierarchy were ignored, observed nodes were considered positive and descendents of observed nodes were considered to be negative. [sent-266, score-0.34]
76 Since the sLDA model does not enforce the hierarchical constraints, we establish a more equal footing by considering only the observed labels as being positive, despite the fact that, following the hierarchical constraints, ancestors must also be positive. [sent-268, score-0.328]
77 Such a gold standard will likely inflate the number of false positives because the labels applied to any particular document are usually not as complete as they could be. [sent-269, score-0.473]
78 ICD-9 codes, for instance, lack sensitivity and their use as a gold standard could lead to correctly positive predictions being labeled as false positives [5]. [sent-270, score-0.29]
79 However, given that the label space is often large (as in our examples) it is a moderate assumption that erroneous false positives should not skew results significantly. [sent-271, score-0.291]
80 Using zˆ these expectations, we performed Gibbs sampling over the hierarchy to acquire predictive samples for the documents in the test set. [sent-275, score-0.209]
81 The true positive rate was calculated as the average expected labeling for gold standard positive labels. [sent-276, score-0.216]
82 The false positive rate was calculated as the average expected labeling for gold standard negative labels. [sent-277, score-0.284]
83 As sensitivity and specificity can always be traded off, we examined sensitivity for a range of values for two different parameters – the prior means for the regression coefficients and the threshold for the auxiliary variables. [sent-278, score-0.36]
84 The prior mean in combination with the auxiliary variable threshold together encode the strength of the prior belief that unobserved labels are likely to be negative. [sent-281, score-0.331]
85 Effectively, the prior mean applies negative pressure to the predictions and the auxiliary variable threshold determines the cutoff. [sent-282, score-0.222]
86 In contrast, to evaluate predictive performance as a function of the auxiliary variable threshold, a single model was fit for each model type and prediction was evaluated based on predictive samples drawn subject to different auxiliary variable thresholds. [sent-285, score-0.376]
87 These methods are significantly different since the prior mean is varied prior to inference, and the auxiliary variable threshold is varied following inference. [sent-286, score-0.221]
88 These results indicate that the full HSLDA model predicts more of the the correct labels at a cost of an increase in the number of false positives relative to the comparison models. [sent-297, score-0.229]
89 Figure 3 shows the predictive performance of HSLDA relative to the two comparison models on the clinical dataset as a function of the auxiliary variable threshold. [sent-309, score-0.273]
90 For low values of the auxiliary variable threshold, the models predict labels in a more sensitive and less specific manner, creating the points in the upper right corner of the ROC curve. [sent-310, score-0.289]
91 As the auxiliary variable threshold is increased, the models predict in a less sensitive and more specific manner, creating the points in the lower left hand corner of the ROC curve. [sent-311, score-0.207]
92 One way is to see it as a family of topic models that improve on the topic modeling performance of LDA via the inclusion of observed supervision. [sent-314, score-0.28]
93 A large diversity of problems can be expressed as label prediction problems for bag-of-word data. [sent-316, score-0.204]
94 Hierarchical probit regression makes for tractable Markov chain Monte Carlo SLDA inference, a benefit that should extend to other sLDA models should probit regression be used for response variable prediction there too. [sent-321, score-0.325]
95 Extensions to this work include unbounded topic cardinality variants and relaxations to different kinds of label structure. [sent-325, score-0.312]
96 Utilizing different kinds of label structure is possible within this framework, but requires relaxing some of the simplifications we made in this paper for expositional purposes. [sent-327, score-0.172]
97 Scalable feature selection, classification and signature generation for organizing large text databases into hierarchical topic taxonomies. [sent-381, score-0.237]
98 Large scale diagnostic code classification for medical patient records. [sent-466, score-0.166]
99 Automating the assignment of diagnosis codes to patient encounters using example-based and machine learning techniques. [sent-480, score-0.24]
100 Labeled LDA: a supervised topic model for credit attribution in multi-labeled corpora. [sent-488, score-0.181]
wordName wordTfidf (topN-words)
[('hslda', 0.643), ('slda', 0.345), ('lda', 0.217), ('label', 0.172), ('document', 0.159), ('discharge', 0.157), ('ypa', 0.143), ('topic', 0.14), ('summaries', 0.135), ('labels', 0.11), ('retail', 0.107), ('hierarchically', 0.107), ('auxiliary', 0.106), ('clinical', 0.101), ('hierarchy', 0.1), ('hierarchical', 0.097), ('codes', 0.087), ('false', 0.08), ('descriptions', 0.075), ('probit', 0.071), ('documents', 0.068), ('sensitivity', 0.066), ('dirichlet', 0.065), ('medical', 0.065), ('diagnosis', 0.064), ('hospital', 0.063), ('regression', 0.063), ('gold', 0.061), ('regressors', 0.059), ('zd', 0.058), ('roc', 0.056), ('dev', 0.054), ('unstructured', 0.049), ('patient', 0.048), ('draw', 0.047), ('std', 0.047), ('dirk', 0.047), ('product', 0.046), ('city', 0.046), ('positive', 0.044), ('allocation', 0.044), ('topics', 0.042), ('supervised', 0.041), ('predictive', 0.041), ('assignment', 0.041), ('positives', 0.039), ('parent', 0.038), ('labeling', 0.036), ('catalogs', 0.036), ('coders', 0.036), ('directories', 0.036), ('hospitalization', 0.036), ('pneumonia', 0.036), ('conditional', 0.035), ('zt', 0.035), ('inference', 0.034), ('supervision', 0.034), ('posterior', 0.034), ('zk', 0.033), ('prediction', 0.032), ('coef', 0.032), ('negative', 0.032), ('rate', 0.031), ('prior', 0.031), ('disclda', 0.031), ('descendant', 0.031), ('diagnoses', 0.031), ('health', 0.03), ('constraints', 0.03), ('insurance', 0.029), ('diagnostic', 0.029), ('classi', 0.029), ('threshold', 0.028), ('categories', 0.028), ('vocabulary', 0.028), ('automatic', 0.028), ('cients', 0.028), ('latent', 0.027), ('coding', 0.027), ('products', 0.027), ('web', 0.026), ('ignored', 0.026), ('ld', 0.026), ('issn', 0.026), ('care', 0.026), ('root', 0.025), ('separately', 0.025), ('variable', 0.025), ('blei', 0.025), ('omitting', 0.025), ('code', 0.024), ('gibbs', 0.024), ('applied', 0.024), ('language', 0.024), ('sensitive', 0.024), ('corner', 0.024), ('labelings', 0.024), ('proportions', 0.024), ('ancestors', 0.024), ('child', 0.023)]
simIndex simValue paperId paperTitle
same-paper 1 1.0 116 nips-2011-Hierarchically Supervised Latent Dirichlet Allocation
Author: Adler J. Perotte, Frank Wood, Noemie Elhadad, Nicholas Bartlett
Abstract: We introduce hierarchically supervised latent Dirichlet allocation (HSLDA), a model for hierarchically and multiply labeled bag-of-word data. Examples of such data include web pages and their placement in directories, product descriptions and associated categories from product hierarchies, and free-text clinical records and their assigned diagnosis codes. Out-of-sample label prediction is the primary goal of this work, but improved lower-dimensional representations of the bagof-word data are also of interest. We demonstrate HSLDA on large-scale data from clinical document labeling and retail product categorization tasks. We show that leveraging the structure from hierarchical labels improves out-of-sample label prediction substantially when compared to models that do not. 1
2 0.22003412 58 nips-2011-Complexity of Inference in Latent Dirichlet Allocation
Author: David Sontag, Dan Roy
Abstract: We consider the computational complexity of probabilistic inference in Latent Dirichlet Allocation (LDA). First, we study the problem of finding the maximum a posteriori (MAP) assignment of topics to words, where the document’s topic distribution is integrated out. We show that, when the e↵ective number of topics per document is small, exact inference takes polynomial time. In contrast, we show that, when a document has a large number of topics, finding the MAP assignment of topics to words in LDA is NP-hard. Next, we consider the problem of finding the MAP topic distribution for a document, where the topic-word assignments are integrated out. We show that this problem is also NP-hard. Finally, we briefly discuss the problem of sampling from the posterior, showing that this is NP-hard in one restricted setting, but leaving open the general question. 1
3 0.15170211 281 nips-2011-The Doubly Correlated Nonparametric Topic Model
Author: Dae I. Kim, Erik B. Sudderth
Abstract: Topic models are learned via a statistical model of variation within document collections, but designed to extract meaningful semantic structure. Desirable traits include the ability to incorporate annotations or metadata associated with documents; the discovery of correlated patterns of topic usage; and the avoidance of parametric assumptions, such as manual specification of the number of topics. We propose a doubly correlated nonparametric topic (DCNT) model, the first model to simultaneously capture all three of these properties. The DCNT models metadata via a flexible, Gaussian regression on arbitrary input features; correlations via a scalable square-root covariance representation; and nonparametric selection from an unbounded series of potential topics via a stick-breaking construction. We validate the semantic structure and predictive performance of the DCNT using a corpus of NIPS documents annotated by various metadata. 1
4 0.11876939 129 nips-2011-Improving Topic Coherence with Regularized Topic Models
Author: David Newman, Edwin V. Bonilla, Wray Buntine
Abstract: Topic models have the potential to improve search and browsing by extracting useful semantic themes from web pages and other text documents. When learned topics are coherent and interpretable, they can be valuable for faceted browsing, results set diversity analysis, and document retrieval. However, when dealing with small collections or noisy text (e.g. web search result snippets or blog posts), learned topics can be less coherent, less interpretable, and less useful. To overcome this, we propose two methods to regularize the learning of topic models. Our regularizers work by creating a structured prior over words that reflect broad patterns in the external data. Using thirteen datasets we show that both regularizers improve topic coherence and interpretability while learning a faithful representation of the collection of interest. Overall, this work makes topic models more useful across a broader range of text data. 1
5 0.092049271 115 nips-2011-Hierarchical Topic Modeling for Analysis of Time-Evolving Personal Choices
Author: Xianxing Zhang, Lawrence Carin, David B. Dunson
Abstract: The nested Chinese restaurant process is extended to design a nonparametric topic-model tree for representation of human choices. Each tree path corresponds to a type of person, and each node (topic) has a corresponding probability vector over items that may be selected. The observed data are assumed to have associated temporal covariates (corresponding to the time at which choices are made), and we wish to impose that with increasing time it is more probable that topics deeper in the tree are utilized. This structure is imposed by developing a new “change point
6 0.082663126 156 nips-2011-Learning to Learn with Compound HD Models
7 0.082606077 96 nips-2011-Fast and Balanced: Efficient Label Tree Learning for Large Scale Object Recognition
9 0.074762627 110 nips-2011-Group Anomaly Detection using Flexible Genre Models
10 0.073916927 42 nips-2011-Bayesian Bias Mitigation for Crowdsourcing
11 0.072305381 104 nips-2011-Generalized Beta Mixtures of Gaussians
12 0.063072912 141 nips-2011-Large-Scale Category Structure Aware Image Categorization
13 0.06200996 258 nips-2011-Sparse Bayesian Multi-Task Learning
14 0.060904302 221 nips-2011-Priors over Recurrent Continuous Time Processes
15 0.060861647 134 nips-2011-Infinite Latent SVM for Classification and Multi-task Learning
16 0.055630762 151 nips-2011-Learning a Tree of Metrics with Disjoint Visual Features
17 0.053979546 1 nips-2011-$\theta$-MRF: Capturing Spatial and Semantic Structure in the Parameters for Scene Understanding
18 0.053971212 114 nips-2011-Hierarchical Multitask Structured Output Learning for Large-scale Sequence Segmentation
19 0.053414602 169 nips-2011-Maximum Margin Multi-Label Structured Prediction
20 0.051635955 40 nips-2011-Automated Refinement of Bayes Networks' Parameters based on Test Ordering Constraints
topicId topicWeight
[(0, 0.152), (1, 0.068), (2, -0.038), (3, 0.027), (4, -0.02), (5, -0.265), (6, 0.103), (7, 0.036), (8, -0.114), (9, 0.052), (10, 0.046), (11, 0.076), (12, 0.022), (13, 0.01), (14, 0.007), (15, 0.019), (16, -0.005), (17, -0.018), (18, 0.036), (19, 0.01), (20, 0.006), (21, -0.022), (22, 0.06), (23, -0.01), (24, 0.001), (25, 0.024), (26, -0.027), (27, -0.045), (28, -0.043), (29, -0.035), (30, 0.027), (31, -0.051), (32, 0.033), (33, 0.002), (34, -0.034), (35, -0.056), (36, -0.031), (37, 0.053), (38, 0.012), (39, 0.001), (40, 0.022), (41, 0.008), (42, 0.01), (43, 0.038), (44, -0.049), (45, -0.001), (46, 0.07), (47, -0.035), (48, 0.044), (49, 0.056)]
simIndex simValue paperId paperTitle
same-paper 1 0.92834759 116 nips-2011-Hierarchically Supervised Latent Dirichlet Allocation
Author: Adler J. Perotte, Frank Wood, Noemie Elhadad, Nicholas Bartlett
Abstract: We introduce hierarchically supervised latent Dirichlet allocation (HSLDA), a model for hierarchically and multiply labeled bag-of-word data. Examples of such data include web pages and their placement in directories, product descriptions and associated categories from product hierarchies, and free-text clinical records and their assigned diagnosis codes. Out-of-sample label prediction is the primary goal of this work, but improved lower-dimensional representations of the bagof-word data are also of interest. We demonstrate HSLDA on large-scale data from clinical document labeling and retail product categorization tasks. We show that leveraging the structure from hierarchical labels improves out-of-sample label prediction substantially when compared to models that do not. 1
2 0.83835042 281 nips-2011-The Doubly Correlated Nonparametric Topic Model
Author: Dae I. Kim, Erik B. Sudderth
Abstract: Topic models are learned via a statistical model of variation within document collections, but designed to extract meaningful semantic structure. Desirable traits include the ability to incorporate annotations or metadata associated with documents; the discovery of correlated patterns of topic usage; and the avoidance of parametric assumptions, such as manual specification of the number of topics. We propose a doubly correlated nonparametric topic (DCNT) model, the first model to simultaneously capture all three of these properties. The DCNT models metadata via a flexible, Gaussian regression on arbitrary input features; correlations via a scalable square-root covariance representation; and nonparametric selection from an unbounded series of potential topics via a stick-breaking construction. We validate the semantic structure and predictive performance of the DCNT using a corpus of NIPS documents annotated by various metadata. 1
3 0.83586597 58 nips-2011-Complexity of Inference in Latent Dirichlet Allocation
Author: David Sontag, Dan Roy
Abstract: We consider the computational complexity of probabilistic inference in Latent Dirichlet Allocation (LDA). First, we study the problem of finding the maximum a posteriori (MAP) assignment of topics to words, where the document’s topic distribution is integrated out. We show that, when the e↵ective number of topics per document is small, exact inference takes polynomial time. In contrast, we show that, when a document has a large number of topics, finding the MAP assignment of topics to words in LDA is NP-hard. Next, we consider the problem of finding the MAP topic distribution for a document, where the topic-word assignments are integrated out. We show that this problem is also NP-hard. Finally, we briefly discuss the problem of sampling from the posterior, showing that this is NP-hard in one restricted setting, but leaving open the general question. 1
4 0.78085101 129 nips-2011-Improving Topic Coherence with Regularized Topic Models
Author: David Newman, Edwin V. Bonilla, Wray Buntine
Abstract: Topic models have the potential to improve search and browsing by extracting useful semantic themes from web pages and other text documents. When learned topics are coherent and interpretable, they can be valuable for faceted browsing, results set diversity analysis, and document retrieval. However, when dealing with small collections or noisy text (e.g. web search result snippets or blog posts), learned topics can be less coherent, less interpretable, and less useful. To overcome this, we propose two methods to regularize the learning of topic models. Our regularizers work by creating a structured prior over words that reflect broad patterns in the external data. Using thirteen datasets we show that both regularizers improve topic coherence and interpretability while learning a faithful representation of the collection of interest. Overall, this work makes topic models more useful across a broader range of text data. 1
5 0.70632416 110 nips-2011-Group Anomaly Detection using Flexible Genre Models
Author: Liang Xiong, Barnabás Póczos, Jeff G. Schneider
Abstract: An important task in exploring and analyzing real-world data sets is to detect unusual and interesting phenomena. In this paper, we study the group anomaly detection problem. Unlike traditional anomaly detection research that focuses on data points, our goal is to discover anomalous aggregated behaviors of groups of points. For this purpose, we propose the Flexible Genre Model (FGM). FGM is designed to characterize data groups at both the point level and the group level so as to detect various types of group anomalies. We evaluate the effectiveness of FGM on both synthetic and real data sets including images and turbulence data, and show that it is superior to existing approaches in detecting group anomalies. 1
6 0.68556476 115 nips-2011-Hierarchical Topic Modeling for Analysis of Time-Evolving Personal Choices
7 0.63889354 14 nips-2011-A concave regularization technique for sparse mixture models
8 0.50291419 156 nips-2011-Learning to Learn with Compound HD Models
9 0.49902573 42 nips-2011-Bayesian Bias Mitigation for Crowdsourcing
11 0.45216683 40 nips-2011-Automated Refinement of Bayes Networks' Parameters based on Test Ordering Constraints
12 0.44720355 62 nips-2011-Continuous-Time Regression Models for Longitudinal Networks
13 0.44071341 232 nips-2011-Ranking annotators for crowdsourced labeling tasks
14 0.43941087 147 nips-2011-Learning Patient-Specific Cancer Survival Distributions as a Sequence of Dependent Regressors
15 0.41458753 33 nips-2011-An Exact Algorithm for F-Measure Maximization
16 0.40165654 191 nips-2011-Nonnegative dictionary learning in the exponential noise model for adaptive music signal representation
17 0.39947844 134 nips-2011-Infinite Latent SVM for Classification and Multi-task Learning
18 0.39921042 169 nips-2011-Maximum Margin Multi-Label Structured Prediction
19 0.39598989 27 nips-2011-Advice Refinement in Knowledge-Based SVMs
20 0.39443979 277 nips-2011-Submodular Multi-Label Learning
topicId topicWeight
[(0, 0.031), (4, 0.027), (20, 0.033), (21, 0.051), (26, 0.012), (31, 0.081), (33, 0.064), (43, 0.065), (45, 0.099), (57, 0.055), (65, 0.018), (74, 0.069), (76, 0.232), (83, 0.049), (99, 0.023)]
simIndex simValue paperId paperTitle
same-paper 1 0.76783407 116 nips-2011-Hierarchically Supervised Latent Dirichlet Allocation
Author: Adler J. Perotte, Frank Wood, Noemie Elhadad, Nicholas Bartlett
Abstract: We introduce hierarchically supervised latent Dirichlet allocation (HSLDA), a model for hierarchically and multiply labeled bag-of-word data. Examples of such data include web pages and their placement in directories, product descriptions and associated categories from product hierarchies, and free-text clinical records and their assigned diagnosis codes. Out-of-sample label prediction is the primary goal of this work, but improved lower-dimensional representations of the bagof-word data are also of interest. We demonstrate HSLDA on large-scale data from clinical document labeling and retail product categorization tasks. We show that leveraging the structure from hierarchical labels improves out-of-sample label prediction substantially when compared to models that do not. 1
2 0.76735783 268 nips-2011-Speedy Q-Learning
Author: Mohammad Ghavamzadeh, Hilbert J. Kappen, Mohammad G. Azar, Rémi Munos
Abstract: We introduce a new convergent variant of Q-learning, called speedy Q-learning (SQL), to address the problem of slow convergence in the standard form of the Q-learning algorithm. We prove a PAC bound on the performance of SQL, which shows that for an MDP with n state-action pairs and the discount factor γ only T = O log(n)/(ǫ2 (1 − γ)4 ) steps are required for the SQL algorithm to converge to an ǫ-optimal action-value function with high probability. This bound has a better dependency on 1/ǫ and 1/(1 − γ), and thus, is tighter than the best available result for Q-learning. Our bound is also superior to the existing results for both modelfree and model-based instances of batch Q-value iteration that are considered to be more efficient than the incremental methods like Q-learning. 1
3 0.69741523 243 nips-2011-Select and Sample - A Model of Efficient Neural Inference and Learning
Author: Jacquelyn A. Shelton, Abdul S. Sheikh, Pietro Berkes, Joerg Bornschein, Joerg Luecke
Abstract: An increasing number of experimental studies indicate that perception encodes a posterior probability distribution over possible causes of sensory stimuli, which is used to act close to optimally in the environment. One outstanding difficulty with this hypothesis is that the exact posterior will in general be too complex to be represented directly, and thus neurons will have to represent an approximation of this distribution. Two influential proposals of efficient posterior representation by neural populations are: 1) neural activity represents samples of the underlying distribution, or 2) they represent a parametric representation of a variational approximation of the posterior. We show that these approaches can be combined for an inference scheme that retains the advantages of both: it is able to represent multiple modes and arbitrary correlations, a feature of sampling methods, and it reduces the represented space to regions of high probability mass, a strength of variational approximations. Neurally, the combined method can be interpreted as a feed-forward preselection of the relevant state space, followed by a neural dynamics implementation of Markov Chain Monte Carlo (MCMC) to approximate the posterior over the relevant states. We demonstrate the effectiveness and efficiency of this approach on a sparse coding model. In numerical experiments on artificial data and image patches, we compare the performance of the algorithms to that of exact EM, variational state space selection alone, MCMC alone, and the combined select and sample approach. The select and sample approach integrates the advantages of the sampling and variational approximations, and forms a robust, neurally plausible, and very efficient model of processing and learning in cortical networks. For sparse coding we show applications easily exceeding a thousand observed and a thousand hidden dimensions. 1
4 0.61056066 266 nips-2011-Spatial distance dependent Chinese restaurant processes for image segmentation
Author: Soumya Ghosh, Andrei B. Ungureanu, Erik B. Sudderth, David M. Blei
Abstract: The distance dependent Chinese restaurant process (ddCRP) was recently introduced to accommodate random partitions of non-exchangeable data [1]. The ddCRP clusters data in a biased way: each data point is more likely to be clustered with other data that are near it in an external sense. This paper examines the ddCRP in a spatial setting with the goal of natural image segmentation. We explore the biases of the spatial ddCRP model and propose a novel hierarchical extension better suited for producing “human-like” segmentations. We then study the sensitivity of the models to various distance and appearance hyperparameters, and provide the first rigorous comparison of nonparametric Bayesian models in the image segmentation domain. On unsupervised image segmentation, we demonstrate that similar performance to existing nonparametric Bayesian models is possible with substantially simpler models and algorithms.
5 0.60856789 31 nips-2011-An Application of Tree-Structured Expectation Propagation for Channel Decoding
Author: Pablo M. Olmos, Luis Salamanca, Juan Fuentes, Fernando Pérez-Cruz
Abstract: We show an application of a tree structure for approximate inference in graphical models using the expectation propagation algorithm. These approximations are typically used over graphs with short-range cycles. We demonstrate that these approximations also help in sparse graphs with long-range loops, as the ones used in coding theory to approach channel capacity. For asymptotically large sparse graph, the expectation propagation algorithm together with the tree structure yields a completely disconnected approximation to the graphical model but, for for finite-length practical sparse graphs, the tree structure approximation to the code graph provides accurate estimates for the marginal of each variable. Furthermore, we propose a new method for constructing the tree structure on the fly that might be more amenable for sparse graphs with general factors. 1
6 0.6024009 188 nips-2011-Non-conjugate Variational Message Passing for Multinomial and Binary Regression
7 0.60056353 227 nips-2011-Pylon Model for Semantic Segmentation
8 0.60002923 58 nips-2011-Complexity of Inference in Latent Dirichlet Allocation
9 0.59480053 43 nips-2011-Bayesian Partitioning of Large-Scale Distance Data
10 0.59340811 281 nips-2011-The Doubly Correlated Nonparametric Topic Model
11 0.59322762 276 nips-2011-Structured sparse coding via lateral inhibition
12 0.59317994 1 nips-2011-$\theta$-MRF: Capturing Spatial and Semantic Structure in the Parameters for Scene Understanding
13 0.59316766 57 nips-2011-Comparative Analysis of Viterbi Training and Maximum Likelihood Estimation for HMMs
14 0.5929684 156 nips-2011-Learning to Learn with Compound HD Models
15 0.59053797 219 nips-2011-Predicting response time and error rates in visual search
16 0.58947003 66 nips-2011-Crowdclustering
17 0.5890637 183 nips-2011-Neural Reconstruction with Approximate Message Passing (NeuRAMP)
18 0.58827442 258 nips-2011-Sparse Bayesian Multi-Task Learning
19 0.58782899 253 nips-2011-Signal Estimation Under Random Time-Warpings and Nonlinear Signal Alignment
20 0.58615059 99 nips-2011-From Stochastic Nonlinear Integrate-and-Fire to Generalized Linear Models