nips nips2010 nips2010-286 knowledge-graph by maker-knowledge-mining
Source: pdf
Author: James Petterson, Wray Buntine, Shravan M. Narayanamurthy, Tibério S. Caetano, Alex J. Smola
Abstract: We extend Latent Dirichlet Allocation (LDA) by explicitly allowing for the encoding of side information in the distribution over words. This results in a variety of new capabilities, such as improved estimates for infrequently occurring words, as well as the ability to leverage thesauri and dictionaries in order to boost topic cohesion within and across languages. We present experiments on multi-language topic synchronisation where dictionary information is used to bias corresponding words towards similar topics. Results indicate that our model substantially improves topic cohesion when compared to the standard LDA model. 1
Reference: text
sentIndex sentText sentNum sentScore
1 This results in a variety of new capabilities, such as improved estimates for infrequently occurring words, as well as the ability to leverage thesauri and dictionaries in order to boost topic cohesion within and across languages. [sent-4, score-0.591]
2 We present experiments on multi-language topic synchronisation where dictionary information is used to bias corresponding words towards similar topics. [sent-5, score-0.542]
3 Results indicate that our model substantially improves topic cohesion when compared to the standard LDA model. [sent-6, score-0.418]
4 1 Introduction Latent Dirichlet Allocation [4] assigns topics to documents and generates topic distributions over words given a collection of texts. [sent-7, score-0.905]
5 The inability to deal with word features makes LDA fall short on several aspects. [sent-10, score-0.256]
6 The most obvious one is perhaps that the topics estimated for infrequently occurring words are usually unreliable. [sent-11, score-0.34]
7 Ideally, for example, we would like the topics associated with synonyms to have a prior tendency of being similar, so that in case one of the words is rare but the other is common, the topic estimates for the rare one can be improved. [sent-12, score-0.777]
8 Similarly, we would like to be able to leverage dictionaries in order to boost topic cohesion across languages, a problem that has been researched but is far from being fully solved, especially for non-aligned corpora [6]. [sent-15, score-0.534]
9 A possible solution, which we propose in this paper, is to treat word information as features rather than as explicit constraints and to adjust a smoothing prior over topic distributions for words such that correlation is emphasised. [sent-17, score-0.869]
10 In the parlance of LDA we do not pick a globally constant β smoother over the word multinomials but rather we adjust it according to word similarity. [sent-18, score-0.561]
11 In this way we are capable of learning the prior probability of how words are distributed over various topics based on how similar they are, e. [sent-19, score-0.343]
12 in the context of dictionaries, synonym collections, thesauri, edit distances, or distributional word similarity features. [sent-21, score-0.35]
13 Instead, we use a hybrid approach where we perform smooth optimisation over the word smoothing coefficients, while retaining a collapsed Gibbs sampler to assign topics for a fixed choice of smoothing coefficients. [sent-23, score-0.741]
14 We present experimental results on multi-language topic synchronisation which clearly evidence the ability of the model to incorporate dictionary information successfully. [sent-25, score-0.435]
15 Using several different measures of topic alignment, we consistently observe that the proposed model improves substantially on standard LDA, which is unable to leverage this type of information. [sent-26, score-0.389]
16 1 for m = 1 to M wmn for v = 1 to V ψkv for n = 1 to Nm Related work βkv ψkv for n = 1 to Nm βkv φv for v = 1 to V Figure m α2: x for k = to K y Our Extension: Assume 1we observe side zmn θm information φv (i. [sent-29, score-0.526]
17 m Previous work on multilingual topic models requires parallelism at either the sentence level ([20]) or document level ([9], [15]). [sent-36, score-0.551]
18 More recent work [13] relaxes that, but still requires that a significant fraction (at least 25%) of the documents are paired up. [sent-37, score-0.198]
19 Multilingual topic alignment without parallelism was recently proposed by [6]. [sent-38, score-0.441]
20 Their model requires a list of matched word pairs m (where each pair has one word in each language) and corresponding matching priors π that encode the prior knowledge on how likely the match is to occur. [sent-39, score-0.556]
21 The topics are defined as distributions over word pairs, while the unmatched words come from a unigram distribution specific to each language. [sent-40, score-0.576]
22 Although their model could be in principle extended to more than two languages their experimental section was focused on the bilingual case. [sent-41, score-0.145]
23 One of the key differences between [6] and our method is that we do not hardcode word information, but we use it only as a prior – this way our method becomes less sensitive to errors in the word features. [sent-42, score-0.508]
24 Furthermore, our model automatically extends to multiple languages without any modification, aligning topics even for language pairs for which we have no information, as we show in the experimental section for the Portuguese/French pair. [sent-43, score-0.407]
25 It assumes that θm ∼ Dir(α) zmn ∼ Mult(θm ) (1a) (1b) ψk ∼ Dir(βk |φ, y) (2a) ψk ∼ Dir(β) wmn ∼ Multi(ψzmn ) (1c) (1d) β ∼ Logistic(y; φ). [sent-46, score-0.502]
26 (2b) Nonparametric extensions in terms of the number of topics can be obtained using Dirichlet process models [2] regarding the generation of topics. [sent-47, score-0.204]
27 Instead of treating it as a constant for all words we attempt to infer its values for different words and topics respectively. [sent-49, score-0.418]
28 The above dependency allows us to incorporate features of words as side information. [sent-52, score-0.149]
29 ’politics’ and ’politician’) are very similar then it is plausible to assume that their topic distributions should also be quite similar. [sent-55, score-0.396]
30 Conjugate distribution p(θm |α): This is a Dirichlet distribution with parameters α, where αk denotes the smoother for topic k. [sent-62, score-0.436]
31 Word distribution p(wmn |zmn , ψ): We assume that given a topic zmn the word wmn is drawn from a multinomial distribution ψwmn ,zmn . [sent-64, score-1.109]
32 Collapsed distribution p(w|z, β): Integrating out ψk for all topics k yields the following K p(w|z, β) = k=1 2. [sent-70, score-0.204]
33 2 V v=1 Γ(nKV + βkv ) Γ ( βk 1 ) kv V Γ nK + βk 1 k v=1 Γ(βkv ) Priors In order to better control the capacity of our model, we impose a prior on naturally related words, e. [sent-71, score-0.43]
34 For this purpose we design a similarity graph G(V, E) with words represented as vertices V and similarity edge weights φuv between vertices u, v ∈ V whenever u is related to v. [sent-74, score-0.265]
35 In particular, the magnitude of φuv can denote the similarity between words u and v. [sent-75, score-0.165]
36 In the following we denote by ykv the topic dependent smoothing coefficients for a given word v and topic k. [sent-76, score-1.216]
37 We impose the smoother −1 2 log βkv = ykv + yv and log p(β) = 2 φv,v (ykv − ykv )2 + yv 2λ v v,v ,k where log p(β) is given up to an additive constant and yv allows for multiplicative topic-unspecific corrections. [sent-77, score-0.643]
38 A similar model was used by [3] to capture temporal dependence between topic models computed at different time instances, e. [sent-78, score-0.369]
39 when dealing with topic drift over several years in a scientific journal. [sent-80, score-0.369]
40 There the vertices are words at a given time and the edges are between smoothers instantiated at subsequent years. [sent-81, score-0.15]
41 3 Inference In analogy to the collapsed sampler of [8] we also represent the model in a collapsed fashion. [sent-82, score-0.225]
42 1 Document Likelihood The likelihood contains two terms: a word-dependent term which can be computed on the fly while resampling data1 , and a model-dependent term involving the topic counts and the word-topic counts which can be computed by one pass through the aggregate tables respectively. [sent-85, score-0.423]
43 For the count variables nKM , nKV , nK and nM we denote by the subscript ‘−’ their values after the word wmn and associated topic zmn have been removed from the statistics. [sent-93, score-1.134]
44 Standard calculations yield the following topic probability for resampling: βkv + nKV − nKM + αk kvmn km− (6) p(zmn = k|rest) ∝ ¯ nK + βk k− In the appendix we detail how to addapt the sampler of [19] to obtain faster sampling. [sent-94, score-0.438]
45 The data-dependent contribution to the negative log-likelihood is K Lβ = k=1 ¯ ¯ log Γ(βk + nK ) − log Γ(βk ) + k K k=1 v:nKV =0 kv log Γ(βkv ) − log Γ(βkv + nKV ) kv with gradients given by the appropriate derivatives of the Γ function. [sent-98, score-0.796]
46 After choosing edges φuv according to these matching words, we obtain an optimisation problem directly in terms of the variables ykv and yv . [sent-101, score-0.306]
47 Denote by N (v) the neighbours for word v in G(V, E), and Υ(x) := ∂x log Γ(x) the Digamma function. [sent-102, score-0.238]
48 We have 1 ¯ ¯ ∂ykv [Lβ − log p(β)] = 2 φv,v [ykv − ykv ] + βkv Υ(βk + nK ) − Υ(βk ) + k λ v ∈N (v) + nKV > 0 kv The gradient with respect to yk is analogous. [sent-103, score-0.578]
49 2 Here zi denotes the topic of word i, and z¬i the topics of all words in the corpus except for i. [sent-108, score-0.971]
50 4 4 Experiments To demonstrate the usefulness of our model we applied it to a multi-lingual document collection, where we can show a substantial improvement over the standard LDA model on the coordination between topics of different languages. [sent-109, score-0.296]
51 1 Dataset Since our goal is to compare topic distributions on different languages we used a parallel corpus [11] with the proceedings of the European Parliament in 11 languages. [sent-111, score-0.583]
52 We randomly sampled 1000 documents from each language, removed infrequent4 and frequent5 words and kept only the documents with at least 20 words. [sent-115, score-0.528]
53 Finally, we removed all documents that lost their corresponding translations in this process. [sent-116, score-0.264]
54 cz/dict/, augmented with translations from Google Translate for the most frequent words in our dataset. [sent-125, score-0.148]
55 As described earlier, each word corresponds to a vertex, with an edge7 whenever two words match in the dictionary. [sent-126, score-0.345]
56 In our model β = exp(ykv + yv ), so we want to keep both ykv and yv reasonably low to avoid numerical problems, as a large value of either would lead to overflows. [sent-127, score-0.324]
57 We did the same for the standard LDA model, where to learn an asymmetric beta we simply removed ykv to get β = exp(yv ). [sent-129, score-0.205]
58 4 Methodology In our experiments we used all the English documents and a subset of the French and Portuguese ones – this is what we have in a real application, when we try to learn a topic model from web pages: the number of pages is English is far greater than in any other language. [sent-131, score-0.567]
59 First, we run the standard LDA model with all documents mixed together – this is one of our baselines, which we call STD1. [sent-133, score-0.198]
60 8 We need to start with only one language so that an initial topic-word distribution is built; once that is done the priors are learned and can be used to guide the topic-word distributions in other languages. [sent-140, score-0.141]
61 5 Finally, as a control experiment we run the standard LDA model in this same setting: first English documents, then all languages mixed. [sent-141, score-0.112]
62 In all experiments we run the Gibbs sampler for a total of 3000 iterations, with the number of topics fixed to 20, and keep the last sample. [sent-143, score-0.273]
63 After a burn-in of 500 iterations, the optimisation over the word smoothing coefficients is done every 100 iterations, using an off-the-shelf L-BFGS [12] optimizer. [sent-144, score-0.33]
64 5 Evaluation Evaluation of topic models is an open problem – recent work [7] suggests that popular measures based on held-out likelihood, such as perplexity, do not capture whether topics are coherent or not. [sent-148, score-0.573]
65 our goal – to synchronize topics across different languages – and there’s no reason to believe that likelihood measures would assess that: a model where topics are synchronized across languages is not necessarily more likely than a model that is not synchronized. [sent-152, score-0.668]
66 1 Mean number of agreements in top 5 topics: |L1 | d1 ∈L1 ,d2 =F (d1 ) agreements(d1 , d2 ) where agreements(d1 , d2 ) is the cardinality of the intersection of the 5 most likely topics of d1 and d2 . [sent-155, score-0.367]
67 In Figure 6 we plot the word smoothing prior for the English word democracy and its French and Portuguese translations, d´ mocratie and democracia, for both the standard LDA model (STD1) and e our model (DC), with 20% of the French and Portuguese documents used in training. [sent-162, score-0.897]
68 In STD1 we don’t have topic-specific priors (hence the horizontal line) and the word democracy has a much higher prior, because it happens more often in the dataset (since we have all English documents and only 20% of the French and Portuguese ones). [sent-163, score-0.541]
69 6 To emphasize that we do not need a parallel corpus we ran a second experiment where we selected the same number of documents of each language, but assuring that for each document its corresponding translations are not in the dataset, and trained our model (DC) with 100 topics. [sent-168, score-0.385]
70 In this case, however, we cannot compute the distance metrics as before, since we have no information on the actual topic distributions of the documents. [sent-170, score-0.45]
71 This is shown in Table 1, for some selected topics, where the synchronization amongst the different languages is clear. [sent-172, score-0.112]
72 Mean l2−distance Mean Hellinger distance 1 % agreements on first topic 2 1. [sent-173, score-0.557]
73 5 20 0 0 5 10 15 % of French documents 20 1 0 0 0. [sent-183, score-0.198]
74 5 20 0 5 10 15 % of French documents 5 10 15 % of French documents 20 Figure 3: Comparison of topic distributions in English and French documents. [sent-184, score-0.792]
75 Mean l2−distance Mean Hellinger distance 1 % agreements on first topic 2 STD1 STD2 DC 3. [sent-186, score-0.557]
76 2 5 10 15 % of Portuguese documents 20 STD1 STD2 DC 0 0 5 10 15 % of Portuguese documents 1. [sent-196, score-0.396]
77 5 20 0 5 10 15 % of Portuguese documents 5 10 15 % of Portuguese documents 20 Figure 4: Comparison of topic distributions in English and Portuguese documents. [sent-198, score-0.792]
78 5 STD1 STD2 DC 0 0 5 10 15 20 % of Portuguese/French documents 0. [sent-207, score-0.198]
79 5 0 0 20 STD1 STD2 DC 1 5 10 15 20 % of Portuguese/French documents 0 0 5 10 15 % of Portuguese/French documents 0. [sent-208, score-0.396]
80 5 20 0 5 10 15 20 % of Portuguese/French documents Figure 5: Comparison of topic distributions in Portuguese and French documents. [sent-209, score-0.594]
81 9 −1 0 5 10 topic 15 −5 0 20 5 10 topic 15 20 Figure 6: Word smoothing prior for two words in the standard LDA and in our model. [sent-218, score-0.937]
82 7 Table 1: Top 10 words for some of the learned topics (from top to bottom, respectively, topics 8, 17, 20, 32, 49). [sent-223, score-0.515]
83 , information is a word in both French and English). [sent-226, score-0.238]
84 2 with edges only between words which exceed a level of proximity. [sent-231, score-0.129]
85 Lexical Similarity: For interpolation between words one could use a distribution over substrings of a word as the feature map. [sent-232, score-0.345]
86 Such lexical similarity makes the sampler less sensitive to issues such as stemming: after all, two words which reduce to the same stem will also have a high lexical similarity score, hence the estimated βkv will yield very similar topic assignments. [sent-234, score-0.845]
87 This can be achieved by adding edges between a word and all of its synonyms. [sent-236, score-0.26]
88 Since in our framework we only use this information to shape a prior, errors in the synonym list and multiple meanings of a word will not prove fatal. [sent-237, score-0.319]
89 2 Multiple Languages Lexical Similarity: Similar considerations apply for inter-lingual topic models. [sent-239, score-0.369]
90 It is reasonable to assume that lexical similarity generally points to similarity in meaning. [sent-240, score-0.208]
91 Using such features should allow one to synchronise topics even in the absence of dictionaries. [sent-241, score-0.255]
92 However, it is important that similarities are not hardcoded but only imposed as a prior on the topic distribution (e. [sent-242, score-0.401]
93 6 Discussion In this paper we described a simple yet general formalism for incorporating word features into LDA, which among other things allows us to synchronise topics across different languages. [sent-245, score-0.511]
94 We performed a number of experiments in the multiple-language setting, in which the goal was to show that our model is able to incorporate dictionary information in order to improve topic alignment across different languages. [sent-246, score-0.463]
95 We also showed that the algorithm is quite effective even in the absence of documents that are explicitly denoted as being aligned (see Table 1). [sent-248, score-0.198]
96 This sets it apart from [13], which requires that a significant fraction (at least 25%) of documents are paired up. [sent-249, score-0.198]
97 For instance, noun / verb disambiguation or named entity recognition are all useful in determining the meaning of words and therefore it is quite likely that they will also aid in obtaining an improved topical mixture model. [sent-252, score-0.133]
98 Incorporating domain knowledge into topic modeling via Dirichlet Forest priors. [sent-255, score-0.369]
99 Lexical triggers and latent semantic analysis for crosslingual language model adaptation. [sent-300, score-0.114]
100 Efficient methods for topic model inference on streaming document collections. [sent-354, score-0.44]
wordName wordTfidf (topN-words)
[('kv', 0.398), ('topic', 0.369), ('zmn', 0.287), ('word', 0.238), ('wmn', 0.215), ('topics', 0.204), ('documents', 0.198), ('nkv', 0.18), ('ykv', 0.18), ('french', 0.173), ('lda', 0.172), ('agreements', 0.163), ('portuguese', 0.148), ('dc', 0.136), ('nkm', 0.131), ('english', 0.113), ('languages', 0.112), ('words', 0.107), ('lexical', 0.092), ('language', 0.091), ('nm', 0.083), ('democracy', 0.082), ('multilingual', 0.082), ('collapsed', 0.078), ('yv', 0.072), ('document', 0.071), ('sampler', 0.069), ('km', 0.068), ('smoother', 0.067), ('democracia', 0.065), ('synonyms', 0.065), ('smoothing', 0.06), ('similarity', 0.058), ('hellinger', 0.057), ('dirichlet', 0.056), ('corpus', 0.053), ('david', 0.05), ('cohesion', 0.049), ('mocratie', 0.049), ('thesauri', 0.049), ('nk', 0.049), ('alignment', 0.043), ('translations', 0.041), ('mimno', 0.037), ('dictionaries', 0.036), ('uv', 0.035), ('dictionary', 0.033), ('bilingual', 0.033), ('elections', 0.033), ('politician', 0.033), ('synchronisation', 0.033), ('synchronise', 0.033), ('synonym', 0.033), ('prior', 0.032), ('resampling', 0.032), ('optimisation', 0.032), ('dir', 0.031), ('gibbs', 0.029), ('metrics', 0.029), ('argmaxk', 0.029), ('parallelism', 0.029), ('infrequently', 0.029), ('politics', 0.029), ('distributions', 0.027), ('topical', 0.026), ('blei', 0.026), ('distance', 0.025), ('removed', 0.025), ('list', 0.025), ('xiaojin', 0.025), ('allocation', 0.024), ('side', 0.024), ('grammar', 0.023), ('meanings', 0.023), ('nicta', 0.023), ('latent', 0.023), ('priors', 0.023), ('cients', 0.023), ('coef', 0.022), ('edges', 0.022), ('parallel', 0.022), ('pass', 0.022), ('coordination', 0.021), ('distributional', 0.021), ('corpora', 0.021), ('editors', 0.021), ('entirely', 0.021), ('andrew', 0.021), ('boost', 0.021), ('german', 0.021), ('vertices', 0.021), ('leverage', 0.02), ('zm', 0.02), ('yahoo', 0.019), ('occurred', 0.019), ('across', 0.018), ('features', 0.018), ('australian', 0.018), ('logistic', 0.018), ('adjust', 0.018)]
simIndex simValue paperId paperTitle
same-paper 1 0.99999976 286 nips-2010-Word Features for Latent Dirichlet Allocation
Author: James Petterson, Wray Buntine, Shravan M. Narayanamurthy, Tibério S. Caetano, Alex J. Smola
Abstract: We extend Latent Dirichlet Allocation (LDA) by explicitly allowing for the encoding of side information in the distribution over words. This results in a variety of new capabilities, such as improved estimates for infrequently occurring words, as well as the ability to leverage thesauri and dictionaries in order to boost topic cohesion within and across languages. We present experiments on multi-language topic synchronisation where dictionary information is used to bias corresponding words towards similar topics. Results indicate that our model substantially improves topic cohesion when compared to the standard LDA model. 1
2 0.28145298 194 nips-2010-Online Learning for Latent Dirichlet Allocation
Author: Matthew Hoffman, Francis R. Bach, David M. Blei
Abstract: We develop an online variational Bayes (VB) algorithm for Latent Dirichlet Allocation (LDA). Online LDA is based on online stochastic optimization with a natural gradient step, which we show converges to a local optimum of the VB objective function. It can handily analyze massive document collections, including those arriving in a stream. We study the performance of online LDA in several ways, including by fitting a 100-topic topic model to 3.3M articles from Wikipedia in a single pass. We demonstrate that online LDA finds topic models as good or better than those found with batch VB, and in a fraction of the time. 1
3 0.20540059 60 nips-2010-Deterministic Single-Pass Algorithm for LDA
Author: Issei Sato, Kenichi Kurihara, Hiroshi Nakagawa
Abstract: We develop a deterministic single-pass algorithm for latent Dirichlet allocation (LDA) in order to process received documents one at a time and then discard them in an excess text stream. Our algorithm does not need to store old statistics for all data. The proposed algorithm is much faster than a batch algorithm and is comparable to the batch algorithm in terms of perplexity in experiments.
4 0.20046912 276 nips-2010-Tree-Structured Stick Breaking for Hierarchical Data
Author: Zoubin Ghahramani, Michael I. Jordan, Ryan P. Adams
Abstract: Many data are naturally modeled by an unobserved hierarchical structure. In this paper we propose a flexible nonparametric prior over unknown data hierarchies. The approach uses nested stick-breaking processes to allow for trees of unbounded width and depth, where data can live at any node and are infinitely exchangeable. One can view our model as providing infinite mixtures where the components have a dependency structure corresponding to an evolutionary diffusion down a tree. By using a stick-breaking approach, we can apply Markov chain Monte Carlo methods based on slice sampling to perform Bayesian inference and simulate from the posterior distribution on trees. We apply our method to hierarchical clustering of images and topic modeling of text data. 1
5 0.19535185 150 nips-2010-Learning concept graphs from text with stick-breaking priors
Author: America Chambers, Padhraic Smyth, Mark Steyvers
Abstract: We present a generative probabilistic model for learning general graph structures, which we term concept graphs, from text. Concept graphs provide a visual summary of the thematic content of a collection of documents—a task that is difficult to accomplish using only keyword search. The proposed model can learn different types of concept graph structures and is capable of utilizing partial prior knowledge about graph structure as well as labeled documents. We describe a generative model that is based on a stick-breaking process for graphs, and a Markov Chain Monte Carlo inference procedure. Experiments on simulated data show that the model can recover known graph structure when learning in both unsupervised and semi-supervised modes. We also show that the proposed model is competitive in terms of empirical log likelihood with existing structure-based topic models (hPAM and hLDA) on real-world text data sets. Finally, we illustrate the application of the model to the problem of updating Wikipedia category graphs. 1
6 0.1932335 131 nips-2010-Joint Analysis of Time-Evolving Binary Matrices and Associated Documents
7 0.1669261 285 nips-2010-Why are some word orders more common than others? A uniform information density account
8 0.15789838 264 nips-2010-Synergies in learning words and their referents
9 0.10857416 177 nips-2010-Multitask Learning without Label Correspondences
10 0.10685648 277 nips-2010-Two-Layer Generalization Analysis for Ranking Using Rademacher Average
11 0.1009852 251 nips-2010-Sphere Embedding: An Application to Part-of-Speech Induction
12 0.086259753 137 nips-2010-Large Margin Learning of Upstream Scene Understanding Models
13 0.080531381 51 nips-2010-Construction of Dependent Dirichlet Processes based on Poisson Processes
14 0.076691292 213 nips-2010-Predictive Subspace Learning for Multi-view Data: a Large Margin Approach
15 0.068224587 287 nips-2010-Worst-Case Linear Discriminant Analysis
16 0.049070798 228 nips-2010-Reverse Multi-Label Learning
17 0.046271861 198 nips-2010-Optimal Web-Scale Tiering as a Flow Problem
18 0.046096839 70 nips-2010-Efficient Optimization for Discriminative Latent Class Models
19 0.045500215 49 nips-2010-Computing Marginal Distributions over Continuous Markov Networks for Statistical Relational Learning
20 0.044438828 283 nips-2010-Variational Inference over Combinatorial Spaces
topicId topicWeight
[(0, 0.133), (1, 0.055), (2, -0.002), (3, -0.022), (4, -0.359), (5, 0.134), (6, 0.247), (7, 0.008), (8, -0.069), (9, 0.008), (10, 0.205), (11, 0.105), (12, -0.02), (13, 0.111), (14, -0.043), (15, 0.051), (16, 0.018), (17, 0.011), (18, -0.112), (19, -0.035), (20, 0.082), (21, -0.08), (22, 0.006), (23, 0.101), (24, 0.123), (25, -0.084), (26, 0.07), (27, -0.027), (28, 0.094), (29, 0.034), (30, 0.055), (31, -0.051), (32, -0.081), (33, 0.059), (34, -0.012), (35, -0.035), (36, 0.033), (37, 0.052), (38, -0.002), (39, -0.019), (40, 0.037), (41, 0.023), (42, -0.021), (43, 0.044), (44, -0.013), (45, 0.028), (46, 0.038), (47, -0.008), (48, -0.005), (49, 0.035)]
simIndex simValue paperId paperTitle
same-paper 1 0.97685933 286 nips-2010-Word Features for Latent Dirichlet Allocation
Author: James Petterson, Wray Buntine, Shravan M. Narayanamurthy, Tibério S. Caetano, Alex J. Smola
Abstract: We extend Latent Dirichlet Allocation (LDA) by explicitly allowing for the encoding of side information in the distribution over words. This results in a variety of new capabilities, such as improved estimates for infrequently occurring words, as well as the ability to leverage thesauri and dictionaries in order to boost topic cohesion within and across languages. We present experiments on multi-language topic synchronisation where dictionary information is used to bias corresponding words towards similar topics. Results indicate that our model substantially improves topic cohesion when compared to the standard LDA model. 1
2 0.79770881 60 nips-2010-Deterministic Single-Pass Algorithm for LDA
Author: Issei Sato, Kenichi Kurihara, Hiroshi Nakagawa
Abstract: We develop a deterministic single-pass algorithm for latent Dirichlet allocation (LDA) in order to process received documents one at a time and then discard them in an excess text stream. Our algorithm does not need to store old statistics for all data. The proposed algorithm is much faster than a batch algorithm and is comparable to the batch algorithm in terms of perplexity in experiments.
3 0.71419567 194 nips-2010-Online Learning for Latent Dirichlet Allocation
Author: Matthew Hoffman, Francis R. Bach, David M. Blei
Abstract: We develop an online variational Bayes (VB) algorithm for Latent Dirichlet Allocation (LDA). Online LDA is based on online stochastic optimization with a natural gradient step, which we show converges to a local optimum of the VB objective function. It can handily analyze massive document collections, including those arriving in a stream. We study the performance of online LDA in several ways, including by fitting a 100-topic topic model to 3.3M articles from Wikipedia in a single pass. We demonstrate that online LDA finds topic models as good or better than those found with batch VB, and in a fraction of the time. 1
4 0.67725378 131 nips-2010-Joint Analysis of Time-Evolving Binary Matrices and Associated Documents
Author: Eric Wang, Dehong Liu, Jorge Silva, Lawrence Carin, David B. Dunson
Abstract: We consider problems for which one has incomplete binary matrices that evolve with time (e.g., the votes of legislators on particular legislation, with each year characterized by a different such matrix). An objective of such analysis is to infer structure and inter-relationships underlying the matrices, here defined by latent features associated with each axis of the matrix. In addition, it is assumed that documents are available for the entities associated with at least one of the matrix axes. By jointly analyzing the matrices and documents, one may be used to inform the other within the analysis, and the model offers the opportunity to predict matrix values (e.g., votes) based only on an associated document (e.g., legislation). The research presented here merges two areas of machine-learning that have previously been investigated separately: incomplete-matrix analysis and topic modeling. The analysis is performed from a Bayesian perspective, with efficient inference constituted via Gibbs sampling. The framework is demonstrated by considering all voting data and available documents (legislation) during the 220-year lifetime of the United States Senate and House of Representatives. 1
5 0.66763186 264 nips-2010-Synergies in learning words and their referents
Author: Mark Johnson, Katherine Demuth, Bevan Jones, Michael J. Black
Abstract: This paper presents Bayesian non-parametric models that simultaneously learn to segment words from phoneme strings and learn the referents of some of those words, and shows that there is a synergistic interaction in the acquisition of these two kinds of linguistic information. The models themselves are novel kinds of Adaptor Grammars that are an extension of an embedding of topic models into PCFGs. These models simultaneously segment phoneme sequences into words and learn the relationship between non-linguistic objects to the words that refer to them. We show (i) that modelling inter-word dependencies not only improves the accuracy of the word segmentation but also of word-object relationships, and (ii) that a model that simultaneously learns word-object relationships and word segmentation segments more accurately than one that just learns word segmentation on its own. We argue that these results support an interactive view of language acquisition that can take advantage of synergies such as these. 1
6 0.62824506 285 nips-2010-Why are some word orders more common than others? A uniform information density account
7 0.62310821 150 nips-2010-Learning concept graphs from text with stick-breaking priors
8 0.49989772 251 nips-2010-Sphere Embedding: An Application to Part-of-Speech Induction
9 0.4808417 276 nips-2010-Tree-Structured Stick Breaking for Hierarchical Data
10 0.33467278 237 nips-2010-Shadow Dirichlet for Restricted Probability Modeling
11 0.33109593 125 nips-2010-Inference and communication in the game of Password
12 0.32170197 277 nips-2010-Two-Layer Generalization Analysis for Ranking Using Rademacher Average
13 0.32167739 106 nips-2010-Global Analytic Solution for Variational Bayesian Matrix Factorization
14 0.31294295 198 nips-2010-Optimal Web-Scale Tiering as a Flow Problem
15 0.29719558 177 nips-2010-Multitask Learning without Label Correspondences
16 0.29241052 289 nips-2010-b-Bit Minwise Hashing for Estimating Three-Way Similarities
17 0.28938147 213 nips-2010-Predictive Subspace Learning for Multi-view Data: a Large Margin Approach
18 0.26617467 287 nips-2010-Worst-Case Linear Discriminant Analysis
19 0.26347077 51 nips-2010-Construction of Dependent Dirichlet Processes based on Poisson Processes
20 0.24464101 120 nips-2010-Improvements to the Sequence Memoizer
topicId topicWeight
[(9, 0.013), (13, 0.036), (17, 0.044), (27, 0.151), (30, 0.098), (35, 0.014), (43, 0.272), (45, 0.143), (50, 0.044), (52, 0.023), (60, 0.011), (77, 0.028), (78, 0.015), (90, 0.019)]
simIndex simValue paperId paperTitle
same-paper 1 0.77363628 286 nips-2010-Word Features for Latent Dirichlet Allocation
Author: James Petterson, Wray Buntine, Shravan M. Narayanamurthy, Tibério S. Caetano, Alex J. Smola
Abstract: We extend Latent Dirichlet Allocation (LDA) by explicitly allowing for the encoding of side information in the distribution over words. This results in a variety of new capabilities, such as improved estimates for infrequently occurring words, as well as the ability to leverage thesauri and dictionaries in order to boost topic cohesion within and across languages. We present experiments on multi-language topic synchronisation where dictionary information is used to bias corresponding words towards similar topics. Results indicate that our model substantially improves topic cohesion when compared to the standard LDA model. 1
2 0.70755124 20 nips-2010-A unified model of short-range and long-range motion perception
Author: Shuang Wu, Xuming He, Hongjing Lu, Alan L. Yuille
Abstract: The human vision system is able to effortlessly perceive both short-range and long-range motion patterns in complex dynamic scenes. Previous work has assumed that two different mechanisms are involved in processing these two types of motion. In this paper, we propose a hierarchical model as a unified framework for modeling both short-range and long-range motion perception. Our model consists of two key components: a data likelihood that proposes multiple motion hypotheses using nonlinear matching, and a hierarchical prior that imposes slowness and spatial smoothness constraints on the motion field at multiple scales. We tested our model on two types of stimuli, random dot kinematograms and multiple-aperture stimuli, both commonly used in human vision research. We demonstrate that the hierarchical model adequately accounts for human performance in psychophysical experiments.
3 0.63784438 98 nips-2010-Functional form of motion priors in human motion perception
Author: Hongjing Lu, Tungyou Lin, Alan Lee, Luminita Vese, Alan L. Yuille
Abstract: It has been speculated that the human motion system combines noisy measurements with prior expectations in an optimal, or rational, manner. The basic goal of our work is to discover experimentally which prior distribution is used. More specifically, we seek to infer the functional form of the motion prior from the performance of human subjects on motion estimation tasks. We restricted ourselves to priors which combine three terms for motion slowness, first-order smoothness, and second-order smoothness. We focused on two functional forms for prior distributions: L2-norm and L1-norm regularization corresponding to the Gaussian and Laplace distributions respectively. In our first experimental session we estimate the weights of the three terms for each functional form to maximize the fit to human performance. We then measured human performance for motion tasks and found that we obtained better fit for the L1-norm (Laplace) than for the L2-norm (Gaussian). We note that the L1-norm is also a better fit to the statistics of motion in natural environments. In addition, we found large weights for the second-order smoothness term, indicating the importance of high-order smoothness compared to slowness and lower-order smoothness. To validate our results further, we used the best fit models using the L1-norm to predict human performance in a second session with different experimental setups. Our results showed excellent agreement between human performance and model prediction – ranging from 3% to 8% for five human subjects over ten experimental conditions – and give further support that the human visual system uses an L1-norm (Laplace) prior.
4 0.62610817 39 nips-2010-Bayesian Action-Graph Games
Author: Albert X. Jiang, Kevin Leyton-brown
Abstract: Games of incomplete information, or Bayesian games, are an important gametheoretic model and have many applications in economics. We propose Bayesian action-graph games (BAGGs), a novel graphical representation for Bayesian games. BAGGs can represent arbitrary Bayesian games, and furthermore can compactly express Bayesian games exhibiting commonly encountered types of structure including symmetry, action- and type-specific utility independence, and probabilistic independence of type distributions. We provide an algorithm for computing expected utility in BAGGs, and discuss conditions under which the algorithm runs in polynomial time. Bayes-Nash equilibria of BAGGs can be computed by adapting existing algorithms for complete-information normal form games and leveraging our expected utility algorithm. We show both theoretically and empirically that our approaches improve significantly on the state of the art. 1
5 0.61968642 161 nips-2010-Linear readout from a neural population with partial correlation data
Author: Adrien Wohrer, Ranulfo Romo, Christian K. Machens
Abstract: How much information does a neural population convey about a stimulus? Answers to this question are known to strongly depend on the correlation of response variability in neural populations. These noise correlations, however, are essentially immeasurable as the number of parameters in a noise correlation matrix grows quadratically with population size. Here, we suggest to bypass this problem by imposing a parametric model on a noise correlation matrix. Our basic assumption is that noise correlations arise due to common inputs between neurons. On average, noise correlations will therefore reflect signal correlations, which can be measured in neural populations. We suggest an explicit parametric dependency between signal and noise correlations. We show how this dependency can be used to ”fill the gaps” in noise correlations matrices using an iterative application of the Wishart distribution over positive definitive matrices. We apply our method to data from the primary somatosensory cortex of monkeys performing a two-alternativeforced choice task. We compare the discrimination thresholds read out from the population of recorded neurons with the discrimination threshold of the monkey and show that our method predicts different results than simpler, average schemes of noise correlations. 1
6 0.61501348 81 nips-2010-Evaluating neuronal codes for inference using Fisher information
7 0.614959 121 nips-2010-Improving Human Judgments by Decontaminating Sequential Dependencies
8 0.61397791 268 nips-2010-The Neural Costs of Optimal Control
9 0.61212438 266 nips-2010-The Maximal Causes of Natural Scenes are Edge Filters
10 0.61055231 60 nips-2010-Deterministic Single-Pass Algorithm for LDA
11 0.61046422 128 nips-2010-Infinite Relational Modeling of Functional Connectivity in Resting State fMRI
12 0.60887569 6 nips-2010-A Discriminative Latent Model of Image Region and Object Tag Correspondence
13 0.60695183 21 nips-2010-Accounting for network effects in neuronal responses using L1 regularized point process models
14 0.60530239 194 nips-2010-Online Learning for Latent Dirichlet Allocation
15 0.592776 119 nips-2010-Implicit encoding of prior probabilities in optimal neural populations
16 0.59201449 44 nips-2010-Brain covariance selection: better individual functional connectivity models using population prior
17 0.59072083 200 nips-2010-Over-complete representations on recurrent neural networks can support persistent percepts
18 0.58962274 56 nips-2010-Deciphering subsampled data: adaptive compressive sampling as a principle of brain communication
19 0.58904874 19 nips-2010-A rational decision making framework for inhibitory control
20 0.58658415 123 nips-2010-Individualized ROI Optimization via Maximization of Group-wise Consistency of Structural and Functional Profiles