nips nips2007 nips2007-73 knowledge-graph by maker-knowledge-mining

73 nips-2007-Distributed Inference for Latent Dirichlet Allocation

Source: pdf

Author: David Newman, Padhraic Smyth, Max Welling, Arthur U. Asuncion

Abstract: We investigate the problem of learning a widely-used latent-variable model – the Latent Dirichlet Allocation (LDA) or “topic” model – using distributed computation, where each of processors only sees of the total data set. We propose two distributed inference schemes that are motivated from different perspectives. The ﬁrst scheme uses local Gibbs sampling on each processor with periodic updates—it is simple to implement and can be viewed as an approximation to a single processor implementation of Gibbs sampling. The second scheme relies on a hierarchical Bayesian extension of the standard LDA model to directly account for the fact that data are distributed across processors—it has a theoretical guarantee of convergence but is more complex to implement than the approximate method. Using ﬁve real-world text corpora we show that distributed learning works very well for LDA models, i.e., perplexity and precision-recall scores for distributed learning are indistinguishable from those obtained with single-processor learning. Our extensive experimental results include large-scale distributed computation on 1000 virtual processors; and speedup experiments of learning topics in a 100-million word corpus using 16 processors. ¢ ¤ ¦¥£ ¢ ¢

Reference: text

Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 edu ¡ Abstract We investigate the problem of learning a widely-used latent-variable model – the Latent Dirichlet Allocation (LDA) or “topic” model – using distributed computation, where each of processors only sees of the total data set. [sent-3, score-0.544]

2 We propose two distributed inference schemes that are motivated from different perspectives. [sent-4, score-0.168]

3 The ﬁrst scheme uses local Gibbs sampling on each processor with periodic updates—it is simple to implement and can be viewed as an approximation to a single processor implementation of Gibbs sampling. [sent-5, score-0.862]

4 The second scheme relies on a hierarchical Bayesian extension of the standard LDA model to directly account for the fact that data are distributed across processors—it has a theoretical guarantee of convergence but is more complex to implement than the approximate method. [sent-6, score-0.202]

5 Using ﬁve real-world text corpora we show that distributed learning works very well for LDA models, i. [sent-7, score-0.164]

6 , perplexity and precision-recall scores for distributed learning are indistinguishable from those obtained with single-processor learning. [sent-9, score-0.43]

7 Our extensive experimental results include large-scale distributed computation on 1000 virtual processors; and speedup experiments of learning topics in a 100-million word corpus using 16 processors. [sent-10, score-0.479]

8 For example, a text corpus with 1 million documents, each containing 1000 words on average, will require approximately 12 Gbytes of memory to store the words, which is beyond the main memory capacity for most single processor machines. [sent-14, score-0.566]

9 Thus, algorithms that make multiple passes over this sized corpus (such as occurs in many clustering and classiﬁcation algorithms) will have run times in days. [sent-16, score-0.094]

10 © § ¨£ § £ © § ¨£ An obvious approach for addressing these time and memory issues is to distribute the learning algorithm over multiple processors [1, 2, 3]. [sent-17, score-0.464]

11 However, the computation problem remains non-trivial for a fairly large class of learning algorithms, namely how to combine local processing on each of the processors to arrive at a useful global solution. [sent-19, score-0.446]

12 ¢ ¢ 1 In this general context we investigate distributed learning algorithms for the LDA model [4]. [sent-20, score-0.169]

13 LDA models are arguably among the most successful recent learning algorithms for analyzing count data such as text. [sent-21, score-0.087]

14 However, they can take days to learn for large corpora, and thus, distributed learning would be particularly useful for this type of model. [sent-22, score-0.145]

15 2 Latent Dirichlet Allocation Before introducing our distributed algorithms for LDA, we brieﬂy review the standard LDA model. [sent-25, score-0.169]

16 LDA models each of documents as a mixture over latent topics, each being a multinomial word vocabulary. [sent-26, score-0.167]

17 For the word in the document, a topic is drawn with topic chosen with probability , then word is drawn from the topic, with taking on value with probability . [sent-28, score-0.298]

18 Finally, a Dirichlet prior with parameter is placed on the topics . [sent-29, score-0.158]

19 ¥ ¨ ¦ % ¦ (#& ¨ ' 7 5 3 1 ¨ ¦ 0 642 ©¥ Given the observed words , the task of Bayesian inference is to compute the posterior distribution over the latent topic indices , the mixing proportions , and the topics . [sent-31, score-0.418]

20 An efﬁcient procedure is to use collapsed Gibbs sampling [5], where and are marginalized out, and the latent variables are sampled. [sent-32, score-0.099]

21 P ¨ ' P ¨ ¦ 'q d ¦ (0d ¦ ' ¦ q d ' Cy ©xd P w % P 8v2 tP rq d " u s '¦ X ¤ pUi ¡ 3 Distributed Inference Algorithms for LDA We now present two versions of LDA where the data and the parameters are distributed over distinct processors. [sent-37, score-0.145]

22 We distribute the documents over processors, with documents on each processor. [sent-38, score-0.139]

23 We partition the data (words from the documents) into and the corresponding topic assignments into , where and only exist on processor . [sent-39, score-0.52]

24 Document-speciﬁc counts are likewise distributed, however every processor maintains its own copy of word-topic and topic counts, and . [sent-40, score-0.55]

25 1 Approximate Distributed Inference In our Approximate Distributed LDA model (AD-LDA), we simply implement LDA on each processor, and simultaneous Gibbs sampling is performed independently on each of the processors, as if each processor thinks it is the only processor. [sent-43, score-0.455]

26 In particular , where is the total number of words across all processors, as opposed to the number of words on processor . [sent-49, score-0.465]

27 After processor has reassigned , we have modiﬁed counts , , and . [sent-50, score-0.438]

28 To merge back to a single set of counts, after a number of Gibbs sampling steps (e. [sent-51, score-0.086]

29 Note that this global update correctly reﬂects the The counts topic assignments (i. [sent-55, score-0.235]

30 ¦ (0d ' ¨ ' R P ¦ d ¦ (xd ¨ ' R ¦ d ¦ (xd ¨ ' We can consider this algorithm to be an approximation to the single-processor Gibbs sampler in the following sense: at the start of each iteration, all of the processors have the same set of counts. [sent-58, score-0.455]

31 However, as each processor starts sampling, the global count matrix is changing in a way that is unknown to each processor. [sent-59, score-0.477]

32 Thus, in Equation 3, the sampling is not being done according to the true current global count (or true posterior distribution), but to an approximation. [sent-60, score-0.167]

33 We have experimented with “repairing” reversibility of the sampler by adding a phase which re-traces the Gibbs moves starting at the (global) end-state, but we found that, due to the curse-of-dimensionality, virtually all steps ended up being rejected. [sent-61, score-0.097]

34 This parent has children which represent the topic distributions on the various processors. [sent-65, score-0.112]

35 The model that lives on each processor is simply an LDA model. [sent-67, score-0.385]

36 This model is different than the two other topic hierarchies we found in the literature, namely 1) the deeper version of the hierarchical Dirichlet process mentioned in [6] and 2) Pachinko allocation [7]. [sent-70, score-0.217]

37 These types of hierarchies do not suit our need to facilitate parallel computation. [sent-72, score-0.1]

38 This view clariﬁes the procedure we have adopted for testing: First we sample assignment variables for the ﬁrst half of the test document (analogous to folding-in). [sent-93, score-0.1]

39 Given these samples we compute the likelihood of the test document under the model for each processor. [sent-94, score-0.083]

40 Assuming equal prior weights for each processor we then compute responsibilities, which are given by the likelihoods, normalized over processors. [sent-95, score-0.385]

41 The probability of the remainder of the test document is then given by the responsibility-weighted average over the processors. [sent-96, score-0.083]

42 ¢ 4 Experiments The two distributed algorithms are initialized by ﬁrst randomly assigning topics to , then from this counting topics in documents, , and words in topics, , for each processor. [sent-97, score-0.525]

43 Recall for AD-LDA that the count arrays are the same on every processor (initially, and after every global update). [sent-98, score-0.477]

44 Multiple processors were simulated in software (by separating data, running sequentially through each processor, and simulating the global update step), except for the speedup experiments which were run on a 16processor computer. [sent-100, score-0.577]

45 2 0 0 5 10 15 Iteration 20 25 topic mode topic mode 30 0. [sent-131, score-0.31]

46 (Center) Projection of topics onto simplex, showing convergence to mode. [sent-140, score-0.158]

47 The center panel of Figure 2 plots the same run, in the 2-d planar simplex corresponding to the 3-word topic distribution. [sent-144, score-0.19]

48 This panel shows the paths in parameter space of each model, taking a few small steps near the starting point (top right corner), moving down to the true solution (bottom left), and then sampling near the posterior mode for the rest of the iterations. [sent-145, score-0.177]

49 We observed that after the initial few iterations, the individual processor steps and the merge step each resulted in a move closer to the mode. [sent-147, score-0.428]

50 , due to repeated label mismatching of the topics across processors. [sent-151, score-0.158]

51 On a single processor, one can view Gibbs sampling during burn-in as a stochastic algorithm to move up the likelihood surface. [sent-154, score-0.089]

52 With multiple processors, each processor computes an upward direction in its own subspace, keeping all other directions ﬁxed. [sent-155, score-0.404]

53 We conjecture AD-LDA works reliably because saddle points are 1) unstable and 2) rare due to the fact that the posterior appears often to be highly peaked for LDA models and high-dimensional count data sets. [sent-158, score-0.094]

54 For every test document, half the words test (at random) are put in a fold-in part, and the remaining words are put in a test part. [sent-160, score-0.179]

55 The document mix is learned using the fold-in part, and log probability is computed using this mix and words from the test part, ensuring that the test words are never seen before being used. [sent-161, score-0.24]

56 For AD-LDA, the perplexity computation exactly follows that of LDA, since a single set of topic counts are saved when a sample is taken. [sent-162, score-0.486]

57 In contrast, all copies of are required to compute perplexity for HD-LDA, as described in the previous section. [sent-163, score-0.285]

58 We compared LDA (Gibbs sampling on a single processor) and our two distributed algorithms, ADLDA and HD-LDA, using three data sets: KOS (from dailykos. [sent-165, score-0.211]

59 Using the three data sets and the three models we computed test set £ © 5 KOS 3000 6906 410,000 430 ¡ train £© test NIPS 1500 12,419 1,900,000 184 NYTIMES 300,000 102,660 100,000,000 34,658 ¡ Table 1: Size parameters for the three data sets used in perplexity and speedup experiments. [sent-174, score-0.45]

60 ¢ 1750 , and for number of processors, ¢ perplexities for a range of topics our distributed models. [sent-175, score-0.39]

61 Figure 3 clearly shows that, for a ﬁxed number of topics, the perplexity results are essentially the same whether we use single-processor LDA or either of the two algorithms with data distributed across multiple processors (either 10 or 100). [sent-178, score-0.834]

62 The ﬁgure shows the test set perplexity for KOS perplexity is computed by (left) and NIPS (right), versus number of processors, . [sent-179, score-0.598]

63 The LDA (circles), and we use our distributed models – AD-LDA (crosses), and HD-LDA (squares) – to compute the and perplexities. [sent-180, score-0.163]

64 Though not shown, perplexities for AD-LDA remained approximately constant as the number of processors was further increased to for KOS and for NIPS, demonstrating effective distributed learning with only 3 documents on each processor. [sent-181, score-0.688]

65 , topics mutually exclusively distributed over processors)—page limitations preclude a full description of all these results in this paper. [sent-184, score-0.303]

66 1000 0 100 200 300 400 Number of Topics 500 600 700 (Right) Test perplexity versus number of topics. [sent-186, score-0.299]

67 To properly determine the utility of the distributed algorithms, it is necessary to check whether the parallelized samplers are systematically converging more slowly than single processor sampling. [sent-187, score-0.593]

68 In fact our experiments consistently showed (somewhat surprisingly) that the convergence rate for the distributed algorithms is just as rapid as for the single processor case. [sent-202, score-0.576]

69 During burn-in, up test perplexity versus iteration number of the Gibbs sampler (NIPS, to iteration 200, the distributed models are actually converging slightly faster than single processor LDA. [sent-204, score-1.036]

70 Also note that 1 iteration of AD-LDA (or HD-LDA) on a parallel computer takes a fraction of the wall-clock time of 1 iteration of LDA. [sent-205, score-0.134]

71 P § ¦ ¢ We also investigated whether the results were sensitive to the number of topics used in the models, e. [sent-206, score-0.158]

72 , perhaps the distributed algorithms’ performance diverges when the number of topics becomes very large. [sent-208, score-0.303]

73 Figure 4 (right) shows the test set perplexity computed on the NIPS data set using samples, as a function of the number of topics, for the different algorithms and a ﬁxed number of processors (not shown here are the results for the KOS data set which were quite similar). [sent-209, score-0.722]

74 The perplexities of the different algorithms closely track each other as varies. [sent-210, score-0.111]

75 Sometimes the distributed algorithms produce slightly lower perplexities than those of single processor LDA. [sent-211, score-0.663]

76 This lower perplexity may be due to: for AD-LDA, parameters constantly splitting and merging producing an internal averaging effect; and for HD-LDA, test perplexity being computed using copies of saved parameters. [sent-212, score-0.665]

77 Test perplexity ( ) computed by averaging 100-separate LDA models was 2117, versus the P=100 test perplexity of 1575 for AD-LDA and HD-LDA. [sent-214, score-0.643]

78 This shows that simple averaging of results from separate processors does not perform nearly as well as the distributed coordinated learning. [sent-215, score-0.571]

79 P ¢ § § £ § ¡ P ¢ Our distributed algorithms also perform well under other performance metrics. [sent-216, score-0.169]

80 We performed precision/recall calculations using TREC’s AP and FR collections and measured performance using the well-known mean average precision (MAP) metric used in IR research. [sent-217, score-0.103]

81 All three LDA models have signiﬁcantly higher precision than TF-IDF on the AP and FR collections (signiﬁcance was computed using a t-test at the 0. [sent-219, score-0.121]

82 AD-LDA’s memory requirement scales well as collections grow, because while and can get arbitrarily large (which can be offset by increasing ), the vocabulary size asymptotes. [sent-223, score-0.144]

83 ¡ ¢ © £ ¢ © ¢ £ ¢ ¢ © §¦ Y©¦ ¨('0d ¦ ¨('0d T £ ¤ ¢ Using our large NYTIMES data set, we performed speedup experiments on a 16-processor SMP shared memory computer using 1, 2, 4, 8 and 16 processors (since we did not have access to a distributed memory computer). [sent-228, score-0.724]

84 The speedup results, shown in Figure 5 (right), show reasonable parallel efﬁciency, with a speedup using processors. [sent-231, score-0.278]

85 This speedup reduces our NYTIMES 10day run (880 sec/iteration on 1 processor) to the order of 1 day (105 sec/iteration on 16 processors). [sent-232, score-0.131]

86 Note, however, that while the implementation on an SMP machine captures some distributed effects (e. [sent-233, score-0.145]

87 , parallel updates of expected sufﬁcient statistics for mixture models [2, 1]. [sent-239, score-0.117]

88 In the statistical literature, the idea of running multiple MCMC chains in parallel is one approach to parallelization (e. [sent-240, score-0.097]

89 , the method of parallel tempering), but requires that each processor store a copy of the full data set. [sent-242, score-0.463]

90 Since MCMC is inherently sequential, parallel sampling using distributed subsets of the data will not in general yield a proper MCMC sampler except in special cases [10]. [sent-243, score-0.323]

91 Mimno and McCallum [11] recently proposed the DCM-LDA model, where processorspeciﬁc sets of topics are learned independently on each processor for local subsets of data, without any communication between processors, followed by a global clustering of the topics from the different processors. [sent-244, score-0.768]

92 While this method is highly scalable, it does not lead to single global set of topics that represent individual documents, nor is it deﬁned by a generative process. [sent-245, score-0.227]

93 We proposed two different approaches to distributing MCMC sampling across different processors for an LDA model. [sent-246, score-0.472]

94 With AD-LDA we sample from an approximation to the posterior density by allowing different processors to concurrently sample latent topic assignments on their local subsets of the data. [sent-247, score-0.599]

95 With HD-LDA we adapt the underlying LDA model to map to the distributed computational infrastructure. [sent-249, score-0.145]

96 On each processor they burn-in and converge at the same rate as LDA, yielding signiﬁcant speedups in practice. [sent-253, score-0.385]

97 The space and time complexity of both models make them scalable to run on enormous problems, for example, collections with billions to trillions of words. [sent-254, score-0.157]

98 , using asynchronous local communication (as opposed to the environment of synchronous global communications covered in this paper) and more complex schemes that allow data to adaptively move from one processor to another. [sent-257, score-0.475]

99 The distributed scheme of AD-LDA can also be used to parallelize other machine learning algorithms. [sent-258, score-0.145]

100 Using the same principles, we have implemented distributed versions of NMF and PLSA, and initial results suggest that these distributed algorithms also work well in practice. [sent-259, score-0.314]

similar papers computed by tfidf model

tfidf for this paper:

wordName wordTfidf (topN-words)

[('lda', 0.587), ('processors', 0.399), ('processor', 0.385), ('perplexity', 0.266), ('topics', 0.158), ('distributed', 0.145), ('ad', 0.128), ('topic', 0.112), ('gibbs', 0.107), ('speedup', 0.1), ('hd', 0.087), ('kos', 0.087), ('eb', 0.087), ('perplexities', 0.087), ('collections', 0.085), ('parallel', 0.078), ('ge', 0.075), ('dirichlet', 0.072), ('nytimes', 0.067), ('xd', 0.067), ('documents', 0.057), ('sampler', 0.056), ('counts', 0.053), ('document', 0.05), ('global', 0.047), ('count', 0.045), ('sampling', 0.044), ('mode', 0.043), ('memory', 0.04), ('words', 0.04), ('wf', 0.04), ('panel', 0.039), ('corpus', 0.039), ('word', 0.037), ('fr', 0.037), ('nips', 0.037), ('ap', 0.035), ('allocation', 0.034), ('latent', 0.034), ('adlda', 0.033), ('saved', 0.033), ('smp', 0.033), ('versus', 0.033), ('test', 0.033), ('hierarchical', 0.031), ('posterior', 0.031), ('run', 0.031), ('mcmc', 0.03), ('jp', 0.029), ('burn', 0.029), ('distributing', 0.029), ('pachinko', 0.029), ('iteration', 0.028), ('averaging', 0.027), ('implement', 0.026), ('crosses', 0.025), ('distribute', 0.025), ('algorithms', 0.024), ('digamma', 0.023), ('inference', 0.023), ('scalable', 0.023), ('assignments', 0.023), ('move', 0.023), ('converging', 0.022), ('hierarchies', 0.022), ('mix', 0.022), ('hundred', 0.022), ('single', 0.022), ('mixture', 0.021), ('moves', 0.021), ('collapsed', 0.021), ('merging', 0.021), ('communication', 0.02), ('indices', 0.02), ('digital', 0.02), ('merge', 0.02), ('simplex', 0.02), ('left', 0.02), ('starting', 0.02), ('iterations', 0.019), ('corpora', 0.019), ('vocabulary', 0.019), ('indistinguishable', 0.019), ('careful', 0.019), ('google', 0.019), ('center', 0.019), ('directions', 0.019), ('chains', 0.019), ('copies', 0.019), ('systematically', 0.019), ('precision', 0.018), ('deeper', 0.018), ('models', 0.018), ('ciency', 0.018), ('carlo', 0.018), ('monte', 0.018), ('operation', 0.018), ('mccallum', 0.017), ('assignment', 0.017), ('qualitatively', 0.017)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.99999934 73 nips-2007-Distributed Inference for Latent Dirichlet Allocation

Author: David Newman, Padhraic Smyth, Max Welling, Arthur U. Asuncion

2 0.28658086 183 nips-2007-Spatial Latent Dirichlet Allocation

Author: Xiaogang Wang, Eric Grimson

Abstract: In recent years, the language model Latent Dirichlet Allocation (LDA), which clusters co-occurring words into topics, has been widely applied in the computer vision ﬁeld. However, many of these applications have difﬁculty with modeling the spatial and temporal structure among visual words, since LDA assumes that a document is a “bag-of-words”. It is also critical to properly design “words” and “documents” when using a language model to solve vision problems. In this paper, we propose a topic model Spatial Latent Dirichlet Allocation (SLDA), which better encodes spatial structures among visual words that are essential for solving many vision problems. The spatial information is not encoded in the values of visual words but in the design of documents. Instead of knowing the partition of words into documents a priori, the word-document assignment becomes a random hidden variable in SLDA. There is a generative procedure, where knowledge of spatial structure can be ﬂexibly added as a prior, grouping visual words which are close in space into the same document. We use SLDA to discover objects from a collection of images, and show it achieves better performance than LDA. 1

3 0.24663165 189 nips-2007-Supervised Topic Models

Author: Jon D. Mcauliffe, David M. Blei

Abstract: We introduce supervised latent Dirichlet allocation (sLDA), a statistical model of labelled documents. The model accommodates a variety of response types. We derive a maximum-likelihood procedure for parameter estimation, which relies on variational approximations to handle intractable posterior expectations. Prediction problems motivate this research: we use the ﬁtted model to predict response values for new documents. We test sLDA on two real-world problems: movie ratings predicted from reviews, and web page popularity predicted from text descriptions. We illustrate the beneﬁts of sLDA versus modern regularized regression, as well as versus an unsupervised LDA analysis followed by a separate regression. 1

4 0.16923836 105 nips-2007-Infinite State Bayes-Nets for Structured Domains

Author: Max Welling, Ian Porteous, Evgeniy Bart

Abstract: A general modeling framework is proposed that uniﬁes nonparametric-Bayesian models, topic-models and Bayesian networks. This class of inﬁnite state Bayes nets (ISBN) can be viewed as directed networks of ‘hierarchical Dirichlet processes’ (HDPs) where the domain of the variables can be structured (e.g. words in documents or features in images). We show that collapsed Gibbs sampling can be done efﬁciently in these models by leveraging the structure of the Bayes net and using the forward-ﬁltering-backward-sampling algorithm for junction trees. Existing models, such as nested-DP, Pachinko allocation, mixed membership stochastic block models as well as a number of new models are described as ISBNs. Two experiments have been performed to illustrate these ideas. 1

5 0.16861032 47 nips-2007-Collapsed Variational Inference for HDP

Author: Yee W. Teh, Kenichi Kurihara, Max Welling

Abstract: A wide variety of Dirichlet-multinomial ‘topic’ models have found interesting applications in recent years. While Gibbs sampling remains an important method of inference in such models, variational techniques have certain advantages such as easy assessment of convergence, easy optimization without the need to maintain detailed balance, a bound on the marginal likelihood, and side-stepping of issues with topic-identiﬁability. The most accurate variational technique thus far, namely collapsed variational latent Dirichlet allocation, did not deal with model selection nor did it include inference for hyperparameters. We address both issues by generalizing the technique, obtaining the ﬁrst variational algorithm to deal with the hierarchical Dirichlet process and to deal with hyperparameters of Dirichlet variables. Experiments show a signiﬁcant improvement in accuracy. 1

6 0.13820478 188 nips-2007-Subspace-Based Face Recognition in Analog VLSI

7 0.13337462 197 nips-2007-The Infinite Markov Model

8 0.1181394 95 nips-2007-HM-BiTAM: Bilingual Topic Exploration, Word Alignment, and Translation

9 0.10999414 129 nips-2007-Mining Internet-Scale Software Repositories

10 0.10949049 2 nips-2007-A Bayesian LDA-based model for semi-supervised part-of-speech tagging

11 0.071451597 181 nips-2007-Sparse Overcomplete Latent Variable Decomposition of Counts Data

12 0.069989435 70 nips-2007-Discriminative K-means for Clustering

13 0.060113899 152 nips-2007-Parallelizing Support Vector Machines on Distributed Computers

14 0.04488983 84 nips-2007-Expectation Maximization and Posterior Constraints

15 0.043552175 209 nips-2007-Ultrafast Monte Carlo for Statistical Summations

16 0.041495636 145 nips-2007-On Sparsity and Overcompleteness in Image Models

17 0.03753712 34 nips-2007-Bayesian Policy Learning with Trans-Dimensional MCMC

18 0.03702737 125 nips-2007-Markov Chain Monte Carlo with People

19 0.03583464 71 nips-2007-Discriminative Keyword Selection Using Support Vector Machines

20 0.034301054 1 nips-2007-A Bayesian Framework for Cross-Situational Word-Learning

similar papers computed by lsi model

lsi for this paper:

topicId topicWeight

[(0, -0.155), (1, 0.075), (2, -0.065), (3, -0.365), (4, 0.108), (5, -0.069), (6, 0.08), (7, -0.156), (8, -0.04), (9, 0.151), (10, -0.009), (11, 0.005), (12, -0.055), (13, 0.161), (14, 0.059), (15, 0.09), (16, 0.177), (17, -0.13), (18, 0.002), (19, -0.038), (20, 0.02), (21, -0.012), (22, -0.065), (23, 0.044), (24, 0.059), (25, -0.034), (26, -0.046), (27, 0.044), (28, 0.043), (29, -0.042), (30, 0.034), (31, -0.061), (32, -0.008), (33, 0.021), (34, -0.037), (35, -0.026), (36, 0.06), (37, -0.043), (38, -0.033), (39, 0.009), (40, 0.042), (41, -0.003), (42, 0.049), (43, 0.016), (44, 0.065), (45, 0.003), (46, 0.032), (47, 0.046), (48, -0.027), (49, -0.02)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.96118772 73 nips-2007-Distributed Inference for Latent Dirichlet Allocation

Author: David Newman, Padhraic Smyth, Max Welling, Arthur U. Asuncion

2 0.86165512 189 nips-2007-Supervised Topic Models

Author: Jon D. Mcauliffe, David M. Blei

3 0.76423305 183 nips-2007-Spatial Latent Dirichlet Allocation

Author: Xiaogang Wang, Eric Grimson

4 0.71390653 47 nips-2007-Collapsed Variational Inference for HDP

Author: Yee W. Teh, Kenichi Kurihara, Max Welling

5 0.61571914 105 nips-2007-Infinite State Bayes-Nets for Structured Domains

Author: Max Welling, Ian Porteous, Evgeniy Bart

6 0.54765844 129 nips-2007-Mining Internet-Scale Software Repositories

7 0.47672254 95 nips-2007-HM-BiTAM: Bilingual Topic Exploration, Word Alignment, and Translation

8 0.40586993 188 nips-2007-Subspace-Based Face Recognition in Analog VLSI

9 0.338249 197 nips-2007-The Infinite Markov Model

10 0.31437281 2 nips-2007-A Bayesian LDA-based model for semi-supervised part-of-speech tagging

11 0.29711097 70 nips-2007-Discriminative K-means for Clustering

12 0.22639757 152 nips-2007-Parallelizing Support Vector Machines on Distributed Computers

13 0.22287337 181 nips-2007-Sparse Overcomplete Latent Variable Decomposition of Counts Data

14 0.21061622 131 nips-2007-Modeling homophily and stochastic equivalence in symmetric relational data

15 0.19585165 9 nips-2007-A Probabilistic Approach to Language Change

16 0.19551557 71 nips-2007-Discriminative Keyword Selection Using Support Vector Machines

17 0.19428606 31 nips-2007-Bayesian Agglomerative Clustering with Coalescents

18 0.19098945 209 nips-2007-Ultrafast Monte Carlo for Statistical Summations

19 0.18588817 153 nips-2007-People Tracking with the Laplacian Eigenmaps Latent Variable Model

20 0.18405376 1 nips-2007-A Bayesian Framework for Cross-Situational Word-Learning

similar papers computed by lda model

lda for this paper:

topicId topicWeight

[(5, 0.038), (13, 0.038), (16, 0.026), (18, 0.022), (21, 0.069), (31, 0.01), (34, 0.02), (35, 0.025), (45, 0.194), (47, 0.082), (49, 0.034), (83, 0.1), (85, 0.04), (87, 0.134), (90, 0.069)]

similar papers list:

simIndex simValue paperId paperTitle

1 0.90622973 1 nips-2007-A Bayesian Framework for Cross-Situational Word-Learning

Author: Noah Goodman, Joshua B. Tenenbaum, Michael J. Black

Abstract: For infants, early word learning is a chicken-and-egg problem. One way to learn a word is to observe that it co-occurs with a particular referent across different situations. Another way is to use the social context of an utterance to infer the intended referent of a word. Here we present a Bayesian model of cross-situational word learning, and an extension of this model that also learns which social cues are relevant to determining reference. We test our model on a small corpus of mother-infant interaction and ﬁnd it performs better than competing models. Finally, we show that our model accounts for experimental phenomena including mutual exclusivity, fast-mapping, and generalization from social cues. To understand the difﬁculty of an infant word-learner, imagine walking down the street with a friend who suddenly says “dax blicket philbin na ﬁvy!” while at the same time wagging her elbow. If you knew any of these words you might infer from the syntax of her sentence that blicket is a novel noun, and hence the name of a novel object. At the same time, if you knew that this friend indicated her attention by wagging her elbow at objects, you might infer that she intends to refer to an object in a nearby show window. On the other hand if you already knew that “blicket” meant the object in the window, you might be able to infer these elements of syntax and social cues. Thus, the problem of early word-learning is a classic chicken-and-egg puzzle: in order to learn word meanings, learners must use their knowledge of the rest of language (including rules of syntax, parts of speech, and other word meanings) as well as their knowledge of social situations. But in order to learn about the facts of their language they must ﬁrst learn some words, and in order to determine which cues matter for establishing reference (for instance, pointing and looking at an object but normally not waggling your elbow) they must ﬁrst have a way to know the intended referent in some situations. For theories of language acquisition, there are two common ways out of this dilemma. The ﬁrst involves positing a wide range of innate structures which determine the syntax and categories of a language and which social cues are informative. (Though even when all of these elements are innately determined using them to learn a language from evidence may not be trivial [1].) The other alternative involves bootstrapping: learning some words, then using those words to learn how to learn more. This paper gives a proposal for the second alternative. We ﬁrst present a Bayesian model of how learners could use a statistical strategy—cross-situational word-learning—to learn how words map to objects, independent of syntactic and social cues. We then extend this model to a true bootstrapping situation: using social cues to learn words while using words to learn social cues. Finally, we examine several important phenomena in word learning: mutual exclusivity (the tendency to assign novel words to novel referents), fast-mapping (the ability to assign a novel word in a linguistic context to a novel referent after only a single use), and social generalization (the ability to use social context to learn the referent of a novel word). Without adding additional specialized machinery, we show how these can be explained within our model as the result of domain-general probabilistic inference mechanisms operating over the linguistic domain. 1 Os r, b Is Ws Figure 1: Graphical model describing the generation of words (Ws ) from an intention (Is ) and lexicon ( ), and intention from the objects present in a situation (Os ). The plate indicates multiple copies of the model for different situation/utterance pairs (s). Dotted portions indicate additions to include the generation of social cues Ss from intentions. Ss ∀s 1 The Model Behind each linguistic utterance is a meaning that the speaker intends to communicate. Our model operates by attempting to infer this intended meaning (which we call the intent) on the basis of the utterance itself and observations of the physical and social context. For the purpose of modeling early word learning—which consists primarily of learning words for simple object categories—in our model, we assume that intents are simply groups of objects. To state the model formally, we assume the non-linguistic situation consists of a set Os of objects and that utterances are unordered sets of words Ws 1 . The lexicon is a (many-to-many) map from words to objects, which captures the meaning of those words. (Syntax enters our model only obliquely by different treatment of words depending on whether they are in the lexicon or not—that is, whether they are common nouns or other types of words.) In this setting the speaker’s intention will be captured by a set of objects in the situation to which she intends to refer: Is ⊆ Os . This setup is indicated in the graphical model of Fig. 1. Different situation-utterance pairs Ws , Os are independent given the lexicon , giving: P (Ws |Is , ) · P (Is |Os ). P (W| , O) = s (1) Is We further simplify by assuming that P (Is |Os ) ∝ 1 (which could be reﬁned by adding a more detailed model of the communicative intentions a person is likely to form in different situations). We will assume that words in the utterance are generated independently given the intention and the lexicon and that the length of the utterance is observed. Each word is then generated from the intention set and lexicon by ﬁrst choosing whether the word is a referential word or a non-referential word (from a binomial distribution of weight γ), then, for referential words, choosing which object in the intent it refers to (uniformly). This process gives: P (Ws |Is , ) = (1 − γ)PNR (w| ) + γ w∈Ws x∈Is 1 PR (w|x, ) . |Is | The probability of word w referring to object x is PR (w|x, ) ∝ δx∈ w occurring as a non-referring word is PNR (w| ) ∝ 1 if (w) = ∅, κ otherwise. (w) , (2) and the probability of word (3) (this probability is a distribution over all words in the vocabulary, not just those in lexicon ). The constant κ is a penalty for using a word in the lexicon as a non-referring word—this penalty indirectly enforces a light-weight difference between two different groups of words (parts-of-speech): words that refer and words that do not refer. Because the generative structure of this model exposes the role of speaker’s intentions, it is straightforward to add non-linguistic social cues. We assume that social cues such as pointing are generated 1 Note that, since we ignore word order, the distribution of words in a sentence should be exchangeable given the lexicon and situation. This implies, by de Finetti’s theorem, that they are independent conditioned on a latent state—we assume that the latent state giving rise to words is the intention of the speaker. 2 from the speaker’s intent independently of the linguistic aspects (as shown in the dotted arrows of Fig. 1). With the addition of social cues Ss , Eq. 1 becomes: P (Ws |Is , ) · P (Ss |Is ) · P (Is |Os ). P (W| , O) = s (4) Is We assume that the social cues are a set Si (x) of independent binary (cue present or not) feature values for each object x ∈ Os , which are generated through a noisy-or process: P (Si (x)=1|Is , ri , bi ) = 1 − (1 − bi )(1 − ri )δx∈Is . (5) Here ri is the relevance of cue i, while bi is its base rate. For the model without social cues the posterior probability of a lexicon given a set of situated utterances is: P ( |W, O) ∝ P (W| , O)P ( ). (6) And for the model with social cues the joint posterior over lexicon and cue parameters is: P ( , r, b|W, O) ∝ P (W| , r, b, O)P ( )P (r, b). (7) We take the prior probability of a lexicon to be exponential in its size: P ( ) ∝ e−α| | , and the prior probability of social cue parameters to be uniform. Given the model above and the corpus described below, we found the best lexicon (or lexicon and cue parameters) according to Eq. 6 and 7 by MAP inference using stochastic search2 . 2 Previous work While cross-situational word-learning has been widely discussed in the empirical literature, e.g., [2], there have been relatively few attempts to model this process computationally. Siskind [3] created an ambitious model which used deductive rules to make hypotheses about propositional word meanings their use across situations. This model achieved surprising success in learning word meanings in artiﬁcial corpora, but was extremely complex and relied on the availability of fully coded representations of the meaning of each sentence, making it difﬁcult to extend to empirical corpus data. More recently, Yu and Ballard [4] have used a machine translation model (similar to IBM Translation Model I) to learn word-object association probabilities. In their study, they used a pre-existing corpus of mother-infant interactions and coded the objects present during each utterance (an example from this corpus—illustrated with our own coding scheme—is shown in Fig. 2). They applied their translation model to estimate the probability of an object given a word, creating a table of associations between words and objects. Using this table, they extracted a lexicon (a group of word-object mappings) which was relatively accurate in its guesses about the names of objects that were being talked about. They further extended their model to incorporate prosodic emphasis on words (a useful cue which we will not discuss here) and joint attention on objects. Joint attention was coded by hand, isolating a subset of objects which were attended to by both mother and infant. Their results reﬂected a sizable increase in recall with the use of social cues. 3 Materials and Assessment Methods To test the performance of our model on natural data, we used the Rollins section of the CHILDES corpus[5]. For comparison with the model by Yu and Ballard [4], we chose the ﬁles me03 and di06, each of which consisted of approximately ten minutes of interaction between a mother and a preverbal infant playing with objects found in a box of toys. Because we were not able to obtain the exact corpus Yu and Ballard used, we recoded the objects in the videos and added a coding of social cues co-occurring with each utterance. We annotated each utterance with the set of objects visible to the infant and with a social coding scheme (for an illustrated example, see Figure 2). Our social code included seven features: infants eyes, infants hands, infants mouth, infant touching, mothers hands, mothers eyes, mother touching. For each utterance, this coding created an object by social feature matrix. 2 In order to speed convergence we used a simulated tempering scheme with three temperature chains and a range of data-driven proposals. 3 Figure 2: A still frame from our corpus showing the coding of objects and social cues. We coded all mid-sized objects visible to the infant as well as social information including what both mother and infant were touching and looking at. We evaluated all models based on their coverage of a gold-standard lexicon, computing precision (how many of the word-object mappings in a lexicon were correct relative to the gold-standard), recall (how many of the total correct mappings were found), and their geometric mean, F-score. However, the gold-standard lexicon for word-learning is not obvious. For instance, should it include the mapping between the plural “pigs” or the sound “oink” and the object PIG? Should a goldstandard lexicon include word-object pairings that are correct but were not present in the learning situation? In the results we report, we included those pairings which would be useful for a child to learn (e.g., “oink” → PIG) but not including those pairings which were not observed to co-occur in the corpus (however, modifying these decisions did not affect the qualitative pattern of results). 4 Results For the purpose of comparison, we give scores for several other models on the same corpus. We implemented a range of simple associative models based on co-occurrence frequency, conditional probability (both word given object and object given word), and point-wise mutual information. In each of these models, we computed the relevant statistic across the entire corpus and then created a lexicon by including all word-object pairings for which the association statistic met a threshold value. We additionally implemented a translation model (based on Yu and Ballard [4]). Because Yu and Ballard did not include details on how they evaluated their model, we scored it in the same way as the other associative models, by creating an association matrix based on the scores P (O|W ) (as given in equation (3) in their paper) and then creating a lexicon based on a threshold value. In order to simulate this type of threshold value for our model, we searched for the MAP lexicon over a range of parameters α in our prior (the larger the prior value, the less probable a larger lexicon, thus this manipulation served to create more or less selective lexicons) . Base model. In Figure 3, we plot the precision and the recall for lexicons across a range of prior parameter values for our model and the full range of threshold values for the translation model and two of the simple association models (since results for the conditional probability models were very similar but slightly inferior to the performance of mutual information, we did not include them). For our model, we averaged performance at each threshold value across three runs of 5000 search iterations each. Our model performed better than any of the other models on a number of dimensions (best lexicon shown in Table 1), both achieving the highest F-score and showing a better tradeoff between precision and recall at sub-optimal threshold values. The translation model also performed well, increasing precision as the threshold of association was raised. Surprisingly, standard cooccurrence statistics proved to be relatively ineffective at extracting high-scoring lexicons: at any given threshold value, these models included a very large number of incorrect pairs. Table 1: The best lexicon found by the Bayesian model (α=11, γ=0.2, κ=0.01). baby → book hand → hand bigbird → bird hat → hat on → ring bird → rattle meow → kitty ring → ring 4 birdie → duck moocow → cow sheep → sheep book → book oink → pig 1 Co!occurrence frequency Mutual information Translation model Bayesian model 0.9 0.8 0.7 recall 0.6 0.5 0.4 0.3 F=0.54 F=0.44 F=0.21 F=0.12 0.2 0.1 0 0 0.2 0.4 0.6 precision 0.8 1 Figure 3: Comparison of models on corpus data: we plot model precision vs. recall across a range of threshold values for each model (see text). Unlike standard ROC curves for classiﬁcation tasks, the precision and recall of a lexicon depends on the entire lexicon, and irregularities in the curves reﬂect the small size of the lexicons). One additional virtue of our model over other associative models is its ability to determine which objects the speaker intended to refer to. In Table 2, we give some examples of situations in which the model correctly inferred the objects that the speaker was talking about. Social model. While the addition of social cues did not increase corpus performance above that found in the base model, the lexicons which were found by the social model did have several properties that were not present in the base model. First, the model effectively and quickly converged on the social cues that we found subjectively important in viewing the corpus videos. The two cues which were consistently found relevant across the model were (1) the target of the infant’s gaze and (2) the caregiver’s hand. These data are especially interesting in light of the speculation that infants initially believe their own point of gaze is a good cue to reference, and must learn over the second year that the true cue is the caregiver’s point of gaze, not their own [6]. Second, while the social model did not outperform the base model on the full corpus (where many words were paired with their referents several times), on a smaller corpus (taking every other utterance), the social cue model did slightly outperform a model without social cues (max F-score=0.43 vs. 0.37). Third, the addition of social cues allowed the model to infer the intent of a speaker even in the absence of a word being used. In the right-hand column of Table 2, we give an example of a situation in which the caregiver simply says ”see that?” but from the direction of the infant’s eyes and the location of her hand, the model correctly infers that she is talking about the COW, not either of the other possible referents. This kind of inference might lead the way in allowing infants to learn words like pronouns, which serve pick out an unambiguous focus of attention (one that is so obvious based on social and contextual cues that it does not need to be named). Finally, in the next section we show that the addition of social cues to the model allows correct performance in experimental tests of social generalization which only children older than 18 months can pass, suggesting perhaps that the social model is closer to the strategy used by more mature word learners. Table 2: Intentions inferred by the Bayesian model after having learned a lexicon from the corpus. (IE=Infant’s eyes, CH=Caregiver’s hands). Words Objects Social Cues Inferred intention “look at the moocow” COW GIRL BEAR “see the bear by the rattle?” BEAR RATTLE COW COW BEAR RATTLE 5 “see that?” BEAR RATTLE COW IE & CH→COW COW situation: !7.3, corpus: !631.1, total: !638.4

same-paper 2 0.78956914 73 nips-2007-Distributed Inference for Latent Dirichlet Allocation

Author: David Newman, Padhraic Smyth, Max Welling, Arthur U. Asuncion

3 0.77917331 48 nips-2007-Collective Inference on Markov Models for Modeling Bird Migration

Author: M.a. S. Elmohamed, Dexter Kozen, Daniel R. Sheldon

Abstract: We investigate a family of inference problems on Markov models, where many sample paths are drawn from a Markov chain and partial information is revealed to an observer who attempts to reconstruct the sample paths. We present algorithms and hardness results for several variants of this problem which arise by revealing different information to the observer and imposing different requirements for the reconstruction of sample paths. Our algorithms are analogous to the classical Viterbi algorithm for Hidden Markov Models, which ﬁnds the single most probable sample path given a sequence of observations. Our work is motivated by an important application in ecology: inferring bird migration paths from a large database of observations. 1

4 0.73484874 59 nips-2007-Continuous Time Particle Filtering for fMRI

Author: Lawrence Murray, Amos J. Storkey

Abstract: We construct a biologically motivated stochastic differential model of the neural and hemodynamic activity underlying the observed Blood Oxygen Level Dependent (BOLD) signal in Functional Magnetic Resonance Imaging (fMRI). The model poses a difﬁcult parameter estimation problem, both theoretically due to the nonlinearity and divergence of the differential system, and computationally due to its time and space complexity. We adapt a particle ﬁlter and smoother to the task, and discuss some of the practical approaches used to tackle the difﬁculties, including use of sparse matrices and parallelisation. Results demonstrate the tractability of the approach in its application to an effective connectivity study. 1

5 0.7266975 50 nips-2007-Combined discriminative and generative articulated pose and non-rigid shape estimation

Author: Leonid Sigal, Alexandru Balan, Michael J. Black

Abstract: Estimation of three-dimensional articulated human pose and motion from images is a central problem in computer vision. Much of the previous work has been limited by the use of crude generative models of humans represented as articulated collections of simple parts such as cylinders. Automatic initialization of such models has proved difﬁcult and most approaches assume that the size and shape of the body parts are known a priori. In this paper we propose a method for automatically recovering a detailed parametric model of non-rigid body shape and pose from monocular imagery. Speciﬁcally, we represent the body using a parameterized triangulated mesh model that is learned from a database of human range scans. We demonstrate a discriminative method to directly recover the model parameters from monocular images using a conditional mixture of kernel regressors. This predicted pose and shape are used to initialize a generative model for more detailed pose and shape estimation. The resulting approach allows fully automatic pose and shape recovery from monocular and multi-camera imagery. Experimental results show that our method is capable of robustly recovering articulated pose, shape and biometric measurements (e.g. height, weight, etc.) in both calibrated and uncalibrated camera environments. 1

6 0.70185632 129 nips-2007-Mining Internet-Scale Software Repositories

7 0.69061226 183 nips-2007-Spatial Latent Dirichlet Allocation

8 0.67965627 189 nips-2007-Supervised Topic Models

9 0.63882214 105 nips-2007-Infinite State Bayes-Nets for Structured Domains

10 0.63810605 2 nips-2007-A Bayesian LDA-based model for semi-supervised part-of-speech tagging

11 0.63535154 47 nips-2007-Collapsed Variational Inference for HDP

12 0.6326111 95 nips-2007-HM-BiTAM: Bilingual Topic Exploration, Word Alignment, and Translation

13 0.629875 153 nips-2007-People Tracking with the Laplacian Eigenmaps Latent Variable Model

14 0.62559587 172 nips-2007-Scene Segmentation with CRFs Learned from Partially Labeled Images

15 0.61773592 211 nips-2007-Unsupervised Feature Selection for Accurate Recommendation of High-Dimensional Image Data

16 0.61473632 154 nips-2007-Predicting Brain States from fMRI Data: Incremental Functional Principal Component Regression

17 0.61391574 56 nips-2007-Configuration Estimates Improve Pedestrian Finding

18 0.61387193 93 nips-2007-GRIFT: A graphical model for inferring visual classification features from human data

19 0.61185914 18 nips-2007-A probabilistic model for generating realistic lip movements from speech

20 0.60970092 180 nips-2007-Sparse Feature Learning for Deep Belief Networks