nips nips2009 nips2009-153 knowledge-graph by maker-knowledge-mining

153 nips-2009-Modeling Social Annotation Data with Content Relevance using a Topic Model


Source: pdf

Author: Tomoharu Iwata, Takeshi Yamada, Naonori Ueda

Abstract: We propose a probabilistic topic model for analyzing and extracting contentrelated annotations from noisy annotated discrete data such as web pages stored in social bookmarking services. In these services, since users can attach annotations freely, some annotations do not describe the semantics of the content, thus they are noisy, i.e. not content-related. The extraction of content-related annotations can be used as a preprocessing step in machine learning tasks such as text classification and image recognition, or can improve information retrieval performance. The proposed model is a generative model for content and annotations, in which the annotations are assumed to originate either from topics that generated the content or from a general distribution unrelated to the content. We demonstrate the effectiveness of the proposed method by using synthetic data and real social annotation data for text and images.

Reference: text


Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 jp Abstract We propose a probabilistic topic model for analyzing and extracting contentrelated annotations from noisy annotated discrete data such as web pages stored in social bookmarking services. [sent-5, score-1.16]

2 In these services, since users can attach annotations freely, some annotations do not describe the semantics of the content, thus they are noisy, i. [sent-6, score-1.378]

3 The extraction of content-related annotations can be used as a preprocessing step in machine learning tasks such as text classification and image recognition, or can improve information retrieval performance. [sent-9, score-0.723]

4 The proposed model is a generative model for content and annotations, in which the annotations are assumed to originate either from topics that generated the content or from a general distribution unrelated to the content. [sent-10, score-1.277]

5 We demonstrate the effectiveness of the proposed method by using synthetic data and real social annotation data for text and images. [sent-11, score-0.461]

6 1 Introduction Recently there has been great interest in social annotations, also called collaborative tagging or folksonomy, created by users freely annotating objects such as web pages [7], photographs [9], blog posts [23], videos [26], music [19], and scientific papers [5]. [sent-12, score-0.272]

7 Delicious [7], which is a social bookmarking service, and Flickr [9], which is an online photo sharing service, are two representative social annotation services, and they have succeeded in collecting huge numbers of annotations. [sent-13, score-0.677]

8 Since users can attach annotations freely in social annotation services, the annotations include those that do not describe the semantics of the content, and are, therefore, not content-related [10]. [sent-14, score-1.803]

9 For example, annotations such as ’nikon’ or ’canon’ in a social photo service often represent the name of the manufacturer of the camera with which the photographs were taken, or annotations such as ’2008’ or ’november’ indicate when they were taken. [sent-15, score-1.622]

10 Other examples of content-unrelated annotations include those designed to remind the annotator such as ’toread’, those identifying qualities such as ’great’, and those identifying ownership. [sent-16, score-0.689]

11 Content-unrelated annotations can often constitute noise if used for training samples in machine learning tasks, such as automatic text classification and image recognition. [sent-17, score-0.689]

12 We can improve classifier performance if we can employ huge amounts of social annotation data from which the content-unrelated annotations have been filtered out. [sent-19, score-1.114]

13 Content-unrelated annotations may also constitute noise in information retrieval. [sent-20, score-0.689]

14 In this paper, we propose a probabilistic topic model for analyzing and extracting content-related annotations from noisy annotated data. [sent-22, score-0.909]

15 A number of methods for automatic annotation have been proposed [1, 2, 8, 16, 17]. [sent-23, score-0.337]

16 The extraction of content-related annotations can improve performance of machine learning and information retrieval tasks. [sent-25, score-0.723]

17 The proposed model is a generative model for content and annotations. [sent-27, score-0.231]

18 We assume that each annotation is associated with a latent variable that indicates whether it is related to the content or not, and the annotation originates either from the topics that generated the content or from a content-unrelated general distribution depending on the latent variable. [sent-29, score-1.199]

19 Intuitively speaking, this approach considers an annotation to be content-related when it is almost always attached to objects in a specific topic. [sent-31, score-0.357]

20 As regards real social annotation data, the annotations are not explicitly labeled as content related/unrelated. [sent-32, score-1.309]

21 The proposed model is an unsupervised model, and so can extract content-related annotations without content relevance labels. [sent-33, score-0.984]

22 A topic model is a hierarchical probabilistic model, in which a document is modeled as a mixture of topics, and where a topic is modeled as a probability distribution over words. [sent-35, score-0.451]

23 The proposed method is an extension of the correspondence latent Dirichlet allocation (CorrLDA) [2], which is a generative topic model for contents and annotations. [sent-37, score-0.253]

24 Since Corr-LDA assumes that all annotations are related to the content, it cannot be used for separating content-related annotations from content-unrelated ones. [sent-38, score-1.378]

25 A topic model with a background distribution [4] assumes that words are generated either from a topic-specific distribution or from a corpus-wide background distribution. [sent-39, score-0.228]

26 In the rest of this paper, we assume that the given data are annotated document data, in which the content of each document is represented by words appearing in the document, and each document has both content-related and content-unrelated annotations. [sent-41, score-0.692]

27 These include annotated image data, where each image is represented with visual words [6], and annotated movie data, where each movie is represented by user ratings. [sent-43, score-0.261]

28 2 Proposed method Suppose that, we have a set of D documents, and each document consists of a pair of words and annotations (wd , td ), where wd = {wdn }Nd is the set of words in a document that represents the n=1 content, and td = {tdm }Md is the set of assigned annotations, or tags. [sent-44, score-1.17]

29 2 α θ λ w c η z t r N φ ψ K K+1 β γ M D Figure 1: Graphical model representation of the proposed topic model with content relevance. [sent-46, score-0.394]

30 The proposed topic model first generates the content, and then generates the annotations. [sent-47, score-0.255]

31 The generative process for the content is the same as basic topic models, such as latent Dirichlet allocation (LDA) [3]. [sent-48, score-0.412]

32 Each document has topic proportions θd that are sampled from a Dirichlet distribution. [sent-49, score-0.288]

33 For each of the Nd words in the document, a topic zdn is chosen from the topic proportions, and then word wdn is generated from a topic-specific multinomial distribution φzdn . [sent-50, score-0.64]

34 In the generative process for annotations, each annotation is assessed as to whether it is related to the content or not. [sent-51, score-0.496]

35 In particular, each annotation is associated with a latent variable rdm with value rdm = 0 if annotation tdm is not related to the content; rdm = 1 otherwise. [sent-52, score-1.111]

36 If the annotation is not related to the content, rdm = 0, annotation tdm is sampled from general topic-unrelated multinomial distribution ψ0 . [sent-53, score-0.868]

37 If the annotation is related to the content, rdm = 1, annotation tdm is sampled from topic-specific multinomial distribution ψcdm , where cdm is the topic for the annotation. [sent-54, score-1.158]

38 Topic cdm is sampled uniform randomly from topics zd = {zdn }Nd that have previously generated the content. [sent-55, score-0.253]

39 This n=1 means that topic cdm is generated from a multinomial distribution, in which P (cdm = k) = Nkd , Nd where Nkd is the number of words that are assigned to topic k in the dth document. [sent-56, score-0.69]

40 For each topic k = 1, · · · , K: (a) Draw word probability φk ∼ Dirichlet(β) (b) Draw annotation probability ψk ∼ Dirichlet(γ) 4. [sent-60, score-0.505]

41 For each document d = 1, · · · , D: (a) Draw topic proportions θd ∼ Dirichlet(α) (b) For each word n = 1, · · · , Nd : i. [sent-61, score-0.329]

42 Draw word wdn ∼ Multinomial(φzdn ) (c) For each annotation m = 1, · · · , Md : i. [sent-63, score-0.406]

43 Draw topic cdm ∼ Multinomial({ Nkd }K ) Nd k=1 ii. [sent-64, score-0.29]

44 Draw relevance rdm ∼ Bernoulli(λ) { Multinomial(ψ0 ) if rdm = 0 iii. [sent-65, score-0.308]

45 Draw annotation tdm ∼ Multinomial(ψcdm ) otherwise where α, β and γ are Dirichlet distribution parameters, and η is a beta distribution parameter. [sent-66, score-0.39]

46 As with Corr-LDA, the proposed model first generates the content and then generates the annotations by modeling the conditional distribution of latent topics for annotations given the topics for the content. [sent-68, score-1.917]

47 Therefore, it achieves a comprehensive fit of the joint distribution of content and annotations and finds superior conditional distributions of annotations given content [2]. [sent-69, score-1.768]

48 Similarly, the second Γ(α)K ( )K ∏ Q w Γ(Nkw +β) term is given as follows, P (W |Z, β) = Γ(βW ) k Γ(Nk +βW ) , where Nkw is the number Γ(β)W ∑ of times word w has been assigned to topic k, and Nk = w Nkw . [sent-73, score-0.204]

49 Mk t is the number of times annotation t has been ∑ identified as content-unrelated if k = 0, or as content-related topic k if k = 0, and Mk = t Mk t . [sent-75, score-0.464]

50 The fifth term is given as follows, P (C|Z) = d k Nkd Nd Mkd is the number of annotations that are assigned to topic k in the dth document. [sent-78, score-0.969]

51 The inference of the latent topics Z given content W and annotations T can be efficiently computed using collapsed Gibbs sampling [11]. [sent-79, score-1.037]

52 1 Experiments Synthetic content-unrelated annotations We evaluated the proposed method quantitatively by using labeled text data from the 20 Newsgroups corpus [18] and adding synthetic content-unrelated annotations. [sent-85, score-0.725]

53 Specifically, in the 20News1 data, the unique number of content-unrelated annotations was set at ten, and the number of content-unrelated annotations per document was set at {1, · · · , 10}. [sent-89, score-1.554]

54 In the 20News2 data, the unique number of content-unrelated annotations was set at {1, · · · , 10}, and the number of content-unrelated annotations per document was set at one. [sent-90, score-1.554]

55 Corr-LDA [2] is a topic model for words and annotations that does not take the relevance to content into consideration. [sent-99, score-1.176]

56 For the proposed method and Corr-LDA, we set the number of latent topics, K, to 20, and estimated latent topics and parameters by using collapsed Gibbs sampling and the fixed-point iteration method, respectively. [sent-100, score-0.243]

57 We evaluated the predictive performance of each method using the perplexity of held-out contentrelated annotations given the content. [sent-101, score-0.856]

58 Note that no content-unrelated annotations were attached to the test samples. [sent-104, score-0.745]

59 In all cases, when content-unrelated annotations were included, the proposed method achieved the lowest perplexity, indicating that it can appropriately predict content-related annotations. [sent-106, score-0.725]

60 Although the perplexity achieved by MaxEnt was slightly lower than that of the proposed method without content-unrelated annotations, the performance of MaxEnt deteriorated greatly when even one content-unrelated annotation was attached. [sent-107, score-0.466]

61 Since MaxEnt is a supervised classifier, it considers all attached annotations to be content-related even if they are not. [sent-108, score-0.745]

62 Therefore, its perplexity is significantly high when there are fewer content-related annotations per document than unrelated annotations as with the 20News1 data. [sent-109, score-1.695]

63 In contrast, since the proposed method considers the relevance to the content for each annotation, it always offered low perplexity even if the number of content-unrelated annotations was increased. [sent-110, score-1.113]

64 The perplexity achieved by Corr-LDA was high because it does not consider the relevance to the content as in MaxEnt. [sent-111, score-0.388]

65 We considered extraction as a binary classification problem, in which each annotation is classified as either contentrelated or content-unrelated. [sent-113, score-0.339]

66 We compared the proposed method to a baseline method in which the annotations are considered to be content-related if any of the words in the annotations appear in the document. [sent-115, score-1.479]

67 We assume that the baseline method knows that content-unrelated annotations do not appear in any document. [sent-118, score-0.689]

68 Note that this baseline method does not support image data, because words in the annotations never appear in the content. [sent-120, score-0.754]

69 The F-measures achieved by the baseline method were low because annotations might be related to the content even if the annotations did not appear in the document. [sent-127, score-1.573]

70 On the other hand, the proposed method considers that annotations are related to the content when the topic, or latent semantics, of the content and the topic of the annotations are similar even if the annotations did not appear in the document. [sent-128, score-2.71]

71 2 Proposed Baseline 10 5 0 0 2 4 6 8 10 0 0 number of content-unrelated annotations per document 2 4 6 8 10 0 number of content-unrelated annotations per document 2 4 6 8 10 number of content-unrelated annotations per document 20News2 18 12 10 0. [sent-137, score-2.442]

72 Figure 2 (c) shows the content-related annotation ratios as estimated by the following equation, −M0 +η ˆ λ = MM +2η , with the proposed method. [sent-146, score-0.374]

73 2 Social annotations We analyzed the following three sets of real social annotation data taken from two social bookmarking services and a photo sharing service, namely Hatena, Delicious, and Flickr. [sent-149, score-1.379]

74 From the Hatena data, we used web pages and their annotations in Hatena::Bookmark [12], which is a social bookmarking service in Japan, that were collected using a similar method to that used in [25, 27]. [sent-150, score-0.937]

75 We omitted stop-words and words and annotations that occurred in fewer than ten documents. [sent-155, score-0.816]

76 We omitted documents with fewer than ten unique words and also omitted those without annotations. [sent-156, score-0.259]

77 The numbers of documents, unique words, and unique annotations were 39,132, 8,885, and 43,667, respectively. [sent-157, score-0.816]

78 From the Delicious data, we used web pages and their annotations [7] that were collected using the same method used for the Hatena data. [sent-158, score-0.733]

79 The numbers of documents, unique words, and unique annotations were 65,528, 30,274, and 21,454, respectively. [sent-159, score-0.816]

80 From the Flickr data, we used photographs and their annotations Flickr [9] that were collected in November 2008 using the same method used for the Hatena data. [sent-160, score-0.716]

81 We omitted annotations that were attached to fewer than ten images. [sent-162, score-0.807]

82 The numbers of images, unique visual words, and unique annotations were 12,711, 200, and 2,197, respectively. [sent-163, score-0.816]

83 Figure 3 (a)(b)(c) shows the average perplexities over ten experiments and their standard deviation for held-out annotations in the three real social annotation data sets with different numbers of topics. [sent-165, score-1.248]

84 Figure 3 (d) shows the result with the Patent data as an example of data without content unrelated annotations. [sent-166, score-0.258]

85 info Figure 4: Examples of content-related annotations in the Delicious data extracted by the proposed method. [sent-172, score-0.725]

86 Each row shows annotations attached to a document; content-unrelated annotations are shaded. [sent-173, score-1.434]

87 On the other hand, with the real social annotation data, the proposed method achieved much lower perplexities than Corr-LDA. [sent-176, score-0.537]

88 This result implies that it is important to consider relevance to the content when analyzing noisy social annotation data. [sent-177, score-0.684]

89 The perplexity of Corr-LDA with social annotation data gets worse as the number of topics increases because Corr-LDA overfits noisy content-unrelated annotations. [sent-178, score-0.653]

90 The upper half of each table in Table 2 shows probable content-unrelated annotations in the leftmost column, and probable annotations for some topics, which were estimated with the proposed method using 50 topics. [sent-179, score-1.496]

91 The lower half in (a) and (b) shows probable words in the content for each topic. [sent-180, score-0.301]

92 For content-unrelated annotations, words that seemed to be irrelevant to the content were extracted, such as ’toread’, ’later’, ’*’, ’? [sent-182, score-0.26]

93 Each topic has characteristic annotations and words, for example, Topic1 in the Hatena data is about programming, Topic2 is about games, and Topic3 is about economics. [sent-184, score-0.852]

94 7 Table 2: The ten most probable content-unrelated annotations (leftmost column), and the ten most probable annotations for some topics (other columns), estimated with the proposed method using 50 topics. [sent-186, score-1.661]

95 (a) Hatena unrelated toread web later great document troll * ? [sent-189, score-0.308]

96 We have confirmed experimentally that the proposed method can extract content-related annotations appropriately, and can be used for analyzing social annotation data. [sent-191, score-1.15]

97 Since the proposed method is, theoretically, applicable to various kinds of annotation data, we will confirm this in additional experiments. [sent-193, score-0.337]

98 Modeling general and specific aspects of documents with a probabilistic topic model. [sent-225, score-0.215]

99 Probabilistic latent semantic visualization: topic model for visualizing documents. [sent-285, score-0.217]

100 Automatic image annotation and retrieval using cross-media relevance models. [sent-292, score-0.399]


similar papers computed by tfidf model

tfidf for this paper:

wordName wordTfidf (topN-words)

[('annotations', 0.689), ('annotation', 0.301), ('content', 0.195), ('topic', 0.163), ('perplexity', 0.129), ('cdm', 0.127), ('hatena', 0.127), ('document', 0.125), ('social', 0.124), ('rdm', 0.122), ('nkd', 0.122), ('dth', 0.117), ('topics', 0.099), ('tdm', 0.089), ('zdn', 0.089), ('maxent', 0.086), ('delicious', 0.076), ('perplexities', 0.076), ('ruby', 0.076), ('toread', 0.076), ('food', 0.071), ('words', 0.065), ('relevance', 0.064), ('corrlda', 0.064), ('nikon', 0.064), ('wdn', 0.064), ('unrelated', 0.063), ('photo', 0.058), ('annotated', 0.057), ('attached', 0.056), ('multinomial', 0.055), ('latent', 0.054), ('documents', 0.052), ('iphone', 0.051), ('unique', 0.051), ('dirichlet', 0.05), ('patent', 0.048), ('flickr', 0.045), ('linux', 0.045), ('imported', 0.045), ('bookmarking', 0.045), ('web', 0.044), ('blog', 0.043), ('word', 0.041), ('probable', 0.041), ('movie', 0.041), ('md', 0.041), ('wd', 0.041), ('mk', 0.04), ('services', 0.038), ('contentrelated', 0.038), ('cooking', 0.038), ('mkd', 0.038), ('rails', 0.038), ('urls', 0.038), ('japan', 0.037), ('ratios', 0.037), ('draw', 0.037), ('proposed', 0.036), ('recipes', 0.036), ('server', 0.036), ('service', 0.035), ('music', 0.034), ('retrieval', 0.034), ('animation', 0.033), ('nkw', 0.033), ('photography', 0.033), ('education', 0.033), ('sigir', 0.033), ('ten', 0.033), ('nth', 0.032), ('eat', 0.031), ('td', 0.03), ('omitted', 0.029), ('mth', 0.029), ('economy', 0.029), ('money', 0.029), ('apple', 0.029), ('mac', 0.029), ('generates', 0.028), ('zd', 0.027), ('photographs', 0.027), ('development', 0.026), ('acm', 0.026), ('nance', 0.026), ('security', 0.026), ('video', 0.025), ('bookmarked', 0.025), ('gmail', 0.025), ('interview', 0.025), ('ipc', 0.025), ('ipod', 0.025), ('opensource', 0.025), ('php', 0.025), ('shopping', 0.025), ('ssd', 0.025), ('tips', 0.025), ('ubuntu', 0.025), ('webdev', 0.025), ('numbers', 0.025)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.99999899 153 nips-2009-Modeling Social Annotation Data with Content Relevance using a Topic Model

Author: Tomoharu Iwata, Takeshi Yamada, Naonori Ueda

Abstract: We propose a probabilistic topic model for analyzing and extracting contentrelated annotations from noisy annotated discrete data such as web pages stored in social bookmarking services. In these services, since users can attach annotations freely, some annotations do not describe the semantics of the content, thus they are noisy, i.e. not content-related. The extraction of content-related annotations can be used as a preprocessing step in machine learning tasks such as text classification and image recognition, or can improve information retrieval performance. The proposed model is a generative model for content and annotations, in which the annotations are assumed to originate either from topics that generated the content or from a general distribution unrelated to the content. We demonstrate the effectiveness of the proposed method by using synthetic data and real social annotation data for text and images.

2 0.20246781 5 nips-2009-A Bayesian Model for Simultaneous Image Clustering, Annotation and Object Segmentation

Author: Lan Du, Lu Ren, Lawrence Carin, David B. Dunson

Abstract: A non-parametric Bayesian model is proposed for processing multiple images. The analysis employs image features and, when present, the words associated with accompanying annotations. The model clusters the images into classes, and each image is segmented into a set of objects, also allowing the opportunity to assign a word to each object (localized labeling). Each object is assumed to be represented as a heterogeneous mix of components, with this realized via mixture models linking image features to object types. The number of image classes, number of object types, and the characteristics of the object-feature mixture models are inferred nonparametrically. To constitute spatially contiguous objects, a new logistic stick-breaking process is developed. Inference is performed efficiently via variational Bayesian analysis, with example results presented on two image databases.

3 0.16637768 205 nips-2009-Rethinking LDA: Why Priors Matter

Author: Andrew McCallum, David M. Mimno, Hanna M. Wallach

Abstract: Implementations of topic models typically use symmetric Dirichlet priors with fixed concentration parameters, with the implicit assumption that such “smoothing parameters” have little practical effect. In this paper, we explore several classes of structured priors for topic models. We find that an asymmetric Dirichlet prior over the document–topic distributions has substantial advantages over a symmetric prior, while an asymmetric prior over the topic–word distributions provides no real benefit. Approximation of this prior structure through simple, efficient hyperparameter optimization steps is sufficient to achieve these performance gains. The prior structure we advocate substantially increases the robustness of topic models to variations in the number of topics and to the highly skewed word frequency distributions common in natural language. Since this prior structure can be implemented using efficient algorithms that add negligible cost beyond standard inference techniques, we recommend it as a new standard for topic modeling. 1

4 0.12221934 65 nips-2009-Decoupling Sparsity and Smoothness in the Discrete Hierarchical Dirichlet Process

Author: Chong Wang, David M. Blei

Abstract: We present a nonparametric hierarchical Bayesian model of document collections that decouples sparsity and smoothness in the component distributions (i.e., the “topics”). In the sparse topic model (sparseTM), each topic is represented by a bank of selector variables that determine which terms appear in the topic. Thus each topic is associated with a subset of the vocabulary, and topic smoothness is modeled on this subset. We develop an efficient Gibbs sampler for the sparseTM that includes a general-purpose method for sampling from a Dirichlet mixture with a combinatorial number of components. We demonstrate the sparseTM on four real-world datasets. Compared to traditional approaches, the empirical results will show that sparseTMs give better predictive performance with simpler inferred models. 1

5 0.11819041 96 nips-2009-Filtering Abstract Senses From Image Search Results

Author: Kate Saenko, Trevor Darrell

Abstract: We propose an unsupervised method that, given a word, automatically selects non-abstract senses of that word from an online ontology and generates images depicting the corresponding entities. When faced with the task of learning a visual model based only on the name of an object, a common approach is to find images on the web that are associated with the object name and train a visual classifier from the search result. As words are generally polysemous, this approach can lead to relatively noisy models if many examples due to outlier senses are added to the model. We argue that images associated with an abstract word sense should be excluded when training a visual classifier to learn a model of a physical object. While image clustering can group together visually coherent sets of returned images, it can be difficult to distinguish whether an image cluster relates to a desired object or to an abstract sense of the word. We propose a method that uses both image features and the text associated with the images to relate latent topics to particular senses. Our model does not require any human supervision, and takes as input only the name of an object category. We show results of retrieving concrete-sense images in two available multimodal, multi-sense databases, as well as experiment with object classifiers trained on concrete-sense images returned by our method for a set of ten common office objects. 1

6 0.11638097 204 nips-2009-Replicated Softmax: an Undirected Topic Model

7 0.09241382 4 nips-2009-A Bayesian Analysis of Dynamics in Free Recall

8 0.076598093 190 nips-2009-Polynomial Semantic Indexing

9 0.066391669 186 nips-2009-Parallel Inference for Latent Dirichlet Allocation on Graphics Processing Units

10 0.058955651 68 nips-2009-Dirichlet-Bernoulli Alignment: A Generative Model for Multi-Class Multi-Label Multi-Instance Corpora

11 0.057121057 28 nips-2009-An Additive Latent Feature Model for Transparent Object Recognition

12 0.055627409 226 nips-2009-Spatial Normalized Gamma Processes

13 0.051207773 171 nips-2009-Nonparametric Bayesian Models for Unsupervised Event Coreference Resolution

14 0.051025186 260 nips-2009-Zero-shot Learning with Semantic Output Codes

15 0.049956948 259 nips-2009-Who’s Doing What: Joint Modeling of Names and Verbs for Simultaneous Face and Pose Annotation

16 0.046027765 236 nips-2009-Structured output regression for detection with partial truncation

17 0.043140247 255 nips-2009-Variational Inference for the Nested Chinese Restaurant Process

18 0.042908415 87 nips-2009-Exponential Family Graph Matching and Ranking

19 0.042905323 90 nips-2009-Factor Modeling for Advertisement Targeting

20 0.042846549 18 nips-2009-A Stochastic approximation method for inference in probabilistic graphical models


similar papers computed by lsi model

lsi for this paper:

topicId topicWeight

[(0, -0.122), (1, -0.092), (2, -0.108), (3, -0.147), (4, 0.06), (5, -0.097), (6, -0.077), (7, 0.004), (8, 0.019), (9, 0.219), (10, -0.038), (11, -0.012), (12, 0.009), (13, 0.071), (14, -0.001), (15, -0.003), (16, -0.143), (17, -0.063), (18, -0.035), (19, 0.034), (20, 0.055), (21, 0.024), (22, 0.003), (23, -0.042), (24, -0.036), (25, 0.025), (26, 0.008), (27, -0.046), (28, 0.043), (29, 0.012), (30, -0.016), (31, 0.035), (32, -0.005), (33, -0.03), (34, 0.018), (35, 0.034), (36, -0.085), (37, 0.02), (38, 0.068), (39, 0.08), (40, 0.069), (41, -0.01), (42, -0.025), (43, 0.118), (44, -0.075), (45, -0.052), (46, 0.036), (47, -0.057), (48, 0.013), (49, -0.014)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.96737629 153 nips-2009-Modeling Social Annotation Data with Content Relevance using a Topic Model

Author: Tomoharu Iwata, Takeshi Yamada, Naonori Ueda

Abstract: We propose a probabilistic topic model for analyzing and extracting contentrelated annotations from noisy annotated discrete data such as web pages stored in social bookmarking services. In these services, since users can attach annotations freely, some annotations do not describe the semantics of the content, thus they are noisy, i.e. not content-related. The extraction of content-related annotations can be used as a preprocessing step in machine learning tasks such as text classification and image recognition, or can improve information retrieval performance. The proposed model is a generative model for content and annotations, in which the annotations are assumed to originate either from topics that generated the content or from a general distribution unrelated to the content. We demonstrate the effectiveness of the proposed method by using synthetic data and real social annotation data for text and images.

2 0.7756893 65 nips-2009-Decoupling Sparsity and Smoothness in the Discrete Hierarchical Dirichlet Process

Author: Chong Wang, David M. Blei

Abstract: We present a nonparametric hierarchical Bayesian model of document collections that decouples sparsity and smoothness in the component distributions (i.e., the “topics”). In the sparse topic model (sparseTM), each topic is represented by a bank of selector variables that determine which terms appear in the topic. Thus each topic is associated with a subset of the vocabulary, and topic smoothness is modeled on this subset. We develop an efficient Gibbs sampler for the sparseTM that includes a general-purpose method for sampling from a Dirichlet mixture with a combinatorial number of components. We demonstrate the sparseTM on four real-world datasets. Compared to traditional approaches, the empirical results will show that sparseTMs give better predictive performance with simpler inferred models. 1

3 0.76695496 205 nips-2009-Rethinking LDA: Why Priors Matter

Author: Andrew McCallum, David M. Mimno, Hanna M. Wallach

Abstract: Implementations of topic models typically use symmetric Dirichlet priors with fixed concentration parameters, with the implicit assumption that such “smoothing parameters” have little practical effect. In this paper, we explore several classes of structured priors for topic models. We find that an asymmetric Dirichlet prior over the document–topic distributions has substantial advantages over a symmetric prior, while an asymmetric prior over the topic–word distributions provides no real benefit. Approximation of this prior structure through simple, efficient hyperparameter optimization steps is sufficient to achieve these performance gains. The prior structure we advocate substantially increases the robustness of topic models to variations in the number of topics and to the highly skewed word frequency distributions common in natural language. Since this prior structure can be implemented using efficient algorithms that add negligible cost beyond standard inference techniques, we recommend it as a new standard for topic modeling. 1

4 0.70910931 204 nips-2009-Replicated Softmax: an Undirected Topic Model

Author: Geoffrey E. Hinton, Ruslan Salakhutdinov

Abstract: We introduce a two-layer undirected graphical model, called a “Replicated Softmax”, that can be used to model and automatically extract low-dimensional latent semantic representations from a large unstructured collection of documents. We present efficient learning and inference algorithms for this model, and show how a Monte-Carlo based method, Annealed Importance Sampling, can be used to produce an accurate estimate of the log-probability the model assigns to test data. This allows us to demonstrate that the proposed model is able to generalize much better compared to Latent Dirichlet Allocation in terms of both the log-probability of held-out documents and the retrieval accuracy.

5 0.62986457 68 nips-2009-Dirichlet-Bernoulli Alignment: A Generative Model for Multi-Class Multi-Label Multi-Instance Corpora

Author: Shuang-hong Yang, Hongyuan Zha, Bao-gang Hu

Abstract: We propose Dirichlet-Bernoulli Alignment (DBA), a generative model for corpora in which each pattern (e.g., a document) contains a set of instances (e.g., paragraphs in the document) and belongs to multiple classes. By casting predefined classes as latent Dirichlet variables (i.e., instance level labels), and modeling the multi-label of each pattern as Bernoulli variables conditioned on the weighted empirical average of topic assignments, DBA automatically aligns the latent topics discovered from data to human-defined classes. DBA is useful for both pattern classification and instance disambiguation, which are tested on text classification and named entity disambiguation in web search queries respectively.

6 0.61212677 186 nips-2009-Parallel Inference for Latent Dirichlet Allocation on Graphics Processing Units

7 0.59366763 96 nips-2009-Filtering Abstract Senses From Image Search Results

8 0.49336922 28 nips-2009-An Additive Latent Feature Model for Transparent Object Recognition

9 0.48504725 5 nips-2009-A Bayesian Model for Simultaneous Image Clustering, Annotation and Object Segmentation

10 0.4578059 4 nips-2009-A Bayesian Analysis of Dynamics in Free Recall

11 0.4515416 259 nips-2009-Who’s Doing What: Joint Modeling of Names and Verbs for Simultaneous Face and Pose Annotation

12 0.44157529 226 nips-2009-Spatial Normalized Gamma Processes

13 0.41676989 143 nips-2009-Localizing Bugs in Program Executions with Graphical Models

14 0.37515113 255 nips-2009-Variational Inference for the Nested Chinese Restaurant Process

15 0.35758242 90 nips-2009-Factor Modeling for Advertisement Targeting

16 0.35590678 190 nips-2009-Polynomial Semantic Indexing

17 0.3498345 51 nips-2009-Clustering sequence sets for motif discovery

18 0.34599862 233 nips-2009-Streaming Pointwise Mutual Information

19 0.31902677 171 nips-2009-Nonparametric Bayesian Models for Unsupervised Event Coreference Resolution

20 0.28863409 258 nips-2009-Whose Vote Should Count More: Optimal Integration of Labels from Labelers of Unknown Expertise


similar papers computed by lda model

lda for this paper:

topicId topicWeight

[(24, 0.03), (25, 0.05), (27, 0.016), (35, 0.044), (36, 0.063), (39, 0.05), (58, 0.058), (71, 0.091), (81, 0.016), (84, 0.378), (86, 0.075), (91, 0.023)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.80878991 153 nips-2009-Modeling Social Annotation Data with Content Relevance using a Topic Model

Author: Tomoharu Iwata, Takeshi Yamada, Naonori Ueda

Abstract: We propose a probabilistic topic model for analyzing and extracting contentrelated annotations from noisy annotated discrete data such as web pages stored in social bookmarking services. In these services, since users can attach annotations freely, some annotations do not describe the semantics of the content, thus they are noisy, i.e. not content-related. The extraction of content-related annotations can be used as a preprocessing step in machine learning tasks such as text classification and image recognition, or can improve information retrieval performance. The proposed model is a generative model for content and annotations, in which the annotations are assumed to originate either from topics that generated the content or from a general distribution unrelated to the content. We demonstrate the effectiveness of the proposed method by using synthetic data and real social annotation data for text and images.

2 0.63299334 74 nips-2009-Efficient Bregman Range Search

Author: Lawrence Cayton

Abstract: We develop an algorithm for efficient range search when the notion of dissimilarity is given by a Bregman divergence. The range search task is to return all points in a potentially large database that are within some specified distance of a query. It arises in many learning algorithms such as locally-weighted regression, kernel density estimation, neighborhood graph-based algorithms, and in tasks like outlier detection and information retrieval. In metric spaces, efficient range search-like algorithms based on spatial data structures have been deployed on a variety of statistical tasks. Here we describe an algorithm for range search for an arbitrary Bregman divergence. This broad class of dissimilarity measures includes the relative entropy, Mahalanobis distance, Itakura-Saito divergence, and a variety of matrix divergences. Metric methods cannot be directly applied since Bregman divergences do not in general satisfy the triangle inequality. We derive geometric properties of Bregman divergences that yield an efficient algorithm for range search based on a recently proposed space decomposition for Bregman divergences. 1

3 0.63091868 90 nips-2009-Factor Modeling for Advertisement Targeting

Author: Ye Chen, Michael Kapralov, John Canny, Dmitry Y. Pavlov

Abstract: We adapt a probabilistic latent variable model, namely GaP (Gamma-Poisson) [6], to ad targeting in the contexts of sponsored search (SS) and behaviorally targeted (BT) display advertising. We also approach the important problem of ad positional bias by formulating a one-latent-dimension GaP factorization. Learning from click-through data is intrinsically large scale, even more so for ads. We scale up the algorithm to terabytes of real-world SS and BT data that contains hundreds of millions of users and hundreds of thousands of features, by leveraging the scalability characteristics of the algorithm and the inherent structure of the problem including data sparsity and locality. SpeciÄ?Ĺš cally, we demonstrate two somewhat orthogonal philosophies of scaling algorithms to large-scale problems, through the SS and BT implementations, respectively. Finally, we report the experimental results using Yahoo’s vast datasets, and show that our approach substantially outperform the state-of-the-art methods in prediction accuracy. For BT in particular, the ROC area achieved by GaP is exceeding 0.95, while one prior approach using Poisson regression [11] yielded 0.83. For computational performance, we compare a single-node sparse implementation with a parallel implementation using Hadoop MapReduce, the results are counterintuitive yet quite interesting. We therefore provide insights into the underlying principles of large-scale learning. 1

4 0.52642715 79 nips-2009-Efficient Recovery of Jointly Sparse Vectors

Author: Liang Sun, Jun Liu, Jianhui Chen, Jieping Ye

Abstract: We consider the reconstruction of sparse signals in the multiple measurement vector (MMV) model, in which the signal, represented as a matrix, consists of a set of jointly sparse vectors. MMV is an extension of the single measurement vector (SMV) model employed in standard compressive sensing (CS). Recent theoretical studies focus on the convex relaxation of the MMV problem based on the (2, 1)-norm minimization, which is an extension of the well-known 1-norm minimization employed in SMV. However, the resulting convex optimization problem in MMV is significantly much more difficult to solve than the one in SMV. Existing algorithms reformulate it as a second-order cone programming (SOCP) or semidefinite programming (SDP) problem, which is computationally expensive to solve for problems of moderate size. In this paper, we propose a new (dual) reformulation of the convex optimization problem in MMV and develop an efficient algorithm based on the prox-method. Interestingly, our theoretical analysis reveals the close connection between the proposed reformulation and multiple kernel learning. Our simulation studies demonstrate the scalability of the proposed algorithm.

5 0.39733577 204 nips-2009-Replicated Softmax: an Undirected Topic Model

Author: Geoffrey E. Hinton, Ruslan Salakhutdinov

Abstract: We introduce a two-layer undirected graphical model, called a “Replicated Softmax”, that can be used to model and automatically extract low-dimensional latent semantic representations from a large unstructured collection of documents. We present efficient learning and inference algorithms for this model, and show how a Monte-Carlo based method, Annealed Importance Sampling, can be used to produce an accurate estimate of the log-probability the model assigns to test data. This allows us to demonstrate that the proposed model is able to generalize much better compared to Latent Dirichlet Allocation in terms of both the log-probability of held-out documents and the retrieval accuracy.

6 0.39549851 205 nips-2009-Rethinking LDA: Why Priors Matter

7 0.39367494 56 nips-2009-Conditional Neural Fields

8 0.39283356 40 nips-2009-Bayesian Nonparametric Models on Decomposable Graphs

9 0.39249039 260 nips-2009-Zero-shot Learning with Semantic Output Codes

10 0.39233017 96 nips-2009-Filtering Abstract Senses From Image Search Results

11 0.38935977 19 nips-2009-A joint maximum-entropy model for binary neural population patterns and continuous signals

12 0.38804761 226 nips-2009-Spatial Normalized Gamma Processes

13 0.38733172 154 nips-2009-Modeling the spacing effect in sequential category learning

14 0.38642746 130 nips-2009-Learning from Multiple Partially Observed Views - an Application to Multilingual Text Categorization

15 0.38615346 28 nips-2009-An Additive Latent Feature Model for Transparent Object Recognition

16 0.38540679 155 nips-2009-Modelling Relational Data using Bayesian Clustered Tensor Factorization

17 0.38534027 132 nips-2009-Learning in Markov Random Fields using Tempered Transitions

18 0.38533551 112 nips-2009-Human Rademacher Complexity

19 0.38476908 158 nips-2009-Multi-Label Prediction via Sparse Infinite CCA

20 0.38282126 188 nips-2009-Perceptual Multistability as Markov Chain Monte Carlo Inference