nips nips2012 nips2012-12 knowledge-graph by maker-knowledge-mining

12 nips-2012-A Neural Autoregressive Topic Model

Source: pdf

Author: Hugo Larochelle, Stanislas Lauly

Abstract: We describe a new model for learning meaningful representations of text documents from an unlabeled collection of documents. This model is inspired by the recently proposed Replicated Softmax, an undirected graphical model of word counts that was shown to learn a better generative model and more meaningful document representations. Speciﬁcally, we take inspiration from the conditional mean-ﬁeld recursive equations of the Replicated Softmax in order to deﬁne a neural network architecture that estimates the probability of observing a new word in a given document given the previously observed words. This paradigm also allows us to replace the expensive softmax distribution over words with a hierarchical distribution over paths in a binary tree of words. The end result is a model whose training complexity scales logarithmically with the vocabulary size instead of linearly as in the Replicated Softmax. Our experiments show that our model is competitive both as a generative model of documents and as a document representation learning algorithm. 1

Reference: text

Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 A Neural Autoregressive Topic Model Stanislas Lauly D´ partement d’informatique e Universit´ de Sherbrooke e stanislas. [sent-1, score-0.078]

2 ca Hugo Larochelle D´ partement d’informatique e Universit´ de Sherbrooke e hugo. [sent-3, score-0.078]

3 ca Abstract We describe a new model for learning meaningful representations of text documents from an unlabeled collection of documents. [sent-5, score-0.272]

4 This model is inspired by the recently proposed Replicated Softmax, an undirected graphical model of word counts that was shown to learn a better generative model and more meaningful document representations. [sent-6, score-0.725]

5 Speciﬁcally, we take inspiration from the conditional mean-ﬁeld recursive equations of the Replicated Softmax in order to deﬁne a neural network architecture that estimates the probability of observing a new word in a given document given the previously observed words. [sent-7, score-0.505]

6 This paradigm also allows us to replace the expensive softmax distribution over words with a hierarchical distribution over paths in a binary tree of words. [sent-8, score-0.408]

7 The end result is a model whose training complexity scales logarithmically with the vocabulary size instead of linearly as in the Replicated Softmax. [sent-9, score-0.23]

8 Our experiments show that our model is competitive both as a generative model of documents and as a document representation learning algorithm. [sent-10, score-0.517]

9 1 Introduction In order to leverage the large amount of available unlabeled text, a lot of research has been devoted to developing good probabilistic models of documents. [sent-11, score-0.1]

10 Such models are usually embedded with latent variables or topics, whose role is to capture salient statistical patterns in the co-occurrence of words within documents. [sent-12, score-0.157]

11 The most popular model is latent Dirichlet allocation (LDA) [1], a directed graphical model in which each word is a sample from a mixture of global word distributions (shared across documents) and where the mixture weights vary between documents. [sent-13, score-0.717]

12 In this context, the word multinomial distributions (mixture components) correspond to the topics and a document is represented as the parameters (mixture weights) of its associated distribution over topics. [sent-14, score-0.6]

13 Once trained, these topics have been found to extract meaningful groups of semantically related words and the (approximately) inferred topic mixture weights have been shown to form a useful representation for documents. [sent-15, score-0.474]

14 More recently, Salakhutdinov and Hinton [2] proposed an alternative undirected model, the Replicated Softmax which, instead of representing documents as distributions over topics, relies on a binary distributed representation of the documents. [sent-16, score-0.242]

15 The latent variables can then be understood as topic features: they do not correspond to normalized distributions over words, but to unnormalized factors over words. [sent-17, score-0.207]

16 A combination of topic features generates a word distribution by multiplying these factors and renormalizing. [sent-18, score-0.407]

17 They show that the Replicated Softmax allows for very efﬁcient inference of a document’s topic feature representation and outperforms LDA both as a generative model of documents and as a method for representing documents in an information retrieval setting. [sent-19, score-0.568]

18 While inference of a document representation is efﬁcient in the Replicated Softmax, one of its disadvantages is that the complexity of its learning update scales linearly with the vocabulary size V , i. [sent-20, score-0.42]

19 the number of different words that are observed in a document. [sent-22, score-0.109]

20 The factor responsible for this 1 v v v v ˆ ˆ ˆ ˆ 1 ˆˆˆˆ v1 v2 v4 v3 2 3 4 h1 h2 h3 h4 h1 h2 h3 h4 v4 h v1 v2 v3 NADE v4 v1 v2 v3 v1 v4 Replicated Softmax v2 v3 v4 DocNADE Figure 1: (Left) Illustration of NADE. [sent-23, score-0.019]

21 Colored lines identify the connections that share parameters and vi is a shorthand for the autoregressive conditional p(vi |v i , h|v i , h|v < i), has been lost. [sent-24, score-0.124]

22 One solution is to assume the following generative story: ﬁrst, a seed document v is sampled from DocNADE and, ﬁnally, a random permutation of its words is taken to produce the observed document v. [sent-25, score-0.655]

23 This translates into the following probability distribution: p(v|v)p(v) = p(v) = v∈V(v) 1 |V(v)| p(v) (12) v∈V(v) where p(v) is modeled by DocNADE and V(v) is the set of all documents v with the same word count vector n(v) = n(v). [sent-26, score-0.473]

24 This distribution is a mixture over all possible permutations that could have generated the original document v. [sent-27, score-0.282]

25 Now, we can use the fact that sampling uniformly from V(v) can be done solely on the basis of the word counts of v, by randomly sampling words without replacement from those word counts. [sent-28, score-0.753]

26 Therefore, we can train DocNADE on those generated word sequences, as if they were the original documents from which the word counts were extracted. [sent-29, score-0.795]

27 This approach of training DocNADE can be understood as learning a model that is good at predicting which new words should be inserted in a document at any position, while maintaining its general semantics. [sent-31, score-0.386]

28 The model is therefore learning not to insert “intruder” words, i. [sent-32, score-0.026]

29 After training, a document’s learned representation h(v) should contain valuable information to identify intruder words for this document. [sent-35, score-0.307]

30 It’s interesting to note that the detection of such intruder words has been used previously as a task in user studies to evaluate the quality of the topics learned by LDA, though at the level of single topics and not whole documents [8]. [sent-36, score-0.654]

31 5 Related Work We mentioned that the Replicated Softmax models the distribution over words as a product of topic-dependent factors. [sent-37, score-0.109]

32 The Sparse Additive Generative Model (SAGE) [9] is also based on topicdependent factors, as well as a background factor. [sent-38, score-0.02]

33 The distribution of a word is the renormalized product of its topic factor and the background factor. [sent-39, score-0.411]

34 Unfortunately, much like the Replicated Softmax, training in SAGE scales linearly with the vocabulary size, instead of logarithmically as in DocNADE. [sent-40, score-0.211]

35 Recent work has also been able to improve the complexity of RBM training on word observations. [sent-41, score-0.322]

36 However, for the speciﬁc case of the Replicated Softmax, the proposed method does not allow to remove the linear dependence on V of the complexity [10]. [sent-42, score-0.019]

37 There has been fairly little work on using neural networks to learn generative topic models of documents. [sent-43, score-0.213]

38 [12] have trained neural network autoencoders on documents in their binary bag of words representation, but such neural networks are not generative models of documents. [sent-46, score-0.448]

39 One potential advantage of having a proper generative model under which p(v) can be computed exactly is it becomes possible to do Bayesian learning of the parameters, even on a large scale, using recent online Bayesian inference approaches [13, 14]. [sent-47, score-0.105]

40 The ﬁrst compares the performance of DocNADE as a generative model, while the later evaluates whether DocNADE hidden layer can be used as a meaningful representation for documents. [sent-49, score-0.304]

41 Following Salakhutdinov and Hinton [2], we use a hidden layer size of H = 50 in all experiments. [sent-50, score-0.084]

42 A validation set is always set aside to perform model selection of other hyper-parameters, such as the learning rate and the number of learning passes over the training set (based on early stopping). [sent-51, score-0.093]

43 We also tested the use of a hidden layer hyperbolic tangent nonlinearity tanh(x) = (exp(x) − exp(−x))/(exp(x) + exp(−x)) instead of the sigmoid and always used the best option based on the validation set performance. [sent-52, score-0.264]

44 We end this section with a qualitative inspection of the implicit word representation and topic-features learned by DocNADE. [sent-53, score-0.366]

45 5 Table 1: Test perplexity per word for LDA with 50 and 200 latent topics, Replicated Softmax with 50 topics and DocNADE with 50 topics. [sent-57, score-0.597]

46 The results for LDA and Replicated Softmax were taken from Salakhutdinov and Hinton [2]. [sent-58, score-0.021]

47 1 Generative Model Evaluation We ﬁrst evaluated DocNADE’s performance as a generative model of documents. [sent-60, score-0.105]

48 The vocabulary size for 20 Newsgroups was 2000 and 10 000 for RCV1-v2. [sent-62, score-0.074]

49 We used the version of DocNADE that trains from document word counts. [sent-63, score-0.48]

50 To approximate the corresponding distribution p(v) of Equation 12, we sample a single permuted word sequence v from the word counts. [sent-64, score-0.613]

51 This might seem like a crude approximation, but, as we’ll see, the value of p(v) tends not to vary a lot across different random permutations of the words. [sent-65, score-0.144]

52 1 Instead of minimizing the average document negative log-likelihood − N t log p(vt ), we also 1 1 considered minimizing a version normalized by each document’s size − N t |vt | log p(vt ), though the difference in performance between both ended up not being large. [sent-66, score-0.247]

53 For 20 newsgroups, the model with the best perplexity on the validation set used a learning rate of 0. [sent-67, score-0.198]

54 001, sigmoid hidden activation and optimized the average document negative log-likelihood (non-normalized). [sent-68, score-0.339]

55 1, with sigmoid hidden activation and optimization of the objective normalized by each document’s size performed best. [sent-70, score-0.162]

56 A comparison is made with LDA using 50 or 200 topics and the Replicated Softmax with 50 topics. [sent-72, score-0.12]

57 The results for LDA and Replicated Softmax were taken from Salakhutdinov and Hinton [2]. [sent-73, score-0.021]

58 We see that DocNADE achieves lower perplexity than both models. [sent-74, score-0.167]

59 On RCV1-v2, DocNADE reaches a perplexity that is almost half that of LDA with 50 topics. [sent-75, score-0.167]

60 We also provide the standard deviation of the perplexity obtained by repeating 100 times the calculation of the perplexity on the test set using different permuted word sequences v. [sent-76, score-0.708]

61 We see that it is fairly small, which conﬁrms that the value of p(v) does not vary a lot across different permutations. [sent-77, score-0.11]

similar papers computed by tfidf model

tfidf for this paper:

wordName wordTfidf (topN-words)

[('docnade', 0.586), ('replicated', 0.381), ('softmax', 0.299), ('word', 0.282), ('document', 0.198), ('documents', 0.17), ('perplexity', 0.167), ('lda', 0.165), ('newsgroups', 0.142), ('topics', 0.12), ('intruder', 0.117), ('words', 0.109), ('generative', 0.105), ('vt', 0.09), ('hinton', 0.083), ('salakhutdinov', 0.083), ('topic', 0.079), ('partement', 0.078), ('vocabulary', 0.074), ('sherbrooke', 0.069), ('informatique', 0.064), ('sigmoid', 0.062), ('counts', 0.061), ('sage', 0.06), ('logarithmically', 0.057), ('larochelle', 0.054), ('autoregressive', 0.052), ('meaningful', 0.051), ('lot', 0.05), ('permuted', 0.049), ('mixture', 0.047), ('hidden', 0.045), ('representation', 0.044), ('layer', 0.039), ('universit', 0.038), ('permutations', 0.037), ('hugo', 0.034), ('activation', 0.034), ('vi', 0.033), ('understood', 0.032), ('dauphin', 0.032), ('linearly', 0.032), ('vary', 0.031), ('validation', 0.031), ('unlabeled', 0.03), ('glorot', 0.03), ('hyperbolic', 0.03), ('renormalized', 0.03), ('fairly', 0.029), ('ended', 0.028), ('reuters', 0.028), ('undirected', 0.028), ('latent', 0.028), ('conditionals', 0.027), ('story', 0.027), ('factors', 0.027), ('scales', 0.027), ('insert', 0.026), ('crude', 0.026), ('disadvantages', 0.026), ('inserted', 0.026), ('inspiration', 0.025), ('rounded', 0.025), ('exp', 0.025), ('autoencoders', 0.025), ('seed', 0.024), ('semantically', 0.024), ('tanh', 0.024), ('rbm', 0.023), ('sequences', 0.022), ('aside', 0.022), ('inspection', 0.022), ('translates', 0.021), ('repeating', 0.021), ('normalized', 0.021), ('training', 0.021), ('taken', 0.021), ('text', 0.021), ('bag', 0.021), ('rms', 0.021), ('tangent', 0.02), ('murray', 0.02), ('shorthand', 0.02), ('background', 0.02), ('colored', 0.02), ('evaluates', 0.02), ('unnormalized', 0.02), ('salient', 0.02), ('devoted', 0.02), ('evaluation', 0.019), ('passes', 0.019), ('multiplying', 0.019), ('identify', 0.019), ('complexity', 0.019), ('nonlinearity', 0.019), ('replacement', 0.019), ('responsible', 0.019), ('option', 0.018), ('trained', 0.018), ('learned', 0.018)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.99999982 12 nips-2012-A Neural Autoregressive Topic Model

Author: Hugo Larochelle, Stanislas Lauly

2 0.28205562 124 nips-2012-Factorial LDA: Sparse Multi-Dimensional Text Models

Author: Michael Paul, Mark Dredze

Abstract: Latent variable models can be enriched with a multi-dimensional structure to consider the many latent factors in a text corpus, such as topic, author perspective and sentiment. We introduce factorial LDA, a multi-dimensional model in which a document is inﬂuenced by K different factors, and each word token depends on a K-dimensional vector of latent variables. Our model incorporates structured word priors and learns a sparse product of factors. Experiments on research abstracts show that our model can learn latent factors such as research topic, scientiﬁc discipline, and focus (methods vs. applications). Our modeling improvements reduce test perplexity and improve human interpretability of the discovered factors. 1

3 0.14934739 19 nips-2012-A Spectral Algorithm for Latent Dirichlet Allocation

Author: Anima Anandkumar, Yi-kai Liu, Daniel J. Hsu, Dean P. Foster, Sham M. Kakade

Abstract: Topic modeling is a generalization of clustering that posits that observations (words in a document) are generated by multiple latent factors (topics), as opposed to just one. This increased representational power comes at the cost of a more challenging unsupervised learning problem of estimating the topic-word distributions when only words are observed, and the topics are hidden. This work provides a simple and efﬁcient learning procedure that is guaranteed to recover the parameters for a wide class of topic models, including Latent Dirichlet Allocation (LDA). For LDA, the procedure correctly recovers both the topic-word distributions and the parameters of the Dirichlet prior over the topic mixtures, using only trigram statistics (i.e., third order moments, which may be estimated with documents containing just three words). The method, called Excess Correlation Analysis, is based on a spectral decomposition of low-order moments via two singular value decompositions (SVDs). Moreover, the algorithm is scalable, since the SVDs are carried out only on k × k matrices, where k is the number of latent factors (topics) and is typically much smaller than the dimension of the observation (word) space. 1

4 0.1387569 274 nips-2012-Priors for Diversity in Generative Latent Variable Models

Author: James T. Kwok, Ryan P. Adams

Abstract: Probabilistic latent variable models are one of the cornerstones of machine learning. They offer a convenient and coherent way to specify prior distributions over unobserved structure in data, so that these unknown properties can be inferred via posterior inference. Such models are useful for exploratory analysis and visualization, for building density models of data, and for providing features that can be used for later discriminative tasks. A signiﬁcant limitation of these models, however, is that draws from the prior are often highly redundant due to i.i.d. assumptions on internal parameters. For example, there is no preference in the prior of a mixture model to make components non-overlapping, or in topic model to ensure that co-occurring words only appear in a small number of topics. In this work, we revisit these independence assumptions for probabilistic latent variable models, replacing the underlying i.i.d. prior with a determinantal point process (DPP). The DPP allows us to specify a preference for diversity in our latent variables using a positive deﬁnite kernel function. Using a kernel between probability distributions, we are able to deﬁne a DPP on probability measures. We show how to perform MAP inference with DPP priors in latent Dirichlet allocation and in mixture models, leading to better intuition for the latent variable representation and quantitatively improved unsupervised feature extraction, without compromising the generative aspects of the model. 1

5 0.12609282 229 nips-2012-Multimodal Learning with Deep Boltzmann Machines

Author: Nitish Srivastava, Ruslan Salakhutdinov

Abstract: A Deep Boltzmann Machine is described for learning a generative model of data that consists of multiple and diverse input modalities. The model can be used to extract a uniﬁed representation that fuses modalities together. We ﬁnd that this representation is useful for classiﬁcation and information retrieval tasks. The model works by learning a probability density over the space of multimodal inputs. It uses states of latent variables as representations of the input. The model can extract this representation even when some modalities are absent by sampling from the conditional distribution over them and ﬁlling them in. Our experimental results on bi-modal data consisting of images and text show that the Multimodal DBM can learn a good generative model of the joint space of image and text inputs that is useful for information retrieval from both unimodal and multimodal queries. We further demonstrate that this model signiﬁcantly outperforms SVMs and LDA on discriminative tasks. Finally, we compare our model to other deep learning methods, including autoencoders and deep belief networks, and show that it achieves noticeable gains. 1

6 0.11974318 77 nips-2012-Complex Inference in Neural Circuits with Probabilistic Population Codes and Topic Models

7 0.1189958 354 nips-2012-Truly Nonparametric Online Variational Inference for Hierarchical Dirichlet Processes

8 0.11071597 332 nips-2012-Symmetric Correspondence Topic Models for Multilingual Text Analysis

9 0.10437694 258 nips-2012-Online L1-Dictionary Learning with Application to Novel Document Detection

10 0.089093372 172 nips-2012-Latent Graphical Model Selection: Efficient Methods for Locally Tree-like Graphs

11 0.087943226 355 nips-2012-Truncation-free Online Variational Inference for Bayesian Nonparametric Models

12 0.080327667 166 nips-2012-Joint Modeling of a Matrix with Associated Text via Latent Binary Features

13 0.076561831 220 nips-2012-Monte Carlo Methods for Maximum Margin Supervised Topic Models

14 0.064041443 126 nips-2012-FastEx: Hash Clustering with Exponential Families

15 0.063700236 47 nips-2012-Augment-and-Conquer Negative Binomial Processes

16 0.061476897 316 nips-2012-Small-Variance Asymptotics for Exponential Family Dirichlet Process Mixture Models

17 0.060142551 278 nips-2012-Probabilistic n-Choose-k Models for Classification and Ranking

18 0.055184115 4 nips-2012-A Better Way to Pretrain Deep Boltzmann Machines

19 0.05478818 356 nips-2012-Unsupervised Structure Discovery for Semantic Analysis of Audio

20 0.054145992 65 nips-2012-Cardinality Restricted Boltzmann Machines

similar papers computed by lsi model

lsi for this paper:

topicId topicWeight

[(0, 0.117), (1, 0.062), (2, -0.074), (3, 0.008), (4, -0.163), (5, -0.042), (6, -0.014), (7, -0.001), (8, 0.086), (9, -0.042), (10, 0.22), (11, 0.228), (12, 0.019), (13, 0.059), (14, 0.021), (15, 0.061), (16, 0.087), (17, 0.082), (18, 0.069), (19, 0.033), (20, 0.026), (21, 0.017), (22, -0.034), (23, -0.052), (24, 0.073), (25, -0.003), (26, 0.041), (27, 0.07), (28, 0.105), (29, -0.102), (30, 0.104), (31, -0.058), (32, -0.025), (33, 0.053), (34, 0.012), (35, -0.012), (36, -0.093), (37, 0.059), (38, 0.005), (39, -0.006), (40, 0.036), (41, 0.034), (42, 0.023), (43, 0.076), (44, -0.059), (45, 0.019), (46, -0.039), (47, -0.051), (48, -0.042), (49, -0.017)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.96545935 12 nips-2012-A Neural Autoregressive Topic Model

Author: Hugo Larochelle, Stanislas Lauly

2 0.86819887 332 nips-2012-Symmetric Correspondence Topic Models for Multilingual Text Analysis

Author: Kosuke Fukumasu, Koji Eguchi, Eric P. Xing

Abstract: Topic modeling is a widely used approach to analyzing large text collections. A small number of multilingual topic models have recently been explored to discover latent topics among parallel or comparable documents, such as in Wikipedia. Other topic models that were originally proposed for structured data are also applicable to multilingual documents. Correspondence Latent Dirichlet Allocation (CorrLDA) is one such model; however, it requires a pivot language to be speciﬁed in advance. We propose a new topic model, Symmetric Correspondence LDA (SymCorrLDA), that incorporates a hidden variable to control a pivot language, in an extension of CorrLDA. We experimented with two multilingual comparable datasets extracted from Wikipedia and demonstrate that SymCorrLDA is more eﬀective than some other existing multilingual topic models. 1

3 0.86382145 124 nips-2012-Factorial LDA: Sparse Multi-Dimensional Text Models

Author: Michael Paul, Mark Dredze

4 0.70753223 19 nips-2012-A Spectral Algorithm for Latent Dirichlet Allocation

Author: Anima Anandkumar, Yi-kai Liu, Daniel J. Hsu, Dean P. Foster, Sham M. Kakade

5 0.66443843 166 nips-2012-Joint Modeling of a Matrix with Associated Text via Latent Binary Features

Author: Xianxing Zhang, Lawrence Carin

Abstract: A new methodology is developed for joint analysis of a matrix and accompanying documents, with the documents associated with the matrix rows/columns. The documents are modeled with a focused topic model, inferring interpretable latent binary features for each document. A new matrix decomposition is developed, with latent binary features associated with the rows/columns, and with imposition of a low-rank constraint. The matrix decomposition and topic model are coupled by sharing the latent binary feature vectors associated with each. The model is applied to roll-call data, with the associated documents deﬁned by the legislation. Advantages of the proposed model are demonstrated for prediction of votes on a new piece of legislation, based only on the observed text of legislation. The coupling of the text and legislation is also shown to yield insight into the properties of the matrix decomposition for roll-call data. 1

6 0.65169746 345 nips-2012-Topic-Partitioned Multinetwork Embeddings

7 0.64538586 274 nips-2012-Priors for Diversity in Generative Latent Variable Models

8 0.60669571 220 nips-2012-Monte Carlo Methods for Maximum Margin Supervised Topic Models

9 0.54264867 354 nips-2012-Truly Nonparametric Online Variational Inference for Hierarchical Dirichlet Processes

10 0.52179343 77 nips-2012-Complex Inference in Neural Circuits with Probabilistic Population Codes and Topic Models

11 0.51875675 154 nips-2012-How They Vote: Issue-Adjusted Models of Legislative Behavior

12 0.45779443 229 nips-2012-Multimodal Learning with Deep Boltzmann Machines

13 0.45238629 47 nips-2012-Augment-and-Conquer Negative Binomial Processes

14 0.40875483 258 nips-2012-Online L1-Dictionary Learning with Application to Novel Document Detection

15 0.40374362 355 nips-2012-Truncation-free Online Variational Inference for Bayesian Nonparametric Models

16 0.38551247 65 nips-2012-Cardinality Restricted Boltzmann Machines

17 0.38265339 192 nips-2012-Learning the Dependency Structure of Latent Factors

18 0.37899876 22 nips-2012-A latent factor model for highly multi-relational data

19 0.37698019 89 nips-2012-Coupling Nonparametric Mixtures via Latent Dirichlet Processes

20 0.36584723 4 nips-2012-A Better Way to Pretrain Deep Boltzmann Machines

similar papers computed by lda model

lda for this paper:

topicId topicWeight

[(0, 0.176), (21, 0.018), (38, 0.062), (42, 0.011), (53, 0.304), (54, 0.021), (55, 0.035), (74, 0.058), (76, 0.099), (80, 0.067), (92, 0.041)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.77035183 12 nips-2012-A Neural Autoregressive Topic Model

Author: Hugo Larochelle, Stanislas Lauly

2 0.73099053 10 nips-2012-A Linear Time Active Learning Algorithm for Link Classification

Author: Nicolò Cesa-bianchi, Claudio Gentile, Fabio Vitale, Giovanni Zappella

Abstract: We present very eﬃcient active learning algorithms for link classiﬁcation in signed networks. Our algorithms are motivated by a stochastic model in which edge labels are obtained through perturbations of a initial sign assignment consistent with a two-clustering of the nodes. We provide a theoretical analysis within this model, showing that we can achieve an optimal (to whithin a constant factor) number of mistakes on any graph G = (V, E) such that |E| = Ω(|V |3/2 ) by querying O(|V |3/2 ) edge labels. More generally, we show an algorithm that achieves optimality to within a factor of O(k) by querying at most order of |V | + (|V |/k)3/2 edge labels. The running time of this algorithm is at most of order |E| + |V | log |V |. 1

3 0.72525823 328 nips-2012-Submodular-Bregman and the Lovász-Bregman Divergences with Applications

Author: Rishabh Iyer, Jeff A. Bilmes

Abstract: We introduce a class of discrete divergences on sets (equivalently binary vectors) that we call the submodular-Bregman divergences. We consider two kinds, deﬁned either from tight modular upper or tight modular lower bounds of a submodular function. We show that the properties of these divergences are analogous to the (standard continuous) Bregman divergence. We demonstrate how they generalize many useful divergences, including the weighted Hamming distance, squared weighted Hamming, weighted precision, recall, conditional mutual information, and a generalized KL-divergence on sets. We also show that the generalized Bregman divergence on the Lov´ sz extension of a submodular function, which we a call the Lov´ sz-Bregman divergence, is a continuous extension of a submodular a Bregman divergence. We point out a number of applications, and in particular show that a proximal algorithm deﬁned through the submodular Bregman divergence provides a framework for many mirror-descent style algorithms related to submodular function optimization. We also show that a generalization of the k-means algorithm using the Lov´ sz Bregman divergence is natural in clustering scenarios where a ordering is important. A unique property of this algorithm is that computing the mean ordering is extremely efﬁcient unlike other order based distance measures. 1

4 0.70670617 359 nips-2012-Variational Inference for Crowdsourcing

Author: Qiang Liu, Jian Peng, Alex Ihler

Abstract: Crowdsourcing has become a popular paradigm for labeling large datasets. However, it has given rise to the computational task of aggregating the crowdsourced labels provided by a collection of unreliable annotators. We approach this problem by transforming it into a standard inference problem in graphical models, and applying approximate variational methods, including belief propagation (BP) and mean ﬁeld (MF). We show that our BP algorithm generalizes both majority voting and a recent algorithm by Karger et al. [1], while our MF method is closely related to a commonly used EM algorithm. In both cases, we ﬁnd that the performance of the algorithms critically depends on the choice of a prior distribution on the workers’ reliability; by choosing the prior properly, both BP and MF (and EM) perform surprisingly well on both simulated and real-world datasets, competitive with state-of-the-art algorithms based on more complicated modeling assumptions. 1

5 0.64348286 313 nips-2012-Sketch-Based Linear Value Function Approximation

Author: Marc Bellemare, Joel Veness, Michael Bowling

Abstract: Hashing is a common method to reduce large, potentially inﬁnite feature vectors to a ﬁxed-size table. In reinforcement learning, hashing is often used in conjunction with tile coding to represent states in continuous spaces. Hashing is also a promising approach to value function approximation in large discrete domains such as Go and Hearts, where feature vectors can be constructed by exhaustively combining a set of atomic features. Unfortunately, the typical use of hashing in value function approximation results in biased value estimates due to the possibility of collisions. Recent work in data stream summaries has led to the development of the tug-of-war sketch, an unbiased estimator for approximating inner products. Our work investigates the application of this new data structure to linear value function approximation. Although in the reinforcement learning setting the use of the tug-of-war sketch leads to biased value estimates, we show that this bias can be orders of magnitude less than that of standard hashing. We provide empirical results on two RL benchmark domains and ﬁfty-ﬁve Atari 2600 games to highlight the superior learning performance obtained when using tug-of-war hashing. 1

6 0.59124893 124 nips-2012-Factorial LDA: Sparse Multi-Dimensional Text Models

7 0.58989257 191 nips-2012-Learning the Architecture of Sum-Product Networks Using Clustering on Variables

8 0.57130361 233 nips-2012-Multiresolution Gaussian Processes

9 0.56729621 270 nips-2012-Phoneme Classification using Constrained Variational Gaussian Process Dynamical System

10 0.5668813 192 nips-2012-Learning the Dependency Structure of Latent Factors

11 0.55449349 30 nips-2012-Accuracy at the Top

12 0.54738861 21 nips-2012-A Unifying Perspective of Parametric Policy Search Methods for Markov Decision Processes

13 0.54444396 282 nips-2012-Proximal Newton-type methods for convex optimization

14 0.5282203 7 nips-2012-A Divide-and-Conquer Method for Sparse Inverse Covariance Estimation

15 0.51623106 332 nips-2012-Symmetric Correspondence Topic Models for Multilingual Text Analysis

16 0.49846816 342 nips-2012-The variational hierarchical EM algorithm for clustering hidden Markov models

17 0.49591303 47 nips-2012-Augment-and-Conquer Negative Binomial Processes

18 0.48366362 354 nips-2012-Truly Nonparametric Online Variational Inference for Hierarchical Dirichlet Processes

19 0.47772476 316 nips-2012-Small-Variance Asymptotics for Exponential Family Dirichlet Process Mixture Models

20 0.47537673 274 nips-2012-Priors for Diversity in Generative Latent Variable Models