nips nips2008 nips2008-4 knowledge-graph by maker-knowledge-mining
Source: pdf
Author: Andriy Mnih, Geoffrey E. Hinton
Abstract: Neural probabilistic language models (NPLMs) have been shown to be competitive with and occasionally superior to the widely-used n-gram language models. The main drawback of NPLMs is their extremely long training and testing times. Morin and Bengio have proposed a hierarchical language model built around a binary tree of words, which was two orders of magnitude faster than the nonhierarchical model it was based on. However, it performed considerably worse than its non-hierarchical counterpart in spite of using a word tree created using expert knowledge. We introduce a fast hierarchical language model along with a simple feature-based algorithm for automatic construction of word trees from the data. We then show that the resulting models can outperform non-hierarchical neural models as well as the best n-gram models. 1
Reference: text
sentIndex sentText sentNum sentScore
1 edu Abstract Neural probabilistic language models (NPLMs) have been shown to be competitive with and occasionally superior to the widely-used n-gram language models. [sent-5, score-0.444]
2 Morin and Bengio have proposed a hierarchical language model built around a binary tree of words, which was two orders of magnitude faster than the nonhierarchical model it was based on. [sent-7, score-0.765]
3 However, it performed considerably worse than its non-hierarchical counterpart in spite of using a word tree created using expert knowledge. [sent-8, score-0.829]
4 We introduce a fast hierarchical language model along with a simple feature-based algorithm for automatic construction of word trees from the data. [sent-9, score-0.955]
5 1 Introduction Statistical language modelling is concerned with building probabilistic models of word sequences. [sent-11, score-0.822]
6 The vast majority of statistical language models are based on the Markov assumption, which states that the distribution of a word depends only on some fixed number of words that immediately precede it. [sent-13, score-1.009]
7 While this assumption is clearly false, it is very convenient because it reduces the problem of modelling the probability distribution of word sequences of arbitrary length to the problem of modelling the distribution on the next word given some fixed number of preceding words, called the context. [sent-14, score-1.077]
8 We will denote this distribution by P (wn |w1:n−1 ), where wn is the next word and w1:n−1 is the context (w1 , . [sent-15, score-0.723]
9 n-gram language models are the most popular statistical language models due to their simplicity and surprisingly good performance. [sent-19, score-0.476]
10 Class-based n-gram models [3] aim to address this issue by clustering words and/or contexts into classes based on their usage patterns and then using this class information to improve generalization. [sent-26, score-0.474]
11 While it can improve n-gram performance, this approach introduces a very rigid kind of similarity, since each word typically belongs to exactly one class. [sent-27, score-0.496]
12 An alternative and much more flexible approach to counteracting the data sparsity problem is to represent each word using a real-valued feature vector that captures its properties, so that words 1 used in similar contexts will have similar feature vectors. [sent-28, score-1.095]
13 Then the conditional probability of the next word can be modelled as a smooth function of the feature vectors of the context words and the next word. [sent-29, score-1.02]
14 Most models based on this approach use a feed-forward neural network to map the feature vectors of the context words to the distribution for the next word (e. [sent-32, score-1.04]
15 2 The hierarchical neural network language model The main drawback of the NPLM and other similar models is that they are very slow to train and test [10]. [sent-36, score-0.429]
16 Since computing the exact gradient in such models requires repeatedly computing the probability of the next word given its context and updating the model parameters to increase that probability, training time is also linear in the vocabulary size. [sent-38, score-0.929]
17 It achieves this reduction by replacing the unstructured vocabulary of the NPLM by a binary tree that represents a hierarchical clustering of words in the vocabulary. [sent-43, score-0.839]
18 Each word corresponds to a leaf in the tree and can be uniquely specified by the path from the root to that leaf. [sent-44, score-0.793]
19 If N is the number of words in the vocabulary and the tree is balanced, any word can be specified by a sequence of O(log N ) binary decisions indicating which of the two children of the current node is to be visited next. [sent-45, score-1.266]
20 As a result, a distribution over words in the vocabulary can be specified by providing the probability of visiting the left child at each of the nodes. [sent-48, score-0.348]
21 In the hierarchical NPLM, these local probabilities are computed by giving a version of the NPLM the feature vectors for the context words as well as a feature vector for the current node as inputs. [sent-49, score-0.805]
22 The probability of the next word is then given by the probability of making a sequence of binary decisions that corresponds to the path to that word. [sent-50, score-0.704]
23 The main limitation of this work was the procedure used to construct the tree of words for the model. [sent-53, score-0.429]
24 The tree was obtained by starting with the WordNet IS-A taxonomy and converting it into a binary tree through a combination of manual and data-driven processing. [sent-54, score-0.577]
25 We will also explore the performance benefits of using trees where each word can occur more than once. [sent-56, score-0.607]
26 3 The log-bilinear model We will use the log-bilinear language model (LBL) [9] as the foundation of our hierarchical model because of its excellent performance and simplicity. [sent-57, score-0.42]
27 Like virtually all neural language models, the LBL model represents each word with a real-valued feature vector. [sent-58, score-0.856]
28 We will denote the feature vector for word w by rw and refer to the matrix containing all these feature vectors as R. [sent-59, score-0.871]
29 To predict the next word wn given the context w1:n−1 , the model computes the predicted feature vector r for the ˆ next word by linearly combining the context word feature vectors: 2 n−1 r= ˆ Ci rwi , (1) i=1 where Ci is the weight matrix associated with the context position i. [sent-60, score-2.3]
30 Then the similarity between the predicted feature vector and the feature vector for each word in the vocabulary is computed using the inner product. [sent-61, score-0.98]
31 (2) exp(ˆT rj + bj ) r j Here bw is the bias for word w, which is used to capture the context-independent word frequency. [sent-63, score-1.023]
32 The inputs to the network are the feature vectors for the context words, while the matrix of weights from the hidden layer to the output layer is simply the feature vector matrix R. [sent-65, score-0.516]
33 The vector of activities of the hidden units corresponds to the the predicted feature vector for the next word. [sent-66, score-0.33]
34 4 The hierarchical log-bilinear model Our hierarchical language model is based on the hierarchical model from [10]. [sent-69, score-0.686]
35 The distinguishing features of our model are the use of the log-bilinear language model for computing the probabilities at each node and the ability to handle multiple occurrences of each word in the tree. [sent-70, score-0.894]
36 Note that the idea of using multiple word occurrences in a tree was proposed in [10], but it was not implemented. [sent-71, score-0.771]
37 The first component of the hierarchical log-bilinear model (HLBL) is a binary tree with words at its leaves. [sent-72, score-0.645]
38 For now, we will assume that each word in the vocabulary is at exactly one leaf. [sent-73, score-0.628]
39 Then each word can be uniquely specified by a path from the root of the tree to the leaf node the word is at. [sent-74, score-1.364]
40 This allows each word to be represented by a binary string which we will call a code. [sent-77, score-0.594]
41 In the HLBL model, just like in its non-hierarchical counterpart, context words are represented using real-valued feature vectors. [sent-79, score-0.373]
42 Each of the non-leaf nodes in the tree also has a feature vector associated with it that is used for discriminating the words in the left subtree form the words in the right subtree of the node. [sent-80, score-0.864]
43 Unlike the context words, the words being predicted are represented using their binary codes that are determined by the word tree. [sent-81, score-0.988]
44 In the HLBL model, the probability of the next word being w is the probability of making the sequences of binary decisions specified by the word’s code, given the context. [sent-83, score-0.666]
45 3 The definition of P (wn = w|w1:n−1 ) can be extended to multiple codes per word by including a summation over all codes for w as follows: P (wn = w|w1:n−1 ) = P (di |qi , w1:n−1 ), (5) d∈D(w) i where D(w) is a set of codes corresponding to word w. [sent-88, score-1.413]
46 Allowing multiple codes per word can allow better prediction of words that have multiple senses or multiple usage patterns. [sent-89, score-0.965]
47 Using multiple codes per word also makes it easy to combine several separate words hierarchies to into a single one to to reflect the fact that no single hierarchy can express all the relationships between words. [sent-90, score-0.911]
48 Using the LBL model instead of the NPLM for computing the local probabilities allows us to avoid computing the nonlinearities in the hidden layer which makes our hierarchical model faster at making predictions than the hierarchical NPLM. [sent-91, score-0.504]
49 More importantly, the hierarchical NPLM needs to compute the hidden activities once for each of the O(log N ) decisions, while the HLBL model computes the predicted feature vector just once per prediction. [sent-92, score-0.466]
50 However, the time complexity of computing the probability for a single binary decision in an LBL model is still quadratic in the feature vector dimensionality D, which might make the use of high-dimensional feature vectors too computationally expensive. [sent-93, score-0.529]
51 1 Note that for a context of size 1, this restriction does not reduce the representational power of the model because the context weight matrix C1 can be absorbed into the word feature vectors. [sent-95, score-0.802]
52 5 Hierarchical clustering of words The first step in training a hierarchical language model is constructing a binary tree of words for the model to use. [sent-99, score-1.151]
53 After preprocessing the taxonomy by hand to ensure that each node had only one parent, datadriven hierarchical binary clustering was performed on the children of the nodes in the taxonomy that had more than two children, resulting in a binary tree. [sent-102, score-0.568]
54 Hierarchical binary clustering of words based on the their usage statistics is a natural choice for generating binary trees of words automatically. [sent-105, score-0.736]
55 However, we will mention two existing algorithms that might be suitable for producing binary word hierarchies. [sent-108, score-0.565]
56 The algorithm from [8] produces exactly the kind of binary trees we need, except that its time complexity is cubic in the vocabulary size. [sent-110, score-0.356]
57 2 We also considered the distributional clustering algorithm [11] but decided not to use it because of the difficulties involved in using contexts of more than one word for clustering. [sent-111, score-0.725]
58 Since we would like to cluster words for easy prediction of the next word based on its context, it is natural to describe each word in terms of the contexts that can precede it. [sent-113, score-1.435]
59 For example, for a single-word context one such description is the n−1 1 ˆ Thus the feature vector for the next word can now be computed as r = i=1 ci ◦ rwi , where ci is a vector of context weights for position i and ◦ denotes the elementwise product of two vectors. [sent-114, score-0.982]
60 4 distribution of words that precede the word of interest in the training data. [sent-116, score-0.811]
61 The problem becomes apparent when we consider using larger contexts: the number of contexts that can potentially precede a word grows exponentially in the context size. [sent-117, score-0.797]
62 We avoid these difficulties by operating on low-dimensional real-valued word representations in our tree-building procedure. [sent-120, score-0.496]
63 Since we need to train a model to obtain word feature vectors, we perform the following bootstrapping procedure: we generate a random binary tree of words, train an HLBL model based on it, and use the distributed representations it learns to represent words when building the word tree. [sent-121, score-1.742]
64 Since each word is represented by a distribution over contexts it appears in, we need a way of compressing such a collection of contexts down to a low-dimensional vector. [sent-122, score-0.754]
65 Then, we condense the distribution of contexts that precede a given word into a feature vector by computing the expectation of the predicted representation w. [sent-125, score-0.949]
66 Thus, for the purposes of clustering each word is represented by its average predicted feature vector. [sent-129, score-0.739]
67 One appealing property of this algorithm is that the running time of each iteration is linear in the vocabulary size, which is a consequence of representing words using feature vectors of fixed dimensionality. [sent-137, score-0.503]
68 In our experiments, the algorithm took only a few minutes to build a hierarchy for a vocabulary of nearly 18000 words based on 100-dimensional feature vectors. [sent-138, score-0.465]
69 The goal of an algorithm for generating trees for hierarchical language models is to produce trees that are well-supported by the data and are reasonably well-balanced so that the resulting models generalize well and are fast to train and test. [sent-139, score-0.72]
70 Our simplest rule aims to produce a balanced tree at any cost. [sent-144, score-0.349]
71 It achieves that by assigning the word to the component with the higher responsibility for the word. [sent-147, score-0.525]
72 This rule is designed to produce multiple codes for words that are difficult to cluster. [sent-150, score-0.368]
73 It starts with a random permutation of the words and recursively builds the left subtree based one the first half of the words and the right subtree based on the second half of the words. [sent-153, score-0.46]
74 The mean code length is the sum of lengths of codes associated with a word, averaged over the distribution of the words in the training data. [sent-157, score-0.378]
75 The run-time complexity of the hierarchical model is linear in the mean code length of the tree used. [sent-158, score-0.46]
76 The mean number of codes per word refers to the number of codes per word averaged over the training data distribution. [sent-159, score-1.346]
77 Since each non-leaf node in a tree has its own feature vector, the number of free parameters associated with the tree is linear in this quantity. [sent-160, score-0.653]
78 8 Number of non-leaf nodes 17963 17963 17963 22995 30296 61014 121980 Table 2: The effect of the feature dimensionality and the word tree used on the test set perplexity of the model. [sent-179, score-1.061]
79 Feature Perplexity using Perplexity using Reduction dimensionality a random tree a non-random tree in perplexity 25 191. [sent-180, score-0.65]
80 The dataset consists of a 14 million word training set, a 1 million word validation set, and 1 million word test set. [sent-193, score-1.706]
81 Except for where stated otherwise, the models used for the experiments used 100 dimensional feature vectors and a context size of 5. [sent-196, score-0.327]
82 We started by training a model that used a tree generated by the RANDOM algorithm (tree T1 in Table 1). [sent-199, score-0.338]
83 The feature vectors learned by this model were used to build a tree using the BALANCED algorithm (tree T2). [sent-200, score-0.458]
84 We then trained models of various feature vector dimensionality on each of these trees to see whether a highly expressive model can compensate for using a poorly constructed tree. [sent-201, score-0.389]
85 Though the gap in performance can be reduced by increasing the dimensionality of feature vectors, using a non-random tree drastically improves performance even for the model with 100-dimensional feature vectors. [sent-204, score-0.537]
86 Since increasing the feature dimensionality beyond 100 did not result in a substantial reduction in perplexity, we used 100-dimensional feature vectors for all of our models in the following experiments. [sent-209, score-0.429]
87 4) algorithm twice and creating a tree with a root node that had the two generated trees as its subtrees. [sent-236, score-0.445]
88 Note that trees generated using ADAPTIVE(ǫ) using ǫ > 0 result in models with more parameters due to the greater number of tree-nodes and thus tree-node feature vectors, as compared to trees generated using methods producing one code/leaf per word. [sent-240, score-0.438]
89 As expected, building word trees adaptively improves model performance. [sent-243, score-0.681]
90 However, using a 2× overcomplete tree generated using the same algorithm results in a model that outperforms both the n-gram models and the LBL model, and using a 4× overcomplete tree leads to a further reduction in perplexity. [sent-247, score-0.666]
91 Creating hierarchies in which every word occurred more than once was essential to getting the models to perform better. [sent-253, score-0.576]
92 An inspection of trees generated by our adaptive algorithm showed that the words with the largest numbers of codes (i. [sent-254, score-0.498]
93 the word that were replicated the most) were not the words with multiple distinct senses. [sent-256, score-0.698]
94 The failure to use multiple codes for words with several very different senses is probably a consequence of summarizing the distribution over contexts with a single mean feature vector when clustering words. [sent-258, score-0.702]
95 The “sense multimodality” of context distributions would be better captured by using a small set of feature vectors found by clustering the contexts. [sent-259, score-0.341]
96 7 Finally, since our tree building algorithm is based on the feature vectors learned by the model, it is possible to periodically interrupt training of such a model to rebuild the word tree based on the feature vectors provided by the model being trained. [sent-260, score-1.49]
97 This modified training procedure might produce better models by allowing the word hierarchy to adapt to the probabilistic component of the model and vice versa. [sent-261, score-0.738]
98 The biases for the tree nodes were initialized so that the distribution produced by the model with all the non-bias parameters set to zero matched the base rates of the words in the training set. [sent-264, score-0.507]
99 Improving statistical language model performance with automatically generated word hierarchies. [sent-306, score-0.711]
100 Connectionist language modeling for large vocabulary continuous speech recognition. [sent-324, score-0.333]
wordName wordTfidf (topN-words)
[('word', 0.496), ('hlbl', 0.388), ('lbl', 0.33), ('tree', 0.229), ('nplm', 0.194), ('language', 0.179), ('words', 0.178), ('perplexity', 0.16), ('hierarchical', 0.133), ('vocabulary', 0.132), ('contexts', 0.129), ('feature', 0.12), ('codes', 0.12), ('wn', 0.113), ('trees', 0.111), ('precede', 0.097), ('adaptive', 0.089), ('context', 0.075), ('node', 0.075), ('balanced', 0.074), ('clustering', 0.073), ('vectors', 0.073), ('binary', 0.069), ('decisions', 0.062), ('models', 0.059), ('nonhierarchical', 0.058), ('responsibilities', 0.058), ('datapoint', 0.055), ('spite', 0.055), ('subtree', 0.052), ('yoshua', 0.051), ('million', 0.051), ('taxonomy', 0.05), ('predicted', 0.05), ('expert', 0.049), ('qi', 0.047), ('overcomplete', 0.044), ('linguistics', 0.042), ('wordnet', 0.041), ('training', 0.04), ('code', 0.04), ('next', 0.039), ('mnih', 0.039), ('morin', 0.039), ('nplms', 0.039), ('rwi', 0.039), ('vocabularies', 0.039), ('child', 0.038), ('ci', 0.038), ('path', 0.038), ('building', 0.038), ('per', 0.037), ('model', 0.036), ('di', 0.036), ('hierarchy', 0.035), ('usage', 0.035), ('perplexities', 0.034), ('layer', 0.033), ('bengio', 0.033), ('started', 0.033), ('smoothing', 0.033), ('dimensionality', 0.032), ('vector', 0.031), ('bw', 0.031), ('rw', 0.031), ('hidden', 0.031), ('root', 0.03), ('responsibility', 0.029), ('string', 0.029), ('culties', 0.028), ('activities', 0.028), ('distributional', 0.027), ('epoch', 0.027), ('senses', 0.027), ('probabilistic', 0.027), ('computing', 0.026), ('faster', 0.025), ('dataset', 0.025), ('nonlinearities', 0.025), ('virtually', 0.025), ('visits', 0.025), ('reduction', 0.025), ('outperform', 0.025), ('children', 0.025), ('multiple', 0.024), ('nodes', 0.024), ('generating', 0.023), ('rule', 0.023), ('modelling', 0.023), ('visit', 0.023), ('produce', 0.023), ('speech', 0.022), ('complexity', 0.022), ('procedure', 0.022), ('scores', 0.022), ('cubic', 0.022), ('occurrences', 0.022), ('train', 0.022), ('hierarchies', 0.021), ('sparsity', 0.021)]
simIndex simValue paperId paperTitle
same-paper 1 0.99999994 4 nips-2008-A Scalable Hierarchical Distributed Language Model
Author: Andriy Mnih, Geoffrey E. Hinton
Abstract: Neural probabilistic language models (NPLMs) have been shown to be competitive with and occasionally superior to the widely-used n-gram language models. The main drawback of NPLMs is their extremely long training and testing times. Morin and Bengio have proposed a hierarchical language model built around a binary tree of words, which was two orders of magnitude faster than the nonhierarchical model it was based on. However, it performed considerably worse than its non-hierarchical counterpart in spite of using a word tree created using expert knowledge. We introduce a fast hierarchical language model along with a simple feature-based algorithm for automatic construction of word trees from the data. We then show that the resulting models can outperform non-hierarchical neural models as well as the best n-gram models. 1
2 0.15064733 229 nips-2008-Syntactic Topic Models
Author: Jordan L. Boyd-graber, David M. Blei
Abstract: We develop the syntactic topic model (STM), a nonparametric Bayesian model of parsed documents. The STM generates words that are both thematically and syntactically constrained, which combines the semantic insights of topic models with the syntactic information available from parse trees. Each word of a sentence is generated by a distribution that combines document-specific topic weights and parse-tree-specific syntactic transitions. Words are assumed to be generated in an order that respects the parse tree. We derive an approximate posterior inference method based on variational methods for hierarchical Dirichlet processes, and we report qualitative and quantitative results on both synthetic data and hand-parsed documents. 1
3 0.11986013 246 nips-2008-Unsupervised Learning of Visual Sense Models for Polysemous Words
Author: Kate Saenko, Trevor Darrell
Abstract: Polysemy is a problem for methods that exploit image search engines to build object category models. Existing unsupervised approaches do not take word sense into consideration. We propose a new method that uses a dictionary to learn models of visual word sense from a large collection of unlabeled web data. The use of LDA to discover a latent sense space makes the model robust despite the very limited nature of dictionary definitions. The definitions are used to learn a distribution in the latent space that best represents a sense. The algorithm then uses the text surrounding image links to retrieve images with high probability of a particular dictionary sense. An object classifier is trained on the resulting sense-specific images. We evaluate our method on a dataset obtained by searching the web for polysemous words. Category classification experiments show that our dictionarybased approach outperforms baseline methods. 1
4 0.11293589 120 nips-2008-Learning the Semantic Correlation: An Alternative Way to Gain from Unlabeled Text
Author: Yi Zhang, Artur Dubrawski, Jeff G. Schneider
Abstract: In this paper, we address the question of what kind of knowledge is generally transferable from unlabeled text. We suggest and analyze the semantic correlation of words as a generally transferable structure of the language and propose a new method to learn this structure using an appropriately chosen latent variable model. This semantic correlation contains structural information of the language space and can be used to control the joint shrinkage of model parameters for any specific task in the same space through regularization. In an empirical study, we construct 190 different text classification tasks from a real-world benchmark, and the unlabeled documents are a mixture from all these tasks. We test the ability of various algorithms to use the mixed unlabeled text to enhance all classification tasks. Empirical results show that the proposed approach is a reliable and scalable method for semi-supervised learning, regardless of the source of unlabeled data, the specific task to be enhanced, and the prediction model used.
5 0.10901784 127 nips-2008-Logistic Normal Priors for Unsupervised Probabilistic Grammar Induction
Author: Shay B. Cohen, Kevin Gimpel, Noah A. Smith
Abstract: We explore a new Bayesian model for probabilistic grammars, a family of distributions over discrete structures that includes hidden Markov models and probabilistic context-free grammars. Our model extends the correlated topic model framework to probabilistic grammars, exploiting the logistic normal distribution as a prior over the grammar parameters. We derive a variational EM algorithm for that model, and then experiment with the task of unsupervised grammar induction for natural language dependency parsing. We show that our model achieves superior results over previous models that use different priors. 1
6 0.10528078 194 nips-2008-Regularized Learning with Networks of Features
7 0.099414743 117 nips-2008-Learning Taxonomies by Dependence Maximization
8 0.091093428 84 nips-2008-Fast Prediction on a Tree
9 0.089459293 191 nips-2008-Recursive Segmentation and Recognition Templates for 2D Parsing
10 0.086995415 139 nips-2008-Modeling the effects of memory on human online sentence processing with particle filters
11 0.081224576 36 nips-2008-Beyond Novelty Detection: Incongruent Events, when General and Specific Classifiers Disagree
12 0.080728345 205 nips-2008-Semi-supervised Learning with Weakly-Related Unlabeled Data : Towards Better Text Categorization
13 0.078312241 52 nips-2008-Correlated Bigram LSA for Unsupervised Language Model Adaptation
14 0.066634223 70 nips-2008-Efficient Inference in Phylogenetic InDel Trees
15 0.063835032 28 nips-2008-Asynchronous Distributed Learning of Topic Models
16 0.063079543 73 nips-2008-Estimating Robust Query Models with Convex Optimization
17 0.062194873 6 nips-2008-A ``Shape Aware'' Model for semi-supervised Learning of Objects and its Context
18 0.062179215 82 nips-2008-Fast Computation of Posterior Mode in Multi-Level Hierarchical Models
19 0.06068765 103 nips-2008-Implicit Mixtures of Restricted Boltzmann Machines
20 0.059393536 83 nips-2008-Fast High-dimensional Kernel Summations Using the Monte Carlo Multipole Method
topicId topicWeight
[(0, -0.188), (1, -0.084), (2, 0.066), (3, -0.106), (4, -0.012), (5, -0.096), (6, 0.109), (7, 0.104), (8, -0.084), (9, -0.076), (10, -0.083), (11, 0.029), (12, -0.108), (13, 0.099), (14, -0.037), (15, 0.071), (16, -0.043), (17, -0.076), (18, 0.012), (19, -0.026), (20, 0.005), (21, 0.082), (22, 0.022), (23, -0.001), (24, -0.075), (25, 0.093), (26, 0.067), (27, -0.057), (28, -0.107), (29, 0.022), (30, 0.072), (31, 0.093), (32, 0.004), (33, 0.055), (34, 0.052), (35, -0.042), (36, 0.102), (37, 0.022), (38, -0.025), (39, 0.114), (40, -0.067), (41, -0.004), (42, 0.013), (43, 0.108), (44, -0.007), (45, -0.067), (46, 0.01), (47, -0.006), (48, -0.008), (49, 0.051)]
simIndex simValue paperId paperTitle
same-paper 1 0.95913225 4 nips-2008-A Scalable Hierarchical Distributed Language Model
Author: Andriy Mnih, Geoffrey E. Hinton
Abstract: Neural probabilistic language models (NPLMs) have been shown to be competitive with and occasionally superior to the widely-used n-gram language models. The main drawback of NPLMs is their extremely long training and testing times. Morin and Bengio have proposed a hierarchical language model built around a binary tree of words, which was two orders of magnitude faster than the nonhierarchical model it was based on. However, it performed considerably worse than its non-hierarchical counterpart in spite of using a word tree created using expert knowledge. We introduce a fast hierarchical language model along with a simple feature-based algorithm for automatic construction of word trees from the data. We then show that the resulting models can outperform non-hierarchical neural models as well as the best n-gram models. 1
2 0.68626976 139 nips-2008-Modeling the effects of memory on human online sentence processing with particle filters
Author: Roger P. Levy, Florencia Reali, Thomas L. Griffiths
Abstract: Language comprehension in humans is significantly constrained by memory, yet rapid, highly incremental, and capable of utilizing a wide range of contextual information to resolve ambiguity and form expectations about future input. In contrast, most of the leading psycholinguistic models and fielded algorithms for natural language parsing are non-incremental, have run time superlinear in input length, and/or enforce structural locality constraints on probabilistic dependencies between events. We present a new limited-memory model of sentence comprehension which involves an adaptation of the particle filter, a sequential Monte Carlo method, to the problem of incremental parsing. We show that this model can reproduce classic results in online sentence comprehension, and that it naturally provides the first rational account of an outstanding problem in psycholinguistics, in which the preferred alternative in a syntactic ambiguity seems to grow more attractive over time even in the absence of strong disambiguating information. 1
3 0.6388337 229 nips-2008-Syntactic Topic Models
Author: Jordan L. Boyd-graber, David M. Blei
Abstract: We develop the syntactic topic model (STM), a nonparametric Bayesian model of parsed documents. The STM generates words that are both thematically and syntactically constrained, which combines the semantic insights of topic models with the syntactic information available from parse trees. Each word of a sentence is generated by a distribution that combines document-specific topic weights and parse-tree-specific syntactic transitions. Words are assumed to be generated in an order that respects the parse tree. We derive an approximate posterior inference method based on variational methods for hierarchical Dirichlet processes, and we report qualitative and quantitative results on both synthetic data and hand-parsed documents. 1
4 0.59667754 52 nips-2008-Correlated Bigram LSA for Unsupervised Language Model Adaptation
Author: Yik-cheung Tam, Tanja Schultz
Abstract: We present a correlated bigram LSA approach for unsupervised LM adaptation for automatic speech recognition. The model is trained using efficient variational EM and smoothed using the proposed fractional Kneser-Ney smoothing which handles fractional counts. We address the scalability issue to large training corpora via bootstrapping of bigram LSA from unigram LSA. For LM adaptation, unigram and bigram LSA are integrated into the background N-gram LM via marginal adaptation and linear interpolation respectively. Experimental results on the Mandarin RT04 test set show that applying unigram and bigram LSA together yields 6%–8% relative perplexity reduction and 2.5% relative character error rate reduction which is statistically significant compared to applying only unigram LSA. On the large-scale evaluation on Arabic, 3% relative word error rate reduction is achieved which is also statistically significant. 1
5 0.58913654 127 nips-2008-Logistic Normal Priors for Unsupervised Probabilistic Grammar Induction
Author: Shay B. Cohen, Kevin Gimpel, Noah A. Smith
Abstract: We explore a new Bayesian model for probabilistic grammars, a family of distributions over discrete structures that includes hidden Markov models and probabilistic context-free grammars. Our model extends the correlated topic model framework to probabilistic grammars, exploiting the logistic normal distribution as a prior over the grammar parameters. We derive a variational EM algorithm for that model, and then experiment with the task of unsupervised grammar induction for natural language dependency parsing. We show that our model achieves superior results over previous models that use different priors. 1
6 0.51248235 82 nips-2008-Fast Computation of Posterior Mode in Multi-Level Hierarchical Models
7 0.49515316 246 nips-2008-Unsupervised Learning of Visual Sense Models for Polysemous Words
8 0.4880603 28 nips-2008-Asynchronous Distributed Learning of Topic Models
9 0.47971225 98 nips-2008-Hierarchical Semi-Markov Conditional Random Fields for Recursive Sequential Data
10 0.46899536 64 nips-2008-DiscLDA: Discriminative Learning for Dimensionality Reduction and Classification
11 0.46585363 117 nips-2008-Learning Taxonomies by Dependence Maximization
12 0.4650754 29 nips-2008-Automatic online tuning for fast Gaussian summation
13 0.46209645 35 nips-2008-Bayesian Synchronous Grammar Induction
14 0.4603295 70 nips-2008-Efficient Inference in Phylogenetic InDel Trees
15 0.43986925 36 nips-2008-Beyond Novelty Detection: Incongruent Events, when General and Specific Classifiers Disagree
16 0.42011884 120 nips-2008-Learning the Semantic Correlation: An Alternative Way to Gain from Unlabeled Text
17 0.41421807 114 nips-2008-Large Margin Taxonomy Embedding for Document Categorization
18 0.41275609 115 nips-2008-Learning Bounded Treewidth Bayesian Networks
19 0.39324346 197 nips-2008-Relative Performance Guarantees for Approximate Inference in Latent Dirichlet Allocation
20 0.39212075 83 nips-2008-Fast High-dimensional Kernel Summations Using the Monte Carlo Multipole Method
topicId topicWeight
[(6, 0.061), (7, 0.093), (12, 0.083), (28, 0.195), (32, 0.037), (57, 0.104), (59, 0.016), (63, 0.034), (67, 0.151), (71, 0.014), (77, 0.038), (78, 0.03), (83, 0.046)]
simIndex simValue paperId paperTitle
1 0.91203535 9 nips-2008-A mixture model for the evolution of gene expression in non-homogeneous datasets
Author: Gerald Quon, Yee W. Teh, Esther Chan, Timothy Hughes, Michael Brudno, Quaid D. Morris
Abstract: We address the challenge of assessing conservation of gene expression in complex, non-homogeneous datasets. Recent studies have demonstrated the success of probabilistic models in studying the evolution of gene expression in simple eukaryotic organisms such as yeast, for which measurements are typically scalar and independent. Models capable of studying expression evolution in much more complex organisms such as vertebrates are particularly important given the medical and scientific interest in species such as human and mouse. We present Brownian Factor Phylogenetic Analysis, a statistical model that makes a number of significant extensions to previous models to enable characterization of changes in expression among highly complex organisms. We demonstrate the efficacy of our method on a microarray dataset profiling diverse tissues from multiple vertebrate species. We anticipate that the model will be invaluable in the study of gene expression patterns in other diverse organisms as well, such as worms and insects. 1
same-paper 2 0.8862738 4 nips-2008-A Scalable Hierarchical Distributed Language Model
Author: Andriy Mnih, Geoffrey E. Hinton
Abstract: Neural probabilistic language models (NPLMs) have been shown to be competitive with and occasionally superior to the widely-used n-gram language models. The main drawback of NPLMs is their extremely long training and testing times. Morin and Bengio have proposed a hierarchical language model built around a binary tree of words, which was two orders of magnitude faster than the nonhierarchical model it was based on. However, it performed considerably worse than its non-hierarchical counterpart in spite of using a word tree created using expert knowledge. We introduce a fast hierarchical language model along with a simple feature-based algorithm for automatic construction of word trees from the data. We then show that the resulting models can outperform non-hierarchical neural models as well as the best n-gram models. 1
3 0.84018373 246 nips-2008-Unsupervised Learning of Visual Sense Models for Polysemous Words
Author: Kate Saenko, Trevor Darrell
Abstract: Polysemy is a problem for methods that exploit image search engines to build object category models. Existing unsupervised approaches do not take word sense into consideration. We propose a new method that uses a dictionary to learn models of visual word sense from a large collection of unlabeled web data. The use of LDA to discover a latent sense space makes the model robust despite the very limited nature of dictionary definitions. The definitions are used to learn a distribution in the latent space that best represents a sense. The algorithm then uses the text surrounding image links to retrieve images with high probability of a particular dictionary sense. An object classifier is trained on the resulting sense-specific images. We evaluate our method on a dataset obtained by searching the web for polysemous words. Category classification experiments show that our dictionarybased approach outperforms baseline methods. 1
4 0.83538592 66 nips-2008-Dynamic visual attention: searching for coding length increments
Author: Xiaodi Hou, Liqing Zhang
Abstract: A visual attention system should respond placidly when common stimuli are presented, while at the same time keep alert to anomalous visual inputs. In this paper, a dynamic visual attention model based on the rarity of features is proposed. We introduce the Incremental Coding Length (ICL) to measure the perspective entropy gain of each feature. The objective of our model is to maximize the entropy of the sampled visual features. In order to optimize energy consumption, the limit amount of energy of the system is re-distributed amongst features according to their Incremental Coding Length. By selecting features with large coding length increments, the computational system can achieve attention selectivity in both static and dynamic scenes. We demonstrate that the proposed model achieves superior accuracy in comparison to mainstream approaches in static saliency map generation. Moreover, we also show that our model captures several less-reported dynamic visual search behaviors, such as attentional swing and inhibition of return. 1
5 0.83480239 118 nips-2008-Learning Transformational Invariants from Natural Movies
Author: Charles Cadieu, Bruno A. Olshausen
Abstract: We describe a hierarchical, probabilistic model that learns to extract complex motion from movies of the natural environment. The model consists of two hidden layers: the first layer produces a sparse representation of the image that is expressed in terms of local amplitude and phase variables. The second layer learns the higher-order structure among the time-varying phase variables. After training on natural movies, the top layer units discover the structure of phase-shifts within the first layer. We show that the top layer units encode transformational invariants: they are selective for the speed and direction of a moving pattern, but are invariant to its spatial structure (orientation/spatial-frequency). The diversity of units in both the intermediate and top layers of the model provides a set of testable predictions for representations that might be found in V1 and MT. In addition, the model demonstrates how feedback from higher levels can influence representations at lower levels as a by-product of inference in a graphical model. 1
6 0.82995474 197 nips-2008-Relative Performance Guarantees for Approximate Inference in Latent Dirichlet Allocation
7 0.82812059 103 nips-2008-Implicit Mixtures of Restricted Boltzmann Machines
8 0.82597339 116 nips-2008-Learning Hybrid Models for Image Annotation with Partially Labeled Data
9 0.82596409 200 nips-2008-Robust Kernel Principal Component Analysis
10 0.8254056 62 nips-2008-Differentiable Sparse Coding
11 0.82540399 205 nips-2008-Semi-supervised Learning with Weakly-Related Unlabeled Data : Towards Better Text Categorization
12 0.82529753 138 nips-2008-Modeling human function learning with Gaussian processes
13 0.82325083 219 nips-2008-Spectral Hashing
14 0.82079601 176 nips-2008-Partially Observed Maximum Entropy Discrimination Markov Networks
15 0.8201744 79 nips-2008-Exploring Large Feature Spaces with Hierarchical Multiple Kernel Learning
16 0.81990147 63 nips-2008-Dimensionality Reduction for Data in Multiple Feature Representations
17 0.8188867 192 nips-2008-Reducing statistical dependencies in natural signals using radial Gaussianization
18 0.81745517 231 nips-2008-Temporal Dynamics of Cognitive Control
19 0.81538892 64 nips-2008-DiscLDA: Discriminative Learning for Dimensionality Reduction and Classification
20 0.81516725 221 nips-2008-Stochastic Relational Models for Large-scale Dyadic Data using MCMC