nips nips2000 nips2000-6 knowledge-graph by maker-knowledge-mining
Source: pdf
Author: Yoshua Bengio, Réjean Ducharme, Pascal Vincent
Abstract: A goal of statistical language modeling is to learn the joint probability function of sequences of words. This is intrinsically difficult because of the curse of dimensionality: we propose to fight it with its own weapons. In the proposed approach one learns simultaneously (1) a distributed representation for each word (i.e. a similarity between words) along with (2) the probability function for word sequences, expressed with these representations. Generalization is obtained because a sequence of words that has never been seen before gets high probability if it is made of words that are similar to words forming an already seen sentence. We report on experiments using neural networks for the probability function, showing on two text corpora that the proposed approach very significantly improves on a state-of-the-art trigram model.
Reference: text
sentIndex sentText sentNum sentScore
1 ca Abstract A goal of statistical language modeling is to learn the joint probability function of sequences of words. [sent-3, score-0.281]
2 In the proposed approach one learns simultaneously (1) a distributed representation for each word (i. [sent-5, score-0.54]
3 a similarity between words) along with (2) the probability function for word sequences, expressed with these representations. [sent-7, score-0.517]
4 Generalization is obtained because a sequence of words that has never been seen before gets high probability if it is made of words that are similar to words forming an already seen sentence. [sent-8, score-1.206]
5 We report on experiments using neural networks for the probability function, showing on two text corpora that the proposed approach very significantly improves on a state-of-the-art trigram model. [sent-9, score-0.658]
6 1 Introduction A fundamental problem that makes language modeling and other learning problems difficult is the curse of dimensionality. [sent-10, score-0.185]
7 It is particularly obvious in the case when one wants to model the joint distribution between many discrete random variables (such as words in a sentence, or discrete attributes in a data-mining task). [sent-11, score-0.447]
8 For example, if one wants to model the joint distribution of 10 consecutive words in a natural language with a vocabulary V of size 100,000, there are potentially 10000010 - 1 = 1050 - 1 free parameters. [sent-12, score-0.646]
9 A statistical model of language can be represented by the conditional probability of the P( Wt Iwf-1), next word given all the previous ones in the sequence, since P( W'[) = where Wt is the t-th word, and writing subsequence w[ = (Wi, Wi+1, . [sent-13, score-0.685]
10 rri=l When building statistical models of natural language, one reduces the difficulty by taking advantage of word order, and the fact that temporally closer words in the word sequence are statistically more dependent. [sent-17, score-1.316]
11 Thus, n-gram models construct tables of conditional probabilities for the next word, for each one of a large number of contexts, i. [sent-18, score-0.15]
12 =~+l)' Only those combinations of successive words that actually occur in the training corpus (or that occur frequently enough) are considered. [sent-21, score-0.522]
13 What happens when a new combination of n words appears that was not seen in the training corpus? [sent-22, score-0.407]
14 A simple answer is to look at the probability predicted using smaller context size, as done in back-off trigram models [7] or in smoothed (or interpolated) trigram models [6]. [sent-23, score-0.98]
15 words seen in the training corpus to new sequences of words? [sent-27, score-0.595]
16 , the probability for a long sequence of words is obtained by "gluing" very short pieces of length 1, 2 or 3 words that have been seen frequently enough in the training data. [sent-30, score-0.954]
17 Obviously there is much more information in the sequence that precedes the word to predict than just the identity of the previous couple of words. [sent-31, score-0.482]
18 There are at least two obvious flaws in this approach (which however has turned out to be very difficult to beat): first it is not taking into account contexts farther than 1 or 2 words, second it is not taking account of the "similarity" between words. [sent-32, score-0.111]
19 For example, having seen the sentence Th e ca t i s wa l k in g i n t he b e droom in the training corpus should help us generalize to make the sentence A d og was r u nning in a room almost as likely, simply because "dog" and "cat" (resp. [sent-33, score-0.495]
20 associate with each word in the vocabulary a distributed "feature vector" (a real- valued vector in ~m), thereby creating a notion of similarity between words, 2. [sent-40, score-0.641]
21 express the joint probability fun ction of word sequences in terms of the feature vectors of these words in the sequence, and 3. [sent-41, score-1.043]
22 learn simultaneously the word feature vectors and the parameters of thatfitnction. [sent-42, score-0.505]
23 The feature vector represents different aspects of a word: each word is associated with a point in a vector space. [sent-43, score-0.565]
24 The probability function is expressed as a product of conditional probabilities of the next word given the previous ones, (e. [sent-47, score-0.63]
25 The feature vectors associated with each word are learned, but they can be initialized using prior knowledge. [sent-53, score-0.543]
26 2 Relation to Previous Work The idea of using neural networks to model high-dimensional discrete distributions has already been found useful in [3] where the joint probability of Zl . [sent-59, score-0.187]
27 ) is a function represented by part of a neural network, and it yields parameters for expressing the distribution of Zi. [sent-66, score-0.136]
28 The idea of using neural networks for language modeling is not new either, e. [sent-70, score-0.205]
29 In contrast, here we push this idea to a large scale, and concentrate on learning a statistical model of the distribution of word sequences, rather than learning the role of words in a sentence. [sent-73, score-0.846]
30 The proposed approach is also related to previous proposals of character-based text compression using neural networks [11]. [sent-74, score-0.144]
31 Learning a clustering of words [10, 1] is also a way to discover similarities between words. [sent-75, score-0.37]
32 a distributed feature vector, to indirectly represent similarity between words. [sent-78, score-0.164]
33 An important difference is that here we look for a representation for words that is helpful in representing compactly the probability distribution of word sequences from natural language text. [sent-80, score-1.036]
34 WT of words Wt E V, where the vocabulary V is a large but finite set. [sent-85, score-0.456]
35 By the product of these conditional probabilities, one obtains a model of the joint probability of any sequence of words. [sent-91, score-0.205]
36 It represents the "distributed feature vector" associated with each word in the vocabulary. [sent-99, score-0.505]
37 We have considered two alternative formulations : (a) The direct architecture: a function 9 maps a sequence of feature vectors for words in context (C(Wt-n),·· · , C(wt-d) to a probability distribution over words in V. [sent-103, score-1.048]
38 We used the "softmax" in the output layer of a neural net: P( Wt = ilwi- i ) = ehi / E j eh;, where hi is the neural network output score for word i . [sent-106, score-0.639]
39 (b) The cycling architecture: a function h maps a sequence of feature vectors (C(Wt-n),···, C(Wt-i), C(i)) (i. [sent-107, score-0.199]
40 including the context words and a candidate next word i) to a scalar hi, and again using a softmax, P(Wt = ilwi- i ) = ehi /Eje h;. [sent-109, score-0.963]
41 a neural net), each time putting in input the feature vector C(i) for a candidate next word i. [sent-113, score-0.615]
42 The function f is a composition of these two mappings (C and g), with C being shared across all the words in the context. [sent-114, score-0.37]
43 The parameters of the mapping C are simply the feature vectors themselves (represented by a IVI x m matrix C whose row i is the feature vector C(i) for word i). [sent-116, score-0.607]
44 The function 9 may be implemented by a feed-forward or recurrent neural network or another parameterized function, with parameters (). [sent-117, score-0.146]
45 Training is achieved by looking for ((), C) that maximize the training corpus penalized loglikelihood: L = ~ ~t logpw. [sent-119, score-0.152]
46 The main idea is to focus the effort of the neural network on a "short list" of words that have the highest probability. [sent-125, score-0.53]
47 The idea of the speed-up trick is the following: instead of computing the actual probability of the next word, the neural network is used to compute the relative probability of the next word within that short list. [sent-127, score-0.83]
48 The choice of the short list depends on the current context (the previous n words). [sent-128, score-0.271]
49 We have used our smoothed trigram model to pre-compute a short list containing the most probable next words associated to the previous two words. [sent-129, score-1.108]
50 The conditional probabilities P(Wt = ilht ) are thus computed as follows, denoting with h t the history (context) before Wt. [sent-130, score-0.199]
51 and L t the short list of words for the prediction of Wt. [sent-131, score-0.552]
52 Ptrigram(ilht), with Ptrigram(ilht) standing for the next-word probabilities computed by the smoothed trigram. [sent-133, score-0.264]
53 To speed up application of the trained model, one can pre-compute in a hash table the output of the neural network, at least for the most frequent input contexts. [sent-136, score-0.141]
54 In that case, the neural network will only be rarely called upon, and the average computation time will be very small. [sent-137, score-0.117]
55 Note that in a speech recognition system, one needs only compute the relative probabilities of the acoustically ambiguous words in each context, also reducing drastically the amount of computations. [sent-138, score-0.458]
56 To speed up training using stochastic gradient descent, we have found it useful to break the corpus in paragraphs and to randomly permute them. [sent-142, score-0.192]
57 In this way, some of the non-stationarity in the word stream is eliminated, yielding faster convergence. [sent-143, score-0.467]
58 2 million examples), we have found early stopping and weight decay useful to avoid over-fitting. [sent-146, score-0.137]
59 We have found improved performance by combining the probability predictions of the neural network with those of the smoothed trigram, with weights that were conditional on the frequency of the context (same procedure used to combine trigram, bigram, and unigram in the smoothed trigram). [sent-150, score-0.809]
60 These context features are formed by counting the frequency of occurrence of each word in each one of the most frequent contexts (word sequences) in the corpus. [sent-155, score-0.764]
61 The idea is that "similar" words should occur with similar frequency in the same contexts. [sent-156, score-0.456]
62 We used about 9000 most frequent contexts, and compressed these to 30 features with the SVD. [sent-157, score-0.116]
63 For an out-of-vocabulary word Wt we need to come up with a feature vector in order to predict the words that follow, or predict its probability (that is only possible with the cycling architecture). [sent-159, score-1.03]
64 We used as feature vector the weighted average feature vector of all the words in the short list, with the weights being the relative probabilities ofthose words: E[C(wt)lhtl = Ei C(i)P(wt = ilh t ). [sent-160, score-0.713]
65 The Brown corpus is a stream of 1,181,041 words (from a large variety of English texts and books). [sent-162, score-0.553]
66 The first 800,000 words were used for training, the following 200,000 for validation (model selection, weight decay, early stopping) and the remaining 181,041 for testing. [sent-163, score-0.426]
67 The number of different words is 47, 578 (including punctuation, distinguishing between upper and lower case, and including the syntactical marks used to separate texts and paragraphs). [sent-164, score-0.404]
68 Rare words with frequency:::; 3 were merged into a single token, reducing the vocabulary size to IVI = 16,383. [sent-165, score-0.49]
69 The Hansard corpus (Canadian parliament proceedings, French version) is a stream of about 34 million words, of which 32 millions (set A) was used for training, 1. [sent-166, score-0.289]
70 The benchmark against which the neural network was compared is an interpolated or smoothed trigram model [6]. [sent-170, score-0.689]
71 Let qt = l(Jreq(Wt-l,Wt-2)) represent the discretized frequency of occurrence of the context (Wt-l, Wt-2) (we used l(x) = -log((l + x)/T)l where x is the frequency of occurrence of the context and T is the size of the training corpus). [sent-171, score-0.369]
72 A conditional mixture of the trigram, bigram, unigram and zero-gram was learned on the validation set, with mixture weights conditional on discretized frequency. [sent-172, score-0.312]
73 r Below are measures of test set perplexity (geometric average of 1/ p( Wt Iwi- 1 ) for different models P. [sent-173, score-0.279]
74 Apparent convergence of the stochastic gradient descent procedure was obtained after around 10 epochs for Hansard and after about 50 epochs for Brown, with a learning rate gradually decreased from approximately 10- 3 to 10- 5 . [sent-174, score-0.115]
75 Weight decay of 10- 4 or 10- 5 was used in all the experiments (based on a few experiments compared on the validation set). [sent-175, score-0.19]
76 The main result is that the neural network performs much better than the smoothed trigram. [sent-176, score-0.323]
77 On Brown the best neural network system, according to validation perplexity (among different architectures tried, see below) yielded a perplexity of 258, while the smoothed trigram yields a perplexity of 348, which is about 35% worse. [sent-177, score-1.667]
78 This is obtained using a network with the direct architecture mixed with the trigram (conditional mixture), with 30 word features initialized with the SVD method, 40 hidden units, and n = 5 words of context. [sent-178, score-1.435]
79 This is obtained with a network with the direct architecture, 100 randomly initialized words features, 120 hidden units, and n = 8 words of context. [sent-183, score-0.897]
80 Experiments with the cycling architecture on Brown, with 30 word features, and 30 hidden units, varying the number of context words: n = 1 (like the bigram) yields a test perplexity of 302, n = 3 yields 291 , n = 5 yields 281 , n = 8 yields 279 (N. [sent-185, score-1.319]
81 Experiments with the direct architecture on Brown (with direct input to output connections), with 30 word features, 5 words of context, varying the number of hidden units: 0 yields a test perplexity of 275, 10 yields 267, 20 yields 266, 40 yields 265, 80 yields 265. [sent-189, score-1.711]
82 Experiments with the direct architecture on Brown (40 hidden units, 5 words of context), in which the word features initialized with the SVD method are kept fixed during training yield a test perplexity of 345. [sent-191, score-1.364]
83 8 whereas if the word features are trained jointly with the rest of the parameters, the perplexity is 265. [sent-192, score-0.813]
84 Experiments on Brown with both architectures reveal that the SVD initialization of the word features does not bring much improvement with respect to random initialization: it speeds up initial convergence (saving about 2 epochs), and yields a perplexity improvement of less than 0. [sent-194, score-0.972]
85 The direct architecture was found about 2% better than the cycling architecture. [sent-197, score-0.221]
86 Conditional mixture helps but even without it the neural net is better. [sent-198, score-0.143]
87 On Brown, the best neural net without the mixture yields a test perplexity of 265, the smoothed trigram yields 348, and their conditional mixture yields 258 (i. [sent-199, score-1.316]
88 On Hansard the improvement is less: a neural network yielding 46. [sent-202, score-0.117]
89 2 million words), and a large one (34 million words) have shown that the proposed approach yields much better perplexity than a state-of-the-art method, the smoothed trigram, with differences on the order of 20% to 35 %. [sent-207, score-0.81]
90 Note that if we had a separate feature vector for each "context" (short sequence of words), the model would have much more capacity (which could grow like that of n-grams) but it would not naturally generalize between the many different ways a word can be used. [sent-209, score-0.648]
91 A more reasonable alternative would be to explore language units other than words (e. [sent-210, score-0.531]
92 some short word sequences, or alternatively some sub-word morphemic units). [sent-212, score-0.514]
93 An important priority of future research should be to evaluate and improve the speeding-up tricks proposed here, and find ways to increase capacity without increasing training time too much (to deal with corpora with hundreds of millions of words). [sent-214, score-0.31]
94 A simple idea to take advantage of temporal structure and extend the size of the input window to include possibly a whole paragraph, without increasing too much the number of parameters, is to use a time-delay and possibly recurrent neural network. [sent-215, score-0.121]
95 In such a multi-layered network the computation that has been performed for small groups of consecutive words does not need to be redone when the network input window is shifted. [sent-216, score-0.506]
96 Looking at the word features learned by the model should help understand it and improve it. [sent-223, score-0.497]
97 Finally, future research should establish how useful the proposed approach will be in applications to speech recognition, language translation, and information retrieval. [sent-224, score-0.195]
98 Taking on the curse of dimensionality in joint distributions using neural networks. [sent-235, score-0.169]
99 Estimation of probabilities from sparse data for the language model component of a speech recognizer. [sent-276, score-0.201]
100 Natural language processing with modular neural networks and distributed lexicon. [sent-282, score-0.217]
wordName wordTfidf (topN-words)
[('word', 0.433), ('words', 0.37), ('trigram', 0.319), ('perplexity', 0.279), ('wt', 0.223), ('smoothed', 0.206), ('hansard', 0.12), ('ptrigram', 0.12), ('corpus', 0.115), ('brown', 0.114), ('language', 0.113), ('corpora', 0.103), ('list', 0.101), ('million', 0.093), ('sentence', 0.093), ('architecture', 0.092), ('context', 0.089), ('yields', 0.087), ('vocabulary', 0.086), ('short', 0.081), ('ilht', 0.08), ('cycling', 0.078), ('sequences', 0.073), ('feature', 0.072), ('curse', 0.072), ('network', 0.068), ('initialization', 0.064), ('features', 0.064), ('softmax', 0.062), ('conditional', 0.061), ('bigram', 0.06), ('ltlht', 0.06), ('probabilities', 0.058), ('validation', 0.056), ('distributed', 0.055), ('proposed', 0.052), ('frequent', 0.052), ('svd', 0.051), ('direct', 0.051), ('neural', 0.049), ('contexts', 0.049), ('sequence', 0.049), ('joint', 0.048), ('units', 0.048), ('probability', 0.047), ('net', 0.047), ('mixture', 0.047), ('interpolated', 0.047), ('ivi', 0.047), ('millions', 0.047), ('room', 0.047), ('experiments', 0.045), ('architectures', 0.045), ('zl', 0.045), ('decay', 0.044), ('idea', 0.043), ('cat', 0.043), ('epochs', 0.043), ('text', 0.043), ('frequency', 0.043), ('bengio', 0.041), ('zn', 0.041), ('droom', 0.04), ('ehi', 0.04), ('fot', 0.04), ('hash', 0.04), ('paragraphs', 0.04), ('pnn', 0.04), ('token', 0.04), ('tricks', 0.04), ('unigram', 0.04), ('weapons', 0.04), ('ng', 0.038), ('initialized', 0.038), ('similarity', 0.037), ('training', 0.037), ('jointly', 0.037), ('og', 0.037), ('semantic', 0.035), ('merged', 0.034), ('distributional', 0.034), ('fight', 0.034), ('ilwt', 0.034), ('montreal', 0.034), ('recherche', 0.034), ('texts', 0.034), ('yoshua', 0.034), ('occurrence', 0.034), ('stream', 0.034), ('generalize', 0.033), ('taking', 0.031), ('capacity', 0.031), ('next', 0.031), ('vector', 0.03), ('speech', 0.03), ('descent', 0.029), ('recurrent', 0.029), ('ht', 0.029), ('indexing', 0.029), ('wants', 0.029)]
simIndex simValue paperId paperTitle
same-paper 1 0.99999946 6 nips-2000-A Neural Probabilistic Language Model
Author: Yoshua Bengio, Réjean Ducharme, Pascal Vincent
Abstract: A goal of statistical language modeling is to learn the joint probability function of sequences of words. This is intrinsically difficult because of the curse of dimensionality: we propose to fight it with its own weapons. In the proposed approach one learns simultaneously (1) a distributed representation for each word (i.e. a similarity between words) along with (2) the probability function for word sequences, expressed with these representations. Generalization is obtained because a sequence of words that has never been seen before gets high probability if it is made of words that are similar to words forming an already seen sentence. We report on experiments using neural networks for the probability function, showing on two text corpora that the proposed approach very significantly improves on a state-of-the-art trigram model.
2 0.17301437 71 nips-2000-Interactive Parts Model: An Application to Recognition of On-line Cursive Script
Author: Predrag Neskovic, Philip C. Davis, Leon N. Cooper
Abstract: In this work, we introduce an Interactive Parts (IP) model as an alternative to Hidden Markov Models (HMMs). We t ested both models on a database of on-line cursive script. We show that implementations of HMMs and the IP model, in which all letters are assumed to have the same average width , give comparable results. However , in contrast to HMMs, the IP model can handle duration modeling without an increase in computational complexity. 1
3 0.15241855 130 nips-2000-Text Classification using String Kernels
Author: Huma Lodhi, John Shawe-Taylor, Nello Cristianini, Christopher J. C. H. Watkins
Abstract: We introduce a novel kernel for comparing two text documents. The kernel is an inner product in the feature space consisting of all subsequences of length k. A subsequence is any ordered sequence of k characters occurring in the text though not necessarily contiguously. The subsequences are weighted by an exponentially decaying factor of their full length in the text, hence emphasising those occurrences which are close to contiguous. A direct computation of this feature vector would involve a prohibitive amount of computation even for modest values of k, since the dimension of the feature space grows exponentially with k. The paper describes how despite this fact the inner product can be efficiently evaluated by a dynamic programming technique. A preliminary experimental comparison of the performance of the kernel compared with a standard word feature space kernel [6] is made showing encouraging results. 1
4 0.11866689 141 nips-2000-Universality and Individuality in a Neural Code
Author: Elad Schneidman, Naama Brenner, Naftali Tishby, Robert R. de Ruyter van Steveninck, William Bialek
Abstract: The problem of neural coding is to understand how sequences of action potentials (spikes) are related to sensory stimuli, motor outputs, or (ultimately) thoughts and intentions. One clear question is whether the same coding rules are used by different neurons, or by corresponding neurons in different individuals. We present a quantitative formulation of this problem using ideas from information theory, and apply this approach to the analysis of experiments in the fly visual system. We find significant individual differences in the structure of the code, particularly in the way that temporal patterns of spikes are used to convey information beyond that available from variations in spike rate. On the other hand, all the flies in our ensemble exhibit a high coding efficiency, so that every spike carries the same amount of information in all the individuals. Thus the neural code has a quantifiable mixture of individuality and universality. 1
5 0.10200752 112 nips-2000-Reinforcement Learning with Function Approximation Converges to a Region
Author: Geoffrey J. Gordon
Abstract: Many algorithms for approximate reinforcement learning are not known to converge. In fact, there are counterexamples showing that the adjustable weights in some algorithms may oscillate within a region rather than converging to a point. This paper shows that, for two popular algorithms, such oscillation is the worst that can happen: the weights cannot diverge, but instead must converge to a bounded region. The algorithms are SARSA(O) and V(O); the latter algorithm was used in the well-known TD-Gammon program. 1
6 0.084823802 131 nips-2000-The Early Word Catches the Weights
7 0.071810193 84 nips-2000-Minimum Bayes Error Feature Selection for Continuous Speech Recognition
8 0.070267715 51 nips-2000-Factored Semi-Tied Covariance Matrices
9 0.066958331 41 nips-2000-Discovering Hidden Variables: A Structure-Based Approach
10 0.066120997 138 nips-2000-The Use of Classifiers in Sequential Inference
11 0.065207765 90 nips-2000-New Approaches Towards Robust and Adaptive Speech Recognition
12 0.062798776 98 nips-2000-Partially Observable SDE Models for Image Sequence Recognition Tasks
13 0.059047196 136 nips-2000-The Missing Link - A Probabilistic Model of Document Content and Hypertext Connectivity
14 0.058371071 97 nips-2000-Overfitting in Neural Nets: Backpropagation, Conjugate Gradient, and Early Stopping
15 0.057454426 69 nips-2000-Incorporating Second-Order Functional Knowledge for Better Option Pricing
16 0.055643316 88 nips-2000-Multiple Timescales of Adaptation in a Neural Code
17 0.055332463 96 nips-2000-One Microphone Source Separation
18 0.05484831 107 nips-2000-Rate-coded Restricted Boltzmann Machines for Face Recognition
19 0.053696081 65 nips-2000-Higher-Order Statistical Properties Arising from the Non-Stationarity of Natural Signals
20 0.052503787 105 nips-2000-Programmable Reinforcement Learning Agents
topicId topicWeight
[(0, 0.205), (1, -0.043), (2, 0.034), (3, 0.012), (4, -0.058), (5, 0.023), (6, -0.062), (7, 0.02), (8, 0.007), (9, 0.184), (10, 0.116), (11, 0.001), (12, 0.187), (13, 0.122), (14, -0.032), (15, 0.174), (16, 0.113), (17, -0.062), (18, 0.093), (19, 0.099), (20, 0.326), (21, -0.106), (22, 0.001), (23, -0.018), (24, 0.139), (25, -0.259), (26, 0.004), (27, 0.066), (28, -0.021), (29, 0.071), (30, 0.015), (31, -0.168), (32, 0.15), (33, -0.017), (34, -0.106), (35, 0.059), (36, -0.049), (37, 0.04), (38, 0.067), (39, 0.102), (40, -0.038), (41, 0.142), (42, -0.017), (43, 0.009), (44, -0.073), (45, -0.075), (46, -0.001), (47, -0.006), (48, -0.049), (49, -0.011)]
simIndex simValue paperId paperTitle
same-paper 1 0.97158325 6 nips-2000-A Neural Probabilistic Language Model
Author: Yoshua Bengio, Réjean Ducharme, Pascal Vincent
Abstract: A goal of statistical language modeling is to learn the joint probability function of sequences of words. This is intrinsically difficult because of the curse of dimensionality: we propose to fight it with its own weapons. In the proposed approach one learns simultaneously (1) a distributed representation for each word (i.e. a similarity between words) along with (2) the probability function for word sequences, expressed with these representations. Generalization is obtained because a sequence of words that has never been seen before gets high probability if it is made of words that are similar to words forming an already seen sentence. We report on experiments using neural networks for the probability function, showing on two text corpora that the proposed approach very significantly improves on a state-of-the-art trigram model.
2 0.71011537 71 nips-2000-Interactive Parts Model: An Application to Recognition of On-line Cursive Script
Author: Predrag Neskovic, Philip C. Davis, Leon N. Cooper
Abstract: In this work, we introduce an Interactive Parts (IP) model as an alternative to Hidden Markov Models (HMMs). We t ested both models on a database of on-line cursive script. We show that implementations of HMMs and the IP model, in which all letters are assumed to have the same average width , give comparable results. However , in contrast to HMMs, the IP model can handle duration modeling without an increase in computational complexity. 1
3 0.58585125 131 nips-2000-The Early Word Catches the Weights
Author: Mark A. Smith, Garrison W. Cottrell, Karen L. Anderson
Abstract: The strong correlation between the frequency of words and their naming latency has been well documented. However, as early as 1973, the Age of Acquisition (AoA) of a word was alleged to be the actual variable of interest, but these studies seem to have been ignored in most of the literature. Recently, there has been a resurgence of interest in AoA. While some studies have shown that frequency has no effect when AoA is controlled for, more recent studies have found independent contributions of frequency and AoA. Connectionist models have repeatedly shown strong effects of frequency, but little attention has been paid to whether they can also show AoA effects. Indeed, several researchers have explicitly claimed that they cannot show AoA effects. In this work, we explore these claims using a simple feed forward neural network. We find a significant contribution of AoA to naming latency, as well as conditions under which frequency provides an independent contribution. 1 Background Naming latency is the time between the presentation of a picture or written word and the beginning of the correct utterance of that word. It is undisputed that there are significant differences in the naming latency of many words, even when controlling word length, syllabic complexity, and other structural variants. The cause of differences in naming latency has been the subject of numerous studies. Earlier studies found that the frequency with which a word appears in spoken English is the best determinant of its naming latency (Oldfield & Wingfield, 1965). More recent psychological studies, however, show that the age at which a word is learned, or its Age of Acquisition (AoA), may be a better predictor of naming latency. Further, in many multiple regression analyses, frequency is not found to be significant when AoA is controlled for (Brown & Watson, 1987; Carroll & White, 1973; Morrison et al. 1992; Morrison & Ellis, 1995). These studies show that frequency and AoA are highly correlated (typically r =-.6) explaining the confound of older studies on frequency. However, still more recent studies question this finding and find that both AoA and frequency are significant and contribute independently to naming latency (Ellis & Morrison, 1998; Gerhand & Barry, 1998,1999). Much like their psychological counterparts, connectionist networks also show very strong frequency effects. However, the ability of a connectionist network to show AoA effects has been doubted (Gerhand & Barry, 1998; Morrison & Ellis, 1995). Most of these claims are based on the well known fact that connectionist networks exhibit
4 0.41114464 112 nips-2000-Reinforcement Learning with Function Approximation Converges to a Region
Author: Geoffrey J. Gordon
Abstract: Many algorithms for approximate reinforcement learning are not known to converge. In fact, there are counterexamples showing that the adjustable weights in some algorithms may oscillate within a region rather than converging to a point. This paper shows that, for two popular algorithms, such oscillation is the worst that can happen: the weights cannot diverge, but instead must converge to a bounded region. The algorithms are SARSA(O) and V(O); the latter algorithm was used in the well-known TD-Gammon program. 1
5 0.41046 141 nips-2000-Universality and Individuality in a Neural Code
Author: Elad Schneidman, Naama Brenner, Naftali Tishby, Robert R. de Ruyter van Steveninck, William Bialek
Abstract: The problem of neural coding is to understand how sequences of action potentials (spikes) are related to sensory stimuli, motor outputs, or (ultimately) thoughts and intentions. One clear question is whether the same coding rules are used by different neurons, or by corresponding neurons in different individuals. We present a quantitative formulation of this problem using ideas from information theory, and apply this approach to the analysis of experiments in the fly visual system. We find significant individual differences in the structure of the code, particularly in the way that temporal patterns of spikes are used to convey information beyond that available from variations in spike rate. On the other hand, all the flies in our ensemble exhibit a high coding efficiency, so that every spike carries the same amount of information in all the individuals. Thus the neural code has a quantifiable mixture of individuality and universality. 1
6 0.40228721 130 nips-2000-Text Classification using String Kernels
7 0.31609064 136 nips-2000-The Missing Link - A Probabilistic Model of Document Content and Hypertext Connectivity
8 0.31502768 138 nips-2000-The Use of Classifiers in Sequential Inference
9 0.30704635 90 nips-2000-New Approaches Towards Robust and Adaptive Speech Recognition
10 0.29114518 22 nips-2000-Algorithms for Non-negative Matrix Factorization
11 0.28015009 97 nips-2000-Overfitting in Neural Nets: Backpropagation, Conjugate Gradient, and Early Stopping
12 0.27685603 41 nips-2000-Discovering Hidden Variables: A Structure-Based Approach
13 0.27442014 73 nips-2000-Kernel-Based Reinforcement Learning in Average-Cost Problems: An Application to Optimal Portfolio Choice
14 0.26664934 98 nips-2000-Partially Observable SDE Models for Image Sequence Recognition Tasks
15 0.25415233 84 nips-2000-Minimum Bayes Error Feature Selection for Continuous Speech Recognition
16 0.25287223 51 nips-2000-Factored Semi-Tied Covariance Matrices
17 0.22657155 69 nips-2000-Incorporating Second-Order Functional Knowledge for Better Option Pricing
18 0.21232373 92 nips-2000-Occam's Razor
19 0.1995178 142 nips-2000-Using Free Energies to Represent Q-values in a Multiagent Reinforcement Learning Task
20 0.19771823 105 nips-2000-Programmable Reinforcement Learning Agents
topicId topicWeight
[(4, 0.016), (10, 0.028), (17, 0.088), (26, 0.377), (32, 0.016), (33, 0.05), (54, 0.013), (55, 0.035), (62, 0.063), (65, 0.026), (67, 0.066), (75, 0.012), (76, 0.02), (79, 0.014), (81, 0.048), (90, 0.034), (97, 0.014)]
simIndex simValue paperId paperTitle
same-paper 1 0.85494119 6 nips-2000-A Neural Probabilistic Language Model
Author: Yoshua Bengio, Réjean Ducharme, Pascal Vincent
Abstract: A goal of statistical language modeling is to learn the joint probability function of sequences of words. This is intrinsically difficult because of the curse of dimensionality: we propose to fight it with its own weapons. In the proposed approach one learns simultaneously (1) a distributed representation for each word (i.e. a similarity between words) along with (2) the probability function for word sequences, expressed with these representations. Generalization is obtained because a sequence of words that has never been seen before gets high probability if it is made of words that are similar to words forming an already seen sentence. We report on experiments using neural networks for the probability function, showing on two text corpora that the proposed approach very significantly improves on a state-of-the-art trigram model.
2 0.79242647 14 nips-2000-A Variational Mean-Field Theory for Sigmoidal Belief Networks
Author: Chiranjib Bhattacharyya, S. Sathiya Keerthi
Abstract: A variational derivation of Plefka's mean-field theory is presented. This theory is then applied to sigmoidal belief networks with the aid of further approximations. Empirical evaluation on small scale networks show that the proposed approximations are quite competitive. 1
3 0.40361235 80 nips-2000-Learning Switching Linear Models of Human Motion
Author: Vladimir Pavlovic, James M. Rehg, John MacCormick
Abstract: The human figure exhibits complex and rich dynamic behavior that is both nonlinear and time-varying. Effective models of human dynamics can be learned from motion capture data using switching linear dynamic system (SLDS) models. We present results for human motion synthesis, classification, and visual tracking using learned SLDS models. Since exact inference in SLDS is intractable, we present three approximate inference algorithms and compare their performance. In particular, a new variational inference algorithm is obtained by casting the SLDS model as a Dynamic Bayesian Network. Classification experiments show the superiority of SLDS over conventional HMM's for our problem domain.
4 0.38078094 51 nips-2000-Factored Semi-Tied Covariance Matrices
Author: Mark J. F. Gales
Abstract: A new form of covariance modelling for Gaussian mixture models and hidden Markov models is presented. This is an extension to an efficient form of covariance modelling used in speech recognition, semi-tied covariance matrices. In the standard form of semi-tied covariance matrices the covariance matrix is decomposed into a highly shared decorrelating transform and a component-specific diagonal covariance matrix. The use of a factored decorrelating transform is presented in this paper. This factoring effectively increases the number of possible transforms without increasing the number of free parameters. Maximum likelihood estimation schemes for all the model parameters are presented including the component/transform assignment, transform and component parameters. This new model form is evaluated on a large vocabulary speech recognition task. It is shown that using this factored form of covariance modelling reduces the word error rate.
5 0.3765448 138 nips-2000-The Use of Classifiers in Sequential Inference
Author: Vasin Punyakanok, Dan Roth
Abstract: We study the problem of combining the outcomes of several different classifiers in a way that provides a coherent inference that satisfies some constraints. In particular, we develop two general approaches for an important subproblem - identifying phrase structure. The first is a Markovian approach that extends standard HMMs to allow the use of a rich observation structure and of general classifiers to model state-observation dependencies. The second is an extension of constraint satisfaction formalisms. We develop efficient combination algorithms under both models and study them experimentally in the context of shallow parsing.
6 0.36447829 94 nips-2000-On Reversing Jensen's Inequality
7 0.36107352 104 nips-2000-Processing of Time Series by Neural Circuits with Biologically Realistic Synaptic Dynamics
8 0.35683951 123 nips-2000-Speech Denoising and Dereverberation Using Probabilistic Models
9 0.35648865 98 nips-2000-Partially Observable SDE Models for Image Sequence Recognition Tasks
10 0.35146421 129 nips-2000-Temporally Dependent Plasticity: An Information Theoretic Account
11 0.34999719 106 nips-2000-Propagation Algorithms for Variational Bayesian Learning
12 0.34922481 146 nips-2000-What Can a Single Neuron Compute?
13 0.3449268 69 nips-2000-Incorporating Second-Order Functional Knowledge for Better Option Pricing
14 0.34410203 7 nips-2000-A New Approximate Maximal Margin Classification Algorithm
15 0.34009132 136 nips-2000-The Missing Link - A Probabilistic Model of Document Content and Hypertext Connectivity
16 0.33995506 79 nips-2000-Learning Segmentation by Random Walks
17 0.33986005 74 nips-2000-Kernel Expansions with Unlabeled Examples
18 0.33936459 49 nips-2000-Explaining Away in Weight Space
19 0.3370463 13 nips-2000-A Tighter Bound for Graphical Models
20 0.33671826 107 nips-2000-Rate-coded Restricted Boltzmann Machines for Face Recognition