nips nips2004 nips2004-200 knowledge-graph by maker-knowledge-mining
Source: pdf
Author: Peng Xu, Frederick Jelinek
Abstract: In this paper, we explore the use of Random Forests (RFs) in the structured language model (SLM), which uses rich syntactic information in predicting the next word based on words already seen. The goal in this work is to construct RFs by randomly growing Decision Trees (DTs) using syntactic information and investigate the performance of the SLM modeled by the RFs in automatic speech recognition. RFs, which were originally developed as classifiers, are a combination of decision tree classifiers. Each tree is grown based on random training data sampled independently and with the same distribution for all trees in the forest, and a random selection of possible questions at each node of the decision tree. Our approach extends the original idea of RFs to deal with the data sparseness problem encountered in language modeling. RFs have been studied in the context of n-gram language modeling and have been shown to generalize well to unseen data. We show in this paper that RFs using syntactic information can also achieve better performance in both perplexity (PPL) and word error rate (WER) in a large vocabulary speech recognition system, compared to a baseline that uses Kneser-Ney smoothing. 1
Reference: text
sentIndex sentText sentNum sentScore
1 edu Abstract In this paper, we explore the use of Random Forests (RFs) in the structured language model (SLM), which uses rich syntactic information in predicting the next word based on words already seen. [sent-2, score-0.731]
2 The goal in this work is to construct RFs by randomly growing Decision Trees (DTs) using syntactic information and investigate the performance of the SLM modeled by the RFs in automatic speech recognition. [sent-3, score-0.278]
3 RFs, which were originally developed as classifiers, are a combination of decision tree classifiers. [sent-4, score-0.069]
4 Each tree is grown based on random training data sampled independently and with the same distribution for all trees in the forest, and a random selection of possible questions at each node of the decision tree. [sent-5, score-0.369]
5 Our approach extends the original idea of RFs to deal with the data sparseness problem encountered in language modeling. [sent-6, score-0.367]
6 RFs have been studied in the context of n-gram language modeling and have been shown to generalize well to unseen data. [sent-7, score-0.359]
7 We show in this paper that RFs using syntactic information can also achieve better performance in both perplexity (PPL) and word error rate (WER) in a large vocabulary speech recognition system, compared to a baseline that uses Kneser-Ney smoothing. [sent-8, score-0.634]
8 1 Introduction In many systems dealing with speech or natural language, such as Automatic Speech Recognition and Statistical Machine Translation, a language model is a crucial component for searching in the often prohibitively large hypothesis space. [sent-9, score-0.436]
9 Most state-of-the-art systems use n-gram language models, which are simple and effective most of the time. [sent-10, score-0.329]
10 Many smoothing techniques that improve language model probability estimation have been proposed and studied in the n-gram literature [1]. [sent-11, score-0.431]
11 There has so far been work in exploring Decision Tree (DT) language models [2, 3], which attempt to cluster similar histories together to achieve better probability estimation. [sent-12, score-0.406]
12 However, the results were negative [3]: decision tree language models failed to improve upon the baseline n-gram models with the same order n. [sent-13, score-0.432]
13 Random Forest (RF) language models, which are generalizations of DT language models, have been recently applied to word n-grams [4]. [sent-14, score-0.851]
14 DT growing is randomized to construct RFs efficiently. [sent-15, score-0.058]
15 Once constructed, the RFs function as a randomized history clustering, which helps in dealing with the data sparseness problem. [sent-16, score-0.158]
16 In general, the weakness of some trees can be compensated for by other trees. [sent-17, score-0.092]
17 The collective contribution of all DTs in an RF n-gram model results in significant improvements in both perplexity (PPL) and word error rate (WER) in a large vocabulary speech recognition system. [sent-18, score-0.48]
18 Recent efforts have studied various ways of using information from a longer history span than that usually captured by normal n-gram language models, as well as ways of using syntactic information that is not available to the word-based n-gram models [5, 6, 7]. [sent-20, score-0.534]
19 All these language models are based on stochastic parsing techniques that build up parse trees for the input word sequence and condition the generation of words on syntactic and lexical information available in the parse trees. [sent-21, score-0.914]
20 Since these language models capture useful hierarchical characteristics of language, they can improve PPL and WER significantly for various tasks. [sent-22, score-0.329]
21 However, due to the n-gram nature of the components of the syntactic language models, the data sparseness problem can be severe. [sent-23, score-0.516]
22 In order to reduce the data sparseness problem for using rich syntactic information in the context, we study the use of RFs in the structured language model (SLM) [5]. [sent-24, score-0.535]
23 Our results show that although the components of the SLM have high order n-grams, our RF approach can still achieve better performance, reducing both the perplexity (PPL) and word error rate (WER) in a large vocabulary speech recognition system compared to a Kneser-Ney smoothing baseline. [sent-25, score-0.611]
24 2 Basic Language Modeling The purpose of a language model is to estimate the probability of a word string. [sent-26, score-0.522]
25 However, in any practical natural language system of even moderate vocabulary size, it is clear that the number of probabilities to be estimated and stored is prohibitively large. [sent-39, score-0.453]
26 , wi−1 for a word wi are usually grouped into equivalence classes. [sent-43, score-0.578]
27 The most widely used language models, n-gram language models, use the identities of the last n − 1 words as equivalence classes. [sent-44, score-0.723]
28 In an n-gram model, we then have P (W )=P (w1 )× N i=2 i−1 P (wi |wi−n+1 ), (2) i−1 where we have used wi−n+1 to denote the word sequence wi−n+1 , . [sent-45, score-0.193]
29 For a vocabulary of size |V | = 104 , there are |V |3 = 1012 trigram probabilities to be estimated. [sent-55, score-0.163]
30 For any training data of a manageable size, many of the probabilities will be zero if the ML estimate is used. [sent-56, score-0.075]
31 In order to solve this problem, many smoothing techniques have been studied (see [1] and the references therein). [sent-57, score-0.102]
32 Smoothing adjusts the ML estimates to produce more accurate probabilities and to assign nonzero probabilities to any word string. [sent-58, score-0.293]
33 Details about various smoothing techniques will not be presented in this paper, but we will outline a particular way of smoothing, namely interpolated Kneser-Ney smoothing [8], for later reference. [sent-59, score-0.289]
34 To ensure that the probabilities sum to one, we have D i−1 λ(wi−n+1 )= wi :C(wi )>0 i−n+1 i−1 C(w ) i−n+1 1 . [sent-62, score-0.411]
35 The lower order probabilities in interpolated Kneser-Ney smoothing can be estimated as (assuming ML estimation): i−1 PKN (wi |wi−n+2 )= 1 wi−n+1 :C(wi )>0 i−n+1 . [sent-63, score-0.237]
36 2 Language Model Evalution A commonly used task-independent quality measure for a given language model is related to the cross-entropy of the underlying model and is referred to as perplexity (PPL): P P L=exp(−1/N N i=1 i−1 log [P (wi |w1 )]), (6) where w1 , . [sent-66, score-0.401]
37 For different tasks, there are different task-dependent quality measures of language models. [sent-70, score-0.329]
38 For example, in an automatic speech recognition system, the performance is usually measured by word error rate (WER). [sent-71, score-0.334]
39 3 The Structured Language Model (SLM) The SLM uses rich syntactic information beyond regular word n-grams to improve language model quality. [sent-72, score-0.642]
40 The model assigns a probability P (W, T ) to every sentence W and every possible binary parse T . [sent-74, score-0.16]
41 The terminals of T are the words of W with POS tags, and the nodes of T are annotated with phrase headwords and non-terminal labels. [sent-75, score-0.074]
42 Figure 1: A word-parse k-prefix a sentence of length n words to which we have prepended the sentence beginning marker and appended the sentence end marker so that w0 = and wn+1 = . [sent-97, score-0.263]
43 wk be the word k-prefix of the sentence — the words from the beginning of the sentence up to the current position k — and Wk Tk the word-parse k-prefix. [sent-101, score-0.788]
44 The exposed heads at a given position k in the input sentence are a function of the word-parse k-prefix [5]. [sent-105, score-0.132]
45 The joint probability P (W, T ) of a word sequence W and a complete parse T comes from contributions of three components: WORD-PREDICTOR, TAGGER and CONSTRUCTOR. [sent-106, score-0.279]
46 Each of the three components can be seen as an n-gram model and can be estimated independently because of the product form of the joint probability. [sent-109, score-0.055]
47 word), i (7) (8) (9) where pk is the ith CONSTRUCTOR action after the k th word and POS tag have been i predicted. [sent-124, score-0.307]
48 Since the number of parses for a given word prefix Wk grows exponentially with k, |{Tk }| ∼ O(2k ), the state space of our model is huge even for relatively short sentences. [sent-125, score-0.221]
49 Thus we must use a search strategy that prunes the allowable parse set. [sent-126, score-0.086]
50 Each model component —WORD-PREDICTOR, TAGGER, CONSTRUCTOR— is estimated independently from a set of parsed sentences after undergoing headword percolation and binarization (see details in [5]). [sent-129, score-0.087]
51 1 Using Random Forests in the Structured Language Model Random Forest n-gram Modeling A Random Forest (RF) n-gram model is a collection of randomly constructed decision tree (DT) n-gram models. [sent-131, score-0.097]
52 Unlike RFs in classification and regression tasks [9, 10, 11], RFs are used in language modeling to deal with the data sparsenes problem [4]. [sent-132, score-0.359]
53 Figure 2 shows the algorithm DT-Grow and Node-Split used for generating random DT language models. [sent-134, score-0.353]
54 We define a position in the history as the distance between a word in the history and the predicted word. [sent-135, score-0.397]
55 If Node-Split(p) is successful, eliminate p from Φ and put the two children of p in Φ Foreach internal node p in the tree Randomly select a subset of positions I in the history Foreach position i in I 1. [sent-138, score-0.24]
56 LH ← normalized likelihood of heldp out data associated with p, using training data statistics in p (a) Move each element from L to R if the move results in positive gain in training data likelihood 2. [sent-142, score-0.114]
57 LH ← normalized likelihood of heldP out data associated with all leaves in P, using training data statistics in the corresponding leaves (b) Move each element from R to L if the move results in positive gain in training data likelihood 4. [sent-144, score-0.081]
58 Since our splitting criterion is to maximize the log-likelihood of the training data, each split uses only statistics (from training data) associated with the node under consideration. [sent-146, score-0.107]
59 Given a position i in the history, β(v) is defined to be the set of histories belonging to the node p, such that they all have word v at position i. [sent-148, score-0.395]
60 It is clear that for every position i in the history, the union ∪v β(v) is all histories in the node p. [sent-149, score-0.168]
61 In DT-Grow, after a DT is fully grown, we use some heldout data to prune it. [sent-150, score-0.065]
62 This pruning is similar to the pruning strategy used in CART [13]. [sent-152, score-0.076]
63 Once we get the DTs, we only use the leaf nodes as equivalence classes of histories. [sent-153, score-0.088]
64 If a new history is encountered, it is very likely that we will not be able to place it at a leaf i−1 node in the DT. [sent-154, score-0.182]
65 The randomized version of the DT growing algorithm can be run many times and finally we will get a collection of randomly grown DTs: a Random Forest (RF). [sent-156, score-0.195]
66 Since each DT is a smoothed language model, we simply aggregate all DTs in our RF to get the RF language model. [sent-157, score-0.682]
67 In the n-gram case, the RF language model probabilities can be computed as: i−1 1 PRF (wi |wi−n+1 )= M i−1 ΦDTj (wi−n+1 ) M j=1 (12) i−1 PDTj (wi |ΦDTj (wi−n+1 )) i−1 wi−n+1 i−1 wi−n+1 where maps the history to a leaf node in DTj . [sent-162, score-0.561]
68 If can not be mapped to a leaf node in some DT, we back-off to the lower order KN probability as mentioned earlier. [sent-163, score-0.097]
69 The advantage of the RF approach over the KN smoothing lies in the fact that different DTs have different weaknesses and strengths for word prediction. [sent-166, score-0.295]
70 As the number of trees grows, the weakness of some trees can be compensated for by some other trees. [sent-167, score-0.151]
71 The only difference is that we will have different n-gram orders and different items in the history for each model. [sent-171, score-0.11]
72 The SLM uses a synchronous multi-stack search algorithm to dynamically construct stacks and compute the language model probabilities as in Equation 10. [sent-174, score-0.441]
73 Supa a pose we have M randomly grown DTs, DT1 , . [sent-178, score-0.113]
74 We calculate the joint probability P (W, T ) for the j th DT triple according to: Pj (W,T )= n+1 k=1 [PDT P j Nk i=1 (wk |Wk−1 Tk−1 )·PDT T (tk |Wk−1 Tk−1 ,wk )· j PDT C (pk |Wk−1 Tk−1 ,wk ,tk ,pk . [sent-186, score-0.061]
75 Finally, after the SLM is run M times, the RF language model probability is an average of the probabilities above: 1 PRF (wk+1 |Wk )= M M j=1 Pj (wk+1 |Wk ). [sent-191, score-0.379]
76 (15) The triple {DTjP , DTjT , DTjC } can be considered as a single DT in which the root node has three children corresponding to the three root nodes of DTjP , DTjT and DTjC . [sent-192, score-0.172]
77 The root node of this DT asks the question: Which model component does the history belong to? [sent-193, score-0.169]
78 We used section 00-20 for training our models, section 21-22 as heldout data for pruning the DTs, and section 23-24 to test our models. [sent-199, score-0.128]
79 Before carrying out our experiments, we normalized the text in the following ways: numbers in arabic form were replaced by a single token “N”, punctuation was removed, all words were mapped to lower case, extra information in the parse trees was ignored, and, finally, traces were ignored. [sent-200, score-0.219]
80 The word vocabulary contains 10k words including a special token for unknown words. [sent-201, score-0.341]
81 We trained an RF for each component and each RF contained 100 randomly grown DTs. [sent-204, score-0.113]
82 We also interpolated the SLM with the KN-trigram to get further improvements. [sent-208, score-0.109]
83 5% from RF-trigram to KN-trigram), the RF-SLM achieved greater improvement by using syntactic information. [sent-214, score-0.12]
84 0 - 130 Table 1: PPL comparison between KNSLM and RF-SLM, interpolated with KN-trigram 125 120 20 40 60 Number of DTs 80 100 Figure 3: PPL convergence 5. [sent-227, score-0.085]
85 2 Word Error Rate by N -best Re-scoring To test our RF modeling approach in the context of speech recognition, we evaluated the models in the WSJ DARPA’93 HUB1 test setup. [sent-228, score-0.137]
86 The 20k word open vocabulary and baseline 3-gram model are the standard ones provided by NIST and LDC — see [5] for details. [sent-230, score-0.301]
87 For the KN-SLM and RFSLM, we used 20M words automatically parsed, binarized and enriched with headwords and NT/POS tag information. [sent-233, score-0.126]
88 6 Table 2: N-best rescoring WER results For purpose of comparison, we interpolated all models with the KN-trigram built from 40M words at different level of interpolation weights α (on KN-trigram). [sent-252, score-0.156]
89 6 Conclusions Based on the idea of Random Forests in classification and regression, we developed algorithms for constructing and using Random Forests in language modeling. [sent-257, score-0.329]
90 The independently constructed Random Forests can be considered as a more general single Random Forest, which ensures the convergence of the probabilities as the number of Decision Trees grows. [sent-259, score-0.076]
91 The results on a large vocabulary speech recognition system show that we can achieve significant reduction in both perplexity and word error rate, compared to a baseline using Kneser-Ney smoothing. [sent-260, score-0.514]
92 Chen and Joshua Goodman, “An empirical study of smoothing techniques for language modeling,” Tech. [sent-262, score-0.431]
93 Mercer, “A tree-based statistical language model for natural language speech recognition,” in IEEE Transactions on Acoustics, Speech and Signal Processing, July 1989, vol. [sent-269, score-0.765]
94 [3] Gerasimos Potamianos and Frederick Jelinek, “A study of n-gram and decision tree letter language modeling methods,” Speech Communication, vol. [sent-272, score-0.428]
95 [4] Peng Xu and Frederick Jelinek, “Random forests in language modeling,” in Proceedings of the 2004 Conference on Empirical Methods in Natural Language Processing, Barcelona, Spain, July 2004. [sent-275, score-0.431]
96 [5] Ciprian Chelba and Frederick Jelinek, “Structured language modeling,” Computer Speech and Language, vol. [sent-276, score-0.329]
97 [6] Eugene Charniak, “Immediate-head parsing for language models,” in Proceedings of the 39th Annual Meeting and 10th Conference of the European Chapter of ACL, Toulouse, France, July 2001, pp. [sent-280, score-0.329]
98 [8] Reinhard Kneser and Hermann Ney, “Improved backing-off for m-gram language modeling,” in Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing, 1995, vol. [sent-285, score-0.329]
99 Geman, “Shape quantization and recognition with randomized trees,” Neural Computation, , no. [sent-290, score-0.069]
100 Ney, “Algorithms for bigram and trigram word clustering,” Speech Communication, vol. [sent-306, score-0.232]
wordName wordTfidf (topN-words)
[('wk', 0.372), ('wi', 0.361), ('language', 0.329), ('rfs', 0.276), ('slm', 0.276), ('dts', 0.198), ('tk', 0.194), ('word', 0.193), ('rf', 0.173), ('ppl', 0.163), ('dt', 0.139), ('syntactic', 0.12), ('wer', 0.114), ('speech', 0.107), ('smoothing', 0.102), ('forests', 0.102), ('pkn', 0.098), ('parse', 0.086), ('history', 0.085), ('interpolated', 0.085), ('grown', 0.085), ('histories', 0.077), ('sentence', 0.074), ('vocabulary', 0.074), ('perplexity', 0.072), ('pdt', 0.071), ('jelinek', 0.071), ('constructor', 0.065), ('frederick', 0.065), ('heldout', 0.065), ('tagger', 0.065), ('forest', 0.062), ('trees', 0.059), ('pre', 0.057), ('node', 0.057), ('lh', 0.057), ('pos', 0.057), ('tag', 0.052), ('probabilities', 0.05), ('kn', 0.05), ('dtjc', 0.049), ('dtjp', 0.049), ('dtjt', 0.049), ('structured', 0.048), ('dtj', 0.042), ('words', 0.041), ('pj', 0.041), ('tree', 0.04), ('leaf', 0.04), ('trigram', 0.039), ('pk', 0.038), ('pruning', 0.038), ('sparseness', 0.038), ('triple', 0.037), ('synchronous', 0.036), ('randomized', 0.035), ('position', 0.034), ('baseline', 0.034), ('recognition', 0.034), ('chelba', 0.033), ('compensated', 0.033), ('dtm', 0.033), ('headword', 0.033), ('headwords', 0.033), ('heldp', 0.033), ('prf', 0.033), ('token', 0.033), ('treebank', 0.033), ('upenn', 0.033), ('wsj', 0.033), ('wn', 0.032), ('ml', 0.031), ('gain', 0.031), ('interpolation', 0.03), ('modeling', 0.03), ('decision', 0.029), ('components', 0.029), ('breiman', 0.028), ('parsed', 0.028), ('ney', 0.028), ('parses', 0.028), ('foreach', 0.028), ('randomly', 0.028), ('root', 0.027), ('sk', 0.026), ('independently', 0.026), ('peng', 0.026), ('triples', 0.026), ('stacks', 0.026), ('training', 0.025), ('items', 0.025), ('get', 0.024), ('random', 0.024), ('heads', 0.024), ('equivalence', 0.024), ('children', 0.024), ('july', 0.024), ('th', 0.024), ('equation', 0.024), ('growing', 0.023)]
simIndex simValue paperId paperTitle
same-paper 1 1.0000002 200 nips-2004-Using Random Forests in the Structured Language Model
Author: Peng Xu, Frederick Jelinek
Abstract: In this paper, we explore the use of Random Forests (RFs) in the structured language model (SLM), which uses rich syntactic information in predicting the next word based on words already seen. The goal in this work is to construct RFs by randomly growing Decision Trees (DTs) using syntactic information and investigate the performance of the SLM modeled by the RFs in automatic speech recognition. RFs, which were originally developed as classifiers, are a combination of decision tree classifiers. Each tree is grown based on random training data sampled independently and with the same distribution for all trees in the forest, and a random selection of possible questions at each node of the decision tree. Our approach extends the original idea of RFs to deal with the data sparseness problem encountered in language modeling. RFs have been studied in the context of n-gram language modeling and have been shown to generalize well to unseen data. We show in this paper that RFs using syntactic information can also achieve better performance in both perplexity (PPL) and word error rate (WER) in a large vocabulary speech recognition system, compared to a baseline that uses Kneser-Ney smoothing. 1
2 0.20377815 36 nips-2004-Class-size Independent Generalization Analsysis of Some Discriminative Multi-Category Classification
Author: Tong Zhang
Abstract: We consider the problem of deriving class-size independent generalization bounds for some regularized discriminative multi-category classification methods. In particular, we obtain an expected generalization bound for a standard formulation of multi-category support vector machines. Based on the theoretical result, we argue that the formulation over-penalizes misclassification error, which in theory may lead to poor generalization performance. A remedy, based on a generalization of multi-category logistic regression (conditional maximum entropy), is then proposed, and its theoretical properties are examined. 1
3 0.15888526 78 nips-2004-Hierarchical Distributed Representations for Statistical Language Modeling
Author: John Blitzer, Fernando Pereira, Kilian Q. Weinberger, Lawrence K. Saul
Abstract: Statistical language models estimate the probability of a word occurring in a given context. The most common language models rely on a discrete enumeration of predictive contexts (e.g., n-grams) and consequently fail to capture and exploit statistical regularities across these contexts. In this paper, we show how to learn hierarchical, distributed representations of word contexts that maximize the predictive value of a statistical language model. The representations are initialized by unsupervised algorithms for linear and nonlinear dimensionality reduction [14], then fed as input into a hierarchical mixture of experts, where each expert is a multinomial distribution over predicted words [12]. While the distributed representations in our model are inspired by the neural probabilistic language model of Bengio et al. [2, 3], our particular architecture enables us to work with significantly larger vocabularies and training corpora. For example, on a large-scale bigram modeling task involving a sixty thousand word vocabulary and a training corpus of three million sentences, we demonstrate consistent improvement over class-based bigram models [10, 13]. We also discuss extensions of our approach to longer multiword contexts. 1
4 0.1556243 87 nips-2004-Integrating Topics and Syntax
Author: Thomas L. Griffiths, Mark Steyvers, David M. Blei, Joshua B. Tenenbaum
Abstract: Statistical approaches to language learning typically focus on either short-range syntactic dependencies or long-range semantic dependencies between words. We present a generative model that uses both kinds of dependencies, and can be used to simultaneously find syntactic classes and semantic topics despite having no representation of syntax or semantics beyond statistical dependency. This model is competitive on tasks like part-of-speech tagging and document classification with models that exclusively use short- and long-range dependencies respectively. 1
5 0.1110151 205 nips-2004-Who's In the Picture
Author: Tamara L. Berg, Alexander C. Berg, Jaety Edwards, David A. Forsyth
Abstract: The context in which a name appears in a caption provides powerful cues as to who is depicted in the associated image. We obtain 44,773 face images, using a face detector, from approximately half a million captioned news images and automatically link names, obtained using a named entity recognizer, with these faces. A simple clustering method can produce fair results. We improve these results significantly by combining the clustering process with a model of the probability that an individual is depicted given its context. Once the labeling procedure is over, we have an accurately labeled set of faces, an appearance model for each individual depicted, and a natural language model that can produce accurate results on captions in isolation. 1
6 0.093381979 173 nips-2004-Spike-timing Dependent Plasticity and Mutual Information Maximization for a Spiking Neuron Model
7 0.090892226 8 nips-2004-A Machine Learning Approach to Conjoint Analysis
8 0.088857204 107 nips-2004-Making Latin Manuscripts Searchable using gHMM's
9 0.080848806 175 nips-2004-Stable adaptive control with online learning
10 0.071544975 31 nips-2004-Blind One-microphone Speech Separation: A Spectral Learning Approach
11 0.070573874 126 nips-2004-Nearly Tight Bounds for the Continuum-Armed Bandit Problem
12 0.070062555 112 nips-2004-Maximising Sensitivity in a Spiking Network
13 0.059042938 28 nips-2004-Bayesian inference in spiking neurons
14 0.056950863 198 nips-2004-Unsupervised Variational Bayesian Learning of Nonlinear Models
15 0.054947861 67 nips-2004-Exponentiated Gradient Algorithms for Large-margin Structured Classification
16 0.053740837 85 nips-2004-Instance-Based Relevance Feedback for Image Retrieval
17 0.049859107 100 nips-2004-Learning Preferences for Multiclass Problems
18 0.049170017 137 nips-2004-On the Adaptive Properties of Decision Trees
19 0.048798468 162 nips-2004-Semi-Markov Conditional Random Fields for Information Extraction
20 0.048575163 189 nips-2004-The Power of Selective Memory: Self-Bounded Learning of Prediction Suffix Trees
topicId topicWeight
[(0, -0.141), (1, -0.021), (2, 0.019), (3, -0.038), (4, 0.004), (5, 0.05), (6, 0.029), (7, 0.194), (8, 0.057), (9, 0.094), (10, 0.035), (11, -0.037), (12, -0.187), (13, 0.163), (14, -0.149), (15, -0.082), (16, 0.15), (17, -0.002), (18, 0.084), (19, 0.03), (20, -0.047), (21, -0.203), (22, -0.226), (23, -0.141), (24, -0.004), (25, -0.04), (26, 0.251), (27, 0.16), (28, 0.076), (29, 0.07), (30, -0.012), (31, 0.161), (32, -0.082), (33, 0.208), (34, 0.071), (35, 0.029), (36, 0.139), (37, 0.003), (38, 0.041), (39, 0.097), (40, -0.003), (41, 0.018), (42, 0.032), (43, -0.02), (44, -0.044), (45, 0.029), (46, -0.043), (47, -0.059), (48, -0.065), (49, 0.036)]
simIndex simValue paperId paperTitle
same-paper 1 0.97526878 200 nips-2004-Using Random Forests in the Structured Language Model
Author: Peng Xu, Frederick Jelinek
Abstract: In this paper, we explore the use of Random Forests (RFs) in the structured language model (SLM), which uses rich syntactic information in predicting the next word based on words already seen. The goal in this work is to construct RFs by randomly growing Decision Trees (DTs) using syntactic information and investigate the performance of the SLM modeled by the RFs in automatic speech recognition. RFs, which were originally developed as classifiers, are a combination of decision tree classifiers. Each tree is grown based on random training data sampled independently and with the same distribution for all trees in the forest, and a random selection of possible questions at each node of the decision tree. Our approach extends the original idea of RFs to deal with the data sparseness problem encountered in language modeling. RFs have been studied in the context of n-gram language modeling and have been shown to generalize well to unseen data. We show in this paper that RFs using syntactic information can also achieve better performance in both perplexity (PPL) and word error rate (WER) in a large vocabulary speech recognition system, compared to a baseline that uses Kneser-Ney smoothing. 1
2 0.63891923 36 nips-2004-Class-size Independent Generalization Analsysis of Some Discriminative Multi-Category Classification
Author: Tong Zhang
Abstract: We consider the problem of deriving class-size independent generalization bounds for some regularized discriminative multi-category classification methods. In particular, we obtain an expected generalization bound for a standard formulation of multi-category support vector machines. Based on the theoretical result, we argue that the formulation over-penalizes misclassification error, which in theory may lead to poor generalization performance. A remedy, based on a generalization of multi-category logistic regression (conditional maximum entropy), is then proposed, and its theoretical properties are examined. 1
3 0.55645508 78 nips-2004-Hierarchical Distributed Representations for Statistical Language Modeling
Author: John Blitzer, Fernando Pereira, Kilian Q. Weinberger, Lawrence K. Saul
Abstract: Statistical language models estimate the probability of a word occurring in a given context. The most common language models rely on a discrete enumeration of predictive contexts (e.g., n-grams) and consequently fail to capture and exploit statistical regularities across these contexts. In this paper, we show how to learn hierarchical, distributed representations of word contexts that maximize the predictive value of a statistical language model. The representations are initialized by unsupervised algorithms for linear and nonlinear dimensionality reduction [14], then fed as input into a hierarchical mixture of experts, where each expert is a multinomial distribution over predicted words [12]. While the distributed representations in our model are inspired by the neural probabilistic language model of Bengio et al. [2, 3], our particular architecture enables us to work with significantly larger vocabularies and training corpora. For example, on a large-scale bigram modeling task involving a sixty thousand word vocabulary and a training corpus of three million sentences, we demonstrate consistent improvement over class-based bigram models [10, 13]. We also discuss extensions of our approach to longer multiword contexts. 1
4 0.52168769 205 nips-2004-Who's In the Picture
Author: Tamara L. Berg, Alexander C. Berg, Jaety Edwards, David A. Forsyth
Abstract: The context in which a name appears in a caption provides powerful cues as to who is depicted in the associated image. We obtain 44,773 face images, using a face detector, from approximately half a million captioned news images and automatically link names, obtained using a named entity recognizer, with these faces. A simple clustering method can produce fair results. We improve these results significantly by combining the clustering process with a model of the probability that an individual is depicted given its context. Once the labeling procedure is over, we have an accurately labeled set of faces, an appearance model for each individual depicted, and a natural language model that can produce accurate results on captions in isolation. 1
5 0.44203383 87 nips-2004-Integrating Topics and Syntax
Author: Thomas L. Griffiths, Mark Steyvers, David M. Blei, Joshua B. Tenenbaum
Abstract: Statistical approaches to language learning typically focus on either short-range syntactic dependencies or long-range semantic dependencies between words. We present a generative model that uses both kinds of dependencies, and can be used to simultaneously find syntactic classes and semantic topics despite having no representation of syntax or semantics beyond statistical dependency. This model is competitive on tasks like part-of-speech tagging and document classification with models that exclusively use short- and long-range dependencies respectively. 1
6 0.44089025 107 nips-2004-Making Latin Manuscripts Searchable using gHMM's
7 0.35964677 8 nips-2004-A Machine Learning Approach to Conjoint Analysis
9 0.28289247 199 nips-2004-Using Machine Learning to Break Visual Human Interaction Proofs (HIPs)
10 0.28097743 43 nips-2004-Conditional Models of Identity Uncertainty with Application to Noun Coreference
11 0.27594951 180 nips-2004-Synchronization of neural networks by mutual learning and its application to cryptography
12 0.2651926 175 nips-2004-Stable adaptive control with online learning
13 0.24759924 126 nips-2004-Nearly Tight Bounds for the Continuum-Armed Bandit Problem
14 0.24679457 108 nips-2004-Markov Networks for Detecting Overalpping Elements in Sequence Data
15 0.23904958 162 nips-2004-Semi-Markov Conditional Random Fields for Information Extraction
16 0.21937339 154 nips-2004-Resolving Perceptual Aliasing In The Presence Of Noisy Sensors
17 0.20553659 101 nips-2004-Learning Syntactic Patterns for Automatic Hypernym Discovery
18 0.20435554 100 nips-2004-Learning Preferences for Multiclass Problems
19 0.19940959 198 nips-2004-Unsupervised Variational Bayesian Learning of Nonlinear Models
20 0.1922747 10 nips-2004-A Probabilistic Model for Online Document Clustering with Application to Novelty Detection
topicId topicWeight
[(9, 0.043), (13, 0.126), (15, 0.117), (26, 0.042), (31, 0.038), (33, 0.115), (39, 0.01), (50, 0.03), (51, 0.02), (67, 0.322), (87, 0.012)]
simIndex simValue paperId paperTitle
same-paper 1 0.78146881 200 nips-2004-Using Random Forests in the Structured Language Model
Author: Peng Xu, Frederick Jelinek
Abstract: In this paper, we explore the use of Random Forests (RFs) in the structured language model (SLM), which uses rich syntactic information in predicting the next word based on words already seen. The goal in this work is to construct RFs by randomly growing Decision Trees (DTs) using syntactic information and investigate the performance of the SLM modeled by the RFs in automatic speech recognition. RFs, which were originally developed as classifiers, are a combination of decision tree classifiers. Each tree is grown based on random training data sampled independently and with the same distribution for all trees in the forest, and a random selection of possible questions at each node of the decision tree. Our approach extends the original idea of RFs to deal with the data sparseness problem encountered in language modeling. RFs have been studied in the context of n-gram language modeling and have been shown to generalize well to unseen data. We show in this paper that RFs using syntactic information can also achieve better performance in both perplexity (PPL) and word error rate (WER) in a large vocabulary speech recognition system, compared to a baseline that uses Kneser-Ney smoothing. 1
2 0.69498193 47 nips-2004-Contextual Models for Object Detection Using Boosted Random Fields
Author: Antonio Torralba, Kevin P. Murphy, William T. Freeman
Abstract: We seek to both detect and segment objects in images. To exploit both local image data as well as contextual information, we introduce Boosted Random Fields (BRFs), which uses Boosting to learn the graph structure and local evidence of a conditional random field (CRF). The graph structure is learned by assembling graph fragments in an additive model. The connections between individual pixels are not very informative, but by using dense graphs, we can pool information from large regions of the image; dense models also support efficient inference. We show how contextual information from other objects can improve detection performance, both in terms of accuracy and speed, by using a computational cascade. We apply our system to detect stuff and things in office and street scenes. 1
3 0.53726023 25 nips-2004-Assignment of Multiplicative Mixtures in Natural Images
Author: Odelia Schwartz, Terrence J. Sejnowski, Peter Dayan
Abstract: In the analysis of natural images, Gaussian scale mixtures (GSM) have been used to account for the statistics of filter responses, and to inspire hierarchical cortical representational learning schemes. GSMs pose a critical assignment problem, working out which filter responses were generated by a common multiplicative factor. We present a new approach to solving this assignment problem through a probabilistic extension to the basic GSM, and show how to perform inference in the model using Gibbs sampling. We demonstrate the efficacy of the approach on both synthetic and image data. Understanding the statistical structure of natural images is an important goal for visual neuroscience. Neural representations in early cortical areas decompose images (and likely other sensory inputs) in a way that is sensitive to sophisticated aspects of their probabilistic structure. This structure also plays a key role in methods for image processing and coding. A striking aspect of natural images that has reflections in both top-down and bottom-up modeling is coordination across nearby locations, scales, and orientations. From a topdown perspective, this structure has been modeled using what is known as a Gaussian Scale Mixture model (GSM).1–3 GSMs involve a multi-dimensional Gaussian (each dimension of which captures local structure as in a linear filter), multiplied by a spatialized collection of common hidden scale variables or mixer variables∗ (which capture the coordination). GSMs have wide implications in theories of cortical receptive field development, eg the comprehensive bubbles framework of Hyv¨ rinen.4 The mixer variables provide the a top-down account of two bottom-up characteristics of natural image statistics, namely the ‘bowtie’ statistical dependency,5, 6 and the fact that the marginal distributions of receptive field-like filters have high kurtosis.7, 8 In hindsight, these ideas also bear a close relationship with Ruderman and Bialek’s multiplicative bottom-up image analysis framework 9 and statistical models for divisive gain control.6 Coordinated structure has also been addressed in other image work,10–14 and in other domains such as speech15 and finance.16 Many approaches to the unsupervised specification of representations in early cortical areas rely on the coordinated structure.17–21 The idea is to learn linear filters (eg modeling simple cells as in22, 23 ), and then, based on the coordination, to find combinations of these (perhaps non-linearly transformed) as a way of finding higher order filters (eg complex cells). One critical facet whose specification from data is not obvious is the neighborhood arrangement, ie which linear filters share which mixer variables. ∗ Mixer variables are also called mutlipliers, but are unrelated to the scales of a wavelet. Here, we suggest a method for finding the neighborhood based on Bayesian inference of the GSM random variables. In section 1, we consider estimating these components based on information from different-sized neighborhoods and show the modes of failure when inference is too local or too global. Based on these observations, in section 2 we propose an extension to the GSM generative model, in which the mixer variables can overlap probabilistically. We solve the neighborhood assignment problem using Gibbs sampling, and demonstrate the technique on synthetic data. In section 3, we apply the technique to image data. 1 GSM inference of Gaussian and mixer variables In a simple, n-dimensional, version of a GSM, filter responses l are synthesized † by multiplying an n-dimensional Gaussian with values g = {g1 . . . gn }, by a common mixer variable v. l = vg (1) We assume g are uncorrelated (σ 2 along diagonal of the covariance matrix). For the analytical calculations, we assume that v has a Rayleigh distribution: where 0 < a ≤ 1 parameterizes the strength of the prior p[v] ∝ [v exp −v 2 /2]a (2) For ease, we develop the theory for a = 1. As is well known,2 and repeated in figure 1(B), the marginal distribution of the resulting GSM is sparse and highly kurtotic. The joint conditional distribution of two elements l1 and l2 , follows a bowtie shape, with the width of the distribution of one dimension increasing for larger values (both positive and negative) of the other dimension. The inverse problem is to estimate the n+1 variables g1 . . . gn , v from the n filter responses l1 . . . ln . It is formally ill-posed, though regularized through the prior distributions. Four posterior distributions are particularly relevant, and can be derived analytically from the model: rv distribution posterior mean ” “ √ σ |l1 | 2 2 l1 |l1 | B“ 1, σ ” |l1 | ” exp − v − “ p[v|l1 ] 2 2v 2 σ 2 σ 1 |l1 | 1 |l1 | B p[v|l] p[|g1 ||l1 ] p[|g1 ||l] √ B 2, σ 1 (n−2) 2 2 2 ( ) −(n−1) exp − v2 − 2vl2 σ2 l v B(1− n , σ ) 2 √ σ|l1 | g2 l2 “ ” 1 exp − 12 − 12 2σ 1 |l1 | g2 2g l σ B −2, σ|l1 | ”1 |l1 | 2 (2−n) l n l 2 −1, σ “ B( ) σ (n−3) g1 1 l σ σ 1 g2 2 1 exp − 2σ2 l2 − l 1 2 l1 2 2g1 σ |l1 | σ ( ( 2, σ ) ) l B 3−n,σ 2 2 l B 1− n , σ “ 2 ” |l1 | B 0, σ |l1 | “ ” σ B − 1 , |l1 | 2 σ n 1 l |l1 | B( 2 − 2 , σ ) n l B( −1, l ) 2 σ 2 where B(n, x) is the modified Bessel function of the second kind (see also24 ), l = i li and gi is forced to have the same sign as li , since the mixer variables are always positive. Note that p[v|l1 ] and p[g1 |l1 ] (rows 1,3) are local estimates, while p[v|l] and p[g|l] (rows 2,4) are estimates according to filter outputs {l1 . . . ln }. The posterior p[v|l] has also been estimated numerically in noise removal for other mixer priors, by Portilla et al 25 The full GSM specifies a hierarchy of mixer variables. Wainwright2 considered a prespecified tree-based hierarhical arrangement. In practice, for natural sensory data, given a heterogeneous collection of li , it is advantageous to learn the hierachical arrangement from examples. In an approach related to that of the GSM, Karklin and Lewicki19 suggested We describe the l as being filter responses even in the synthetic case, to facilitate comparison with images. † B A α 1 ... g v 20 1 ... β 0.1 l 0 -5 0 l 2 0 21 0 0 5 l 1 0 l 1 1 l ... l 21 40 20 Actual Distribution 0 D Gaussian 0 5 0 0 -5 0 0 5 0 5 -5 0 g 1 0 5 E(g 1 | l1) 1 .. 40 ) 0.06 -5 0 0 5 2 E(g |l 1 1 .. 20 ) 0 1 E(g | l ) -5 5 E(g | l 1 2 1 .. 20 5 α E(g |l 1 .. 20 ) E(g |l 0 E(v | l α 0.06 E(g | l2) 2 2 0 5 E(v | l 1 .. 20 ) E(g | l1) 1 1 g 0 1 0.06 0 0.06 E(vαl | ) g 40 filters, too global 0.06 0.06 0.06 Distribution 20 filters 1 filter, too local 0.06 vα E Gaussian joint conditional 40 l l C Mixer g ... 21 Multiply Multiply l g Distribution g v 1 .. 40 1 .. 40 ) ) E(g | l 1 1 .. 40 ) Figure 1: A Generative model: each filter response is generated by multiplying its Gaussian variable by either mixer variable vα , or mixer variable vβ . B Marginal and joint conditional statistics (bowties) of sample synthetic filter responses. For the joint conditional statistics, intensity is proportional to the bin counts, except that each column is independently re-scaled to fill the range of intensities. C-E Left: actual distributions of mixer and Gaussian variables; other columns: estimates based on different numbers of filter responses. C Distribution of estimate of the mixer variable vα . Note that mixer variable values are by definition positive. D Distribution of estimate of one of the Gaussian variables, g1 . E Joint conditional statistics of the estimates of Gaussian variables g1 and g2 . generating log mixer values for all the filters and learning the linear combinations of a smaller collection of underlying values. Here, we consider the problem in terms of multiple mixer variables, with the linear filters being clustered into groups that share a single mixer. This poses a critical assignment problem of working out which filter responses share which mixer variables. We first study this issue using synthetic data in which two groups of filter responses l1 . . . l20 and l21 . . . l40 are generated by two mixer variables vα and vβ (figure 1). We attempt to infer the components of the GSM model from the synthetic data. Figure 1C;D shows the empirical distributions of estimates of the conditional means of a mixer variable E(vα |{l}) and one of the Gaussian variables E(g1 |{l}) based on different assumed assignments. For estimation based on too few filter responses, the estimates do not well match the actual distributions. For example, for a local estimate based on a single filter response, the Gaussian estimate peaks away from zero. For assignments including more filter responses, the estimates become good. However, inference is also compromised if the estimates for vα are too global, including filter responses actually generated from vβ (C and D, last column). In (E), we consider the joint conditional statistics of two components, each 1 v v α vγ β g 1 ... v vα B Actual A Generative model 1 100 1 100 0 v 01 l1 ... l100 0 l 1 20 2 0 0 l 1 0 -4 100 Filter number vγ β 1 100 1 0 Filter number 100 1 Filter number 0 E(g 1 | l ) Gibbs fit assumed 0.15 E(g | l ) 0 2 0 1 Mixer Gibbs fit assumed 0.1 4 0 E(g 1 | l ) Distribution Distribution Distribution l 100 Filter number Gaussian 0.2 -20 1 1 0 Filter number Inferred v α Multiply 100 1 Filter number Pixel vγ 1 g 0 C β E(v | l ) β 0 0 0 15 E(v | l ) α 0 E(v | l ) α Figure 2: A Generative model in which each filter response is generated by multiplication of its Gaussian variable by a mixer variable. The mixer variable, v α , vβ , or vγ , is chosen probabilistically upon each filter response sample, from a Rayleigh distribution with a = .1. B Top: actual probability of filter associations with vα , vβ , and vγ ; Bottom: Gibbs estimates of probability of filter associations corresponding to vα , vβ , and vγ . C Statistics of generated filter responses, and of Gaussian and mixer estimates from Gibbs sampling. estimating their respective g1 and g2 . Again, as the number of filter responses increases, the estimates improve, provided that they are taken from the right group of filter responses with the same mixer variable. Specifically, the mean estimates of g1 and g2 become more independent (E, third column). Note that for estimations based on a single filter response, the joint conditional distribution of the Gaussian appears correlated rather than independent (E, second column); for estimation based on too many filter responses (40 in this example), the joint conditional distribution of the Gaussian estimates shows a dependent (rather than independent) bowtie shape (E, last column). Mixer variable joint statistics also deviate from the actual when the estimations are too local or global (not shown). We have observed qualitatively similar statistics for estimation based on coefficients in natural images. Neighborhood size has also been discussed in the context of the quality of noise removal, assuming a GSM model.26 2 Neighborhood inference: solving the assignment problem The plots in figure 1 suggest that it should be possible to infer the assignments, ie work out which filter responses share common mixers, by learning from the statistics of the resulting joint dependencies. Hard assignment problems (in which each filter response pays allegiance to just one mixer) are notoriously computationally brittle. Soft assignment problems (in which there is a probabilistic relationship between filter responses and mixers) are computationally better behaved. Further, real world stimuli are likely better captured by the possibility that filter responses are coordinated in somewhat different collections in different images. We consider a richer, mixture GSM as a generative model (Figure 2). To model the generation of filter responses li for a single image patch, we multiply each Gaussian variable gi by a single mixer variable from the set v1 . . . vm . We assume that gi has association probabil- ity pij (satisfying j pij = 1, ∀i) of being assigned to mixer variable vj . The assignments are assumed to be made independently for each patch. We use si ∈ {1, 2, . . . m} for the assignments: li = g i vs i (3) Inference and learning in this model proceeds in two stages, according to the expectation maximization algorithm. First, given a filter response li , we use Gibbs sampling for the E phase to find possible appropriate (posterior) assignments. Williams et al.27 suggested using Gibbs sampling to solve a similar assignment problem in the context of dynamic tree models. Second, for the M phase, given the collection of assignments across multiple filter responses, we update the association probabilities pij . Given sample mixer assignments, we can estimate the Gaussian and mixer components of the GSM using the table of section 1, but restricting the filter response samples just to those associated with each mixer variable. We tested the ability of this inference method to find the associations in the probabilistic mixer variable synthetic example shown in figure 2, (A,B). The true generative model specifies probabilistic overlap of 3 mixer variables. We generated 5000 samples for each filter according to the generative model. We ran the Gibbs sampling procedure, setting the number of possible neighborhoods to 5 (e.g., > 3); after 500 iterations the weights converged near to the proper probabilities. In (B, top), we plot the actual probability distributions for the filter associations with each of the mixer variables. In (B, bottom), we show the estimated associations: the three non-zero estimates closely match the actual distributions; the other two estimates are zero (not shown). The procedure consistently finds correct associations even in larger examples of data generated with up to 10 mixer variables. In (C) we show an example of the actual and estimated distributions of the mixer and Gaussian components of the GSM. Note that the joint conditional statistics of both mixer and Gaussian are independent, since the variables were generated as such in the synthetic example. The Gibbs procedure can be adjusted for data generated with different parameters a of equation 2, and for related mixers,2 allowing for a range of image coefficient behaviors. 3 Image data Having validated the inference model using synthetic data, we turned to natural images. We derived linear filters from a multi-scale oriented steerable pyramid,28 with 100 filters, at 2 preferred orientations, 25 non-overlapping spatial positions (with spatial subsampling of 8 pixels), and two phases (quadrature pairs), and a single spatial frequency peaked at 1/6 cycles/pixel. The image ensemble is 4 images from a standard image compression database (boats, goldhill, plant leaves, and mountain) and 4000 samples. We ran our method with the same parameters as for synthetic data, with 7 possible neighborhoods and Rayleigh parameter a = .1 (as in figure 2). Figure 3 depicts the association weights pij of the coefficients for each of the obtained mixer variables. In (A), we show a schematic (template) of the association representation that will follow in (B, C) for the actual data. Each mixer variable neighborhood is shown for coefficients of two phases and two orientations along a spatial grid (one grid for each phase). The neighborhood is illustrated via the probability of each coefficient to be generated from a given mixer variable. For the first two neighborhoods (B), we also show the image patches that yielded the maximum log likelihood of P (v|patch). The first neighborhood (in B) prefers vertical patterns across most of its “receptive field”, while the second has a more localized region of horizontal preference. This can also be seen by averaging the 200 image patches with the maximum log likelihood. Strikingly, all the mixer variables group together two phases of quadrature pair (B, C). Quadrature pairs have also been extracted from cortical data, and are the components of ideal complex cell models. Another tendency is to group Phase 2 Phase 1 19 Y position Y position A 0 -19 Phase 1 Phase 2 19 0 -19 -19 0 19 X position -19 0 19 X position B Neighborhood Example max patches Average Neighborhood Example max patches C Neighborhood Average Gaussian 0.25 l2 0 -50 0 l 1 50 0 l 1 Mixer Gibbs fit assumed Gibbs fit assumed Distribution Distribution Distribution D Coefficient 0.12 E(g | l ) 0 2 0 -5 0 E(g 1 | l ) 5 0 E(g 1 | l ) 0.15 ) E(v | l ) β 0 00 15 E(v | l ) α 0 E(v | l ) α Figure 3: A Schematic of the mixer variable neighborhood representation. The probability that each coefficient is associated with the mixer variable ranges from 0 (black) to 1 (white). Left: Vertical and horizontal filters, at two orientations, and two phases. Each phase is plotted separately, on a 38 by 38 pixel spatial grid. Right: summary of representation, with filter shapes replaced by oriented lines. Filters are approximately 6 pixels in diameter, with the spacing between filters 8 pixels. B First two image ensemble neighborhoods obtained from Gibbs sampling. Also shown, are four 38×38 pixel patches that had the maximum log likelihood of P (v|patch), and the average of the first 200 maximal patches. C Other image ensemble neighborhoods. D Statistics of representative coefficients of two spatially displaced vertical filters, and of inferred Gaussian and mixer variables. orientations across space. The phase and iso-orientation grouping bear some interesting similarity to other recent suggestions;17, 18 as do the maximal patches.19 Wavelet filters have the advantage that they can span a wider spatial extent than is possible with current ICA techniques, and the analysis of parameters such as phase grouping is more controlled. We are comparing the analysis with an ICA first-stage representation, which has other obvious advantages. We are also extending the analysis to correlated wavelet filters; 25 and to simulations with a larger number of neighborhoods. From the obtained associations, we estimated the mixer and Gaussian variables according to our model. In (D) we show representative statistics of the coefficients and of the inferred variables. The learned distributions of Gaussian and mixer variables are quite close to our assumptions. The Gaussian estimates exhibit joint conditional statistics that are roughly independent, and the mixer variables are weakly dependent. We have thus far demonstrated neighborhood inference for an image ensemble, but it is also interesting and perhaps more intuitive to consider inference for particular images or image classes. In figure 4 (A-B) we demonstrate example mixer variable neighborhoods derived from learning patches of a zebra image (Corel CD-ROM). As before, the neighborhoods are composed of quadrature pairs; however, the spatial configurations are richer and have A Neighborhood B Neighborhood Average Example max patches Top 25 max patches Average Example max patches Top 25 max patches Figure 4: Example of Gibbs on Zebra image. Image is 151×151 pixels, and each spatial neighborhood spans 38×38 pixels. A, B Example mixer variable neighborhoods. Left: example mixer variable neighborhood, and average of 200 patches that yielded the maximum likelihood of P (v|patch). Right: Image and marked on top of it example patches that yielded the maximum likelihood of P (v|patch). not been previously reported with unsupervised hierarchical methods: for example, in (A), the mixture neighborhood captures a horizontal-bottom/vertical-top spatial configuration. This appears particularly relevant in segmenting regions of the front zebra, as shown by marking in the image the patches i that yielded the maximum log likelihood of P (v|patch). In (B), the mixture neighborhood captures a horizontal configuration, more focused on the horizontal stripes of the front zebra. This example demonstrates the logic behind a probabilistic mixture: coefficients corresponding to the bottom horizontal stripes might be linked with top vertical stripes (A) or to more horizontal stripes (B). 4 Discussion Work on the study of natural image statistics has recently evolved from issues about scalespace hierarchies, wavelets, and their ready induction through unsupervised learning models (loosely based on cortical development) towards the coordinated statistical structure of the wavelet components. This includes bottom-up (eg bowties, hierarchical representations such as complex cells) and top-down (eg GSM) viewpoints. The resulting new insights inform a wealth of models and ideas and form the essential backdrop for the work in this paper. They also link to impressive engineering results in image coding and processing. A most critical aspect of an hierarchical representational model is the way that the structure of the hierarchy is induced. We addressed the hierarchy question using a novel extension to the GSM generative model in which mixer variables (at one level of the hierarchy) enjoy probabilistic assignments to filter responses (at a lower level). We showed how these assignments can be learned (using Gibbs sampling), and illustrated some of their attractive properties using both synthetic and a variety of image data. We grounded our method firmly in Bayesian inference of the posterior distributions over the two classes of random variables in a GSM (mixer and Gaussian), placing particular emphasis on the interplay between the generative model and the statistical properties of its components. An obvious question raised by our work is the neural correlate of the two different posterior variables. The Gaussian variable has characteristics resembling those of the output of divisively normalized simple cells;6 the mixer variable is more obviously related to the output of quadrature pair neurons (such as orientation energy or motion energy cells, which may also be divisively normalized). How these different information sources may subsequently be used is of great interest. Acknowledgements This work was funded by the HHMI (OS, TJS) and the Gatsby Charitable Foundation (PD). We are very grateful to Patrik Hoyer, Mike Lewicki, Zhaoping Li, Simon Osindero, Javier Portilla and Eero Simoncelli for discussion. References [1] D Andrews and C Mallows. Scale mixtures of normal distributions. J. Royal Stat. Soc., 36:99–102, 1974. [2] M J Wainwright and E P Simoncelli. Scale mixtures of Gaussians and the statistics of natural images. In S. A. Solla, T. K. Leen, and K.-R. M¨ ller, editors, Adv. Neural Information Processing Systems, volume 12, pages 855–861, Cambridge, MA, u May 2000. MIT Press. [3] M J Wainwright, E P Simoncelli, and A S Willsky. Random cascades on wavelet trees and their use in modeling and analyzing natural imagery. Applied and Computational Harmonic Analysis, 11(1):89–123, July 2001. Special issue on wavelet applications. [4] A Hyv¨ rinen, J Hurri, and J Vayrynen. Bubbles: a unifying framework for low-level statistical properties of natural image a sequences. Journal of the Optical Society of America A, 20:1237–1252, May 2003. [5] R W Buccigrossi and E P Simoncelli. Image compression via joint statistical characterization in the wavelet domain. IEEE Trans Image Proc, 8(12):1688–1701, December 1999. [6] O Schwartz and E P Simoncelli. Natural signal statistics and sensory gain control. Nature Neuroscience, 4(8):819–825, August 2001. [7] D J Field. Relations between the statistics of natural images and the response properties of cortical cells. J. Opt. Soc. Am. A, 4(12):2379–2394, 1987. [8] H Attias and C E Schreiner. Temporal low-order statistics of natural sounds. In M Jordan, M Kearns, and S Solla, editors, Adv in Neural Info Processing Systems, volume 9, pages 27–33. MIT Press, 1997. [9] D L Ruderman and W Bialek. Statistics of natural images: Scaling in the woods. Phys. Rev. Letters, 73(6):814–817, 1994. [10] C Zetzsche, B Wegmann, and E Barth. Nonlinear aspects of primary vision: Entropy reduction beyond decorrelation. In Int’l Symposium, Society for Information Display, volume XXIV, pages 933–936, 1993. [11] J Huang and D Mumford. Statistics of natural images and models. In CVPR, page 547, 1999. [12] J. Romberg, H. Choi, and R. Baraniuk. Bayesian wavelet domain image modeling using hidden Markov trees. In Proc. IEEE Int’l Conf on Image Proc, Kobe, Japan, October 1999. [13] A Turiel, G Mato, N Parga, and J P Nadal. The self-similarity properties of natural images resemble those of turbulent flows. Phys. Rev. Lett., 80:1098–1101, 1998. [14] J Portilla and E P Simoncelli. A parametric texture model based on joint statistics of complex wavelet coefficients. Int’l Journal of Computer Vision, 40(1):49–71, 2000. [15] Helmut Brehm and Walter Stammler. Description and generation of spherically invariant speech-model signals. Signal Processing, 12:119–141, 1987. [16] T Bollersley, K Engle, and D Nelson. ARCH models. In B Engle and D McFadden, editors, Handbook of Econometrics V. 1994. [17] A Hyv¨ rinen and P Hoyer. Emergence of topography and complex cell properties from natural images using extensions of a ¨ ICA. In S. A. Solla, T. K. Leen, and K.-R. Muller, editors, Adv. Neural Information Processing Systems, volume 12, pages 827–833, Cambridge, MA, May 2000. MIT Press. [18] P Hoyer and A Hyv¨ rinen. A multi-layer sparse coding network learns contour coding from natural images. Vision Research, a 42(12):1593–1605, 2002. [19] Y Karklin and M S Lewicki. Learning higher-order structures in natural images. Network: Computation in Neural Systems, 14:483–499, 2003. [20] W Laurenz and T Sejnowski. Slow feature analysis: Unsupervised learning of invariances. Neural Computation, 14(4):715– 770, 2002. [21] C Kayser, W Einh¨ user, O D¨ mmer, P K¨ nig, and K P K¨ rding. Extracting slow subspaces from natural videos leads to a u o o complex cells. In G Dorffner, H Bischof, and K Hornik, editors, Proc. Int’l Conf. on Artificial Neural Networks (ICANN-01), pages 1075–1080, Vienna, Aug 2001. Springer-Verlag, Heidelberg. [22] B A Olshausen and D J Field. Emergence of simple-cell receptive field properties by learning a sparse factorial code. Nature, 381:607–609, 1996. [23] A J Bell and T J Sejnowski. The ’independent components’ of natural scenes are edge filters. Vision Research, 37(23):3327– 3338, 1997. [24] U Grenander and A Srivastava. Probabibility models for clutter in natural images. IEEE Trans. on Patt. Anal. and Mach. Intel., 23:423–429, 2002. [25] J Portilla, V Strela, M Wainwright, and E Simoncelli. Adaptive Wiener denoising using a Gaussian scale mixture model in the wavelet domain. In Proc 8th IEEE Int’l Conf on Image Proc, pages 37–40, Thessaloniki, Greece, Oct 7-10 2001. IEEE Computer Society. [26] J Portilla, V Strela, M Wainwright, and E P Simoncelli. Image denoising using a scale mixture of Gaussians in the wavelet domain. IEEE Trans Image Processing, 12(11):1338–1351, November 2003. [27] C K I Williams and N J Adams. Dynamic trees. In M. S. Kearns, S. A. Solla, and D. A. Cohn, editors, Adv. Neural Information Processing Systems, volume 11, pages 634–640, Cambridge, MA, 1999. MIT Press. [28] E P Simoncelli, W T Freeman, E H Adelson, and D J Heeger. Shiftable multi-scale transforms. IEEE Trans Information Theory, 38(2):587–607, March 1992. Special Issue on Wavelets.
4 0.53344154 131 nips-2004-Non-Local Manifold Tangent Learning
Author: Yoshua Bengio, Martin Monperrus
Abstract: We claim and present arguments to the effect that a large class of manifold learning algorithms that are essentially local and can be framed as kernel learning algorithms will suffer from the curse of dimensionality, at the dimension of the true underlying manifold. This observation suggests to explore non-local manifold learning algorithms which attempt to discover shared structure in the tangent planes at different positions. A criterion for such an algorithm is proposed and experiments estimating a tangent plane prediction function are presented, showing its advantages with respect to local manifold learning algorithms: it is able to generalize very far from training data (on learning handwritten character image rotations), where a local non-parametric method fails. 1
5 0.53100282 36 nips-2004-Class-size Independent Generalization Analsysis of Some Discriminative Multi-Category Classification
Author: Tong Zhang
Abstract: We consider the problem of deriving class-size independent generalization bounds for some regularized discriminative multi-category classification methods. In particular, we obtain an expected generalization bound for a standard formulation of multi-category support vector machines. Based on the theoretical result, we argue that the formulation over-penalizes misclassification error, which in theory may lead to poor generalization performance. A remedy, based on a generalization of multi-category logistic regression (conditional maximum entropy), is then proposed, and its theoretical properties are examined. 1
6 0.53039432 189 nips-2004-The Power of Selective Memory: Self-Bounded Learning of Prediction Suffix Trees
7 0.5287894 181 nips-2004-Synergies between Intrinsic and Synaptic Plasticity in Individual Model Neurons
8 0.52776045 95 nips-2004-Large-Scale Prediction of Disulphide Bond Connectivity
9 0.5272702 5 nips-2004-A Harmonic Excitation State-Space Approach to Blind Separation of Speech
10 0.52564251 28 nips-2004-Bayesian inference in spiking neurons
11 0.52508038 178 nips-2004-Support Vector Classification with Input Data Uncertainty
12 0.52403325 102 nips-2004-Learning first-order Markov models for control
13 0.5224784 116 nips-2004-Message Errors in Belief Propagation
14 0.52219248 70 nips-2004-Following Curved Regularized Optimization Solution Paths
15 0.52135152 22 nips-2004-An Investigation of Practical Approximate Nearest Neighbor Algorithms
16 0.52068704 4 nips-2004-A Generalized Bradley-Terry Model: From Group Competition to Individual Skill
17 0.52065623 78 nips-2004-Hierarchical Distributed Representations for Statistical Language Modeling
18 0.51999444 163 nips-2004-Semi-parametric Exponential Family PCA
19 0.51968867 58 nips-2004-Edge of Chaos Computation in Mixed-Mode VLSI - A Hard Liquid
20 0.51955312 93 nips-2004-Kernel Projection Machine: a New Tool for Pattern Recognition