nips nips2004 nips2004-87 knowledge-graph by maker-knowledge-mining

87 nips-2004-Integrating Topics and Syntax

Source: pdf

Author: Thomas L. Griffiths, Mark Steyvers, David M. Blei, Joshua B. Tenenbaum

Abstract: Statistical approaches to language learning typically focus on either short-range syntactic dependencies or long-range semantic dependencies between words. We present a generative model that uses both kinds of dependencies, and can be used to simultaneously ﬁnd syntactic classes and semantic topics despite having no representation of syntax or semantics beyond statistical dependency. This model is competitive on tasks like part-of-speech tagging and document classiﬁcation with models that exclusively use short- and long-range dependencies respectively. 1

Reference: text

Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 edu Massachusetts Institute of Technology Cambridge, MA 02139 Abstract Statistical approaches to language learning typically focus on either short-range syntactic dependencies or long-range semantic dependencies between words. [sent-9, score-1.068]

2 We present a generative model that uses both kinds of dependencies, and can be used to simultaneously ﬁnd syntactic classes and semantic topics despite having no representation of syntax or semantics beyond statistical dependency. [sent-10, score-1.217]

3 This model is competitive on tasks like part-of-speech tagging and document classiﬁcation with models that exclusively use short- and long-range dependencies respectively. [sent-11, score-0.352]

4 1 Introduction A word can appear in a sentence for two reasons: because it serves a syntactic function, or because it provides semantic content. [sent-12, score-0.965]

5 Words that play different roles are treated differently in human language processing: function and content words produce different patterns of brain activity [1], and have different developmental trends [2]. [sent-13, score-0.538]

6 So, how might a language learner discover the syntactic and semantic classes of words? [sent-14, score-0.949]

7 In this paper, we explore how statistical learning, with no prior knowledge of either syntax or semantics, can discover the difference between function and content words and simultaneously organize words into syntactic classes and semantic topics. [sent-16, score-1.556]

8 Our approach relies on the different kinds of dependencies between words produced by syntactic and semantic constraints. [sent-17, score-1.148]

9 Syntactic constraints result in relatively short-range dependencies, spanning several words within the limits of a sentence. [sent-18, score-0.217]

10 This model is a generative model for text in which a hidden Markov model (HMM) determines when to emit a word from a topic model. [sent-21, score-0.462]

11 The different capacities of the two components of the model result in a factorization of a sentence into function words, handled by the HMM, and content words, handled by the topic model. [sent-22, score-0.409]

12 Each component divides words into ﬁner groups according to a different criterion: the function words are divided into syntactic classes, and the content words are divided into semantic topics. [sent-23, score-1.579]

13 This model can be used to extract clean syntactic and semantic classes and to identify the role that words play in a document. [sent-24, score-1.161]

14 It is also competitive in quantitative tasks, such as part-of-speech tagging and document classiﬁcation, with models specialized to detect short- and long-range dependencies respectively. [sent-25, score-0.325]

15 First, we introduce the approach, considering the general question of how syntactic and semantic generative models might be combined, and arguing that a composite model is necessary to capture the different roles that words can play in a document. [sent-27, score-1.528]

16 Finally, we present results illustrating the quality of the recovered syntactic classes and semantic topics. [sent-29, score-0.876]

17 2 Combining syntactic and semantic generative models A probabilistic generative model speciﬁes a simple stochastic procedure by which data might be generated, usually making reference to unobserved random variables that express latent structure. [sent-30, score-0.965]

18 Such an approach is appropriate for modeling language, where words are generated from the latent structure of the speaker’s intentions, and is widely used in statistical natural language processing [5]. [sent-32, score-0.338]

19 Probabilistic models of language are typically developed to capture either short-range or long-range dependencies between words. [sent-33, score-0.218]

20 HMMs and probabilistic context-free grammars [5] generate documents purely based on syntactic relations among unobserved word classes, while “bag-of-words” models like naive Bayes or topic models [6] generate documents based on semantic correlations between words, independent of word order. [sent-34, score-1.424]

21 A generative model for text speciﬁes a probability distribution over words in terms of other probability distributions over words, and different models are thus easily combined. [sent-37, score-0.351]

22 We can produce a model that expresses both the short- and long-range dependencies of words by combining two models that are each sensitive to one kind of dependency. [sent-38, score-0.42]

23 In a mixture of syntactic and semantic models, each word would exhibit either short-range or long-range dependencies, while in a product of models (e. [sent-40, score-0.94]

24 In fact, only a subset of words – the content words – exhibit long-range semantic dependencies, while all words obey short-range syntactic dependencies. [sent-44, score-1.579]

25 This asymmetry can be captured in a composite model, where we replace one of the probability distributions over words used in the syntactic model with the semantic model. [sent-45, score-1.394]

26 This allows the syntactic model to choose when to emit a content word, and the semantic model to choose which word to emit. [sent-46, score-1.152]

27 1 A composite model We will explore a simple composite model, in which the syntactic component is an HMM and the semantic component is a topic model. [sent-48, score-1.665]

28 The graphical model for this composite is shown in Figure 1(a). [sent-49, score-0.38]

29 The model is deﬁned in terms of three sets of variables: a sequence of words w = {w1 , . [sent-50, score-0.244]

30 , wn }, with each wi being one of W words, a sequence of topic assignments z = {z1 , . [sent-53, score-0.274]

31 zn }, with each zi being one of T topics, and a sequence of classes c = {c1 , . [sent-56, score-0.187]

32 One class, say ci = 1, is designated the “semantic” class. [sent-60, score-0.259]

33 The zth topic is associated with a distribution over words θ (a) (b) 0. [sent-61, score-0.382]

34 9 neural network trained with svm images Figure 1: The composite model. [sent-83, score-0.353]

35 φ(z) , each class c = 1 is associated with a distribution over words φ(c) , each document d has a distribution over topics θ(d) , and transitions between classes ci−1 and ci follow a distribution π (si−1 ) . [sent-86, score-0.873]

36 For each word wi in document d (a) Draw zi from θ(d) (b) Draw ci from π (ci−1 ) (c) If ci = 1, then draw wi from φ(zi ) , else draw wi from φ(ci ) Figure 1(b) provides an intuitive representation of how phrases are generated by the composite model. [sent-89, score-1.438]

37 The third is a topic model, containing three topics. [sent-92, score-0.165]

38 The topics in the semantic class also have probabilities, used to choose a topic when the HMM transitions to the semantic class. [sent-94, score-1.093]

39 Phrases are generated by following a path through the model, choosing a word from the distribution associated with each syntactic class, and a topic followed by a word from the distribution associated with that topic for the semantic class. [sent-95, score-1.381]

40 Sentences with the same syntax but different content would be generated if the topic distribution were different. [sent-96, score-0.411]

41 The generative model thus acts like it is playing a game of Madlibs: the semantic component provides a list of topical words (shown in black) which are slotted into templates generated by the syntactic component (shown in gray). [sent-97, score-1.081]

42 2 Inference The EM algorithm can be applied to the graphical model shown in Figure 1, treating the document distributions θ, the topics and classes φ, and the transition probabilities π as parameters. [sent-99, score-0.367]

43 However, EM produces poor results with topic models, which have many parameters and many local maxima. [sent-100, score-0.165]

44 We will use Markov chain Monte Carlo (MCMC; see [9]) to perform full Bayesian inference in this model, sampling from a posterior distribution over assignments of words to classes and topics. [sent-102, score-0.427]

45 We use Gibbs sampling to draw iteratively a topic assignment zi and class assignment ci for each word wi in the corpus (see [8, 9]). [sent-104, score-0.925]

46 The Brown corpus [10] consists of D = 500 documents and n = 1, 137, 466 word tokens, with part-of-speech tags for each token. [sent-111, score-0.358]

47 The TASA corpus is an untagged collection of educational materials consisting of D = 37, 651 documents and n = 12, 190, 931 word tokens. [sent-112, score-0.301]

48 On the Brown corpus, we ran samplers for LDA and 1st, 2nd, and 3rd order HMM and composite models, with three chains of 4000 iterations each, taking samples at a lag of 100 iterations after a burn-in of 2000 iterations. [sent-120, score-0.353]

49 On Brown+TASA, we ran a single chain for 4000 iterations for LDA and the 3rd order HMM and composite models. [sent-121, score-0.376]

50 1 Syntactic classes and semantic topics The two components of the model are sensitive to different kinds of dependency among words. [sent-124, score-0.703]

51 The HMM is sensitive to short-range dependencies that are constant across documents, and the topic model is sensitive to long-range dependencies that vary across documents. [sent-125, score-0.56]

52 As a consequence, the HMM allocates words that vary across contexts to the semantic class, where they are differentiated into topics. [sent-126, score-0.651]

53 The results of the algorithm, taken from the 4000th iteration of a 3rd order composite model on Brown+TASA, are shown in Figure 2. [sent-127, score-0.38]

54 The model cleanly separates words that play syntactic and semantic roles, in sharp contrast to the results of the LDA model, also shown in the ﬁgure, where all words are forced into topics. [sent-128, score-1.269]

55 The syntactic categories include prepositions, pronouns, past-tense verbs, and punctuation. [sent-129, score-0.411]

56 While one state of the HMM, shown in the eighth column of the ﬁgure, emits common nouns, the majority of nouns are assigned to the semantic class. [sent-130, score-0.425]

57 The designation of words as syntactic or semantic depends upon the corpus. [sent-131, score-0.984]

58 Each column represents a single topic/class, and words appear in order of probability in that topic/class. [sent-134, score-0.217]

59 Since some classes give almost all probability to only a few words, a list is terminated when the words account for 90% of the probability mass. [sent-135, score-0.326]

60 We replaced all words appearing in fewer than 3 papers with an asterisk, leading to W = 17, 268 types. [sent-139, score-0.244]

61 A selection of topics and classes from the 4000th iteration are shown in Figure 3. [sent-141, score-0.238]

62 Words that might convey semantic information in another setting, such as “model”, “algorithm”, or “network”, form part of the syntax of NIPS: the consistent use of these words across documents leads them to be incorporated into the syntactic component. [sent-142, score-1.167]

63 2 Identifying function and content words Identifying function and content words requires using information about both syntactic class and semantic context. [sent-144, score-1.568]

64 In a machine learning paper, the word “control” might be an innocuous verb, or an important part of the content of a paper. [sent-145, score-0.303]

65 Likewise, “graph” could refer to a ﬁgure, or indicate content related to graph theory. [sent-146, score-0.182]

66 Tagging classes might indicate that “control” appears as a verb rather than a noun, but deciding that “graph” refers to a ﬁgure requires using information about the content of the rest of the document. [sent-147, score-0.298]

67 The factorization of words between the HMM and LDA components provides a simple means of assessing the role that a given word plays in a document: evaluating the posterior probability of assignment to the LDA component. [sent-148, score-0.414]

68 The results of using this procedure to identify content words in sentences excerpted from NIPS papers are shown in Figure 4. [sent-149, score-0.434]

69 Probabilities were evaluated by averaging over assignments from all 20 samples, and take into account the semantic context of the whole document. [sent-150, score-0.406]

70 As a result of combining shortand long-range dependencies, the model is able to pick out the words in each sentence that concern the content of the document. [sent-151, score-0.461]

71 Figure 4: Function and content words in the NIPS corpus. [sent-162, score-0.378]

72 The boxed word appears as a function word and a content word in one element of each pair of sentences. [sent-164, score-0.587]

73 Asterisked words had low frequency, and were treated as a single word type by the model. [sent-165, score-0.359]

74 being assigned to syntactic HMM classes produces templates for writing NIPS papers, into which content words can be inserted. [sent-166, score-0.928]

75 The composite model provided the best account of both corpora, Brown Brown+TASA −4e+07 Marginal likelihood Marginal likelihood −4e+06 Composite −4. [sent-173, score-0.38]

76 6 1st 2nd 3rd Brown 1st 2nd 3rd Brown+TASA 1st 2nd 3rd Brown 1st 2nd 3rd Brown+TASA 1000 most frequent words 0. [sent-181, score-0.217]

77 2 0 DC HMM Composite Figure 6: Part-of-speech tagging for HMM, composite, and distributional clustering (DC). [sent-184, score-0.149]

78 Using a higher-order transition matrix for either the HMM or the composite model produced little improvement in marginal likelihood for the Brown corpus, but the 3rd order models performed best on Brown+TASA. [sent-186, score-0.465]

79 4 Part-of-speech tagging Part-of-speech tagging – identifying the syntactic class of a word – is a standard task in computational linguistics. [sent-188, score-0.836]

80 Most unsupervised tagging methods use a lexicon that identiﬁes the possible classes for different words. [sent-189, score-0.217]

81 This simpliﬁes the problem, as most words belong to a single class. [sent-190, score-0.217]

82 However, genuinely unsupervised recovery of parts-of-speech has been used to assess statistical models of language learning, such as distributional clustering [3]. [sent-191, score-0.145]

83 We assessed tagging performance on the Brown corpus, using two tagsets. [sent-192, score-0.144]

84 We evaluated tagging performance using the Adjusted Rand Index [12] to measure the concordance between the tags and the class assignments of the HMM and composite models in the 4000th iteration. [sent-195, score-0.644]

85 Both models produced class assignments that were strongly concordant with part-of-speech, although the HMM gave a slightly better match to the full tagset, and the composite model gave a closer match to the top-level tags. [sent-198, score-0.621]

86 This is partly because all words that vary strongly in frequency across contexts get assigned to the semantic class in the composite model, so it misses some of the ﬁne-grained distinctions expressed in the full tagset. [sent-199, score-1.079]

87 Both the HMM and the composite model performed better than the distributional clustering method described in [3], which was used to form the 1000 most frequent words in Brown into 19 clusters. [sent-200, score-0.638]

88 Figure 6 compares this clustering with the classes for those words from the HMM and composite models trained on Brown. [sent-201, score-0.71]

89 5 Document classiﬁcation The 500 documents in the Brown corpus are classiﬁed into 15 groups, such as editorial journalism and romance ﬁction. [sent-203, score-0.159]

90 We assessed the quality of the topics recovered by the LDA and composite models by training a naive Bayes classiﬁer on the topic vectors produced by the two models. [sent-204, score-0.735]

91 The 1st, 2nd, and 3rd order composite models gave 0. [sent-212, score-0.431]

92 The slightly lower accuracy of the composite model may result from having fewer data in which to ﬁnd correlations: it only sees the words allocated to the semantic component, which account for approximately 20% of the words in the corpus. [sent-229, score-1.17]

93 4 Conclusion The composite model we have described captures the interaction between short- and longrange dependencies between words. [sent-230, score-0.494]

94 As a consequence, the posterior distribution over the latent variables in this model picks out syntactic classes and semantic topics and identiﬁes the role that words play in documents. [sent-231, score-1.366]

95 The model is competitive in part-of-speech tagging and classiﬁcation with models that specialize in short- and long-range dependencies respectively. [sent-232, score-0.28]

96 Clearly, such a model does not do justice to the depth of syntactic or semantic structure, or their interaction. [sent-233, score-0.794]

97 However, it illustrates how a sensitivity to different kinds of statistical dependency might be sufﬁcient for the ﬁrst stages of language acquisition, discovering the syntactic and semantic building blocks that form the basis for learning more sophisticated representations. [sent-234, score-0.891]

98 The TASA corpus appears courtesy of Tom Landauer and Touchstone Applied Science Associates, and the NIPS corpus was provided by Sam Roweis. [sent-236, score-0.172]

99 Distributional information: A powerful cue for acquiring syntactic categories. [sent-255, score-0.411]

100 Towards better integration of semantic predictors in statistical language modeling. [sent-281, score-0.429]

similar papers computed by tfidf model

tfidf for this paper:

wordName wordTfidf (topN-words)

[('syntactic', 0.411), ('semantic', 0.356), ('composite', 0.353), ('hmm', 0.269), ('ci', 0.259), ('words', 0.217), ('brown', 0.209), ('topic', 0.165), ('content', 0.161), ('word', 0.142), ('tasa', 0.142), ('lda', 0.135), ('topics', 0.129), ('dependencies', 0.114), ('classes', 0.109), ('tagging', 0.108), ('corpus', 0.086), ('syntax', 0.085), ('dirichlet', 0.081), ('zi', 0.078), ('language', 0.073), ('documents', 0.073), ('document', 0.072), ('wi', 0.059), ('tags', 0.057), ('nci', 0.057), ('sentence', 0.056), ('rand', 0.052), ('assignments', 0.05), ('nwi', 0.049), ('nwii', 0.049), ('nzi', 0.049), ('latent', 0.048), ('gave', 0.047), ('generative', 0.046), ('roles', 0.046), ('class', 0.045), ('transitions', 0.042), ('distributional', 0.041), ('play', 0.041), ('nouns', 0.039), ('draw', 0.037), ('assessed', 0.036), ('land', 0.034), ('alcohol', 0.033), ('farmers', 0.033), ('lens', 0.033), ('marginal', 0.033), ('blei', 0.033), ('models', 0.031), ('sensitive', 0.031), ('assigned', 0.03), ('distributions', 0.03), ('excluding', 0.03), ('kinds', 0.029), ('adjusted', 0.029), ('nips', 0.029), ('sentences', 0.029), ('story', 0.028), ('landauer', 0.028), ('emit', 0.028), ('farm', 0.028), ('verb', 0.028), ('posterior', 0.028), ('vary', 0.028), ('assignment', 0.027), ('model', 0.027), ('papers', 0.027), ('noun', 0.026), ('markers', 0.026), ('blood', 0.026), ('punctuation', 0.026), ('contexts', 0.025), ('semantics', 0.025), ('across', 0.025), ('asterisk', 0.024), ('water', 0.024), ('phrases', 0.024), ('doubly', 0.024), ('exclude', 0.024), ('game', 0.024), ('body', 0.024), ('drawn', 0.023), ('chain', 0.023), ('identifying', 0.022), ('dependency', 0.022), ('carlo', 0.022), ('monte', 0.022), ('team', 0.022), ('mcmc', 0.022), ('produced', 0.021), ('graph', 0.021), ('gure', 0.021), ('index', 0.021), ('irvine', 0.021), ('grif', 0.021), ('ths', 0.021), ('architecture', 0.02), ('hyperparameter', 0.02), ('government', 0.02)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.99999899 87 nips-2004-Integrating Topics and Syntax

Author: Thomas L. Griffiths, Mark Steyvers, David M. Blei, Joshua B. Tenenbaum

2 0.1556243 200 nips-2004-Using Random Forests in the Structured Language Model

Author: Peng Xu, Frederick Jelinek

Abstract: In this paper, we explore the use of Random Forests (RFs) in the structured language model (SLM), which uses rich syntactic information in predicting the next word based on words already seen. The goal in this work is to construct RFs by randomly growing Decision Trees (DTs) using syntactic information and investigate the performance of the SLM modeled by the RFs in automatic speech recognition. RFs, which were originally developed as classiﬁers, are a combination of decision tree classiﬁers. Each tree is grown based on random training data sampled independently and with the same distribution for all trees in the forest, and a random selection of possible questions at each node of the decision tree. Our approach extends the original idea of RFs to deal with the data sparseness problem encountered in language modeling. RFs have been studied in the context of n-gram language modeling and have been shown to generalize well to unseen data. We show in this paper that RFs using syntactic information can also achieve better performance in both perplexity (PPL) and word error rate (WER) in a large vocabulary speech recognition system, compared to a baseline that uses Kneser-Ney smoothing. 1

3 0.14130652 145 nips-2004-Parametric Embedding for Class Visualization

Author: Tomoharu Iwata, Kazumi Saito, Naonori Ueda, Sean Stromsten, Thomas L. Griffiths, Joshua B. Tenenbaum

Abstract: In this paper, we propose a new method, Parametric Embedding (PE), for visualizing the posteriors estimated over a mixture model. PE simultaneously embeds both objects and their classes in a low-dimensional space. PE takes as input a set of class posterior vectors for given data points, and tries to preserve the posterior structure in an embedding space by minimizing a sum of Kullback-Leibler divergences, under the assumption that samples are generated by a Gaussian mixture with equal covariances in the embedding space. PE has many potential uses depending on the source of the input data, providing insight into the classiﬁer’s behavior in supervised, semi-supervised and unsupervised settings. The PE algorithm has a computational advantage over conventional embedding methods based on pairwise object relations since its complexity scales with the product of the number of objects and the number of classes. We demonstrate PE by visualizing supervised categorization of web pages, semi-supervised categorization of digits, and the relations of words and latent topics found by an unsupervised algorithm, Latent Dirichlet Allocation. 1

4 0.12033612 10 nips-2004-A Probabilistic Model for Online Document Clustering with Application to Novelty Detection

Author: Jian Zhang, Zoubin Ghahramani, Yiming Yang

Abstract: In this paper we propose a probabilistic model for online document clustering. We use non-parametric Dirichlet process prior to model the growing number of clusters, and use a prior of general English language model as the base distribution to handle the generation of novel clusters. Furthermore, cluster uncertainty is modeled with a Bayesian Dirichletmultinomial distribution. We use empirical Bayes method to estimate hyperparameters based on a historical dataset. Our probabilistic model is applied to the novelty detection task in Topic Detection and Tracking (TDT) and compared with existing approaches in the literature. 1

5 0.11325435 78 nips-2004-Hierarchical Distributed Representations for Statistical Language Modeling

Author: John Blitzer, Fernando Pereira, Kilian Q. Weinberger, Lawrence K. Saul

Abstract: Statistical language models estimate the probability of a word occurring in a given context. The most common language models rely on a discrete enumeration of predictive contexts (e.g., n-grams) and consequently fail to capture and exploit statistical regularities across these contexts. In this paper, we show how to learn hierarchical, distributed representations of word contexts that maximize the predictive value of a statistical language model. The representations are initialized by unsupervised algorithms for linear and nonlinear dimensionality reduction [14], then fed as input into a hierarchical mixture of experts, where each expert is a multinomial distribution over predicted words [12]. While the distributed representations in our model are inspired by the neural probabilistic language model of Bengio et al. [2, 3], our particular architecture enables us to work with signiﬁcantly larger vocabularies and training corpora. For example, on a large-scale bigram modeling task involving a sixty thousand word vocabulary and a training corpus of three million sentences, we demonstrate consistent improvement over class-based bigram models [10, 13]. We also discuss extensions of our approach to longer multiword contexts. 1

6 0.10926453 62 nips-2004-Euclidean Embedding of Co-Occurrence Data

7 0.10558408 197 nips-2004-Two-Dimensional Linear Discriminant Analysis

8 0.10450276 101 nips-2004-Learning Syntactic Patterns for Automatic Hypernym Discovery

9 0.10284807 107 nips-2004-Making Latin Manuscripts Searchable using gHMM's

10 0.097905397 169 nips-2004-Sharing Clusters among Related Groups: Hierarchical Dirichlet Processes

11 0.089073107 66 nips-2004-Exponential Family Harmoniums with an Application to Information Retrieval

12 0.073040053 127 nips-2004-Neighbourhood Components Analysis

13 0.066098332 163 nips-2004-Semi-parametric Exponential Family PCA

14 0.064810082 108 nips-2004-Markov Networks for Detecting Overalpping Elements in Sequence Data

15 0.062713325 167 nips-2004-Semi-supervised Learning with Penalized Probabilistic Clustering

16 0.061678424 90 nips-2004-Joint Probabilistic Curve Clustering and Alignment

17 0.060596213 134 nips-2004-Object Classification from a Single Example Utilizing Class Relevance Metrics

18 0.060506098 205 nips-2004-Who's In the Picture

19 0.057811942 44 nips-2004-Conditional Random Fields for Object Recognition

20 0.057741996 54 nips-2004-Distributed Information Regularization on Graphs

similar papers computed by lsi model

lsi for this paper:

topicId topicWeight

[(0, -0.171), (1, 0.023), (2, -0.026), (3, -0.106), (4, 0.036), (5, 0.142), (6, -0.109), (7, 0.184), (8, 0.154), (9, 0.054), (10, 0.074), (11, -0.024), (12, -0.232), (13, 0.113), (14, 0.083), (15, -0.088), (16, 0.128), (17, -0.041), (18, 0.077), (19, 0.059), (20, -0.053), (21, -0.193), (22, 0.012), (23, -0.11), (24, -0.054), (25, -0.111), (26, -0.03), (27, -0.05), (28, -0.04), (29, 0.079), (30, -0.017), (31, 0.007), (32, 0.016), (33, -0.057), (34, 0.137), (35, 0.012), (36, -0.019), (37, -0.024), (38, -0.125), (39, 0.047), (40, 0.084), (41, 0.02), (42, 0.028), (43, 0.086), (44, -0.0), (45, -0.027), (46, 0.042), (47, 0.041), (48, -0.068), (49, -0.03)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.97307241 87 nips-2004-Integrating Topics and Syntax

Author: Thomas L. Griffiths, Mark Steyvers, David M. Blei, Joshua B. Tenenbaum

2 0.61794513 78 nips-2004-Hierarchical Distributed Representations for Statistical Language Modeling

Author: John Blitzer, Fernando Pereira, Kilian Q. Weinberger, Lawrence K. Saul

3 0.59315515 107 nips-2004-Making Latin Manuscripts Searchable using gHMM's

Author: Jaety Edwards, Yee W. Teh, Roger Bock, Michael Maire, Grace Vesom, David A. Forsyth

Abstract: We describe a method that can make a scanned, handwritten mediaeval latin manuscript accessible to full text search. A generalized HMM is ﬁtted, using transcribed latin to obtain a transition model and one example each of 22 letters to obtain an emission model. We show results for unigram, bigram and trigram models. Our method transcribes 25 pages of a manuscript of Terence with fair accuracy (75% of letters correctly transcribed). Search results are very strong; we use examples of variant spellings to demonstrate that the search respects the ink of the document. Furthermore, our model produces fair searches on a document from which we obtained no training data. 1. Intoduction There are many large corpora of handwritten scanned documents, and their number is growing rapidly. Collections range from the complete works of Mark Twain to thousands of pages of zoological notes spanning two centuries. Large scale analyses of such corpora is currently very difﬁcult, because handwriting recognition works poorly. Recently, Rath and Manmatha have demonstrated that one can use small bodies of aligned material as supervised data to train a word spotting mechanism [7]. The result can make scanned handwritten documents searchable. Current techniques assume a closed vocabulary — one can search only for words in the training set — and search for instances of whole words. This approach is particularly unattractive for an inﬂected language, because individual words can take so many forms that one is unlikely to see all in the training set. Furthermore, one would like the method used to require very little aligned training data, so that it is possible to process documents written by different scribes with little overhead. Mediaeval Latin manuscripts are a natural ﬁrst corpus for studying this problem, because there are many scanned manuscripts and because the handwriting is relatively regular. We expect the primary user need to be search over a large body of documents — to allow comparisons between documents — rather than transcription of a particular document (which is usually relatively easy to do by hand). Desirable features for a system are: First, that it use little or no aligned training data (an ideal, which we believe may be attainable, is an unsupervised learning system). Second, that one can search the document for an arbitrary string (rather than, say, only complete words that appear in the training data). This would allow a user to determine whether a document contains curious or distinctive spellings, for example (ﬁgure 7). We show that, using a statistical model based on a generalized HMM, we can search a medieval manuscript with considerable accuracy, using only one instance each of each letter in the manuscript to train the method (22 instances in total; Latin has no j, k, w, or z). Furthermore, our method allows fairly accurate transcription of the manuscript. We train our system on 22 glyphs taken from a a 12th century latin manuscript of Terence’s Comedies (obtained from a repository of over 80 scanned medieval works maintained by Oxford University [1]). We evaluate searches using a considerable portion of this manuscript aligned by hand; we then show that fair search results are available on a different manuscript (MS. Auct. D. 2. 16, Latin Gospels with beast-headed evangelist portraits made at Landvennec, Brittany, late 9th or early 10th century, from [1]) without change of letter templates. 1.1. Previous Work Handwriting recognition is a traditional problem, too well studied to review in detail here (see [6]). Typically, online handwriting recognition (where strokes can be recorded) works better than ofﬂine handwriting recognition. Handwritten digits can now be recognized with high accuracy [2, 5]. Handwritten amounts can be read with fair accuracy, which is signiﬁcantly improved if one segments the amount into digits at the same time as one recognizes it [4, 5]. Recently several authors have proposed new techniques for search and translation in this unrestricted setting. Manmatha et al [7] introduce the technique of “word spotting,” which segments text into word images, rectiﬁes the word images, and then uses an aligned training set to learn correspondences between rectiﬁed word images and strings. The method is not suitable for a heavily inﬂected language, because words take so many forms. In an inﬂected language, the natural unit to match to is a subset of a word, rather than a whole word, implying that one should segment the text into blocks — which may be smaller than words — while recognizing. Vinciarelli et al [8] introduce a method for line by line recognition based around an HMM and quite similar to techniques used in the speech recognition community. Their method uses a window that slides along the text to obtain features; this has the difﬁculty that the same window is in some places too small (and so uninformative) and in others too big (and so spans more than one letter, and is confusing). Their method requires a substantial body of aligned training data, which makes it impractical for our applications. Close in spirit to our work is the approach to machine translation of Koehn and Knight [3]. They demonstrate that the statistics of unaligned corpora may provide as powerful constraints for training models as aligned bitexts. 2. The Model Our models for both search and transcription are based on the generalized HMM and differ only in their choice of transition model. In an HMM, each hidden node ct emits a single evidence node xt . In a generalized HMM, we allow each ct to emit a series of x’s whose length is itself a random variable. In our model, the hidden nodes correspond to letters and each xt is a single column of pixels. Allowing letters to emit sets of columns lets us accomodate letter templates of variable width. In particular, this means that we can unify segmenting ink into letters and recognizing blocks of ink; ﬁgure 3 shows an example of how useful this is. 2.1. Generating a line of text Our hidden state consists of a character label c, width w and vertical position y. The statespace of c contains the characters ‘a’-‘z’, a space ‘ ’, and a special end state Ω. Let T c be the template associated with character c, Tch , Tcw be respectively the height and width of that template, and m be the height of the image. Figure 1: Left, a full page of our manuscript, a 12’th century manuscript of Terence’s Comedies obtained from [1]. Top right, a set of lines from a page from that document and bottom right, some words in higher resolution. Note: (a) the richness of page layout; (b) the clear spacing of the lines; (c) the relatively regular handwriting. Figure 2: Left, the 22 instances, one per letter, used to train our emission model. These templates are extracted by hand from the Terence document. Right, the ﬁve image channels for a single letter. Beginning at image column 1 (and assuming a dummy space before the ﬁrst character), • • • • choose character c ∼ p(c|c−1...−n ) (an n-gram letter model) choose length w ∼ Uniform(Tcw − k, Tcw + k) (for some small k) choose vertical position y ∼ Uniform(1, m − Tch ) z,y and Tch now deﬁne a bounding box b of pixels. Let i and j be indexed from the top left of that bounding box. – draw pixel (i, j) ∼ N (Tcij , σcij ) for each pixel in b – draw all pixels above and below b from background gaussian N (µ0 , σ0 ) (See 2.2 for greater detail on pixel emission model) • move to column w + 1 and repeat until we enter the end state Ω. Inference on a gHMM is a relatively straighforward business of dynamic programming. We have used unigram, bigram and trigram models, with each transition model ﬁtted using an electronic version of Caesar’s Gallic Wars, obtained from http://www.thelatinlibrary.com. We do not believe that the choice of author should signiﬁcantly affect the ﬁtted transition model — which is at the level of characters — but have not experimented with this point. The important matter is the emission model. 2.2. The Emission Model Our emission model is as follows: Given the character c and width w, we generate a template of the required length. Each pixel in this template becomes the mean of a gaussian which generates the corresponding pixel in the image. This template has a separate mean image for each pixel channel. The channels are assumed independent given the means. We train the model by cutting out by hand a single instance of each letter from our corpus (ﬁgure 2). This forms the central portion of the template. Pixels above and below this Model Perfect transcription unigram bigram trigram matching chars 21019 14603 15572 15788 substitutions 0 5487 4597 4410 insertions 0 534 541 507 deletions 0 773 718 695 Table 1: Edit distance between our transcribed Terence and the editor’s version. Note the trigram model produces signiﬁcantly fewer letter errors than the unigram model, but that the error rate is still a substantial 25%. central box are generated from a single gaussian used to model background pixels (basically white pixels). We add a third variable yt to our hidden state indicating the vertical position of the central box. However, since we are uninterested in actually recovering this variable, during inference we sum it out of the model. The width of a character is constrained to be close to the width (tw ) of our hand cut example by setting p(w|c) = 0 for w < tw − k and w > tw + k. Here k is a small, user deﬁned integer. Within this range, p(w|c) is distributed uniformly, larger templates are created by appending pixels from the background model to the template and smaller ones by simply removing the right k-most columns of the hand cut example. For features, we generate ﬁve image representations, shown in ﬁgure 2. The ﬁrst is a grayscale version of the original color image. The second and third are generated by convolving the grayscale image with a vertical derivative of gaussian ﬁlter, separating the positive and negative components of this response, and smoothing each of these gradient images separately. The fourth and ﬁfth are generated similarly but with a horizontal derivative of gaussian ﬁlter. We have experimented with different weightings of these 5 channels. In practice we use the gray scale channel and the horizontal gradient channels. We emphasize the horizontal pieces since these seem the more discriminative. 2.3. Transcription For transcription, we model letters as coming from an n-gram language model, with no dependencies between words. Thus, the probability of a letter depends on the k letters before it, where k = n unless this would cross a word boundary in which case the history terminates at this boundary. We chose not to model word to word transition probabilities since, unlike in English, word order in Latin is highly arbitrary. This transition model is ﬁt from a corpus of ascii encoded latin. We have experimented with unigram (i.e. uniform transition probabilities), bigram and trigram letter models. We can perform transcription by ﬁtting the maximum likelihood path through any given line. Some results of this technique are shown in ﬁgure 3. 2.4. Search For search, we rank lines by the probability that they contain our search word. We set up a ﬁnite state machine like that in ﬁgure 4. In this ﬁgure, ‘bg’ represents our background model for that portion of the line not generated by our search word. We can use any of the n-gram letter models described for transcription as the transition model for ‘bg’. The probability that the line contains the search word is the probability that this FSM takes path 1. We use this FSM as the transition model for our gHMM, and output the posterior probability of the two arrows leading into the end state. 1 and 2 are user deﬁned weights, but in practice the algorithm does not appear to be particular sensitive to the choice of these parameters. The results presented here use the unigram model. Editorial translation Orator ad vos venio ornatu prologi: unigram b u rt o r a d u o s u em o o r n a t u p r o l o g r b u rt o r a d v o s v em o o r u a t u p r o l o g r fo r a t o r a d v o s v en i o o r n a t u p r o l o g i bigram trigram Figure 3: We transcribe the text by ﬁnding the maximum likelihood path through the gHMM. The top line shows the standard version of the line (obtained by consensus among editors who have consulted various manuscripts; we obtained this information in electronic form from http://www.thelatinlibrary.com). Below, we show the line as segmented and transcribed by unigram, bigram and trigram models; the unigram and bigram models transcribe one word as “vemo”, but the stronger trigram model forces the two letters to be segmented and correctly transcribes the word as “venio”, illustrating the considerable beneﬁt to be obtained by segmenting only at recognition time. 1 − ε1 Path 1 1 − ε2 a b bg ε1 Ω bg Path 2 ε2 Figure 4: The ﬁnite state machine to search for the word ‘ab.’ ‘bg’ is a place holder for the larger ﬁnite state machine deﬁned by our language model’s transition matrix. 3. Results Figure 1 shows a page from our collection. This is a scanned 12th century manuscript of Terence’s Comedies, obtained from the collection at [1]. In preprocessing, we extract individual lines of text by rotating the image to various degrees and projecting the sum of the pixel values onto the y-axis. We choose the orientation whose projection vector has the lowest entropy, and then segment lines by cutting at minima of this projection. Transcription is not our primary task, but methods that produce good transcriptions are going to support good searches. The gHMM can produce a surprisingly good transcription, given how little training data is used to train the emission model. We aligned an editors version of Terence with 25 pages from the manuscript by hand, and computed the edit distance between the transcribed text and the aligned text; as table 1 indicates, approximately 75% of letters are read correctly. Search results are strong. We show results for two documents. The ﬁrst set of results refers to the edition of Terence’s Comedies, from which we took the 22 letter instances. In particular, for any given search term, our process ranks the complete set of lines. We used a hand alignment of the manuscript to determine which lines contained each term; ﬁgure 5 shows an overview of searches performed using every word that appears in the 50 100 150 200 250 300 350 400 450 500 550 Figure 5: Our search ranks 587 manuscript lines, with higher ranking lines more likely to contain the relevant term. This ﬁgure shows complete search results for each term that appears more than three times in the 587 lines. Each row represents the ranked search results for a term, and a black mark appears if the search term is actually in the line; a successful search will therefore appear as a row which is wholly dark to the left, and then wholly light. All 587 lines are represented. More common terms are represented by lower rows. More detailed results appear in ﬁgure 5 and ﬁgure 6; this summary ﬁgure suggests almost all searches are highly successful. document more than three times, in particular, showing which of the ranked set of lines actually contained the search term. For almost every search, the term appears mainly in the lines with higher rank. Figure 6 contains more detailed information for a smaller set of words. We do not score the position of a word in a line (for practical reasons). Figure 7 demonstrates (a) that our search respects the ink of the document and (b) that for the Terence document, word positions are accurately estimated. The spelling of mediaeval documents is typically cleaned up by editors; in our manuscript, the scribe reliably spells “michi” for the standard “mihi”. A search on “michi” produces many instances; a search on “mihi” produces none, because the ink doesn’t have any. Notice this phenomenon also in the bottom right line of ﬁgure 7, the scribe writes “habet, ut consumat nunc cum nichil obsint doli” and the editor gives “habet, ut consumat nunc quom nil obsint doli.” Figure 8 shows that searches on short strings produce many words containing that string as one would wish. 4. Discussion We have shown that it is possible to make at least some handwritten mediaeval manuscripts accessible to full text search, without requiring an aligned text or much supervisory data. Our documents have very regular letters, and letter frequencies — which can be obtained from transcribed Latin — appear to provide so powerful a cue that relatively little detailed information about letter shapes is required. Linking letter segmentation and recognition has thoroughly beneﬁcial effects. This suggests that the pool of manuscripts that can be made accessible in this way is large. In particular, we have used our method, trained on 22 instances of letters from one document, to search another document. Figure 9 shows the results from two searches of our second document (MS. Auct. D. 2. 16, Latin Gospels with beast-headed evangelist portraits made at Landvennec, Brittany, late 9th or early 10th century, from [1]). No information from this document was used in training at all; but letter 1tu arbitror pater etiam nisi factum primum siet vero illi inter hic michi ibi qui tu ibi michi 0.9 0.8 0.7 qui hic 0.6 inter 0.5 illi 0.4 siet 0.3 vero 0.2 nisi 0.1 50 100 150 200 250 300 350 400 450 500 550 0 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 Figure 6: On the left, search results for selected words (indicated on the leftmost column). Each row represents the ranked search results for a term, and a black mark appears if the search term is actually in the line; a successful search will therefore appear as a row which is wholly dark to the left, and then wholly light. Note only the top 300 results are represented, and that lines containing the search term are almost always at or close to the top of the search results (black marks to the left). On the right, we plot precision against recall for a set of different words by taking the top 10, 20, ... lines returned from the search, and checking them against the aligned manuscript. Note that, once all cases have been found, if the size of the pool is increased the precision will fall with 100% recall; many words work well, with most of the ﬁrst 20 or so lines returned containing the search term. shapes are sufﬁciently well shared that the search is still useful. All this suggests that one might be able to use EM to link three processes: one that clusters to determine letter shapes; one that segments letters; and one that imposes a language model. Such a system might be able to make handwritten Latin searchable with no training data. References [1] Early Manuscripts at Oxford University. Bodleian library ms. auct. f. 2.13. http://image.ox.ac.uk/. [2] Serge Belongie, Jitendra Malik, and Jan Puzicha. Shape matching and object recognition using shape contexts. IEEE T. Pattern Analysis and Machine Intelligence, 24(4):509–522, 2002. [3] Philipp Koehn and Kevin Knight. Estimating word translation probabilities from unrelated monolingual corpora. In Proc. of the 17th National Conf. on AI, pages 711–715. AAAI Press / The MIT Press, 2000. [4] Y. LeCun, L. Bottou, and Y. Bengio. Reading checks with graph transformer networks. In International Conference on Acoustics, Speech, and Signal Processing, volume 1, pages 151–154, Munich, 1997. IEEE. [5] Y. Lecun, L. Bottou, Y. Bengio, and P. Haffner. Gradient-based learning applied to document recognition. Proceedings of the IEEE, 86(11):2278–2324, 1998. [6] R. Plamondon and S.N. Srihari. Online and off-line handwriting recognition: a comprehensive survey. IEEE Transactions on Pattern Analysis and Machine Intelligence, 22(1):63–84, 2000. [7] T. M. Rath and R. Manmatha. Word image matching using dynamic time warping. In Proc. of the Conf. on Computer Vision and Pattern Recognition (CVPR), volume 2, pages 521–527, 2003. [8] Alessandro Vinciarelli, Samy Bengio, and Horst Bunke. Ofﬂine recognition of unconstrained handwritten texts using hmms and statistical language models. IEEE Trans. Pattern Anal. Mach. Intell., 26(6):709–720, 2004. michi: Spe incerta certum mihi laborem sustuli, mihi: Faciuntne intellegendo ut nil intellegant? michi: Nonnumquam conlacrumabat. placuit tum id mihi. mihi: Placuit: despondi. hic nuptiis dictust dies. michi: Sto exspectans siquid mi imperent. venit una,

4 0.5576334 169 nips-2004-Sharing Clusters among Related Groups: Hierarchical Dirichlet Processes

Author: Yee W. Teh, Michael I. Jordan, Matthew J. Beal, David M. Blei

Abstract: We propose the hierarchical Dirichlet process (HDP), a nonparametric Bayesian model for clustering problems involving multiple groups of data. Each group of data is modeled with a mixture, with the number of components being open-ended and inferred automatically by the model. Further, components can be shared across groups, allowing dependencies across groups to be modeled effectively as well as conferring generalization to new groups. Such grouped clustering problems occur often in practice, e.g. in the problem of topic discovery in document corpora. We report experimental results on three text corpora showing the effective and superior performance of the HDP over previous models.

5 0.53918481 10 nips-2004-A Probabilistic Model for Online Document Clustering with Application to Novelty Detection

Author: Jian Zhang, Zoubin Ghahramani, Yiming Yang

6 0.53054506 145 nips-2004-Parametric Embedding for Class Visualization

7 0.52058387 62 nips-2004-Euclidean Embedding of Co-Occurrence Data

8 0.51939118 200 nips-2004-Using Random Forests in the Structured Language Model

9 0.46604103 205 nips-2004-Who's In the Picture

10 0.43318608 43 nips-2004-Conditional Models of Identity Uncertainty with Application to Noun Coreference

11 0.43305597 193 nips-2004-Theories of Access Consciousness

12 0.40739301 101 nips-2004-Learning Syntactic Patterns for Automatic Hypernym Discovery

13 0.39898375 66 nips-2004-Exponential Family Harmoniums with an Application to Information Retrieval

14 0.34101072 127 nips-2004-Neighbourhood Components Analysis

15 0.33125517 162 nips-2004-Semi-Markov Conditional Random Fields for Information Extraction

16 0.33029872 74 nips-2004-Harmonising Chorales by Probabilistic Inference

17 0.33026811 108 nips-2004-Markov Networks for Detecting Overalpping Elements in Sequence Data

18 0.30715099 197 nips-2004-Two-Dimensional Linear Discriminant Analysis

19 0.30123571 122 nips-2004-Modelling Uncertainty in the Game of Go

20 0.28848076 54 nips-2004-Distributed Information Regularization on Graphs

similar papers computed by lda model

lda for this paper:

topicId topicWeight

[(9, 0.327), (13, 0.054), (15, 0.097), (26, 0.035), (31, 0.061), (33, 0.171), (35, 0.034), (39, 0.02), (50, 0.042), (51, 0.018), (56, 0.011), (86, 0.014)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.81918031 87 nips-2004-Integrating Topics and Syntax

Author: Thomas L. Griffiths, Mark Steyvers, David M. Blei, Joshua B. Tenenbaum

2 0.79088748 95 nips-2004-Large-Scale Prediction of Disulphide Bond Connectivity

Author: Jianlin Cheng, Alessandro Vullo, Pierre F. Baldi

Abstract: The formation of disulphide bridges among cysteines is an important feature of protein structures. Here we develop new methods for the prediction of disulphide bond connectivity. We ﬁrst build a large curated data set of proteins containing disulphide bridges and then use 2-Dimensional Recursive Neural Networks to predict bonding probabilities between cysteine pairs. These probabilities in turn lead to a weighted graph matching problem that can be addressed efﬁciently. We show how the method consistently achieves better results than previous approaches on the same validation data. In addition, the method can easily cope with chains with arbitrary numbers of bonded cysteines. Therefore, it overcomes one of the major limitations of previous approaches restricting predictions to chains containing no more than 10 oxidized cysteines. The method can be applied both to situations where the bonded state of each cysteine is known or unknown, in which case bonded state can be predicted with 85% precision and 90% recall. The method also yields an estimate for the total number of disulphide bridges in each chain. 1

3 0.71518558 32 nips-2004-Boosting on Manifolds: Adaptive Regularization of Base Classifiers

Author: Ligen Wang, Balázs Kégl

Abstract: In this paper we propose to combine two powerful ideas, boosting and manifold learning. On the one hand, we improve A DA B OOST by incorporating knowledge on the structure of the data into base classiﬁer design and selection. On the other hand, we use A DA B OOST’s efﬁcient learning mechanism to signiﬁcantly improve supervised and semi-supervised algorithms proposed in the context of manifold learning. Beside the speciﬁc manifold-based penalization, the resulting algorithm also accommodates the boosting of a large family of regularized learning algorithms. 1

4 0.66451985 25 nips-2004-Assignment of Multiplicative Mixtures in Natural Images

Author: Odelia Schwartz, Terrence J. Sejnowski, Peter Dayan

Abstract: In the analysis of natural images, Gaussian scale mixtures (GSM) have been used to account for the statistics of ﬁlter responses, and to inspire hierarchical cortical representational learning schemes. GSMs pose a critical assignment problem, working out which ﬁlter responses were generated by a common multiplicative factor. We present a new approach to solving this assignment problem through a probabilistic extension to the basic GSM, and show how to perform inference in the model using Gibbs sampling. We demonstrate the efﬁcacy of the approach on both synthetic and image data. Understanding the statistical structure of natural images is an important goal for visual neuroscience. Neural representations in early cortical areas decompose images (and likely other sensory inputs) in a way that is sensitive to sophisticated aspects of their probabilistic structure. This structure also plays a key role in methods for image processing and coding. A striking aspect of natural images that has reﬂections in both top-down and bottom-up modeling is coordination across nearby locations, scales, and orientations. From a topdown perspective, this structure has been modeled using what is known as a Gaussian Scale Mixture model (GSM).1–3 GSMs involve a multi-dimensional Gaussian (each dimension of which captures local structure as in a linear ﬁlter), multiplied by a spatialized collection of common hidden scale variables or mixer variables∗ (which capture the coordination). GSMs have wide implications in theories of cortical receptive ﬁeld development, eg the comprehensive bubbles framework of Hyv¨ rinen.4 The mixer variables provide the a top-down account of two bottom-up characteristics of natural image statistics, namely the ‘bowtie’ statistical dependency,5, 6 and the fact that the marginal distributions of receptive ﬁeld-like ﬁlters have high kurtosis.7, 8 In hindsight, these ideas also bear a close relationship with Ruderman and Bialek’s multiplicative bottom-up image analysis framework 9 and statistical models for divisive gain control.6 Coordinated structure has also been addressed in other image work,10–14 and in other domains such as speech15 and ﬁnance.16 Many approaches to the unsupervised speciﬁcation of representations in early cortical areas rely on the coordinated structure.17–21 The idea is to learn linear ﬁlters (eg modeling simple cells as in22, 23 ), and then, based on the coordination, to ﬁnd combinations of these (perhaps non-linearly transformed) as a way of ﬁnding higher order ﬁlters (eg complex cells). One critical facet whose speciﬁcation from data is not obvious is the neighborhood arrangement, ie which linear ﬁlters share which mixer variables. ∗ Mixer variables are also called mutlipliers, but are unrelated to the scales of a wavelet. Here, we suggest a method for ﬁnding the neighborhood based on Bayesian inference of the GSM random variables. In section 1, we consider estimating these components based on information from different-sized neighborhoods and show the modes of failure when inference is too local or too global. Based on these observations, in section 2 we propose an extension to the GSM generative model, in which the mixer variables can overlap probabilistically. We solve the neighborhood assignment problem using Gibbs sampling, and demonstrate the technique on synthetic data. In section 3, we apply the technique to image data. 1 GSM inference of Gaussian and mixer variables In a simple, n-dimensional, version of a GSM, ﬁlter responses l are synthesized † by multiplying an n-dimensional Gaussian with values g = {g1 . . . gn }, by a common mixer variable v. l = vg (1) We assume g are uncorrelated (σ 2 along diagonal of the covariance matrix). For the analytical calculations, we assume that v has a Rayleigh distribution: where 0 < a ≤ 1 parameterizes the strength of the prior p[v] ∝ [v exp −v 2 /2]a (2) For ease, we develop the theory for a = 1. As is well known,2 and repeated in ﬁgure 1(B), the marginal distribution of the resulting GSM is sparse and highly kurtotic. The joint conditional distribution of two elements l1 and l2 , follows a bowtie shape, with the width of the distribution of one dimension increasing for larger values (both positive and negative) of the other dimension. The inverse problem is to estimate the n+1 variables g1 . . . gn , v from the n ﬁlter responses l1 . . . ln . It is formally ill-posed, though regularized through the prior distributions. Four posterior distributions are particularly relevant, and can be derived analytically from the model: rv distribution posterior mean ” “ √ σ |l1 | 2 2 l1 |l1 | B“ 1, σ ” |l1 | ” exp − v − “ p[v|l1 ] 2 2v 2 σ 2 σ 1 |l1 | 1 |l1 | B p[v|l] p[|g1 ||l1 ] p[|g1 ||l] √ B 2, σ 1 (n−2) 2 2 2 ( ) −(n−1) exp − v2 − 2vl2 σ2 l v B(1− n , σ ) 2 √ σ|l1 | g2 l2 “ ” 1 exp − 12 − 12 2σ 1 |l1 | g2 2g l σ B −2, σ|l1 | ”1 |l1 | 2 (2−n) l n l 2 −1, σ “ B( ) σ (n−3) g1 1 l σ σ 1 g2 2 1 exp − 2σ2 l2 − l 1 2 l1 2 2g1 σ |l1 | σ ( ( 2, σ ) ) l B 3−n,σ 2 2 l B 1− n , σ “ 2 ” |l1 | B 0, σ |l1 | “ ” σ B − 1 , |l1 | 2 σ n 1 l |l1 | B( 2 − 2 , σ ) n l B( −1, l ) 2 σ 2 where B(n, x) is the modiﬁed Bessel function of the second kind (see also24 ), l = i li and gi is forced to have the same sign as li , since the mixer variables are always positive. Note that p[v|l1 ] and p[g1 |l1 ] (rows 1,3) are local estimates, while p[v|l] and p[g|l] (rows 2,4) are estimates according to ﬁlter outputs {l1 . . . ln }. The posterior p[v|l] has also been estimated numerically in noise removal for other mixer priors, by Portilla et al 25 The full GSM speciﬁes a hierarchy of mixer variables. Wainwright2 considered a prespeciﬁed tree-based hierarhical arrangement. In practice, for natural sensory data, given a heterogeneous collection of li , it is advantageous to learn the hierachical arrangement from examples. In an approach related to that of the GSM, Karklin and Lewicki19 suggested We describe the l as being ﬁlter responses even in the synthetic case, to facilitate comparison with images. † B A α 1 ... g v 20 1 ... β 0.1 l 0 -5 0 l 2 0 21 0 0 5 l 1 0 l 1 1 l ... l 21 40 20 Actual Distribution 0 D Gaussian 0 5 0 0 -5 0 0 5 0 5 -5 0 g 1 0 5 E(g 1 | l1) 1 .. 40 ) 0.06 -5 0 0 5 2 E(g |l 1 1 .. 20 ) 0 1 E(g | l ) -5 5 E(g | l 1 2 1 .. 20 5 α E(g |l 1 .. 20 ) E(g |l 0 E(v | l α 0.06 E(g | l2) 2 2 0 5 E(v | l 1 .. 20 ) E(g | l1) 1 1 g 0 1 0.06 0 0.06 E(vαl | ) g 40 filters, too global 0.06 0.06 0.06 Distribution 20 filters 1 filter, too local 0.06 vα E Gaussian joint conditional 40 l l C Mixer g ... 21 Multiply Multiply l g Distribution g v 1 .. 40 1 .. 40 ) ) E(g | l 1 1 .. 40 ) Figure 1: A Generative model: each ﬁlter response is generated by multiplying its Gaussian variable by either mixer variable vα , or mixer variable vβ . B Marginal and joint conditional statistics (bowties) of sample synthetic ﬁlter responses. For the joint conditional statistics, intensity is proportional to the bin counts, except that each column is independently re-scaled to ﬁll the range of intensities. C-E Left: actual distributions of mixer and Gaussian variables; other columns: estimates based on different numbers of ﬁlter responses. C Distribution of estimate of the mixer variable vα . Note that mixer variable values are by deﬁnition positive. D Distribution of estimate of one of the Gaussian variables, g1 . E Joint conditional statistics of the estimates of Gaussian variables g1 and g2 . generating log mixer values for all the ﬁlters and learning the linear combinations of a smaller collection of underlying values. Here, we consider the problem in terms of multiple mixer variables, with the linear ﬁlters being clustered into groups that share a single mixer. This poses a critical assignment problem of working out which ﬁlter responses share which mixer variables. We ﬁrst study this issue using synthetic data in which two groups of ﬁlter responses l1 . . . l20 and l21 . . . l40 are generated by two mixer variables vα and vβ (ﬁgure 1). We attempt to infer the components of the GSM model from the synthetic data. Figure 1C;D shows the empirical distributions of estimates of the conditional means of a mixer variable E(vα |{l}) and one of the Gaussian variables E(g1 |{l}) based on different assumed assignments. For estimation based on too few ﬁlter responses, the estimates do not well match the actual distributions. For example, for a local estimate based on a single ﬁlter response, the Gaussian estimate peaks away from zero. For assignments including more ﬁlter responses, the estimates become good. However, inference is also compromised if the estimates for vα are too global, including ﬁlter responses actually generated from vβ (C and D, last column). In (E), we consider the joint conditional statistics of two components, each 1 v v α vγ β g 1 ... v vα B Actual A Generative model 1 100 1 100 0 v 01 l1 ... l100 0 l 1 20 2 0 0 l 1 0 -4 100 Filter number vγ β 1 100 1 0 Filter number 100 1 Filter number 0 E(g 1 | l ) Gibbs fit assumed 0.15 E(g | l ) 0 2 0 1 Mixer Gibbs fit assumed 0.1 4 0 E(g 1 | l ) Distribution Distribution Distribution l 100 Filter number Gaussian 0.2 -20 1 1 0 Filter number Inferred v α Multiply 100 1 Filter number Pixel vγ 1 g 0 C β E(v | l ) β 0 0 0 15 E(v | l ) α 0 E(v | l ) α Figure 2: A Generative model in which each ﬁlter response is generated by multiplication of its Gaussian variable by a mixer variable. The mixer variable, v α , vβ , or vγ , is chosen probabilistically upon each ﬁlter response sample, from a Rayleigh distribution with a = .1. B Top: actual probability of ﬁlter associations with vα , vβ , and vγ ; Bottom: Gibbs estimates of probability of ﬁlter associations corresponding to vα , vβ , and vγ . C Statistics of generated ﬁlter responses, and of Gaussian and mixer estimates from Gibbs sampling. estimating their respective g1 and g2 . Again, as the number of ﬁlter responses increases, the estimates improve, provided that they are taken from the right group of ﬁlter responses with the same mixer variable. Speciﬁcally, the mean estimates of g1 and g2 become more independent (E, third column). Note that for estimations based on a single ﬁlter response, the joint conditional distribution of the Gaussian appears correlated rather than independent (E, second column); for estimation based on too many ﬁlter responses (40 in this example), the joint conditional distribution of the Gaussian estimates shows a dependent (rather than independent) bowtie shape (E, last column). Mixer variable joint statistics also deviate from the actual when the estimations are too local or global (not shown). We have observed qualitatively similar statistics for estimation based on coefﬁcients in natural images. Neighborhood size has also been discussed in the context of the quality of noise removal, assuming a GSM model.26 2 Neighborhood inference: solving the assignment problem The plots in ﬁgure 1 suggest that it should be possible to infer the assignments, ie work out which ﬁlter responses share common mixers, by learning from the statistics of the resulting joint dependencies. Hard assignment problems (in which each ﬁlter response pays allegiance to just one mixer) are notoriously computationally brittle. Soft assignment problems (in which there is a probabilistic relationship between ﬁlter responses and mixers) are computationally better behaved. Further, real world stimuli are likely better captured by the possibility that ﬁlter responses are coordinated in somewhat different collections in different images. We consider a richer, mixture GSM as a generative model (Figure 2). To model the generation of ﬁlter responses li for a single image patch, we multiply each Gaussian variable gi by a single mixer variable from the set v1 . . . vm . We assume that gi has association probabil- ity pij (satisfying j pij = 1, ∀i) of being assigned to mixer variable vj . The assignments are assumed to be made independently for each patch. We use si ∈ {1, 2, . . . m} for the assignments: li = g i vs i (3) Inference and learning in this model proceeds in two stages, according to the expectation maximization algorithm. First, given a ﬁlter response li , we use Gibbs sampling for the E phase to ﬁnd possible appropriate (posterior) assignments. Williams et al.27 suggested using Gibbs sampling to solve a similar assignment problem in the context of dynamic tree models. Second, for the M phase, given the collection of assignments across multiple ﬁlter responses, we update the association probabilities pij . Given sample mixer assignments, we can estimate the Gaussian and mixer components of the GSM using the table of section 1, but restricting the ﬁlter response samples just to those associated with each mixer variable. We tested the ability of this inference method to ﬁnd the associations in the probabilistic mixer variable synthetic example shown in ﬁgure 2, (A,B). The true generative model speciﬁes probabilistic overlap of 3 mixer variables. We generated 5000 samples for each ﬁlter according to the generative model. We ran the Gibbs sampling procedure, setting the number of possible neighborhoods to 5 (e.g., > 3); after 500 iterations the weights converged near to the proper probabilities. In (B, top), we plot the actual probability distributions for the ﬁlter associations with each of the mixer variables. In (B, bottom), we show the estimated associations: the three non-zero estimates closely match the actual distributions; the other two estimates are zero (not shown). The procedure consistently ﬁnds correct associations even in larger examples of data generated with up to 10 mixer variables. In (C) we show an example of the actual and estimated distributions of the mixer and Gaussian components of the GSM. Note that the joint conditional statistics of both mixer and Gaussian are independent, since the variables were generated as such in the synthetic example. The Gibbs procedure can be adjusted for data generated with different parameters a of equation 2, and for related mixers,2 allowing for a range of image coefﬁcient behaviors. 3 Image data Having validated the inference model using synthetic data, we turned to natural images. We derived linear ﬁlters from a multi-scale oriented steerable pyramid,28 with 100 ﬁlters, at 2 preferred orientations, 25 non-overlapping spatial positions (with spatial subsampling of 8 pixels), and two phases (quadrature pairs), and a single spatial frequency peaked at 1/6 cycles/pixel. The image ensemble is 4 images from a standard image compression database (boats, goldhill, plant leaves, and mountain) and 4000 samples. We ran our method with the same parameters as for synthetic data, with 7 possible neighborhoods and Rayleigh parameter a = .1 (as in ﬁgure 2). Figure 3 depicts the association weights pij of the coefﬁcients for each of the obtained mixer variables. In (A), we show a schematic (template) of the association representation that will follow in (B, C) for the actual data. Each mixer variable neighborhood is shown for coefﬁcients of two phases and two orientations along a spatial grid (one grid for each phase). The neighborhood is illustrated via the probability of each coefﬁcient to be generated from a given mixer variable. For the ﬁrst two neighborhoods (B), we also show the image patches that yielded the maximum log likelihood of P (v|patch). The ﬁrst neighborhood (in B) prefers vertical patterns across most of its “receptive ﬁeld”, while the second has a more localized region of horizontal preference. This can also be seen by averaging the 200 image patches with the maximum log likelihood. Strikingly, all the mixer variables group together two phases of quadrature pair (B, C). Quadrature pairs have also been extracted from cortical data, and are the components of ideal complex cell models. Another tendency is to group Phase 2 Phase 1 19 Y position Y position A 0 -19 Phase 1 Phase 2 19 0 -19 -19 0 19 X position -19 0 19 X position B Neighborhood Example max patches Average Neighborhood Example max patches C Neighborhood Average Gaussian 0.25 l2 0 -50 0 l 1 50 0 l 1 Mixer Gibbs fit assumed Gibbs fit assumed Distribution Distribution Distribution D Coefficient 0.12 E(g | l ) 0 2 0 -5 0 E(g 1 | l ) 5 0 E(g 1 | l ) 0.15 ) E(v | l ) β 0 00 15 E(v | l ) α 0 E(v | l ) α Figure 3: A Schematic of the mixer variable neighborhood representation. The probability that each coefﬁcient is associated with the mixer variable ranges from 0 (black) to 1 (white). Left: Vertical and horizontal ﬁlters, at two orientations, and two phases. Each phase is plotted separately, on a 38 by 38 pixel spatial grid. Right: summary of representation, with ﬁlter shapes replaced by oriented lines. Filters are approximately 6 pixels in diameter, with the spacing between ﬁlters 8 pixels. B First two image ensemble neighborhoods obtained from Gibbs sampling. Also shown, are four 38×38 pixel patches that had the maximum log likelihood of P (v|patch), and the average of the ﬁrst 200 maximal patches. C Other image ensemble neighborhoods. D Statistics of representative coefﬁcients of two spatially displaced vertical ﬁlters, and of inferred Gaussian and mixer variables. orientations across space. The phase and iso-orientation grouping bear some interesting similarity to other recent suggestions;17, 18 as do the maximal patches.19 Wavelet ﬁlters have the advantage that they can span a wider spatial extent than is possible with current ICA techniques, and the analysis of parameters such as phase grouping is more controlled. We are comparing the analysis with an ICA ﬁrst-stage representation, which has other obvious advantages. We are also extending the analysis to correlated wavelet ﬁlters; 25 and to simulations with a larger number of neighborhoods. From the obtained associations, we estimated the mixer and Gaussian variables according to our model. In (D) we show representative statistics of the coefﬁcients and of the inferred variables. The learned distributions of Gaussian and mixer variables are quite close to our assumptions. The Gaussian estimates exhibit joint conditional statistics that are roughly independent, and the mixer variables are weakly dependent. We have thus far demonstrated neighborhood inference for an image ensemble, but it is also interesting and perhaps more intuitive to consider inference for particular images or image classes. In ﬁgure 4 (A-B) we demonstrate example mixer variable neighborhoods derived from learning patches of a zebra image (Corel CD-ROM). As before, the neighborhoods are composed of quadrature pairs; however, the spatial conﬁgurations are richer and have A Neighborhood B Neighborhood Average Example max patches Top 25 max patches Average Example max patches Top 25 max patches Figure 4: Example of Gibbs on Zebra image. Image is 151×151 pixels, and each spatial neighborhood spans 38×38 pixels. A, B Example mixer variable neighborhoods. Left: example mixer variable neighborhood, and average of 200 patches that yielded the maximum likelihood of P (v|patch). Right: Image and marked on top of it example patches that yielded the maximum likelihood of P (v|patch). not been previously reported with unsupervised hierarchical methods: for example, in (A), the mixture neighborhood captures a horizontal-bottom/vertical-top spatial conﬁguration. This appears particularly relevant in segmenting regions of the front zebra, as shown by marking in the image the patches i that yielded the maximum log likelihood of P (v|patch). In (B), the mixture neighborhood captures a horizontal conﬁguration, more focused on the horizontal stripes of the front zebra. This example demonstrates the logic behind a probabilistic mixture: coefﬁcients corresponding to the bottom horizontal stripes might be linked with top vertical stripes (A) or to more horizontal stripes (B). 4 Discussion Work on the study of natural image statistics has recently evolved from issues about scalespace hierarchies, wavelets, and their ready induction through unsupervised learning models (loosely based on cortical development) towards the coordinated statistical structure of the wavelet components. This includes bottom-up (eg bowties, hierarchical representations such as complex cells) and top-down (eg GSM) viewpoints. The resulting new insights inform a wealth of models and ideas and form the essential backdrop for the work in this paper. They also link to impressive engineering results in image coding and processing. A most critical aspect of an hierarchical representational model is the way that the structure of the hierarchy is induced. We addressed the hierarchy question using a novel extension to the GSM generative model in which mixer variables (at one level of the hierarchy) enjoy probabilistic assignments to ﬁlter responses (at a lower level). We showed how these assignments can be learned (using Gibbs sampling), and illustrated some of their attractive properties using both synthetic and a variety of image data. We grounded our method ﬁrmly in Bayesian inference of the posterior distributions over the two classes of random variables in a GSM (mixer and Gaussian), placing particular emphasis on the interplay between the generative model and the statistical properties of its components. An obvious question raised by our work is the neural correlate of the two different posterior variables. The Gaussian variable has characteristics resembling those of the output of divisively normalized simple cells;6 the mixer variable is more obviously related to the output of quadrature pair neurons (such as orientation energy or motion energy cells, which may also be divisively normalized). How these different information sources may subsequently be used is of great interest. Acknowledgements This work was funded by the HHMI (OS, TJS) and the Gatsby Charitable Foundation (PD). We are very grateful to Patrik Hoyer, Mike Lewicki, Zhaoping Li, Simon Osindero, Javier Portilla and Eero Simoncelli for discussion. References [1] D Andrews and C Mallows. Scale mixtures of normal distributions. J. Royal Stat. Soc., 36:99–102, 1974. [2] M J Wainwright and E P Simoncelli. Scale mixtures of Gaussians and the statistics of natural images. In S. A. Solla, T. K. Leen, and K.-R. M¨ ller, editors, Adv. Neural Information Processing Systems, volume 12, pages 855–861, Cambridge, MA, u May 2000. MIT Press. [3] M J Wainwright, E P Simoncelli, and A S Willsky. Random cascades on wavelet trees and their use in modeling and analyzing natural imagery. Applied and Computational Harmonic Analysis, 11(1):89–123, July 2001. Special issue on wavelet applications. [4] A Hyv¨ rinen, J Hurri, and J Vayrynen. Bubbles: a unifying framework for low-level statistical properties of natural image a sequences. Journal of the Optical Society of America A, 20:1237–1252, May 2003. [5] R W Buccigrossi and E P Simoncelli. Image compression via joint statistical characterization in the wavelet domain. IEEE Trans Image Proc, 8(12):1688–1701, December 1999. [6] O Schwartz and E P Simoncelli. Natural signal statistics and sensory gain control. Nature Neuroscience, 4(8):819–825, August 2001. [7] D J Field. Relations between the statistics of natural images and the response properties of cortical cells. J. Opt. Soc. Am. A, 4(12):2379–2394, 1987. [8] H Attias and C E Schreiner. Temporal low-order statistics of natural sounds. In M Jordan, M Kearns, and S Solla, editors, Adv in Neural Info Processing Systems, volume 9, pages 27–33. MIT Press, 1997. [9] D L Ruderman and W Bialek. Statistics of natural images: Scaling in the woods. Phys. Rev. Letters, 73(6):814–817, 1994. [10] C Zetzsche, B Wegmann, and E Barth. Nonlinear aspects of primary vision: Entropy reduction beyond decorrelation. In Int’l Symposium, Society for Information Display, volume XXIV, pages 933–936, 1993. [11] J Huang and D Mumford. Statistics of natural images and models. In CVPR, page 547, 1999. [12] J. Romberg, H. Choi, and R. Baraniuk. Bayesian wavelet domain image modeling using hidden Markov trees. In Proc. IEEE Int’l Conf on Image Proc, Kobe, Japan, October 1999. [13] A Turiel, G Mato, N Parga, and J P Nadal. The self-similarity properties of natural images resemble those of turbulent ﬂows. Phys. Rev. Lett., 80:1098–1101, 1998. [14] J Portilla and E P Simoncelli. A parametric texture model based on joint statistics of complex wavelet coefﬁcients. Int’l Journal of Computer Vision, 40(1):49–71, 2000. [15] Helmut Brehm and Walter Stammler. Description and generation of spherically invariant speech-model signals. Signal Processing, 12:119–141, 1987. [16] T Bollersley, K Engle, and D Nelson. ARCH models. In B Engle and D McFadden, editors, Handbook of Econometrics V. 1994. [17] A Hyv¨ rinen and P Hoyer. Emergence of topography and complex cell properties from natural images using extensions of a ¨ ICA. In S. A. Solla, T. K. Leen, and K.-R. Muller, editors, Adv. Neural Information Processing Systems, volume 12, pages 827–833, Cambridge, MA, May 2000. MIT Press. [18] P Hoyer and A Hyv¨ rinen. A multi-layer sparse coding network learns contour coding from natural images. Vision Research, a 42(12):1593–1605, 2002. [19] Y Karklin and M S Lewicki. Learning higher-order structures in natural images. Network: Computation in Neural Systems, 14:483–499, 2003. [20] W Laurenz and T Sejnowski. Slow feature analysis: Unsupervised learning of invariances. Neural Computation, 14(4):715– 770, 2002. [21] C Kayser, W Einh¨ user, O D¨ mmer, P K¨ nig, and K P K¨ rding. Extracting slow subspaces from natural videos leads to a u o o complex cells. In G Dorffner, H Bischof, and K Hornik, editors, Proc. Int’l Conf. on Artiﬁcial Neural Networks (ICANN-01), pages 1075–1080, Vienna, Aug 2001. Springer-Verlag, Heidelberg. [22] B A Olshausen and D J Field. Emergence of simple-cell receptive ﬁeld properties by learning a sparse factorial code. Nature, 381:607–609, 1996. [23] A J Bell and T J Sejnowski. The ’independent components’ of natural scenes are edge ﬁlters. Vision Research, 37(23):3327– 3338, 1997. [24] U Grenander and A Srivastava. Probabibility models for clutter in natural images. IEEE Trans. on Patt. Anal. and Mach. Intel., 23:423–429, 2002. [25] J Portilla, V Strela, M Wainwright, and E Simoncelli. Adaptive Wiener denoising using a Gaussian scale mixture model in the wavelet domain. In Proc 8th IEEE Int’l Conf on Image Proc, pages 37–40, Thessaloniki, Greece, Oct 7-10 2001. IEEE Computer Society. [26] J Portilla, V Strela, M Wainwright, and E P Simoncelli. Image denoising using a scale mixture of Gaussians in the wavelet domain. IEEE Trans Image Processing, 12(11):1338–1351, November 2003. [27] C K I Williams and N J Adams. Dynamic trees. In M. S. Kearns, S. A. Solla, and D. A. Cohn, editors, Adv. Neural Information Processing Systems, volume 11, pages 634–640, Cambridge, MA, 1999. MIT Press. [28] E P Simoncelli, W T Freeman, E H Adelson, and D J Heeger. Shiftable multi-scale transforms. IEEE Trans Information Theory, 38(2):587–607, March 1992. Special Issue on Wavelets.

5 0.63146132 72 nips-2004-Generalization Error and Algorithmic Convergence of Median Boosting

Author: Balázs Kégl

Abstract: We have recently proposed an extension of A DA B OOST to regression that uses the median of the base regressors as the ﬁnal regressor. In this paper we extend theoretical results obtained for A DA B OOST to median boosting and to its localized variant. First, we extend recent results on efﬁcient margin maximizing to show that the algorithm can converge to the maximum achievable margin within a preset precision in a ﬁnite number of steps. Then we provide conﬁdence-interval-type bounds on the generalization error. 1

6 0.58514643 145 nips-2004-Parametric Embedding for Class Visualization

7 0.56819624 200 nips-2004-Using Random Forests in the Structured Language Model

8 0.56397551 78 nips-2004-Hierarchical Distributed Representations for Statistical Language Modeling

9 0.56027222 101 nips-2004-Learning Syntactic Patterns for Automatic Hypernym Discovery

10 0.54915011 169 nips-2004-Sharing Clusters among Related Groups: Hierarchical Dirichlet Processes

11 0.54307914 69 nips-2004-Fast Rates to Bayes for Kernel Machines

12 0.54274064 77 nips-2004-Hierarchical Clustering of a Mixture Model

13 0.54212284 167 nips-2004-Semi-supervised Learning with Penalized Probabilistic Clustering

14 0.54036081 44 nips-2004-Conditional Random Fields for Object Recognition

15 0.54032946 66 nips-2004-Exponential Family Harmoniums with an Application to Information Retrieval

16 0.54020864 82 nips-2004-Incremental Algorithms for Hierarchical Classification

17 0.5398708 174 nips-2004-Spike Sorting: Bayesian Clustering of Non-Stationary Data

18 0.53986531 38 nips-2004-Co-Validation: Using Model Disagreement on Unlabeled Data to Validate Classification Algorithms

19 0.53940278 62 nips-2004-Euclidean Embedding of Co-Occurrence Data

20 0.53888506 166 nips-2004-Semi-supervised Learning via Gaussian Processes