nips nips2004 nips2004-78 knowledge-graph by maker-knowledge-mining

78 nips-2004-Hierarchical Distributed Representations for Statistical Language Modeling

Source: pdf

Author: John Blitzer, Fernando Pereira, Kilian Q. Weinberger, Lawrence K. Saul

Abstract: Statistical language models estimate the probability of a word occurring in a given context. The most common language models rely on a discrete enumeration of predictive contexts (e.g., n-grams) and consequently fail to capture and exploit statistical regularities across these contexts. In this paper, we show how to learn hierarchical, distributed representations of word contexts that maximize the predictive value of a statistical language model. The representations are initialized by unsupervised algorithms for linear and nonlinear dimensionality reduction [14], then fed as input into a hierarchical mixture of experts, where each expert is a multinomial distribution over predicted words [12]. While the distributed representations in our model are inspired by the neural probabilistic language model of Bengio et al. [2, 3], our particular architecture enables us to work with signiﬁcantly larger vocabularies and training corpora. For example, on a large-scale bigram modeling task involving a sixty thousand word vocabulary and a training corpus of three million sentences, we demonstrate consistent improvement over class-based bigram models [10, 13]. We also discuss extensions of our approach to longer multiword contexts. 1

Reference: text

Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 edu Abstract Statistical language models estimate the probability of a word occurring in a given context. [sent-7, score-0.387]

2 The most common language models rely on a discrete enumeration of predictive contexts (e. [sent-8, score-0.327]

3 In this paper, we show how to learn hierarchical, distributed representations of word contexts that maximize the predictive value of a statistical language model. [sent-11, score-0.659]

4 The representations are initialized by unsupervised algorithms for linear and nonlinear dimensionality reduction [14], then fed as input into a hierarchical mixture of experts, where each expert is a multinomial distribution over predicted words [12]. [sent-12, score-0.911]

5 While the distributed representations in our model are inspired by the neural probabilistic language model of Bengio et al. [sent-13, score-0.331]

6 For example, on a large-scale bigram modeling task involving a sixty thousand word vocabulary and a training corpus of three million sentences, we demonstrate consistent improvement over class-based bigram models [10, 13]. [sent-15, score-0.925]

7 We also discuss extensions of our approach to longer multiword contexts. [sent-16, score-0.171]

8 1 Introduction Statistical language models are essential components of natural language systems for human-computer interaction. [sent-17, score-0.394]

9 These models estimate the probability that a word will occur in a given context, where in general a context speciﬁes a relationship to one or more words that have already been observed. [sent-19, score-0.385]

10 The simplest, most studied case is that of n-gram language modeling, where each word is predicted from the preceding n−1 words. [sent-20, score-0.355]

11 The main problem in building these models is that the vast majority of word combinations occur very infrequently, making it difﬁcult to estimate accurate probabilities of words in most contexts. [sent-21, score-0.369]

12 Researchers in statistical language modeling have developed a variety of smoothing techniques to alleviate this problem of data sparseness. [sent-22, score-0.258]

13 One expects the probabilities of rare or unseen events in one context to be related to their probabilities in statistically similar contexts. [sent-25, score-0.159]

14 The aggregate Markov model (AMM) of Saul and Pereira [13] (also discussed by Hofmann and Puzicha [10] as a special case of the aspect model) factors the conditional probability table of a word given its context by a latent variable representing context “classes”. [sent-28, score-0.359]

15 However, this latent variable approach is difﬁcult to generalize to multiword contexts, as the size of the conditional probability table for class given context grows exponentially with the context length. [sent-29, score-0.329]

16 The neural probabilistic language model (NPLM) of Bengio et al. [sent-30, score-0.181]

17 The low-dimensional vectors and the parameters of the network are trained simultaneously to minimize the perplexity of the language model. [sent-34, score-0.304]

18 This model has no difﬁculty encoding multiword contexts, but its training and application are very costly because of the need to compute a separate normalization for the conditional probabilities associated to each context. [sent-35, score-0.248]

19 In this paper, we introduce and evaluate a statistical language model that combines the advantages of the AMM and NPLM. [sent-36, score-0.221]

20 Like the NPLM, it can be used for multiword contexts, and like the AMM it avoids per-context normalization. [sent-37, score-0.171]

21 In our model, contexts are represented as low-dimensional real vectors initialized by unsupervised algorithms for dimensionality reduction [14]. [sent-38, score-0.464]

22 The probabilities of words given contexts are represented by a hierarchical mixture of experts (HME) [12], where each expert is a multinomial distribution over predicted words. [sent-39, score-0.641]

23 Proper initialization of the distributed representations is crucial; in particular, we ﬁnd that initializations from the results of linear and nonlinear dimensionality reduction algorithms lead to better models (with signiﬁcantly lower test perplexities) than random initialization. [sent-41, score-0.496]

24 We present results on a large-scale bigram modeling task, showing that our model also leads to signiﬁcant improvements over comparable AMMs. [sent-43, score-0.243]

25 2 Distributed representations of words Natural language has complex, multidimensional semantics. [sent-44, score-0.385]

26 The bottom right sentence is syntactically valid but semantically meaningless. [sent-49, score-0.146]

27 As shown by the table, a two-bit distributed representation of the words “vase” and “window” sufﬁces to express that a vase is both a container and breakable, while a window is breakable but cannot be a container. [sent-50, score-0.363]

28 More generally, we expect low dimensional continuous representations of words to be even more effective at capturing semantic regularities. [sent-51, score-0.237]

29 Distributed representations of words can be derived in several ways. [sent-52, score-0.204]

30 In a given corpus of text, for example, consider the matrix of bigram counts whose element Cij records the number of times that word wj follows word wi . [sent-53, score-0.711]

31 Further, let pij = Cij / k Cik denote the conditional frequencies derived from these counts, and let pi denote the V -dimensional frequency vector with elements pij , where V is the vocabulary size. [sent-54, score-0.247]

32 Note that the vectors pi themselves provide a distributed representation of the words wi in the corpus. [sent-55, score-0.385]

33 For large vocabularies and training corpora, however, this is an extremely unwieldy representation, tantamount to storing the full matrix of bigram counts. [sent-56, score-0.338]

34 We consider two methods in dimensionality reduction for this problem. [sent-59, score-0.194]

35 1 Linear dimensionality reduction The simplest form of dimensionality reduction is principal component analysis (PCA). [sent-62, score-0.388]

36 PCA computes a linear projection of the frequency vectors pi into the low dimensional subspace that maximizes their variance. [sent-63, score-0.236]

37 The variance-maximizing subspace of dimensionality d is spanned by the top d eigenvectors of the frequency vector covariance matrix. [sent-64, score-0.148]

38 The effect of PCA can also be understood as a translation and rotation of the frequency vectors pi , followed by a truncation that preserves only their ﬁrst d elements. [sent-66, score-0.23]

39 2 Nonlinear dimensionality reduction Intuitively, we would like to map the vectors pi into a low dimensional space where semantically similar words remain close together and semantically dissimilar words are far apart. [sent-68, score-0.958]

40 This is done by balancing two competing goals: (i) to co-locate semantically similar words, and (ii) to separate semantically dissimilar words. [sent-74, score-0.319]

41 The ﬁrst goal is achieved by ﬁxing the distances between words with similar frequency vectors to their original values. [sent-75, score-0.292]

42 Thus, we push the words in the vocabulary as far apart as possible subject to the constraint that the distances between semantically similar words do not change. [sent-78, score-0.511]

43 The only freedom in this optimization is the criterion for judging that two words are semantically similar. [sent-79, score-0.297]

44 In practice, we adopt a simple criterion such as k-nearest neighbors in the space of frequency vectors pi and choose k as small as possible so that the resulting neighborhood graph is connected [14]. [sent-80, score-0.203]

45 The optimization can 2 thus be formulated as the semideﬁnite programming problem: Maximize Σij Dij subject to: (i) DT = D, (ii) − 1 HDH 0, and 2 2 (iii) Dij = |pi − pj | for all neighboring vectors pi and pj . [sent-83, score-0.279]

46 1 Assuming without loss of generality that the vectors xi are centered on the origin, the dot products Gij = xi · xj are related to the pairwise squared distances Dij = |xi − xj |2 as stated above. [sent-84, score-0.165]

47 0 Figure 1: Eigenvalues from principal component analysis (PCA) and semideﬁnite embedding (SDE), applied to bigram distributions of the 2000 most frequently occuring words in the corpus. [sent-91, score-0.448]

48 Thus, one can compare the eigenvalue spectra from this method and PCA to ascertain if the variance of the nonlinear embedding is concentrated in fewer dimensions. [sent-98, score-0.148]

49 We refer to this method of nonlinear dimensionality reduction as semideﬁnite embedding (SDE). [sent-99, score-0.342]

50 The ﬁgure shows that the nonlinear embedding by SDE concentrates its variance in many fewer dimensions than the linear embedding by PCA. [sent-102, score-0.267]

51 2 shows that even the ﬁrst two dimensions of the nonlinear embedding preserve the neighboring relationships of many words that are semantically similar. [sent-104, score-0.456]

52 Note that semantically meaningful neighborhoods are preserved, despite the massive dimensionality reduction from V = 60000 to d = 2. [sent-107, score-0.34]

53 For the results in this paper—on the corpus described in section 4—we solved the semideﬁnite program in this section to embed the 2000 most frequent words in the corpus, then used a greedy incremental solver to embed the remaining 58000 words in the vocabulary. [sent-109, score-0.374]

54 Though not the main point of this paper, the nonlinear embedding of V = 60000 words is to our knowledge one of the largest applications of recently developed spectral methods for nonlinear dimensionality reduction [9, 14]. [sent-111, score-0.533]

55 3 Hierarchical mixture of experts The model we use to compute the probability that word w follows word w is known as a hierarchical mixture of experts (HME) [12]. [sent-112, score-0.64]

56 HMEs are fully probabilistic models, making them ideally suited to the task of statistical language modeling. [sent-113, score-0.221]

57 HMEs are tree-structured mixture models in which the mixture components are “experts” that lie at the leaves of the tree. [sent-116, score-0.192]

58 The interior nodes of the tree perform binary logistic regressions on the input vector to the HME, and the mixing weight for a leaf is computed by multiplying the probabilities of each branch (left or right) along the path to that leaf. [sent-117, score-0.348]

59 In our model, the input vector x is a function of the context word w, and the expert at each leaf speciﬁes a multinomial distribution over the predicted word w . [sent-118, score-0.681]

60 Letting π denote a path through the tree from root to leaf, the HME computes the probability of a word w conditioned on a context word w as Pr(w |w) = Pr(π|x(w)) · Pr(w |π). [sent-119, score-0.437]

61 The E-step involves computing the posterior probability over paths Pr(π|w, w ) for each observed bigram in the training corpus. [sent-121, score-0.281]

62 In the M-step, we must maximize the EM auxiliary function with respect to the parameters of the logistic regressions and multinomial leaves as well as the input vectors x(w). [sent-123, score-0.522]

63 The logistic regressions in the tree decouple and can be optimized separately by Newton’s method, while the multinomial leaves have a simple closed-form update. [sent-124, score-0.441]

64 Though the input vectors are shared across all logistic regressions in the tree, we can compute their gradients and hessians in one recursive pass and update them by Newton’s method as well. [sent-125, score-0.296]

65 The EM algorithm for HMEs converges to a local maximum of the log-likelihood, or equivalently, a local minimum of the training perplexity 1  − C   Ptrain = Pr(wj |wi )Cij , (2)   ij where C = ij Cij is the total number of observed bigrams in the training corpus. [sent-126, score-0.163]

66 Words are mapped to input vectors; probabilities of next words are computed by summing over paths through the tree. [sent-128, score-0.204]

67 The mapping from words to input vectors is initialized by dimensionality reduction of bigram counts. [sent-129, score-0.758]

68 Ptest init random PCA SDE Ptest m 8 16 32 64 d 4 468 406 385 8 407 364 361 12 378 362 360 16 373 351 355 Table 1: Test perplexities of HMEs with different input dimensionalities and initializations. [sent-130, score-0.225]

69 4 435 385 350 336 d 8 429 361 328 308 12 426 360 320 298 16 428 355 317 294 Table 2: Test perplexities of HMEs with different input dimensionalities and numbers of leaves. [sent-131, score-0.225]

70 section, initialization of the input vectors by PCA or SDE leads to signiﬁcantly better models than random initialization. [sent-132, score-0.2]

71 We initialized the logistic regressions in the HME to split the input vectors recursively along their dimensions of greatest variance. [sent-133, score-0.416]

72 The multinomial distributions at leaf nodes were initialized by uniform distributions. [sent-134, score-0.289]

73 For an HME with m multinomial leaves and d-dimensional input vectors, the number of parameters scales as O(V d + V m + dm). [sent-135, score-0.267]

74 The resulting model can be therefore be much more compact than a full bigram model over V words. [sent-136, score-0.243]

75 Our training set contained 78 million words from a 60,000 word vocabulary. [sent-138, score-0.411]

76 The test set, untruncated, had 13 million words resulting in 2. [sent-142, score-0.199]

77 1 Empirical evaluation Table 1 reports the test perplexities of several HMEs whose input vectors were initialized in different ways. [sent-145, score-0.35]

78 Here SDE outperformed PCA, most likely because the ﬁrst few eigenvectors of SDE capture more variance in the bigram counts than those of PCA (see Figure 1). [sent-151, score-0.305]

79 Table 2 reports the test perplexities of several HMEs initialized by SDE, but with varying input dimensionality (d) and numbers of leaves (m). [sent-152, score-0.457]

80 Perplexity decreases with increasing tree depth and input dimensionality, but increasing the dimensionality beyond d = 8 does not appear to give much gain. [sent-153, score-0.184]

81 2 Comparison to a class-based bigram model w z w' We obtained baseline results from an AMM [13] trained on the same corpus. [sent-155, score-0.243]

82 parameters (*1000) 960 1440 2400 4320 Ptest (AMM) 456 414 353 310 Ptest (HME) 429 361 328 308 improvement 6% 13% 7% 1% Table 3: Test perplexities of HMEs and AMMs with roughly equal parameter counts. [sent-161, score-0.153]

83 Table 3 compares the test perplexities of several HMEs and AMMs with similar numbers of parameters. [sent-162, score-0.153]

84 The performance is nearly equal for the larger models, which may be explained by the fact that most of the parameters of the larger HMEs come from the multinomial leaves, not from the distributed inputs. [sent-165, score-0.224]

85 3 Comparison to NPLM The most successful large-scale application of distributed representations to language modeling is the NPLM of Bengio et al. [sent-167, score-0.331]

86 The NPLM uses softmax to compute the probability of a word w given its context, thus requiring a separate normalization for each context. [sent-171, score-0.203]

87 5× larger vocabulary and a larger training corpus than were used for the NPLM, and still complete training our largest models in a matter of hours. [sent-177, score-0.228]

88 Note that the numbers in Table 4 do not include the time to compute the initial distributed representations by PCA (30 minutes) or SDE (3 days), but these computations do not need to be repeated for each trained model. [sent-178, score-0.15]

89 We have not done these experiments yet, but our model extends naturally to multiword contexts, as we explain in the next section. [sent-184, score-0.171]

90 5 Discussion In this paper, we have presented a statistical language model that exploits hierarchical distributed representations of word contexts. [sent-185, score-0.613]

91 The model shares the advantages of the NPLM [2], but differs in its use of dimensionality reduction for effective parameter ini- tialization and in the signiﬁcant speedup provided by the HME architecture. [sent-186, score-0.194]

92 We have also demonstrated that our models consistently match or outperform a baseline class-based bigram model. [sent-188, score-0.275]

93 The class-based bigram model is nearly as effective as the HME, but it has the major drawback that there is no straightforward way to extend it to multiword contexts without exploding its parameter count. [sent-189, score-0.528]

94 We can form an input vector for a multiword history (w1 , w2 ) simply by concatenating the vectors x(w1 ) and x(w2 ). [sent-191, score-0.286]

95 Initialization from dimensionality reduction is also straightforward: we can compute the low dimensional representation for each word separately. [sent-193, score-0.401]

96 We are actively pursuing these ideas to train models with hierarchical distributed representations of multiword contexts. [sent-194, score-0.421]

97 An empirical study of smoothing techniques for language modeling. [sent-253, score-0.218]

98 A kernel view of the dimensionality reduction of o manifolds. [sent-265, score-0.194]

99 Aggregate and mixed-order Markov models for statistical language processing. [sent-289, score-0.253]

100 A study of smoothing methods for language models applied to information retrieval. [sent-304, score-0.25]

similar papers computed by tfidf model

tfidf for this paper:

wordName wordTfidf (topN-words)

[('sde', 0.35), ('hmes', 0.307), ('nplm', 0.307), ('hme', 0.286), ('bigram', 0.243), ('language', 0.181), ('word', 0.174), ('multiword', 0.171), ('pca', 0.156), ('multinomial', 0.154), ('perplexities', 0.153), ('semantically', 0.146), ('words', 0.124), ('contexts', 0.114), ('amm', 0.109), ('regressions', 0.109), ('dimensionality', 0.109), ('semide', 0.102), ('dij', 0.092), ('pi', 0.09), ('ptest', 0.088), ('vase', 0.088), ('reduction', 0.085), ('initialized', 0.082), ('embedding', 0.081), ('pr', 0.08), ('representations', 0.08), ('million', 0.075), ('bengio', 0.075), ('vectors', 0.074), ('logistic', 0.072), ('leaves', 0.072), ('distributed', 0.07), ('weinberger', 0.07), ('experts', 0.068), ('hierarchical', 0.068), ('nonlinear', 0.067), ('amms', 0.066), ('cij', 0.065), ('vocabulary', 0.062), ('corpus', 0.058), ('pietra', 0.057), ('vocabularies', 0.057), ('context', 0.055), ('distances', 0.055), ('leaf', 0.053), ('initialization', 0.053), ('perplexity', 0.049), ('table', 0.048), ('pj', 0.044), ('breakable', 0.044), ('ducharme', 0.044), ('hdh', 0.044), ('mixture', 0.044), ('linguistics', 0.042), ('input', 0.041), ('frequent', 0.04), ('statistical', 0.04), ('frequency', 0.039), ('probabilities', 0.039), ('training', 0.038), ('dimensions', 0.038), ('bigrams', 0.038), ('smoothing', 0.037), ('window', 0.037), ('squared', 0.036), ('vincent', 0.035), ('counts', 0.035), ('nite', 0.034), ('tree', 0.034), ('em', 0.033), ('dimensional', 0.033), ('banff', 0.032), ('multilayer', 0.032), ('models', 0.032), ('dimensionalities', 0.031), ('regularities', 0.031), ('expert', 0.03), ('eigenvalues', 0.03), ('twenty', 0.029), ('softmax', 0.029), ('euclidean', 0.029), ('pij', 0.028), ('meeting', 0.028), ('solver', 0.028), ('hofmann', 0.028), ('saul', 0.027), ('translation', 0.027), ('captured', 0.027), ('newton', 0.027), ('dissimilar', 0.027), ('aggregate', 0.027), ('fed', 0.027), ('corpora', 0.027), ('pereira', 0.027), ('outperformed', 0.027), ('optimization', 0.027), ('wi', 0.027), ('events', 0.026), ('pk', 0.026)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.99999934 78 nips-2004-Hierarchical Distributed Representations for Statistical Language Modeling

Author: John Blitzer, Fernando Pereira, Kilian Q. Weinberger, Lawrence K. Saul

2 0.15888526 200 nips-2004-Using Random Forests in the Structured Language Model

Author: Peng Xu, Frederick Jelinek

Abstract: In this paper, we explore the use of Random Forests (RFs) in the structured language model (SLM), which uses rich syntactic information in predicting the next word based on words already seen. The goal in this work is to construct RFs by randomly growing Decision Trees (DTs) using syntactic information and investigate the performance of the SLM modeled by the RFs in automatic speech recognition. RFs, which were originally developed as classiﬁers, are a combination of decision tree classiﬁers. Each tree is grown based on random training data sampled independently and with the same distribution for all trees in the forest, and a random selection of possible questions at each node of the decision tree. Our approach extends the original idea of RFs to deal with the data sparseness problem encountered in language modeling. RFs have been studied in the context of n-gram language modeling and have been shown to generalize well to unseen data. We show in this paper that RFs using syntactic information can also achieve better performance in both perplexity (PPL) and word error rate (WER) in a large vocabulary speech recognition system, compared to a baseline that uses Kneser-Ney smoothing. 1

3 0.13149762 62 nips-2004-Euclidean Embedding of Co-Occurrence Data

Author: Amir Globerson, Gal Chechik, Fernando Pereira, Naftali Tishby

Abstract: Embedding algorithms search for low dimensional structure in complex data, but most algorithms only handle objects of a single type for which pairwise distances are speciﬁed. This paper describes a method for embedding objects of different types, such as images and text, into a single common Euclidean space based on their co-occurrence statistics. The joint distributions are modeled as exponentials of Euclidean distances in the low-dimensional embedding space, which links the problem to convex optimization over positive semideﬁnite matrices. The local structure of our embedding corresponds to the statistical correlations via random walks in the Euclidean space. We quantify the performance of our method on two text datasets, and show that it consistently and signiﬁcantly outperforms standard methods of statistical correspondence modeling, such as multidimensional scaling and correspondence analysis. 1

4 0.12596481 163 nips-2004-Semi-parametric Exponential Family PCA

Author: Sajama Sajama, Alon Orlitsky

Abstract: We present a semi-parametric latent variable model based technique for density modelling, dimensionality reduction and visualization. Unlike previous methods, we estimate the latent distribution non-parametrically which enables us to model data generated by an underlying low dimensional, multimodal distribution. In addition, we allow the components of latent variable models to be drawn from the exponential family which makes the method suitable for special data types, for example binary or count data. Simulations on real valued, binary and count data show favorable comparison to other related schemes both in terms of separating different populations and generalization to unseen samples. 1

5 0.11325435 87 nips-2004-Integrating Topics and Syntax

Author: Thomas L. Griffiths, Mark Steyvers, David M. Blei, Joshua B. Tenenbaum

Abstract: Statistical approaches to language learning typically focus on either short-range syntactic dependencies or long-range semantic dependencies between words. We present a generative model that uses both kinds of dependencies, and can be used to simultaneously ﬁnd syntactic classes and semantic topics despite having no representation of syntax or semantics beyond statistical dependency. This model is competitive on tasks like part-of-speech tagging and document classiﬁcation with models that exclusively use short- and long-range dependencies respectively. 1

6 0.10118555 107 nips-2004-Making Latin Manuscripts Searchable using gHMM's

7 0.087086141 145 nips-2004-Parametric Embedding for Class Visualization

8 0.084393397 37 nips-2004-Co-Training and Expansion: Towards Bridging Theory and Practice

9 0.083874255 205 nips-2004-Who's In the Picture

10 0.079015292 125 nips-2004-Multiple Relational Embedding

11 0.075052857 2 nips-2004-A Direct Formulation for Sparse PCA Using Semidefinite Programming

12 0.070575878 127 nips-2004-Neighbourhood Components Analysis

13 0.066996619 131 nips-2004-Non-Local Manifold Tangent Learning

14 0.064440407 67 nips-2004-Exponentiated Gradient Algorithms for Large-margin Structured Classification

15 0.060586449 162 nips-2004-Semi-Markov Conditional Random Fields for Information Extraction

16 0.058143854 134 nips-2004-Object Classification from a Single Example Utilizing Class Relevance Metrics

17 0.057392698 13 nips-2004-A Three Tiered Approach for Articulated Object Action Modeling and Recognition

18 0.056828942 98 nips-2004-Learning Gaussian Process Kernels via Hierarchical Bayes

19 0.056705277 82 nips-2004-Incremental Algorithms for Hierarchical Classification

20 0.056364618 164 nips-2004-Semi-supervised Learning by Entropy Minimization

similar papers computed by lsi model

lsi for this paper:

topicId topicWeight

[(0, -0.192), (1, 0.056), (2, -0.021), (3, -0.081), (4, -0.026), (5, 0.07), (6, -0.089), (7, 0.125), (8, 0.132), (9, 0.078), (10, 0.121), (11, -0.119), (12, -0.2), (13, 0.041), (14, -0.027), (15, -0.004), (16, 0.138), (17, -0.017), (18, -0.005), (19, 0.046), (20, -0.043), (21, -0.073), (22, -0.063), (23, -0.17), (24, -0.121), (25, 0.012), (26, 0.007), (27, 0.11), (28, 0.018), (29, 0.129), (30, -0.002), (31, 0.042), (32, 0.041), (33, 0.067), (34, -0.037), (35, 0.089), (36, -0.034), (37, 0.003), (38, 0.086), (39, 0.074), (40, -0.05), (41, -0.103), (42, -0.028), (43, -0.044), (44, 0.001), (45, 0.033), (46, 0.102), (47, -0.078), (48, -0.071), (49, -0.02)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.95207965 78 nips-2004-Hierarchical Distributed Representations for Statistical Language Modeling

Author: John Blitzer, Fernando Pereira, Kilian Q. Weinberger, Lawrence K. Saul

2 0.66823095 200 nips-2004-Using Random Forests in the Structured Language Model

Author: Peng Xu, Frederick Jelinek

3 0.62390649 87 nips-2004-Integrating Topics and Syntax

Author: Thomas L. Griffiths, Mark Steyvers, David M. Blei, Joshua B. Tenenbaum

4 0.54514682 62 nips-2004-Euclidean Embedding of Co-Occurrence Data

Author: Amir Globerson, Gal Chechik, Fernando Pereira, Naftali Tishby

5 0.52583063 107 nips-2004-Making Latin Manuscripts Searchable using gHMM's

Author: Jaety Edwards, Yee W. Teh, Roger Bock, Michael Maire, Grace Vesom, David A. Forsyth

Abstract: We describe a method that can make a scanned, handwritten mediaeval latin manuscript accessible to full text search. A generalized HMM is ﬁtted, using transcribed latin to obtain a transition model and one example each of 22 letters to obtain an emission model. We show results for unigram, bigram and trigram models. Our method transcribes 25 pages of a manuscript of Terence with fair accuracy (75% of letters correctly transcribed). Search results are very strong; we use examples of variant spellings to demonstrate that the search respects the ink of the document. Furthermore, our model produces fair searches on a document from which we obtained no training data. 1. Intoduction There are many large corpora of handwritten scanned documents, and their number is growing rapidly. Collections range from the complete works of Mark Twain to thousands of pages of zoological notes spanning two centuries. Large scale analyses of such corpora is currently very difﬁcult, because handwriting recognition works poorly. Recently, Rath and Manmatha have demonstrated that one can use small bodies of aligned material as supervised data to train a word spotting mechanism [7]. The result can make scanned handwritten documents searchable. Current techniques assume a closed vocabulary — one can search only for words in the training set — and search for instances of whole words. This approach is particularly unattractive for an inﬂected language, because individual words can take so many forms that one is unlikely to see all in the training set. Furthermore, one would like the method used to require very little aligned training data, so that it is possible to process documents written by different scribes with little overhead. Mediaeval Latin manuscripts are a natural ﬁrst corpus for studying this problem, because there are many scanned manuscripts and because the handwriting is relatively regular. We expect the primary user need to be search over a large body of documents — to allow comparisons between documents — rather than transcription of a particular document (which is usually relatively easy to do by hand). Desirable features for a system are: First, that it use little or no aligned training data (an ideal, which we believe may be attainable, is an unsupervised learning system). Second, that one can search the document for an arbitrary string (rather than, say, only complete words that appear in the training data). This would allow a user to determine whether a document contains curious or distinctive spellings, for example (ﬁgure 7). We show that, using a statistical model based on a generalized HMM, we can search a medieval manuscript with considerable accuracy, using only one instance each of each letter in the manuscript to train the method (22 instances in total; Latin has no j, k, w, or z). Furthermore, our method allows fairly accurate transcription of the manuscript. We train our system on 22 glyphs taken from a a 12th century latin manuscript of Terence’s Comedies (obtained from a repository of over 80 scanned medieval works maintained by Oxford University [1]). We evaluate searches using a considerable portion of this manuscript aligned by hand; we then show that fair search results are available on a different manuscript (MS. Auct. D. 2. 16, Latin Gospels with beast-headed evangelist portraits made at Landvennec, Brittany, late 9th or early 10th century, from [1]) without change of letter templates. 1.1. Previous Work Handwriting recognition is a traditional problem, too well studied to review in detail here (see [6]). Typically, online handwriting recognition (where strokes can be recorded) works better than ofﬂine handwriting recognition. Handwritten digits can now be recognized with high accuracy [2, 5]. Handwritten amounts can be read with fair accuracy, which is signiﬁcantly improved if one segments the amount into digits at the same time as one recognizes it [4, 5]. Recently several authors have proposed new techniques for search and translation in this unrestricted setting. Manmatha et al [7] introduce the technique of “word spotting,” which segments text into word images, rectiﬁes the word images, and then uses an aligned training set to learn correspondences between rectiﬁed word images and strings. The method is not suitable for a heavily inﬂected language, because words take so many forms. In an inﬂected language, the natural unit to match to is a subset of a word, rather than a whole word, implying that one should segment the text into blocks — which may be smaller than words — while recognizing. Vinciarelli et al [8] introduce a method for line by line recognition based around an HMM and quite similar to techniques used in the speech recognition community. Their method uses a window that slides along the text to obtain features; this has the difﬁculty that the same window is in some places too small (and so uninformative) and in others too big (and so spans more than one letter, and is confusing). Their method requires a substantial body of aligned training data, which makes it impractical for our applications. Close in spirit to our work is the approach to machine translation of Koehn and Knight [3]. They demonstrate that the statistics of unaligned corpora may provide as powerful constraints for training models as aligned bitexts. 2. The Model Our models for both search and transcription are based on the generalized HMM and differ only in their choice of transition model. In an HMM, each hidden node ct emits a single evidence node xt . In a generalized HMM, we allow each ct to emit a series of x’s whose length is itself a random variable. In our model, the hidden nodes correspond to letters and each xt is a single column of pixels. Allowing letters to emit sets of columns lets us accomodate letter templates of variable width. In particular, this means that we can unify segmenting ink into letters and recognizing blocks of ink; ﬁgure 3 shows an example of how useful this is. 2.1. Generating a line of text Our hidden state consists of a character label c, width w and vertical position y. The statespace of c contains the characters ‘a’-‘z’, a space ‘ ’, and a special end state Ω. Let T c be the template associated with character c, Tch , Tcw be respectively the height and width of that template, and m be the height of the image. Figure 1: Left, a full page of our manuscript, a 12’th century manuscript of Terence’s Comedies obtained from [1]. Top right, a set of lines from a page from that document and bottom right, some words in higher resolution. Note: (a) the richness of page layout; (b) the clear spacing of the lines; (c) the relatively regular handwriting. Figure 2: Left, the 22 instances, one per letter, used to train our emission model. These templates are extracted by hand from the Terence document. Right, the ﬁve image channels for a single letter. Beginning at image column 1 (and assuming a dummy space before the ﬁrst character), • • • • choose character c ∼ p(c|c−1...−n ) (an n-gram letter model) choose length w ∼ Uniform(Tcw − k, Tcw + k) (for some small k) choose vertical position y ∼ Uniform(1, m − Tch ) z,y and Tch now deﬁne a bounding box b of pixels. Let i and j be indexed from the top left of that bounding box. – draw pixel (i, j) ∼ N (Tcij , σcij ) for each pixel in b – draw all pixels above and below b from background gaussian N (µ0 , σ0 ) (See 2.2 for greater detail on pixel emission model) • move to column w + 1 and repeat until we enter the end state Ω. Inference on a gHMM is a relatively straighforward business of dynamic programming. We have used unigram, bigram and trigram models, with each transition model ﬁtted using an electronic version of Caesar’s Gallic Wars, obtained from http://www.thelatinlibrary.com. We do not believe that the choice of author should signiﬁcantly affect the ﬁtted transition model — which is at the level of characters — but have not experimented with this point. The important matter is the emission model. 2.2. The Emission Model Our emission model is as follows: Given the character c and width w, we generate a template of the required length. Each pixel in this template becomes the mean of a gaussian which generates the corresponding pixel in the image. This template has a separate mean image for each pixel channel. The channels are assumed independent given the means. We train the model by cutting out by hand a single instance of each letter from our corpus (ﬁgure 2). This forms the central portion of the template. Pixels above and below this Model Perfect transcription unigram bigram trigram matching chars 21019 14603 15572 15788 substitutions 0 5487 4597 4410 insertions 0 534 541 507 deletions 0 773 718 695 Table 1: Edit distance between our transcribed Terence and the editor’s version. Note the trigram model produces signiﬁcantly fewer letter errors than the unigram model, but that the error rate is still a substantial 25%. central box are generated from a single gaussian used to model background pixels (basically white pixels). We add a third variable yt to our hidden state indicating the vertical position of the central box. However, since we are uninterested in actually recovering this variable, during inference we sum it out of the model. The width of a character is constrained to be close to the width (tw ) of our hand cut example by setting p(w|c) = 0 for w < tw − k and w > tw + k. Here k is a small, user deﬁned integer. Within this range, p(w|c) is distributed uniformly, larger templates are created by appending pixels from the background model to the template and smaller ones by simply removing the right k-most columns of the hand cut example. For features, we generate ﬁve image representations, shown in ﬁgure 2. The ﬁrst is a grayscale version of the original color image. The second and third are generated by convolving the grayscale image with a vertical derivative of gaussian ﬁlter, separating the positive and negative components of this response, and smoothing each of these gradient images separately. The fourth and ﬁfth are generated similarly but with a horizontal derivative of gaussian ﬁlter. We have experimented with different weightings of these 5 channels. In practice we use the gray scale channel and the horizontal gradient channels. We emphasize the horizontal pieces since these seem the more discriminative. 2.3. Transcription For transcription, we model letters as coming from an n-gram language model, with no dependencies between words. Thus, the probability of a letter depends on the k letters before it, where k = n unless this would cross a word boundary in which case the history terminates at this boundary. We chose not to model word to word transition probabilities since, unlike in English, word order in Latin is highly arbitrary. This transition model is ﬁt from a corpus of ascii encoded latin. We have experimented with unigram (i.e. uniform transition probabilities), bigram and trigram letter models. We can perform transcription by ﬁtting the maximum likelihood path through any given line. Some results of this technique are shown in ﬁgure 3. 2.4. Search For search, we rank lines by the probability that they contain our search word. We set up a ﬁnite state machine like that in ﬁgure 4. In this ﬁgure, ‘bg’ represents our background model for that portion of the line not generated by our search word. We can use any of the n-gram letter models described for transcription as the transition model for ‘bg’. The probability that the line contains the search word is the probability that this FSM takes path 1. We use this FSM as the transition model for our gHMM, and output the posterior probability of the two arrows leading into the end state. 1 and 2 are user deﬁned weights, but in practice the algorithm does not appear to be particular sensitive to the choice of these parameters. The results presented here use the unigram model. Editorial translation Orator ad vos venio ornatu prologi: unigram b u rt o r a d u o s u em o o r n a t u p r o l o g r b u rt o r a d v o s v em o o r u a t u p r o l o g r fo r a t o r a d v o s v en i o o r n a t u p r o l o g i bigram trigram Figure 3: We transcribe the text by ﬁnding the maximum likelihood path through the gHMM. The top line shows the standard version of the line (obtained by consensus among editors who have consulted various manuscripts; we obtained this information in electronic form from http://www.thelatinlibrary.com). Below, we show the line as segmented and transcribed by unigram, bigram and trigram models; the unigram and bigram models transcribe one word as “vemo”, but the stronger trigram model forces the two letters to be segmented and correctly transcribes the word as “venio”, illustrating the considerable beneﬁt to be obtained by segmenting only at recognition time. 1 − ε1 Path 1 1 − ε2 a b bg ε1 Ω bg Path 2 ε2 Figure 4: The ﬁnite state machine to search for the word ‘ab.’ ‘bg’ is a place holder for the larger ﬁnite state machine deﬁned by our language model’s transition matrix. 3. Results Figure 1 shows a page from our collection. This is a scanned 12th century manuscript of Terence’s Comedies, obtained from the collection at [1]. In preprocessing, we extract individual lines of text by rotating the image to various degrees and projecting the sum of the pixel values onto the y-axis. We choose the orientation whose projection vector has the lowest entropy, and then segment lines by cutting at minima of this projection. Transcription is not our primary task, but methods that produce good transcriptions are going to support good searches. The gHMM can produce a surprisingly good transcription, given how little training data is used to train the emission model. We aligned an editors version of Terence with 25 pages from the manuscript by hand, and computed the edit distance between the transcribed text and the aligned text; as table 1 indicates, approximately 75% of letters are read correctly. Search results are strong. We show results for two documents. The ﬁrst set of results refers to the edition of Terence’s Comedies, from which we took the 22 letter instances. In particular, for any given search term, our process ranks the complete set of lines. We used a hand alignment of the manuscript to determine which lines contained each term; ﬁgure 5 shows an overview of searches performed using every word that appears in the 50 100 150 200 250 300 350 400 450 500 550 Figure 5: Our search ranks 587 manuscript lines, with higher ranking lines more likely to contain the relevant term. This ﬁgure shows complete search results for each term that appears more than three times in the 587 lines. Each row represents the ranked search results for a term, and a black mark appears if the search term is actually in the line; a successful search will therefore appear as a row which is wholly dark to the left, and then wholly light. All 587 lines are represented. More common terms are represented by lower rows. More detailed results appear in ﬁgure 5 and ﬁgure 6; this summary ﬁgure suggests almost all searches are highly successful. document more than three times, in particular, showing which of the ranked set of lines actually contained the search term. For almost every search, the term appears mainly in the lines with higher rank. Figure 6 contains more detailed information for a smaller set of words. We do not score the position of a word in a line (for practical reasons). Figure 7 demonstrates (a) that our search respects the ink of the document and (b) that for the Terence document, word positions are accurately estimated. The spelling of mediaeval documents is typically cleaned up by editors; in our manuscript, the scribe reliably spells “michi” for the standard “mihi”. A search on “michi” produces many instances; a search on “mihi” produces none, because the ink doesn’t have any. Notice this phenomenon also in the bottom right line of ﬁgure 7, the scribe writes “habet, ut consumat nunc cum nichil obsint doli” and the editor gives “habet, ut consumat nunc quom nil obsint doli.” Figure 8 shows that searches on short strings produce many words containing that string as one would wish. 4. Discussion We have shown that it is possible to make at least some handwritten mediaeval manuscripts accessible to full text search, without requiring an aligned text or much supervisory data. Our documents have very regular letters, and letter frequencies — which can be obtained from transcribed Latin — appear to provide so powerful a cue that relatively little detailed information about letter shapes is required. Linking letter segmentation and recognition has thoroughly beneﬁcial effects. This suggests that the pool of manuscripts that can be made accessible in this way is large. In particular, we have used our method, trained on 22 instances of letters from one document, to search another document. Figure 9 shows the results from two searches of our second document (MS. Auct. D. 2. 16, Latin Gospels with beast-headed evangelist portraits made at Landvennec, Brittany, late 9th or early 10th century, from [1]). No information from this document was used in training at all; but letter 1tu arbitror pater etiam nisi factum primum siet vero illi inter hic michi ibi qui tu ibi michi 0.9 0.8 0.7 qui hic 0.6 inter 0.5 illi 0.4 siet 0.3 vero 0.2 nisi 0.1 50 100 150 200 250 300 350 400 450 500 550 0 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 Figure 6: On the left, search results for selected words (indicated on the leftmost column). Each row represents the ranked search results for a term, and a black mark appears if the search term is actually in the line; a successful search will therefore appear as a row which is wholly dark to the left, and then wholly light. Note only the top 300 results are represented, and that lines containing the search term are almost always at or close to the top of the search results (black marks to the left). On the right, we plot precision against recall for a set of different words by taking the top 10, 20, ... lines returned from the search, and checking them against the aligned manuscript. Note that, once all cases have been found, if the size of the pool is increased the precision will fall with 100% recall; many words work well, with most of the ﬁrst 20 or so lines returned containing the search term. shapes are sufﬁciently well shared that the search is still useful. All this suggests that one might be able to use EM to link three processes: one that clusters to determine letter shapes; one that segments letters; and one that imposes a language model. Such a system might be able to make handwritten Latin searchable with no training data. References [1] Early Manuscripts at Oxford University. Bodleian library ms. auct. f. 2.13. http://image.ox.ac.uk/. [2] Serge Belongie, Jitendra Malik, and Jan Puzicha. Shape matching and object recognition using shape contexts. IEEE T. Pattern Analysis and Machine Intelligence, 24(4):509–522, 2002. [3] Philipp Koehn and Kevin Knight. Estimating word translation probabilities from unrelated monolingual corpora. In Proc. of the 17th National Conf. on AI, pages 711–715. AAAI Press / The MIT Press, 2000. [4] Y. LeCun, L. Bottou, and Y. Bengio. Reading checks with graph transformer networks. In International Conference on Acoustics, Speech, and Signal Processing, volume 1, pages 151–154, Munich, 1997. IEEE. [5] Y. Lecun, L. Bottou, Y. Bengio, and P. Haffner. Gradient-based learning applied to document recognition. Proceedings of the IEEE, 86(11):2278–2324, 1998. [6] R. Plamondon and S.N. Srihari. Online and off-line handwriting recognition: a comprehensive survey. IEEE Transactions on Pattern Analysis and Machine Intelligence, 22(1):63–84, 2000. [7] T. M. Rath and R. Manmatha. Word image matching using dynamic time warping. In Proc. of the Conf. on Computer Vision and Pattern Recognition (CVPR), volume 2, pages 521–527, 2003. [8] Alessandro Vinciarelli, Samy Bengio, and Horst Bunke. Ofﬂine recognition of unconstrained handwritten texts using hmms and statistical language models. IEEE Trans. Pattern Anal. Mach. Intell., 26(6):709–720, 2004. michi: Spe incerta certum mihi laborem sustuli, mihi: Faciuntne intellegendo ut nil intellegant? michi: Nonnumquam conlacrumabat. placuit tum id mihi. mihi: Placuit: despondi. hic nuptiis dictust dies. michi: Sto exspectans siquid mi imperent. venit una,

6 0.52118284 205 nips-2004-Who's In the Picture

7 0.46673241 108 nips-2004-Markov Networks for Detecting Overalpping Elements in Sequence Data

8 0.45602247 163 nips-2004-Semi-parametric Exponential Family PCA

9 0.41407299 2 nips-2004-A Direct Formulation for Sparse PCA Using Semidefinite Programming

10 0.40816557 127 nips-2004-Neighbourhood Components Analysis

11 0.40550962 162 nips-2004-Semi-Markov Conditional Random Fields for Information Extraction

12 0.39096481 43 nips-2004-Conditional Models of Identity Uncertainty with Application to Noun Coreference

13 0.38683772 37 nips-2004-Co-Training and Expansion: Towards Bridging Theory and Practice

14 0.37905172 145 nips-2004-Parametric Embedding for Class Visualization

15 0.37660971 125 nips-2004-Multiple Relational Embedding

16 0.34554565 172 nips-2004-Sparse Coding of Natural Images Using an Overcomplete Set of Limited Capacity Units

17 0.3300018 193 nips-2004-Theories of Access Consciousness

18 0.30849722 199 nips-2004-Using Machine Learning to Break Visual Human Interaction Proofs (HIPs)

19 0.30735171 158 nips-2004-Sampling Methods for Unsupervised Learning

20 0.30721557 169 nips-2004-Sharing Clusters among Related Groups: Hierarchical Dirichlet Processes

similar papers computed by lda model

lda for this paper:

topicId topicWeight

[(13, 0.093), (15, 0.116), (17, 0.013), (26, 0.06), (31, 0.067), (33, 0.135), (35, 0.031), (39, 0.027), (50, 0.04), (51, 0.017), (56, 0.015), (62, 0.289), (87, 0.013)]

similar papers list:

simIndex simValue paperId paperTitle

1 0.83016145 170 nips-2004-Similarity and Discrimination in Classical Conditioning: A Latent Variable Account

Author: Aaron C. Courville, Nathaniel D. Daw, David S. Touretzky

Abstract: We propose a probabilistic, generative account of conﬁgural learning phenomena in classical conditioning. Conﬁgural learning experiments probe how animals discriminate and generalize between patterns of simultaneously presented stimuli (such as tones and lights) that are differentially predictive of reinforcement. Previous models of these issues have been successful more on a phenomenological than an explanatory level: they reproduce experimental ﬁndings but, lacking formal foundations, provide scant basis for understanding why animals behave as they do. We present a theory that clariﬁes seemingly arbitrary aspects of previous models while also capturing a broader set of data. Key patterns of data, e.g. concerning animals’ readiness to distinguish patterns with varying degrees of overlap, are shown to follow from statistical inference.

same-paper 2 0.76851916 78 nips-2004-Hierarchical Distributed Representations for Statistical Language Modeling

Author: John Blitzer, Fernando Pereira, Kilian Q. Weinberger, Lawrence K. Saul

3 0.59305876 131 nips-2004-Non-Local Manifold Tangent Learning

Author: Yoshua Bengio, Martin Monperrus

Abstract: We claim and present arguments to the effect that a large class of manifold learning algorithms that are essentially local and can be framed as kernel learning algorithms will suffer from the curse of dimensionality, at the dimension of the true underlying manifold. This observation suggests to explore non-local manifold learning algorithms which attempt to discover shared structure in the tangent planes at different positions. A criterion for such an algorithm is proposed and experiments estimating a tangent plane prediction function are presented, showing its advantages with respect to local manifold learning algorithms: it is able to generalize very far from training data (on learning handwritten character image rotations), where a local non-parametric method fails. 1

4 0.59195292 189 nips-2004-The Power of Selective Memory: Self-Bounded Learning of Prediction Suffix Trees

Author: Ofer Dekel, Shai Shalev-shwartz, Yoram Singer

Abstract: Prediction sufﬁx trees (PST) provide a popular and effective tool for tasks such as compression, classiﬁcation, and language modeling. In this paper we take a decision theoretic view of PSTs for the task of sequence prediction. Generalizing the notion of margin to PSTs, we present an online PST learning algorithm and derive a loss bound for it. The depth of the PST generated by this algorithm scales linearly with the length of the input. We then describe a self-bounded enhancement of our learning algorithm which automatically grows a bounded-depth PST. We also prove an analogous mistake-bound for the self-bounded algorithm. The result is an efﬁcient algorithm that neither relies on a-priori assumptions on the shape or maximal depth of the target PST nor does it require any parameters. To our knowledge, this is the ﬁrst provably-correct PST learning algorithm which generates a bounded-depth PST while being competitive with any ﬁxed PST determined in hindsight. 1

5 0.59179759 69 nips-2004-Fast Rates to Bayes for Kernel Machines

Author: Ingo Steinwart, Clint Scovel

Abstract: We establish learning rates to the Bayes risk for support vector machines (SVMs) with hinge loss. In particular, for SVMs with Gaussian RBF kernels we propose a geometric condition for distributions which can be used to determine approximation properties of these kernels. Finally, we compare our methods with a recent paper of G. Blanchard et al.. 1

6 0.58727127 28 nips-2004-Bayesian inference in spiking neurons

7 0.58611798 66 nips-2004-Exponential Family Harmoniums with an Application to Information Retrieval

8 0.58399433 181 nips-2004-Synergies between Intrinsic and Synaptic Plasticity in Individual Model Neurons

9 0.58259505 70 nips-2004-Following Curved Regularized Optimization Solution Paths

10 0.58201814 174 nips-2004-Spike Sorting: Bayesian Clustering of Non-Stationary Data

11 0.58186167 178 nips-2004-Support Vector Classification with Input Data Uncertainty

12 0.58169484 163 nips-2004-Semi-parametric Exponential Family PCA

13 0.58167374 62 nips-2004-Euclidean Embedding of Co-Occurrence Data

14 0.58120304 187 nips-2004-The Entire Regularization Path for the Support Vector Machine

15 0.58047402 4 nips-2004-A Generalized Bradley-Terry Model: From Group Competition to Individual Skill

16 0.57990199 206 nips-2004-Worst-Case Analysis of Selective Sampling for Linear-Threshold Algorithms

17 0.57881874 110 nips-2004-Matrix Exponential Gradient Updates for On-line Learning and Bregman Projection

18 0.57795942 98 nips-2004-Learning Gaussian Process Kernels via Hierarchical Bayes

19 0.57747334 102 nips-2004-Learning first-order Markov models for control

20 0.57737225 25 nips-2004-Assignment of Multiplicative Mixtures in Natural Images