emnlp emnlp2010 emnlp2010-45 knowledge-graph by maker-knowledge-mining

45 emnlp-2010-Evaluating Models of Latent Document Semantics in the Presence of OCR Errors

Source: pdf

Author: Daniel Walker ; William B. Lund ; Eric K. Ringger

Abstract: Models of latent document semantics such as the mixture of multinomials model and Latent Dirichlet Allocation have received substantial attention for their ability to discover topical semantics in large collections of text. In an effort to apply such models to noisy optical character recognition (OCR) text output, we endeavor to understand the effect that character-level noise can have on unsupervised topic modeling. We show the effects both with document-level topic analysis (document clustering) and with word-level topic analysis (LDA) on both synthetic and real-world OCR data. As expected, experimental results show that performance declines as word error rates increase. Common techniques for alleviating these problems, such as filtering low-frequency words, are successful in enhancing model quality, but exhibit failure trends similar to models trained on unpro- cessed OCR output in the case of LDA. To our knowledge, this study is the first of its kind.

Reference: text

Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 edu l Abstract Models of latent document semantics such as the mixture of multinomials model and Latent Dirichlet Allocation have received substantial attention for their ability to discover topical semantics in large collections of text. [sent-9, score-0.286]

2 In an effort to apply such models to noisy optical character recognition (OCR) text output, we endeavor to understand the effect that character-level noise can have on unsupervised topic modeling. [sent-10, score-0.431]

3 We show the effects both with document-level topic analysis (document clustering) and with word-level topic analysis (LDA) on both synthetic and real-world OCR data. [sent-11, score-0.437]

4 As expected, experimental results show that performance declines as word error rates increase. [sent-12, score-0.172]

5 One example of this type of analysis is document clustering, in which documents are grouped into clusters by topic. [sent-17, score-0.133]

6 Another type of topic analysis attempts to discover finer-grained topics—labeling individual words in a document as belonging to a particular 240 topic. [sent-18, score-0.218]

7 In addition, researchers are having increasing levels of success in digitizing hand-written manuscripts (Bunke, 2003), though error rates remain much higher than for OCR. [sent-22, score-0.173]

8 Finding good estimates for the parameters of models such as the mixture of multinomials document model (Walker and Ringger, 2008) and the Latent Dirichlet Allocation (LDA) model (Blei et al. [sent-25, score-0.163]

9 It is obvious, therefore, that model quality must suffer, especially since unsupervised methods are typically much more sensitive to noise than supervised methods. [sent-28, score-0.176]

10 Unsupervised models, in contrast, have no grounding in labels to prevent them from confusing patterns that emerge by chance in the noise with the “true” patterns of potential interest. [sent-32, score-0.174]

11 Though we expect model quality to decrease, it is not well understood how sensitive these models are to OCR errors, or how quality deteriorates as the level of OCR noise increases. [sent-34, score-0.207]

12 In this work we show how the performance of unsupervised topic modeling algorithms degrades as character-level noise is introduced. [sent-35, score-0.271]

13 We demonstrate the effect using both artificially corrupted data and an existing real-world OCR corpus. [sent-36, score-0.184]

14 The results are promising, especially in the case of relatively low word error rates (e. [sent-37, score-0.145]

15 Though model quality declines as errors increase, simple feature selection techniques enable the learning of relatively high quality models even as word error rates approach 50%. [sent-40, score-0.309]

16 This result is particularly interesting in that even humans find it difficult to make sense of documents with error rates of that magnitude (Munteanu et al. [sent-41, score-0.186]

17 Because of the difficulties in evaluating topic models, even on clean data, these results should not be interpreted as definitive answers, but they do offer insight into prominent trends. [sent-43, score-0.196]

18 It is our hope that this work will lead to an increase in the usefulness of collections of OCRed texts, as document clustering and topic modeling expose useful patterns to historians and other interested parties. [sent-45, score-0.327]

19 After an overview of related work in Section 2, Section 3 introduces the data used in our experiments, including an explanation of how the synthetic data were created and of some of their properties. [sent-47, score-0.185]

20 Most of this previous work ignores the presence of OCR errors or attempts to remove corrupted tokens with special pre-processing such as stop-word removal and frequency cutoffs. [sent-51, score-0.264]

21 Also, there are at least two instances of using topic modeling to improve the results of an OCR algorithm (Wick et al. [sent-52, score-0.169]

22 Similar evaluations to ours have been conducted to assess the effect of OCR errors on supervised document classification (Taghva et al. [sent-55, score-0.142]

23 3 Data We conducted experiments on synthetic and real OCR data. [sent-61, score-0.185]

24 In addition, the curator of the collection has created a “gold standard” transcription, from which it is possible to obtain accurate measures of average document word error rates (WER) for each engine, which are: 19. [sent-69, score-0.237]

25 All of the documents in the Eisenhower corpus discuss the fairly narrow topic of troop movements and battle developments taking place at the end of World War II. [sent-74, score-0.213]

26 In an attempt to generalize our results to larger and more diverse data, we also ran experiments using synthetic OCR data. [sent-76, score-0.185]

27 This synthetic data was created by corrupting “clean” datasets, adding character-level noise. [sent-77, score-0.185]

28 The synthetic data was created by building a noise model based on mistakes made by the worst performing OCR engine on the Eisenhower dataset, Tesseract. [sent-78, score-0.373]

29 To construct the noise model, a character-level alignment between the human transcribed Eisenhower documents and the OCR output was first computed. [sent-79, score-0.186]

30 To parameterize the amount of noise being generated, the Md matrix was interpolated with an identity matrix I using a parameter γ so that the final interpolated parameters Mi were calculated with the formula Mi = γMd + (1 − γ)I. [sent-82, score-0.145]

31 0, Mi = Md, and we would expect to see characters corrupted at the same rate as in the output of the OCR engine. [sent-86, score-0.208]

32 Segmentation errors can still occur in the learning stage, however, as the noise model sometimes replaced alphabet characters with punctuation characters, which were treated as delimiters by our tokenizer. [sent-90, score-0.219]

33 Each of these datasets were corrupted at values γ = i∗ 0. [sent-97, score-0.219]

34 wAte rteh icso point, t ahet waloureds error rate o1f f othre i corrupted data was near 50% and, since this was approximately the WER observed for the worst OCR engine on the real-world data, we chose to stop there. [sent-99, score-0.275]

35 Here is an example sentence corrupted at two γ values: γ = 0. [sent-101, score-0.184]

36 For an example of how noise and pre-processing techniques affect these counts see Section 4. [sent-109, score-0.145]

37 It is interesting to note that the word error rates produced by the noise model appear to be significantly higher than first expected. [sent-111, score-0.29]

38 First, the vocabulary of the Eisenhower dataset does not match well with that of any of the source datasets from which the synthetic data were generated. [sent-114, score-0.245]

39 This means that the word and character distributions are different and so the error rates will be as well. [sent-115, score-0.195]

40 This is because most sources of noise do not affect document images uniformly. [sent-117, score-0.237]

41 Furthermore, because content bearing words tend to be relatively rare, language models are poorer for them than for frequent function words, meaning that the words most correlated with semantics are also the most likely to be corrupted by an OCR engine. [sent-122, score-0.207]

42 Since “c” has a high rate of confusion with “e”, we would expect at least some instances of “the” to be corrupted to “thc” by the error model. [sent-125, score-0.275]

43 So, the noise model converts “the” to “thc” roughly 0. [sent-128, score-0.145]

44 Another interesting property of the noise introduced by actual OCR engines and our synthetic noise model is the way in which this noise affects words distributions. [sent-131, score-0.659]

45 This is very important, since word occurrence and co-occurrence counts are the basis for model inference in both clustering and topic modeling. [sent-132, score-0.181]

46 As mentioned previously, one common way of lessening the impact of OCR noise when training topic models over OCRed data is to apply a frequency cutoff filter to cull words that occur fewer than a certain number of times. [sent-133, score-0.361]

47 Figures 1 and 2 show the number of word types that are culled from the synthetic 20 Newsgroups OCR data and the Eisenhower OCR data, respectively, at various levels of noise. [sent-134, score-0.266]

48 Note that the cutoff filters use a strict “less 243 Figure 1: The number of word types culled with fre- quency cutoff filters applied to the 20 Newsgroups data with various levels of errors introduced. [sent-135, score-0.357]

49 Also, these series are additive, as the words culled with a frequency cutoff of 2 are a subset of those culled with a frequency cutoff of j > 2. [sent-137, score-0.342]

50 In both cases, it is apparent that by far the largest impact that noise has is in the creation of singletons. [sent-138, score-0.145]

51 This means that it is unlikely that enough evidence will be available to associate, through similar contexts, the original word and its corrupted forms. [sent-140, score-0.184]

52 4 Experimental Results We ran experiments on both the real and synthetic OCR data. [sent-143, score-0.185]

53 In this section we explain our experi- Figure 2: The number of word types culled with frequency cutoff filters applied to the transcript and three OCR engine outputs for the Eisenhower data. [sent-144, score-0.268]

54 1 Methodology For the synthetic OCR datasets, we ran clustering experiments using EM on a mixture of multinomials (c. [sent-147, score-0.311]

55 For both the synthetic and non-synthetic data we also trained LDA topic models (Blei et al. [sent-158, score-0.311]

56 The number of topics used for each dataset was adjusted a priori according to the number of documents it contained. [sent-165, score-0.163]

57 In addition to running experiments on the “raw” synthetic data, we also applied simple unsupervised feature selectors before training in order to evaluate the effectiveness of such measures in mitigating problems caused by OCR errors. [sent-167, score-0.225]

58 For the topic modeling (LDA) experiments three feature selectors were used. [sent-168, score-0.166]

59 The first method employed was a simple term frequency cutoff filter (TFCF), with a cutoff of 5 as in (Wang and McCallum, 2006). [sent-169, score-0.18]

60 The next method employed was Term Contribution (TC), a feature selection algorithm developed for document clustering (Liu et al. [sent-170, score-0.172]

61 This does not mean that all documents contain only one word after feature selection, as the top word in one document may occur in many other documents, even if it is not the top word in those documents. [sent-179, score-0.133]

62 Because all of these procedures alter the number of words and tokens in the final data, log-likelihood measured on a held-out set cannot be used to accurately compare the quality of topic models trained on pre-processed data, as the held-out data will contain many unknown words. [sent-188, score-0.21]

63 We use an alternative method for evaluating the topic models, discussed in (Griffiths et al. [sent-191, score-0.126]

64 Since the synthetic data is derived from datasets that have topical document labels, we are able to use the output from LDA in a classification problem with the word vectors for each document being replaced by the assigned topic vectors. [sent-193, score-0.557]

65 A naive Bayes learner is trained on a portion of the topic vectors, labeled with the original document label, and then the classification accu245 racy on a held-out portion of the data is computed. [sent-195, score-0.218]

66 2 Empirical Analysis Both the mixture of multinomials document model and LDA appear to be fairly resilient to characterlevel noise. [sent-202, score-0.186]

67 Figures 4 and 5 show the results of the document clustering experiments with and without feature selection, respectively. [sent-203, score-0.147]

68 Memory issues prevented the collection of results for the highest error rates on the Enron and Reuters data without feature selection. [sent-204, score-0.145]

69 Once feature selection occurs, however, performance remains much more stable as error rates increase. [sent-210, score-0.17]

70 Unfortunately, it was not possible to compare the performance of the pre-processing methods on this dataset, due to a lack of document topic labels and the deficiencies of log-likelihood mentioned previously. [sent-214, score-0.247]

71 Figure 6(b) shows the results of the LDA topicmodeling experiments on the three “raw” synthetic datasets. [sent-215, score-0.185]

72 Figures 7(a) through 7(c) show the results of evaluating the various proposed pre-processing procedures in the context of topic modeling. [sent-218, score-0.149]

73 These results show that topic quality on both the raw and pre-processed noisy data degrades at a rate relative to the amount of errors in the data. [sent-227, score-0.241]

74 That is, the difference in performance between two relatively low word error rates (e. [sent-228, score-0.145]

75 5% and 7% on the Reuters data) is small, whereas the differences between two high error rates (e. [sent-230, score-0.145]

76 While pre-processing does improve model quality, in the case of LDA this improvement amounts to a nearly constant boost; at high error rates quality is improved the same amount as at low error rates. [sent-233, score-0.224]

77 In order to provide a more thorough discussion of the relative quality of the topic models induced on the OCR data versus those induced on clean data, we sampled the results of several of the runs of the LDA algorithm. [sent-240, score-0.227]

78 In Tables 2 and 3 we show the top words for the five topics with the highest learned topic prior (α in the LDA literature) learned during Gibbs sampling. [sent-241, score-0.192]

79 In general, there appears to be a surprisingly good correlation between the topics learned on the clean data and those learned on the corrupted data, given the high level of noise involved. [sent-243, score-0.465]

80 However, the topics trained on the clean data, though all related to financial markets, are fairly distinctive. [sent-245, score-0.21]

81 In addition, it appears as though the first topic (topic 93) is not very coherent at all. [sent-248, score-0.154]

82 This topic is significantly larger, in terms of the number of tokens assigned to it than the other topics shown in either table. [sent-249, score-0.222]

83 For example, even though there are no instances of “ts” as a distinct token in the clean Reuters data, it is in the list of the top 19 words for topic 93. [sent-254, score-0.267]

84 It is also the case that, for most topics learned on the corrupted data, the most probable words for those topics tend to be shorter, on average, than for topics learned on clean data. [sent-256, score-0.452]

85 We believe this is due to the fact that the processes used to add noise to the data (both real OCR engines and our synthetic noise model) are more likely to corrupt long words, especially in the case of the synthetic data which was created using a character-level noise model. [sent-257, score-0.871]

86 (a) 20 Newsgroups Data (b) Reuters Data (c) Enron Data Figure 7: Average ten-fold cross-validation accuracy for the LDA pre-processing experiments on the synthetic OCR data. [sent-264, score-0.185]

87 As a result, given that a word recognition error has occurred in true OCR output, it is more likely to be an error that lies at an edit distance greater than one from the true word, or else it would have been corrected internally. [sent-267, score-0.15]

88 In all cases, the corrupted versions of a given word are very rare, occurring usually only once or twice in the noisy output, making them useless features for informing a model. [sent-273, score-0.24]

89 5 Conclusions and Future Work The primary outcome of these experiments is an understanding regarding when clustering and LDA topic models can be expected to function well on noisy OCR data. [sent-274, score-0.215]

90 Our results imply that clustering methods should perform almost as well on OCR data as they do on clean data, provided that a reasonable feature selection algorithm is employed. [sent-275, score-0.15]

91 The LDA topic model degraded less gracefully in performance 249 prior values found using MALLET for one run of LDA on the Reuters data corrupted with the data-derived noise model to a WER of 45%. [sent-276, score-0.455]

92 with the addition of character level errors to its input, with higher error rates impacting model quality in a way that was apparent empirically in the loglikelihood and ten-fold cross-validation metrics as well as through human inspection of the produced topics. [sent-277, score-0.276]

93 We found it to be the case that even in data with high word error rates, corrupted words often share many characters in common with their uncorrupted form. [sent-279, score-0.283]

94 This suggests an approach in which word similarities are used to cluster the unique corrupted versions of a word in order to increase the evidence available to the topic model during training time and improve model quality. [sent-280, score-0.332]

95 How much noise is too much: A study in automatic text classification. [sent-294, score-0.145]

96 A survey of retrieval strategies for ocr text collections. [sent-303, score-0.781]

97 Optical character recognition errors and their effects on natural language processing. [sent-381, score-0.13]

98 The effect of speech recognition accuracy rates on the usefulness and usability of webcast archives. [sent-408, score-0.127]

99 Evaluating text categorization in the presence of ocr errors. [sent-441, score-0.781]

100 Context-sensitive error correction: Using topic models to improve OCR. [sent-485, score-0.174]

similar papers computed by tfidf model

tfidf for this paper:

wordName wordTfidf (topN-words)

[('ocr', 0.781), ('synthetic', 0.185), ('corrupted', 0.184), ('eisenhower', 0.175), ('lda', 0.151), ('noise', 0.145), ('topic', 0.126), ('rates', 0.097), ('enron', 0.094), ('reuters', 0.093), ('document', 0.092), ('cutoff', 0.09), ('wer', 0.086), ('culled', 0.081), ('thc', 0.081), ('newsgroups', 0.081), ('clean', 0.07), ('ocred', 0.067), ('topics', 0.066), ('ringger', 0.058), ('clustering', 0.055), ('taghva', 0.054), ('walker', 0.054), ('character', 0.05), ('errors', 0.05), ('error', 0.048), ('lund', 0.046), ('optical', 0.046), ('instances', 0.043), ('engine', 0.043), ('multinomials', 0.042), ('documents', 0.041), ('abbyy', 0.04), ('communiqu', 0.04), ('rfp', 0.04), ('selectors', 0.04), ('tesseract', 0.04), ('tnpd', 0.04), ('engines', 0.039), ('mallet', 0.038), ('yesterday', 0.036), ('wallach', 0.036), ('datasets', 0.035), ('corruption', 0.035), ('corruptions', 0.035), ('digitized', 0.035), ('blei', 0.034), ('noisy', 0.034), ('rand', 0.031), ('transcript', 0.031), ('quality', 0.031), ('adjusted', 0.031), ('tokens', 0.03), ('recognition', 0.03), ('labels', 0.029), ('mixture', 0.029), ('though', 0.028), ('topical', 0.027), ('collections', 0.027), ('trends', 0.027), ('berry', 0.027), ('borsack', 0.027), ('corrupt', 0.027), ('declines', 0.027), ('experiences', 0.027), ('historians', 0.027), ('icdar', 0.027), ('jcdl', 0.027), ('kazem', 0.027), ('newmann', 0.027), ('nuance', 0.027), ('omnipage', 0.027), ('uncorrupted', 0.027), ('failure', 0.027), ('mimno', 0.027), ('dataset', 0.025), ('selection', 0.025), ('characters', 0.024), ('edit', 0.024), ('david', 0.023), ('brigham', 0.023), ('byu', 0.023), ('agarwal', 0.023), ('allen', 0.023), ('libraries', 0.023), ('battle', 0.023), ('distort', 0.023), ('beitzel', 0.023), ('farooq', 0.023), ('financial', 0.023), ('hubert', 0.023), ('meil', 0.023), ('filters', 0.023), ('procedures', 0.023), ('fairly', 0.023), ('md', 0.023), ('semantics', 0.023), ('latent', 0.023), ('versions', 0.022), ('digital', 0.022)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 1.0000002 45 emnlp-2010-Evaluating Models of Latent Document Semantics in the Presence of OCR Errors

Author: Daniel Walker ; William B. Lund ; Eric K. Ringger

2 0.10568001 100 emnlp-2010-Staying Informed: Supervised and Semi-Supervised Multi-View Topical Analysis of Ideological Perspective

Author: Amr Ahmed ; Eric Xing

Abstract: With the proliferation of user-generated articles over the web, it becomes imperative to develop automated methods that are aware of the ideological-bias implicit in a document collection. While there exist methods that can classify the ideological bias of a given document, little has been done toward understanding the nature of this bias on a topical-level. In this paper we address the problem ofmodeling ideological perspective on a topical level using a factored topic model. We develop efficient inference algorithms using Collapsed Gibbs sampling for posterior inference, and give various evaluations and illustrations of the utility of our model on various document collections with promising results. Finally we give a Metropolis-Hasting inference algorithm for a semi-supervised extension with decent results.

3 0.099817567 58 emnlp-2010-Holistic Sentiment Analysis Across Languages: Multilingual Supervised Latent Dirichlet Allocation

Author: Jordan Boyd-Graber ; Philip Resnik

Abstract: In this paper, we develop multilingual supervised latent Dirichlet allocation (MLSLDA), a probabilistic generative model that allows insights gleaned from one language’s data to inform how the model captures properties of other languages. MLSLDA accomplishes this by jointly modeling two aspects of text: how multilingual concepts are clustered into thematically coherent topics and how topics associated with text connect to an observed regression variable (such as ratings on a sentiment scale). Concepts are represented in a general hierarchical framework that is flexible enough to express semantic ontologies, dictionaries, clustering constraints, and, as a special, degenerate case, conventional topic models. Both the topics and the regression are discovered via posterior inference from corpora. We show MLSLDA can build topics that are consistent across languages, discover sensible bilingual lexical correspondences, and leverage multilingual corpora to better predict sentiment. Sentiment analysis (Pang and Lee, 2008) offers the promise of automatically discerning how people feel about a product, person, organization, or issue based on what they write online, which is potentially of great value to businesses and other organizations. However, the vast majority of sentiment resources and algorithms are limited to a single language, usually English (Wilson, 2008; Baccianella and Sebastiani, 2010). Since no single language captures a majority of the content online, adopting such a limited approach in an increasingly global community risks missing important details and trends that might only be available when text in multiple languages is taken into account. 45 Philip Resnik Department of Linguistics and UMIACS University of Maryland College Park, MD re snik@umd .edu Up to this point, multiple languages have been addressed in sentiment analysis primarily by transferring knowledge from a resource-rich language to a less rich language (Banea et al., 2008), or by ignoring differences in languages via translation into English (Denecke, 2008). These approaches are limited to a view of sentiment that takes place through an English-centric lens, and they ignore the potential to share information between languages. Ideally, learning sentiment cues holistically, across languages, would result in a richer and more globally consistent picture. In this paper, we introduce Multilingual Supervised Latent Dirichlet Allocation (MLSLDA), a model for sentiment analysis on a multilingual corpus. MLSLDA discovers a consistent, unified picture of sentiment across multiple languages by learning “topics,” probabilistic partitions of the vocabulary that are consistent in terms of both meaning and relevance to observed sentiment. Our approach makes few assumptions about available resources, requiring neither parallel corpora nor machine translation. The rest of the paper proceeds as follows. In Section 1, we describe the probabilistic tools that we use to create consistent topics bridging across languages and the MLSLDA model. In Section 2, we present the inference process. We discuss our set of semantic bridges between languages in Section 3, and our experiments in Section 4 demonstrate that this approach functions as an effective multilingual topic model, discovers sentiment-biased topics, and uses multilingual corpora to make better sentiment predictions across languages. Sections 5 and 6 discuss related research and discusses future work, respectively. ProcMe IdTi,n Mgsas ofsa tchehu 2se0t1t0s, C UoSnAfe,r 9e-n1ce1 o Onc Etombepri 2ic0a1l0 M. ?ec th2o0d1s0 i Ans Nsaotcuiartaioln La fonrg Cuaogmep Purtoatcieosnsainlg L,in pgagueis ti 4c5s–5 , 1 Predictions from Multilingual Topics As its name suggests, MLSLDA is an extension of Latent Dirichlet allocation (LDA) (Blei et al., 2003), a modeling approach that takes a corpus of unannotated documents as input and produces two outputs, a set of “topics” and assignments of documents to topics. Both the topics and the assignments are probabilistic: a topic is represented as a probability distribution over words in the corpus, and each document is assigned a probability distribution over all the topics. Topic models built on the foundations of LDA are appealing for sentiment analysis because the learned topics can cluster together sentimentbearing words, and because topic distributions are a parsimonious way to represent a document.1 LDA has been used to discover latent structure in text (e.g. for discourse segmentation (Purver et al., 2006) and authorship (Rosen-Zvi et al., 2004)). MLSLDA extends the approach by ensuring that this latent structure the underlying topics is consistent across languages. We discuss multilingual topic modeling in Section 1. 1, and in Section 1.2 we show how this enables supervised regression regardless of a document’s language. — — 1.1 Capturing Semantic Correlations Topic models posit a straightforward generative process that creates an observed corpus. For each docu- ment d, some distribution θd over unobserved topics is chosen. Then, for each word position in the document, a topic z is selected. Finally, the word for that position is generated by selecting from the topic indexed by z. (Recall that in LDA, a “topic” is a distribution over words). In monolingual topic models, the topic distribution is usually drawn from a Dirichlet distribution. Using Dirichlet distributions makes it easy to specify sparse priors, and it also simplifies posterior inference because Dirichlet distributions are conjugate to multinomial distributions. However, drawing topics from Dirichlet distributions will not suffice if our vocabulary includes multiple languages. If we are working with English, German, and Chinese at the same time, a Dirichlet prior has no way to favor distributions z such that p(good|z), p(gut|z), and 1The latter property has also made LDA popular for information retrieval (Wei and Croft, 2006)). 46 p(h aˇo|z) all tend to be high at the same time, or low at hth ˇaeo same lti tmened. tMoo bree generally, et sheam structure oorf our model must encourage topics to be consistent across languages, and Dirichlet distributions cannot encode correlations between elements. One possible solution to this problem is to use the multivariate normal distribution, which can produce correlated multinomials (Blei and Lafferty, 2005), in place of the Dirichlet distribution. This has been done successfully in multilingual settings (Cohen and Smith, 2009). However, such models complicate inference by not being conjugate. Instead, we appeal to tree-based extensions of the Dirichlet distribution, which has been used to induce correlation in semantic ontologies (Boyd-Graber et al., 2007) and to encode clustering constraints (Andrzejewski et al., 2009). The key idea in this approach is to assume the vocabularies of all languages are organized according to some shared semantic structure that can be represented as a tree. For concreteness in this section, we will use WordNet (Miller, 1990) as the representation of this multilingual semantic bridge, since it is well known, offers convenient and intuitive terminology, and demonstrates the full flexibility of our approach. However, the model we describe generalizes to any tree-structured rep- resentation of multilingual knowledge; we discuss some alternatives in Section 3. WordNet organizes a vocabulary into a rooted, directed acyclic graph of nodes called synsets, short for “synonym sets.” A synset is a child of another synset if it satisfies a hyponomy relationship; each child “is a” more specific instantiation of its parent concept (thus, hyponomy is often called an “isa” relationship). For example, a “dog” is a “canine” is an “animal” is a “living thing,” etc. As an approximation, it is not unreasonable to assume that WordNet’s structure of meaning is language independent, i.e. the concept encoded by a synset can be realized using terms in different languages that share the same meaning. In practice, this organization has been used to create many alignments of international WordNets to the original English WordNet (Ordan and Wintner, 2007; Sagot and Fiˇ ser, 2008; Isahara et al., 2008). Using the structure of WordNet, we can now describe a generative process that produces a distribution over a multilingual vocabulary, which encourages correlations between words with similar meanings regardless of what language each word is in. For each synset h, we create a multilingual word distribution for that synset as follows: 1. Draw transition probabilities βh ∼ Dir (τh) 2. Draw stop probabilities ωh ∼ Dir∼ (κ Dhi)r 3. For each language l, draw emission probabilities for that synset φh,l ∼ Dir (πh,l) . For conciseness in the rest of the paper, we will refer to this generative process as multilingual Dirichlet hierarchy, or MULTDIRHIER(τ, κ, π) .2 Each observed token can be viewed as the end result of a sequence of visited synsets λ. At each node in the tree, the path can end at node iwith probability ωi,1, or it can continue to a child synset with probability ωi,0. If the path continues to another child synset, it visits child j with probability βi,j. If the path ends at a synset, it generates word k with probability φi,l,k.3 The probability of a word being emitted from a path with visited synsets r and final synset h in language lis therefore p(w, λ = r, h|l, β, ω, φ) = (iY,j)∈rβi,jωi,0(1 − ωh,1)φh,l,w. Note that the stop probability ωh (1) is independent of language, but the emission φh,l is dependent on the language. This is done to prevent the following scenario: while synset A is highly probable in a topic and words in language 1attached to that synset have high probability, words in language 2 have low probability. If this could happen for many synsets in a topic, an entire language would be effectively silenced, which would lead to inconsistent topics (e.g. 2Variables τh, πh,l, and κh are hyperparameters. Their mean is fixed, but their magnitude is sampled during inference (i.e. Pkτhτ,ih,k is constant, but τh,i is not). For the bushier bridges, (Pe.g. dictionary and flat), their mean is uniform. For GermaNet, we took frequencies from two balanced corpora of German and English: the British National Corpus (University of Oxford, 2006) and the Kern Corpus of the Digitales Wo¨rterbuch der Deutschen Sprache des 20. Jahrhunderts project (Geyken, 2007). We took these frequencies and propagated them through the multilingual hierarchy, following LDAWN’s (Boyd-Graber et al., 2007) formulation of information content (Resnik, 1995) as a Bayesian prior. The variance of the priors was initialized to be 1.0, but could be sampled during inference. 3Note that the language and word are taken as given, but the path through the semantic hierarchy is a latent random variable. 47 Topic 1 is about baseball in English and about travel in German). Separating path from emission helps ensure that topics are consistent across languages. Having defined topic distributions in a way that can preserve cross-language correspondences, we now use this distribution within a larger model that can discover cross-language patterns of use that predict sentiment. 1.2 The MLSLDA Model We will view sentiment analysis as a regression problem: given an input document, we want to predict a real-valued observation y that represents the sentiment of a document. Specifically, we build on supervised latent Dirichlet allocation (SLDA, (Blei and McAuliffe, 2007)), which makes predictions based on the topics expressed in a document; this can be thought of projecting the words in a document to low dimensional space of dimension equal to the number of topics. Blei et al. showed that using this latent topic structure can offer improved predictions over regressions based on words alone, and the approach fits well with our current goals, since word-level cues are unlikely to be identical across languages. In addition to text, SLDA has been successfully applied to other domains such as social networks (Chang and Blei, 2009) and image classification (Wang et al., 2009). The key innovation in this paper is to extend SLDA by creating topics that are globally consistent across languages, using the bridging approach above. We express our model in the form of a probabilistic generative latent-variable model that generates documents in multiple languages and assigns a realvalued score to each document. The score comes from a normal distribution whose sum is the dot product between a regression parameter η that encodes the influence of each topic on the observation and a variance σ2. With this model in hand, we use statistical inference to determine the distribution over latent variables that, given the model, best explains observed data. The generative model is as follows: 1. For each topic i= 1. . . K, draw a topic distribution {βi, ωi, φi} from MULTDIRHIER(τ, κ, π). 2. {Foβr each do}cuf mroemn tM Md = 1. . . M with language ld: (a) CDihro(oαse). a distribution over topics θd ∼ (b) For each word in the document n = 1. . . Nd, choose a topic assignment zd,n ∼ Mult (θd) and a path λd,n ending at word wd,n according to Equation 1using {βzd,n , ωzd,n , φzd,n }. 3. Choose a re?sponse variable from y Norm ?η> z¯, σ2?, where z¯ d ≡ N1 PnN=1 zd,n. ∼ Crucially, note that the topics are not independent of the sentiment task; the regression encourages terms with similar effects on the observation y to be in the same topic. The consistency of topics described above allows the same regression to be done for the entire corpus regardless of the language of the underlying document. 2 Inference Finding the model parameters most likely to explain the data is a problem of statistical inference. We employ stochastic EM (Diebolt and Ip, 1996), using a Gibbs sampler for the E-step to assign words to paths and topics. After randomly initializing the topics, we alternate between sampling the topic and path of a word (zd,n, λd,n) and finding the regression parameters η that maximize the likelihood. We jointly sample the topic and path conditioning on all of the other path and document assignments in the corpus, selecting a path and topic with probability p(zn = k, λn = r|z−n , λ−n, wn , η, σ, Θ) = p(yd|z, η, σ)p(λn = r|zn = k, λ−n, wn, τ, p(zn = k|z−n, α) . κ, π) (2) Each of these three terms reflects a different influence on the topics from the vocabulary structure, the document’s topics, and the response variable. In the next paragraphs, we will expand each of them to derive the full conditional topic distribution. As discussed in Section 1.1, the structure of the topic distribution encourages terms with the same meaning to be in the same topic, even across languages. During inference, we marginalize over possible multinomial distributions β, ω, and φ, using the observed transitions from ito j in topic k; Tk,i,j, stop counts in synset iin topic k, Ok,i,0; continue counts in synsets iin topic k, Ok,i,1 ; and emission counts in synset iin language lin topic k, Fk,i,l. The 48 Multilingual Topics Text Documents Sentiment Prediction Figure 1: Graphical model representing MLSLDA. Shaded nodes represent observations, plates denote replication, and lines show probabilistic dependencies. probability of taking a path r is then p(λn = r|zn = k, λ−n) = (iY,j)∈r PBj0Bk,ik,j,i,+j0 τ+i,j τi,jPs∈0O,1k,Oi,1k,+i,s ω+i ωi,s! |(iY,j)∈rP{zP} Tran{szitiPon Ok,rend,0 + ωrend Fk,rend,wn + πrend,}l Ps∈0,1Ok,rend,s+ ωrend,sPw0Frend,w0+ πrend,w0 |PEmi{szsiPon} (3) Equation 3 reflects the multilingual aspect of this model. The conditional topic distribution for SLDA (Blei and McAuliffe, 2007) replaces this term with the standard Multinomial-Dirichlet. However, we believe this is the first published SLDA-style model using MCMC inference, as prior work has used variational inference (Blei and McAuliffe, 2007; Chang and Blei, 2009; Wang et al., 2009). Because the observed response variable depends on the topic assignments of a document, the conditional topic distribution is shifted toward topics that explain the observed response. Topics that move the predicted response yˆd toward the true yd will be favored. We drop terms that are constant across all topics for the effect of the response variable, p(yd|z, η, σ) ∝ exp?σ12?yd−PPk0kN0Nd,dk,0kη0k0?Pkη0Nzkd,k0? |??PP{z?P?} . Other wPord{zs’ influence exp

4 0.094784841 48 emnlp-2010-Exploiting Conversation Structure in Unsupervised Topic Segmentation for Emails

Author: Shafiq Joty ; Giuseppe Carenini ; Gabriel Murray ; Raymond T. Ng

Abstract: This work concerns automatic topic segmentation of email conversations. We present a corpus of email threads manually annotated with topics, and evaluate annotator reliability. To our knowledge, this is the first such email corpus. We show how the existing topic segmentation models (i.e., Lexical Chain Segmenter (LCSeg) and Latent Dirichlet Allocation (LDA)) which are solely based on lexical information, can be applied to emails. By pointing out where these methods fail and what any desired model should consider, we propose two novel extensions of the models that not only use lexical information but also exploit finer level conversation structure in a principled way. Empirical evaluation shows that LCSeg is a better model than LDA for segmenting an email thread into topical clusters and incorporating conversation structure into these models improves the performance significantly.

5 0.090178981 6 emnlp-2010-A Latent Variable Model for Geographic Lexical Variation

Author: Jacob Eisenstein ; Brendan O'Connor ; Noah A. Smith ; Eric P. Xing

Abstract: The rapid growth of geotagged social media raises new computational possibilities for investigating geographic linguistic variation. In this paper, we present a multi-level generative model that reasons jointly about latent topics and geographical regions. High-level topics such as “sports” or “entertainment” are rendered differently in each geographic region, revealing topic-specific regional distinctions. Applied to a new dataset of geotagged microblogs, our model recovers coherent topics and their regional variants, while identifying geographic areas of linguistic consistency. The model also enables prediction of an author’s geographic location from raw text, outperforming both text regression and supervised topic models.

6 0.078311734 62 emnlp-2010-Improving Mention Detection Robustness to Noisy Input

7 0.074103229 109 emnlp-2010-Translingual Document Representations from Discriminative Projections

8 0.065660179 64 emnlp-2010-Incorporating Content Structure into Text Analysis Applications

9 0.065546528 84 emnlp-2010-NLP on Spoken Documents Without ASR

10 0.062333375 23 emnlp-2010-Automatic Keyphrase Extraction via Topic Decomposition

11 0.061704963 77 emnlp-2010-Measuring Distributional Similarity in Context

12 0.05094919 85 emnlp-2010-Negative Training Data Can be Harmful to Text Classification

13 0.049371284 7 emnlp-2010-A Mixture Model with Sharing for Lexical Semantics

14 0.045131527 34 emnlp-2010-Crouching Dirichlet, Hidden Markov Model: Unsupervised POS Tagging with Context Local Tag Generation

15 0.043648578 27 emnlp-2010-Clustering-Based Stratified Seed Sampling for Semi-Supervised Relation Classification

16 0.042282004 33 emnlp-2010-Cross Language Text Classification by Model Translation and Semi-Supervised Learning

17 0.041372541 66 emnlp-2010-Inducing Word Senses to Improve Web Search Result Clustering

18 0.037739836 111 emnlp-2010-Two Decades of Unsupervised POS Induction: How Far Have We Come?

19 0.03672605 32 emnlp-2010-Context Comparison of Bursty Events in Web Search and Online Media

20 0.03619615 55 emnlp-2010-Handling Noisy Queries in Cross Language FAQ Retrieval

similar papers computed by lsi model

lsi for this paper:

topicId topicWeight

[(0, 0.143), (1, 0.121), (2, -0.158), (3, -0.053), (4, 0.072), (5, -0.014), (6, -0.064), (7, -0.06), (8, -0.016), (9, -0.086), (10, 0.023), (11, 0.004), (12, -0.109), (13, 0.104), (14, -0.002), (15, 0.018), (16, -0.005), (17, 0.041), (18, 0.107), (19, -0.094), (20, 0.047), (21, 0.008), (22, -0.041), (23, 0.059), (24, 0.194), (25, 0.052), (26, 0.075), (27, -0.013), (28, 0.116), (29, -0.001), (30, 0.039), (31, -0.15), (32, -0.007), (33, -0.002), (34, 0.094), (35, 0.01), (36, -0.038), (37, -0.07), (38, -0.102), (39, -0.036), (40, -0.091), (41, 0.105), (42, 0.018), (43, 0.226), (44, 0.128), (45, 0.017), (46, 0.081), (47, 0.011), (48, -0.141), (49, 0.111)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.9233613 45 emnlp-2010-Evaluating Models of Latent Document Semantics in the Presence of OCR Errors

Author: Daniel Walker ; William B. Lund ; Eric K. Ringger

2 0.62794 23 emnlp-2010-Automatic Keyphrase Extraction via Topic Decomposition

Author: Zhiyuan Liu ; Wenyi Huang ; Yabin Zheng ; Maosong Sun

Abstract: Existing graph-based ranking methods for keyphrase extraction compute a single importance score for each word via a single random walk. Motivated by the fact that both documents and words can be represented by a mixture of semantic topics, we propose to decompose traditional random walk into multiple random walks specific to various topics. We thus build a Topical PageRank (TPR) on word graph to measure word importance with respect to different topics. After that, given the topic distribution of the document, we further calculate the ranking scores of words and extract the top ranked ones as keyphrases. Experimental results show that TPR outperforms state-of-the-art keyphrase extraction methods on two datasets under various evaluation metrics.

3 0.55197704 100 emnlp-2010-Staying Informed: Supervised and Semi-Supervised Multi-View Topical Analysis of Ideological Perspective

Author: Amr Ahmed ; Eric Xing

4 0.53073329 48 emnlp-2010-Exploiting Conversation Structure in Unsupervised Topic Segmentation for Emails

Author: Shafiq Joty ; Giuseppe Carenini ; Gabriel Murray ; Raymond T. Ng

5 0.48549259 6 emnlp-2010-A Latent Variable Model for Geographic Lexical Variation

Author: Jacob Eisenstein ; Brendan O'Connor ; Noah A. Smith ; Eric P. Xing

6 0.43368861 109 emnlp-2010-Translingual Document Representations from Discriminative Projections

7 0.42164105 58 emnlp-2010-Holistic Sentiment Analysis Across Languages: Multilingual Supervised Latent Dirichlet Allocation

8 0.3442578 84 emnlp-2010-NLP on Spoken Documents Without ASR

9 0.32565492 54 emnlp-2010-Generating Confusion Sets for Context-Sensitive Error Correction

10 0.31420735 62 emnlp-2010-Improving Mention Detection Robustness to Noisy Input

11 0.27538276 7 emnlp-2010-A Mixture Model with Sharing for Lexical Semantics

12 0.27507299 64 emnlp-2010-Incorporating Content Structure into Text Analysis Applications

13 0.26690435 85 emnlp-2010-Negative Training Data Can be Harmful to Text Classification

14 0.25116622 77 emnlp-2010-Measuring Distributional Similarity in Context

15 0.24888 17 emnlp-2010-An Efficient Algorithm for Unsupervised Word Segmentation with Branching Entropy and MDL

16 0.24277623 27 emnlp-2010-Clustering-Based Stratified Seed Sampling for Semi-Supervised Relation Classification

17 0.23758361 2 emnlp-2010-A Fast Decoder for Joint Word Segmentation and POS-Tagging Using a Single Discriminative Model

18 0.22512795 122 emnlp-2010-WikiWars: A New Corpus for Research on Temporal Expressions

19 0.21555761 43 emnlp-2010-Enhancing Domain Portability of Chinese Segmentation Model Using Chi-Square Statistics and Bootstrapping

20 0.21055529 55 emnlp-2010-Handling Noisy Queries in Cross Language FAQ Retrieval

similar papers computed by lda model

lda for this paper:

topicId topicWeight

[(3, 0.03), (10, 0.011), (12, 0.04), (29, 0.101), (30, 0.047), (32, 0.019), (52, 0.019), (56, 0.075), (62, 0.012), (66, 0.107), (72, 0.067), (76, 0.021), (78, 0.314), (79, 0.015), (87, 0.018), (89, 0.019)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.7076931 45 emnlp-2010-Evaluating Models of Latent Document Semantics in the Presence of OCR Errors

Author: Daniel Walker ; William B. Lund ; Eric K. Ringger

2 0.48993492 78 emnlp-2010-Minimum Error Rate Training by Sampling the Translation Lattice

Author: Samidh Chatterjee ; Nicola Cancedda

Abstract: Minimum Error Rate Training is the algorithm for log-linear model parameter training most used in state-of-the-art Statistical Machine Translation systems. In its original formulation, the algorithm uses N-best lists output by the decoder to grow the Translation Pool that shapes the surface on which the actual optimization is performed. Recent work has been done to extend the algorithm to use the entire translation lattice built by the decoder, instead of N-best lists. We propose here a third, intermediate way, consisting in growing the translation pool using samples randomly drawn from the translation lattice. We empirically measure a systematic im- provement in the BLEU scores compared to training using N-best lists, without suffering the increase in computational complexity associated with operating with the whole lattice.

3 0.4874481 69 emnlp-2010-Joint Training and Decoding Using Virtual Nodes for Cascaded Segmentation and Tagging Tasks

Author: Xian Qian ; Qi Zhang ; Yaqian Zhou ; Xuanjing Huang ; Lide Wu

Abstract: Many sequence labeling tasks in NLP require solving a cascade of segmentation and tagging subtasks, such as Chinese POS tagging, named entity recognition, and so on. Traditional pipeline approaches usually suffer from error propagation. Joint training/decoding in the cross-product state space could cause too many parameters and high inference complexity. In this paper, we present a novel method which integrates graph structures of two subtasks into one using virtual nodes, and performs joint training and decoding in the factorized state space. Experimental evaluations on CoNLL 2000 shallow parsing data set and Fourth SIGHAN Bakeoff CTB POS tagging data set demonstrate the superiority of our method over cross-product, pipeline and candidate reranking approaches.

4 0.48670685 35 emnlp-2010-Discriminative Sample Selection for Statistical Machine Translation

Author: Sankaranarayanan Ananthakrishnan ; Rohit Prasad ; David Stallard ; Prem Natarajan

Abstract: Production of parallel training corpora for the development of statistical machine translation (SMT) systems for resource-poor languages usually requires extensive manual effort. Active sample selection aims to reduce the labor, time, and expense incurred in producing such resources, attaining a given performance benchmark with the smallest possible training corpus by choosing informative, nonredundant source sentences from an available candidate pool for manual translation. We present a novel, discriminative sample selection strategy that preferentially selects batches of candidate sentences with constructs that lead to erroneous translations on a held-out development set. The proposed strategy supports a built-in diversity mechanism that reduces redundancy in the selected batches. Simulation experiments on English-to-Pashto and Spanish-to-English translation tasks demon- strate the superiority of the proposed approach to a number of competing techniques, such as random selection, dissimilarity-based selection, as well as a recently proposed semisupervised active learning strategy.

5 0.48656595 105 emnlp-2010-Title Generation with Quasi-Synchronous Grammar

Author: Kristian Woodsend ; Yansong Feng ; Mirella Lapata

Abstract: The task of selecting information and rendering it appropriately appears in multiple contexts in summarization. In this paper we present a model that simultaneously optimizes selection and rendering preferences. The model operates over a phrase-based representation of the source document which we obtain by merging PCFG parse trees and dependency graphs. Selection preferences for individual phrases are learned discriminatively, while a quasi-synchronous grammar (Smith and Eisner, 2006) captures rendering preferences such as paraphrases and compressions. Based on an integer linear programming formulation, the model learns to generate summaries that satisfy both types of preferences, while ensuring that length, topic coverage and grammar constraints are met. Experiments on headline and image caption generation show that our method obtains state-of-the-art performance using essentially the same model for both tasks without any major modifications.

6 0.48560455 82 emnlp-2010-Multi-Document Summarization Using A* Search and Discriminative Learning

7 0.48488164 63 emnlp-2010-Improving Translation via Targeted Paraphrasing

8 0.4844299 58 emnlp-2010-Holistic Sentiment Analysis Across Languages: Multilingual Supervised Latent Dirichlet Allocation

9 0.48427805 32 emnlp-2010-Context Comparison of Bursty Events in Web Search and Online Media

10 0.48348138 65 emnlp-2010-Inducing Probabilistic CCG Grammars from Logical Form with Higher-Order Unification

11 0.48099354 84 emnlp-2010-NLP on Spoken Documents Without ASR

12 0.48049006 6 emnlp-2010-A Latent Variable Model for Geographic Lexical Variation

13 0.48046309 18 emnlp-2010-Assessing Phrase-Based Translation Models with Oracle Decoding

14 0.47985005 49 emnlp-2010-Extracting Opinion Targets in a Single and Cross-Domain Setting with Conditional Random Fields

15 0.47947797 87 emnlp-2010-Nouns are Vectors, Adjectives are Matrices: Representing Adjective-Noun Constructions in Semantic Space

16 0.47937468 89 emnlp-2010-PEM: A Paraphrase Evaluation Metric Exploiting Parallel Texts

17 0.47850889 120 emnlp-2010-What's with the Attitude? Identifying Sentences with Attitude in Online Discussions

18 0.4782449 103 emnlp-2010-Tense Sense Disambiguation: A New Syntactic Polysemy Task

19 0.47791734 7 emnlp-2010-A Mixture Model with Sharing for Lexical Semantics

20 0.47761938 86 emnlp-2010-Non-Isomorphic Forest Pair Translation