emnlp emnlp2010 emnlp2010-62 knowledge-graph by maker-knowledge-mining

62 emnlp-2010-Improving Mention Detection Robustness to Noisy Input


Source: pdf

Author: Radu Florian ; John Pitrelli ; Salim Roukos ; Imed Zitouni

Abstract: Information-extraction (IE) research typically focuses on clean-text inputs. However, an IE engine serving real applications yields many false alarms due to less-well-formed input. For example, IE in a multilingual broadcast processing system has to deal with inaccurate automatic transcription and translation. The resulting presence of non-target-language text in this case, and non-language material interspersed in data from other applications, raise the research problem of making IE robust to such noisy input text. We address one such IE task: entity-mention detection. We describe augmenting a statistical mention-detection system in order to reduce false alarms from spurious passages. The diverse nature of input noise leads us to pursue a multi-faceted approach to robustness. For our English-language system, at various miss rates we eliminate 97% of false alarms on inputs from other Latin-alphabet languages. In another experiment, representing scenarios in which genre-specific training is infeasible, we process real financial-transactions text containing mixed languages and data-set codes. On these data, because we do not train on data like it, we achieve a smaller but significant improvement. These gains come with virtually no loss in accuracy on clean English text.

Reference: text


Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 However, an IE engine serving real applications yields many false alarms due to less-well-formed input. [sent-10, score-0.19]

2 The resulting presence of non-target-language text in this case, and non-language material interspersed in data from other applications, raise the research problem of making IE robust to such noisy input text. [sent-12, score-0.513]

3 We describe augmenting a statistical mention-detection system in order to reduce false alarms from spurious passages. [sent-14, score-0.257]

4 For our English-language system, at various miss rates we eliminate 97% of false alarms on inputs from other Latin-alphabet languages. [sent-16, score-0.395]

5 In another experiment, representing scenarios in which genre-specific training is infeasible, we process real financial-transactions text containing mixed languages and data-set codes. [sent-17, score-0.202]

6 These gains come with virtually no loss in accuracy on clean English text. [sent-19, score-0.367]

7 1 Introduction Information-extraction (IE) research is typically performed on clean text in a predetermined language. [sent-20, score-0.409]

8 Broadcast monitoring demands that IE handle as input not only clean text, but also the transcripts output by speech recognizers. [sent-26, score-0.457]

9 Multilingual applications, and the imperfection of translation technology, require IE to contend with non-target-language text input (Pitrelli et al. [sent-28, score-0.159]

10 Naive users at times input to IE other material which deviates from clean text, such as a PDF file that “looks” like plain text. [sent-31, score-0.645]

11 Search applications require IE to deal with databases which not only possess clean text but at times exhibit other complications like markup codes particular to narrow, applicationspecific data-format standards, for example, the excerpt from a financial-transactions data set shown in Figure 1. [sent-33, score-0.444]

12 These issues greatly complicate the IE problem, particularly considering that adapting IE to such formats is hampered by the existence of a multitude of such “standards” and by lack of sufficient annotated data in each one. [sent-59, score-0.129]

13 A typical state-of-the-art statistical IE engine will happily process such “noisy” inputs, and will typically provide garbage-in/garbage-out performance, embarrassingly reporting spurious “information” no human would ever mistake. [sent-60, score-0.099]

14 This information can include accurate speech-recognition output, names which are recognizable even in wrong-language material, and clean target-language passages interleaved with the markup. [sent-62, score-0.657]

15 Such explicit classification would be impractical in the presence of the interleaving and the unconstrained data formats from unpredetermined sources. [sent-67, score-0.17]

16 We begin our robustness work by addressing an important and basic IE task: mention detection (MD). [sent-68, score-0.378]

17 A mention also has a specific class which describes the type of entity it refers to. [sent-77, score-0.229]

18 For instance, consider the following sentence: Ju l a Gi l ard i l , prime , mini st er o f Au st ral i a de clared she wi l enhance l the country’ s economy . [sent-78, score-0.113]

19 Here we see three mentions of one person entity: Jul i Gi l a lard, prime mini st e r, and she; these mentions are of type named, nominal, and pronominal, respectively. [sent-79, score-0.615]

20 Aust ral ia and count ry are mentions of type named and nominal, respectively, of a single geopolitical entity. [sent-80, score-0.33]

21 Therefore, the goal of this study is to investigate schemes to make a language-specific MD engine robust to the types of interspersed non-target material described above. [sent-86, score-0.335]

22 In these initial experiments, we work with English as the target language, though we aim to make our approach to robustness as target-language-independent as possible. [sent-87, score-0.167]

23 However, we process mixed-language material including real-world data with its own peculiar mark-up, text conventions including abbreviations, and mix of languages, with the goal of English MD. [sent-89, score-0.278]

24 First, non-target-character-set passages (here, non-Latin-alphabet) are identified and marked for non-processing. [sent-91, score-0.145]

25 Then, following word-tokenization, we apply a language classifier to a sliding variablelength set of windows in order to generate features for each word indicative of how much the text around that word resembles good English, primarily in comparison to other Latin-alphabet languages. [sent-92, score-0.28]

26 Section 4 introduces enhancements to the system to achieve robustness. [sent-102, score-0.106]

27 2 Previous work on mention detection The MD task has close ties to named-entity recognition, which has been the focus of much recent research (Bikel et al. [sent-104, score-0.252]

28 Usually, in computationallinguistics literature, a named entity represents an instance of either a location, a person, an organization, and the named-entity-recognition task consists of identifying each individual occurrence of names of such an entity appearing in the text. [sent-109, score-0.299]

29 Effort to handle noisy data is still limited, especially for scenarios in which the system at decoding time does not have prior knowledge of the input data source. [sent-113, score-0.215]

30 , 2005) assume that the input data is text from e-mails, and define special features to enhance the detection of named entities. [sent-118, score-0.273]

31 , 2000) assume that the input data is the output of a speech or optical character recognition system, and hence extract new features for better named-entity recognition. [sent-121, score-0.159]

32 eliminate the noisy text from the document before performing data mining (Yi et al. [sent-124, score-0.234]

33 Hence, they do not try to process noisy data; instead, they remove it. [sent-126, score-0.118]

34 Also we do not want to eliminate the noisy data, but rather attempt to detect the appropriate mentions, if any, that appear in that portion of the data. [sent-128, score-0.192]

35 1 Mention detection: standard features The featues used by our mention detection systems can be divided into the following categories: 1. [sent-140, score-0.252]

36 Dictionaries contain single names such as John or Bo st on, and also phrases such as Barack Obama, New York C ity, or The Unit ed St ate s . [sent-154, score-0.11]

37 The use of this framework to build MD systems for clean English text has given very competitive results at ACE evaluations (Florian et al. [sent-156, score-0.409]

38 4 Enhancements for robustness – – As stated above, our goal is to skip spans of characters which do not lend themselves to target-language MD, while minimizing impact on MD for targetlanguage text, with English as the initial target language for our experiments. [sent-160, score-0.219]

39 It is important to note that this is not merely a document-classification problem; this non-target data is often interspersed with valid input text. [sent-162, score-0.179]

40 , , Therefore, to minimize needless loss of processable material, a robustness algorithm ideally does a sliding analysis, in which, character-by-character or word-by-word, material may be deemed to be suitable to process. [sent-172, score-0.348]

41 Furthermore, a variety of strategies will be needed to contend with the diverse nature of non-target material and the patterns in which it will appear among valid input. [sent-173, score-0.236]

42 detection of standard file formats, SGML, and associated detagging, such as 2. [sent-175, score-0.144]

43 primarily gazetteerbased) features, to catch isolated obvious English/English-compatible names • embedded in otherwise-foreign text. [sent-183, score-0.124]

44 1 Detection and detagging for standard file formats Some types of mark-up are well-known standards, such as SGML (Warmer and van Egmond, 1989). [sent-187, score-0.178]

45 2 Character-set segmentation Some entity mentions may be recognizable in a nontarget language which shares the target-language’s character set, for example, a person’s name recognizable by English speakers in an otherwise-not- understandable Spanish sentence. [sent-192, score-0.567]

46 However, nontarget character sets, such as Arabic and Chinese when processing English, represent pure noise for an IE system. [sent-193, score-0.141]

47 Therefore, deterministic characterset segmentation is applied, to mark non-targetcharacter-set passages for non-processing by the remainder of the system, or, in a multilingual system, to be diverted to a subsystem suited to process that character set. [sent-194, score-0.249]

48 Characters which can be ambiguous with regard to character set, such as some punctuation marks, are attached to target-character-set passages when possible, but are not considered to break non-target-character-set passages surrounding them on both sides. [sent-195, score-0.353]

49 4 Robust mention detection After preprocessing steps presented earlier, we de- tect mentions using a cascaded approach that combines several MD classifiers. [sent-202, score-0.501]

50 We select a classifier based on a sentence-level determination of the material’s fit to the target language. [sent-205, score-0.166]

51 First, we build an n-gram language model on clean target-language training text. [sent-206, score-0.367]

52 A sentence with a PP lower than a threshold θ1 is considered “clean” and hence the “clean” baseline MD model described in Section 3 is used to detect mentions of this sentence. [sent-212, score-0.249]

53 The clean MD model has access to standard features described in Section 3. [sent-213, score-0.367]

54 The gazetteer-based MD model has access only to gazetteer information and does not look to lexical context during decoding, reflecting the likelihood that in this poor material, words surrounding any recognizable mention are foreign and therefore unusable. [sent-216, score-0.472]

55 This set contains a mix of clean English and Latin-alphabet-but-nonEnglish text that is not used for traning and evaluation. [sent-219, score-0.409]

56 We will show in the experiments sec- tion how this combination strategy is effective not only in maintaining good performance on a clean English text but also in improving performance on non-English data when compared to other sourcespecific MD models. [sent-222, score-0.448]

57 5 Mixed mention detection model The mixed MD model is designed to process “sentences” mixing English with non-English, whether foreign-language or non-language material. [sent-224, score-0.412]

58 1 Language-identification features We apply an n-gram-based language classifier (Prager, 1999) to variable-length sliding windows as follows. [sent-228, score-0.188]

59 For each word, we run 1- through 6-preceding-word windows through the classifier, and 1- through 6-word windows beginning with the word, for a total of 12 windows, yielding for each window a result like: 0 . [sent-229, score-0.112]

60 We bin these and use them as input to a maximum-entropy classifier (separate from the MD classifier) which outputs “English” or “Non-English”, and a confidence score. [sent-244, score-0.139]

61 The language-identification classifier and the maximum-entropy “how-English” classifier are each trained on text data separate from each other and from the training and test sets for MD. [sent-246, score-0.21]

62 These features are part of the augmentation of the mixed MD model relative to the clean MD model. [sent-250, score-0.527]

63 These data average approximately 21 annotated mentions per 100 words. [sent-254, score-0.249]

64 They are annotated using the same annotation conventions as “English”, and from the perspective of English; that is, only mentions which would be clear to an English speaker are labeled, such as Barack Obama in the Spanish example in Section 4. [sent-257, score-0.249]

65 For this reason, these data average only approximately 5 mentions per 100 words. [sent-258, score-0.249]

66 Figure 1 shows example passages from this database, anonymized while preserving the character of the content. [sent-260, score-0.208]

67 In short, good English is interspersed with nonlanguage content, foreign-language text, and rough English like data-entry errors and haphazard abbreviations. [sent-266, score-0.206]

68 The eight in the right-most column are not further distinguished by mention type, while the remaining 36 are further classified as named, nominal or pronominal, for a total of 36 3 + 8 = 116 mention labels. [sent-284, score-0.366]

69 723 Table 2: Performance of clean, mixed, and gazetteer-based mention detection systems as well bination. [sent-294, score-0.252]

70 Not surprisingly, the baseline system, intended for clean data, performs poorly on noisy data. [sent-298, score-0.485]

71 However, because the mixed classifier, and moreso the gazetteer classifier, are oriented to noisy data, on clean data they suffer in performance by 2. [sent-300, score-0.911]

72 5 F-measure point of this loss, while also actually performing better on the noisy data sets than the two classifiers specifically targeted toward them, as can be seen in Table 2. [sent-303, score-0.118]

73 Including them would make our system look comparatively stronger, as they would have only spurious mentions and so generate false alarms but no correct mentions in the baseline system, while our system deterministically removes them. [sent-312, score-0.797]

74 As mentioned above, we view MD robustness primarily as an effort to eliminate, relative to a baseline system, large volumes of spurious “mentions” detected in non-target input content, while minimiz3http://www. [sent-313, score-0.293]

75 be/conll2002/ner/ (a) DET plot for clean (baseline), mixed, gazetteer, and combination MD systems on the Latin-alphabetnon-English text. [sent-317, score-0.486]

76 The clean system (upper curve) performs far worse than the other three systems designed to provide robustness; these systems in turn perform nearly indistinguishably. [sent-318, score-0.409]

77 (b) DET plot for clean (baseline), mixed, gazetteer, and combination MD systems on the Transactions data set. [sent-319, score-0.486]

78 The clean system (upper/longer curve) reaches far higher false-alarm rates, while never ap- proaching the lower miss rates achievable by any of the other three systems, which in turn perform comparably to each other. [sent-320, score-0.521]

79 Figure 2: DET plots for Latin-alphabet-non-English and Transactions data sets ing disruption of detection in target input. [sent-321, score-0.198]

80 A secondary goal is recall in the event of occasional valid mentions in such non-target material. [sent-322, score-0.249]

81 Thus, as input material degrades, precision increases in importance relative to recall. [sent-323, score-0.229]

82 , 1997) analysis, in which we plot a curve of miss rate on valid mentions vs. [sent-325, score-0.471]

83 Each mention is treated equally in this analysis, so frequently-recurring entity/mention types weigh on the results accordingly. [sent-328, score-0.157]

84 Figure 2a shows a DET plot for the clean, mixed, gazetteer, and combination systems on the “Latin” data set, while Figure 2b shows the analogous plot for the “Transactions” data set. [sent-329, score-0.199]

85 6 (nearly the best achievable by the clean system), we find that the robustness-oriented systems eliminate 97% of the false alarms of the clean baseline system, as the plot shows false-alarm rates near 0. [sent-332, score-1.088]

86 In making a system more oriented toward robustness in the face of non-target inputs, it is important to quantify the effect of these systems being less- oriented toward clean, target-language text. [sent-337, score-0.286]

87 Figure 3 shows the analogous DET plot for the English test set, showing that achieving robustness through the combination system comes at a small cost to accuracy on the text the original system is trained to process. [sent-338, score-0.371]

88 be processed in a garbage-in-garbage-out fashion merely because the system was designed only to handle clean text in one language. [sent-341, score-0.451]

89 Thus we have embarked on information-extraction-robustness work, to improve performance on imperfect inputs while minimizing disruption of processing of clean text. [sent-342, score-0.485]

90 Rather than relying on explicit recognition of genre of source data, the experimental system merely does its own assessment of how much each sentence-sized chunk matches the target language, an important feature in the case of unknown text sources. [sent-345, score-0.166]

91 Chief among directions for further work is to continue to improve performance on noisy data, and to strengthen our findings via larger data sets. [sent-346, score-0.118]

92 Further work should also explore the degree to which the approach to achieving robustness must vary according to the tar344 get language. [sent-348, score-0.126]

93 Finally, robustness work should be expanded to other information-extraction tasks. [sent-349, score-0.126]

94 Exploiting diverse knowledge sources via maximum entropy in named entity recognition. [sent-374, score-0.153]

95 A statistical model for multilingual entity detection and tracking. [sent-411, score-0.208]

96 Factorizing complex models: A case study in mention detection. [sent-419, score-0.157]

97 The DET curve in assessment of detection task performance. [sent-441, score-0.172]

98 Named entity extraction from noisy input: speech and OCR. [sent-451, score-0.19]

99 Extracting personal names from email: Applying named entity recognition to informal text. [sent-461, score-0.268]

100 Eliminating noisy information in web pages for data mining. [sent-543, score-0.118]


similar papers computed by tfidf model

tfidf for this paper:

wordName wordTfidf (topN-words)

[('clean', 0.367), ('md', 0.342), ('mentions', 0.249), ('gazetteer', 0.207), ('ie', 0.19), ('material', 0.174), ('mixed', 0.16), ('mention', 0.157), ('passages', 0.145), ('robustness', 0.126), ('interspersed', 0.124), ('det', 0.124), ('noisy', 0.118), ('florian', 0.117), ('alarms', 0.106), ('detection', 0.095), ('zitouni', 0.088), ('formats', 0.088), ('classifier', 0.084), ('named', 0.081), ('plot', 0.08), ('transaction', 0.08), ('curve', 0.077), ('names', 0.074), ('eliminate', 0.074), ('entity', 0.072), ('recognizable', 0.071), ('barack', 0.071), ('transactions', 0.07), ('standards', 0.069), ('miss', 0.065), ('ace', 0.065), ('enhancements', 0.064), ('latin', 0.064), ('character', 0.063), ('spurious', 0.062), ('contend', 0.062), ('disruption', 0.062), ('peculiar', 0.062), ('purchase', 0.062), ('oriented', 0.059), ('english', 0.058), ('windows', 0.056), ('inputs', 0.056), ('input', 0.055), ('minkov', 0.053), ('pitrelli', 0.053), ('sgml', 0.053), ('obama', 0.052), ('nominal', 0.052), ('characters', 0.052), ('primarily', 0.05), ('file', 0.049), ('sliding', 0.048), ('false', 0.047), ('pp', 0.047), ('rates', 0.047), ('tjong', 0.044), ('text', 0.042), ('system', 0.042), ('borthwick', 0.041), ('caldrade', 0.041), ('detagged', 0.041), ('detagging', 0.041), ('gazetteerbased', 0.041), ('haphazard', 0.041), ('huf', 0.041), ('interleaving', 0.041), ('ltd', 0.041), ('mini', 0.041), ('multitude', 0.041), ('nonlanguage', 0.041), ('nontarget', 0.041), ('peculiarities', 0.041), ('rueckgabe', 0.041), ('unpredetermined', 0.041), ('warmer', 0.041), ('zimmerman', 0.041), ('infeasible', 0.041), ('pronominal', 0.041), ('determination', 0.041), ('multilingual', 0.041), ('recognition', 0.041), ('target', 0.041), ('person', 0.04), ('miller', 0.04), ('combination', 0.039), ('engine', 0.037), ('noise', 0.037), ('reflecting', 0.037), ('token', 0.036), ('st', 0.036), ('totaling', 0.035), ('alphabetic', 0.035), ('benajiba', 0.035), ('ittycheriah', 0.035), ('financial', 0.035), ('monitoring', 0.035), ('tokenization', 0.035), ('databases', 0.035)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.99999946 62 emnlp-2010-Improving Mention Detection Robustness to Noisy Input

Author: Radu Florian ; John Pitrelli ; Salim Roukos ; Imed Zitouni

Abstract: Information-extraction (IE) research typically focuses on clean-text inputs. However, an IE engine serving real applications yields many false alarms due to less-well-formed input. For example, IE in a multilingual broadcast processing system has to deal with inaccurate automatic transcription and translation. The resulting presence of non-target-language text in this case, and non-language material interspersed in data from other applications, raise the research problem of making IE robust to such noisy input text. We address one such IE task: entity-mention detection. We describe augmenting a statistical mention-detection system in order to reduce false alarms from spurious passages. The diverse nature of input noise leads us to pursue a multi-faceted approach to robustness. For our English-language system, at various miss rates we eliminate 97% of false alarms on inputs from other Latin-alphabet languages. In another experiment, representing scenarios in which genre-specific training is infeasible, we process real financial-transactions text containing mixed languages and data-set codes. On these data, because we do not train on data like it, we achieve a smaller but significant improvement. These gains come with virtually no loss in accuracy on clean English text.

2 0.36451462 44 emnlp-2010-Enhancing Mention Detection Using Projection via Aligned Corpora

Author: Yassine Benajiba ; Imed Zitouni

Abstract: The research question treated in this paper is centered on the idea of exploiting rich resources of one language to enhance the performance of a mention detection system of another one. We successfully achieve this goal by projecting information from one language to another via a parallel corpus. We examine the potential improvement using various degrees of linguistic information in a statistical framework and we show that the proposed technique is effective even when the target language model has access to a significantly rich feature set. Experimental results show up to 2.4F improvement in performance when the system has access to information obtained by projecting mentions from a resource-richlanguage mention detection system via a parallel corpus.

3 0.19755429 8 emnlp-2010-A Multi-Pass Sieve for Coreference Resolution

Author: Karthik Raghunathan ; Heeyoung Lee ; Sudarshan Rangarajan ; Nate Chambers ; Mihai Surdeanu ; Dan Jurafsky ; Christopher Manning

Abstract: Most coreference resolution models determine if two mentions are coreferent using a single function over a set of constraints or features. This approach can lead to incorrect decisions as lower precision features often overwhelm the smaller number of high precision ones. To overcome this problem, we propose a simple coreference architecture based on a sieve that applies tiers of deterministic coreference models one at a time from highest to lowest precision. Each tier builds on the previous tier’s entity cluster output. Further, our model propagates global information by sharing attributes (e.g., gender and number) across mentions in the same cluster. This cautious sieve guarantees that stronger features are given precedence over weaker ones and that each decision is made using all of the information available at the time. The framework is highly modular: new coreference modules can be plugged in without any change to the other modules. In spite of its simplicity, our approach outperforms many state-of-the-art supervised and unsupervised models on several standard corpora. This suggests that sievebased approaches could be applied to other NLP tasks.

4 0.10359541 28 emnlp-2010-Collective Cross-Document Relation Extraction Without Labelled Data

Author: Limin Yao ; Sebastian Riedel ; Andrew McCallum

Abstract: We present a novel approach to relation extraction that integrates information across documents, performs global inference and requires no labelled text. In particular, we tackle relation extraction and entity identification jointly. We use distant supervision to train a factor graph model for relation extraction based on an existing knowledge base (Freebase, derived in parts from Wikipedia). For inference we run an efficient Gibbs sampler that leads to linear time joint inference. We evaluate our approach both for an indomain (Wikipedia) and a more realistic outof-domain (New York Times Corpus) setting. For the in-domain setting, our joint model leads to 4% higher precision than an isolated local approach, but has no advantage over a pipeline. For the out-of-domain data, we benefit strongly from joint modelling, and observe improvements in precision of 13% over the pipeline, and 15% over the isolated baseline.

5 0.090430811 20 emnlp-2010-Automatic Detection and Classification of Social Events

Author: Apoorv Agarwal ; Owen Rambow

Abstract: In this paper we introduce the new task of social event extraction from text. We distinguish two broad types of social events depending on whether only one or both parties are aware of the social contact. We annotate part of Automatic Content Extraction (ACE) data, and perform experiments using Support Vector Machines with Kernel methods. We use a combination of structures derived from phrase structure trees and dependency trees. A characteristic of our events (which distinguishes them from ACE events) is that the participating entities can be spread far across the parse trees. We use syntactic and semantic insights to devise a new structure derived from dependency trees and show that this plays a role in achieving the best performing system for both social event detection and classification tasks. We also use three data sampling approaches to solve the problem of data skewness. Sampling methods improve the F1-measure for the task of relation detection by over 20% absolute over the baseline.

6 0.086728498 119 emnlp-2010-We're Not in Kansas Anymore: Detecting Domain Changes in Streams

7 0.080959789 33 emnlp-2010-Cross Language Text Classification by Model Translation and Semi-Supervised Learning

8 0.078311734 45 emnlp-2010-Evaluating Models of Latent Document Semantics in the Presence of OCR Errors

9 0.077708215 55 emnlp-2010-Handling Noisy Queries in Cross Language FAQ Retrieval

10 0.058194641 122 emnlp-2010-WikiWars: A New Corpus for Research on Temporal Expressions

11 0.057408821 17 emnlp-2010-An Efficient Algorithm for Unsupervised Word Segmentation with Branching Entropy and MDL

12 0.057191566 2 emnlp-2010-A Fast Decoder for Joint Word Segmentation and POS-Tagging Using a Single Discriminative Model

13 0.055162482 79 emnlp-2010-Mining Name Translations from Entity Graph Mapping

14 0.054301403 64 emnlp-2010-Incorporating Content Structure into Text Analysis Applications

15 0.053998176 24 emnlp-2010-Automatically Producing Plot Unit Representations for Narrative Text

16 0.053222552 43 emnlp-2010-Enhancing Domain Portability of Chinese Segmentation Model Using Chi-Square Statistics and Bootstrapping

17 0.051369678 56 emnlp-2010-Hashing-Based Approaches to Spelling Correction of Personal Names

18 0.050397255 69 emnlp-2010-Joint Training and Decoding Using Virtual Nodes for Cascaded Segmentation and Tagging Tasks

19 0.050304398 58 emnlp-2010-Holistic Sentiment Analysis Across Languages: Multilingual Supervised Latent Dirichlet Allocation

20 0.049954996 63 emnlp-2010-Improving Translation via Targeted Paraphrasing


similar papers computed by lsi model

lsi for this paper:

topicId topicWeight

[(0, 0.221), (1, 0.095), (2, -0.072), (3, 0.194), (4, -0.171), (5, -0.353), (6, 0.018), (7, 0.076), (8, -0.081), (9, -0.354), (10, 0.105), (11, 0.143), (12, -0.009), (13, 0.174), (14, -0.024), (15, -0.021), (16, 0.024), (17, -0.085), (18, 0.136), (19, 0.018), (20, -0.152), (21, 0.022), (22, -0.044), (23, -0.034), (24, 0.113), (25, 0.023), (26, -0.045), (27, -0.092), (28, 0.085), (29, -0.015), (30, -0.046), (31, 0.078), (32, -0.095), (33, 0.076), (34, -0.048), (35, -0.014), (36, -0.069), (37, -0.063), (38, 0.007), (39, -0.009), (40, -0.036), (41, -0.003), (42, 0.036), (43, 0.088), (44, -0.01), (45, 0.086), (46, 0.013), (47, -0.006), (48, 0.011), (49, 0.033)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.95930141 62 emnlp-2010-Improving Mention Detection Robustness to Noisy Input

Author: Radu Florian ; John Pitrelli ; Salim Roukos ; Imed Zitouni

Abstract: Information-extraction (IE) research typically focuses on clean-text inputs. However, an IE engine serving real applications yields many false alarms due to less-well-formed input. For example, IE in a multilingual broadcast processing system has to deal with inaccurate automatic transcription and translation. The resulting presence of non-target-language text in this case, and non-language material interspersed in data from other applications, raise the research problem of making IE robust to such noisy input text. We address one such IE task: entity-mention detection. We describe augmenting a statistical mention-detection system in order to reduce false alarms from spurious passages. The diverse nature of input noise leads us to pursue a multi-faceted approach to robustness. For our English-language system, at various miss rates we eliminate 97% of false alarms on inputs from other Latin-alphabet languages. In another experiment, representing scenarios in which genre-specific training is infeasible, we process real financial-transactions text containing mixed languages and data-set codes. On these data, because we do not train on data like it, we achieve a smaller but significant improvement. These gains come with virtually no loss in accuracy on clean English text.

2 0.90266472 44 emnlp-2010-Enhancing Mention Detection Using Projection via Aligned Corpora

Author: Yassine Benajiba ; Imed Zitouni

Abstract: The research question treated in this paper is centered on the idea of exploiting rich resources of one language to enhance the performance of a mention detection system of another one. We successfully achieve this goal by projecting information from one language to another via a parallel corpus. We examine the potential improvement using various degrees of linguistic information in a statistical framework and we show that the proposed technique is effective even when the target language model has access to a significantly rich feature set. Experimental results show up to 2.4F improvement in performance when the system has access to information obtained by projecting mentions from a resource-richlanguage mention detection system via a parallel corpus.

3 0.70054549 8 emnlp-2010-A Multi-Pass Sieve for Coreference Resolution

Author: Karthik Raghunathan ; Heeyoung Lee ; Sudarshan Rangarajan ; Nate Chambers ; Mihai Surdeanu ; Dan Jurafsky ; Christopher Manning

Abstract: Most coreference resolution models determine if two mentions are coreferent using a single function over a set of constraints or features. This approach can lead to incorrect decisions as lower precision features often overwhelm the smaller number of high precision ones. To overcome this problem, we propose a simple coreference architecture based on a sieve that applies tiers of deterministic coreference models one at a time from highest to lowest precision. Each tier builds on the previous tier’s entity cluster output. Further, our model propagates global information by sharing attributes (e.g., gender and number) across mentions in the same cluster. This cautious sieve guarantees that stronger features are given precedence over weaker ones and that each decision is made using all of the information available at the time. The framework is highly modular: new coreference modules can be plugged in without any change to the other modules. In spite of its simplicity, our approach outperforms many state-of-the-art supervised and unsupervised models on several standard corpora. This suggests that sievebased approaches could be applied to other NLP tasks.

4 0.35781032 28 emnlp-2010-Collective Cross-Document Relation Extraction Without Labelled Data

Author: Limin Yao ; Sebastian Riedel ; Andrew McCallum

Abstract: We present a novel approach to relation extraction that integrates information across documents, performs global inference and requires no labelled text. In particular, we tackle relation extraction and entity identification jointly. We use distant supervision to train a factor graph model for relation extraction based on an existing knowledge base (Freebase, derived in parts from Wikipedia). For inference we run an efficient Gibbs sampler that leads to linear time joint inference. We evaluate our approach both for an indomain (Wikipedia) and a more realistic outof-domain (New York Times Corpus) setting. For the in-domain setting, our joint model leads to 4% higher precision than an isolated local approach, but has no advantage over a pipeline. For the out-of-domain data, we benefit strongly from joint modelling, and observe improvements in precision of 13% over the pipeline, and 15% over the isolated baseline.

5 0.32824224 55 emnlp-2010-Handling Noisy Queries in Cross Language FAQ Retrieval

Author: Danish Contractor ; Govind Kothari ; Tanveer Faruquie ; L V Subramaniam ; Sumit Negi

Abstract: Recent times have seen a tremendous growth in mobile based data services that allow people to use Short Message Service (SMS) to access these data services. In a multilingual society it is essential that data services that were developed for a specific language be made accessible through other local languages also. In this paper, we present a service that allows a user to query a FrequentlyAsked-Questions (FAQ) database built in a local language (Hindi) using Noisy SMS English queries. The inherent noise in the SMS queries, along with the language mismatch makes this a challenging problem. We handle these two problems by formulating the query similarity over FAQ questions as a combinatorial search problem where the search space consists of combinations of dictionary variations of the noisy query and its top-N translations. We demonstrate the effectiveness of our approach on a real-life dataset.

6 0.32823479 45 emnlp-2010-Evaluating Models of Latent Document Semantics in the Presence of OCR Errors

7 0.24318855 9 emnlp-2010-A New Approach to Lexical Disambiguation of Arabic Text

8 0.23085044 122 emnlp-2010-WikiWars: A New Corpus for Research on Temporal Expressions

9 0.22887483 20 emnlp-2010-Automatic Detection and Classification of Social Events

10 0.22545682 29 emnlp-2010-Combining Unsupervised and Supervised Alignments for MT: An Empirical Study

11 0.22173585 119 emnlp-2010-We're Not in Kansas Anymore: Detecting Domain Changes in Streams

12 0.21456328 43 emnlp-2010-Enhancing Domain Portability of Chinese Segmentation Model Using Chi-Square Statistics and Bootstrapping

13 0.21149682 24 emnlp-2010-Automatically Producing Plot Unit Representations for Narrative Text

14 0.21044649 54 emnlp-2010-Generating Confusion Sets for Context-Sensitive Error Correction

15 0.19994804 2 emnlp-2010-A Fast Decoder for Joint Word Segmentation and POS-Tagging Using a Single Discriminative Model

16 0.1992875 33 emnlp-2010-Cross Language Text Classification by Model Translation and Semi-Supervised Learning

17 0.19915664 17 emnlp-2010-An Efficient Algorithm for Unsupervised Word Segmentation with Branching Entropy and MDL

18 0.19617155 25 emnlp-2010-Better Punctuation Prediction with Dynamic Conditional Random Fields

19 0.18985386 84 emnlp-2010-NLP on Spoken Documents Without ASR

20 0.18952259 109 emnlp-2010-Translingual Document Representations from Discriminative Projections


similar papers computed by lda model

lda for this paper:

topicId topicWeight

[(3, 0.386), (10, 0.015), (12, 0.041), (29, 0.067), (30, 0.021), (32, 0.018), (43, 0.039), (52, 0.029), (56, 0.064), (66, 0.1), (72, 0.062), (76, 0.026), (77, 0.012), (79, 0.012), (87, 0.02), (89, 0.021)]

similar papers list:

simIndex simValue paperId paperTitle

1 0.91418272 15 emnlp-2010-A Unified Framework for Scope Learning via Simplified Shallow Semantic Parsing

Author: Qiaoming Zhu ; Junhui Li ; Hongling Wang ; Guodong Zhou

Abstract: This paper approaches the scope learning problem via simplified shallow semantic parsing. This is done by regarding the cue as the predicate and mapping its scope into several constituents as the arguments of the cue. Evaluation on the BioScope corpus shows that the structural information plays a critical role in capturing the relationship between a cue and its dominated arguments. It also shows that our parsing approach significantly outperforms the state-of-the-art chunking ones. Although our parsing approach is only evaluated on negation and speculation scope learning here, it is portable to other kinds of scope learning. 1

same-paper 2 0.77023137 62 emnlp-2010-Improving Mention Detection Robustness to Noisy Input

Author: Radu Florian ; John Pitrelli ; Salim Roukos ; Imed Zitouni

Abstract: Information-extraction (IE) research typically focuses on clean-text inputs. However, an IE engine serving real applications yields many false alarms due to less-well-formed input. For example, IE in a multilingual broadcast processing system has to deal with inaccurate automatic transcription and translation. The resulting presence of non-target-language text in this case, and non-language material interspersed in data from other applications, raise the research problem of making IE robust to such noisy input text. We address one such IE task: entity-mention detection. We describe augmenting a statistical mention-detection system in order to reduce false alarms from spurious passages. The diverse nature of input noise leads us to pursue a multi-faceted approach to robustness. For our English-language system, at various miss rates we eliminate 97% of false alarms on inputs from other Latin-alphabet languages. In another experiment, representing scenarios in which genre-specific training is infeasible, we process real financial-transactions text containing mixed languages and data-set codes. On these data, because we do not train on data like it, we achieve a smaller but significant improvement. These gains come with virtually no loss in accuracy on clean English text.

3 0.73686701 61 emnlp-2010-Improving Gender Classification of Blog Authors

Author: Arjun Mukherjee ; Bing Liu

Abstract: The problem of automatically classifying the gender of a blog author has important applications in many commercial domains. Existing systems mainly use features such as words, word classes, and POS (part-ofspeech) n-grams, for classification learning. In this paper, we propose two new techniques to improve the current result. The first technique introduces a new class of features which are variable length POS sequence patterns mined from the training data using a sequence pattern mining algorithm. The second technique is a new feature selection method which is based on an ensemble of several feature selection criteria and approaches. Empirical evaluation using a real-life blog data set shows that these two techniques improve the classification accuracy of the current state-ofthe-art methods significantly.

4 0.46000794 44 emnlp-2010-Enhancing Mention Detection Using Projection via Aligned Corpora

Author: Yassine Benajiba ; Imed Zitouni

Abstract: The research question treated in this paper is centered on the idea of exploiting rich resources of one language to enhance the performance of a mention detection system of another one. We successfully achieve this goal by projecting information from one language to another via a parallel corpus. We examine the potential improvement using various degrees of linguistic information in a statistical framework and we show that the proposed technique is effective even when the target language model has access to a significantly rich feature set. Experimental results show up to 2.4F improvement in performance when the system has access to information obtained by projecting mentions from a resource-richlanguage mention detection system via a parallel corpus.

5 0.41445425 45 emnlp-2010-Evaluating Models of Latent Document Semantics in the Presence of OCR Errors

Author: Daniel Walker ; William B. Lund ; Eric K. Ringger

Abstract: Models of latent document semantics such as the mixture of multinomials model and Latent Dirichlet Allocation have received substantial attention for their ability to discover topical semantics in large collections of text. In an effort to apply such models to noisy optical character recognition (OCR) text output, we endeavor to understand the effect that character-level noise can have on unsupervised topic modeling. We show the effects both with document-level topic analysis (document clustering) and with word-level topic analysis (LDA) on both synthetic and real-world OCR data. As expected, experimental results show that performance declines as word error rates increase. Common techniques for alleviating these problems, such as filtering low-frequency words, are successful in enhancing model quality, but exhibit failure trends similar to models trained on unpro- cessed OCR output in the case of LDA. To our knowledge, this study is the first of its kind.

6 0.41379571 69 emnlp-2010-Joint Training and Decoding Using Virtual Nodes for Cascaded Segmentation and Tagging Tasks

7 0.4132998 20 emnlp-2010-Automatic Detection and Classification of Social Events

8 0.41137305 35 emnlp-2010-Discriminative Sample Selection for Statistical Machine Translation

9 0.40804654 8 emnlp-2010-A Multi-Pass Sieve for Coreference Resolution

10 0.40792227 119 emnlp-2010-We're Not in Kansas Anymore: Detecting Domain Changes in Streams

11 0.40450266 82 emnlp-2010-Multi-Document Summarization Using A* Search and Discriminative Learning

12 0.39770031 55 emnlp-2010-Handling Noisy Queries in Cross Language FAQ Retrieval

13 0.39744416 53 emnlp-2010-Fusing Eye Gaze with Speech Recognition Hypotheses to Resolve Exophoric References in Situated Dialogue

14 0.39687353 37 emnlp-2010-Domain Adaptation of Rule-Based Annotators for Named-Entity Recognition Tasks

15 0.39087194 75 emnlp-2010-Lessons Learned in Part-of-Speech Tagging of Conversational Speech

16 0.39035463 84 emnlp-2010-NLP on Spoken Documents Without ASR

17 0.39030081 120 emnlp-2010-What's with the Attitude? Identifying Sentences with Attitude in Online Discussions

18 0.38790745 25 emnlp-2010-Better Punctuation Prediction with Dynamic Conditional Random Fields

19 0.3872062 86 emnlp-2010-Non-Isomorphic Forest Pair Translation

20 0.38461927 9 emnlp-2010-A New Approach to Lexical Disambiguation of Arabic Text