acl acl2010 acl2010-32 knowledge-graph by maker-knowledge-mining

32 acl-2010-Arabic Named Entity Recognition: Using Features Extracted from Noisy Data


Source: pdf

Author: Yassine Benajiba ; Imed Zitouni ; Mona Diab ; Paolo Rosso

Abstract: Building an accurate Named Entity Recognition (NER) system for languages with complex morphology is a challenging task. In this paper, we present research that explores the feature space using both gold and bootstrapped noisy features to build an improved highly accurate Arabic NER system. We bootstrap noisy features by projection from an Arabic-English parallel corpus that is automatically tagged with a baseline NER system. The feature space covers lexical, morphological, and syntactic features. The proposed approach yields an improvement of up to 1.64 F-measure (absolute).

Reference: text


Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 edu , Abstract Building an accurate Named Entity Recognition (NER) system for languages with complex morphology is a challenging task. [sent-6, score-0.027]

2 In this paper, we present research that explores the feature space using both gold and bootstrapped noisy features to build an improved highly accurate Arabic NER system. [sent-7, score-0.292]

3 We bootstrap noisy features by projection from an Arabic-English parallel corpus that is automatically tagged with a baseline NER system. [sent-8, score-0.518]

4 The feature space covers lexical, morphological, and syntactic features. [sent-9, score-0.129]

5 The proposed approach yields an improvement of up to 1. [sent-10, score-0.043]

6 The class-set used to tag NEs may vary according to user needs. [sent-17, score-0.029]

7 According to (Nadeau and Sekine, 2007), optimization of the feature set is the key component in enhancing the performance of a global NER system. [sent-19, score-0.074]

8 In this paper we investigate the possibility of building a high performance Arabic NER system by using a large space of available feature sets that go beyond the explored shallow feature sets used to date in the literature for Arabic NER. [sent-20, score-0.148]

9 Realizing that the gold data available for NER is quite limited in size especially given the diverse genres in the set, we devise a method to bootstrap additional instances for the new features of interest from noisily NER tagged Arabic data. [sent-29, score-0.387]

10 BASE employs Support Vector Machines (SVMs) and Conditional Random Fields (CRFs) as Machine Learning (ML) approaches. [sent-32, score-0.04]

11 BASE uses lexical, syntactic and morphological features extracted using highly accurate automatic Arabic POS-taggers. [sent-33, score-0.229]

12 BASE employs a multi-classifier approach where each classifier is tagging a NE class separately. [sent-34, score-0.117]

13 The feature selection is performed by using an incremental approach selecting the top n features (the features are ranked according to their individual impact) at each iteration and keeping the set that yields the best results. [sent-35, score-0.287]

14 The following is the feature set used in (Benajiba et al. [sent-37, score-0.074]

15 y G Gclaezaentetdee Prse:rs aount oNmEa cticlaaslsl y(P hEaRrv)e, Gteedo apnodlitical Entity NE class (GPE), and Organization NE class (ORG) lexica; 4. [sent-45, score-0.098]

16 POS-tag and Base Phrase Chunk (BPC): automatically tagged using AMIRA (Diab et al. [sent-46, score-0.102]

17 , 2007) which yields Fmeasures for both tasks in the high 90’s; 5. [sent-47, score-0.043]

18 Morphological features: automatically tagged using the Morphological Analysis and Disambiguation for Arabic (MADA) tool to extract information about gender, number, person, definiteness and as- + 281 UppsalaP,r Sowce ed ein ,g 1s1 o-f16 th Jeu AlyC 2L0 210 1. [sent-48, score-0.102]

19 Capitalization: derived as a side effect from running MADA. [sent-51, score-0.074]

20 MADA chooses a specific morphological analysis given the context of a given word. [sent-52, score-0.129]

21 As part of the morphological information available in the underlying lexicon that MADA exploits. [sent-53, score-0.089]

22 As part of the information present, the underlying lexicon has an English gloss associated with each entry. [sent-54, score-0.036]

23 More often than not, if the word is a NE in Arabic then the gloss will also be a NE in English and hence capitalized. [sent-55, score-0.074]

24 We devise an extended Arabic NER system (EXTENDED) that uses the same architecture as BASE but employs additional features to those in BASE. [sent-56, score-0.196]

25 We specifically investigate the space of the surrounding context for the NEs. [sent-58, score-0.04]

26 We explore generalizations over the kinds of words that occur with NEs and the syntactic relations NEs engage in. [sent-59, score-0.055]

27 Stateof-the-art for Arabic syntactic parsing for the most common genre (with the most training data) of Arabic data, newswire, is in the low 80%s. [sent-61, score-0.115]

28 Hence, we acknowledge that some of the derived syntactic features will be noisy. [sent-62, score-0.14]

29 The size of the manually annotated gold data typically used for training Arabic NER systems poses a significant challenge for robustly exploring deeper syntactic and lexical features. [sent-64, score-0.212]

30 Accordingly, we bootstrap more NE tagged data via projection over Arabic-English parallel data. [sent-65, score-0.379]

31 The role ofthis data is simply to give us more instances of the newly defined features (namely the syntagmatic features) in the EXTENDED system as well as more instances for the Gazetteers and Context features defined in BASE. [sent-66, score-0.409]

32 It is worth noting that we do not use the bootstrapped NE tagged data directly as training data with the gold data. [sent-67, score-0.181]

33 1 Syntagmatic Features For deriving our deeper linguistic features, we parse the Arabic sentences that contain an NE. [sent-69, score-0.072]

34 For each of the NEs, we extract a number of features described as follows: - Syntactic head-word (SHW): The idea here is to look for a broader relevant context. [sent-70, score-0.085]

35 Whereas the feature lexical n-gram context feature used in BASE, and hence here for EXTENDED, considers the linearly adjacent neighboring words of a NE, SHW uses a parse tree to look at farther, yet related, words. [sent-71, score-0.298]

36 For instance, in the Arabic phrase “SrH Ams An Figure 1: Example for the head word and syntactic environment feature bArAk AwbAma ytrAs”, which means “declared yesterday that Barack Obama governs . [sent-72, score-0.277]

37 According to the phrase structure parse, the first parent sub-tree headword of the NE “bArAk AwbAmA” is the verb ‘ytrAs’ (governs), the second one is ‘An’ (that) and the third one is the verb ‘SrH’ (declared). [sent-79, score-0.067]

38 This example illustrates that the word “Ams” is ignored for this feature set since it is not a syntactic head. [sent-80, score-0.129]

39 - Syntactic Environment (SE): This follows in the same spirit as SHW, but expands the idea in that it looks at the parent non-terminal instead of the parent head word, hence it is not a lexicalized feature. [sent-82, score-0.238]

40 The goal being to use a more abstract representation level of the context in which a NE appears. [sent-83, score-0.04]

41 For instance, for the same example presented in Figure 1, the first, second, and third nonterminal parents of the NE “bArAk AwbAmA” are ‘S’, ‘SBAR’ and ‘VP’, respectively. [sent-84, score-0.04]

42 2 Bootstrapping Noisy Arabic NER Data Extracting the syntagmatic features from the training data yields relatively small number of instances. [sent-88, score-0.367]

43 The new Arabic NER tagged data is derived via projection exploiting parallel Arabic English data. [sent-90, score-0.31]

44 The process depends on the availability of two key components: a large Arabic English parallel corpus that is sentence and word aligned, and a robust high performing English NER system. [sent-91, score-0.14]

45 We project the automatically tagged NER tags from the English side to the Arabic side of the parallel corpus. [sent-98, score-0.39]

46 In our case, we have access to a large manually aligned parallel corpus, therefore the NER projection is direct. [sent-99, score-0.208]

47 However, the English side of the parallel corpus is not NER tagged, hence we use an off-the-shelf competitive robust automatic English NER system which has a published performance of 92% (Zitouni and Florian, 2009). [sent-100, score-0.252]

48 The result of these two processes is a large Arabic NER, albeit noisy, tagged data set. [sent-101, score-0.102]

49 As mentioned earlier this data is used only for deriving additional instances for training for the syntagmatic features and for the context and gazetteer features. [sent-102, score-0.416]

50 3 Given this additional source of data, we changed the lexical features extracted from the BASE to the EXTENDED. [sent-103, score-0.126]

51 We added two other lexical features: CBG and NGC, described as follows: - Class Based Gazetteers (CBG): This feature focuses on the surface form of the NEs. [sent-104, score-0.115]

52 We group the NEs encountered on the Arabic side of the parallel corpus by class as they are found in different dictionaries. [sent-105, score-0.263]

53 The difference between this feature and that in BASE is that the Gazetteers are not restricted to Wikipedia sources. [sent-106, score-0.074]

54 - N-gram context (NGC): Here we disregard the surface form of the NE, instead we focus on its lexical context. [sent-107, score-0.081]

55 −Sinm,i +lanr t aon tdh −e /C+BGn feature, these lists are also separated by NE class. [sent-109, score-0.035]

56 It is worth highlighting that the NCG feature is different from the Context feature in BASE in that the window size is different +/ 1 3 for tEhXatT tEhNeD wEinDd ovwer ssuizse e+ i/s f1e freorn tB +A/SE −. [sent-110, score-0.148]

57 ACE 2005 includes a different genre of Weblogs (WL). [sent-116, score-0.06]

58 3Therefore, we did not do the full feature extraction for the other features described in BASE for this data. [sent-118, score-0.159]

59 2 Parallel Data Most of the hand-aligned Arabic-English parallel data used in our experiments is from the Language Data Consortium (LDC). [sent-129, score-0.14]

60 Another set of the parallel data is annotated in-house by professional annotators. [sent-131, score-0.14]

61 The corpus has texts of five different genres, namely: newswire, news groups, broadcast news, broadcast conversation and weblogs corresponding to the data genres in the ACE gold data. [sent-132, score-0.252]

62 The Arabic side of the parallel corpus contains 941,282 tokens. [sent-133, score-0.214]

63 After projecting the NE tags from the English side to the Arabic side of the parallel corpus, we obtain a total of 57,290 Arabic NE instances. [sent-134, score-0.323]

64 CLOFlAORaCGCs Num21b70e9,r965 7o8f12NEsVCWPE laEHRsA Num1b7e28,r905o64fNEs Table 1: Number of NEs per class in the Arabic side of the parallel corpus 3. [sent-136, score-0.301]

65 3 Individual Feature Impact Across the board, all the features yield improved performance. [sent-137, score-0.085]

66 The highest obtained result is observed where the first non-terminal parent is used as a feature, a Syntactic Environment (SE) feature, yielding an improvement of up to 4 points over the baseline. [sent-138, score-0.095]

67 taking the first parent versus adding neighboring non-terminal parents. [sent-141, score-0.098]

68 We note that even though we observe an overall increase in performance, considering both the {first, secionnd p}e rofor trmhea n{cfires,t, c soencsoindder, anngd b btohitrhd} th eno {nf-irtesrtm, sineca-l parents dtheecr {efaisrsets, s peecrofnodrm,a anncde t h biryd 0}. [sent-142, score-0.066]

69 5i Fa-l measure points, respectively, compared to considering the first parent information alone. [sent-144, score-0.067]

70 The head word features, SHW, show a higher positive impact than the lexical context feature, NGC. [sent-145, score-0.173]

71 Finally, the Gazetteer feature, CBG, impact is comparable to the obtained improvement of the lexical context feature. [sent-146, score-0.139]

72 It shows for each data set and each genre the F-measure ob- tained using the best feature set and ML approach. [sent-149, score-0.134]

73 It shows results for both the dev and test data using the optimal number of features selected from 5All the LDC data are publicly available 283 FreqBaseline7B3A. [sent-150, score-0.085]

74 3101 Table 2: Final Results obtained with selected features contrasted against all features combined the all the features except the syntagmatic ones (Al l-Synt . [sent-184, score-0.546]

75 ) contrasted against the system including the semantic features, i. [sent-185, score-0.052]

76 The baseline results, FreqBaseline, assigns a test token the most frequent tag observed for it in the gold training data, if a test token is not observed in the training data, it is assigned the most frequent tag which is the O tag. [sent-188, score-0.102]

77 4 Results Discussion Individual feature impact results show that the syntagmatic features are helpful for most of the data sets. [sent-189, score-0.456]

78 The im- provement varies significantly from one data-set to another because it highly depends on the number of NEs which the model has not been able to capture using the contextual, lexical, syntactic and morphological features. [sent-191, score-0.144]

79 Impact of the features extracted from the parallel corpus per class: The syntagmatic features have varied in their influence on the different NE classes. [sent-192, score-0.587]

80 Generally, the LOC and PER classes benefitted more from the head word features, SHW), than the other classes. [sent-193, score-0.034]

81 On the other hand for the syntactic environment feature (SE), the PER class seemed not to benefit much from the presence of this feature. [sent-194, score-0.232]

82 Consequently, the features which use a more global context (syntactic environment, SE, and head word, SHW, features) have helped obtain better results than the ones which we have obtained using local context namely CBG and NGC. [sent-196, score-0.199]

83 5 Related Work Projecting explicit linguistic tags from another language via parallel corpora has been widely used in the NLP tasks and has proved to contribute significantly to achieving better performance. [sent-197, score-0.14]

84 Different research works report positive results when using this technique to enhance WSD (Diab and Resnik, 2002; Ng et al. [sent-198, score-0.029]

85 In the latter two works, they augment training data from parallel data for training supervised systems. [sent-200, score-0.14]

86 In (Diab, 2004), the author uses projections from English into Arabic to bootstrap a sense tagging system for Arabic as well as a seed Arabic WordNet through projection. [sent-201, score-0.097]

87 Finally, in Mention Detection (MD), a task which includes NER and adds the identification and classification of nominal and pronominal mentions, (Zitouni and Florian, 2008) show the impact of using a MT system to enhance the performance of an Arabic MD model. [sent-206, score-0.087]

88 6F when the baseline system uses lexical fea- tures only. [sent-208, score-0.041]

89 6 Conclusion and Future Directions In this paper we investigate the possibility of building a high performance Arabic NER system by using lexical, syntactic and morphological features and augmenting the model with deeper lexical features and more syntagmatic features. [sent-210, score-0.666]

90 These extra features are extracted from noisy data obtained via projection from an Arabic-English parallel corpus. [sent-211, score-0.347]

91 64 points of F-measure) is obtained for the ACE 2004, BN genre. [sent-214, score-0.028]

92 Mention detection cErMosNsLinPg’0 t8he, H laon gouluaglue, H ba rwriaeir . [sent-228, score-0.04]

93 On the parameter space of generative lexicalized statistical parsing models. [sent-247, score-0.032]

94 Can one language bootstrap the other: A case study of event extraction. [sent-254, score-0.069]

95 An unsupervised method for word sense tagging using parallel corpora. [sent-263, score-0.168]

96 Exploiting parallel texts for word sense disambiguation: An empirical study. [sent-304, score-0.14]


similar papers computed by tfidf model

tfidf for this paper:

wordName wordTfidf (topN-words)

[('arabic', 0.513), ('ner', 0.388), ('syntagmatic', 0.239), ('nes', 0.206), ('ne', 0.189), ('shw', 0.179), ('ace', 0.169), ('parallel', 0.14), ('cbg', 0.119), ('diab', 0.112), ('tagged', 0.102), ('awbama', 0.089), ('barack', 0.089), ('benajiba', 0.089), ('morphological', 0.089), ('features', 0.085), ('zitouni', 0.085), ('base', 0.084), ('barak', 0.078), ('gpe', 0.078), ('feature', 0.074), ('side', 0.074), ('deeper', 0.072), ('mada', 0.072), ('bootstrap', 0.069), ('projection', 0.068), ('obama', 0.067), ('parent', 0.067), ('wl', 0.066), ('gazetteers', 0.064), ('mona', 0.064), ('genre', 0.06), ('ams', 0.06), ('governs', 0.06), ('nadeau', 0.06), ('ngc', 0.06), ('ytras', 0.06), ('weblogs', 0.058), ('impact', 0.058), ('syntactic', 0.055), ('noisy', 0.054), ('environment', 0.054), ('entity', 0.052), ('contrasted', 0.052), ('declared', 0.052), ('gazetteer', 0.052), ('broadcast', 0.051), ('class', 0.049), ('genres', 0.048), ('srh', 0.048), ('bn', 0.047), ('ml', 0.045), ('gold', 0.044), ('yields', 0.043), ('pennsylvania', 0.043), ('thompson', 0.042), ('habash', 0.042), ('imed', 0.042), ('se', 0.042), ('lexical', 0.041), ('parents', 0.04), ('bikel', 0.04), ('newswire', 0.04), ('employs', 0.04), ('detection', 0.04), ('context', 0.04), ('devise', 0.039), ('hence', 0.038), ('per', 0.038), ('florian', 0.037), ('hwa', 0.037), ('org', 0.037), ('recognition', 0.036), ('gloss', 0.036), ('bootstrapped', 0.035), ('aon', 0.035), ('projecting', 0.035), ('named', 0.034), ('head', 0.034), ('ldc', 0.034), ('bootstrapping', 0.034), ('philadelphia', 0.033), ('nw', 0.033), ('lexicalized', 0.032), ('extended', 0.032), ('neighboring', 0.031), ('english', 0.03), ('md', 0.03), ('enhance', 0.029), ('tag', 0.029), ('accordingly', 0.028), ('resnik', 0.028), ('points', 0.028), ('tagging', 0.028), ('morphology', 0.027), ('rs', 0.026), ('fac', 0.026), ('bgn', 0.026), ('amira', 0.026), ('anngd', 0.026)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.99999917 32 acl-2010-Arabic Named Entity Recognition: Using Features Extracted from Noisy Data

Author: Yassine Benajiba ; Imed Zitouni ; Mona Diab ; Paolo Rosso

Abstract: Building an accurate Named Entity Recognition (NER) system for languages with complex morphology is a challenging task. In this paper, we present research that explores the feature space using both gold and bootstrapped noisy features to build an improved highly accurate Arabic NER system. We bootstrap noisy features by projection from an Arabic-English parallel corpus that is automatically tagged with a baseline NER system. The feature space covers lexical, morphological, and syntactic features. The proposed approach yields an improvement of up to 1.64 F-measure (absolute).

2 0.26528829 145 acl-2010-Improving Arabic-to-English Statistical Machine Translation by Reordering Post-Verbal Subjects for Alignment

Author: Marine Carpuat ; Yuval Marton ; Nizar Habash

Abstract: We study the challenges raised by Arabic verb and subject detection and reordering in Statistical Machine Translation (SMT). We show that post-verbal subject (VS) constructions are hard to translate because they have highly ambiguous reordering patterns when translated to English. In addition, implementing reordering is difficult because the boundaries of VS constructions are hard to detect accurately, even with a state-of-the-art Arabic dependency parser. We therefore propose to reorder VS constructions into SV order for SMT word alignment only. This strategy significantly improves BLEU and TER scores, even on a strong large-scale baseline and despite noisy parses.

3 0.21697867 180 acl-2010-On Jointly Recognizing and Aligning Bilingual Named Entities

Author: Yufeng Chen ; Chengqing Zong ; Keh-Yih Su

Abstract: We observe that (1) how a given named entity (NE) is translated (i.e., either semantically or phonetically) depends greatly on its associated entity type, and (2) entities within an aligned pair should share the same type. Also, (3) those initially detected NEs are anchors, whose information should be used to give certainty scores when selecting candidates. From this basis, an integrated model is thus proposed in this paper to jointly identify and align bilingual named entities between Chinese and English. It adopts a new mapping type ratio feature (which is the proportion of NE internal tokens that are semantically translated), enforces an entity type consistency constraint, and utilizes additional monolingual candidate certainty factors (based on those NE anchors). The experi- ments show that this novel approach has substantially raised the type-sensitive F-score of identified NE-pairs from 68.4% to 81.7% (42.1% F-score imperfection reduction) in our Chinese-English NE alignment task.

4 0.16215381 213 acl-2010-Simultaneous Tokenization and Part-Of-Speech Tagging for Arabic without a Morphological Analyzer

Author: Seth Kulick

Abstract: We describe an approach to simultaneous tokenization and part-of-speech tagging that is based on separating the closed and open-class items, and focusing on the likelihood of the possible stems of the openclass words. By encoding some basic linguistic information, the machine learning task is simplified, while achieving stateof-the-art tokenization results and competitive POS results, although with a reduced tag set and some evaluation difficulties.

5 0.1309936 132 acl-2010-Hierarchical Joint Learning: Improving Joint Parsing and Named Entity Recognition with Non-Jointly Labeled Data

Author: Jenny Rose Finkel ; Christopher D. Manning

Abstract: One of the main obstacles to producing high quality joint models is the lack of jointly annotated data. Joint modeling of multiple natural language processing tasks outperforms single-task models learned from the same data, but still underperforms compared to single-task models learned on the more abundant quantities of available single-task annotated data. In this paper we present a novel model which makes use of additional single-task annotated data to improve the performance of a joint model. Our model utilizes a hierarchical prior to link the feature weights for shared features in several single-task models and the joint model. Experiments on joint parsing and named entity recog- nition, using the OntoNotes corpus, show that our hierarchical joint model can produce substantial gains over a joint model trained on only the jointly annotated data.

6 0.11633526 150 acl-2010-Inducing Domain-Specific Semantic Class Taggers from (Almost) Nothing

7 0.098082215 263 acl-2010-Word Representations: A Simple and General Method for Semi-Supervised Learning

8 0.096633382 72 acl-2010-Coreference Resolution across Corpora: Languages, Coding Schemes, and Preprocessing Information

9 0.094987243 249 acl-2010-Unsupervised Search for the Optimal Segmentation for Statistical Machine Translation

10 0.092163049 119 acl-2010-Fixed Length Word Suffix for Factored Statistical Machine Translation

11 0.091081969 133 acl-2010-Hierarchical Search for Word Alignment

12 0.081600793 221 acl-2010-Syntax-to-Morphology Mapping in Factored Phrase-Based Statistical Machine Translation from English to Turkish

13 0.069445312 117 acl-2010-Fine-Grained Genre Classification Using Structural Learning Algorithms

14 0.069411837 62 acl-2010-Combining Orthogonal Monolingual and Multilingual Sources of Evidence for All Words WSD

15 0.066921622 169 acl-2010-Learning to Translate with Source and Target Syntax

16 0.064833157 102 acl-2010-Error Detection for Statistical Machine Translation Using Linguistic Features

17 0.06288147 247 acl-2010-Unsupervised Event Coreference Resolution with Rich Linguistic Features

18 0.062632799 152 acl-2010-It Makes Sense: A Wide-Coverage Word Sense Disambiguation System for Free Text

19 0.057599522 51 acl-2010-Bilingual Sense Similarity for Statistical Machine Translation

20 0.054063559 144 acl-2010-Improved Unsupervised POS Induction through Prototype Discovery


similar papers computed by lsi model

lsi for this paper:

topicId topicWeight

[(0, -0.186), (1, -0.031), (2, -0.005), (3, -0.027), (4, 0.043), (5, 0.066), (6, -0.013), (7, 0.033), (8, 0.044), (9, 0.169), (10, 0.04), (11, 0.067), (12, 0.033), (13, -0.113), (14, -0.038), (15, -0.05), (16, -0.0), (17, 0.121), (18, 0.261), (19, -0.146), (20, -0.083), (21, 0.012), (22, 0.016), (23, 0.159), (24, 0.056), (25, 0.035), (26, -0.172), (27, 0.035), (28, 0.023), (29, -0.011), (30, 0.042), (31, 0.065), (32, -0.122), (33, -0.01), (34, -0.194), (35, 0.142), (36, -0.196), (37, 0.084), (38, 0.129), (39, -0.036), (40, -0.033), (41, 0.143), (42, -0.074), (43, 0.054), (44, 0.007), (45, 0.161), (46, -0.14), (47, 0.003), (48, 0.013), (49, 0.032)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.94490385 32 acl-2010-Arabic Named Entity Recognition: Using Features Extracted from Noisy Data

Author: Yassine Benajiba ; Imed Zitouni ; Mona Diab ; Paolo Rosso

Abstract: Building an accurate Named Entity Recognition (NER) system for languages with complex morphology is a challenging task. In this paper, we present research that explores the feature space using both gold and bootstrapped noisy features to build an improved highly accurate Arabic NER system. We bootstrap noisy features by projection from an Arabic-English parallel corpus that is automatically tagged with a baseline NER system. The feature space covers lexical, morphological, and syntactic features. The proposed approach yields an improvement of up to 1.64 F-measure (absolute).

2 0.67093694 180 acl-2010-On Jointly Recognizing and Aligning Bilingual Named Entities

Author: Yufeng Chen ; Chengqing Zong ; Keh-Yih Su

Abstract: We observe that (1) how a given named entity (NE) is translated (i.e., either semantically or phonetically) depends greatly on its associated entity type, and (2) entities within an aligned pair should share the same type. Also, (3) those initially detected NEs are anchors, whose information should be used to give certainty scores when selecting candidates. From this basis, an integrated model is thus proposed in this paper to jointly identify and align bilingual named entities between Chinese and English. It adopts a new mapping type ratio feature (which is the proportion of NE internal tokens that are semantically translated), enforces an entity type consistency constraint, and utilizes additional monolingual candidate certainty factors (based on those NE anchors). The experi- ments show that this novel approach has substantially raised the type-sensitive F-score of identified NE-pairs from 68.4% to 81.7% (42.1% F-score imperfection reduction) in our Chinese-English NE alignment task.

3 0.63507116 145 acl-2010-Improving Arabic-to-English Statistical Machine Translation by Reordering Post-Verbal Subjects for Alignment

Author: Marine Carpuat ; Yuval Marton ; Nizar Habash

Abstract: We study the challenges raised by Arabic verb and subject detection and reordering in Statistical Machine Translation (SMT). We show that post-verbal subject (VS) constructions are hard to translate because they have highly ambiguous reordering patterns when translated to English. In addition, implementing reordering is difficult because the boundaries of VS constructions are hard to detect accurately, even with a state-of-the-art Arabic dependency parser. We therefore propose to reorder VS constructions into SV order for SMT word alignment only. This strategy significantly improves BLEU and TER scores, even on a strong large-scale baseline and despite noisy parses.

4 0.6223501 213 acl-2010-Simultaneous Tokenization and Part-Of-Speech Tagging for Arabic without a Morphological Analyzer

Author: Seth Kulick

Abstract: We describe an approach to simultaneous tokenization and part-of-speech tagging that is based on separating the closed and open-class items, and focusing on the likelihood of the possible stems of the openclass words. By encoding some basic linguistic information, the machine learning task is simplified, while achieving stateof-the-art tokenization results and competitive POS results, although with a reduced tag set and some evaluation difficulties.

5 0.48083448 132 acl-2010-Hierarchical Joint Learning: Improving Joint Parsing and Named Entity Recognition with Non-Jointly Labeled Data

Author: Jenny Rose Finkel ; Christopher D. Manning

Abstract: One of the main obstacles to producing high quality joint models is the lack of jointly annotated data. Joint modeling of multiple natural language processing tasks outperforms single-task models learned from the same data, but still underperforms compared to single-task models learned on the more abundant quantities of available single-task annotated data. In this paper we present a novel model which makes use of additional single-task annotated data to improve the performance of a joint model. Our model utilizes a hierarchical prior to link the feature weights for shared features in several single-task models and the joint model. Experiments on joint parsing and named entity recog- nition, using the OntoNotes corpus, show that our hierarchical joint model can produce substantial gains over a joint model trained on only the jointly annotated data.

6 0.41672462 263 acl-2010-Word Representations: A Simple and General Method for Semi-Supervised Learning

7 0.34806064 28 acl-2010-An Entity-Level Approach to Information Extraction

8 0.34381694 139 acl-2010-Identifying Generic Noun Phrases

9 0.34313232 150 acl-2010-Inducing Domain-Specific Semantic Class Taggers from (Almost) Nothing

10 0.34148821 221 acl-2010-Syntax-to-Morphology Mapping in Factored Phrase-Based Statistical Machine Translation from English to Turkish

11 0.32764217 119 acl-2010-Fixed Length Word Suffix for Factored Statistical Machine Translation

12 0.32468432 256 acl-2010-Vocabulary Choice as an Indicator of Perspective

13 0.30984348 135 acl-2010-Hindi-to-Urdu Machine Translation through Transliteration

14 0.30731246 117 acl-2010-Fine-Grained Genre Classification Using Structural Learning Algorithms

15 0.29977956 152 acl-2010-It Makes Sense: A Wide-Coverage Word Sense Disambiguation System for Free Text

16 0.29646498 154 acl-2010-Jointly Optimizing a Two-Step Conditional Random Field Model for Machine Transliteration and Its Fast Decoding Algorithm

17 0.29503751 218 acl-2010-Structural Semantic Relatedness: A Knowledge-Based Method to Named Entity Disambiguation

18 0.29395771 249 acl-2010-Unsupervised Search for the Optimal Segmentation for Statistical Machine Translation

19 0.2813428 62 acl-2010-Combining Orthogonal Monolingual and Multilingual Sources of Evidence for All Words WSD

20 0.27540711 72 acl-2010-Coreference Resolution across Corpora: Languages, Coding Schemes, and Preprocessing Information


similar papers computed by lda model

lda for this paper:

topicId topicWeight

[(14, 0.032), (25, 0.054), (39, 0.03), (42, 0.023), (54, 0.278), (59, 0.088), (71, 0.014), (73, 0.057), (78, 0.014), (83, 0.151), (84, 0.016), (98, 0.126)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.8026405 32 acl-2010-Arabic Named Entity Recognition: Using Features Extracted from Noisy Data

Author: Yassine Benajiba ; Imed Zitouni ; Mona Diab ; Paolo Rosso

Abstract: Building an accurate Named Entity Recognition (NER) system for languages with complex morphology is a challenging task. In this paper, we present research that explores the feature space using both gold and bootstrapped noisy features to build an improved highly accurate Arabic NER system. We bootstrap noisy features by projection from an Arabic-English parallel corpus that is automatically tagged with a baseline NER system. The feature space covers lexical, morphological, and syntactic features. The proposed approach yields an improvement of up to 1.64 F-measure (absolute).

2 0.77417576 198 acl-2010-Predicate Argument Structure Analysis Using Transformation Based Learning

Author: Hirotoshi Taira ; Sanae Fujita ; Masaaki Nagata

Abstract: Maintaining high annotation consistency in large corpora is crucial for statistical learning; however, such work is hard, especially for tasks containing semantic elements. This paper describes predicate argument structure analysis using transformation-based learning. An advantage of transformation-based learning is the readability of learned rules. A disadvantage is that the rule extraction procedure is time-consuming. We present incremental-based, transformation-based learning for semantic processing tasks. As an example, we deal with Japanese predicate argument analysis and show some tendencies of annotators for constructing a corpus with our method.

3 0.76880425 177 acl-2010-Multilingual Pseudo-Relevance Feedback: Performance Study of Assisting Languages

Author: Manoj Kumar Chinnakotla ; Karthik Raman ; Pushpak Bhattacharyya

Abstract: In a previous work of ours Chinnakotla et al. (2010) we introduced a novel framework for Pseudo-Relevance Feedback (PRF) called MultiPRF. Given a query in one language called Source, we used English as the Assisting Language to improve the performance of PRF for the source language. MulitiPRF showed remarkable improvement over plain Model Based Feedback (MBF) uniformly for 4 languages, viz., French, German, Hungarian and Finnish with English as the assisting language. This fact inspired us to study the effect of any source-assistant pair on MultiPRF performance from out of a set of languages with widely different characteristics, viz., Dutch, English, Finnish, French, German and Spanish. Carrying this further, we looked into the effect of using two assisting languages together on PRF. The present paper is a report of these investigations, their results and conclusions drawn therefrom. While performance improvement on MultiPRF is observed whatever the assisting language and whatever the source, observations are mixed when two assisting languages are used simultaneously. Interestingly, the performance improvement is more pronounced when the source and assisting languages are closely related, e.g., French and Spanish.

4 0.74501061 164 acl-2010-Learning Phrase-Based Spelling Error Models from Clickthrough Data

Author: Xu Sun ; Jianfeng Gao ; Daniel Micol ; Chris Quirk

Abstract: This paper explores the use of clickthrough data for query spelling correction. First, large amounts of query-correction pairs are derived by analyzing users' query reformulation behavior encoded in the clickthrough data. Then, a phrase-based error model that accounts for the transformation probability between multi-term phrases is trained and integrated into a query speller system. Experiments are carried out on a human-labeled data set. Results show that the system using the phrase-based error model outperforms cantly its baseline systems. 1 signifi-

5 0.6239928 1 acl-2010-"Ask Not What Textual Entailment Can Do for You..."

Author: Mark Sammons ; V.G.Vinod Vydiswaran ; Dan Roth

Abstract: We challenge the NLP community to participate in a large-scale, distributed effort to design and build resources for developing and evaluating solutions to new and existing NLP tasks in the context of Recognizing Textual Entailment. We argue that the single global label with which RTE examples are annotated is insufficient to effectively evaluate RTE system performance; to promote research on smaller, related NLP tasks, we believe more detailed annotation and evaluation are needed, and that this effort will benefit not just RTE researchers, but the NLP community as a whole. We use insights from successful RTE systems to propose a model for identifying and annotating textual infer- ence phenomena in textual entailment examples, and we present the results of a pilot annotation study that show this model is feasible and the results immediately useful.

6 0.62306857 101 acl-2010-Entity-Based Local Coherence Modelling Using Topological Fields

7 0.62117684 73 acl-2010-Coreference Resolution with Reconcile

8 0.61700732 102 acl-2010-Error Detection for Statistical Machine Translation Using Linguistic Features

9 0.61496139 230 acl-2010-The Manually Annotated Sub-Corpus: A Community Resource for and by the People

10 0.61432612 153 acl-2010-Joint Syntactic and Semantic Parsing of Chinese

11 0.61179435 39 acl-2010-Automatic Generation of Story Highlights

12 0.61091095 247 acl-2010-Unsupervised Event Coreference Resolution with Rich Linguistic Features

13 0.61040193 195 acl-2010-Phylogenetic Grammar Induction

14 0.60943514 42 acl-2010-Automatically Generating Annotator Rationales to Improve Sentiment Classification

15 0.60915011 155 acl-2010-Kernel Based Discourse Relation Recognition with Temporal Ordering Information

16 0.60861325 252 acl-2010-Using Parse Features for Preposition Selection and Error Detection

17 0.60841942 197 acl-2010-Practical Very Large Scale CRFs

18 0.60805374 219 acl-2010-Supervised Noun Phrase Coreference Research: The First Fifteen Years

19 0.60784519 233 acl-2010-The Same-Head Heuristic for Coreference

20 0.6073612 112 acl-2010-Extracting Social Networks from Literary Fiction