acl acl2013 acl2013-256 knowledge-graph by maker-knowledge-mining
Source: pdf
Author: Kareem Darwish
Abstract: Some languages lack large knowledge bases and good discriminative features for Name Entity Recognition (NER) that can generalize to previously unseen named entities. One such language is Arabic, which: a) lacks a capitalization feature; and b) has relatively small knowledge bases, such as Wikipedia. In this work we address both problems by incorporating cross-lingual features and knowledge bases from English using cross-lingual links. We show that such features have a dramatic positive effect on recall. We show the effectiveness of cross-lingual features and resources on a standard dataset as well as on two new test sets that cover both news and microblogs. On the standard dataset, we achieved a 4.1% relative improvement in Fmeasure over the best reported result in the literature. The features led to improvements of 17.1% and 20.5% on the new news and mi- croblogs test sets respectively.
Reference: text
sentIndex sentText sentNum sentScore
1 qa Abstract Some languages lack large knowledge bases and good discriminative features for Name Entity Recognition (NER) that can generalize to previously unseen named entities. [sent-3, score-0.322]
2 One such language is Arabic, which: a) lacks a capitalization feature; and b) has relatively small knowledge bases, such as Wikipedia. [sent-4, score-0.352]
3 We show the effectiveness of cross-lingual features and resources on a standard dataset as well as on two new test sets that cover both news and microblogs. [sent-7, score-0.285]
4 5% on the new news and mi- croblogs test sets respectively. [sent-12, score-0.166]
5 To train an NER system, some of the following feature types are typically used (Benajiba and Rosso, 2008; Nadeau and Sekine, 2009): - Orthographic features: These features include capitalization, punctuation, existence of digits, etc. [sent-15, score-0.112]
6 One of the most effective orthographic features is capitalization in English, which helps NER to generalize to new text of different genres. [sent-16, score-0.453]
7 However, capitalization is not very useful in some languages such as German, and nonexistent in other languages such as Arabic. [sent-17, score-0.363]
8 Further, even in English social media, capitalization may be inconsistent. [sent-18, score-0.309]
9 - Contextual features: Certain words are indicative of the existence of named entities. [sent-19, score-0.192]
10 For example, the word “said” is often preceded by a named entity of type “person” or “organization”. [sent-20, score-0.245]
11 - Character-level features: These features typically include the leading and trailing letters of words. [sent-23, score-0.265]
12 For example, a word ending with “ing” is typically not a named entity, while a word ending in “berg” is often a named entity. [sent-26, score-0.347]
13 - Part-of-speech (POS) tags and morphological features: POS tags indicate (or counter-indicate) the possible presence of a named entity at word level or at word sequence level. [sent-27, score-0.314]
14 Morphological features can mostly indicate the absence of named entities. [sent-28, score-0.212]
15 However, pronouns are rarely ever attached to named entities. [sent-30, score-0.158]
16 - Gazetteers: This feature checks the presence of a word or a sequence of words in large lists of named entities. [sent-31, score-0.226]
17 However, Arabic lacks indicative orthographic features that generalize to previously unseen named entities. [sent-36, score-0.379]
18 Ac s2s0o1ci3a Atiosnso fcoirat Cioonm foprut Caotimonpaulta Lti nognuails Lti cnsg,u piasgteics 1558–1567, of the Arabic gazetteers that were used for NER were small (Benajiba and Rosso, 2008), there has been efforts to build larger Arabic gazetteers (Attia et al. [sent-39, score-0.15]
19 Since training and test parts of standard datasets for Arabic NER are drawn from the same genre in relatively close temporal proximity, a named entity recognizer that simply memorizes named entities in the training set generally performs well on such test sets. [sent-41, score-0.645]
20 We illustrate the limited capacity of existing recognizers to generalize to previously unseen named en- tities using two new test sets that include microblogs as well as news texts that cover local and international politics, economics, health, sports, entertainment, and science. [sent-44, score-0.366]
21 As we will show later, recall is well below 50% for all named entity types on the new test sets. [sent-45, score-0.346]
22 To address this problem, we introduce the use of cross-lingual links between a disadvantaged language, Arabic, and a language with good discriminative features and large resources, English, to improve Arabic NER. [sent-46, score-0.116]
23 We also show how to use transliteration mining to improve NER, even when neither language has a capitalization (or similar) feature. [sent-48, score-0.511]
24 The intuition is that if the translation of a word is in fact a transliteration, then the word is likely a named entity. [sent-49, score-0.203]
25 Cross-lingual links are obtained using Wikipedia cross-language links and a large Machine Translation (MT) phrase table that is true cased, where word casing is preserved during training. [sent-50, score-0.179]
26 We show the effectiveness of these new features on a standard dataset as well as two new test sets. [sent-51, score-0.17]
27 The contributions of this paper are as follows: - Using cross-lingual links to exploit orthographic features in other languages. [sent-52, score-0.164]
28 - Introducing two new NER test sets for Arabic that include recent news as well as microblogs. [sent-55, score-0.166]
29 The remainder of the paper is organized as follows: Section 2 provides related work; Section 3 describes the baseline system; Section 4 introduces the cross-lingual features and reports on their effectiveness; and Section 5 concludes the paper. [sent-62, score-0.119]
30 (2010) used bilingual text to improve monolingual models including NER models for German, which lacks a good capitalization feature. [sent-71, score-0.352]
31 Further, we are not aware of prior work on using TM (or transliteration in general) as a cross lingual feature in any annotation task. [sent-86, score-0.213]
32 (2007) used a maximum entropy classifier trained on a feature set that includes the use of gazetteers and a stopword list, appearance of a NE in the training set, leading and trailing word bigrams, and the tag of the previous word. [sent-95, score-0.237]
33 They reported 80%, 37%, and 47% F-measure for locations, organizations, and persons respectively on the ANERCORP dataset that they created and publicly released. [sent-96, score-0.164]
34 They reported 87%, 46%, and 52% F-measure for loca- tions, organizations, and persons respectively. [sent-98, score-0.127]
35 Using POS tagging generally improved recall at the expense of precision, leading to overall improvements in F-measure. [sent-101, score-0.152]
36 Using all their suggested features, they reported 90%, 66%, and 73% F-measure for location, organization, and persons respectively. [sent-102, score-0.127]
37 They did not report per category F-measure, but they reported overall 81%, 75%, and 78% macro-average F-measure for broadcast news and newswire on the ACE 2003, 2004, and 2005 datasets respectively. [sent-105, score-0.207]
38 Abdul-Hamid and Darwish (2010) used a simplified feature set that relied primarily on character level features, namely leading and trailing letters in a word. [sent-112, score-0.236]
39 They reported an F-measure of 76% and 81% for the ACE2005 and the ANERCorp datasets datasets respectively. [sent-114, score-0.123]
40 (2012) performed NER on a different genre from news, namely Arabic Wikipedia articles, and reported recall values as low as 35. [sent-119, score-0.168]
41 They used self training and recall oriented classification to improve recall, typically at the expense of precision. [sent-121, score-0.114]
42 com/p / crfpp / 1560 cessful features that were reported by Benajiba et al. [sent-128, score-0.115]
43 (2008) and Abdul-Hamid and Darwish (2010), namely the leading and trailing 1, 2, 3, and 4 letters in a word; whether a word appears in the gazetteer that was created by Benajiba et al. [sent-129, score-0.209]
44 As mentioned earlier, the leading and trailing letters in a word may indicate or counter-indicate the presence of named entities. [sent-132, score-0.338]
45 It is noteworthy that 69% of the named entities in the test part were seen during training. [sent-140, score-0.252]
46 The first test set is composed of news snippets from the RSS feed of the Arabic (Egypt) version of news. [sent-142, score-0.197]
47 The RSS feed contains the headline and the first 50-100 words in the news ar- ticles. [sent-146, score-0.146]
48 The set has news from over a dozen different news sources and covers international and local news, politics, financial news, health, sports, entertainment, and technology. [sent-147, score-0.23]
49 The second set contains a set of 1,423 tweets that were randomly selected from tweets authored between November 23, 2011 and November 27, 2011. [sent-149, score-0.322]
50 We scraped tweets from Twitter using the query “lang:ar” (language=Arabic). [sent-150, score-0.161]
51 It is worth noting that only 27% of the named entities in the NEWS test set were observed in the training set (compared to 69% for ANERCORP). [sent-156, score-0.252]
52 As Table 3 shows for the ANERCORP dataset, using only the tokens as features, where the labeler mainly memorizes previously seen named entities, yields higher results than the baseline results for the NEWS dataset (Table 2 (b)). [sent-157, score-0.266]
53 The results on the TWEETS test are very poor, with 24% of the named entities in the test set appearing in the training set. [sent-158, score-0.303]
54 60a91turesonANERCORP 1561 4 Cross-lingual Features We experimented with three different cross-lingual features that used Arabic and English Wikipedia cross-language links and a true-cased phrase table that was generated using Moses (Koehn et al. [sent-171, score-0.171]
55 The snapshot has 348,873 titles including redirects, which are alternative names to articles. [sent-175, score-0.114]
56 8 which includes 6,157,591 entries of Wikipedia titles and their “types”, such as “person”, “plant”, or “device”, where a title can have multiple types. [sent-178, score-0.115]
57 The sentences were drawn from the UN parallel data along with a variety of parallel news data from LDC and the GALE project. [sent-182, score-0.115]
58 1 Cross-lingual Capitalization As we mentioned earlier, Arabic lacks capitalization and Arabic names are often common Arabic words. [sent-185, score-0.386]
59 To capture cross-lingual capitalization, we used the aforementioned true-cased phrase table at word and phrase levels as follows: Input: True-cased phrase table PT, sentence S containing n words w0. [sent-187, score-0.202]
60 Table 4 reports on the results of the baseline system with the capitalization feature on the three datasets. [sent-222, score-0.401]
61 In comparing baseline results in Table 2 and cross-lingual capitalization results in Table 4, recall consistently increased for all datasets, particularly for “persons” and “locations”. [sent-223, score-0.42]
62 Precision dropped overall on the ANERCORP dataset and dropped substantially for the NEWS and TWEETS test sets. [sent-234, score-0.158]
63 2 Transliteration Mining An alternative to capitalization can be transliteration mining. [sent-263, score-0.467]
64 The intuition is that named entities are often transliterated, particularly the names of locations and persons. [sent-264, score-0.315]
65 This feature is helpful if crosslingual resources do not have capitalization information, or if the “helper” language to be consulted does not have a useful capitalization feature. [sent-265, score-0.645]
66 We performed transliteration mining (aka cognate matching) at word level for each Arabic word against all its possible translations in the phrase table. [sent-266, score-0.284]
67 We used a transliteration miner akin to that of El-Kahki et al. [sent-267, score-0.185]
68 (201 1) that was trained using 3,452 parallel Arabic-English transliteration pairs. [sent-268, score-0.158]
69 Again we used a weight similar to the one for cross-lingual capitalization and we rounded the values of the ratio the significant figure. [sent-274, score-0.309]
70 If a word was not found in the phrase table, the feature value was assigned null. [sent-276, score-0.11]
71 Table 5 reports on the results using the baseline system with the transliteration mining feature. [sent-293, score-0.267]
72 Like the capitalization feature, transliteration mining slightly lowered precision except for the TWEETS test set where the drop in precision was significant and positively increased recall, leading to an overall improvement in F-measure for all test sets. [sent-294, score-0.726]
73 The similarity of results between using transliteration mining and word-level cross-lingual capitalization suggests that perhaps they can serve as surrogates. [sent-299, score-0.511]
74 Since Wikipedia titles may have multiple DBpedia types, we opted to keep the most popular type (by count of how many Wikipedia titles are assigned a particular type) for each title, and we disregarded the rest. [sent-309, score-0.188]
75 For translation, we generated two features using two translation resources, namely the aforementioned phrase table and ArabicEnglish Wikipedia cross-lingual links. [sent-312, score-0.22]
76 For both features (using the two translation methods), for an Arabic word sequence corresponding to the DBpedia entry, the first word in the sequence was assigned the feature “B-” plus the DBpedia type and subsequent words were assigned the featu? [sent-318, score-0.264]
77 Using the phrase table for translation likely yielded improved coverage over using Wikipedia cross-lingual links. [sent-329, score-0.13]
78 Using DBpedia consistently improved precision and recall for named entity types on all test sets, except for a small drop in precision for locations on the ANERCORP dataset and for locations and persons on the TWEETS test set. [sent-332, score-0.708]
79 For the different test sets, improvements in recall ranged between 4. [sent-333, score-0.129]
80 4 Putting it All Together Table 7 reports on the results of using all aforementioned cross-lingual features together. [sent-349, score-0.123]
81 As the results show, the impact of cross-lingual features on recall were much more pronounced on the NEWS and TWEETS test sets compared to the ANERCORP dataset. [sent-351, score-0.155]
82 Further, the recall values for the ANERCORP dataset in the baseline experiments were much higher than those for the two other test sets. [sent-352, score-0.171]
83 This confirms our suspicion that the reported values in the literature on the standard datasets are unrealistically high due to the similarity between the training and test sets. [sent-353, score-0.143]
84 Hence, these high effectiveness results may not generalize to other test sets. [sent-354, score-0.121]
85 e 92lativ lute/relative differences compared to baseline differences compared to baseline lingual features that we experimented with, the use of DBpedia led to improvements in both precision and recall (except for precision on the TWEETS test set). [sent-409, score-0.336]
86 Other cross-lingual features yielded overall improvements in F-measure, mostly due to gains in recall, typically at the expense of precision. [sent-410, score-0.118]
87 1564 Figure 1: ANERCORP Dataset Results Figure 2: NEWS Test Set Results When using all the features together, one notable result is that precision dropped significantly for the TWEETS test sets. [sent-417, score-0.177]
88 We examined the output for the TWEETS test set and here are some of the factors that affected precision: - the presence of words that would typically be named entities in news but would generally be regular words in tweets. [sent-418, score-0.398]
89 - the use of dialectic words that may have transliter- ations or a named entity as the most likely translation into English. [sent-420, score-0.367]
90 However, since the MT system that we used was trained on modern standard Arabic, the dialectic word would not appear in training and would typically be translated/transliterated to the name “Che” (as in Che Guevara). [sent-431, score-0.141]
91 - Since tweets are restricted in length, authors frequently use shortened versions of named entities. [sent-432, score-0.319]
92 For example, tweets would mostly have “Morsi” instead of “Mohamed Morsi” and without trigger words such as “Dr. [sent-433, score-0.161]
93 This same problem was present in the NEWS test set, because it was constructed from an RSS feed, and headlines, which are typically compact, had a higher representation in the test collection. [sent-463, score-0.133]
94 We believe that this problem can be overcome by introducing new training data that include tweets (or other social text) and performing domain adaptation. [sent-468, score-0.161]
95 1565 5 Conclusion In this paper, we presented different cross-lingual features that can make use of linguistic properties and knowledge bases of other languages for NER. [sent-471, score-0.122]
96 We used English as the “helper” language and we exploited the English capitalization feature and an English knowl- edge base, DBpedia. [sent-473, score-0.336]
97 If the helper language did not have capitalization, then transliteration mining could provide some of the benefit of capitalization. [sent-474, score-0.249]
98 We believe that the proposed cross-lingual features can be used to help NER for other languages, particularly languages that lack good features that generalize well. [sent-476, score-0.205]
99 We tested on a new news test set, NEWS, which has recent news articles (the same genre as the standard dataset), and indeed NER effectiveness was much lower. [sent-480, score-0.337]
100 For the new NEWS test set, cross-lingual features led to a small increase in precision (1. [sent-481, score-0.183]
wordName wordTfidf (topN-words)
[('arabic', 0.408), ('benajiba', 0.374), ('capitalization', 0.309), ('anercorp', 0.288), ('ner', 0.234), ('dbpedia', 0.17), ('darwish', 0.17), ('tweets', 0.161), ('transliteration', 0.158), ('named', 0.158), ('rosso', 0.14), ('tk', 0.119), ('wikipedia', 0.119), ('news', 0.115), ('trailing', 0.096), ('entity', 0.087), ('nadeau', 0.085), ('titles', 0.08), ('dialectic', 0.077), ('gazetteers', 0.075), ('persons', 0.066), ('links', 0.062), ('reported', 0.061), ('sekine', 0.061), ('iscaps', 0.058), ('phrase', 0.055), ('features', 0.054), ('pol', 0.054), ('locations', 0.052), ('crf', 0.051), ('test', 0.051), ('recall', 0.05), ('orthographic', 0.048), ('ace', 0.047), ('rss', 0.047), ('helper', 0.047), ('letters', 0.045), ('translation', 0.045), ('mining', 0.044), ('lacks', 0.043), ('entities', 0.043), ('generalize', 0.042), ('capitalized', 0.042), ('led', 0.041), ('bases', 0.041), ('sequence', 0.041), ('mayfield', 0.04), ('leading', 0.039), ('anersys', 0.038), ('farber', 0.038), ('istranpsliteration', 0.038), ('larkey', 0.038), ('memorizes', 0.038), ('morsi', 0.038), ('richman', 0.038), ('shaalan', 0.038), ('tabllpoeevr', 0.038), ('recognition', 0.038), ('stemmed', 0.038), ('mcnamee', 0.038), ('dataset', 0.037), ('aforementioned', 0.037), ('precision', 0.037), ('dropped', 0.035), ('title', 0.035), ('indicative', 0.034), ('tm', 0.034), ('names', 0.034), ('name', 0.033), ('expense', 0.033), ('baseline', 0.033), ('reports', 0.032), ('feed', 0.031), ('datasets', 0.031), ('attia', 0.031), ('typically', 0.031), ('absolute', 0.03), ('improved', 0.03), ('english', 0.03), ('relative', 0.03), ('november', 0.029), ('hermjakob', 0.029), ('pos', 0.029), ('namely', 0.029), ('cross', 0.028), ('morphological', 0.028), ('effectiveness', 0.028), ('assigned', 0.028), ('particularly', 0.028), ('mada', 0.028), ('entertainment', 0.028), ('burkett', 0.028), ('ranged', 0.028), ('genre', 0.028), ('feature', 0.027), ('akin', 0.027), ('transliterated', 0.027), ('udupa', 0.027), ('languages', 0.027), ('translations', 0.027)]
simIndex simValue paperId paperTitle
same-paper 1 0.99999952 256 acl-2013-Named Entity Recognition using Cross-lingual Resources: Arabic as an Example
Author: Kareem Darwish
Abstract: Some languages lack large knowledge bases and good discriminative features for Name Entity Recognition (NER) that can generalize to previously unseen named entities. One such language is Arabic, which: a) lacks a capitalization feature; and b) has relatively small knowledge bases, such as Wikipedia. In this work we address both problems by incorporating cross-lingual features and knowledge bases from English using cross-lingual links. We show that such features have a dramatic positive effect on recall. We show the effectiveness of cross-lingual features and resources on a standard dataset as well as on two new test sets that cover both news and microblogs. On the standard dataset, we achieved a 4.1% relative improvement in Fmeasure over the best reported result in the literature. The features led to improvements of 17.1% and 20.5% on the new news and mi- croblogs test sets respectively.
2 0.16446467 214 acl-2013-Language Independent Connectivity Strength Features for Phrase Pivot Statistical Machine Translation
Author: Ahmed El Kholy ; Nizar Habash ; Gregor Leusch ; Evgeny Matusov ; Hassan Sawaf
Abstract: An important challenge to statistical machine translation (SMT) is the lack of parallel data for many language pairs. One common solution is to pivot through a third language for which there exist parallel corpora with the source and target languages. Although pivoting is a robust technique, it introduces some low quality translations. In this paper, we present two language-independent features to improve the quality of phrase-pivot based SMT. The features, source connectivity strength and target connectivity strength reflect the quality of projected alignments between the source and target phrases in the pivot phrase table. We show positive results (0.6 BLEU points) on Persian-Arabic SMT as a case study.
3 0.15745081 187 acl-2013-Identifying Opinion Subgroups in Arabic Online Discussions
Author: Amjad Abu-Jbara ; Ben King ; Mona Diab ; Dragomir Radev
Abstract: In this paper, we use Arabic natural language processing techniques to analyze Arabic debates. The goal is to identify how the participants in a discussion split into subgroups with contrasting opinions. The members of each subgroup share the same opinion with respect to the discussion topic and an opposing opinion to the members of other subgroups. We use opinion mining techniques to identify opinion expressions and determine their polarities and their targets. We opinion predictions to represent the discussion in one of two formal representations: signed attitude network or a space of attitude vectors. We identify opinion subgroups by partitioning the signed network representation or by clustering the vector space representation. We evaluate the system using a data set of labeled discussions and show that it achieves good results.
4 0.15384191 210 acl-2013-Joint Word Alignment and Bilingual Named Entity Recognition Using Dual Decomposition
Author: Mengqiu Wang ; Wanxiang Che ; Christopher D. Manning
Abstract: Translated bi-texts contain complementary language cues, and previous work on Named Entity Recognition (NER) has demonstrated improvements in performance over monolingual taggers by promoting agreement of tagging decisions between the two languages. However, most previous approaches to bilingual tagging assume word alignments are given as fixed input, which can cause cascading errors. We observe that NER label information can be used to correct alignment mistakes, and present a graphical model that performs bilingual NER tagging jointly with word alignment, by combining two monolingual tagging models with two unidirectional alignment models. We intro- duce additional cross-lingual edge factors that encourage agreements between tagging and alignment decisions. We design a dual decomposition inference algorithm to perform joint decoding over the combined alignment and NER output space. Experiments on the OntoNotes dataset demonstrate that our method yields significant improvements in both NER and word alignment over state-of-the-art monolingual baselines.
5 0.1501523 211 acl-2013-LABR: A Large Scale Arabic Book Reviews Dataset
Author: Mohamed Aly ; Amir Atiya
Abstract: We introduce LABR, the largest sentiment analysis dataset to-date for the Arabic language. It consists of over 63,000 book reviews, each rated on a scale of 1 to 5 stars. We investigate the properties of the the dataset, and present its statistics. We explore using the dataset for two tasks: sentiment polarity classification and rating classification. We provide standard splits of the dataset into training and testing, for both polarity and rating classification, in both balanced and unbalanced settings. We run baseline experiments on the dataset to establish a benchmark.
6 0.14318021 240 acl-2013-Microblogs as Parallel Corpora
7 0.1420448 359 acl-2013-Translating Dialectal Arabic to English
8 0.12965997 233 acl-2013-Linking Tweets to News: A Framework to Enrich Short Text Data in Social Media
9 0.12559099 115 acl-2013-Detecting Event-Related Links and Sentiments from Social Media Texts
10 0.11788431 25 acl-2013-A Tightly-coupled Unsupervised Clustering and Bilingual Alignment Model for Transliteration
11 0.11730623 317 acl-2013-Sentence Level Dialect Identification in Arabic
12 0.11224461 179 acl-2013-HYENA-live: Fine-Grained Online Entity Type Classification from Natural-language Text
13 0.10581379 352 acl-2013-Towards Accurate Distant Supervision for Relational Facts Extraction
14 0.10337421 139 acl-2013-Entity Linking for Tweets
15 0.099974543 47 acl-2013-An Information Theoretic Approach to Bilingual Word Clustering
16 0.098802477 164 acl-2013-FudanNLP: A Toolkit for Chinese Natural Language Processing
17 0.095118366 255 acl-2013-Name-aware Machine Translation
18 0.092347093 97 acl-2013-Cross-lingual Projections between Languages from Different Families
19 0.090794034 338 acl-2013-Task Alternation in Parallel Sentence Retrieval for Twitter Translation
20 0.088415608 154 acl-2013-Extracting bilingual terminologies from comparable corpora
topicId topicWeight
[(0, 0.205), (1, 0.03), (2, 0.063), (3, 0.069), (4, 0.131), (5, 0.072), (6, -0.058), (7, 0.057), (8, 0.144), (9, -0.085), (10, -0.053), (11, -0.086), (12, 0.004), (13, 0.016), (14, -0.054), (15, 0.038), (16, -0.024), (17, -0.047), (18, -0.085), (19, 0.061), (20, -0.058), (21, 0.051), (22, 0.063), (23, 0.172), (24, 0.035), (25, -0.001), (26, 0.036), (27, -0.106), (28, 0.075), (29, -0.124), (30, 0.022), (31, -0.062), (32, 0.072), (33, -0.037), (34, -0.044), (35, -0.022), (36, 0.022), (37, 0.1), (38, 0.072), (39, 0.189), (40, -0.007), (41, -0.08), (42, -0.014), (43, -0.009), (44, -0.016), (45, -0.026), (46, -0.013), (47, -0.041), (48, 0.034), (49, -0.071)]
simIndex simValue paperId paperTitle
same-paper 1 0.92639208 256 acl-2013-Named Entity Recognition using Cross-lingual Resources: Arabic as an Example
Author: Kareem Darwish
Abstract: Some languages lack large knowledge bases and good discriminative features for Name Entity Recognition (NER) that can generalize to previously unseen named entities. One such language is Arabic, which: a) lacks a capitalization feature; and b) has relatively small knowledge bases, such as Wikipedia. In this work we address both problems by incorporating cross-lingual features and knowledge bases from English using cross-lingual links. We show that such features have a dramatic positive effect on recall. We show the effectiveness of cross-lingual features and resources on a standard dataset as well as on two new test sets that cover both news and microblogs. On the standard dataset, we achieved a 4.1% relative improvement in Fmeasure over the best reported result in the literature. The features led to improvements of 17.1% and 20.5% on the new news and mi- croblogs test sets respectively.
2 0.73706728 317 acl-2013-Sentence Level Dialect Identification in Arabic
Author: Heba Elfardy ; Mona Diab
Abstract: This paper introduces a supervised approach for performing sentence level dialect identification between Modern Standard Arabic and Egyptian Dialectal Arabic. We use token level labels to derive sentence-level features. These features are then used with other core and meta features to train a generative classifier that predicts the correct label for each sentence in the given input text. The system achieves an accuracy of 85.5% on an Arabic online-commentary dataset outperforming a previously proposed approach achieving 80.9% and reflecting a significant gain over a majority baseline of 5 1.9% and two strong baseline systems of 78.5% and 80.4%, respectively.
3 0.72781789 359 acl-2013-Translating Dialectal Arabic to English
Author: Hassan Sajjad ; Kareem Darwish ; Yonatan Belinkov
Abstract: We present a dialectal Egyptian Arabic to English statistical machine translation system that leverages dialectal to Modern Standard Arabic (MSA) adaptation. In contrast to previous work, we first narrow down the gap between Egyptian and MSA by applying an automatic characterlevel transformational model that changes Egyptian to EG0, which looks similar to MSA. The transformations include morphological, phonological and spelling changes. The transformation reduces the out-of-vocabulary (OOV) words from 5.2% to 2.6% and gives a gain of 1.87 BLEU points. Further, adapting large MSA/English parallel data increases the lexical coverage, reduces OOVs to 0.7% and leads to an absolute BLEU improvement of 2.73 points.
4 0.55339444 327 acl-2013-Sorani Kurdish versus Kurmanji Kurdish: An Empirical Comparison
Author: Kyumars Sheykh Esmaili ; Shahin Salavati
Abstract: Resource scarcity along with diversity– both in dialect and script–are the two primary challenges in Kurdish language processing. In this paper we aim at addressing these two problems by (i) building a text corpus for Sorani and Kurmanji, the two main dialects of Kurdish, and (ii) highlighting some of the orthographic, phonological, and morphological differences between these two dialects from statistical and rule-based perspectives.
5 0.54561162 240 acl-2013-Microblogs as Parallel Corpora
Author: Wang Ling ; Guang Xiang ; Chris Dyer ; Alan Black ; Isabel Trancoso
Abstract: In the ever-expanding sea of microblog data, there is a surprising amount of naturally occurring parallel text: some users create post multilingual messages targeting international audiences while others “retweet” translations. We present an efficient method for detecting these messages and extracting parallel segments from them. We have been able to extract over 1M Chinese-English parallel segments from Sina Weibo (the Chinese counterpart of Twitter) using only their public APIs. As a supplement to existing parallel training data, our automatically extracted parallel data yields substantial translation quality improvements in translating microblog text and modest improvements in translating edited news commentary. The resources in described in this paper are available at http://www.cs.cmu.edu/∼lingwang/utopia.
6 0.53360605 214 acl-2013-Language Independent Connectivity Strength Features for Phrase Pivot Statistical Machine Translation
7 0.51792485 179 acl-2013-HYENA-live: Fine-Grained Online Entity Type Classification from Natural-language Text
8 0.50214797 139 acl-2013-Entity Linking for Tweets
10 0.48579347 233 acl-2013-Linking Tweets to News: A Framework to Enrich Short Text Data in Social Media
11 0.48239177 352 acl-2013-Towards Accurate Distant Supervision for Relational Facts Extraction
12 0.48008892 160 acl-2013-Fine-grained Semantic Typing of Emerging Entities
13 0.45974278 138 acl-2013-Enriching Entity Translation Discovery using Selective Temporality
14 0.44931057 340 acl-2013-Text-Driven Toponym Resolution using Indirect Supervision
15 0.4471032 210 acl-2013-Joint Word Alignment and Bilingual Named Entity Recognition Using Dual Decomposition
16 0.43145779 235 acl-2013-Machine Translation Detection from Monolingual Web-Text
17 0.42886308 71 acl-2013-Bootstrapping Entity Translation on Weakly Comparable Corpora
18 0.4252924 255 acl-2013-Name-aware Machine Translation
19 0.42406231 301 acl-2013-Resolving Entity Morphs in Censored Data
20 0.40493539 34 acl-2013-Accurate Word Segmentation using Transliteration and Language Model Projection
topicId topicWeight
[(0, 0.028), (6, 0.023), (11, 0.039), (24, 0.035), (26, 0.061), (35, 0.062), (42, 0.052), (48, 0.021), (70, 0.034), (88, 0.022), (90, 0.027), (95, 0.505)]
simIndex simValue paperId paperTitle
1 0.99529815 195 acl-2013-Improving machine translation by training against an automatic semantic frame based evaluation metric
Author: Chi-kiu Lo ; Karteek Addanki ; Markus Saers ; Dekai Wu
Abstract: We present the first ever results showing that tuning a machine translation system against a semantic frame based objective function, MEANT, produces more robustly adequate translations than tuning against BLEU or TER as measured across commonly used metrics and human subjective evaluation. Moreover, for informal web forum data, human evaluators preferred MEANT-tuned systems over BLEU- or TER-tuned systems by a significantly wider margin than that for formal newswire—even though automatic semantic parsing might be expected to fare worse on informal language. We argue thatbypreserving the meaning ofthe trans- lations as captured by semantic frames right in the training process, an MT system is constrained to make more accurate choices of both lexical and reordering rules. As a result, MT systems tuned against semantic frame based MT evaluation metrics produce output that is more adequate. Tuning a machine translation system against a semantic frame based objective function is independent ofthe translation model paradigm, so, any translation model can benefit from the semantic knowledge incorporated to improve translation adequacy through our approach.
2 0.98974645 359 acl-2013-Translating Dialectal Arabic to English
Author: Hassan Sajjad ; Kareem Darwish ; Yonatan Belinkov
Abstract: We present a dialectal Egyptian Arabic to English statistical machine translation system that leverages dialectal to Modern Standard Arabic (MSA) adaptation. In contrast to previous work, we first narrow down the gap between Egyptian and MSA by applying an automatic characterlevel transformational model that changes Egyptian to EG0, which looks similar to MSA. The transformations include morphological, phonological and spelling changes. The transformation reduces the out-of-vocabulary (OOV) words from 5.2% to 2.6% and gives a gain of 1.87 BLEU points. Further, adapting large MSA/English parallel data increases the lexical coverage, reduces OOVs to 0.7% and leads to an absolute BLEU improvement of 2.73 points.
same-paper 3 0.98942453 256 acl-2013-Named Entity Recognition using Cross-lingual Resources: Arabic as an Example
Author: Kareem Darwish
Abstract: Some languages lack large knowledge bases and good discriminative features for Name Entity Recognition (NER) that can generalize to previously unseen named entities. One such language is Arabic, which: a) lacks a capitalization feature; and b) has relatively small knowledge bases, such as Wikipedia. In this work we address both problems by incorporating cross-lingual features and knowledge bases from English using cross-lingual links. We show that such features have a dramatic positive effect on recall. We show the effectiveness of cross-lingual features and resources on a standard dataset as well as on two new test sets that cover both news and microblogs. On the standard dataset, we achieved a 4.1% relative improvement in Fmeasure over the best reported result in the literature. The features led to improvements of 17.1% and 20.5% on the new news and mi- croblogs test sets respectively.
4 0.98137349 336 acl-2013-Syntactic Patterns versus Word Alignment: Extracting Opinion Targets from Online Reviews
Author: Kang Liu ; Liheng Xu ; Jun Zhao
Abstract: Mining opinion targets is a fundamental and important task for opinion mining from online reviews. To this end, there are usually two kinds of methods: syntax based and alignment based methods. Syntax based methods usually exploited syntactic patterns to extract opinion targets, which were however prone to suffer from parsing errors when dealing with online informal texts. In contrast, alignment based methods used word alignment model to fulfill this task, which could avoid parsing errors without using parsing. However, there is no research focusing on which kind of method is more better when given a certain amount of reviews. To fill this gap, this paper empiri- cally studies how the performance of these two kinds of methods vary when changing the size, domain and language of the corpus. We further combine syntactic patterns with alignment model by using a partially supervised framework and investigate whether this combination is useful or not. In our experiments, we verify that our combination is effective on the corpus with small and medium size.
Author: Tsutomu Hirao ; Tomoharu Iwata ; Masaaki Nagata
Abstract: Unsupervised object matching (UOM) is a promising approach to cross-language natural language processing such as bilingual lexicon acquisition, parallel corpus construction, and cross-language text categorization, because it does not require labor-intensive linguistic resources. However, UOM only finds one-to-one correspondences from data sets with the same number of instances in source and target domains, and this prevents us from applying UOM to real-world cross-language natural language processing tasks. To alleviate these limitations, we proposes latent semantic matching, which embeds objects in both source and target language domains into a shared latent topic space. We demonstrate the effectiveness of our method on cross-language text categorization. The results show that our method outperforms conventional unsupervised object matching methods.
6 0.96943676 37 acl-2013-Adaptive Parser-Centric Text Normalization
7 0.95320225 66 acl-2013-Beam Search for Solving Substitution Ciphers
8 0.95227832 162 acl-2013-FrameNet on the Way to Babel: Creating a Bilingual FrameNet Using Wiktionary as Interlingual Connection
9 0.92747486 289 acl-2013-QuEst - A translation quality estimation framework
10 0.87251794 374 acl-2013-Using Context Vectors in Improving a Machine Translation System with Bridge Language
11 0.8638888 135 acl-2013-English-to-Russian MT evaluation campaign
12 0.85823733 255 acl-2013-Name-aware Machine Translation
13 0.85172808 317 acl-2013-Sentence Level Dialect Identification in Arabic
14 0.83647174 214 acl-2013-Language Independent Connectivity Strength Features for Phrase Pivot Statistical Machine Translation
15 0.8318491 326 acl-2013-Social Text Normalization using Contextual Graph Random Walks
16 0.8293184 5 acl-2013-A Decade of Automatic Content Evaluation of News Summaries: Reassessing the State of the Art
17 0.82233095 13 acl-2013-A New Syntactic Metric for Evaluation of Machine Translation
18 0.82209522 120 acl-2013-Dirt Cheap Web-Scale Parallel Text from the Common Crawl
19 0.81874681 240 acl-2013-Microblogs as Parallel Corpora
20 0.80411923 97 acl-2013-Cross-lingual Projections between Languages from Different Families