acl acl2013 acl2013-359 knowledge-graph by maker-knowledge-mining
Source: pdf
Author: Hassan Sajjad ; Kareem Darwish ; Yonatan Belinkov
Abstract: We present a dialectal Egyptian Arabic to English statistical machine translation system that leverages dialectal to Modern Standard Arabic (MSA) adaptation. In contrast to previous work, we first narrow down the gap between Egyptian and MSA by applying an automatic characterlevel transformational model that changes Egyptian to EG0, which looks similar to MSA. The transformations include morphological, phonological and spelling changes. The transformation reduces the out-of-vocabulary (OOV) words from 5.2% to 2.6% and gives a gain of 1.87 BLEU points. Further, adapting large MSA/English parallel data increases the lexical coverage, reduces OOVs to 0.7% and leads to an absolute BLEU improvement of 2.73 points.
Reference: text
sentIndex sentText sentNum sentScore
1 qa Abstract We present a dialectal Egyptian Arabic to English statistical machine translation system that leverages dialectal to Modern Standard Arabic (MSA) adaptation. [sent-3, score-0.786]
2 The transformation reduces the out-of-vocabulary (OOV) words from 5. [sent-6, score-0.045]
3 Further, adapting large MSA/English parallel data increases the lexical coverage, reduces OOVs to 0. [sent-10, score-0.062]
4 The dialects may differ in vocabulary, morphology, syntax, and spelling from MSA and most lack spelling conventions. [sent-16, score-0.529]
5 Different dialects often make different lexical choices to express concepts. [sent-17, score-0.225]
6 For example, Egyptian Arabic uses a negation construct similar to the French “ne pas” neg? [sent-39, score-0.065]
7 Although MSA is used in formal writing, dialects are increasingly being used on social media sites. [sent-62, score-0.225]
8 – The use of phonetic transcription to match dialectal pronunciation. [sent-76, score-0.366]
9 – Creative spellings, spelling mistakes, and word elongations are ubiquitous in social texts. [sent-84, score-0.184]
10 The Egyptian dialect has the largest number of speakers and is the most commonly understood dialect in the Arab world. [sent-88, score-0.236]
11 In this work, we focused ¨æ¸ 1on translating dialectal Egyptian to English usProce diSnogfsia, of B thuleg5a r1iast, A Anungu aslt M4-9e t2in01g3 o. [sent-89, score-0.429]
12 Unlike previous work, we first narrowed the gap between Egyptian and MSA using character-level transformations and word n-gram models that handle spelling mistakes, phonological variations, and morphological transformations. [sent-92, score-0.363]
13 Later, we applied an adaptation method to incorporate MSA/English parallel data. [sent-93, score-0.062]
14 The contributions of this paper are as follows: We trained an Egyptian/MSA transformation model to make Egyptian look similar to MSA. [sent-94, score-0.045]
15 We built a phrasal Machine Translation (MT) system on adapted Egyptian/English parallel data, which outperformed a non-adapted baseline by 1. [sent-96, score-0.125]
16 We used phrase-table merging (Nakov and Ng, 2009) to utilize MSA/English parallel data with the available in-domain parallel data. [sent-98, score-0.193]
17 This can be done by either translating between the related languages using word-level translation, character level transformations, and language specific rules (Durrani et al. [sent-100, score-0.116]
18 , 2000; Nakov and Tiedemann, 2012), or by concatenating the parallel data for both languages (Nakov and Ng, 2009). [sent-102, score-0.062]
19 These translation methods generally require parallel data, for which hardly any exists between dialects and MSA. [sent-103, score-0.341]
20 Instead of translating between a dialect and MSA, we tried to narrow down the lexical, morphological and phonetic gap between them using a character-level conversion model, which we trained on a small set of parallel dialect/MSA word pairs. [sent-104, score-0.503]
21 InthecontextofArabic dialects3, mostprevious work focused on converting dialects to MSA and vice versa to improve the processing of dialects (Sawaf, 2010; Chiang et al. [sent-105, score-0.45]
22 Sawaf (2010) proposed a dialect to MSA normalization that used character-level rules and morphological analysis. [sent-108, score-0.254]
23 Salloum and Habash (201 1) also used a rule-based method to generate MSA paraphrases of dialectal out-of-vocabulary (OOV) and low frequency words. [sent-109, score-0.366]
24 Instead of rules, we automatically 3Due to space limitations, we restrict discussion to work on dialects only. [sent-110, score-0.225]
25 Their best Egyptian/English system was trained on dialect/English parallel data. [sent-114, score-0.062]
26 They used two language models built from the English GigaWord corpus and from a large web crawl. [sent-115, score-0.037]
27 Their best system outperformed manually translating Egyptian to MSA then translating using an MSA/English system. [sent-116, score-0.152]
28 In contrast, we showed that training on in-domain dialectal data irrespective of its small size is better than training on large MSA/English data. [sent-117, score-0.416]
29 We also showed that a conversion does not imply a straight forward usage of MSA resources and there is a need for adaptation which we ful- filled using phrase-table merging (Nakov and Ng, 2009). [sent-119, score-0.193]
30 1 Baseline We constructed baselines that were based on the following training data: - An Egyptian/English parallel corpus consisting of ≈38k sentences, which is part of the iLnDgC o2f01 ≈2T380k9 corpus (Zbib eht al. [sent-121, score-0.087]
31 - An MSA/English parallel corpus consisting of 200k sentences from LDC4. [sent-127, score-0.062]
32 For language modeling, we used either EGen or the English side of the AR corpus plus the English side of NIST12 training data and English GigaWord v5. [sent-129, score-0.025]
33 We tokenized Egyptian and Arabic accord- ing to the ATB tokenization scheme using the MADA+TOKAN morphological analyzer and tokenizer v3. [sent-131, score-0.224]
34 We wordaligned the parallel data using GIZA++ (Och and Ney, 2003), and symmetrized the alignments using grow-diag-final-and heuristic (Koehn et al. [sent-135, score-0.062]
35 We built five-gram LMs using KenLM 4Arabic News (LDC2004T17), eTIRR (LDC2004E72), 2and parallel corpora the GALE program Train B1 B2 B3 B4 LM AR EG EG EG GW GW EGen EGenGW BLEU OOV 7. [sent-139, score-0.099]
36 2 Table 1: Baseline results using the EG and AR training sets with GW and EGen corpora for LM training with modified Kneser-Ney smoothing (Heafield, 2011). [sent-147, score-0.05]
37 We built several baseline systems as follows: B1 used AR for training a translation model and GW for LM. [sent-149, score-0.116]
38 B2-B4 systems used identical training data, namely EG, with the GW, EGen, or both for B2, B3, and B4 respectively for language modeling. [sent-150, score-0.025]
39 Using EG data for training both the translation and language models was effective. [sent-155, score-0.079]
40 1 Egyptian to EG0 Conversion As mentioned previously, dialects differ from MSA in vocabulary, morphology, and phonology. [sent-159, score-0.253]
41 Dialectal spelling often follows dialectal pronun– – ciation, and dialects lack standard spelling conventions. [sent-160, score-0.867]
42 To address the spelling and morphological differences, we trained a character-level mapping model to generate MSA words from dialectal ones using character transformations. [sent-162, score-0.693]
43 To train the model, we extracted the most frequent words from a dialectal Egyptian corpus, which had 12,527 news comments (containing 327k words) from AlYoum Al-Sabe news site (Zaidan and CallisonBurch, 2011) and translated them to their equivalent MSA words. [sent-163, score-0.366]
44 We hired a professional translator, who generated one or more translations of the most frequent 5,581 words into MSA. [sent-164, score-0.025]
45 Out of these word pairs, 4,162 involved character-level transformations due to phonological, morphological, or spelling changes. [sent-165, score-0.22]
46 We aligned the translated pairs at character level using GIZA++ and Moses in the manner described in Section 2. [sent-166, score-0.053]
47 We restricted individual source character sequences to be 3 characters at most. [sent-170, score-0.053]
48 We built the lexicon from a set of 234,638 Aljazeera articles5 that span a 10 year period and contain 254M tokens. [sent-172, score-0.037]
49 Then we used a trigram LM that we built from the aforementioned Aljazeera articles to pick the most likely candidate in context. [sent-175, score-0.037]
50 We simply multiplied the character-level transformation probability with the LM probability giving them equal weight. [sent-176, score-0.045]
51 Since Egyptian has a “ne pas” like negation construct that involves putting a “—” and “? [sent-177, score-0.065]
52 ” at the beginning and end of verbs, handled words that had negation by removing these two letters, then applying our character transformation, and lastly adding the negation article “lA” B before the verb. [sent-179, score-0.183]
53 2 Combining AR and EG0 The aforementioned conversion generated a language that is close, but not identical, to MSA. [sent-242, score-0.124]
54 In order to maximize the gain using both parallel corpora, we used the phrase merging technique described in Nakov and Ng (2009) to merge the phrase tables generated from the AR and EG0 corpora. [sent-243, score-0.273]
55 If a phrase occurred in both phrase tables, we 3 5http://www. [sent-244, score-0.112]
56 net adopted one of the following three solutions: - Only added the phrase with its translations and their probabilities from the AR phrase table. [sent-246, score-0.137]
57 - Only added the phrase with its translations and their probabilities from the EG0 phrase table. [sent-248, score-0.137]
58 - Added translations of the phrase from both phrase tables and left the choice to the decoder. [sent-250, score-0.167]
59 We added three additional features to the new phrase table to avail the information about the origin of phrases (as in Nakov and Ng (2009)). [sent-251, score-0.056]
60 3 Evaluation and Discussion We performed the following experiments: - S0 involved translating the EG0 test using AR. [sent-253, score-0.1]
61 - S1 and S2 trained on the EG0 with EGen and both EGen and GW for LM training respectively. [sent-254, score-0.025]
62 We built separate phrase tables from the two corpora and merged them. [sent-257, score-0.123]
63 For SALL, we kept phrases from both phrase tables. [sent-259, score-0.056]
64 Table 2 summarizes results of using EG0 and phrase table merging. [sent-260, score-0.056]
65 S0 was slightly better than B1, but lagged considerably behind training using EG or EG0. [sent-261, score-0.025]
66 S1, which used only EG0 for training showed an improvement of 1. [sent-262, score-0.025]
67 Phrase merging that preferred phrases learnt from EG0 data over AR data performed the best with a BLEU score of 16. [sent-265, score-0.069]
68 7 Table 2: Summary of results using different combinations of EG0/English and MSA/English train- ing data We analyzed 100 test sentences that led to the greatest absolute change in BLEU score, whether positive or negative, between training with EG and EG0. [sent-281, score-0.025]
69 Training with EG0 outperformed EG for 63 of the sentences. [sent-317, score-0.026]
70 Conversion improved MT, because it reduced OOVs, enabled MADA+TOKAN to successfully analyze words, and reduced spelling mistakes. [sent-318, score-0.138]
71 For each observed conversion error, we identified its linguistic character, i. [sent-321, score-0.124]
72 We found that in more than half of the cases (≈57%) using morphological ainnf hoarmlfa otifo thn ec ocausldes h (≈av5e7 improved mthoer conversion. [sent-324, score-0.171]
73 Consider the following example, where (1) is the original EG sentence and its EG/EN translation, and (2) is the converted EG0 sentence and its EG0/EN translation: 1. [sent-325, score-0.085]
74 Hsb rgbth because this is according to his desire In this case, “rgbtk” ‰J? [sent-347, score-0.074]
75 This could be avoided, for instance, by running a morphological analyzer on the original and converted word, and making sure their morphological features (in this case, the person of the possessive) correspond. [sent-354, score-0.445]
76 In a similar case, the phrase “mEndy$ AEdA” Z@Y«@ ? [sent-355, score-0.056]
77 Here, again, a morphological analyzer could verify the retaining of negation after conversion. [sent-365, score-0.289]
78 Aside from morphological mistakes, conversion often changed words completely. [sent-376, score-0.26]
79 was wrongly converted to “lOnh” (”because it”), resulting in a wrong translation. [sent-381, score-0.085]
80 Perhaps a morphological analyzer, or just a part-of-speech tagger, could enforce (or probabilistically encourage) a match in parts of speech. [sent-382, score-0.174]
81 There is a syntactic challenge in this sentence, since the Egyptian word order in interrogative sentences is normally different from the MSA word order: the interrogative particle appears at the end of the sentence instead of at the beginning. [sent-417, score-0.06]
82 The above analysis suggests that incorporating deeper linguistic information in the conversion procedure could improve translation quality. [sent-419, score-0.178]
83 In particular, using a morphological analyzer seeems like a promising possibility. [sent-420, score-0.224]
84 One approach could be to run a morphological analyzer for dialectal Arabic (e. [sent-421, score-0.59]
85 , 2013)) on the original EG sentence and another analyzer for MSA (such as MADA) on the converted EG0 sentence, and then to compare the morphological features. [sent-424, score-0.309]
86 In contrast to previous work, we used an automatic conversion method to map Egyptian close to MSA. [sent-428, score-0.124]
87 The converted Egyptian EG0 had fewer OOV words and spelling mistakes and improved language handling. [sent-429, score-0.276]
88 The MT system built on the adapted parallel data showed an improvement of 1. [sent-430, score-0.099]
89 Using phrase table merging that combined AR and EG0 training data in a way that preferred adapted dialectal data yielded an extra 0. [sent-432, score-0.516]
90 We will make the training data for our conversion system publicly available. [sent-434, score-0.149]
91 For future work, we want to expand our work to other dialects, while utilizing dialectal morphological analysis to improve conversion. [sent-435, score-0.502]
92 Also, we believe that improving English language modeling to match the genre of the translated sentences can have significant positive impact on translation quality. [sent-436, score-0.054]
93 Improved statistical machine translation for resource-poor languages using related resource-rich languages. [sent-477, score-0.054]
94 Combining word-level and character-level models for machine translation between closely-related languages. [sent-481, score-0.054]
95 Arabic morphological tagging, diacritization, and lemmatization using lexeme models and feature ranking. [sent-490, score-0.136]
96 A hybrid approach for converting written Egyptian colloquial dialect into diacritized Arabic. [sent-502, score-0.153]
97 The Arabic online commentary dataset: an annotated dataset of informal Arabic with high dialectal content. [sent-507, score-0.366]
wordName wordTfidf (topN-words)
[('egyptian', 0.403), ('dialectal', 0.366), ('msa', 0.33), ('eg', 0.257), ('dialects', 0.225), ('egen', 0.184), ('arabic', 0.178), ('egengw', 0.138), ('spelling', 0.138), ('morphological', 0.136), ('ar', 0.129), ('nakov', 0.124), ('conversion', 0.124), ('dialect', 0.118), ('bleu', 0.111), ('aj', 0.107), ('gw', 0.093), ('analyzer', 0.088), ('converted', 0.085), ('zbib', 0.081), ('darwish', 0.081), ('lm', 0.079), ('gulf', 0.075), ('oov', 0.071), ('merging', 0.069), ('aljazeera', 0.069), ('habash', 0.066), ('negation', 0.065), ('translating', 0.063), ('parallel', 0.062), ('kareem', 0.061), ('nizar', 0.056), ('phrase', 0.056), ('translation', 0.054), ('character', 0.053), ('mistakes', 0.053), ('mada', 0.05), ('ahna', 0.046), ('altanyp', 0.046), ('ayh', 0.046), ('ayyyh', 0.046), ('elongations', 0.046), ('emlna', 0.046), ('hsb', 0.046), ('rgbth', 0.046), ('wbyhtrmwa', 0.046), ('transformation', 0.045), ('transformations', 0.045), ('jk', 0.045), ('zaidan', 0.045), ('phonological', 0.044), ('nhn', 0.041), ('enemies', 0.041), ('sawaf', 0.041), ('sajjad', 0.041), ('kahki', 0.041), ('levantine', 0.041), ('salloum', 0.041), ('sall', 0.041), ('durrani', 0.038), ('yg', 0.038), ('probabilistically', 0.038), ('tokan', 0.038), ('sar', 0.038), ('involved', 0.037), ('transforming', 0.037), ('mohamed', 0.037), ('built', 0.037), ('owen', 0.036), ('rambow', 0.036), ('thn', 0.035), ('arab', 0.035), ('lol', 0.035), ('colloquial', 0.035), ('ahmed', 0.035), ('pronunciations', 0.034), ('oovs', 0.034), ('iraqi', 0.034), ('hassan', 0.033), ('kenlm', 0.031), ('preslav', 0.031), ('utiyama', 0.031), ('interrogative', 0.03), ('tables', 0.03), ('mt', 0.029), ('qatar', 0.029), ('desire', 0.028), ('och', 0.028), ('differ', 0.028), ('roth', 0.028), ('omar', 0.028), ('lms', 0.027), ('pas', 0.026), ('hw', 0.026), ('outperformed', 0.026), ('ne', 0.026), ('yj', 0.026), ('translations', 0.025), ('training', 0.025), ('ng', 0.025)]
simIndex simValue paperId paperTitle
same-paper 1 0.99999988 359 acl-2013-Translating Dialectal Arabic to English
Author: Hassan Sajjad ; Kareem Darwish ; Yonatan Belinkov
Abstract: We present a dialectal Egyptian Arabic to English statistical machine translation system that leverages dialectal to Modern Standard Arabic (MSA) adaptation. In contrast to previous work, we first narrow down the gap between Egyptian and MSA by applying an automatic characterlevel transformational model that changes Egyptian to EG0, which looks similar to MSA. The transformations include morphological, phonological and spelling changes. The transformation reduces the out-of-vocabulary (OOV) words from 5.2% to 2.6% and gives a gain of 1.87 BLEU points. Further, adapting large MSA/English parallel data increases the lexical coverage, reduces OOVs to 0.7% and leads to an absolute BLEU improvement of 2.73 points.
2 0.44012278 317 acl-2013-Sentence Level Dialect Identification in Arabic
Author: Heba Elfardy ; Mona Diab
Abstract: This paper introduces a supervised approach for performing sentence level dialect identification between Modern Standard Arabic and Egyptian Dialectal Arabic. We use token level labels to derive sentence-level features. These features are then used with other core and meta features to train a generative classifier that predicts the correct label for each sentence in the given input text. The system achieves an accuracy of 85.5% on an Arabic online-commentary dataset outperforming a previously proposed approach achieving 80.9% and reflecting a significant gain over a majority baseline of 5 1.9% and two strong baseline systems of 78.5% and 80.4%, respectively.
3 0.1420448 256 acl-2013-Named Entity Recognition using Cross-lingual Resources: Arabic as an Example
Author: Kareem Darwish
Abstract: Some languages lack large knowledge bases and good discriminative features for Name Entity Recognition (NER) that can generalize to previously unseen named entities. One such language is Arabic, which: a) lacks a capitalization feature; and b) has relatively small knowledge bases, such as Wikipedia. In this work we address both problems by incorporating cross-lingual features and knowledge bases from English using cross-lingual links. We show that such features have a dramatic positive effect on recall. We show the effectiveness of cross-lingual features and resources on a standard dataset as well as on two new test sets that cover both news and microblogs. On the standard dataset, we achieved a 4.1% relative improvement in Fmeasure over the best reported result in the literature. The features led to improvements of 17.1% and 20.5% on the new news and mi- croblogs test sets respectively.
4 0.13961697 214 acl-2013-Language Independent Connectivity Strength Features for Phrase Pivot Statistical Machine Translation
Author: Ahmed El Kholy ; Nizar Habash ; Gregor Leusch ; Evgeny Matusov ; Hassan Sawaf
Abstract: An important challenge to statistical machine translation (SMT) is the lack of parallel data for many language pairs. One common solution is to pivot through a third language for which there exist parallel corpora with the source and target languages. Although pivoting is a robust technique, it introduces some low quality translations. In this paper, we present two language-independent features to improve the quality of phrase-pivot based SMT. The features, source connectivity strength and target connectivity strength reflect the quality of projected alignments between the source and target phrases in the pivot phrase table. We show positive results (0.6 BLEU points) on Persian-Arabic SMT as a case study.
5 0.10482829 174 acl-2013-Graph Propagation for Paraphrasing Out-of-Vocabulary Words in Statistical Machine Translation
Author: Majid Razmara ; Maryam Siahbani ; Reza Haffari ; Anoop Sarkar
Abstract: Out-of-vocabulary (oov) words or phrases still remain a challenge in statistical machine translation especially when a limited amount of parallel text is available for training or when there is a domain shift from training data to test data. In this paper, we propose a novel approach to finding translations for oov words. We induce a lexicon by constructing a graph on source language monolingual text and employ a graph propagation technique in order to find translations for all the source language phrases. Our method differs from previous approaches by adopting a graph propagation approach that takes into account not only one-step (from oov directly to a source language phrase that has a translation) but multi-step paraphrases from oov source language words to other source language phrases and eventually to target language translations. Experimental results show that our graph propagation method significantly improves performance over two strong baselines under intrinsic and extrinsic evaluation metrics.
6 0.098168522 187 acl-2013-Identifying Opinion Subgroups in Arabic Online Discussions
7 0.096896373 240 acl-2013-Microblogs as Parallel Corpora
8 0.076415025 255 acl-2013-Name-aware Machine Translation
9 0.076358743 211 acl-2013-LABR: A Large Scale Arabic Book Reviews Dataset
10 0.076325588 24 acl-2013-A Tale about PRO and Monsters
11 0.072319381 129 acl-2013-Domain-Independent Abstract Generation for Focused Meeting Summarization
12 0.072096728 330 acl-2013-Stem Translation with Affix-Based Rule Selection for Agglutinative Languages
13 0.071781829 303 acl-2013-Robust multilingual statistical morphological generation models
14 0.071041144 223 acl-2013-Learning a Phrase-based Translation Model from Monolingual Data with Application to Domain Adaptation
15 0.070184231 120 acl-2013-Dirt Cheap Web-Scale Parallel Text from the Common Crawl
16 0.06987486 78 acl-2013-Categorization of Turkish News Documents with Morphological Analysis
17 0.069318108 181 acl-2013-Hierarchical Phrase Table Combination for Machine Translation
18 0.06846793 338 acl-2013-Task Alternation in Parallel Sentence Retrieval for Twitter Translation
19 0.067984648 11 acl-2013-A Multi-Domain Translation Model Framework for Statistical Machine Translation
20 0.064192101 235 acl-2013-Machine Translation Detection from Monolingual Web-Text
topicId topicWeight
[(0, 0.146), (1, -0.055), (2, 0.1), (3, 0.082), (4, 0.016), (5, 0.005), (6, -0.032), (7, 0.036), (8, 0.087), (9, -0.015), (10, -0.069), (11, 0.016), (12, 0.008), (13, 0.04), (14, -0.12), (15, 0.004), (16, -0.057), (17, -0.131), (18, -0.032), (19, 0.117), (20, -0.063), (21, 0.051), (22, 0.107), (23, 0.111), (24, 0.057), (25, -0.023), (26, 0.017), (27, -0.109), (28, 0.141), (29, -0.338), (30, -0.054), (31, -0.074), (32, 0.004), (33, -0.037), (34, 0.078), (35, -0.087), (36, 0.055), (37, 0.083), (38, 0.135), (39, 0.294), (40, 0.068), (41, -0.012), (42, 0.045), (43, -0.049), (44, -0.005), (45, -0.018), (46, -0.047), (47, 0.039), (48, -0.009), (49, 0.061)]
simIndex simValue paperId paperTitle
1 0.92595452 317 acl-2013-Sentence Level Dialect Identification in Arabic
Author: Heba Elfardy ; Mona Diab
Abstract: This paper introduces a supervised approach for performing sentence level dialect identification between Modern Standard Arabic and Egyptian Dialectal Arabic. We use token level labels to derive sentence-level features. These features are then used with other core and meta features to train a generative classifier that predicts the correct label for each sentence in the given input text. The system achieves an accuracy of 85.5% on an Arabic online-commentary dataset outperforming a previously proposed approach achieving 80.9% and reflecting a significant gain over a majority baseline of 5 1.9% and two strong baseline systems of 78.5% and 80.4%, respectively.
same-paper 2 0.90504837 359 acl-2013-Translating Dialectal Arabic to English
Author: Hassan Sajjad ; Kareem Darwish ; Yonatan Belinkov
Abstract: We present a dialectal Egyptian Arabic to English statistical machine translation system that leverages dialectal to Modern Standard Arabic (MSA) adaptation. In contrast to previous work, we first narrow down the gap between Egyptian and MSA by applying an automatic characterlevel transformational model that changes Egyptian to EG0, which looks similar to MSA. The transformations include morphological, phonological and spelling changes. The transformation reduces the out-of-vocabulary (OOV) words from 5.2% to 2.6% and gives a gain of 1.87 BLEU points. Further, adapting large MSA/English parallel data increases the lexical coverage, reduces OOVs to 0.7% and leads to an absolute BLEU improvement of 2.73 points.
3 0.62821501 256 acl-2013-Named Entity Recognition using Cross-lingual Resources: Arabic as an Example
Author: Kareem Darwish
Abstract: Some languages lack large knowledge bases and good discriminative features for Name Entity Recognition (NER) that can generalize to previously unseen named entities. One such language is Arabic, which: a) lacks a capitalization feature; and b) has relatively small knowledge bases, such as Wikipedia. In this work we address both problems by incorporating cross-lingual features and knowledge bases from English using cross-lingual links. We show that such features have a dramatic positive effect on recall. We show the effectiveness of cross-lingual features and resources on a standard dataset as well as on two new test sets that cover both news and microblogs. On the standard dataset, we achieved a 4.1% relative improvement in Fmeasure over the best reported result in the literature. The features led to improvements of 17.1% and 20.5% on the new news and mi- croblogs test sets respectively.
4 0.60773015 327 acl-2013-Sorani Kurdish versus Kurmanji Kurdish: An Empirical Comparison
Author: Kyumars Sheykh Esmaili ; Shahin Salavati
Abstract: Resource scarcity along with diversity– both in dialect and script–are the two primary challenges in Kurdish language processing. In this paper we aim at addressing these two problems by (i) building a text corpus for Sorani and Kurmanji, the two main dialects of Kurdish, and (ii) highlighting some of the orthographic, phonological, and morphological differences between these two dialects from statistical and rule-based perspectives.
5 0.58399624 214 acl-2013-Language Independent Connectivity Strength Features for Phrase Pivot Statistical Machine Translation
Author: Ahmed El Kholy ; Nizar Habash ; Gregor Leusch ; Evgeny Matusov ; Hassan Sawaf
Abstract: An important challenge to statistical machine translation (SMT) is the lack of parallel data for many language pairs. One common solution is to pivot through a third language for which there exist parallel corpora with the source and target languages. Although pivoting is a robust technique, it introduces some low quality translations. In this paper, we present two language-independent features to improve the quality of phrase-pivot based SMT. The features, source connectivity strength and target connectivity strength reflect the quality of projected alignments between the source and target phrases in the pivot phrase table. We show positive results (0.6 BLEU points) on Persian-Arabic SMT as a case study.
6 0.36520419 211 acl-2013-LABR: A Large Scale Arabic Book Reviews Dataset
7 0.35705701 187 acl-2013-Identifying Opinion Subgroups in Arabic Online Discussions
8 0.34722003 227 acl-2013-Learning to lemmatise Polish noun phrases
9 0.34304237 235 acl-2013-Machine Translation Detection from Monolingual Web-Text
10 0.33965629 240 acl-2013-Microblogs as Parallel Corpora
11 0.31006104 374 acl-2013-Using Context Vectors in Improving a Machine Translation System with Bridge Language
12 0.30217671 236 acl-2013-Mapping Source to Target Strings without Alignment by Analogical Learning: A Case Study with Transliteration
13 0.30067104 203 acl-2013-Is word-to-phone mapping better than phone-phone mapping for handling English words?
14 0.29993612 78 acl-2013-Categorization of Turkish News Documents with Morphological Analysis
15 0.29326597 303 acl-2013-Robust multilingual statistical morphological generation models
16 0.28605247 383 acl-2013-Vector Space Model for Adaptation in Statistical Machine Translation
17 0.27801543 120 acl-2013-Dirt Cheap Web-Scale Parallel Text from the Common Crawl
18 0.27004084 338 acl-2013-Task Alternation in Parallel Sentence Retrieval for Twitter Translation
19 0.26770431 48 acl-2013-An Open Source Toolkit for Quantitative Historical Linguistics
20 0.26745766 34 acl-2013-Accurate Word Segmentation using Transliteration and Language Model Projection
topicId topicWeight
[(0, 0.022), (6, 0.03), (11, 0.027), (15, 0.013), (19, 0.014), (24, 0.03), (26, 0.034), (35, 0.042), (42, 0.047), (48, 0.023), (56, 0.013), (70, 0.024), (88, 0.027), (90, 0.025), (95, 0.555)]
simIndex simValue paperId paperTitle
1 0.98836696 195 acl-2013-Improving machine translation by training against an automatic semantic frame based evaluation metric
Author: Chi-kiu Lo ; Karteek Addanki ; Markus Saers ; Dekai Wu
Abstract: We present the first ever results showing that tuning a machine translation system against a semantic frame based objective function, MEANT, produces more robustly adequate translations than tuning against BLEU or TER as measured across commonly used metrics and human subjective evaluation. Moreover, for informal web forum data, human evaluators preferred MEANT-tuned systems over BLEU- or TER-tuned systems by a significantly wider margin than that for formal newswire—even though automatic semantic parsing might be expected to fare worse on informal language. We argue thatbypreserving the meaning ofthe trans- lations as captured by semantic frames right in the training process, an MT system is constrained to make more accurate choices of both lexical and reordering rules. As a result, MT systems tuned against semantic frame based MT evaluation metrics produce output that is more adequate. Tuning a machine translation system against a semantic frame based objective function is independent ofthe translation model paradigm, so, any translation model can benefit from the semantic knowledge incorporated to improve translation adequacy through our approach.
same-paper 2 0.97795987 359 acl-2013-Translating Dialectal Arabic to English
Author: Hassan Sajjad ; Kareem Darwish ; Yonatan Belinkov
Abstract: We present a dialectal Egyptian Arabic to English statistical machine translation system that leverages dialectal to Modern Standard Arabic (MSA) adaptation. In contrast to previous work, we first narrow down the gap between Egyptian and MSA by applying an automatic characterlevel transformational model that changes Egyptian to EG0, which looks similar to MSA. The transformations include morphological, phonological and spelling changes. The transformation reduces the out-of-vocabulary (OOV) words from 5.2% to 2.6% and gives a gain of 1.87 BLEU points. Further, adapting large MSA/English parallel data increases the lexical coverage, reduces OOVs to 0.7% and leads to an absolute BLEU improvement of 2.73 points.
3 0.97501266 256 acl-2013-Named Entity Recognition using Cross-lingual Resources: Arabic as an Example
Author: Kareem Darwish
Abstract: Some languages lack large knowledge bases and good discriminative features for Name Entity Recognition (NER) that can generalize to previously unseen named entities. One such language is Arabic, which: a) lacks a capitalization feature; and b) has relatively small knowledge bases, such as Wikipedia. In this work we address both problems by incorporating cross-lingual features and knowledge bases from English using cross-lingual links. We show that such features have a dramatic positive effect on recall. We show the effectiveness of cross-lingual features and resources on a standard dataset as well as on two new test sets that cover both news and microblogs. On the standard dataset, we achieved a 4.1% relative improvement in Fmeasure over the best reported result in the literature. The features led to improvements of 17.1% and 20.5% on the new news and mi- croblogs test sets respectively.
4 0.97041923 336 acl-2013-Syntactic Patterns versus Word Alignment: Extracting Opinion Targets from Online Reviews
Author: Kang Liu ; Liheng Xu ; Jun Zhao
Abstract: Mining opinion targets is a fundamental and important task for opinion mining from online reviews. To this end, there are usually two kinds of methods: syntax based and alignment based methods. Syntax based methods usually exploited syntactic patterns to extract opinion targets, which were however prone to suffer from parsing errors when dealing with online informal texts. In contrast, alignment based methods used word alignment model to fulfill this task, which could avoid parsing errors without using parsing. However, there is no research focusing on which kind of method is more better when given a certain amount of reviews. To fill this gap, this paper empiri- cally studies how the performance of these two kinds of methods vary when changing the size, domain and language of the corpus. We further combine syntactic patterns with alignment model by using a partially supervised framework and investigate whether this combination is useful or not. In our experiments, we verify that our combination is effective on the corpus with small and medium size.
Author: Tsutomu Hirao ; Tomoharu Iwata ; Masaaki Nagata
Abstract: Unsupervised object matching (UOM) is a promising approach to cross-language natural language processing such as bilingual lexicon acquisition, parallel corpus construction, and cross-language text categorization, because it does not require labor-intensive linguistic resources. However, UOM only finds one-to-one correspondences from data sets with the same number of instances in source and target domains, and this prevents us from applying UOM to real-world cross-language natural language processing tasks. To alleviate these limitations, we proposes latent semantic matching, which embeds objects in both source and target language domains into a shared latent topic space. We demonstrate the effectiveness of our method on cross-language text categorization. The results show that our method outperforms conventional unsupervised object matching methods.
6 0.95064926 37 acl-2013-Adaptive Parser-Centric Text Normalization
7 0.93492758 66 acl-2013-Beam Search for Solving Substitution Ciphers
8 0.93171597 162 acl-2013-FrameNet on the Way to Babel: Creating a Bilingual FrameNet Using Wiktionary as Interlingual Connection
9 0.90230054 289 acl-2013-QuEst - A translation quality estimation framework
10 0.84115785 374 acl-2013-Using Context Vectors in Improving a Machine Translation System with Bridge Language
11 0.83480692 135 acl-2013-English-to-Russian MT evaluation campaign
12 0.82596302 317 acl-2013-Sentence Level Dialect Identification in Arabic
13 0.82457662 255 acl-2013-Name-aware Machine Translation
14 0.80438131 214 acl-2013-Language Independent Connectivity Strength Features for Phrase Pivot Statistical Machine Translation
15 0.79520398 5 acl-2013-A Decade of Automatic Content Evaluation of News Summaries: Reassessing the State of the Art
16 0.79376972 13 acl-2013-A New Syntactic Metric for Evaluation of Machine Translation
17 0.79042971 326 acl-2013-Social Text Normalization using Contextual Graph Random Walks
18 0.77974463 120 acl-2013-Dirt Cheap Web-Scale Parallel Text from the Common Crawl
19 0.77797318 240 acl-2013-Microblogs as Parallel Corpora
20 0.76448059 97 acl-2013-Cross-lingual Projections between Languages from Different Families