acl acl2013 acl2013-317 knowledge-graph by maker-knowledge-mining
Source: pdf
Author: Heba Elfardy ; Mona Diab
Abstract: This paper introduces a supervised approach for performing sentence level dialect identification between Modern Standard Arabic and Egyptian Dialectal Arabic. We use token level labels to derive sentence-level features. These features are then used with other core and meta features to train a generative classifier that predicts the correct label for each sentence in the given input text. The system achieves an accuracy of 85.5% on an Arabic online-commentary dataset outperforming a previously proposed approach achieving 80.9% and reflecting a significant gain over a majority baseline of 5 1.9% and two strong baseline systems of 78.5% and 80.4%, respectively.
Reference: text
sentIndex sentText sentNum sentScore
1 Sentence Level Dialect Identification in Arabic Heba Elfardy Department of Computer Science Columbia University heba @ c s . [sent-1, score-0.114]
2 edu Abstract This paper introduces a supervised approach for performing sentence level dialect identification between Modern Standard Arabic and Egyptian Dialectal Arabic. [sent-3, score-0.409]
3 We use token level labels to derive sentence-level features. [sent-4, score-0.062]
4 These features are then used with other core and meta features to train a generative classifier that predicts the correct label for each sentence in the given input text. [sent-5, score-0.201]
5 5% on an Arabic online-commentary dataset outperforming a previously proposed approach achieving 80. [sent-7, score-0.043]
6 9% and reflecting a significant gain over a majority baseline of 5 1. [sent-8, score-0.024]
7 1 Introduction The Arabic language exists in a state of Diglossia (Ferguson, 1959) where the standard form of the language, Modern Standard Arabic (MSA) and the regional dialects (DA) live side-by-side and are closely related. [sent-12, score-0.097]
8 Arabic dialects may be divided into five main groups: Egyptian (including Libyan and Sudanese), Levantine (including Lebanese, Syrian, Palestinian and Jordanian), Gulf, Iraqi and Moroccan (Maghrebi) (Habash, 2010). [sent-14, score-0.072]
9 Even though these dialects did not originally exist in a written form, they are pervasively present in social media text (normally mixed with MSA) nowadays. [sent-15, score-0.072]
10 DA does not have a standard orthography leading to many spelling variations and inconsistencies. [sent-16, score-0.183]
11 LCS in Arabic poses a serious chal- lenge for almost all NLP tasks since MSA and DA Mona Diab Department of Computer Science The George Washington University mt diab @ gwu . [sent-18, score-0.075]
12 Hence a need for a robust dialect identification tool as a preprocessing step arises both on the word and sentence levels. [sent-21, score-0.398]
13 In this paper, we focus on the problem of dialect identification on the sentence level. [sent-22, score-0.358]
14 We propose a supervised approach for identifying whether a given sentence is prevalently MSA or Egyptian DA (EDA). [sent-23, score-0.046]
15 The token level decisions are then combined with other features to train a generative classifier that tries to predict the class of the given sentence. [sent-26, score-0.11]
16 The presented system outperforms the approach presented by Zaidan and Callison-Burch (201 1) on the same dataset using 10-fold cross validation. [sent-27, score-0.074]
17 (2009) present a system that identifies dialectal words in speech and their dialect of origin through the acoustic signals. [sent-30, score-0.523]
18 (2012) present CODA, a Conventional Orthography for Dialectal Arabic that aims to standardize the orthography of all the variants of DA while Dasigi and Diab (201 1) present an unsupervised clustering approach to identify orthographic variants in DA. [sent-34, score-0.287]
19 Zaidan and CallisonBurch (201 1) crawl a large dataset of MSA-DA news’ commentaries. [sent-35, score-0.043]
20 The authors annotate part of the dataset for sentence-level dialectalness on 456 ProceedingSsof oifa, th Beu 5l1gsarti Aan,An uuaglu Mste 4e-ti9n2g 0 o1f3 t. [sent-36, score-0.198]
21 c A2s0s1o3ci Aatsiosonc fioartio Cno fmorpu Ctoamtiopnuatalt Lioinngauli Lsitnicgsu,i psatgices 456–461, Amazon Mechanical Turk and try a language modeling (LM) approach to solve the problem. [sent-38, score-0.025]
22 In Elfardy and Diab (2012a), we present a set of guidelines for token-level identification of dialectalness while in (Elfardy and Diab, 2012b), (Elfardy et al. [sent-39, score-0.241]
23 , 2013) we tackle the problem of tokenlevel dialect-identification by casting it as a codeswitching problem. [sent-40, score-0.023]
24 3 Approach to Sentence-Level Dialect Identification We present a supervised system that uses a Naive Bayes classifier trained on gold labeled data with sentence level binary decisions of either being MSA or DA. [sent-41, score-0.073]
25 1 Core Features: These features indicate how dialectal (or non dialectal) a given sentence is. [sent-46, score-0.367]
26 They are further divided into: (a) Token-based features and (b) Perplexity-based features. [sent-47, score-0.024]
27 , 2013) to decide upon the class of each word in the given sentence. [sent-52, score-0.024]
28 We use the token-level class labels to estimate the percentage of EDA words and the percentage of OOVs for each sentence. [sent-54, score-0.12]
29 These percentages are then used as features for the proposed model. [sent-55, score-0.024]
30 The following variants of the underlying token-level system are built to assess the effect of varying the level of preprocessing on the underlying LM on the performance of the overall sentence level dialect identification process: (1) Surface, (2) Tokenized, (3) CODAfied, and (4) Tokenized-CODA. [sent-56, score-0.569]
31 We use the following sentence to show the different techniques: AJ? [sent-57, score-0.046]
32 Surface LMs: No significant preprocessing is applied apart from the regular initial clean up of the text which includes removal of URLs, normalization of speech effects such as reducing all redundant letters in a word to 1We transliteration use Buckwalter http://www. [sent-64, score-0.095]
33 J» ktttyyyr (specifically three repeated letters instead of an unpredictable number of repetitions, to maintain the signal that there is a speech effect which could be a DA indicator). [sent-97, score-0.025]
34 Orthography Normalized (CODAfied) LM: since DA is not originally a written form of Arabic, no standard orthography exists for it. [sent-108, score-0.184]
35 (2012) attempt to solve this problem by presenting CODA, a conventional orthography for writing DA. [sent-110, score-0.214]
36 While CODA and its applied version using CODAfy solve the spelling inconsistency problem in DA, special care must be taken when using it for our task since it removes valuable dialectalness cues. [sent-113, score-0.244]
37 (v in Buckwalter (BW) Transliteration) is converted into the letter H? [sent-115, score-0.025]
38 CODA suggests that such cases get mapped to the original MSA phonological variant which might make the dialect identification problem more challenging. [sent-117, score-0.312]
39 On the other hand, CODA solves the sparseness issue by mapping multiple spelling-variants to the same orthographic form leading to a more robust LM. [sent-118, score-0.066]
40 For building the tokenized LM, we maintain clitics and lexemes. [sent-132, score-0.234]
41 Some clitics are unique to MSA while others are unique to EDA so maintaining them in the LM is helpful, eg. [sent-133, score-0.066]
42 $ is only used in EDA but it could be seen with an MSA/EDA homograph, maintaining the enclitic in the LM facilitates the identification of the sequence as being EDA. [sent-136, score-0.162]
43 5-grams are used for building the tokenized LMs (as opposed to 3-grams for the surface LMs) ex. [sent-137, score-0.213]
44 Tokenized & Orthography Normalized LMs: (Tokenized-CODA) The data is tokenized as in (3) then orthography normalization is applied to the tokenized data. [sent-147, score-0.521]
45 kvyr Ely +nA In addition to the underlying token-level system, we use the following token-level features: 1. [sent-157, score-0.081]
46 Percentage of words in the sentence that is analyzable by an MSA morphological analyzer. [sent-158, score-0.136]
47 Percentage of words in the sentence that is analyzable by an EDA morphological analyzer. [sent-160, score-0.136]
48 Percentage of words in the sentence that exists in a precompiled EDA lexicon. [sent-162, score-0.071]
49 2 Perplexity-based Features: We run each sentence through each of the MSA and EDA LMs and record the perplexity for each of them. [sent-166, score-0.123]
50 The perplexity of a language model on a given test sentence; S(w1 , . [sent-167, score-0.077]
51 , wn) is defined as: perplexity = (2)−(1/N) Pilog2(p(wi|hi)) (1) where N is the number of tokens in the sentence and hi is the history of token wi. [sent-169, score-0.181]
52 The perplexity conveys how confused the LM is about the given sentence so the higher the perplexity value, the less probable that the given sentence matches the LM. [sent-170, score-0.246]
53 These are the features that do not directly relate to the dialectalness of words in the given sentence but rather estimate how informal the sentence is and include: • The percentage of punctuation, numbers, special-characters afnd p uwncotrudas iwonri,tte nnu minb eRros-, man script. [sent-174, score-0.342]
54 2We repeat this step for each ofthe preprocessing schemes explained in section 3. [sent-175, score-0.04]
55 1 • The percentage of words having wordlengthening teafgfeects o. [sent-178, score-0.048]
56 • Whether the sentence has consecutive repeated punctuation or en hota. [sent-180, score-0.046]
57 s (Binary feature, yes/no) • Whether the sentence has an exclamation • mWahrekt or rno tht. [sent-181, score-0.046]
58 e (Binary feature, yes/no) Whether the sentence has emoticons or not. [sent-182, score-0.046]
59 , 2009) and the derived features to train a Naive-Bayes classifier. [sent-185, score-0.024]
60 The classifier is trained and cross-validated on the gold-training data for each of our different configurations (Surface, CODAfied, Tokenized & Tokenized-CODA). [sent-186, score-0.076]
61 In the second set, Experiment Set B, we use the whole dataset for training without further splitting. [sent-189, score-0.043]
62 For both sets of experiments, we apply 10-fold cross validation on the training data. [sent-190, score-0.055]
63 1 Data We use the code-switched EDA-MSA portion of the crowd source annotated dataset by Zaidan and Callison-Burch (201 1). [sent-193, score-0.043]
64 The dataset consists of user commentaries on Egyptian news articles. [sent-194, score-0.066]
65 In experiment Set B, all data is used to perform the 10-fold cross-validation. [sent-202, score-0.024]
66 458 (a) Experiment Set A (Uses 90% of the dataset) (b) Experiment Set B (Uses the whole dataset) Figure 1: Learning curves for the different configurations (obtained by applying 10-fold cross validation on the training set. [sent-203, score-0.155]
67 The first of which is a majority baseline (Maj-BL); that assigns all the sentences the label of the most frequent class observed in the training data. [sent-206, score-0.048]
68 The second baseline (Token-BL) assumes that the sentence is EDA if more than 45% ofits tokens are dialectal otherwise it assumes it is MSA. [sent-207, score-0.367]
69 3 The third baseline (Ppl-BL) runs each sentence through MSA & EDA LMs and assigns the sentence the class of the LM yielding the lower perplexity value. [sent-208, score-0.217]
70 The last baseline (OZCCB-BL) is the result obtained by Zaidan and Callison-Burch (201 1) which uses the same approach of our third baseline, Ppl-BL. [sent-209, score-0.024]
71 4 For TokenBL and Ppl-BL, the performance is calculated for all LM-sizes of the four different configurations: Surface, CODAfied, Tokenized, TokenizedCODA and the best performing configuration on the cross-validation set is used as the baseline system. [sent-210, score-0.078]
72 3 Results & Discussion For each of the different configurations, we build a learning curve by varying the size of the LMs between 2M, 4M, 8M, 16M and 28M tokens. [sent-212, score-0.026]
73 Figures 1a and 1b show the learning curves of the different configurations on the cross-validation set for experiments A & B respectively. [sent-213, score-0.1]
74 In Table 2 we note that both CODA and Tokenized solve the datasparseness issue hence they produce better results 3We experimented with different thresholds (15%, 30%, 45%, 60% and 75%) and the 45% threshold setting yielded ConditionExp. [sent-214, score-0.025]
75 6859difer- ent configurations of the 8M LM (best-performing LM size) using 10-fold cross validation against the different baselines. [sent-220, score-0.131]
76 However, as mentioned earlier, CODA removes some dialectalness cues so the improvement resulting from using CODA is much less than that from using tokenization. [sent-222, score-0.195]
77 Also when combining CODA with tokenization as in the condition Tokenized-CODA, the performance drops since in this case the sparseness issue has been already resolved by tokenization so adding CODA only removes dialectalness cues. [sent-223, score-0.391]
78 in the data so when performing the tokenization it becomes Q? [sent-229, score-0.095]
79 w+ ktyr which on the contrary is frequent in the data. [sent-233, score-0.052]
80 Adding the best performance 4This baseline can only be compared to the results of the second set of experiments. [sent-234, score-0.024]
81 3 Table 3: Performance Accuracies of the bestperforming configuration (Tokenized) on the heldout test set against the baselines Maj-BL, TokenBL and Ppl-BL. [sent-237, score-0.062]
82 All configurations outperform all baselines with the Tokenized configuration producing the best re- sults. [sent-244, score-0.138]
83 , 2013) as the size of the MSA & EDA LMs increases, the shared ngrams increase leading to higher confusability between the classes of tokens in a given sentence. [sent-247, score-0.052]
84 Table 3 presents the results on the held out dataset compared against three of the baselines, Maj-BL, Token-BL and Ppl-BL. [sent-248, score-0.043]
85 We note that the Tokenized condition, the best performing condition, outperforms all baselines with a significant margin. [sent-249, score-0.056]
86 5 Conclusion We presented a supervised approach for sentence level dialect identification in Arabic. [sent-250, score-0.385]
87 The approach uses features from an underlying system for token-level identification of Egyptian Dialectal Arabic in addition to other core and meta features to decide whether a given sentence is MSA or EDA. [sent-251, score-0.316]
88 We studied the impact of two types of preprocessing techniques (Tokenization and Orthography Normalization) as well as varying the size of the LM on the performance of our approach. [sent-252, score-0.066]
89 The presented approach produced significantly better results than a previous approach in addition to beating the majority baseline and two other strong baselines. [sent-253, score-0.024]
90 Simplified guidelines for the creation of large scale dialectal arabic annotations. [sent-265, score-0.506]
91 Mada+ tokan: A toolkit for arabic tokenization, diacritization, morphological disambiguation, pos tagging, stemming and lemmatization. [sent-286, score-0.247]
92 Dialectal to standard arabic paraphrasing to improve arabicenglish statistical machine translation. [sent-306, score-0.209]
93 The arabic online commentary dataset: an annotated dataset of informal arabic with high dialectal content. [sent-311, score-0.781]
wordName wordTfidf (topN-words)
[('msa', 0.37), ('dialectal', 0.297), ('coda', 0.284), ('eda', 0.283), ('elfardy', 0.258), ('dialect', 0.226), ('arabic', 0.209), ('tokenized', 0.167), ('orthography', 0.159), ('lm', 0.156), ('dialectalness', 0.155), ('da', 0.152), ('habash', 0.147), ('kdh', 0.129), ('lms', 0.121), ('heba', 0.114), ('nizar', 0.111), ('codafied', 0.103), ('zaidan', 0.101), ('qk', 0.099), ('egyptian', 0.094), ('identification', 0.086), ('meta', 0.079), ('mona', 0.079), ('elyna', 0.078), ('wktyr', 0.078), ('perplexity', 0.077), ('configurations', 0.076), ('diab', 0.075), ('dialects', 0.072), ('tokenization', 0.071), ('eskander', 0.069), ('owen', 0.053), ('analyzable', 0.052), ('codafy', 0.052), ('confusability', 0.052), ('ely', 0.052), ('enclitic', 0.052), ('ktyr', 0.052), ('kvyr', 0.052), ('tokenbl', 0.052), ('percentage', 0.048), ('surface', 0.046), ('sentence', 0.046), ('bw', 0.046), ('biadsy', 0.046), ('ramy', 0.046), ('salloum', 0.046), ('aj', 0.045), ('dataset', 0.043), ('clitics', 0.042), ('nadi', 0.042), ('preprocessing', 0.04), ('removes', 0.04), ('rambow', 0.04), ('orthographic', 0.039), ('morphological', 0.038), ('mada', 0.038), ('dasigi', 0.038), ('buckwalter', 0.038), ('lcs', 0.036), ('tokenizer', 0.036), ('token', 0.035), ('variants', 0.033), ('baselines', 0.032), ('ram', 0.032), ('cross', 0.031), ('conventional', 0.03), ('configuration', 0.03), ('atlanta', 0.029), ('underlying', 0.029), ('hr', 0.028), ('normalization', 0.028), ('core', 0.028), ('ak', 0.028), ('condition', 0.027), ('level', 0.027), ('sparseness', 0.027), ('weka', 0.027), ('transliteration', 0.027), ('varying', 0.026), ('letter', 0.025), ('solve', 0.025), ('maintain', 0.025), ('exists', 0.025), ('baseline', 0.024), ('experiment', 0.024), ('maintaining', 0.024), ('curves', 0.024), ('validation', 0.024), ('features', 0.024), ('spelling', 0.024), ('performing', 0.024), ('class', 0.024), ('informal', 0.023), ('hi', 0.023), ('codeswitching', 0.023), ('standardize', 0.023), ('lebanese', 0.023), ('commentaries', 0.023)]
simIndex simValue paperId paperTitle
same-paper 1 0.99999928 317 acl-2013-Sentence Level Dialect Identification in Arabic
Author: Heba Elfardy ; Mona Diab
Abstract: This paper introduces a supervised approach for performing sentence level dialect identification between Modern Standard Arabic and Egyptian Dialectal Arabic. We use token level labels to derive sentence-level features. These features are then used with other core and meta features to train a generative classifier that predicts the correct label for each sentence in the given input text. The system achieves an accuracy of 85.5% on an Arabic online-commentary dataset outperforming a previously proposed approach achieving 80.9% and reflecting a significant gain over a majority baseline of 5 1.9% and two strong baseline systems of 78.5% and 80.4%, respectively.
2 0.44012278 359 acl-2013-Translating Dialectal Arabic to English
Author: Hassan Sajjad ; Kareem Darwish ; Yonatan Belinkov
Abstract: We present a dialectal Egyptian Arabic to English statistical machine translation system that leverages dialectal to Modern Standard Arabic (MSA) adaptation. In contrast to previous work, we first narrow down the gap between Egyptian and MSA by applying an automatic characterlevel transformational model that changes Egyptian to EG0, which looks similar to MSA. The transformations include morphological, phonological and spelling changes. The transformation reduces the out-of-vocabulary (OOV) words from 5.2% to 2.6% and gives a gain of 1.87 BLEU points. Further, adapting large MSA/English parallel data increases the lexical coverage, reduces OOVs to 0.7% and leads to an absolute BLEU improvement of 2.73 points.
3 0.14395323 214 acl-2013-Language Independent Connectivity Strength Features for Phrase Pivot Statistical Machine Translation
Author: Ahmed El Kholy ; Nizar Habash ; Gregor Leusch ; Evgeny Matusov ; Hassan Sawaf
Abstract: An important challenge to statistical machine translation (SMT) is the lack of parallel data for many language pairs. One common solution is to pivot through a third language for which there exist parallel corpora with the source and target languages. Although pivoting is a robust technique, it introduces some low quality translations. In this paper, we present two language-independent features to improve the quality of phrase-pivot based SMT. The features, source connectivity strength and target connectivity strength reflect the quality of projected alignments between the source and target phrases in the pivot phrase table. We show positive results (0.6 BLEU points) on Persian-Arabic SMT as a case study.
4 0.11730623 256 acl-2013-Named Entity Recognition using Cross-lingual Resources: Arabic as an Example
Author: Kareem Darwish
Abstract: Some languages lack large knowledge bases and good discriminative features for Name Entity Recognition (NER) that can generalize to previously unseen named entities. One such language is Arabic, which: a) lacks a capitalization feature; and b) has relatively small knowledge bases, such as Wikipedia. In this work we address both problems by incorporating cross-lingual features and knowledge bases from English using cross-lingual links. We show that such features have a dramatic positive effect on recall. We show the effectiveness of cross-lingual features and resources on a standard dataset as well as on two new test sets that cover both news and microblogs. On the standard dataset, we achieved a 4.1% relative improvement in Fmeasure over the best reported result in the literature. The features led to improvements of 17.1% and 20.5% on the new news and mi- croblogs test sets respectively.
5 0.11603212 187 acl-2013-Identifying Opinion Subgroups in Arabic Online Discussions
Author: Amjad Abu-Jbara ; Ben King ; Mona Diab ; Dragomir Radev
Abstract: In this paper, we use Arabic natural language processing techniques to analyze Arabic debates. The goal is to identify how the participants in a discussion split into subgroups with contrasting opinions. The members of each subgroup share the same opinion with respect to the discussion topic and an opposing opinion to the members of other subgroups. We use opinion mining techniques to identify opinion expressions and determine their polarities and their targets. We opinion predictions to represent the discussion in one of two formal representations: signed attitude network or a space of attitude vectors. We identify opinion subgroups by partitioning the signed network representation or by clustering the vector space representation. We evaluate the system using a data set of labeled discussions and show that it achieves good results.
6 0.10018808 235 acl-2013-Machine Translation Detection from Monolingual Web-Text
7 0.099649981 211 acl-2013-LABR: A Large Scale Arabic Book Reviews Dataset
8 0.086625889 35 acl-2013-Adaptation Data Selection using Neural Language Models: Experiments in Machine Translation
9 0.074418135 129 acl-2013-Domain-Independent Abstract Generation for Focused Meeting Summarization
10 0.057903938 34 acl-2013-Accurate Word Segmentation using Transliteration and Language Model Projection
11 0.048693366 240 acl-2013-Microblogs as Parallel Corpora
12 0.047190472 255 acl-2013-Name-aware Machine Translation
13 0.044141002 78 acl-2013-Categorization of Turkish News Documents with Morphological Analysis
14 0.041369602 120 acl-2013-Dirt Cheap Web-Scale Parallel Text from the Common Crawl
15 0.040663887 326 acl-2013-Social Text Normalization using Contextual Graph Random Walks
16 0.040535685 257 acl-2013-Natural Language Models for Predicting Programming Comments
17 0.04009673 119 acl-2013-Diathesis alternation approximation for verb clustering
18 0.039376874 40 acl-2013-Advancements in Reordering Models for Statistical Machine Translation
19 0.037506521 92 acl-2013-Context-Dependent Multilingual Lexical Lookup for Under-Resourced Languages
20 0.037487064 383 acl-2013-Vector Space Model for Adaptation in Statistical Machine Translation
topicId topicWeight
[(0, 0.11), (1, 0.0), (2, 0.047), (3, 0.045), (4, 0.006), (5, -0.006), (6, -0.019), (7, 0.024), (8, 0.061), (9, -0.029), (10, -0.061), (11, 0.006), (12, -0.015), (13, 0.033), (14, -0.127), (15, 0.013), (16, -0.075), (17, -0.094), (18, -0.021), (19, 0.081), (20, -0.032), (21, 0.04), (22, 0.107), (23, 0.115), (24, 0.08), (25, -0.043), (26, 0.025), (27, -0.127), (28, 0.152), (29, -0.354), (30, -0.1), (31, -0.075), (32, 0.018), (33, -0.062), (34, 0.087), (35, -0.099), (36, 0.084), (37, 0.091), (38, 0.11), (39, 0.35), (40, 0.036), (41, -0.014), (42, 0.037), (43, -0.058), (44, -0.016), (45, 0.018), (46, -0.1), (47, 0.034), (48, -0.023), (49, 0.029)]
simIndex simValue paperId paperTitle
same-paper 1 0.9431603 317 acl-2013-Sentence Level Dialect Identification in Arabic
Author: Heba Elfardy ; Mona Diab
Abstract: This paper introduces a supervised approach for performing sentence level dialect identification between Modern Standard Arabic and Egyptian Dialectal Arabic. We use token level labels to derive sentence-level features. These features are then used with other core and meta features to train a generative classifier that predicts the correct label for each sentence in the given input text. The system achieves an accuracy of 85.5% on an Arabic online-commentary dataset outperforming a previously proposed approach achieving 80.9% and reflecting a significant gain over a majority baseline of 5 1.9% and two strong baseline systems of 78.5% and 80.4%, respectively.
2 0.81809962 359 acl-2013-Translating Dialectal Arabic to English
Author: Hassan Sajjad ; Kareem Darwish ; Yonatan Belinkov
Abstract: We present a dialectal Egyptian Arabic to English statistical machine translation system that leverages dialectal to Modern Standard Arabic (MSA) adaptation. In contrast to previous work, we first narrow down the gap between Egyptian and MSA by applying an automatic characterlevel transformational model that changes Egyptian to EG0, which looks similar to MSA. The transformations include morphological, phonological and spelling changes. The transformation reduces the out-of-vocabulary (OOV) words from 5.2% to 2.6% and gives a gain of 1.87 BLEU points. Further, adapting large MSA/English parallel data increases the lexical coverage, reduces OOVs to 0.7% and leads to an absolute BLEU improvement of 2.73 points.
3 0.58418834 327 acl-2013-Sorani Kurdish versus Kurmanji Kurdish: An Empirical Comparison
Author: Kyumars Sheykh Esmaili ; Shahin Salavati
Abstract: Resource scarcity along with diversity– both in dialect and script–are the two primary challenges in Kurdish language processing. In this paper we aim at addressing these two problems by (i) building a text corpus for Sorani and Kurmanji, the two main dialects of Kurdish, and (ii) highlighting some of the orthographic, phonological, and morphological differences between these two dialects from statistical and rule-based perspectives.
4 0.57133132 256 acl-2013-Named Entity Recognition using Cross-lingual Resources: Arabic as an Example
Author: Kareem Darwish
Abstract: Some languages lack large knowledge bases and good discriminative features for Name Entity Recognition (NER) that can generalize to previously unseen named entities. One such language is Arabic, which: a) lacks a capitalization feature; and b) has relatively small knowledge bases, such as Wikipedia. In this work we address both problems by incorporating cross-lingual features and knowledge bases from English using cross-lingual links. We show that such features have a dramatic positive effect on recall. We show the effectiveness of cross-lingual features and resources on a standard dataset as well as on two new test sets that cover both news and microblogs. On the standard dataset, we achieved a 4.1% relative improvement in Fmeasure over the best reported result in the literature. The features led to improvements of 17.1% and 20.5% on the new news and mi- croblogs test sets respectively.
5 0.46477368 214 acl-2013-Language Independent Connectivity Strength Features for Phrase Pivot Statistical Machine Translation
Author: Ahmed El Kholy ; Nizar Habash ; Gregor Leusch ; Evgeny Matusov ; Hassan Sawaf
Abstract: An important challenge to statistical machine translation (SMT) is the lack of parallel data for many language pairs. One common solution is to pivot through a third language for which there exist parallel corpora with the source and target languages. Although pivoting is a robust technique, it introduces some low quality translations. In this paper, we present two language-independent features to improve the quality of phrase-pivot based SMT. The features, source connectivity strength and target connectivity strength reflect the quality of projected alignments between the source and target phrases in the pivot phrase table. We show positive results (0.6 BLEU points) on Persian-Arabic SMT as a case study.
6 0.37300006 211 acl-2013-LABR: A Large Scale Arabic Book Reviews Dataset
7 0.35921773 187 acl-2013-Identifying Opinion Subgroups in Arabic Online Discussions
8 0.29031482 235 acl-2013-Machine Translation Detection from Monolingual Web-Text
9 0.28700957 203 acl-2013-Is word-to-phone mapping better than phone-phone mapping for handling English words?
10 0.28001729 84 acl-2013-Combination of Recurrent Neural Networks and Factored Language Models for Code-Switching Language Modeling
11 0.26938403 35 acl-2013-Adaptation Data Selection using Neural Language Models: Experiments in Machine Translation
12 0.25885198 364 acl-2013-Typesetting for Improved Readability using Lexical and Syntactic Information
13 0.2584227 227 acl-2013-Learning to lemmatise Polish noun phrases
14 0.25597531 171 acl-2013-Grammatical Error Correction Using Integer Linear Programming
15 0.25092918 390 acl-2013-Word surprisal predicts N400 amplitude during reading
16 0.24539118 48 acl-2013-An Open Source Toolkit for Quantitative Historical Linguistics
17 0.23715734 34 acl-2013-Accurate Word Segmentation using Transliteration and Language Model Projection
18 0.23580427 78 acl-2013-Categorization of Turkish News Documents with Morphological Analysis
19 0.22958931 240 acl-2013-Microblogs as Parallel Corpora
20 0.22770931 236 acl-2013-Mapping Source to Target Strings without Alignment by Analogical Learning: A Case Study with Transliteration
topicId topicWeight
[(0, 0.03), (6, 0.03), (11, 0.039), (19, 0.278), (24, 0.034), (26, 0.053), (35, 0.057), (42, 0.05), (48, 0.036), (64, 0.014), (70, 0.021), (88, 0.03), (90, 0.013), (95, 0.229)]
simIndex simValue paperId paperTitle
same-paper 1 0.83494651 317 acl-2013-Sentence Level Dialect Identification in Arabic
Author: Heba Elfardy ; Mona Diab
Abstract: This paper introduces a supervised approach for performing sentence level dialect identification between Modern Standard Arabic and Egyptian Dialectal Arabic. We use token level labels to derive sentence-level features. These features are then used with other core and meta features to train a generative classifier that predicts the correct label for each sentence in the given input text. The system achieves an accuracy of 85.5% on an Arabic online-commentary dataset outperforming a previously proposed approach achieving 80.9% and reflecting a significant gain over a majority baseline of 5 1.9% and two strong baseline systems of 78.5% and 80.4%, respectively.
2 0.8095094 340 acl-2013-Text-Driven Toponym Resolution using Indirect Supervision
Author: Michael Speriosu ; Jason Baldridge
Abstract: Toponym resolvers identify the specific locations referred to by ambiguous placenames in text. Most resolvers are based on heuristics using spatial relationships between multiple toponyms in a document, or metadata such as population. This paper shows that text-driven disambiguation for toponyms is far more effective. We exploit document-level geotags to indirectly generate training instances for text classifiers for toponym resolution, and show that textual cues can be straightforwardly integrated with other commonly used ones. Results are given for both 19th century texts pertaining to the American Civil War and 20th century newswire articles.
3 0.68829918 126 acl-2013-Diverse Keyword Extraction from Conversations
Author: Maryam Habibi ; Andrei Popescu-Belis
Abstract: A new method for keyword extraction from conversations is introduced, which preserves the diversity of topics that are mentioned. Inspired from summarization, the method maximizes the coverage of topics that are recognized automatically in transcripts of conversation fragments. The method is evaluated on excerpts of the Fisher and AMI corpora, using a crowdsourcing platform to elicit comparative relevance judgments. The results demonstrate that the method outperforms two competitive baselines.
4 0.6858449 222 acl-2013-Learning Semantic Textual Similarity with Structural Representations
Author: Aliaksei Severyn ; Massimo Nicosia ; Alessandro Moschitti
Abstract: Measuring semantic textual similarity (STS) is at the cornerstone of many NLP applications. Different from the majority of approaches, where a large number of pairwise similarity features are used to represent a text pair, our model features the following: (i) it directly encodes input texts into relational syntactic structures; (ii) relies on tree kernels to handle feature engineering automatically; (iii) combines both structural and feature vector representations in a single scoring model, i.e., in Support Vector Regression (SVR); and (iv) delivers significant improvement over the best STS systems.
5 0.67704558 37 acl-2013-Adaptive Parser-Centric Text Normalization
Author: Congle Zhang ; Tyler Baldwin ; Howard Ho ; Benny Kimelfeld ; Yunyao Li
Abstract: Text normalization is an important first step towards enabling many Natural Language Processing (NLP) tasks over informal text. While many of these tasks, such as parsing, perform the best over fully grammatically correct text, most existing text normalization approaches narrowly define the task in the word-to-word sense; that is, the task is seen as that of mapping all out-of-vocabulary non-standard words to their in-vocabulary standard forms. In this paper, we take a parser-centric view of normalization that aims to convert raw informal text into grammatically correct text. To understand the real effect of normalization on the parser, we tie normal- ization performance directly to parser performance. Additionally, we design a customizable framework to address the often overlooked concept of domain adaptability, and illustrate that the system allows for transfer to new domains with a minimal amount of data and effort. Our experimental study over datasets from three domains demonstrates that our approach outperforms not only the state-of-the-art wordto-word normalization techniques, but also manual word-to-word annotations.
6 0.67691976 359 acl-2013-Translating Dialectal Arabic to English
7 0.67659926 256 acl-2013-Named Entity Recognition using Cross-lingual Resources: Arabic as an Example
8 0.67581654 162 acl-2013-FrameNet on the Way to Babel: Creating a Bilingual FrameNet Using Wiktionary as Interlingual Connection
9 0.67465138 66 acl-2013-Beam Search for Solving Substitution Ciphers
10 0.67053264 336 acl-2013-Syntactic Patterns versus Word Alignment: Extracting Opinion Targets from Online Reviews
12 0.6683957 289 acl-2013-QuEst - A translation quality estimation framework
13 0.66578937 195 acl-2013-Improving machine translation by training against an automatic semantic frame based evaluation metric
14 0.65119898 374 acl-2013-Using Context Vectors in Improving a Machine Translation System with Bridge Language
15 0.6478678 60 acl-2013-Automatic Coupling of Answer Extraction and Information Retrieval
16 0.64462614 255 acl-2013-Name-aware Machine Translation
17 0.6374861 135 acl-2013-English-to-Russian MT evaluation campaign
18 0.63695717 223 acl-2013-Learning a Phrase-based Translation Model from Monolingual Data with Application to Domain Adaptation
19 0.63567495 97 acl-2013-Cross-lingual Projections between Languages from Different Families
20 0.63549042 240 acl-2013-Microblogs as Parallel Corpora