acl acl2011 acl2011-289 knowledge-graph by maker-knowledge-mining
Source: pdf
Author: Muhammad Abdul-Mageed ; Mona Diab ; Mohammed Korayem
Abstract: Although Subjectivity and Sentiment Analysis (SSA) has been witnessing a flurry of novel research, there are few attempts to build SSA systems for Morphologically-Rich Languages (MRL). In the current study, we report efforts to partially fill this gap. We present a newly developed manually annotated corpus ofModern Standard Arabic (MSA) together with a new polarity lexicon.The corpus is a collection of newswire documents annotated on the sentence level. We also describe an automatic SSA tagging system that exploits the annotated data. We investigate the impact of different levels ofpreprocessing settings on the SSA classification task. We show that by explicitly accounting for the rich morphology the system is able to achieve significantly higher levels of performance.
Reference: text
sentIndex sentText sentNum sentScore
1 Science, Learning Systems, and Computing, Indiana University, Columbia University, NYC, USA, Indiana University, Bloomington, USA, mabdulma@ indiana mdiab@ ccl s . [sent-3, score-0.132]
2 edu Abstract Although Subjectivity and Sentiment Analysis (SSA) has been witnessing a flurry of novel research, there are few attempts to build SSA systems for Morphologically-Rich Languages (MRL). [sent-4, score-0.124]
3 We present a newly developed manually annotated corpus ofModern Standard Arabic (MSA) together with a new polarity lexicon. [sent-6, score-0.164]
4 The corpus is a collection of newswire documents annotated on the sentence level. [sent-7, score-0.079]
5 We also describe an automatic SSA tagging system that exploits the annotated data. [sent-8, score-0.043]
6 We investigate the impact of different levels ofpreprocessing settings on the SSA classification task. [sent-9, score-0.192]
7 We show that by explicitly accounting for the rich morphology the system is able to achieve significantly higher levels of performance. [sent-10, score-0.058]
8 Subjectivity and Sentiment Analysis (SSA) is an area that has been witnessing a flurry of novel research. [sent-12, score-0.124]
9 In natural language, subjectivity refers to expression of opinions, evaluations, feelings, and speculations (Banfield, 1982; Wiebe, 1994) and thus incorporates sentiment. [sent-13, score-0.226]
10 The process of subjectivity classification refers to the task of classifying texts into either objective (e. [sent-14, score-0.308]
11 Subjective text is further classified with sentiment or polarity. [sent-19, score-0.16]
12 For sentiment classification, the task refers to identifying whether the subjective text is positive (e. [sent-20, score-0.248]
13 Very few studies have addressed the problem for morphologically rich languages (MRL) such as Arabic, Hebrew, 587 columbia . [sent-34, score-0.124]
14 The problem is even more pronounced in some MRL due to the lack in annotated resources for SSA such as labeled corpora, and polarity lexica. [sent-40, score-0.164]
15 In the current paper, we investigate the task of sentence-level SSA on Modern Standard Arabic (MSA) texts from the newswire genre. [sent-41, score-0.036]
16 We run experiments on three different pre-processing settings based on tokenized text from the Penn Arabic Treebank (PATB) (Maamouri et al. [sent-42, score-0.145]
17 Our work shows that explicitly using morphology-based features in our models improves the system’s performance. [sent-44, score-0.087]
18 We also measure the impact of using a wide coverage polarity lexicon and show that using a tailored resource results in significant improvement in classification performance. [sent-45, score-0.249]
19 2 Approach To our knowledge, no SSA annotated MSA data exists. [sent-46, score-0.043]
20 Hence we decided to create our own SSA annotated data. [sent-47, score-0.043]
21 1 Data set and Annotation Corpus: Two college-educated native speakers of Arabic annotated 2855 sentences from Part 1 V 3. [sent-49, score-0.043]
22 The sentences make up the first 400 documents of that part of PATB amounting to a total of 54. [sent-51, score-0.029]
23 Moreover, each of the sentences in our data set is manually labeled by a domain label. [sent-63, score-0.047]
24 The domain labels are from the newswire genre and are adopted from (Abdul-Mageed, 2008). [sent-64, score-0.083]
25 Polarity Lexicon: We manually created a lexicon of 3982 adjectives labeled with one of the following tags {positive, negative, neutral}. [sent-65, score-0.084]
26 The adjectives pertain st o{ pthosei ntievwe,s nwegiraet divoem, naeinut. [sent-66, score-0.065]
27 2 Automatic Classification Tokenization scheme and settings: We run experiments on gold-tokenized text from PATB. [sent-68, score-0.035]
28 We adopt the PATB+Al tokenization scheme, where proclitics and enclitics as well as Al are segmented out from the stem words. [sent-69, score-0.473]
29 Ta- ble 1 illustrates examples of the three configuration schemes, with each underlined. [sent-71, score-0.027]
30 Features: The features we employed are of two main types: Language-independent features and Morphological features. [sent-72, score-0.104]
31 Language-Independent Features: This group of features has been employed in various SSA studies. [sent-73, score-0.052]
32 , 2009), we apply a feature indicating the domain of the document to which a sentence belongs. [sent-75, score-0.085]
33 As mentioned earlier, each sentence has a document domain label manually associated with it. [sent-76, score-0.047]
34 N-GRAM: We run experiments with N-grams ≤ 4 andN -aGll possible eco rumnb eixnpateiroimnse ontfs stwh eimth. [sent-81, score-0.062]
35 ADJ: For subjectivity classification, we follow Bruce & Wiebe’s (1999) in adding a binary has adjective feature indicating whether or not any of the adjectives in our manually created polarity lexicon exists in a sentence. [sent-82, score-0.56]
36 For sentiment classification, we apply two features, has POS adjective and has NEG adjective, each of these binary features indicate whether a POS or NEG adjective occurs in a sentence. [sent-83, score-0.349]
37 MSA-Morphological Features: MSA exhibits a very rich morphological system that is templatic, and agglutinative and it is based on both derivational and inflectional features. [sent-84, score-0.157]
38 We explicitly model morphological features of person, state, gender, tense, aspect, and number. [sent-85, score-0.116]
39 3 Method: Two-stage Classification Process In the current study, we adopt a two-stage classification approach. [sent-89, score-0.082]
40 , Subjectivity), we build a binary classifier to sort out OBJ from SUBJ cases. [sent-92, score-0.033]
41 , Sentiment) we apply binary classification that distinguishes SPOS from S-NEG cases. [sent-95, score-0.115]
42 We disregard the neutral class of S-NEUT for this round of experimentation. [sent-96, score-0.042]
43 We experimented with various kernels and parameter settings and found that linear kernels yield the best performance. [sent-98, score-0.192]
44 We ran experiments with presence vectors: In each sentence vector, the value of each dimension is binary either a 1 (regardless of how many times a feature occurs) or 0. [sent-99, score-0.102]
45 Experimental Conditions: We first run experiments using each of the three lemmatization settings Surface, Lemma, Stem using various Ngrams and N-gram combinations and then iteratively add other features. [sent-100, score-0.254]
46 , from the following set {DOMAIN, ADJ, UNIQUE}) are added ltoo wthineg g L seemtm {Da OaMndA Stem+Morph settings. [sent-106, score-0.029]
47 rWe aithdd ealdl AlWwtbolArgydhAtNPVOeorubSnSAulr+f awtbclegAf+yoAhrtmAl+Ll>e+mwblmAg+yaphAl+Slb+telwgm+lAhytoitnhGfeolsortmast ehsim Table 1: Examples of word lemmatization settings the three settings, clitics that are split off words are kept as separate features in the sentence vectors. [sent-107, score-0.317]
48 3 Results and Evaluation We divide our data into 80% for 5-fold cross- validation and 20% for test. [sent-108, score-0.034]
49 For experiments on the test data, the 80% are used as training data. [sent-109, score-0.029]
50 We have two settings, a development setting (DEV) and a test setting (TEST). [sent-110, score-0.167]
51 In the development setting, we run the typical 5 fold cross validation where we train on 4 folds and test on the 5th and then average the results. [sent-111, score-0.162]
52 In the test setting, we only ran with the best configurations yielded from the DEV conditions. [sent-112, score-0.087]
53 In TEST mode, we still train with 4 folds but we test on the test data exclusively, averaging across the different training rounds. [sent-113, score-0.089]
54 It is worth noting that the test data is larger than any given dev data (20% of the overall data set for test, vs. [sent-114, score-0.09]
55 Moreover, for TEST we report only experiments on the Stem+Morph setting and Stem+Morph+ADJ, Stem+Morph+DOMAIN, and Stem+Morph+ UNIQUE. [sent-117, score-0.069]
56 Below, we only report the best-performing results across the N-GRAM features and their combinations. [sent-118, score-0.052]
57 1 Subjectivity Among all the lemmatization settings, the Stem was found to perform best with 73. [sent-121, score-0.109]
58 In addition, adding the inflectional morphology features improves classification (and hence the Stem+Morph setting, when ran under the same 1g+2g condition as the Stem, is better by 0. [sent-125, score-0.325]
59 As for the language-independent features, we found that whereas the ADJ feature does not help neither the Lemma nor Stem+Morph setting, the DOMAIN feature improves the results slightly with the two settings. [sent-127, score-0.111]
60 In addition, 589 the UNIQUE feature helps classification with the Lemma, but it hurts with the Stem+Morph. [sent-128, score-0.12]
61 Table 2 shows that although performance on the test set drops with all settings on Stem+Morph, results are still at least 10% higher than the bseline. [sent-129, score-0.172]
62 2 Sentiment Similar to the subjectivity results, the Stem setting performs better than the other two lemmatiza- tion scheme settings, with 56. [sent-134, score-0.265]
63 These best results for the three lemmatization schemes are all acquired with 1g. [sent-138, score-0.109]
64 Again, adding the morphology-based features helps improve the classification: The Stem+Morph outperforms Stem by about 1. [sent-139, score-0.088]
65 We also found that whereas adding the DOMAIN feature to both the Lemma and the Stem+Morph settings improves the classification slightly, the UNIQUE feature only improves classification with the Stem+Morph. [sent-141, score-0.456]
66 Adding the ADJ feature improves performance significantly: An improvement of 20. [sent-142, score-0.073]
67 As Table 3 shows, performance on test data drops with applying all features except ADJ, the latter helping improve performance by 4. [sent-145, score-0.114]
68 The best results we thus acquire on the 80% training data with 5-fold cross validation is 90. [sent-147, score-0.034]
69 93% F with 1g, and the best performance of the system on the test data is 95. [sent-148, score-0.029]
70 (2003) present an NLP-based system that detects all ref- TabDBleEa2sSV:eTlSinubejctSivetym657re3+5. [sent-158, score-0.041]
71 tQ329f8e2UatEures erences to a given subject, and determines sentiment in each of the references. [sent-166, score-0.189]
72 Similar to (2003), Kim & Hovy (2004) present a sentence-level system that, given a topic detects sentiment towards it. [sent-167, score-0.201]
73 Our approach differs from both (2003) and Kim & Hovy (2004) in that we do not detect sentiment toward specific topics. [sent-168, score-0.16]
74 Also, we make use of N-gram features beyond unigrams and employ elaborate Ngram combinations. [sent-169, score-0.052]
75 Yu & Hatzivassiloglou (2003) build a documentand sentence-level subjectivity classification system using various N-gram-based features and a polarity lexicon. [sent-170, score-0.451]
76 Some of our features are similar to those used by Yu & Hatzivassiloglou, but we exploit additional features. [sent-172, score-0.052]
77 (1999) train a sentence-level probabilistic classifier on data from the WSJ to identify subjectivity in these sentences. [sent-174, score-0.196]
78 They use POS features, lexical features, and a paragraph feature and obtain an average accuracy on subjectivity tagging of 72. [sent-175, score-0.234]
79 Again, our feature set is richer than Wiebe et al. [sent-177, score-0.038]
80 use a root extraction algorithm and do not use morphological features. [sent-184, score-0.064]
81 5 Conclusion In this paper, we build a sentence-level SSA system for MSA contrasting language independent only features vs. [sent-188, score-0.052]
82 combining language independent and language-specific feature sets, namely morphological features specific to Arabic. [sent-189, score-0.154]
83 We also investigate the level of stemming required for the task. [sent-190, score-0.032]
84 We show that the Stem lemmatization setting outperforms both Surface and Lemma settings for the SSA task. [sent-191, score-0.288]
85 We illustrate empirically that adding language specific features for MRL yields improved performance. [sent-192, score-0.088]
86 Similar to previous studies of SSA for other languages, we show that exploiting a polarity lexicon has the largest impact on performance. [sent-193, score-0.167]
87 Finally, as part of the contribution of this investigation, we present a novel MSA data set annotated for SSA layered on top of the PATB data annotations that will be made available to the community at large, in addition to a large scale polarity lexicon. [sent-194, score-0.191]
88 Sentiment analysis in multiple languages: Feature selection for opinion classification in web forums. [sent-200, score-0.082]
89 The penn arabic treebank: Building a largescale annotated arabic corpus. [sent-242, score-0.458]
90 Statistical parsing of morphologically rich languages (spmrl) what, how and whither. [sent-255, score-0.093]
91 Development and use of a gold standard data set for subjectivity classifications. [sent-262, score-0.196]
92 Recognizing Contextual Polarity: an exploration of features for phrase-level sentiment analysis. [sent-285, score-0.212]
93 The penn arabic treebank: Building a large-scale annotated arabic corpus. [sent-299, score-0.458]
wordName wordTfidf (topN-words)
[('ssa', 0.483), ('stem', 0.413), ('morph', 0.311), ('subjectivity', 0.196), ('arabic', 0.193), ('mrl', 0.165), ('sentiment', 0.16), ('msa', 0.138), ('patb', 0.134), ('indiana', 0.132), ('lemma', 0.124), ('wiebe', 0.124), ('polarity', 0.121), ('adj', 0.111), ('settings', 0.11), ('lemmatization', 0.109), ('abbasi', 0.087), ('classification', 0.082), ('subj', 0.076), ('setting', 0.069), ('bruce', 0.067), ('bloomington', 0.066), ('mubarak', 0.066), ('witnessing', 0.066), ('morphological', 0.064), ('obj', 0.064), ('inflectional', 0.062), ('dev', 0.061), ('wilson', 0.059), ('surface', 0.059), ('subjective', 0.058), ('stepped', 0.058), ('flurry', 0.058), ('yu', 0.057), ('hatzivassiloglou', 0.056), ('hate', 0.054), ('adjective', 0.052), ('features', 0.052), ('kim', 0.051), ('maamouri', 0.05), ('camera', 0.05), ('unique', 0.049), ('domain', 0.047), ('clitics', 0.046), ('tsarfaty', 0.046), ('neg', 0.046), ('lexicon', 0.046), ('annotated', 0.043), ('neutral', 0.042), ('svmlight', 0.041), ('detects', 0.041), ('kernels', 0.041), ('feature', 0.038), ('adjectives', 0.038), ('newswire', 0.036), ('yi', 0.036), ('adding', 0.036), ('improves', 0.035), ('run', 0.035), ('validation', 0.034), ('binary', 0.033), ('fold', 0.033), ('drops', 0.033), ('stemming', 0.032), ('languages', 0.032), ('folds', 0.031), ('ran', 0.031), ('columbia', 0.031), ('tokenization', 0.031), ('rich', 0.031), ('refers', 0.03), ('morphologically', 0.03), ('enclitics', 0.029), ('mullen', 0.029), ('computional', 0.029), ('amounting', 0.029), ('erences', 0.029), ('rehbein', 0.029), ('nyc', 0.029), ('muhammad', 0.029), ('perfective', 0.029), ('undiacritized', 0.029), ('seddah', 0.029), ('ltoo', 0.029), ('narration', 0.029), ('journalism', 0.029), ('rwe', 0.029), ('penn', 0.029), ('test', 0.029), ('al', 0.029), ('configurations', 0.027), ('modern', 0.027), ('morphology', 0.027), ('configuration', 0.027), ('nemlar', 0.027), ('versley', 0.027), ('layered', 0.027), ('eco', 0.027), ('spmrl', 0.027), ('pertain', 0.027)]
simIndex simValue paperId paperTitle
same-paper 1 1.0000001 289 acl-2011-Subjectivity and Sentiment Analysis of Modern Standard Arabic
Author: Muhammad Abdul-Mageed ; Mona Diab ; Mohammed Korayem
Abstract: Although Subjectivity and Sentiment Analysis (SSA) has been witnessing a flurry of novel research, there are few attempts to build SSA systems for Morphologically-Rich Languages (MRL). In the current study, we report efforts to partially fill this gap. We present a newly developed manually annotated corpus ofModern Standard Arabic (MSA) together with a new polarity lexicon.The corpus is a collection of newswire documents annotated on the sentence level. We also describe an automatic SSA tagging system that exploits the annotated data. We investigate the impact of different levels ofpreprocessing settings on the SSA classification task. We show that by explicitly accounting for the rich morphology the system is able to achieve significantly higher levels of performance.
2 0.19110093 7 acl-2011-A Corpus for Modeling Morpho-Syntactic Agreement in Arabic: Gender, Number and Rationality
Author: Sarah Alkuhlani ; Nizar Habash
Abstract: We present an enriched version of the Penn Arabic Treebank (Maamouri et al., 2004), where latent features necessary for modeling morpho-syntactic agreement in Arabic are manually annotated. We describe our process for efficient annotation, and present the first quantitative analysis of Arabic morphosyntactic phenomena.
3 0.18436132 164 acl-2011-Improving Arabic Dependency Parsing with Form-based and Functional Morphological Features
Author: Yuval Marton ; Nizar Habash ; Owen Rambow
Abstract: We explore the contribution of morphological features both lexical and inflectional to dependency parsing of Arabic, a morphologically rich language. Using controlled experiments, we find that definiteness, person, number, gender, and the undiacritzed lemma are most helpful for parsing on automatically tagged input. We further contrast the contribution of form-based and functional features, and show that functional gender and number (e.g., “broken plurals”) and the related rationality feature improve over form-based features. It is the first time functional morphological features are used for Arabic NLP. – –
4 0.16310436 204 acl-2011-Learning Word Vectors for Sentiment Analysis
Author: Andrew L. Maas ; Raymond E. Daly ; Peter T. Pham ; Dan Huang ; Andrew Y. Ng ; Christopher Potts
Abstract: Unsupervised vector-based approaches to semantics can model rich lexical meanings, but they largely fail to capture sentiment information that is central to many word meanings and important for a wide range of NLP tasks. We present a model that uses a mix of unsupervised and supervised techniques to learn word vectors capturing semanticterm–documentinformation as well as rich sentiment content. The proposed model can leverage both continuous and multi-dimensional sentiment information as well as non-sentiment annotations. We instantiate the model to utilize the document-level sentiment polarity annotations present in many online documents (e.g. star ratings). We evaluate the model using small, widely used sentiment and subjectivity corpora and find it out-performs several previously introduced methods for sentiment classification. We also introduce a large dataset , of movie reviews to serve as a more robust benchmark for work in this area.
5 0.1606892 281 acl-2011-Sentiment Analysis of Citations using Sentence Structure-Based Features
Author: Awais Athar
Abstract: Sentiment analysis of citations in scientific papers and articles is a new and interesting problem due to the many linguistic differences between scientific texts and other genres. In this paper, we focus on the problem of automatic identification of positive and negative sentiment polarity in citations to scientific papers. Using a newly constructed annotated citation sentiment corpus, we explore the effectiveness of existing and novel features, including n-grams, specialised science-specific lexical features, dependency relations, sentence splitting and negation features. Our results show that 3-grams and dependencies perform best in this task; they outperform the sentence splitting, science lexicon and negation based features.
6 0.15049993 292 acl-2011-Target-dependent Twitter Sentiment Classification
7 0.14652477 124 acl-2011-Exploiting Morphology in Turkish Named Entity Recognition System
9 0.14498292 329 acl-2011-Using Deep Morphology to Improve Automatic Error Detection in Arabic Handwriting Recognition
10 0.13985391 299 acl-2011-The Arabic Online Commentary Dataset: an Annotated Dataset of Informal Arabic with High Dialectal Content
11 0.13959786 183 acl-2011-Joint Bilingual Sentiment Classification with Unlabeled Parallel Corpora
12 0.13731888 75 acl-2011-Combining Morpheme-based Machine Translation with Post-processing Morpheme Prediction
13 0.1172285 253 acl-2011-PsychoSentiWordNet
14 0.11205387 279 acl-2011-Semi-supervised latent variable models for sentence-level sentiment analysis
15 0.109029 105 acl-2011-Dr Sentiment Knows Everything!
16 0.10290558 131 acl-2011-Extracting Opinion Expressions and Their Polarities - Exploration of Pipelines and Joint Models
17 0.10171849 54 acl-2011-Automatically Extracting Polarity-Bearing Topics for Cross-Domain Sentiment Classification
18 0.092837267 64 acl-2011-C-Feel-It: A Sentiment Analyzer for Micro-blogs
19 0.090279184 162 acl-2011-Identifying the Semantic Orientation of Foreign Words
20 0.088166952 159 acl-2011-Identifying Noun Product Features that Imply Opinions
topicId topicWeight
[(0, 0.172), (1, 0.155), (2, 0.143), (3, -0.094), (4, -0.006), (5, 0.022), (6, 0.126), (7, 0.005), (8, 0.039), (9, 0.082), (10, -0.03), (11, -0.017), (12, -0.225), (13, 0.07), (14, 0.143), (15, -0.123), (16, 0.043), (17, -0.018), (18, -0.118), (19, -0.048), (20, -0.029), (21, 0.032), (22, 0.034), (23, -0.013), (24, -0.04), (25, 0.041), (26, -0.014), (27, 0.019), (28, 0.098), (29, 0.06), (30, 0.004), (31, 0.039), (32, -0.054), (33, 0.028), (34, -0.018), (35, 0.024), (36, 0.007), (37, -0.021), (38, 0.008), (39, -0.026), (40, -0.021), (41, 0.023), (42, -0.015), (43, -0.025), (44, 0.041), (45, 0.008), (46, 0.008), (47, -0.002), (48, -0.009), (49, 0.026)]
simIndex simValue paperId paperTitle
same-paper 1 0.92415035 289 acl-2011-Subjectivity and Sentiment Analysis of Modern Standard Arabic
Author: Muhammad Abdul-Mageed ; Mona Diab ; Mohammed Korayem
Abstract: Although Subjectivity and Sentiment Analysis (SSA) has been witnessing a flurry of novel research, there are few attempts to build SSA systems for Morphologically-Rich Languages (MRL). In the current study, we report efforts to partially fill this gap. We present a newly developed manually annotated corpus ofModern Standard Arabic (MSA) together with a new polarity lexicon.The corpus is a collection of newswire documents annotated on the sentence level. We also describe an automatic SSA tagging system that exploits the annotated data. We investigate the impact of different levels ofpreprocessing settings on the SSA classification task. We show that by explicitly accounting for the rich morphology the system is able to achieve significantly higher levels of performance.
2 0.78448164 7 acl-2011-A Corpus for Modeling Morpho-Syntactic Agreement in Arabic: Gender, Number and Rationality
Author: Sarah Alkuhlani ; Nizar Habash
Abstract: We present an enriched version of the Penn Arabic Treebank (Maamouri et al., 2004), where latent features necessary for modeling morpho-syntactic agreement in Arabic are manually annotated. We describe our process for efficient annotation, and present the first quantitative analysis of Arabic morphosyntactic phenomena.
3 0.74436432 164 acl-2011-Improving Arabic Dependency Parsing with Form-based and Functional Morphological Features
Author: Yuval Marton ; Nizar Habash ; Owen Rambow
Abstract: We explore the contribution of morphological features both lexical and inflectional to dependency parsing of Arabic, a morphologically rich language. Using controlled experiments, we find that definiteness, person, number, gender, and the undiacritzed lemma are most helpful for parsing on automatically tagged input. We further contrast the contribution of form-based and functional features, and show that functional gender and number (e.g., “broken plurals”) and the related rationality feature improve over form-based features. It is the first time functional morphological features are used for Arabic NLP. – –
4 0.68820959 329 acl-2011-Using Deep Morphology to Improve Automatic Error Detection in Arabic Handwriting Recognition
Author: Nizar Habash ; Ryan Roth
Abstract: Arabic handwriting recognition (HR) is a challenging problem due to Arabic’s connected letter forms, consonantal diacritics and rich morphology. In this paper we isolate the task of identification of erroneous words in HR from the task of producing corrections for these words. We consider a variety of linguistic (morphological and syntactic) and non-linguistic features to automatically identify these errors. Our best approach achieves a roughly ∼15% absolute increase in F-score aov reoru ag hsliym ∼pl1e5 b%ut a rbesaoslounteab inlec breaasseeli inne. F -Asc doreetailed error analysis shows that linguistic features, such as lemma (i.e., citation form) models, help improve HR-error detection precisely where we expect them to: semantically incoherent error words.
Author: Omar F. Zaidan ; Chris Callison-Burch
Abstract: The written form of Arabic, Modern Standard Arabic (MSA), differs quite a bit from the spoken dialects of Arabic, which are the true “native” languages of Arabic speakers used in daily life. However, due to MSA’s prevalence in written form, almost all Arabic datasets have predominantly MSA content. We present the Arabic Online Commentary Dataset, a 52M-word monolingual dataset rich in dialectal content, and we describe our long-term annotation effort to identify the dialect level (and dialect itself) in each sentence of the dataset. So far, we have labeled 108K sentences, 41% of which as having dialectal content. We also present experimental results on the task of automatic dialect identification, using the collected labels for training and evaluation.
6 0.55562985 51 acl-2011-Automatic Headline Generation using Character Cross-Correlation
7 0.52438349 124 acl-2011-Exploiting Morphology in Turkish Named Entity Recognition System
9 0.51058251 45 acl-2011-Aspect Ranking: Identifying Important Product Aspects from Online Consumer Reviews
10 0.50165641 159 acl-2011-Identifying Noun Product Features that Imply Opinions
11 0.50012243 211 acl-2011-Liars and Saviors in a Sentiment Annotated Corpus of Comments to Political Debates
12 0.49250904 281 acl-2011-Sentiment Analysis of Citations using Sentence Structure-Based Features
13 0.48755044 10 acl-2011-A Discriminative Model for Joint Morphological Disambiguation and Dependency Parsing
14 0.47740266 204 acl-2011-Learning Word Vectors for Sentiment Analysis
15 0.47357544 194 acl-2011-Language Use: What can it tell us?
16 0.46843147 292 acl-2011-Target-dependent Twitter Sentiment Classification
17 0.4617002 279 acl-2011-Semi-supervised latent variable models for sentence-level sentiment analysis
18 0.45064494 162 acl-2011-Identifying the Semantic Orientation of Foreign Words
19 0.44509667 54 acl-2011-Automatically Extracting Polarity-Bearing Topics for Cross-Domain Sentiment Classification
20 0.43085423 131 acl-2011-Extracting Opinion Expressions and Their Polarities - Exploration of Pipelines and Joint Models
topicId topicWeight
[(5, 0.024), (17, 0.034), (26, 0.028), (29, 0.217), (31, 0.026), (37, 0.125), (39, 0.053), (41, 0.032), (55, 0.074), (59, 0.088), (72, 0.036), (80, 0.033), (91, 0.042), (96, 0.103)]
simIndex simValue paperId paperTitle
same-paper 1 0.77510017 289 acl-2011-Subjectivity and Sentiment Analysis of Modern Standard Arabic
Author: Muhammad Abdul-Mageed ; Mona Diab ; Mohammed Korayem
Abstract: Although Subjectivity and Sentiment Analysis (SSA) has been witnessing a flurry of novel research, there are few attempts to build SSA systems for Morphologically-Rich Languages (MRL). In the current study, we report efforts to partially fill this gap. We present a newly developed manually annotated corpus ofModern Standard Arabic (MSA) together with a new polarity lexicon.The corpus is a collection of newswire documents annotated on the sentence level. We also describe an automatic SSA tagging system that exploits the annotated data. We investigate the impact of different levels ofpreprocessing settings on the SSA classification task. We show that by explicitly accounting for the rich morphology the system is able to achieve significantly higher levels of performance.
Author: Fei Liu ; Fuliang Weng ; Bingqing Wang ; Yang Liu
Abstract: Most text message normalization approaches are based on supervised learning and rely on human labeled training data. In addition, the nonstandard words are often categorized into different types and specific models are designed to tackle each type. In this paper, we propose a unified letter transformation approach that requires neither pre-categorization nor human supervision. Our approach models the generation process from the dictionary words to nonstandard tokens under a sequence labeling framework, where each letter in the dictionary word can be retained, removed, or substituted by other letters/digits. To avoid the expensive and time consuming hand labeling process, we automatically collected a large set of noisy training pairs using a novel webbased approach and performed character-level . alignment for model training. Experiments on both Twitter and SMS messages show that our system significantly outperformed the stateof-the-art deletion-based abbreviation system and the jazzy spell checker (absolute accuracy gain of 21.69% and 18. 16% over jazzy spell checker on the two test sets respectively).
3 0.66444349 164 acl-2011-Improving Arabic Dependency Parsing with Form-based and Functional Morphological Features
Author: Yuval Marton ; Nizar Habash ; Owen Rambow
Abstract: We explore the contribution of morphological features both lexical and inflectional to dependency parsing of Arabic, a morphologically rich language. Using controlled experiments, we find that definiteness, person, number, gender, and the undiacritzed lemma are most helpful for parsing on automatically tagged input. We further contrast the contribution of form-based and functional features, and show that functional gender and number (e.g., “broken plurals”) and the related rationality feature improve over form-based features. It is the first time functional morphological features are used for Arabic NLP. – –
4 0.64763111 85 acl-2011-Coreference Resolution with World Knowledge
Author: Altaf Rahman ; Vincent Ng
Abstract: While world knowledge has been shown to improve learning-based coreference resolvers, the improvements were typically obtained by incorporating world knowledge into a fairly weak baseline resolver. Hence, it is not clear whether these benefits can carry over to a stronger baseline. Moreover, since there has been no attempt to apply different sources of world knowledge in combination to coreference resolution, it is not clear whether they offer complementary benefits to a resolver. We systematically compare commonly-used and under-investigated sources of world knowledge for coreference resolution by applying them to two learning-based coreference models and evaluating them on documents annotated with two different annotation schemes.
5 0.64353412 237 acl-2011-Ordering Prenominal Modifiers with a Reranking Approach
Author: Jenny Liu ; Aria Haghighi
Abstract: In this work, we present a novel approach to the generation task of ordering prenominal modifiers. We take a maximum entropy reranking approach to the problem which admits arbitrary features on a permutation of modifiers, exploiting hundreds ofthousands of features in total. We compare our error rates to the state-of-the-art and to a strong Google ngram count baseline. We attain a maximum error reduction of 69.8% and average error reduction across all test sets of 59. 1% compared to the state-of-the-art and a maximum error reduction of 68.4% and average error reduction across all test sets of 41.8% compared to our Google n-gram count baseline.
6 0.6417551 7 acl-2011-A Corpus for Modeling Morpho-Syntactic Agreement in Arabic: Gender, Number and Rationality
7 0.64092964 324 acl-2011-Unsupervised Semantic Role Induction via Split-Merge Clustering
8 0.63639605 170 acl-2011-In-domain Relation Discovery with Meta-constraints via Posterior Regularization
9 0.63617313 44 acl-2011-An exponential translation model for target language morphology
10 0.63372159 186 acl-2011-Joint Training of Dependency Parsing Filters through Latent Support Vector Machines
11 0.62639213 126 acl-2011-Exploiting Syntactico-Semantic Structures for Relation Extraction
12 0.62393999 128 acl-2011-Exploring Entity Relations for Named Entity Disambiguation
13 0.62382746 119 acl-2011-Evaluating the Impact of Coder Errors on Active Learning
14 0.62378585 92 acl-2011-Data point selection for cross-language adaptation of dependency parsers
15 0.62343621 269 acl-2011-Scaling up Automatic Cross-Lingual Semantic Role Annotation
16 0.62183809 292 acl-2011-Target-dependent Twitter Sentiment Classification
17 0.62107617 331 acl-2011-Using Large Monolingual and Bilingual Corpora to Improve Coordination Disambiguation
18 0.61951667 103 acl-2011-Domain Adaptation by Constraining Inter-Domain Variability of Latent Feature Representation
19 0.61858082 334 acl-2011-Which Noun Phrases Denote Which Concepts?
20 0.61728138 277 acl-2011-Semi-supervised Relation Extraction with Large-scale Word Clustering