acl acl2013 acl2013-327 knowledge-graph by maker-knowledge-mining

327 acl-2013-Sorani Kurdish versus Kurmanji Kurdish: An Empirical Comparison

Source: pdf

Author: Kyumars Sheykh Esmaili ; Shahin Salavati

Abstract: Resource scarcity along with diversity– both in dialect and script–are the two primary challenges in Kurdish language processing. In this paper we aim at addressing these two problems by (i) building a text corpus for Sorani and Kurmanji, the two main dialects of Kurdish, and (ii) highlighting some of the orthographic, phonological, and morphological differences between these two dialects from statistical and rule-based perspectives.

Reference: text

Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 Sorani Kurdish versus Kurmanji Kurdish: An Empirical Comparison Kyumars Sheykh Esmaili Nanyang Technological University N4-B2a-02 Singapore kyumars s @ ntu . [sent-1, score-0.048]

2 s g Abstract Resource scarcity along with diversity– both in dialect and script–are the two primary challenges in Kurdish language processing. [sent-3, score-0.105]

3 1 Introduction Despite having 20 to 30 millions of native speak- ers (Haig and Matras, 2002; Hassanpour et al. [sent-5, score-0.012]

4 , 2012; Thackston, 2006b; Thackston, 2006a), Kurdish is among the less-resourced languages for which the only linguistic resource available on the Web is raw text (Walther and Sagot, 2010). [sent-6, score-0.012]

5 Apart from the resource-scarcity problem, its diversity –in both dialect and writing systems– is another primary challenge in Kurdish language processing (Gautier, 1998; Gautier, 1996; Esmaili, 2012). [sent-7, score-0.121]

6 , 2012): the Sorani dialect written in an Arabic-based alphabet and the Kurmanji dialect written in a Latinbased alphabet. [sent-9, score-0.135]

7 The features distinguishing these two dialects are phonological, lexical, and morphological. [sent-10, score-0.117]

8 In this paper we report on the first outcomes of a project1 at University of Kurdistan (UoK) that aims at addressing these two challenges of the Kurdish language processing. [sent-11, score-0.031]

9 htm Shahin Salavati University of Kurdistan Sanandaj Iran shahin . [sent-17, score-0.036]

10 we present some insights into the orthographic, phonological, and morphological differences between Sorani Kurdish and Kurmanji Kurdish. [sent-20, score-0.048]

11 In Section 2, we first briefly introduce the Kurdish language and its two main dialects then underline their differences from a rule-based (a. [sent-22, score-0.144]

12 Next, after presenting the Pewan text corpus in Section 3, we use it to conduct a statistical comparison ofthe two dialects in Section 4. [sent-26, score-0.117]

13 It is one of the two official languages of Iraq and has a regional status in Iran. [sent-31, score-0.027]

14 Kurdish is a dialect-rich language, sometimes referred to as a dialect continuum (Matras and Akin, 2012; Shahsavari, 2010). [sent-32, score-0.061]

15 In this paper, however, we focus on Sorani and Kurmanji which are the two closely-related and widely-spoken dialects of the Kurdish language. [sent-33, score-0.117]

16 Together, they account for more than 75% of native Kurdish speakers (Walther and Sagot, 2010). [sent-34, score-0.028]

17 As summarized below, these two dialects differ not only in some linguistics aspects, but also in their writing systems. [sent-35, score-0.148]

18 1 Morphological Differences The important morphological differences are (MacKenzie, 1961 ; Haig and Matras, 2002; Samvelian, 2007): 1. [sent-37, score-0.048]

19 Sorani has largely abandoned this system and uses the pronominal suffixes to take over the functions of the cases, 2. [sent-41, score-0.028]

20 in the past-tense transitive verbs, Kurmanji has the full ergative alignment3 but Sorani, having lost the oblique pronouns, resorts to pronominal enclitics, 3. [sent-42, score-0.033]

21 in Sorani, passive and causative are created via verb morphology, in Kurmanji they can also be formed with the helper verbs hat in (“to come”) and dan (“to give”) respectively, and 4. [sent-43, score-0.044]

22 2 Scriptural Differences Due to geopolitical reasons (Matras and Reershemius, 1991), each of the two dialects has been using its own writing system: while Sorani uses an Arabic-based alphabet, Kurmanji is written in a Latin-based one. [sent-46, score-0.148]

23 2Although there is evidence of gender distinctions weakening in some varieties of Kurmanji (Haig and Matras, 2002). [sent-50, score-0.021]

24 3Recent research suggests that ergativity in Kurmanji is weakening due to either internally-induced change or contact with Turkish (Dixon, 1994; Dorleijn, 1996; Mahalingappa, 2010), perhaps moving towards a full nominative-accusative system. [sent-51, score-0.045]

25 It should be noted that both of these writing systems are phonetic (Gautier, 1998); that is, vowels are explicitly represented and their use is mandatory. [sent-53, score-0.031]

26 At UoK, we followed TREC (TREC, 2013)’s common practice and used news articles to build a text corpus for the Kurdish language. [sent-56, score-0.024]

27 For each agency, we developed a crawler to fetch the articles and extract their textual content. [sent-59, score-0.024]

28 In case of Peyamner, since articles have no language label, we additionally implemented a simple classifier that decides each page’s language 4Although there Kurmanji too. [sent-60, score-0.024]

29 Overall, 115,340 Sorani articles and 25,572 Kurmanji articles were collected5 . [sent-66, score-0.048]

30 The articles are dated between 2003 and 2012 and their sizes range from 1KB to 154KB (on average 2. [sent-67, score-0.024]

31 The final Sorani and Kurmanji lists contain 157 and 152 words respectively, and as in other languages, they mainly consist of prepositions. [sent-74, score-0.017]

32 Pewan, as well as the stopword lists can be obtained from (Pewan, 2013). [sent-75, score-0.032]

33 4 Empirical Study In the first part of this section, we first look at the character and word frequencies and try to obtain some insights about the phonological and lexical correlations and discrepancies between Sorani and Kurmanji. [sent-77, score-0.136]

34 In the second part, we investigate two wellknown linguistic laws –Heaps’ and Zipf’s. [sent-78, score-0.05]

35 Although these laws have been observed in many of the Indo-European languages (L¨ u et al. [sent-79, score-0.062]

36 , 2013), the their coefficients depend on language (Gelbukh and Sidorov, 2001) and therefore they can be 5The relatively small size of the Kurmanji collection is part of a more general trend. [sent-80, score-0.027]

37 In fact, despite having a larger number of speakers, Kurmanji has far fewer online sources with raw text readily available and even those sources do not strictly follow its writing standards. [sent-81, score-0.031]

38 This is partly a result of decades of severe restrictions on use of Kurdish language in Turkey, where the majority of Kurmanji speakers live (Hassanpour et al. [sent-82, score-0.016]

39 It should also be noted that in practice, knowing the coefficients of these laws is important in, for example, full-text database design, since it allows predicting some properties of the index as a function of the size of the database. [sent-85, score-0.077]

40 1 Character Frequencies In this experiment we measure the character frequencies, as a phonological property of the language. [sent-87, score-0.088]

41 Figure 2 shows the frequency-ranked lists (from left to right, in decreasing order) of characters of both dialects in the Pewan corpus. [sent-88, score-0.169]

42 Note that for a fairer comparison, we have excluded characters with 1-to-0 and 1-to-2 mappings as well as three characters from the list of 1-to-1 mappings: A, Eˆ, and Uˆ. [sent-89, score-0.112]

43 Overall, the relative positions of the equivalent characters in these two lists are comparable (Fig- ure 2). [sent-92, score-0.052]

44 However, there are two notable discrepancies which further exhibit the intrinsic phonological differences between Sorani and Kurmanji: • • • use of the character J is far more common iuns Kurmanji (e. [sent-93, score-0.138]

45 , einr prepositions seu ccho as j in “from” and j ı “too”), same holds for the character V; this is, how- same hol 6Izafe construction is a shared feature of several Western Iranian languages (Samvelian, 2006). [sent-95, score-0.063]

46 It, approximately, corresponds to the English preposition “of” and is added between prepositions, nouns and adjectives in a phrase (Shamsfard, 2011). [sent-96, score-0.016]

47 0E+06 Sorani Total Number of Words (a) Standard Representation sd 2. [sent-107, score-0.012]

48 ever, due to Sorani’s phonological tendency to use the phoneme W instead of V. [sent-109, score-0.087]

49 3 Heaps’ Law Heaps’s law (Heaps, 1978) is about the growth of distinct words (a. [sent-117, score-0.076]

50 More specifically, the number of distinct words in a text is roughly proportional to an exponent of its size: log ni ≈ D + h log i Languagelog nih PSEKoeunrgaslimnasnihanji12 . [sent-120, score-0.077]

51 67 9480 Table 2: Heaps’ Linear Regression (1) where ni is the number of distinct words occurring before the running word number i, h is the exponent coefficient (between 0 and 1), and D is a constant. [sent-123, score-0.063]

52 In a logarithmic scale, it is a straight line with about 45◦ angle (Gelbukh and Sidorov, 2001). [sent-124, score-0.051]

53 We carried out an experiment to measure the growth rate of distinct words for both of the Kurdish dialects as well as the Persian and English languages. [sent-125, score-0.162]

54 , 2009) and The English corpus consisted of the Editorial articles of The Guardian newspaper7 (Guardian, 2013). [sent-127, score-0.024]

55 As the curves in Figure 4 and the linear regression coefficients in Table 2 show, the growth rate of distinct words in both Sorani and Kurmanji Kurdish are higher than Persian and English. [sent-128, score-0.104]

56 This result demonstrates the morphological complexity of the Kurdish language (Samvelian, 2007; Walther, 2011). [sent-129, score-0.021]

57 Another important observation from this experiment is that Sorani has a higher growth rate compared to Kurmanji (h = 0. [sent-131, score-0.028]

58 7Since they are written by native speakers, cover a wide spectrum of topics between 2006 and 2013, and have clean HTML sources. [sent-135, score-0.012]

59 In a logarithmic scale, it is a straight line with about 45◦ angle (Gelbukh and Sidorov, 2001). [sent-143, score-0.051]

60 5 Conclusions and Future Work In this paper we took the first steps towards addressing the two main challenges in Kurdish language processing, namely, resource scarcity and diversity. [sent-147, score-0.046]

61 We presented Pewan, a text corpus for Sorani and Kurmanji, the two principal dialects of the Kurdish language. [sent-148, score-0.117]

62 We also highlighted a range of differences between these two dialects and their writing systems. [sent-149, score-0.175]

63 Some of the discrepancies are due to the existence of a generic preposition ( ) in Sorani, as well as the general tendency in its writing system and style to use prepositions as suffix. [sent-154, score-0.12]

64 In future, we plan to first develop stemming algorithms for both Sorani and Kurmanji and then leverage those algorithms to examine the lexical differences between the two dialects. [sent-158, score-0.027]

similar papers computed by tfidf model

tfidf for this paper:

wordName wordTfidf (topN-words)

[('kurdish', 0.573), ('sorani', 0.538), ('kurmanji', 0.478), ('pewan', 0.155), ('dialects', 0.117), ('heaps', 0.096), ('matras', 0.084), ('zipf', 0.07), ('phonological', 0.068), ('dialect', 0.061), ('gautier', 0.06), ('kurdistan', 0.06), ('persian', 0.054), ('walther', 0.053), ('laws', 0.05), ('esmaili', 0.048), ('haig', 0.048), ('hassanpour', 0.048), ('kyumars', 0.048), ('peyamner', 0.048), ('mappings', 0.042), ('gelbukh', 0.037), ('samvelian', 0.036), ('shahin', 0.036), ('sheykh', 0.036), ('sidorov', 0.036), ('uok', 0.036), ('yaron', 0.036), ('characters', 0.035), ('exponent', 0.032), ('iran', 0.032), ('writing', 0.031), ('prepositions', 0.031), ('law', 0.031), ('iraq', 0.029), ('helper', 0.029), ('growth', 0.028), ('sagot', 0.028), ('alphabets', 0.028), ('differences', 0.027), ('coefficients', 0.027), ('marker', 0.025), ('frequencies', 0.025), ('articles', 0.024), ('absaesded', 0.024), ('aleahmad', 0.024), ('barkhoda', 0.024), ('ergativity', 0.024), ('hamshahri', 0.024), ('icb', 0.024), ('izafe', 0.024), ('languagelog', 0.024), ('pollet', 0.024), ('salavati', 0.024), ('thackston', 0.024), ('discrepancies', 0.023), ('ii', 0.023), ('morphological', 0.021), ('eraldine', 0.021), ('guardian', 0.021), ('logarithmic', 0.021), ('voa', 0.021), ('weakening', 0.021), ('wheeler', 0.021), ('erard', 0.021), ('voice', 0.02), ('character', 0.02), ('tendency', 0.019), ('oblique', 0.018), ('curves', 0.017), ('lists', 0.017), ('distinct', 0.017), ('straight', 0.016), ('challenges', 0.016), ('speakers', 0.016), ('diversity', 0.016), ('preposition', 0.016), ('regression', 0.015), ('addressing', 0.015), ('regional', 0.015), ('pronominal', 0.015), ('hat', 0.015), ('scarcity', 0.015), ('stopword', 0.015), ('morphology', 0.014), ('log', 0.014), ('angle', 0.014), ('coefficient', 0.014), ('harvard', 0.014), ('turkey', 0.014), ('alphabet', 0.013), ('definite', 0.013), ('trec', 0.013), ('suffixes', 0.013), ('america', 0.013), ('primary', 0.013), ('orthographic', 0.012), ('native', 0.012), ('languages', 0.012), ('sd', 0.012)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.99999976 327 acl-2013-Sorani Kurdish versus Kurmanji Kurdish: An Empirical Comparison

Author: Kyumars Sheykh Esmaili ; Shahin Salavati

2 0.047929268 359 acl-2013-Translating Dialectal Arabic to English

Author: Hassan Sajjad ; Kareem Darwish ; Yonatan Belinkov

Abstract: We present a dialectal Egyptian Arabic to English statistical machine translation system that leverages dialectal to Modern Standard Arabic (MSA) adaptation. In contrast to previous work, we first narrow down the gap between Egyptian and MSA by applying an automatic characterlevel transformational model that changes Egyptian to EG0, which looks similar to MSA. The transformations include morphological, phonological and spelling changes. The transformation reduces the out-of-vocabulary (OOV) words from 5.2% to 2.6% and gives a gain of 1.87 BLEU points. Further, adapting large MSA/English parallel data increases the lexical coverage, reduces OOVs to 0.7% and leads to an absolute BLEU improvement of 2.73 points.

3 0.030803787 317 acl-2013-Sentence Level Dialect Identification in Arabic

Author: Heba Elfardy ; Mona Diab

Abstract: This paper introduces a supervised approach for performing sentence level dialect identification between Modern Standard Arabic and Egyptian Dialectal Arabic. We use token level labels to derive sentence-level features. These features are then used with other core and meta features to train a generative classifier that predicts the correct label for each sentence in the given input text. The system achieves an accuracy of 85.5% on an Arabic online-commentary dataset outperforming a previously proposed approach achieving 80.9% and reflecting a significant gain over a majority baseline of 5 1.9% and two strong baseline systems of 78.5% and 80.4%, respectively.

4 0.028444575 78 acl-2013-Categorization of Turkish News Documents with Morphological Analysis

Author: Burak Kerim AkkuÅ� ; Ruket Cakici

Abstract: Morphologically rich languages such as Turkish may benefit from morphological analysis in natural language tasks. In this study, we examine the effects of morphological analysis on text categorization task in Turkish. We use stems and word categories that are extracted with morphological analysis as main features and compare them with fixed length stemmers in a bag of words approach with several learning algorithms. We aim to show the effects of using varying degrees of morphological information.

5 0.024918705 214 acl-2013-Language Independent Connectivity Strength Features for Phrase Pivot Statistical Machine Translation

Author: Ahmed El Kholy ; Nizar Habash ; Gregor Leusch ; Evgeny Matusov ; Hassan Sawaf

Abstract: An important challenge to statistical machine translation (SMT) is the lack of parallel data for many language pairs. One common solution is to pivot through a third language for which there exist parallel corpora with the source and target languages. Although pivoting is a robust technique, it introduces some low quality translations. In this paper, we present two language-independent features to improve the quality of phrase-pivot based SMT. The features, source connectivity strength and target connectivity strength reflect the quality of projected alignments between the source and target phrases in the pivot phrase table. We show positive results (0.6 BLEU points) on Persian-Arabic SMT as a case study.

6 0.022497596 369 acl-2013-Unsupervised Consonant-Vowel Prediction over Hundreds of Languages

7 0.022154341 344 acl-2013-The Effects of Lexical Resource Quality on Preference Violation Detection

8 0.019393545 128 acl-2013-Does Korean defeat phonotactic word segmentation?

9 0.017444 28 acl-2013-A Unified Morpho-Syntactic Scheme of Stanford Dependencies

10 0.016920289 193 acl-2013-Improving Chinese Word Segmentation on Micro-blog Using Rich Punctuations

11 0.016451135 41 acl-2013-Aggregated Word Pair Features for Implicit Discourse Relation Disambiguation

12 0.016306549 220 acl-2013-Learning Latent Personas of Film Characters

13 0.015849248 303 acl-2013-Robust multilingual statistical morphological generation models

14 0.014990231 80 acl-2013-Chinese Parsing Exploiting Characters

15 0.014636807 154 acl-2013-Extracting bilingual terminologies from comparable corpora

16 0.014422487 299 acl-2013-Reconstructing an Indo-European Family Tree from Non-native English Texts

17 0.013804227 378 acl-2013-Using subcategorization knowledge to improve case prediction for translation to German

18 0.01359883 256 acl-2013-Named Entity Recognition using Cross-lingual Resources: Arabic as an Example

19 0.013423677 149 acl-2013-Exploring Word Order Universals: a Probabilistic Graphical Model Approach

20 0.013335418 368 acl-2013-Universal Dependency Annotation for Multilingual Parsing

similar papers computed by lsi model

lsi for this paper:

topicId topicWeight

[(0, 0.039), (1, 0.003), (2, -0.004), (3, -0.006), (4, -0.001), (5, -0.024), (6, -0.003), (7, 0.001), (8, 0.004), (9, 0.0), (10, -0.025), (11, -0.013), (12, 0.013), (13, -0.01), (14, -0.046), (15, -0.026), (16, -0.01), (17, -0.031), (18, -0.004), (19, 0.022), (20, -0.027), (21, 0.014), (22, 0.033), (23, 0.033), (24, 0.003), (25, 0.026), (26, 0.004), (27, -0.046), (28, 0.037), (29, -0.051), (30, -0.04), (31, -0.018), (32, -0.042), (33, 0.01), (34, 0.017), (35, 0.003), (36, -0.008), (37, 0.015), (38, 0.003), (39, 0.056), (40, -0.024), (41, -0.033), (42, -0.006), (43, 0.016), (44, -0.004), (45, -0.015), (46, -0.021), (47, 0.005), (48, 0.013), (49, 0.028)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.85051858 327 acl-2013-Sorani Kurdish versus Kurmanji Kurdish: An Empirical Comparison

Author: Kyumars Sheykh Esmaili ; Shahin Salavati

2 0.69922954 317 acl-2013-Sentence Level Dialect Identification in Arabic

Author: Heba Elfardy ; Mona Diab

3 0.61966807 359 acl-2013-Translating Dialectal Arabic to English

Author: Hassan Sajjad ; Kareem Darwish ; Yonatan Belinkov

4 0.46250612 256 acl-2013-Named Entity Recognition using Cross-lingual Resources: Arabic as an Example

Author: Kareem Darwish

Abstract: Some languages lack large knowledge bases and good discriminative features for Name Entity Recognition (NER) that can generalize to previously unseen named entities. One such language is Arabic, which: a) lacks a capitalization feature; and b) has relatively small knowledge bases, such as Wikipedia. In this work we address both problems by incorporating cross-lingual features and knowledge bases from English using cross-lingual links. We show that such features have a dramatic positive effect on recall. We show the effectiveness of cross-lingual features and resources on a standard dataset as well as on two new test sets that cover both news and microblogs. On the standard dataset, we achieved a 4.1% relative improvement in Fmeasure over the best reported result in the literature. The features led to improvements of 17.1% and 20.5% on the new news and mi- croblogs test sets respectively.

5 0.437327 78 acl-2013-Categorization of Turkish News Documents with Morphological Analysis

Author: Burak Kerim AkkuÅ� ; Ruket Cakici

6 0.42925614 48 acl-2013-An Open Source Toolkit for Quantitative Historical Linguistics

7 0.41674891 303 acl-2013-Robust multilingual statistical morphological generation models

8 0.41195834 286 acl-2013-Psycholinguistically Motivated Computational Models on the Organization and Processing of Morphologically Complex Words

9 0.40548947 203 acl-2013-Is word-to-phone mapping better than phone-phone mapping for handling English words?

10 0.4049978 257 acl-2013-Natural Language Models for Predicting Programming Comments

11 0.39281693 227 acl-2013-Learning to lemmatise Polish noun phrases

12 0.38769042 369 acl-2013-Unsupervised Consonant-Vowel Prediction over Hundreds of Languages

13 0.38518837 390 acl-2013-Word surprisal predicts N400 amplitude during reading

14 0.37226063 34 acl-2013-Accurate Word Segmentation using Transliteration and Language Model Projection

15 0.36867794 299 acl-2013-Reconstructing an Indo-European Family Tree from Non-native English Texts

16 0.35635132 89 acl-2013-Computerized Analysis of a Verbal Fluency Test

17 0.34121814 84 acl-2013-Combination of Recurrent Neural Networks and Factored Language Models for Code-Switching Language Modeling

18 0.33862984 364 acl-2013-Typesetting for Improved Readability using Lexical and Syntactic Information

19 0.32650456 171 acl-2013-Grammatical Error Correction Using Integer Linear Programming

20 0.32500273 211 acl-2013-LABR: A Large Scale Arabic Book Reviews Dataset

similar papers computed by lda model

lda for this paper:

topicId topicWeight

[(0, 0.024), (6, 0.015), (11, 0.024), (15, 0.014), (24, 0.046), (26, 0.037), (29, 0.012), (35, 0.043), (42, 0.022), (48, 0.024), (70, 0.034), (88, 0.495), (90, 0.015), (95, 0.076)]

similar papers list:

simIndex simValue paperId paperTitle

1 0.9259972 106 acl-2013-Decentralized Entity-Level Modeling for Coreference Resolution

Author: Greg Durrett ; David Hall ; Dan Klein

Abstract: Efficiently incorporating entity-level information is a challenge for coreference resolution systems due to the difficulty of exact inference over partitions. We describe an end-to-end discriminative probabilistic model for coreference that, along with standard pairwise features, enforces structural agreement constraints between specified properties of coreferent mentions. This model can be represented as a factor graph for each document that admits efficient inference via belief propagation. We show that our method can use entity-level information to outperform a basic pairwise system.

2 0.91246027 141 acl-2013-Evaluating a City Exploration Dialogue System with Integrated Question-Answering and Pedestrian Navigation

Author: Srinivasan Janarthanam ; Oliver Lemon ; Phil Bartie ; Tiphaine Dalmas ; Anna Dickinson ; Xingkun Liu ; William Mackaness ; Bonnie Webber

Abstract: We present a city navigation and tourist information mobile dialogue app with integrated question-answering (QA) and geographic information system (GIS) modules that helps pedestrian users to navigate in and learn about urban environments. In contrast to existing mobile apps which treat these problems independently, our Android app addresses the problem of navigation and touristic questionanswering in an integrated fashion using a shared dialogue context. We evaluated our system in comparison with Samsung S-Voice (which interfaces to Google navigation and Google search) with 17 users and found that users judged our system to be significantly more interesting to interact with and learn from. They also rated our system above Google search (with the Samsung S-Voice interface) for tourist information tasks.

same-paper 3 0.8759535 327 acl-2013-Sorani Kurdish versus Kurmanji Kurdish: An Empirical Comparison

Author: Kyumars Sheykh Esmaili ; Shahin Salavati

4 0.82421601 299 acl-2013-Reconstructing an Indo-European Family Tree from Non-native English Texts

Author: Ryo Nagata ; Edward Whittaker

Abstract: Mother tongue interference is the phenomenon where linguistic systems of a mother tongue are transferred to another language. Although there has been plenty of work on mother tongue interference, very little is known about how strongly it is transferred to another language and about what relation there is across mother tongues. To address these questions, this paper explores and visualizes mother tongue interference preserved in English texts written by Indo-European language speakers. This paper further explores linguistic features that explain why certain relations are preserved in English writing, and which contribute to related tasks such as native language identification.

5 0.7812866 136 acl-2013-Enhanced and Portable Dependency Projection Algorithms Using Interlinear Glossed Text

Author: Ryan Georgi ; Fei Xia ; William D. Lewis

Abstract: As most of the world’s languages are under-resourced, projection algorithms offer an enticing way to bootstrap the resources available for one resourcepoor language from a resource-rich language by means of parallel text and word alignment. These algorithms, however, make the strong assumption that the language pairs share common structures and that the parse trees will resemble one another. This assumption is useful but often leads to errors in projection. In this paper, we will address this weakness by using trees created from instances of Interlinear Glossed Text (IGT) to discover patterns of divergence between the lan- guages. We will show that this method improves the performance of projection algorithms significantly in some languages by accounting for divergence between languages using only the partial supervision of a few corrected trees.

6 0.75957781 41 acl-2013-Aggregated Word Pair Features for Implicit Discourse Relation Disambiguation

7 0.73625916 345 acl-2013-The Haves and the Have-Nots: Leveraging Unlabelled Corpora for Sentiment Analysis

8 0.60998988 111 acl-2013-Density Maximization in Context-Sense Metric Space for All-words WSD

9 0.55829716 252 acl-2013-Multigraph Clustering for Unsupervised Coreference Resolution

10 0.50896966 258 acl-2013-Neighbors Help: Bilingual Unsupervised WSD Using Context

11 0.45936027 105 acl-2013-DKPro WSD: A Generalized UIMA-based Framework for Word Sense Disambiguation

12 0.45735848 130 acl-2013-Domain-Specific Coreference Resolution with Lexicalized Features

13 0.44282088 292 acl-2013-Question Classification Transfer

14 0.43846571 131 acl-2013-Dual Training and Dual Prediction for Polarity Classification

15 0.4295826 177 acl-2013-GuiTAR-based Pronominal Anaphora Resolution in Bengali

16 0.4238255 196 acl-2013-Improving pairwise coreference models through feature space hierarchy learning

17 0.42112643 97 acl-2013-Cross-lingual Projections between Languages from Different Families

18 0.42034706 70 acl-2013-Bilingually-Guided Monolingual Dependency Grammar Induction

19 0.41836464 382 acl-2013-Variational Inference for Structured NLP Models

20 0.41312465 253 acl-2013-Multilingual Affect Polarity and Valence Prediction in Metaphor-Rich Texts