emnlp emnlp2013 emnlp2013-26 knowledge-graph by maker-knowledge-mining

26 emnlp-2013-Assembling the Kazakh Language Corpus


Source: pdf

Author: Olzhas Makhambetov ; Aibek Makazhanov ; Zhandos Yessenbayev ; Bakhyt Matkarimov ; Islam Sabyrgaliyev ; Anuar Sharafudinov

Abstract: This paper presents the Kazakh Language Corpus (KLC), which is one of the first attempts made within a local research community to assemble a Kazakh corpus. KLC is designed to be a large scale corpus containing over 135 million words and conveying five stylistic genres: literary, publicistic, official, scientific and informal. Along with its primary part KLC comprises such parts as: (i) annotated sub-corpus, containing segmented documents encoded in the eXtensible Markup Language (XML) that marks complete morphological, syntactic, and structural characteristics of texts; (ii) as well as a sub-corpus with the annotated speech data. KLC has a web-based corpus management system that helps to navigate the data and retrieve necessary information. KLC is also open for contributors, who are willing to make suggestions, donate texts and help with annotation of existing materials.

Reference: text


Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 ymeavk , , ahnaunaorv Abstract This paper presents the Kazakh Language Corpus (KLC), which is one of the first attempts made within a local research community to assemble a Kazakh corpus. [sent-5, score-0.028]

2 KLC is designed to be a large scale corpus containing over 135 million words and conveying five stylistic genres: literary, publicistic, official, scientific and informal. [sent-6, score-0.152]

3 KLC has a web-based corpus management system that helps to navigate the data and retrieve necessary information. [sent-8, score-0.036]

4 KLC is also open for contributors, who are willing to make suggestions, donate texts and help with annotation of existing materials. [sent-9, score-0.126]

5 Kazakh is an agglutinative and highly inflected language which belongs to the Turkic group. [sent-11, score-0.071]

6 It is official state language ofKazakhstan and a mother tongue for more than 10 million people all around the world. [sent-12, score-0.091]

7 However, up until the early 90’s of 20th century, due to historical reasons of the Soviet era, Russian language was the predominant language in spoken and written communication in Kazakhstan. [sent-13, score-0.067]

8 This fact in turn caused the problem of underrepresentation of Kazakh language in various fields such as science, entertainment, official documentation, etc. [sent-14, score-0.063]

9 For this reason, while assembling the corpus, we had to group categories that are generally presented as separate in other corpora into five stylistic genres. [sent-15, score-0.112]

10 , 2012; Chen, 1996), we included texts as they were available, i. [sent-17, score-0.064]

11 kr zi Substantial part of materials was collected using sourcecustomized web crawlers and donated texts. [sent-22, score-0.131]

12 KLC also contains a manually annotated sub-corpus with morpho-syntactic and structural markups encoded in XML following general notions outlined in CES (Ide, 1998). [sent-23, score-0.095]

13 The annotations have been carried out manually by philology students specializing in morphology and syntax. [sent-25, score-0.035]

14 Trying to make the annotation process as comfortable as possible, we have designed a web-based annotation tool with a user-friendly interface. [sent-26, score-0.198]

15 We took a great care for the annotation quality, and to do that we (i) arranged the validation process, and (ii) equipped the tool with a recommendation system that, as we will show, improves the interannotator agreement. [sent-27, score-0.133]

16 As a part of KLC we have also compiled the annotated read-speech corpus (RSC), which includes audio recordings of words, phrases, sentences (from all genres), news articles and excerpts from books, that were carefully chosen from the primary part of the corpus. [sent-28, score-0.203]

17 All text materials were read by volunteers who represented different age, gender, region and education backgrounds in a balanced way. [sent-29, score-0.087]

18 Each audio file is accompanied with a label file and a corresponding text transcript. [sent-30, score-0.12]

19 in addition to a word-level segmentation of audio information a portion of our data has lexical, and morpho-syntactic annotations. [sent-33, score-0.036]

20 Section 3 provides detailed information about the primary corpus. [sent-37, score-0.043]

21 Sections 4 and 5 thoroughly describe annotated text and speech sub-corpora respectively. [sent-38, score-0.051]

22 oc d2s0 i1n3 N Aastusorcaila Ltiaonng fuoarg Ceo Pmrpoucetastsi onnga,l p Laignegsu 1is0t2ic2s–1031, 2 Related Work Since the pioneering corpus of Brown University was completed in 1964 by Francis and Kuˇ cera (1979), corpus linguistics has become a thriving research field. [sent-42, score-0.072]

23 All materials were selected on a basis of three independent criteria (medium, domain and time), where each criterion had predefined target proportions. [sent-44, score-0.087]

24 The spoken part (remaining 10%) consists of orthographic transcriptions of unscripted informal conversations and spoken language collected in different contexts. [sent-45, score-0.105]

25 , adopted it as a model for compiling their own corpora. [sent-50, score-0.035]

26 The Russian National Corpus (RNC) has been released by the group of specialists from different organizations led by the Institute of Russian language, Russian Academy of Sciences (Ruscorpora, 2003). [sent-51, score-0.031]

27 The corpus covers primarily a period from the middle of the XVIII to the early XXI centuries. [sent-52, score-0.036]

28 It includes both written texts (fiction, memoirs, science, religious literature and others) and recorded spoken data (public speeches and private conversations). [sent-53, score-0.104]

29 Currently RNC contains over 350 million word forms that are automatically POS-tagged and lemmatized. [sent-54, score-0.028]

30 The corpus also includes semantic tags for words and texts (Apresjan et al. [sent-55, score-0.153]

31 Unfortunately, up until now, not too much work has been accomplished in developing a corpus that will represent Kazakh language. [sent-58, score-0.036]

32 To the best of our knowledge – – there has been a limited number of attempts to compile one, but resulting corpora are too small in size and scope, or not available to the public. [sent-59, score-0.052]

33 A Kazakh corpus has been initiated by the Committee on Languages of the Ministry of Culture of the Republic of Kazakhstan (CLMCRK, 2009). [sent-60, score-0.036]

34 This corpus is small in size and not annotated, as it 1023 remains in its very early stage of development. [sent-61, score-0.036]

35 as a part of larger corpus of Turkic languages (Baisa and Suchomel, 2012). [sent-63, score-0.036]

36 This corpus was compiled using a web crawler that selected texts based on a language model trained on Wikipedia texts. [sent-64, score-0.189]

37 Although the obtained corpus is relatively large in size, the data was not categorized by genres. [sent-65, score-0.036]

38 Also, since a crawler was not sourcecustomized, the corpus may contain some noise coming in the form of text in Russian or other languages. [sent-66, score-0.088]

39 We also could not find enough information about a Kazakh corpus that has been developed at Xinjiang University and used in their research (Altenbek and Xiao-long, 2010). [sent-67, score-0.036]

40 The absence of an available corpus that will be large enough to represent Kazakh language decelerates many research activities (Mukan, 2012). [sent-68, score-0.036]

41 We believe that building an open Kazakh corpus will have a significant impact and it will be very useful tool in the analysis of Kazakh. [sent-69, score-0.08]

42 3 KLC Primary Corpus KLC is one of the first attempts to build a large scale, general purpose corpus that represents the present state of Kazakh language. [sent-70, score-0.064]

43 ); (4) publicistic section contains periodicals and articles from online sources, i. [sent-72, score-0.044]

44 newspapers and magazines published over the last ten years; (5) informal language section includes documents with colloquial Kazakh texts extracted from the popular blog platforms starting from 2009. [sent-74, score-0.11]

45 We have to note that while compiling this corpus we intentionally relaxed the document selection criteria by not restricting the collected data to particular domains, media, and time. [sent-75, score-0.071]

46 This was mainly dictated by the lack of materials, and partially due to the reasons mentioned in the introduction. [sent-76, score-0.027]

47 Our main sources of data were Internet websites as well as digitized forms of books, dissertations and arti- cles from public and personal libraries. [sent-77, score-0.155]

48 For each website we designed a source-specific crawler, thereby increasing the precision of the meta data (e. [sent-78, score-0.057]

49 Additionally, we filtered out documents with a high consistency of Russian texts by aligning them to a language model trained on pure Russian texts. [sent-82, score-0.143]

50 It took about 7 months to grow the corpus to its current size. [sent-85, score-0.036]

51 Table 1provides a general quantitative description of the corpus. [sent-86, score-0.064]

52 We release the data under a license that in accordance with Kazakhstan‘s law allows distribution of some materials in whole (official documents, news articles) and some only in part (literature, scientific texts, analytics) provided that sources are properly cited. [sent-87, score-0.184]

53 This license does not allow printed or electronic publications or similar use of substantial portions of text drawn from the cor- pus without the permission of its original publisher(s) or copyright holder(s). [sent-88, score-0.035]

54 1 Text Documents Description Each document is stored in a plain text format in the UTF8 encoding. [sent-90, score-0.036]

55 – – – – Provided that the corresponding information is present in a source, the tag contains both the name of the section of the corpus to which a document belongs and a further categorical sub-division, such as the type of a literary work, e. [sent-92, score-0.16]

56 That is, whenever possible such categories are assigned automatically, e. [sent-95, score-0.027]

57 For sources that lack meta data, such as the digitized books, dissertations and scientific papers, the corresponding categories (informatics, biology, chemistry, etc. [sent-98, score-0.217]

58 2 Writing System of Kazakh language Kazakh adopts different writing systems depending on the regions where it is spoken (Cyrillic alphabet in Kazakhstan, Arabic and Latin graphics in other countries). [sent-101, score-0.08]

59 Recently the government of Kazakhstan has decided to adopt Kazakh alphabet to a Latin graphic. [sent-102, score-0.04]

60 9 lemmata, total42 901 Table 2: A quantitative description of the annotated data deed, we have already provided a group working on this problem with statistical information about letter distributions in Kazakh texts. [sent-107, score-0.115]

61 This information could also aid in designing various speech corpora as well as a proper Kazakh keyboard layout. [sent-108, score-0.062]

62 It can be stated that the latter was done rather carelessly just as a simple adjustment to a Russian keyboard (Wikipedia, 2012). [sent-109, score-0.038]

63 Current Kazakh Cyrillic alphabet consists of 42 letters, whereas 9 of them are pure Kazakh letters and the others adopt the Russian symbolic. [sent-110, score-0.116]

64 Figure 1 shows the distribution of Kazakh letters in the corpus. [sent-111, score-0.043]

65 It can be seen that there is a small non-zero distribution of pure Russian letters (underlined). [sent-112, score-0.076]

66 4 The Annotated Sub-corpus In order to enhance the effectiveness of the corpus as a research tool, we have annotated a portion of the data for syntactic and POS tags, lemmata, and for morpheme types and boundaries. [sent-114, score-0.122]

67 Table 2 provides net amount and the percentages (with respect to the current size of the corpus) of the annotated data in terms of documents, words, unique words, and lemmata. [sent-115, score-0.051]

68 The annotation process has been carried out completely manually. [sent-116, score-0.062]

69 The annotation was performed mainly by the undergraduate students majoring in Kazakh philology. [sent-118, score-0.1]

70 As a quality control measure, two validators (a graduate student majoring in Kazakh philology and one of the authors) were assigned to check a random sample of about 10% of the annotated data. [sent-119, score-0.168]

71 Our analysis of validated data suggest that the annotation Figure 1: The distribution of letters across the corpus, in %. [sent-121, score-0.105]

72 To the best of our knowledge this is the first attempt to annotate Kazakh texts with various linguistic markups. [sent-123, score-0.064]

73 Given this, in the following subsections we would like to describe the tagsets (syntactic and POS), the annotation scheme (the format in which the annotated data is stored and distributed), and the annotation tool itself. [sent-124, score-0.315]

74 At the initial stage of the corpus development we did not plan to build a detailed treebank, leaving this task for the future work. [sent-127, score-0.036]

75 Therefore, our syntactic tagset comprises a compact set of syntactic categories well-defined in a classical grammar. [sent-128, score-0.395]

76 Table 3 contains the tagset description along with the equivalent tags defined in a widely used Penn Treebank (Marcus et al. [sent-129, score-0.327]

77 We do not treat them as a separate syntactic cat1For ease of presentation we used bracketing instead of listing, i. [sent-132, score-0.035]

78 1025 #Linguistic propertyCodeCardinality 1AnimacyA2 2NumberN2 3PossessivenessS10 4PersonP8 5CaseC7 6NegationG2 7TenseT3 8MoodM4 9VoiceV5 Table 4: Linguistic properties considered in the POS tagset design egory, for they typically serve as a single syntactic unit (e. [sent-135, score-0.272]

79 ) Instead each syntactic tag has a corresponding binary property that marks the proverbial case. [sent-138, score-0.101]

80 Kazakh is an agglutinative Turkic language, in which word forms are generated by means of the affix inflection. [sent-140, score-0.084]

81 For this reason, we design a positional tagset (Oflazer et al. [sent-150, score-0.269]

82 , 2003; Haji cˇ and Hladk ´a, 1998; Hana and Feldman, 2010), in which the final tags are constructed by the concatenation of the basic tag (often POS of a word form) and the en#TagDescriptionLPsCap. [sent-151, score-0.119]

83 Table 5 provides a detailed description of the designed tagset (not including punctuation) both qualitatively and quantitatively. [sent-158, score-0.304]

84 The table contains a list of tags grouped by the ten major POS (in bold). [sent-159, score-0.053]

85 For each tag we provide a set of LPs it accepts and generative capacities, i. [sent-160, score-0.098]

86 the upper bound on a number of possible tags that can be generated from a given basic tag and the different combinations of the corresponding LPs2. [sent-162, score-0.148]

87 The 2The multiplication of cardinalities of LPs does not always give the exact number of possible tags, for there are rules that restrict certain combinations of LPs. [sent-163, score-0.067]

88 Moreover, some LP combinations may be technically valid but semantically incorrect as they would make no sense, e. [sent-164, score-0.029]

89 Where possible we tried to am account for such exceptions, checking the combinations and providing 1026 list of 36 basic tags was compiled following the best practices of Penn tagset design (Marcus et al. [sent-167, score-0.356]

90 Particularly, we broke down the major POS categories in sub-categories, in order to capture semantic distinctions and various usage patterns. [sent-169, score-0.027]

91 For instance, negative (tag #6) and desiderative (tag #7) auxiliary verbs in conjunction with main verbs are used to mark uninflected negation (via no and not) and desiderative mood construction (via usage of to come in the meaning of to want) respectively. [sent-170, score-0.119]

92 Finally, following classical Kazakh grammar, we treat onomatopoeias (tag #34), i. [sent-174, score-0.029]

93 The maximum size of the tagset equals to the total generative capacity, or 3844 tags. [sent-177, score-0.237]

94 3Unlike any other part of speech that accepts the NSPC LP chain and must be in the third person (singular or plural) to be in any case other than nominative, personal pronouns can be in any case for any person, thus having a larger capacity. [sent-179, score-0.068]

95 Even the minimal tagset of 36 basic tags can be further reduced to a universal tagset (Petrov et al. [sent-181, score-0.527]

96 2 The Annotation Scheme We have developed an XML-based annotation scheme that follows paradigms of the CES (Ide, 1998) and is convertible into the XCES standard (Ide et al. [sent-186, score-0.087]

97 The main difference with the latter is that in our scheme the raw text and all markup types (i. [sent-188, score-0.077]

98 For the morpho-lexical and syntactic markups we have corresponding tags, i. [sent-192, score-0.079]


similar papers computed by tfidf model

tfidf for this paper:

wordName wordTfidf (topN-words)

[('kazakh', 0.747), ('tagset', 0.237), ('russian', 0.233), ('klc', 0.22), ('lps', 0.154), ('kazakhstan', 0.11), ('materials', 0.087), ('rnc', 0.066), ('turkic', 0.066), ('tag', 0.066), ('texts', 0.064), ('official', 0.063), ('annotation', 0.062), ('ide', 0.058), ('literary', 0.058), ('dissertations', 0.057), ('went', 0.057), ('tags', 0.053), ('crawler', 0.052), ('markup', 0.052), ('annotated', 0.051), ('books', 0.047), ('documents', 0.046), ('pos', 0.046), ('bnc', 0.046), ('tool', 0.044), ('aibek', 0.044), ('aksan', 0.044), ('baisa', 0.044), ('cyrillic', 0.044), ('desiderative', 0.044), ('digitized', 0.044), ('markups', 0.044), ('publicistic', 0.044), ('rsc', 0.044), ('sourcecustomized', 0.044), ('validators', 0.044), ('agglutinative', 0.044), ('letters', 0.043), ('primary', 0.043), ('file', 0.042), ('affix', 0.04), ('alphabet', 0.04), ('spoken', 0.04), ('assembling', 0.038), ('cardinalities', 0.038), ('ces', 0.038), ('keyboard', 0.038), ('lemmata', 0.038), ('majoring', 0.038), ('description', 0.037), ('compiled', 0.037), ('person', 0.036), ('audio', 0.036), ('stored', 0.036), ('wh', 0.036), ('corpus', 0.036), ('scientific', 0.035), ('syntactic', 0.035), ('genres', 0.035), ('latin', 0.035), ('license', 0.035), ('compiling', 0.035), ('philology', 0.035), ('tagsets', 0.035), ('pure', 0.033), ('equivalents', 0.032), ('accepts', 0.032), ('century', 0.032), ('positional', 0.032), ('xml', 0.032), ('comprises', 0.032), ('chemistry', 0.031), ('feldman', 0.031), ('dative', 0.031), ('mood', 0.031), ('organizations', 0.031), ('designed', 0.03), ('classical', 0.029), ('combinations', 0.029), ('lp', 0.028), ('sbar', 0.028), ('million', 0.028), ('attempts', 0.028), ('categories', 0.027), ('sources', 0.027), ('quantitative', 0.027), ('reasons', 0.027), ('voice', 0.027), ('recommendation', 0.027), ('websites', 0.027), ('inflected', 0.027), ('meta', 0.027), ('ii', 0.025), ('conversations', 0.025), ('scheme', 0.025), ('biology', 0.024), ('corpora', 0.024), ('stylistic', 0.023), ('np', 0.023)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.99999964 26 emnlp-2013-Assembling the Kazakh Language Corpus

Author: Olzhas Makhambetov ; Aibek Makazhanov ; Zhandos Yessenbayev ; Bakhyt Matkarimov ; Islam Sabyrgaliyev ; Anuar Sharafudinov

Abstract: This paper presents the Kazakh Language Corpus (KLC), which is one of the first attempts made within a local research community to assemble a Kazakh corpus. KLC is designed to be a large scale corpus containing over 135 million words and conveying five stylistic genres: literary, publicistic, official, scientific and informal. Along with its primary part KLC comprises such parts as: (i) annotated sub-corpus, containing segmented documents encoded in the eXtensible Markup Language (XML) that marks complete morphological, syntactic, and structural characteristics of texts; (ii) as well as a sub-corpus with the annotated speech data. KLC has a web-based corpus management system that helps to navigate the data and retrieve necessary information. KLC is also open for contributors, who are willing to make suggestions, donate texts and help with annotation of existing materials.

2 0.10205331 186 emnlp-2013-Translating into Morphologically Rich Languages with Synthetic Phrases

Author: Victor Chahuneau ; Eva Schlinger ; Noah A. Smith ; Chris Dyer

Abstract: Translation into morphologically rich languages is an important but recalcitrant problem in MT. We present a simple and effective approach that deals with the problem in two phases. First, a discriminative model is learned to predict inflections of target words from rich source-side annotations. Then, this model is used to create additional sentencespecific word- and phrase-level translations that are added to a standard translation model as “synthetic” phrases. Our approach relies on morphological analysis of the target language, but we show that an unsupervised Bayesian model of morphology can successfully be used in place of a supervised analyzer. We report significant improvements in translation quality when translating from English to Russian, Hebrew and Swahili.

3 0.090071164 70 emnlp-2013-Efficient Higher-Order CRFs for Morphological Tagging

Author: Thomas Mueller ; Helmut Schmid ; Hinrich Schutze

Abstract: Training higher-order conditional random fields is prohibitive for huge tag sets. We present an approximated conditional random field using coarse-to-fine decoding and early updating. We show that our implementation yields fast and accurate morphological taggers across six languages with different morphological properties and that across languages higher-order models give significant improvements over 1st-order models.

4 0.074762546 81 emnlp-2013-Exploring Demographic Language Variations to Improve Multilingual Sentiment Analysis in Social Media

Author: Svitlana Volkova ; Theresa Wilson ; David Yarowsky

Abstract: Theresa Wilson Human Language Technology Center of Excellence Johns Hopkins University Baltimore, MD t aw@ j hu .edu differences may Different demographics, e.g., gender or age, can demonstrate substantial variation in their language use, particularly in informal contexts such as social media. In this paper we focus on learning gender differences in the use of subjective language in English, Spanish, and Russian Twitter data, and explore cross-cultural differences in emoticon and hashtag use for male and female users. We show that gender differences in subjective language can effectively be used to improve sentiment analysis, and in particular, polarity classification for Spanish and Russian. Our results show statistically significant relative F-measure improvement over the gender-independent baseline 1.5% and 1% for Russian, 2% and 0.5% for Spanish, and 2.5% and 5% for English for polarity and subjectivity classification.

5 0.069094881 178 emnlp-2013-Success with Style: Using Writing Style to Predict the Success of Novels

Author: Vikas Ganjigunte Ashok ; Song Feng ; Yejin Choi

Abstract: Predicting the success of literary works is a curious question among publishers and aspiring writers alike. We examine the quantitative connection, if any, between writing style and successful literature. Based on novels over several different genres, we probe the predictive power of statistical stylometry in discriminating successful literary works, and identify characteristic stylistic elements that are more prominent in successful writings. Our study reports for the first time that statistical stylometry can be surprisingly effective in discriminating highly successful literature from less successful counterpart, achieving accuracy up to 84%. Closer analyses lead to several new insights into characteristics ofthe writing style in successful literature, including findings that are contrary to the conventional wisdom with respect to good writing style and readability. ,

6 0.068129376 162 emnlp-2013-Russian Stress Prediction using Maximum Entropy Ranking

7 0.048499547 181 emnlp-2013-The Effects of Syntactic Features in Automatic Prediction of Morphology

8 0.045952141 83 emnlp-2013-Exploring the Utility of Joint Morphological and Syntactic Learning from Child-directed Speech

9 0.045239698 30 emnlp-2013-Automatic Extraction of Morphological Lexicons from Morphologically Annotated Corpora

10 0.040151365 53 emnlp-2013-Cross-Lingual Discriminative Learning of Sequence Models with Posterior Regularization

11 0.037732903 56 emnlp-2013-Deep Learning for Chinese Word Segmentation and POS Tagging

12 0.037693929 132 emnlp-2013-Mining Scientific Terms and their Definitions: A Study of the ACL Anthology

13 0.036945209 168 emnlp-2013-Semi-Supervised Feature Transformation for Dependency Parsing

14 0.035902984 24 emnlp-2013-Application of Localized Similarity for Web Documents

15 0.034698214 169 emnlp-2013-Semi-Supervised Representation Learning for Cross-Lingual Text Classification

16 0.034334805 84 emnlp-2013-Factored Soft Source Syntactic Constraints for Hierarchical Machine Translation

17 0.031781718 204 emnlp-2013-Word Level Language Identification in Online Multilingual Communication

18 0.031365741 111 emnlp-2013-Joint Chinese Word Segmentation and POS Tagging on Heterogeneous Annotated Corpora with Multiple Task Learning

19 0.030308651 190 emnlp-2013-Ubertagging: Joint Segmentation and Supertagging for English

20 0.029467082 19 emnlp-2013-Adaptor Grammars for Learning Non-Concatenative Morphology


similar papers computed by lsi model

lsi for this paper:

topicId topicWeight

[(0, -0.12), (1, 0.004), (2, -0.023), (3, -0.05), (4, -0.116), (5, -0.056), (6, -0.021), (7, -0.009), (8, 0.028), (9, -0.052), (10, -0.01), (11, 0.031), (12, -0.003), (13, -0.003), (14, 0.034), (15, 0.051), (16, -0.058), (17, -0.02), (18, -0.02), (19, 0.015), (20, -0.04), (21, 0.071), (22, 0.051), (23, -0.02), (24, 0.042), (25, 0.074), (26, 0.09), (27, -0.005), (28, 0.001), (29, -0.003), (30, 0.028), (31, 0.05), (32, -0.022), (33, -0.011), (34, 0.046), (35, -0.08), (36, 0.069), (37, 0.125), (38, 0.067), (39, 0.01), (40, 0.067), (41, -0.031), (42, 0.081), (43, -0.106), (44, 0.071), (45, 0.041), (46, 0.259), (47, -0.06), (48, 0.091), (49, 0.015)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.91841626 26 emnlp-2013-Assembling the Kazakh Language Corpus

Author: Olzhas Makhambetov ; Aibek Makazhanov ; Zhandos Yessenbayev ; Bakhyt Matkarimov ; Islam Sabyrgaliyev ; Anuar Sharafudinov

Abstract: This paper presents the Kazakh Language Corpus (KLC), which is one of the first attempts made within a local research community to assemble a Kazakh corpus. KLC is designed to be a large scale corpus containing over 135 million words and conveying five stylistic genres: literary, publicistic, official, scientific and informal. Along with its primary part KLC comprises such parts as: (i) annotated sub-corpus, containing segmented documents encoded in the eXtensible Markup Language (XML) that marks complete morphological, syntactic, and structural characteristics of texts; (ii) as well as a sub-corpus with the annotated speech data. KLC has a web-based corpus management system that helps to navigate the data and retrieve necessary information. KLC is also open for contributors, who are willing to make suggestions, donate texts and help with annotation of existing materials.

2 0.54600477 162 emnlp-2013-Russian Stress Prediction using Maximum Entropy Ranking

Author: Keith Hall ; Richard Sproat

Abstract: We explore a model of stress prediction in Russian using a combination of local contextual features and linguisticallymotivated features associated with the word’s stem and suffix. We frame this as a ranking problem, where the objective is to rank the pronunciation with the correct stress above those with incorrect stress. We train our models using a simple Maximum Entropy ranking framework allowing for efficient prediction. An empirical evaluation shows that a model combining the local contextual features and the linguistically-motivated non-local features performs best in identifying both primary and secondary stress. 1

3 0.51854676 70 emnlp-2013-Efficient Higher-Order CRFs for Morphological Tagging

Author: Thomas Mueller ; Helmut Schmid ; Hinrich Schutze

Abstract: Training higher-order conditional random fields is prohibitive for huge tag sets. We present an approximated conditional random field using coarse-to-fine decoding and early updating. We show that our implementation yields fast and accurate morphological taggers across six languages with different morphological properties and that across languages higher-order models give significant improvements over 1st-order models.

4 0.50141734 190 emnlp-2013-Ubertagging: Joint Segmentation and Supertagging for English

Author: Rebecca Dridan

Abstract: A precise syntacto-semantic analysis of English requires a large detailed lexicon with the possibility of treating multiple tokens as a single meaning-bearing unit, a word-with-spaces. However parsing with such a lexicon, as included in the English Resource Grammar, can be very slow. We show that we can apply supertagging techniques over an ambiguous token lattice without resorting to previously used heuristics, a process we call ubertagging. Our model achieves an ubertagging accuracy that can lead to a four to eight fold speed up while improving parser accuracy. 1 Introduction and Motivation Over the last decade or so, supertagging has become a standard method for increasing parser efficiency for heavily lexicalised grammar formalisms such as LTAG (Bangalore and Joshi, 1999), CCG (Clark and Curran, 2007) and HPSG (Matsuzaki et al., 2007). In each of these systems, fine-grained lexical categories, known as supertags, are used to prune the parser search space prior to full syntactic parsing, leading to faster parsing at the risk of removing necessary lexical items. Various methods are used to configure the degree of pruning in order to balance this trade-off. The English Resource Grammar (ERG; Flickinger (2000)) is a large hand-written HPSGbased grammar of English that produces finegrained syntacto-semantic analyses. Given the high level of lexical ambiguity in its lexicon, parsing with the ERG should therefore also benefit from supertagging, but while various attempts have shown possibilities (Blunsom, 2007; Dridan et al., 2008; Dridan, 2009), supertagging is still not a standard element in the ERG parsing pipeline. 1201 There are two main reasons for this. The first is that the ERG lexicon does not assign simple atomic categories to words, but instead builds complex structured signs from information about lemmas and lexical rules, and hence the shape and integration of the supertags is not straightforward. Bangalore and Joshi (2010) define a supertag as a primitive structure that contains all the information about a lexical item, including argument structure, and where the arguments should be found. Within the ERG, that information is not all contained in the lexicon, but comes from different places. The choice, therefore, of what information may be predicted prior to parsing and how it should be integrated into parsing is an open question. The second reason that supertagging is not standard with ERG processing is one that is rarely considered when processing English, namely ambiguous segmentation. In most mainstream English parsing, the segmentation of parser input into tokens that will become the leaves of the parse tree is considered a fixed, unambiguous process. While recent work (Dridan and Oepen, 2012) has shown that producing even these tokens is not a solved problem, the issue we focus on here is the ambiguous mapping from these tokens to meaning-bearing units that we might call words. Within the ERG lexicon are many multi-token lexical entries that are sometimes referred to as words-with-spaces. These multi-token entries are added to the lexicon where the grammarian finds that the semantics of a fixed expression is non-compositional and has the distributional properties of other single word entries. Some examples include an adverb-like all of a sudden, a prepositionlike for example and an adjective-like over and done with. Each of these entries create an segmentation ambiguity between treating the whole expression as a single unit, or allowing analyses comprising enProce Sdeiantgtlse o,f W thaesh 2i0n1gt3o nC,o UnSfeAre,n 1c8e- o2n1 E Omctpoibriecra 2l0 M13et.h ?oc d2s0 i1n3 N Aastusorcaila Ltiaon g fuoarg Ceo Pmrpoucetastsi on ga,l p Laignegsu 1is2t0ic1s–1212, tries triggered by the individual tokens. Previous supertagging research using the ERG has either used the gold standard tokenisation, hence making the task artificially easier, or else tagged the individual tokens, using various heuristics to apply multi-token tags to single tokens. Neither approach has been wholly satisfactory. In this work we avoid the heuristic approaches and learn a sequential classification model that can simultaneously determine the most likely segmentation and supertag sequences, a process we dub ubertagging. We also experiment with more fine- grained tag sets than have been previously used, and find that it is possible to achieve a level of ubertagging accuracy that can improve both parser speed and accuracy for a precise semantic parser. 2 Previous Work As stated above, supertagging has become a standard tool for particular parsing paradigms, but the definitions of a supertag, the methods used to learn them, and the way they are used in parsing varies across formalisms. The original supertags were 300 LTAG elementary trees, predicted using a fairly simple trigram tagger that provided a configurable number of tags per token, since the tagger was not accurate enough to make assigning a single tree viable parser input (Bangalore and Joshi, 1999). The C&C; CCG parser uses a more complex Maximum Entropy tagger to assign tags from a set of 425 CCG lexical categories (Clark and Curran, 2007). They also found it necessary to supply more than one tag per token, and hence assign all tags that have a probability within a percentage β of the most likely tag for each token. Their standard parser configuration uses a very restrictive β value initially, relax- ing it when no parse can be found. Matsuzaki et al. (2007) use a supertagger similar to the C&C; tagger alongside a CFG filter to improve the speed of their HPSG parser, feeding sequences of single tags to the parser until a parse is possible. As in the ERG, category and inflectional information are separate in the automatically-extracted ENJU grammar: their supertag set consists of 1361 tags constructed by combining lexical categories and lexical rules. Figure 1 shows examples of supertags from these three tag sets, all describing the simple transitive use of lends. 1202 S NP0↓ VP VNP1↓ lends (a) LTAG (S[dcl]\NP)/NP (b) CCG [NP.nom NP.acc]-singular3rd verb rule (c) ENJU HPSG Figure 1: Examples of supertags from LTAG, CCG and ENJU HPSG, for the word lends. The ALPINO system for parsing Dutch is the closest in spirit to our ERG parsing setup, since it also uses a hand-written HPSG-based grammar, including multi-token entries in its lexicon. Prins and van Noord (2003) use a trigram HMM tagger to calculate the likelihood of up to 2392 supertags, and discard those that are not within τ of the most likely tag. For their multi-token entries, they assign a constructed category to each token, so that instead of assigning prepos it ion to the expression met betrekking tot (“with respect to”), they use ( 1 prepo s it ion ) , ( 2 prepo s it i ) , on ( 3 prepos it ion ) . Without these constructed categories, they would only have 1365 supertags. Most previous supertagging attempts with the ERG have used the grammar’s lexical types, which describe the coarse-grained part of speech, and the subcategorisation of a word, but not the inflection. Hence both lends and lent have a possible lexical type v np*pp* t o le, which indicates a verb, with optional noun phrase and prepositional phrase arguments, where the preposition has the form to. , , , The number of lexical types changes as the grammar grows, and is currently just over 1000. Dridan (2009) and Fares (2013) experimented with other tag types, but both found lexical types to be the optimal balance between predictability and efficiency. Both used a multi-tagging approach dubbed selective tagging to integrate the supertags into the parser. This involved only applying the supertag filter when the tag probability is above a configurable threshold, and not pruning otherwise. For multi-token entries, both Blunsom (2007) and adve rb adve rb adve rb adve rb ditt o ditt o 1 adve rb 2 adve rb 3 adve rb all in all , , , Figure 2: Options for tagging parts of the multitoken adverb all in all separately. Dridan (2009) assigned separate tags to each token, with Blunsom (2007) assigning a special ditto tag all but the initial token of a multi-token entry, while Dridan (2009) just assigned the same tag to each token (leading to example in the expression for example receiving p np i le, a preposition-type cate- gory). Both of these solutions (demonstrated in Figure 2), as well as that of Prins and van Noord (2003), in some ways defeat one of the purposes of treating these expressions as fixed units. The grammarian, by assigning the same category to, for example, all of a sudden and suddenly, is declaring that these two expressions have the same distributional properties, the properties that a sequential classifier is trying to exploit. Separating the tokens loses that information, and introduces extra noise into the sequence model. Ytrestøl (2012) and Fares (2013) treat the multientry tokens as single expressions for tagging, but with no ambiguity. Ytrestøl (2012) manages this by using gold standard tokenisation, which is, as he states, the standard practice for statistical parsing, but is an artificially simplified setup. Fares (2013) is the only work we know about that has tried to predict the final segmentation that the ERG produces. We compare segmentation accuracy between our joint model and his stand-alone tokeniser in Section 6. Looking at other instances of joint segmentation and tagging leads to work in non-whitespace separated languages such as Chinese (Zhang and Clark, 2010) and Japanese (Kudo et al., 2004). While at a high level, this work is solving the same problem, the shape of the problems are quite different from a data point of view. Regular joint morphological analysis and segmentation has much greater ambiguity in terms of possible segmentations but, in most cases, less ambiguity in terms of labelling than our situation. This also holds for other lemmatisation and morphological research, such as Toutanova and Cherry (2009). While we drew inspiration from this 1203 a j - i le v nge Foreign r-t r dl r v prp ol r v pst ol r v - unacc le v np*l-epndpin*gto le increased w period pl av - s r -vp-po le as well. p vp i le w period pl as av - dg-v le r well. Figure 3: A selection from the 70 lexitems instantiated for Foreign lending increased as well. related area, as well as from the speech recognition field, differences in the relative frequency of observations and labels, as well as in segmentation ambiguity mean that conclusions found in these areas did not always hold true in our problem space. 3 The Parser The parsing environment we work with is the PET parser (Callmeier, 2000), a unification-based chart parser that has been engineered for efficiency with precision grammars, and incorporates subsumptionbased ambiguity packing (Oepen and Carroll, 2000) and statistical model driven selective unpacking (Zhang et al., 2007). Parsing in PET is divided in two stages. The first stage, lexical parsing, covers everything from tokenising the raw input string to populating the base of the parse chart with the appropriate lexical items, ready for the second syntactic parsing stage. In this work, we embed our ubertagging model between the two stages. By this point, the input has been segmented into what we call internal t okens, which broadly means — — splitting at whitespace and hyphens, and making ’s a separate token. These tokens are subject to a morphological analysis component which proposes possible inflectional and derivational rules based on word form, and then are used in retrieving possible lexical entries from the lexicon. The results of applying the appropriate lexical rules, plus affixation rules triggered by punctuation, to the lexical entries form a lexical item object, that for this work we dub a lexitem. Figure 3 shows some examples of lexitems instantiated after the lexical parsing stage when analysing Foreign lending increased as well. The pre-terminal labels on these subtrees are the lexical types that have previously been used as supertags for the ERG. For uninflected words, with no punctuation affixed, the lexical type is the only element in the lexitem, other than the word form (e.g. Foreign, as). In this example, we also see lexitems with inflectional rules (v prp ol r, v pst ol r), derivational rules (v nger-t r dl r) and punctuation affixation rules (w period pl r). These lexitems are put in to a chart, forming a lexical lattice, and it is over this lattice that we apply our ubertagging model, removing unlikely lexitems before they are seen by the syntactic parsing stage. 4 The Data The primary data sets we use in these experiments are from the 1.0 version of DeepBank (Flickinger et al., 2012), an HPSG annotation of the Wall Street Journal text used for the Penn Treebank (PTB; Marcus et al. (1993)). The current version has gold standard annotations for approximately 85% of the first 22 sections. We follow the recommendations of the DeepBank developers in using Sections 00–19 for training, Section 20 (WSJ20) for development and Section 21 (WSJ21) as test data. In addition, we use two further sources of training data: the training portions of the LinGO Redwoods Treebank (Oepen et al., 2004), a steadily growing collection of gold standard HPSG annotations in a variety of domains; and the Wall Street Journal section of the North American News Corpus (NANC), which has been parsed, but not manually annotated. This builds on observations by Prins and van Noord (2003), Dridan (2009) and Ytrestøl (2012) that even uncorrected parser output makes very good train- ing data for a supertagger, since the constraints in the parser lead to viable, if not entirely correct sequences. This allows us to use much larger training sets than would be possible if we required manually annotated data. In final testing, we also include two further data sets to observe how domain affects the contribution of the ubertagging. These are both from the test portion of the Redwoods Treebank: CatB, an essay about open-source software;1 and WeScience13, 1http : / / catb .org/ esr /writ ings / 1204 text from Wikipedia articles about Natural Language Processing from the WeScience project (Ytrestøl et al., 2009). Table 1 summarises the vital statistics of the data we use. With the focus on multi-token lexitems, it is instructive to see just how frequent they are. In terms of type frequency, almost 10% of the approximately 38500 lexical entries in the current ERG lexicon have more than one token in their canonical form.2 However, while this is a significant percentage of the lexicon, they do not account for the same percentage of tokens during parsing. An analysis of WSJ00:19 shows that approximately one third of the sentences had at least one multi-token lexitem in the unpruned lexical lattice, and in just under half of those, the gold standard analysis included a multi-word entry. That gives the multi-token lexitems the awkward property of being rare enough to be difficult for a statistical classifier to accurately detect (just under 1% of the leaves of gold parse trees contain multiple tokens), but too frequent to ignore. In addition, since these multi-token expressions have often been distinguished because they are non-compositional, failing to detect the multi-word usage can lead to a disproportionately adverse effect on the semantic analysis of the text. 5 Ubertagging Model Our ubertagging model is very similar to a standard trigram Hidden Markov Model (HMM), except that the states are not all of the same length. Our states are based on the lexitems in the lexical lattice produced by the lexical parsing stage of PET, and as such, can be partially overlapping. We formalise this be defining each state by its start position, end po- sition, and tag. This turns out to make our model equivalent to a type of Hidden semi-Markov Model called a segmental HMM in Murphy (2002). In a segmental HMM, the states are segments with a tag (t) and a length in frames (l). In our setup, the frames are the ERG internal tokens and the segments are the lexitems, which are the potential candidates cathedral-baz aar / by Eric S. Raymond 2While the parser has mechanisms for handling words unknown to the lexicon, with the current grammar these mechanisms will never propose a multi-token lexitem, and so only the multi-token entries explicitly in the lexicon will be recognised as such. Lexitems Data Set Source Use Gold? Trees All M-T WSJ00:19DeepBank 1.0 §00–19trainyes337836614516309 Redwoods RDeeedwpBooandks 1Tr.0ee §b0a0n–k1 train yes 39478 432873 6568 NANC LDC2008T15 train no 2185323 42376523 399936 WSJ20DeepBank 1.0 §20devyes172134063312 WSJ21DDeeeeppBBaannkk 11..00 §§2210testyes141427515253 WeScience13 RDeeedwpBooandks T1.r0ee §b2a1nk test yes 802 11844 153 CatB Redwoods Treebank test yes 608 11653 115 Table 1: Test, development and training data used in these experiments. The final two columns show the total number of lexitems used for training (All), as well as how many of those were multi-token lexitems (M-T). to become leaves of the parse tree. As indicated above, the majority of segments (over 99%) will be one frame long, but segments of up to four frames are regularly seen in the training data. A standard trigram HMM has a transition proba- bility matrix A, where the elements Aijk represent the probability P(k|ij), and an emission probability tmhaetr pirxo bBa bwilhitoys eP (elke|mije),nt asn Bjo r eemcoisrdsi othne p probabilities P(o|j). Given these matrices and a vector of obstieersve Pd( frames, vOen, th thee posterior probabilities or fo fe oacbhstate at frame v are calculated as:3 P(qv= qy|O) =αv(Pqy()Oβv)(qy) (1) where αv(qy) is the forward probability at frame v, given a current state qy (i.e. the probability of the observation up to v, given the state): = qy) Xαv(qxqy) αv (qy) ≡ P(O0:v |qv = αv(qxqy) (2) (3) Xqx = Bqyov Xαv−1(qwqx)Aqwqxqy (4) Xqw βv (qy) is the backwards probability at frame v, given a current state qy (the probability of the observation 3Since we will require per-state probabilities for integration the parser, we focus on the calculation of posterior probabilities, rather than determing the single best path. to 1205 from v, given the state): βv(qy) ≡ P(Ov+1:V|qv = Xβv(qxqy) = qy) (5) (6) Xqx βv(qxqy) = Xβv+1(qyqz)AqxqyqzBqzov+1 (7) Xqz and the probability of the full observation sequence is equal to the forward probability at the end of the sequence, or the backwards probability at the start of the sequence: P(O) = αV(hEi) = β0(hSi) (8) In implementation, our model varies only in what we consider the previous or next states. While v still indexes frames, qv now indicates a state that ends with frame v, and we look forwards and backwards to adjacent states, not frames, formally designated in terms of l, the length of the state. Hence, we modify equation (4): αv(qxqy) = BqyOv−l+1:v Xαv−l(qwqx)Aqwqxqy Xqw (9) where v−l indexes the frame before the current state starts, va−ndl nhedencxee we are summing over arelln st tsattaetes that lead directly to our current state. An equivalent modification to equation (7) gives: βv(qxqy) = X Xβv+l(qyqz)AqxqyqzBqzOv+1:v+l ∈XQqznXl(qz) (10) LTTyYpPeEv np-pp*to leExample#1T0a2g8s INFL v np-pp * t o le :v pas odl r FULL v np-pp*to le :v pas odlr :w period plr 3626 21866 wv pe praiso oddl prlr l v np-pp*to le recommended. Figure 4: Possible tag types and their tag set size, with examples derived from the lexitem on the right. where Qn is the set of states that start at v + 1(i.e., the states immediately following the current state), and l(qz) is the length of state qz. We construct the transition and emission probability matrices using relative frequencies directly observed from the training data, where we make the simplifying assumption that P(qk |qiqj) ≡ P(t(qk) |t(qi)t(qk)). Which is to say, w|qhile lex≡items w)|itt(hq the same tag, but different length will trigger distinct states with distinct emission probabilities, they will have the same transition probabilities, given the same proceeding tag.4 Even with our large training set, some tag trigrams are rare or unseen. To smooth these probabilities, we use deleted interpolation to calculate a weighted sum of the trigram, bigram and unigram probabilities, since it has been successfully used in effective PoS taggers like the TnT tagger (Brants, 2000). Future work will look more closely at the effects of different smoothing methods. 6 Intrinsic Ubertag Evaluation In order to develop and tune the ubertagging model, we first looked at segmentation and tagging performance in isolation over the development set. We looked at three tag granularities: lexical types (LTYPE) which have previously been shown to be the optimal granularity for supertagging with the ERG, inflected types (INFL) which encompass inflectional and derivational rules applied to the lexical type, and the full lexical item (FULL), which also includes affixation rules used for punctuation handling. Examples of each tag type are shown in Figure 4, along with the number of tags of each type seen in the training data. 4Since the multi-token lexical entries are defined because they have the same properties as the single token variants, there is no reason to think the length of a state should influence the tag sequence probability. 1206 Tag Type Segmentation F1 Sent. Tagging F1 Sent. FULL99.5594.4893.9242.13 INFL LTYPE 99.45 99.40 93.55 93.03 93.74 93.27 41.49 38.12 Table 2: Segmentation and tagging performance of the best path found for each model, measured per segment in terms of F1, and also as complete sentence accuracy. Single sequence results Table 2 shows the results when considering the best path through the lattice. In terms of segmentation, our sentence accuracy is comparable to that of the stand-alone segmentation performance reported by Fares et al. (2013) over similar data.5 In that work, the authors used a binary CRF classifier to label points between objects they called micro-tokens as either SPLIT or NOSPLIT. The CRF classifier used a less informed input (since it was external to the parser), but a much more complex model, to produce a best single path sentence accuracy of 94.06%. Encouragingly, this level of segmentation performance was shown in later work to produce a viable parser input (Fares, 2013). Switching to the tagging results, we see that the F1 numbers are quite good for tag sets of this size.6 The best tag accuracy seen for ERG LTYPE-style tags was 95.55 in Ytrestøl (2012), using gold standard segmentation on a different data set. Dridan (2009) experimented with a tag granularity similar to our INFL (letype+morph) and saw a tag accuracy of 91.51, but with much less training data. From other formalisms, Kummerfeld et al. (2010) 5Fares et al. (2013) used a different section of an earlier version of DeepBank, but with the same style of annotation. 6We need to measure F1 rather than tag accuracy here, since the number of tokens tagged will vary according to the segmentation. report a single tag accuracy of 95.91, with the smaller CCG supertag set. Despite the promising tag F1 numbers however, the sentence level accuracy still indicates a performance level unacceptable for parser input. Comparing between tag types, we see that, possibly surprisingly, the more fine-grained tags are more accurately assigned, although the differences are small. While instinctively a larger tag set should present a more difficult problem, we find that this is mitigated both by the sparse lexical lattice provided by the parser, and by the extra constraints provided by the more informative tags. Multi-tagging results The multi-tagging methods from previous supertagging work becomes more complicated when dealing with ambiguous tokenisation. Where, in other setups, one can compare tag probabilities for all tags for a particular token, that no longer holds directly when tokens can partially overlap. Since ultimately, the parser uses lexitems which encompass segmentation and tagging information, we decided to use a simple integration method, where we remove any lexitem which our model assigns a probability below a certain threshold (ρ). The effect of the different tag granularities is now mediated by the relationship between the states in the ubertagging lattice and the lexitems in the parser’s lattice: for the FULL model, this is a one-to-one relationship, but states from the models that use coarser-grained tags may affect multiple lexitems. To illustrate this point, Figure 5 shows some lexitems for the token forecast,, where there are multiple possible analyses for the comma. A FULL tag of v cp le :v p st olr :w comma pl r will select only lexitem (b), whereas an INFL tag v cp le :v pst ol r will select (b) and (c) and the LTYPE tag v cp le picks out (a), (b) and (c). On the other hand, where there is no ambiguity in inflection or affixation, an LTYPE tag of n - mc le may relate to only a single lexitem ((f) in this case). Since we are using an absolute, rather than relative, threshold, the number needs to be tuned for each model7 and comparisons between models can only be made based on the effects (accuracy or pruning power) of the threshold. Table 3 shows how a selection of threshold values affect the accuracy 7A tag set size of 1028 will lead to higher probabilities in general than a tag set size of 21866. 1207 w comma-nf pl r w comma pl r w comma-n f pl r v pst ol r v pst o l r v cp le v cp le v cp le forecast, (a) w comma pl r forecast, (b) w comma pl r forecast, (c) v p st ol r v pas o l r w comma pl r v np le v np le n - mc le forecast, (d) forecast, (e) forecast, (f) Figure 5: Some of the lexitems triggered by forecast, in Despite the gloomy forecast, profits were up. Tag Type Lexitems ρ Acc. Kept Ave. FULL0.0000199.7141.63.34 FULL FULL FULL 0.0001 0.001 0.01 99.44 98.92 97.75 33.1 25.5 19.4 2.66 2.05 1.56 INFL0.000199.6737.93.04 INFL INFL INFL 0.001 0.01 0.02 99.25 98.21 97.68 29.0 21.6 19.7 2.33 1.73 1.58 LTYPE0.000299.7566.35.33 LTYPE LTYPE LTYPE 0.002 0.02 0.05 99.43 98.41 97.54 55.0 43.5 39.4 4.42 3.50 3.17 Table 3: Accuracy and ambiguity after pruning lexitems in WSJ20, at a selection of thresholds ρ for each model. Accuracy is measured as the percentage of gold lexitems remaining after pruning, while ambiguity is presented both as a percentage of lexitems kept, and the average number of lexitems per initial token still remaining. Tag accuracy versus ambiguity Average lexitems per initial token Figure 6: Accuracy over gold lexitems versus average lexitems per initial token over the development set, for each of the different ubertagging models. and pruning impact of our different disambiguation models, where the accuracy is measured in terms of percentage of gold lexitems retained. The pruning effect is given both as percentage of lexitems retained after pruning, and average number of lexitems per initial token.8 Comparison between the different models can be more easily made by examining Figure 6. Here we see clearly that the LTYPE model provides much less pruning for any given level of lexitem accuracy, while the performance of the other models is almost indistinguishable. Analysis The current state-of-the-art POS tagging accuracy (using the 45 tags in the PTB) is approximately 97.5%. The most restrictive ρ value we report for each model was selected to demonstrate that level of accuracy, which we can see would lead to pruning over 80% of lexitems when using FULL tags, an average of 1.56 tags per token. While this level of accuracy has been sufficient for statistical treebank parsing, previous work (Dridan, 2009) has shown that tag accuracy cannot directly predict parser performance, since errors of different types can have very different effects. This is hard to quantify without parsing, but we made a qualitative analysis at the lexitems that were incorrectly being 8The average number of lexitems per token for the unrestricted parser is 8.03, although the actual assignment is far from uniform, with up to 70 lexitems per token seen for the very ambiguous tokens. 1208 pruned. For all models, the most difficult lexitems to get correct were proper nouns, particular those that are also used as common nouns (e.g. Bank, Airline, Report). While capitalisation provides a clue here, it is not always deterministic, particularly since the treebank incorporates detailed decisions regarding the distinction between a name and a capitalised common noun that require real world knowledge, and are not necessarily always consistent. Almost two thirds of the errors made by the FULL and INFL models are related to these decisions, but only about 40% for the LTYPE model. The other errors are predominately over noun and verb type lexitems, as the open classes, with the only difference between models being that the FULL model seems marginally better at classifying verbs. The next section describes the end-to-end setup and results when parsing the development set. 7 Parsing With encouraging ubertagging results, we now take the next step and evaluate the effect on end-to-end parsing. Apart from the issue of different error types having unpredictable effects, there are two other factors that make the isolated ubertagging results only an approximate indication of parsing performance. The first confounding factor is the statistical parsing disambiguation model. To show the effect of ubertagging in a realistic configuration, we only evaluate the first analysis that the parser returns. That means that when the unrestricted parser does not rank the gold analysis first, errors made by our model may not be visible, because we would never see the gold analysis in any case. On the other hand, it is possible to improve parser accuracy by pruning incorrect lexitems that were in a top ranked, nongold analysis. The second new factor that parser integration brings to the picture is the effect of resource limitations. For reasons of tractability, PET is run with per sentence time and memory limits. For treebank creation, these limits are quite high (up to four minutes), but for these experiments, we set the timeout to a more practical 60 seconds and the memory limit to 2048Mb. Without lexical pruning, this leads to approximately 3% of sentences not receiving an analysis. Since the main aim of ubertagging is to inTag F1 Type ρ Lexitem Bracket Time No Pruning94.0688.586.58 FULL0.0000195.6289.843.99 FULL FULL FULL 0.0001 0.001 0.01 95.95 95.81 94.19 90.09 89.88 88.29 2.69 1.34 0.64 INFL0.000196.1090.373.45 INFL INFL INFL 0.001 0.01 0.02 96.14 95.07 94.32 90.33 89.27 88.49 1.78 0.84 0.64 LTYPE0.000295.3789.634.73 LTYPE LTYPE LTYPE 0.002 0.02 0.05 96.03 95.04 93.36 90.20 89.04 87.26 2.89 1.23 0.88 Table 4: Lexitem and bracket F1over WSJ20, with average per sentence parsing time in seconds. crease efficiency, we would expect to regain at least some of these unanalysed sentences, even when a lexitem needed for the gold analysis has been removed. Table 4 shows the parsing results at the same threshold values used in Table 3. Accuracy is calculated in terms of F1 both over lexitems, and PARSEVAL-style labelled brackets (Black et al., 1991), while efficiency is represented by average parsing time per sentence. We can see here that an ubertagging F1 of below 98 (cf. Table 3) leads to a drop in parser accuracy, but that an ubertagging performance of between 98 and 99 can improve parser F1 while also achieving speed increases up to 8-fold. From the table we confirm that, contrary to earlier pipeline supertagging configurations, tags of a finer granularity than LTYPE can deliver better performance, both in terms of accuracy and efficiency. Again, comparing graphically in Figure 7 gives a clearer picture. Here we have graphed labelled bracket F1 against parsing time for the full range of threshold values explored, with the unpruned parsing results indicated by a cross. From this figure, we see that the INFL model, despite being marginally less accurate when measured in isolation, leads to slightly more accurate parse results than the FULL model at all levels of efficiency. Looking at the same graph for different samples of the development set (not shown) shows some 1209 Parser accuracy versus efficiency Time per sentence Figure 7: Labelled bracket F1 versus parsing time per sentence over the development set, for each of the different ubertagging models. The cross indicates unpruned performance, while the circle pinpoints the configuration we chose for the final test runs. variance in which threshold value gives the best F1, but the relative differences and basic curve shape re- mains the same. From these different views, using the guideline of maximum efficiency without harming accuracy we selected our final configuration: the INFL model with a threshold value of 0.001 (marked with a circle in Figure 7). On the development set, this configuration leads to a 1.75 point improvement in F1 in 27% of the parsing time. 8 Final Results Table 5 shows the results obtained when parsing using the configuration selected on the development set, over our three test sets. The first, WSJ21 is from the same domain as the development set. Here we see that the effect over the WSJ21 set fairly closely mirrored that of the development set, with an F1 increase of 1.81 in 29% of the parsing time. The Wikipedia domain of our WeScience13 test set, while very different to the newswire domain of the development set could still be considered in domain for the parsing and ubertagging models, since there is Wikipedia data in the training sets. With an average sentence length of 15.18 (compared to 18.86 in WSJ21), the baseline parsing time is faster than for WSJ21, and the speedup is not quite as large Data Set Baseline F1 Time Pruned F1 Time WSJ2188.126.0689.931.77 WeScience13 CatB 86.25 86.31 4.09 5.00 87.14 87.1 1 1.48 1.78 Table 5: Parsing accuracy in terms of labelled bracket F1 and average time per sentence when parsing the test sets, without pruning, and then with lexical pruning using the INFL model with a threshold of 0.001. but still welcome, at 36% of the baseline time. The increase is accuracy is likewise smaller (due to less issues with resource exhaustion in the baseline), but as our primary goal is to not harm accuracy, the results are pleasing. The CatB test set is the standard out-of-domain test for the parser, and is also out of domain for the ubertagging model. The average sentence length is not much below that of WSJ21, at 18.61, but the baseline parsing speed is still noticeably faster, which appears to be a reflection of greater structural ambiguity in the newswire text. We still achieve a reduction in parsing time to 35% of the baseline, again with a small improvement in accuracy. The across-the-board performance improvement on all our test sets suggests that, while tuning the pruning threshold could help, it is a robust parameter that can provide good performance across a variety of domains. This means that we finally have a robust supertagging setup for use with the ERG that doesn’t require heuristic shortcuts and can be reliably applied in general parsing. 9 Conclusions and Outlook In this work we have demonstrated a lexical disambiguation process dubbed ubertagging that can assign fine-grained supertags over an ambiguous token lattice, a setup previously ignored for English. It is the first completely integrated supertagging setup for use with the English Resource Grammar, which avoids the previously necessary heuristics for dealing with ambiguous tokenisation, and can be robustly configured for improved performance without loss of accuracy. Indeed, by learning a joint segmentation and supertagging model, we have been able to achieve usefully high tagging accuracies for very 1210 fine-grained tags, which leads to potential parser speedups of between 4 and 8 fold. Analysis of the tagging errors still being made have suggested some possibly avoidable inconsistencies in the grammar and treebank, which have been fed back to the developers, hopefully leading to even better results in the future. In future work, we will investigate more advanced smoothing methods to try and boost the ubertagging accuracy. We also intend to more fully explore the domain adaptation potentials of the lexical model that have been seen in other parsing setups (see Rimell and Clark (2008) for example), as well as examine the limits on the effects of more training data. Finally, we would like to explore just how much the statistic properties of our data dictate the success of the model by looking at related problems like morphological analysis of unsegmented languages such as Japanese. Acknowledgements Iam grateful to my colleagues from the Oslo Language Technology Group and the DELPH-IN consortium for many discussions on the issues involved in this work, and particularly to Stephan Oepen who inspired the initial lattice tagging idea. Thanks also to three anonymous reviewers for their very constructive feedback which improved the final version. Large-scale experimentation and engineering is made possible though access to the TITAN highperformance computing facilities at the University of Oslo, and Iam grateful to the Scientific Computating staff at UiO, as well as to the Norwegian Metacenter for Computational Science and the Norwegian tax payer. References Srinivas Bangalore and Aravind K. Joshi. 1999. Supertagging: an approach to almost parsing. Computational Linguistics, 25(2):237 –265. Srinavas Bangalore and Aravind Joshi, editors. 2010. Supertagging: Using Complex Lexical Descriptions in Natural Language Processing. The MIT Press, Cambridge, US. Ezra Black, Steve Abney, Dan Flickinger, Claudia Gdaniec, Ralph Grishman, Phil Harrison, Don Hindle, Robert Ingria, Fred Jelinek, Judith Klavans, Mark Liberman, Mitch Marcus, S. Roukos, Beatrice Santorini, and Tomek Strzalkowski. 1991. A procedure for quantitatively comparing the syntactic coverage of English grammars. In Proceedings of the Workshop on Speech and Natural Language, page 306 311, Pacific Grove, USA. Philip Blunsom. 2007. Structured Classification for Multilingual Natural Language Processing. Ph.D. thesis, Department of Computer Science and Software Engineering, University of Melbourne. Thorsten Brants. 2000. TnT a statistical part-ofspeech tagger. In Proceedings of the Sixth Conference on Applied Natural Language Processing ANLP-2000, page 224 –23 1, Seattle, USA. Ulrich Callmeier. 2000. PET. A platform for experimentation with efficient HPSG processing techniques. Natural Language Engineering, 6(1):99 108, March. Stephen Clark and James R. Curran. 2007. Formalismindependent parser evaluation with CCG and DepBank. In Proceedings of the 45th Meeting of the Association for Computational Linguistics, page 248 255, Prague, Czech Republic. Rebecca Dridan and Stephan Oepen. 2012. Tokenization. Returning to a long solved problem. A survey, contrastive experiment, recommendations, and toolkit. In Proceedings of the 50th Meeting of the Association for Computational Linguistics, page 378 382, Jeju, Republic of Korea, July. Rebecca Dridan, Valia Kordoni, and Jeremy Nicholson. 2008. Enhancing performance of lexicalised grammars. page 613 621. – — – – – – Rebecca Dridan. 2009. Using lexical statistics to improve HPSG parsing. Ph.D. thesis, Department of Computational Linguistics, Saarland University. Murhaf Fares, Stephan Oepen, and Yi Zhang. 2013. Machine learning for high-quality tokenization. Replicating variable tokenization schemes. In Computational Linguistics and Intelligent Text Processing, page 23 1 244. Springer. Murhaf Fares. 2013. ERG tokenization and lexical categorization: a sequence labeling approach. Master’s thesis, Department of Informatics, University of Oslo. – 1211 Dan Flickinger, Yi Zhang, and Valia Kordoni. 2012. DeepBank. A dynamically annotated treebank of the Wall Street Journal. In Proceedings of the 11th International Workshop on Treebanks and Linguistic Theories, page 85 –96, Lisbon, Portugal. Edi ¸c˜ oes Colibri. Dan Flickinger. 2000. On building a more efficient grammar by exploiting types. Natural Language Engineering, 6 (1): 15 28. Taku Kudo, Kaoru Yamamoto, and Yuji Matsumoto. 2004. Applying conditional random fields to japanese morphological analysis. In Proceedings of the 2004 Conference on Empirical Methods in Natural Language Processing, page 230 237. Jonathan K. Kummerfeld, Jessika Roesner, Tim Daw– – born, James Haggerty, James R. Curran, and Stephen Clark. 2010. Faster parsing by supertagger adaptation. In Proceedings of the 48th Meeting of the Association for Computational Linguistics, page 345 355, Uppsala, Sweden. Mitchell Marcus, Beatrice Santorini, and Mary Ann Marcinkiewicz. 1993. Building a large annotated corpora of English: The Penn Treebank. Computational Linguistics, 19:3 13 –330. Takuya Matsuzaki, Yusuke Miyao, and Jun’ichi Tsujii. 2007. Efficient HPSG parsing with supertagging and CFG-filtering. In Proceedings of the International Joint Conference on Artificial Intelligence (IJCAI 2007), page 1671 1676, Hyderabad, India. Kevin P. Murphy. 2002. Hidden semi-Markov models (HSMMs). Stephan Oepen and John Carroll. 2000. Ambiguity packing in constraint-based parsing. Practical results. In Proceedings of the 1st Meeting of the North American Chapter of the Association for Computational Linguistics, page 162 169, Seattle, WA, USA. Stephan Oepen, Daniel Flickinger, Kristina Toutanova, and Christopher D. Manning. 2004. LinGO Redwoods. A rich and dynamic treebank for HPSG. Research on Language and Computation, 2(4):575 596. Robbert Prins and Gertjan van Noord. 2003. Reinforcing parser preferences through tagging. Traitement Au– – – – des Langues, 44(3): 121 139. Laura Rimell and Stephen Clark. 2008. Adapting a lexicalized-grammar parser to contrasting domains. page 475 –484. Kristina Toutanova and Colin Cherry. 2009. A global model for joint lemmatization and part-of-speech prediction. In Proceedings of the 47th Meeting of the Association for Computational Linguistics, page 486 494, Singapore. Gisle Ytrestøl. 2012. Transition-based Parsing for Large-scale Head-Driven Phrase Structure Grammars. Ph.D. thesis, Department of Informatics, University of Oslo. tomatique – – Gisle Ytrestøl, Stephan Oepen, and Dan Flickinger. 2009. Extracting and annotating Wikipedia subdomains. In Proceedings of the 7th International Workshop on Treebanks and Linguistic Theories, page 185 197, Groningen, The Netherlands. Yue Zhang and Stephen Clark. 2010. A fast decoder for joint word segmentation and POS-tagging using a single discriminative model. In Proceedings of the 2010 Conference on Empirical Methods in Natural Language Processing, page 843 852, Cambridge, MA, USA. Yi Zhang, Stephan Oepen, and John Carroll. 2007. Efficiency in unification-based n-best parsing. In Proceedings of the 10th International Conference on Parsing Technologies, page 48 59, Prague, Czech Republic, July. – – – 1212

5 0.49021214 178 emnlp-2013-Success with Style: Using Writing Style to Predict the Success of Novels

Author: Vikas Ganjigunte Ashok ; Song Feng ; Yejin Choi

Abstract: Predicting the success of literary works is a curious question among publishers and aspiring writers alike. We examine the quantitative connection, if any, between writing style and successful literature. Based on novels over several different genres, we probe the predictive power of statistical stylometry in discriminating successful literary works, and identify characteristic stylistic elements that are more prominent in successful writings. Our study reports for the first time that statistical stylometry can be surprisingly effective in discriminating highly successful literature from less successful counterpart, achieving accuracy up to 84%. Closer analyses lead to several new insights into characteristics ofthe writing style in successful literature, including findings that are contrary to the conventional wisdom with respect to good writing style and readability. ,

6 0.4756622 35 emnlp-2013-Automatically Detecting and Attributing Indirect Quotations

7 0.45773256 129 emnlp-2013-Measuring Ideological Proportions in Political Speeches

8 0.40288049 181 emnlp-2013-The Effects of Syntactic Features in Automatic Prediction of Morphology

9 0.33740681 186 emnlp-2013-Translating into Morphologically Rich Languages with Synthetic Phrases

10 0.3162429 150 emnlp-2013-Pair Language Models for Deriving Alternative Pronunciations and Spellings from Pronunciation Dictionaries

11 0.31279641 86 emnlp-2013-Feature Noising for Log-Linear Structured Prediction

12 0.30845821 61 emnlp-2013-Detecting Promotional Content in Wikipedia

13 0.29925904 132 emnlp-2013-Mining Scientific Terms and their Definitions: A Study of the ACL Anthology

14 0.29459003 204 emnlp-2013-Word Level Language Identification in Online Multilingual Communication

15 0.29271436 121 emnlp-2013-Learning Topics and Positions from Debatepedia

16 0.28940982 34 emnlp-2013-Automatically Classifying Edit Categories in Wikipedia Revisions

17 0.28744563 188 emnlp-2013-Tree Kernel-based Negation and Speculation Scope Detection with Structured Syntactic Parse Features

18 0.2852295 53 emnlp-2013-Cross-Lingual Discriminative Learning of Sequence Models with Posterior Regularization

19 0.27701056 95 emnlp-2013-Identifying Multiple Userids of the Same Author

20 0.27534381 199 emnlp-2013-Using Topic Modeling to Improve Prediction of Neuroticism and Depression in College Students


similar papers computed by lda model

lda for this paper:

topicId topicWeight

[(3, 0.04), (18, 0.021), (22, 0.032), (30, 0.058), (45, 0.023), (47, 0.012), (50, 0.016), (51, 0.189), (53, 0.318), (66, 0.057), (71, 0.044), (75, 0.036), (77, 0.021), (90, 0.011), (96, 0.022)]

similar papers list:

simIndex simValue paperId paperTitle

1 0.85105205 54 emnlp-2013-Decipherment with a Million Random Restarts

Author: Taylor Berg-Kirkpatrick ; Dan Klein

Abstract: This paper investigates the utility and effect of running numerous random restarts when using EM to attack decipherment problems. We find that simple decipherment models are able to crack homophonic substitution ciphers with high accuracy if a large number of random restarts are used but almost completely fail with only a few random restarts. For particularly difficult homophonic ciphers, we find that big gains in accuracy are to be had by running upwards of 100K random restarts, which we accomplish efficiently using a GPU-based parallel implementation. We run a series of experiments using millions of random restarts in order to investigate other empirical properties of decipherment problems, including the famously uncracked Zodiac 340.

2 0.82442641 129 emnlp-2013-Measuring Ideological Proportions in Political Speeches

Author: Yanchuan Sim ; Brice D. L. Acree ; Justin H. Gross ; Noah A. Smith

Abstract: We seek to measure political candidates’ ideological positioning from their speeches. To accomplish this, we infer ideological cues from a corpus of political writings annotated with known ideologies. We then represent the speeches of U.S. Presidential candidates as sequences of cues and lags (filler distinguished only by its length in words). We apply a domain-informed Bayesian HMM to infer the proportions of ideologies each candidate uses in each campaign. The results are validated against a set of preregistered, domain expertauthored hypotheses.

3 0.81409156 185 emnlp-2013-Towards Situated Dialogue: Revisiting Referring Expression Generation

Author: Rui Fang ; Changsong Liu ; Lanbo She ; Joyce Y. Chai

Abstract: In situated dialogue, humans and agents have mismatched capabilities of perceiving the shared environment. Their representations of the shared world are misaligned. Thus referring expression generation (REG) will need to take this discrepancy into consideration. To address this issue, we developed a hypergraph-based approach to account for group-based spatial relations and uncertainties in perceiving the environment. Our empirical results have shown that this approach outperforms a previous graph-based approach with an absolute gain of 9%. However, while these graph-based approaches perform effectively when the agent has perfect knowledge or perception of the environment (e.g., 84%), they perform rather poorly when the agent has imperfect perception of the environment (e.g., 45%). This big performance gap calls for new solutions to REG that can mediate a shared perceptual basis in situated dialogue.

same-paper 4 0.77149057 26 emnlp-2013-Assembling the Kazakh Language Corpus

Author: Olzhas Makhambetov ; Aibek Makazhanov ; Zhandos Yessenbayev ; Bakhyt Matkarimov ; Islam Sabyrgaliyev ; Anuar Sharafudinov

Abstract: This paper presents the Kazakh Language Corpus (KLC), which is one of the first attempts made within a local research community to assemble a Kazakh corpus. KLC is designed to be a large scale corpus containing over 135 million words and conveying five stylistic genres: literary, publicistic, official, scientific and informal. Along with its primary part KLC comprises such parts as: (i) annotated sub-corpus, containing segmented documents encoded in the eXtensible Markup Language (XML) that marks complete morphological, syntactic, and structural characteristics of texts; (ii) as well as a sub-corpus with the annotated speech data. KLC has a web-based corpus management system that helps to navigate the data and retrieve necessary information. KLC is also open for contributors, who are willing to make suggestions, donate texts and help with annotation of existing materials.

5 0.56622124 121 emnlp-2013-Learning Topics and Positions from Debatepedia

Author: Swapna Gottipati ; Minghui Qiu ; Yanchuan Sim ; Jing Jiang ; Noah A. Smith

Abstract: We explore Debatepedia, a communityauthored encyclopedia of sociopolitical debates, as evidence for inferring a lowdimensional, human-interpretable representation in the domain of issues and positions. We introduce a generative model positing latent topics and cross-cutting positions that gives special treatment to person mentions and opinion words. We evaluate the resulting representation’s usefulness in attaching opinionated documents to arguments and its consistency with human judgments about positions.

6 0.55512315 53 emnlp-2013-Cross-Lingual Discriminative Learning of Sequence Models with Posterior Regularization

7 0.55317456 132 emnlp-2013-Mining Scientific Terms and their Definitions: A Study of the ACL Anthology

8 0.55214596 181 emnlp-2013-The Effects of Syntactic Features in Automatic Prediction of Morphology

9 0.550843 143 emnlp-2013-Open Domain Targeted Sentiment

10 0.55031735 114 emnlp-2013-Joint Learning and Inference for Grammatical Error Correction

11 0.54934239 152 emnlp-2013-Predicting the Presence of Discourse Connectives

12 0.54866987 64 emnlp-2013-Discriminative Improvements to Distributional Sentence Similarity

13 0.54669094 51 emnlp-2013-Connecting Language and Knowledge Bases with Embedding Models for Relation Extraction

14 0.54637921 69 emnlp-2013-Efficient Collective Entity Linking with Stacking

15 0.54592085 82 emnlp-2013-Exploring Representations from Unlabeled Data with Co-training for Chinese Word Segmentation

16 0.54567677 48 emnlp-2013-Collective Personal Profile Summarization with Social Networks

17 0.54553401 56 emnlp-2013-Deep Learning for Chinese Word Segmentation and POS Tagging

18 0.54524946 76 emnlp-2013-Exploiting Discourse Analysis for Article-Wide Temporal Classification

19 0.54520434 179 emnlp-2013-Summarizing Complex Events: a Cross-Modal Solution of Storylines Extraction and Reconstruction

20 0.5449416 110 emnlp-2013-Joint Bootstrapping of Corpus Annotations and Entity Types