acl acl2012 acl2012-194 knowledge-graph by maker-knowledge-mining
Source: pdf
Author: Hiroshi Yamaguchi ; Kumiko Tanaka-Ishii
Abstract: The problem addressed in this paper is to segment a given multilingual document into segments for each language and then identify the language of each segment. The problem was motivated by an attempt to collect a large amount of linguistic data for non-major languages from the web. The problem is formulated in terms of obtaining the minimum description length of a text, and the proposed solution finds the segments and their languages through dynamic programming. Empirical results demonstrating the potential of this approach are presented for experiments using texts taken from the Universal Declaration of Human Rights and Wikipedia, covering more than 200 languages.
Reference: text
sentIndex sentText sentNum sentScore
1 Abstract The problem addressed in this paper is to segment a given multilingual document into segments for each language and then identify the language of each segment. [sent-6, score-0.403]
2 The problem was motivated by an attempt to collect a large amount of linguistic data for non-major languages from the web. [sent-7, score-0.189]
3 The problem is formulated in terms of obtaining the minimum description length of a text, and the proposed solution finds the segments and their languages through dynamic programming. [sent-8, score-0.435]
4 1 Introduction For the purposes of this paper, a multilingual text means one containing text segments, limited to those longer than a clause, written in different languages. [sent-10, score-0.279]
5 We can often find such texts in linguistic resources collected from the World Wide Web for many nonmajor languages, which tend to also contain portions of text in a major language. [sent-11, score-0.318]
6 In automatic processing of such multilingual texts, they must first be segmented by language, and the language of each segment must be identified, since many state-of-the-art NLP applications are built by learning a gold standard for one specific language. [sent-12, score-0.304]
7 Moreover, segmentation is useful for other objectives such as collecting linguistic resources for non-major languages and automatically removing portions written in major languages, as noted above. [sent-13, score-0.365]
8 The problem addressed in this article is thus to segment a multilingual text by language and identify the language of each segment. [sent-15, score-0.44]
9 In addition, for our objective, the set of target languages consists of not only major languages but also many non-major languages: more than 200 languages in total. [sent-16, score-0.429]
10 First, (Teahan, 2000) attempted to segment multilingual texts by using text segmentation methods used for non-segmented languages. [sent-23, score-0.628]
11 For this purpose, he used a gold standard of multilingual texts annotated by borders and languages. [sent-24, score-0.587]
12 This segmentation approach is similar to that of word segmentation for nonsegmented texts, and he tested it on six different European languages. [sent-25, score-0.228]
13 Although the problem setting is similar to ours, the formulation and solution are different, particularly in that our method uses only a monolingual gold standard, not a multilingual one as in Teahan’s study. [sent-26, score-0.299]
14 Here again, the problem setting is similar to ours but not exactly the same, since the embedded text portions were assumed to be words. [sent-31, score-0.165]
15 In contrast, our work considers more than 200 languages, and the portions of embedded text are larger: up to the paragraph level to accommodate the reality of multilingual texts. [sent-33, score-0.33]
16 A more common setting in the NLP context is segmentation into semantically coherent text portions, of which a representative method is text tiling as reported by (Hearst, 1997). [sent-38, score-0.27]
17 This article concerns that problem together with segmentation but has another particularity in aiming at classification into a substantial number of categories, i. [sent-45, score-0.193]
18 This article presents one way to formulate the segmentation and identification problem as a combinatorial optimization problem; specifically, to find the set of segments and their languages that minimizes the description length of a given multilingual text. [sent-51, score-0.896]
19 2 Problem Formulation In our setting, we assume that a small amount (up to kilobytes) of monolingual plain text sample data is available for every language, e. [sent-53, score-0.178]
20 First, we assume that for all multilingual text, every text portion is written in one of the given languages; there is no input text of an unknown language without learning data. [sent-57, score-0.279]
21 Hence, our formulation is based on the minimum description length (MDL), which works with relatively small amounts of learning data. [sent-63, score-0.252]
22 A multilingual text to be segmented is denoted as X = x1, . [sent-65, score-0.263]
23 , x|X| , where xi denotes the i-th character of X and |X | denotes the text’s length. [sent-68, score-0.193]
24 Tcehaxtr segmentation by language s re tfheers t ehxetr’se to the process of segmenting X by a set of borders B = [B1, . [sent-69, score-0.423]
25 , B|B|] , where |B| denotes the number of borders, and e,a wchh Bi |iBnd|ic daenteost ethse t hleoc natuimonof a language border as an offset number of characters from the beginning. [sent-72, score-0.311]
26 The language of each segment Xi is denoted as Li, where Li ∈ L, the set of languages. [sent-81, score-0.18]
27 , L|B|] denotes the sequence of languages corresponding to each segment Xi. [sent-85, score-0.282]
28 We formulate the problem of segmenting a multilingual text by language as follows. [sent-87, score-0.262]
29 Given a multilingual text X, the segments X for a list of borders B are obtained with the corresponding languages L. [sent-88, score-0.733]
30 Then, the total description length is obtained by calculating each description length of a segment Xi for the language Li: (Xˆ,ˆL) = argXm,Lini∑=|B0|dlLi(Xi). [sent-89, score-0.525]
31 (1) The function dlLi (Xi) calculates the description length of a text segment Xi through the use of a language model for Li. [sent-90, score-0.389]
32 Since this term is a common constant for all possible segmentations and the minimization of formula (1) is not affected by this term, we will ignore it. [sent-93, score-0.233]
33 The model defined by (1) is additive for Xi, so the following formula can be applied to search for language Li given a segment Xi : Lˆi= arLgi∈mLindlLi(Xi), (2) = under the constraint that Li Li−1 for i ∈ {1, . [sent-94, score-0.366]
34 llo Twhse t fou give tnhe d description length cino an information-theoretic manner: dlLi(Xi) = +−l l oogg22|PXLi|( +Xi l)og2|L| + γ. [sent-99, score-0.193]
35 (3) Here, the first term corresponds to the code length of the text chunk Xi given a language model for Li, which in fact corresponds to the cross-entropy of Xi for Li multiplied by |Xi |. [sent-100, score-0.225]
36 parameters used to describe the length of the first term: the second term corresponds to the segment location; the third term, to the identified language; and the fourth term, to the language model of language Li. [sent-102, score-0.307]
37 This fourth term will differ according to the language model type; moreover, its value can be further minimized through formula (2). [sent-103, score-0.233]
38 Nevertheless, since we use a uniform amount of training data for every language, and since varying γ would prevent us from improving the efficiency of dynamic programming, as explained in §4, in this article we set γ to a constant oplbatainineded i empirically. [sent-104, score-0.171]
39 Under this formulation, therefore, when detecting the language of a segment as in formula (2), the terms of formula (3) other than the first term will be constant: what counts is only the first term, similarly to much of the previous work explained in the following section. [sent-105, score-0.594]
40 Next, after briefly introducing methods to calcu- late the first term of formula (3), we explain the solution to optimize the combinatorial problem of formula (1). [sent-108, score-0.409]
41 x|X| , let Leni (Y ) be the match length starting from xi of X for Y2. [sent-142, score-0.225]
42 a Since formula (1) of §2 is based on adding the description length, oitf i s§ important t ohnat tdhdei nwgh tohlee value be additive to enable efficient optimization (as will be explained in §4). [sent-145, score-0.355]
43 In this article, we use as a function to obtain the cross-entropy and for multiplication by |X| in formula (3). [sent-149, score-0.176]
44 4 Segmentation by Dynamic Programming By applying the above methods, we propose a solution to formula (1) through dynamic programming. [sent-163, score-0.176]
45 Considering the additive characteristic of the description length formulated previously as formula (1), we denote the minimized description length for a given text X simply as DP(X), which can be decomposed recursively as follows6: DP(X) =t∈{0,. [sent-167, score-0.67]
46 x|X| ), gives the description length of the remaining characters under the language model for L. [sent-187, score-0.277]
47 To fill a cell of tmhiasn table, fao trambulela o (4) suggests referring ltlo a at |L| ctheilsls t aabnlde calculating )t sheu description length o ×f t|Lhe| rest ofthe text for O(|X| −t) cells for each language. [sent-189, score-0.25]
48 First, the description length can be calculated from the previous result, decreasing O( |X| − t) to O(1) (to obtoauins trehes uclto,de d length nogf an (a|dXdi|ti −on ta)l character). [sent-191, score-0.304]
49 |X| : feollrs MMS, |U is can abcet proven t|o, b we O(log |Y | ), Xwh|:ere fo |Y | isM tShe, Umax ciamnu bme length among t(hleo learning corpora; iasn dth feo mr PPM, Um corresponds to the maximum length of an n-gram. [sent-193, score-0.222]
50 The formula can also be used to generate segments for which some adjacent languages coincide and then further to generate L through post-processing by concatenating segments of the same language. [sent-198, score-0.517]
51 7This number means the two best scores for different languages, which is required to obtain L directly: in addition to the best score, if the language of the best coincides with L in formula (4), then the second best is also needed. [sent-199, score-0.176]
52 972 Table 1: Number of languages for each writing system character kindsUDHRWiki Latin260158 Cyrillic 12 20 Devanagari 0 8 Arabic 1 6 Other 4 30 5 Experimental 5. [sent-201, score-0.222]
53 1 Setting Monolingual Texts (Training / Test Data) In this work, monolingual texts were used both for training the cross-entropy data for cross-validation: computation and as test the training data does not contain any test data at all. [sent-202, score-0.228]
54 Monolingual texts were also used to build multilingual texts, as explained in the following subsection. [sent-203, score-0.364]
55 We consider UDHR the most suitable text source for our purpose, since the content of every monolingual text in the declaration is unique. [sent-206, score-0.256]
56 The numbers of languages for various representative writing systems are listed in Table 1 for both UDHR and Wiki, while the Ap- 8http://www. [sent-219, score-0.185]
57 Note that in this article, a character means a Unicode character throughout, which differs from a character rendered in block form for some writing systems. [sent-225, score-0.237]
58 To evaluate language identification for monolingual texts, as will be reported in §6. [sent-226, score-0.178]
59 This differs from the most similar previous study (Teahan, 2000), which required multilingual learning data. [sent-232, score-0.165]
60 The multilingual texts were generated artificially, since multilingual texts taken directly from the web have other issues besides segmentation. [sent-233, score-0.636]
61 First, proper nouns in multilingual texts complicate the finaljudgment of language and segment borders. [sent-234, score-0.457]
62 In practical application, therefore, texts for segmentation must be preprocessed by named entity recognition, which is beyond the scope of this work. [sent-235, score-0.267]
63 Second, the sizes of text portions in multilingual web texts differ greatly, which would make it difficult to evaluate the overall performance of the proposed method in a uniform manner. [sent-236, score-0.483]
64 The first is a set of multilingual texts, denoted as Test1, such that each text is the conjunction of two portions in different languages. [sent-238, score-0.371]
65 Here, the experiment is focused on segment border detection, which must segment the text into two parts, provided that there are two languages. [sent-239, score-0.519]
66 The second kind of test set is a set of multilingual texts, denoted as Test2, each consisting of k segments in different languages. [sent-243, score-0.305]
67 Choose k languages randomly from L, where some soef kth lea kn languages can overlap. [sent-251, score-0.286]
68 Choose a text length randomly from {40,80,120,160}, and randomly dseolmeclty yth firso many 0c,h8a0r,a1c2te0r,1s 6fr0}om, atnhed te rastn ddaomta. [sent-254, score-0.168]
69 Shuffle the k languages and concatenate the text portions in the resultant order. [sent-256, score-0.308]
70 6 By default, the possibility of segmentation is considered at every character offset in a text, which provides a lower bound for the proposed method. [sent-259, score-0.236]
71 , |X |}, and, in step 3 above, text portions are generated so as ntod ,e innd s taet pth 3es aeb bordering olortciaotniosn as. [sent-271, score-0.261]
72 r Given a multilingual text, we evaluate the outputs B and L through the following scores: PB/RB: Precision/recall of the borders detected (i. [sent-272, score-0.486]
73 , the correct borders detected, divided by the detected/correct border). [sent-274, score-0.269]
74 Ps and Rs are obtained by changing the parameter γ given in formula (3), which ranges over 1,2,4,. [sent-278, score-0.176]
75 Although there are web pages consisting of texts in more than 2 languages, we rarely see a web page containing 5 languages at the same time. [sent-285, score-0.296]
76 Therefore, Test1 reflects the most important case of 2 languages only, whereas Test2 reflects the case of multiple languages to demonstrate the general potential of the proposed approach. [sent-286, score-0.286]
77 Since our motivation has been to eliminate a portion in a major input length (characters) Figure 1: Accuracy of language identification for monolingual texts language from the text, there could be a formulation specific to the problem. [sent-288, score-0.501]
78 1 Language Identification Performance We first show the performance of language identification using formula (2), which is used as the component of the text segmentation by language. [sent-292, score-0.45]
79 Figure 1 shows the results for language identification of monolingual texts with the UDHR and Wiki test data. [sent-293, score-0.331]
80 The horizontal axis indicates the size of the input text in characters, the vertical axis indicates the accuracy AL, and the graph contains four plots10 for MMS and PPM for each set of data. [sent-294, score-0.444]
81 Overall, all plots rise quickly despite the se- vere conditions of a large number of languages (over 200), a small amount of input data, and a small amount of learning data. [sent-295, score-0.235]
82 974 relative position (characters) Figure 2: Cumulative distribution of segment borders slightly better performance than did MMS. [sent-302, score-0.408]
83 Overall, the language identification performance seems sufficient to justify its application to our main problem of text segmentation by language. [sent-316, score-0.274]
84 Figure 2 shows the cumulative distribution obtained for segment border detection. [sent-319, score-0.323]
85 The horizontal axis indicates the relative location by character with respect to the correct border at zero, and the vertical axis indicates the cumulative proportion of texts whose border is detected at that relative point. [sent-320, score-0.997]
86 Note that segment borders are judged by characters and not by bordering locations, as explained in §5. [sent-322, score-0.634]
87 recall recall Figure 3: PL/RL (language, upper graph) and PB/RB (border, lower graph) results, where borders were taken from any character offset Since the plots rise sharply at the middle of the horizontal axis, the borders were detected at or very near the correct place in many cases. [sent-324, score-0.783]
88 Figure 3 shows the two precision/recall graphs for language identification (upper graph) and segment border detection (lower graph), where borders were taken from any character offset. [sent-326, score-0.774]
89 In each graph, the horizontal axis indicates precision and the vertical axis indicates recall. [sent-327, score-0.345]
90 For segment border performance (lower graph), however, the results were limited. [sent-334, score-0.323]
91 The main reason for this is that both MMS and PPM tend to detect a border one character earlier than the correct location, as was seen in Figure 2. [sent-335, score-0.263]
92 Therefore, we repeated the experiment with Test2 under the constraint that a segment border could occur only at a bordering location, as explained in §5. [sent-337, score-0.465]
93 We could also observe how PPM performed better at detecting borders in this case. [sent-342, score-0.269]
94 This shows how each recursion of formula (4) works almost independently, having segmentation and language identification functions that are both robust. [sent-346, score-0.393]
95 Figure h5 sXh|o,w ws tithhe speed sfoinrg Test2 processing, with the horizontal axis indicating the input length and the vertical axis indicating the processing time. [sent-351, score-0.456]
96 7 Conclusion This article has presented a method for segmenting a multilingual text into segments, each in a different language. [sent-357, score-0.341]
97 This task could serve for preprocessing of multilingual texts before applying languagespecific analysis to each text. [sent-358, score-0.318]
98 Moreover, the pro- posed method could be used to generate corpora in a variety of languages, since many texts in minor languages tend to contain chunks in a major language. [sent-359, score-0.296]
99 The segmentation task was modeled as an optimization problem of finding the best segment and language sequences to minimize the description length of a given text. [sent-360, score-0.446]
100 Overall, when segmenting a text with up to five random portions ofdifferent languages, where each portion consisted of 40 to 120 characters, the best F-scores for language identification and segmentation were 0. [sent-364, score-0.422]
wordName wordTfidf (topN-words)
[('borders', 0.269), ('ppm', 0.269), ('quechua', 0.23), ('border', 0.184), ('formula', 0.176), ('udhr', 0.167), ('multilingual', 0.165), ('texts', 0.153), ('languages', 0.143), ('segment', 0.139), ('teahan', 0.134), ('mms', 0.115), ('xi', 0.114), ('segmentation', 0.114), ('length', 0.111), ('axis', 0.11), ('portions', 0.108), ('identification', 0.103), ('segments', 0.099), ('bordering', 0.096), ('norwegian', 0.096), ('tonga', 0.096), ('northern', 0.084), ('characters', 0.084), ('description', 0.082), ('article', 0.079), ('character', 0.079), ('py', 0.076), ('monolingual', 0.075), ('horizontal', 0.071), ('juola', 0.067), ('declaration', 0.067), ('universal', 0.066), ('latin', 0.059), ('coding', 0.059), ('formulation', 0.059), ('albanian', 0.058), ('belarusian', 0.058), ('bosnian', 0.058), ('cleary', 0.058), ('creole', 0.058), ('cyrillic', 0.058), ('dlli', 0.058), ('farach', 0.058), ('haitian', 0.058), ('kurdish', 0.058), ('leni', 0.058), ('ninka', 0.058), ('nynorsk', 0.058), ('occitan', 0.058), ('sorbian', 0.058), ('walloon', 0.058), ('term', 0.057), ('text', 0.057), ('german', 0.057), ('arabic', 0.056), ('len', 0.054), ('vertical', 0.054), ('detected', 0.052), ('additive', 0.051), ('lithuanian', 0.05), ('miao', 0.05), ('chin', 0.05), ('ash', 0.05), ('indonesian', 0.05), ('malay', 0.05), ('serbian', 0.046), ('amount', 0.046), ('explained', 0.046), ('offset', 0.043), ('wiki', 0.043), ('graph', 0.042), ('representative', 0.042), ('denoted', 0.041), ('locations', 0.04), ('segmenting', 0.04), ('rights', 0.038), ('dutch', 0.038), ('abkhazian', 0.038), ('afrikaans', 0.038), ('akan', 0.038), ('ancash', 0.038), ('arpitan', 0.038), ('aymara', 0.038), ('bambara', 0.038), ('benedetto', 0.038), ('bikol', 0.038), ('bislama', 0.038), ('bokm', 0.038), ('breton', 0.038), ('cebuano', 0.038), ('cilibrasi', 0.038), ('corsican', 0.038), ('devanagari', 0.038), ('dll', 0.038), ('esperanto', 0.038), ('ewe', 0.038), ('faroese', 0.038), ('frisian', 0.038), ('friulian', 0.038)]
simIndex simValue paperId paperTitle
same-paper 1 1.000002 194 acl-2012-Text Segmentation by Language Using Minimum Description Length
Author: Hiroshi Yamaguchi ; Kumiko Tanaka-Ishii
Abstract: The problem addressed in this paper is to segment a given multilingual document into segments for each language and then identify the language of each segment. The problem was motivated by an attempt to collect a large amount of linguistic data for non-major languages from the web. The problem is formulated in terms of obtaining the minimum description length of a text, and the proposed solution finds the segments and their languages through dynamic programming. Empirical results demonstrating the potential of this approach are presented for experiments using texts taken from the Universal Declaration of Human Rights and Wikipedia, covering more than 200 languages.
2 0.096727744 150 acl-2012-Multilingual Named Entity Recognition using Parallel Data and Metadata from Wikipedia
Author: Sungchul Kim ; Kristina Toutanova ; Hwanjo Yu
Abstract: In this paper we propose a method to automatically label multi-lingual data with named entity tags. We build on prior work utilizing Wikipedia metadata and show how to effectively combine the weak annotations stemming from Wikipedia metadata with information obtained through English-foreign language parallel Wikipedia sentences. The combination is achieved using a novel semi-CRF model for foreign sentence tagging in the context of a parallel English sentence. The model outperforms both standard annotation projection methods and methods based solely on Wikipedia metadata.
3 0.095823698 172 acl-2012-Selective Sharing for Multilingual Dependency Parsing
Author: Tahira Naseem ; Regina Barzilay ; Amir Globerson
Abstract: We present a novel algorithm for multilingual dependency parsing that uses annotations from a diverse set of source languages to parse a new unannotated language. Our motivation is to broaden the advantages of multilingual learning to languages that exhibit significant differences from existing resource-rich languages. The algorithm learns which aspects of the source languages are relevant for the target language and ties model parameters accordingly. The model factorizes the process of generating a dependency tree into two steps: selection of syntactic dependents and their ordering. Being largely languageuniversal, the selection component is learned in a supervised fashion from all the training languages. In contrast, the ordering decisions are only influenced by languages with similar properties. We systematically model this cross-lingual sharing using typological features. In our experiments, the model consistently outperforms a state-of-the-art multilingual parser. The largest improvement is achieved on the non Indo-European languages yielding a gain of 14.4%.1
4 0.086897038 33 acl-2012-Automatic Event Extraction with Structured Preference Modeling
Author: Wei Lu ; Dan Roth
Abstract: This paper presents a novel sequence labeling model based on the latent-variable semiMarkov conditional random fields for jointly extracting argument roles of events from texts. The model takes in coarse mention and type information and predicts argument roles for a given event template. This paper addresses the event extraction problem in a primarily unsupervised setting, where no labeled training instances are available. Our key contribution is a novel learning framework called structured preference modeling (PM), that allows arbitrary preference to be assigned to certain structures during the learning procedure. We establish and discuss connections between this framework and other existing works. We show empirically that the structured preferences are crucial to the success of our task. Our model, trained without annotated data and with a small number of structured preferences, yields performance competitive to some baseline supervised approaches.
5 0.085955471 3 acl-2012-A Class-Based Agreement Model for Generating Accurately Inflected Translations
Author: Spence Green ; John DeNero
Abstract: When automatically translating from a weakly inflected source language like English to a target language with richer grammatical features such as gender and dual number, the output commonly contains morpho-syntactic agreement errors. To address this issue, we present a target-side, class-based agreement model. Agreement is promoted by scoring a sequence of fine-grained morpho-syntactic classes that are predicted during decoding for each translation hypothesis. For English-to-Arabic translation, our model yields a +1.04 BLEU average improvement over a state-of-the-art baseline. The model does not require bitext or phrase table annotations and can be easily implemented as a feature in many phrase-based decoders. 1
6 0.085414588 210 acl-2012-Unsupervized Word Segmentation: the Case for Mandarin Chinese
7 0.08107385 206 acl-2012-UWN: A Large Multilingual Lexical Knowledge Base
9 0.075273298 140 acl-2012-Machine Translation without Words through Substring Alignment
10 0.070740037 119 acl-2012-Incremental Joint Approach to Word Segmentation, POS Tagging, and Dependency Parsing in Chinese
11 0.068907052 134 acl-2012-Learning to Find Translations and Transliterations on the Web
12 0.068645537 81 acl-2012-Enhancing Statistical Machine Translation with Character Alignment
13 0.06846381 27 acl-2012-Arabic Retrieval Revisited: Morphological Hole Filling
14 0.067574583 94 acl-2012-Fast Online Training with Frequency-Adaptive Learning Rates for Chinese Word Segmentation and New Word Detection
15 0.067361861 182 acl-2012-Spice it up? Mining Refinements to Online Instructions from User Generated Content
16 0.066038549 41 acl-2012-Bootstrapping a Unified Model of Lexical and Phonetic Acquisition
17 0.066029064 26 acl-2012-Applications of GPC Rules and Character Structures in Games for Learning Chinese Characters
18 0.065512888 54 acl-2012-Combining Word-Level and Character-Level Models for Machine Translation Between Closely-Related Languages
19 0.065412238 152 acl-2012-Multilingual WSD with Just a Few Lines of Code: the BabelNet API
20 0.064458586 16 acl-2012-A Nonparametric Bayesian Approach to Acoustic Model Discovery
topicId topicWeight
[(0, -0.183), (1, 0.009), (2, -0.021), (3, 0.006), (4, 0.034), (5, 0.141), (6, 0.045), (7, -0.062), (8, -0.016), (9, -0.02), (10, -0.055), (11, -0.012), (12, 0.019), (13, -0.036), (14, -0.03), (15, -0.032), (16, -0.002), (17, -0.017), (18, -0.002), (19, 0.161), (20, -0.055), (21, -0.007), (22, 0.034), (23, -0.065), (24, -0.095), (25, 0.08), (26, -0.117), (27, 0.068), (28, -0.017), (29, -0.061), (30, 0.005), (31, 0.065), (32, 0.043), (33, -0.033), (34, 0.006), (35, -0.06), (36, -0.078), (37, 0.046), (38, -0.039), (39, -0.162), (40, 0.02), (41, 0.17), (42, -0.122), (43, 0.076), (44, 0.118), (45, -0.187), (46, 0.188), (47, 0.107), (48, 0.032), (49, -0.029)]
simIndex simValue paperId paperTitle
same-paper 1 0.95457184 194 acl-2012-Text Segmentation by Language Using Minimum Description Length
Author: Hiroshi Yamaguchi ; Kumiko Tanaka-Ishii
Abstract: The problem addressed in this paper is to segment a given multilingual document into segments for each language and then identify the language of each segment. The problem was motivated by an attempt to collect a large amount of linguistic data for non-major languages from the web. The problem is formulated in terms of obtaining the minimum description length of a text, and the proposed solution finds the segments and their languages through dynamic programming. Empirical results demonstrating the potential of this approach are presented for experiments using texts taken from the Universal Declaration of Human Rights and Wikipedia, covering more than 200 languages.
2 0.65361398 219 acl-2012-langid.py: An Off-the-shelf Language Identification Tool
Author: Marco Lui ; Timothy Baldwin
Abstract: We present langid .py, an off-the-shelflanguage identification tool. We discuss the design and implementation of langid .py, and provide an empirical comparison on 5 longdocument datasets, and 2 datasets from the microblog domain. We find that langid .py maintains consistently high accuracy across all domains, making it ideal for end-users that require language identification without wanting to invest in preparation of in-domain training data.
3 0.6494348 210 acl-2012-Unsupervized Word Segmentation: the Case for Mandarin Chinese
Author: Pierre Magistry ; Benoit Sagot
Abstract: In this paper, we present an unsupervized segmentation system tested on Mandarin Chinese. Following Harris's Hypothesis in Kempe (1999) and Tanaka-Ishii's (2005) reformulation, we base our work on the Variation of Branching Entropy. We improve on (Jin and Tanaka-Ishii, 2006) by adding normalization and viterbidecoding. This enable us to remove most of the thresholds and parameters from their model and to reach near state-of-the-art results (Wang et al., 201 1) with a simpler system. We provide evaluation on different corpora available from the Segmentation bake-off II (Emerson, 2005) and define a more precise topline for the task using cross-trained supervized system available off-the-shelf (Zhang and Clark, 2010; Zhao and Kit, 2008; Huang and Zhao, 2007)
4 0.57810938 200 acl-2012-Toward Automatically Assembling Hittite-Language Cuneiform Tablet Fragments into Larger Texts
Author: Stephen Tyndall
Abstract: This paper presents the problem within Hittite and Ancient Near Eastern studies of fragmented and damaged cuneiform texts, and proposes to use well-known text classification metrics, in combination with some facts about the structure of Hittite-language cuneiform texts, to help classify a number offragments of clay cuneiform-script tablets into more complete texts. In particular, Ipropose using Sumerian and Akkadian ideogrammatic signs within Hittite texts to improve the performance of Naive Bayes and Maximum Entropy classifiers. The performance in some cases is improved, and in some cases very much not, suggesting that the variable frequency of occurrence of these ideograms in individual fragments makes considerable difference in the ideal choice for a classification method. Further, complexities of the writing system and the digital availability ofHittite texts complicate the problem.
5 0.49923643 172 acl-2012-Selective Sharing for Multilingual Dependency Parsing
Author: Tahira Naseem ; Regina Barzilay ; Amir Globerson
Abstract: We present a novel algorithm for multilingual dependency parsing that uses annotations from a diverse set of source languages to parse a new unannotated language. Our motivation is to broaden the advantages of multilingual learning to languages that exhibit significant differences from existing resource-rich languages. The algorithm learns which aspects of the source languages are relevant for the target language and ties model parameters accordingly. The model factorizes the process of generating a dependency tree into two steps: selection of syntactic dependents and their ordering. Being largely languageuniversal, the selection component is learned in a supervised fashion from all the training languages. In contrast, the ordering decisions are only influenced by languages with similar properties. We systematically model this cross-lingual sharing using typological features. In our experiments, the model consistently outperforms a state-of-the-art multilingual parser. The largest improvement is achieved on the non Indo-European languages yielding a gain of 14.4%.1
6 0.44235739 26 acl-2012-Applications of GPC Rules and Character Structures in Games for Learning Chinese Characters
7 0.44137928 206 acl-2012-UWN: A Large Multilingual Lexical Knowledge Base
8 0.43994984 163 acl-2012-Prediction of Learning Curves in Machine Translation
9 0.42214417 151 acl-2012-Multilingual Subjectivity and Sentiment Analysis
10 0.39798599 137 acl-2012-Lemmatisation as a Tagging Task
11 0.39121407 122 acl-2012-Joint Evaluation of Morphological Segmentation and Syntactic Parsing
12 0.38443479 94 acl-2012-Fast Online Training with Frequency-Adaptive Learning Rates for Chinese Word Segmentation and New Word Detection
14 0.3732971 182 acl-2012-Spice it up? Mining Refinements to Online Instructions from User Generated Content
15 0.36965004 152 acl-2012-Multilingual WSD with Just a Few Lines of Code: the BabelNet API
16 0.3611801 33 acl-2012-Automatic Event Extraction with Structured Preference Modeling
17 0.35246289 39 acl-2012-Beefmoves: Dissemination, Diversity, and Dynamics of English Borrowings in a German Hip Hop Forum
18 0.34265718 46 acl-2012-Character-Level Machine Translation Evaluation for Languages with Ambiguous Word Boundaries
19 0.34171721 139 acl-2012-MIX Is Not a Tree-Adjoining Language
20 0.34025982 75 acl-2012-Discriminative Strategies to Integrate Multiword Expression Recognition and Parsing
topicId topicWeight
[(13, 0.398), (25, 0.029), (26, 0.032), (28, 0.053), (30, 0.014), (37, 0.043), (39, 0.041), (57, 0.012), (74, 0.021), (82, 0.027), (84, 0.033), (85, 0.032), (90, 0.101), (92, 0.03), (94, 0.02), (99, 0.045)]
simIndex simValue paperId paperTitle
1 0.81163651 185 acl-2012-Strong Lexicalization of Tree Adjoining Grammars
Author: Andreas Maletti ; Joost Engelfriet
Abstract: Recently, it was shown (KUHLMANN, SATTA: Tree-adjoining grammars are not closed under strong lexicalization. Comput. Linguist., 2012) that finitely ambiguous tree adjoining grammars cannot be transformed into a normal form (preserving the generated tree language), in which each production contains a lexical symbol. A more powerful model, the simple context-free tree grammar, admits such a normal form. It can be effectively constructed and the maximal rank of the nonterminals only increases by 1. Thus, simple context-free tree grammars strongly lexicalize tree adjoining grammars and themselves.
same-paper 2 0.7140047 194 acl-2012-Text Segmentation by Language Using Minimum Description Length
Author: Hiroshi Yamaguchi ; Kumiko Tanaka-Ishii
Abstract: The problem addressed in this paper is to segment a given multilingual document into segments for each language and then identify the language of each segment. The problem was motivated by an attempt to collect a large amount of linguistic data for non-major languages from the web. The problem is formulated in terms of obtaining the minimum description length of a text, and the proposed solution finds the segments and their languages through dynamic programming. Empirical results demonstrating the potential of this approach are presented for experiments using texts taken from the Universal Declaration of Human Rights and Wikipedia, covering more than 200 languages.
3 0.69593835 161 acl-2012-Polarity Consistency Checking for Sentiment Dictionaries
Author: Eduard Dragut ; Hong Wang ; Clement Yu ; Prasad Sistla ; Weiyi Meng
Abstract: Polarity classification of words is important for applications such as Opinion Mining and Sentiment Analysis. A number of sentiment word/sense dictionaries have been manually or (semi)automatically constructed. The dictionaries have substantial inaccuracies. Besides obvious instances, where the same word appears with different polarities in different dictionaries, the dictionaries exhibit complex cases, which cannot be detected by mere manual inspection. We introduce the concept of polarity consistency of words/senses in sentiment dictionaries in this paper. We show that the consistency problem is NP-complete. We reduce the polarity consistency problem to the satisfiability problem and utilize a fast SAT solver to detect inconsistencies in a sentiment dictionary. We perform experiments on four sentiment dictionaries and WordNet.
4 0.35303542 139 acl-2012-MIX Is Not a Tree-Adjoining Language
Author: Makoto Kanazawa ; Sylvain Salvati
Abstract: The language MIX consists of all strings over the three-letter alphabet {a, b, c} that contain an equal n-luemttebrer a olpfh occurrences }o tfh heaatch c olentttaeinr. We prove Joshi’s (1985) conjecture that MIX is not a tree-adjoining language.
5 0.34689698 84 acl-2012-Estimating Compact Yet Rich Tree Insertion Grammars
Author: Elif Yamangil ; Stuart Shieber
Abstract: We present a Bayesian nonparametric model for estimating tree insertion grammars (TIG), building upon recent work in Bayesian inference of tree substitution grammars (TSG) via Dirichlet processes. Under our general variant of TIG, grammars are estimated via the Metropolis-Hastings algorithm that uses a context free grammar transformation as a proposal, which allows for cubic-time string parsing as well as tree-wide joint sampling of derivations in the spirit of Cohn and Blunsom (2010). We use the Penn treebank for our experiments and find that our proposal Bayesian TIG model not only has competitive parsing performance but also finds compact yet linguistically rich TIG representations of the data.
6 0.34291837 63 acl-2012-Cross-lingual Parse Disambiguation based on Semantic Correspondence
7 0.34043962 219 acl-2012-langid.py: An Off-the-shelf Language Identification Tool
8 0.33970892 72 acl-2012-Detecting Semantic Equivalence and Information Disparity in Cross-lingual Documents
9 0.33936396 214 acl-2012-Verb Classification using Distributional Similarity in Syntactic and Semantic Structures
10 0.33824334 62 acl-2012-Cross-Lingual Mixture Model for Sentiment Classification
11 0.33731613 140 acl-2012-Machine Translation without Words through Substring Alignment
12 0.33719948 206 acl-2012-UWN: A Large Multilingual Lexical Knowledge Base
14 0.33630684 191 acl-2012-Temporally Anchored Relation Extraction
15 0.33421248 146 acl-2012-Modeling Topic Dependencies in Hierarchical Text Categorization
16 0.33416241 130 acl-2012-Learning Syntactic Verb Frames using Graphical Models
17 0.33382779 187 acl-2012-Subgroup Detection in Ideological Discussions
18 0.33377081 136 acl-2012-Learning to Translate with Multiple Objectives
19 0.33303511 11 acl-2012-A Feature-Rich Constituent Context Model for Grammar Induction
20 0.33300361 156 acl-2012-Online Plagiarized Detection Through Exploiting Lexical, Syntax, and Semantic Information