acl acl2013 acl2013-128 knowledge-graph by maker-knowledge-mining

128 acl-2013-Does Korean defeat phonotactic word segmentation?


Source: pdf

Author: Robert Daland ; Kie Zuraw

Abstract: Computational models of infant word segmentation have not been tested on a wide range of languages. This paper applies a phonotactic segmentation model to Korean. In contrast to the undersegmentation pattern previously found in English and Russian, the model exhibited more oversegmentation errors and more errors overall. Despite the high error rate, analysis suggested that lexical acquisition might not be problematic, provided that infants attend only to frequently segmented items. 1

Reference: text


Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 com Abstract Computational models of infant word segmentation have not been tested on a wide range of languages. [sent-4, score-0.268]

2 This paper applies a phonotactic segmentation model to Korean. [sent-5, score-0.616]

3 In contrast to the undersegmentation pattern previously found in English and Russian, the model exhibited more oversegmentation errors and more errors overall. [sent-6, score-0.31]

4 Despite the high error rate, analysis suggested that lexical acquisition might not be problematic, provided that infants attend only to frequently segmented items. [sent-7, score-0.596]

5 1 Introduction The process by which infants learn to parse the acoustic signal into word-sized units—word segmentation—is an active area of research in developmental psychology (Polka and Sundara 2012; Saffran et al. [sent-8, score-0.29]

6 Word segmentation is a classic bootstrapping problem: to learn words, infants must segment the input, because around 90% of the novel word types they hear are never uttered in isolation (Aslin et al. [sent-11, score-0.529]

7 However, in order to segment infants must know some words, or generalizations about the properties of words. [sent-13, score-0.362]

8 How can infants form generalizations about words before learning words themselves? [sent-14, score-0.332]

9 1 DiBS Two approaches in the literature might be termed lexical and phonotactic. [sent-16, score-0.047]

10 Under the lexical approach, exemplified by GGJ09, infants are assumed to exploit the Zipfian distribution of lan- Kie Zuraw Department of Linguistics University of California, Los Angeles 3125 Campbell Hall, Box 951543 Los Angeles, CA 90095-1543, USA k ie @ucl a . [sent-17, score-0.29]

11 edu guage, identifying frequently recurring and mutually predictive sequences as words. [sent-18, score-0.125]

12 In the phonotactic approach, infants are assumed to leverage universal and/or language-specific knowledge about the phonological content of sequences to infer the optimal segmentation. [sent-19, score-0.953]

13 The present study focuses on the phonotactic approach outlined in DP1 1, termed DiBS. [sent-20, score-0.471]

14 A (Di)phone-(B)ased (S)egmentation model consists of an inventory of segment-segment sequences, with an estimated probability that a word boundary falls between the two segments. [sent-23, score-0.223]

15 For example, when [pd] occurs in English, the probability of an intervening word boundary is very high: Pr(# | [pd]) ≈ 1. [sent-24, score-0.194]

16 For assessment purposes, these probabilities are converted to hard decisions. [sent-27, score-0.032]

17 DP1 1 describe an unsupervised learning algorithm for DiBS that exploits a positional independence assumption, treating phrase edges as a proxy for word edges (phrasal model). [sent-28, score-0.3]

18 This learning model’ s performance on English is on par with state-of-the-art lexical models (GGJ09), reflecting the high positional informativeness of diphones in English. [sent-29, score-0.386]

19 We apply the baseline and phrasal models to Korean. [sent-30, score-0.064]

20 2 Linguistic properties of Korean Korean is unrelated to languages previously modeled (English, Dutch, French, Spanish, Ara873 ProceedingSsof oifa, th Beu 5l1gsarti Aan,An uuaglu Mste 4e-ti9n2g 0 o1f3 t. [sent-32, score-0.03]

21 c A2s0s1o3ci Aatsiosonc fioartio Cno fmorpu Ctoamtiopnuatalt Lioinngauli Lsitnicgsu,i psatgicess 873–877, bic, Greek, Russian), and it is an interesting test case for both phonotactic and lexical approaches. [sent-34, score-0.424]

22 Most noun phrases are marked with a limited set of case suffixes, and clauses generally end in a verb, inflected with suffixes ending in a limited set of sounds ([a,ʌ,i,jo]). [sent-36, score-0.147]

23 Thus, the phrase-final distribution may not reflect the overall word-final distribution—problematic for some phonotactic approaches. [sent-37, score-0.424]

24 Similarly, the high frequency and positional predictability of affixes could lead a lexical model to treat them as words. [sent-38, score-0.096]

25 A range of phonological processes apply in Korean, even across word boundaries (Sohn 1999), yielding extensive allomorphy. [sent-39, score-0.303]

26 Korean consonantal phonology gives diphones several informative properties, including: • Various consonant clusters (obstruentlenis, lenis-nasal, et al. [sent-41, score-0.402]

27 ) are possible only if they span a word boundary • • Various consonants word boundary cannot precede a [ŋ] cannot follow a word boundary Conversely, unlike in previously studied languages, vowel-vowel sequences are common word-internally. [sent-42, score-0.691]

28 This is likely to be problematic for phonotactic models, but not for lexical ones. [sent-43, score-0.483]

29 2 Methods We obtained a phonetic corpus representing Korean speech by applying a grapheme-to-phonetic converter to a text corpus. [sent-44, score-0.151]

30 First, we conducted an analysis of this phonetic corpus, with results in Table 1. [sent-45, score-0.122]

31 Next, for comparability with previous studies, two 750,000-word samples (representing approximately one month of child input each) were randomly drawn from the phonetic corpus—the training and test corpora. [sent-46, score-0.212]

32 The phrasal and baseline DiBS models described above were trained and tested on these corpora; results are reported in Table 2. [sent-47, score-0.064]

33 Finally, we inspected one ‘day’ worth of segmentations, and offer a qualitative assessment of errors. [sent-48, score-0.032]

34 1 Corpus and phonetic conversion The Korean Advanced Institute of Science and Technology Raw Corpus, available from the Semantic Web Research Center, semanticweb. [sent-50, score-0.18]

35 It includes morphosyntactic processing, phrase-break detection, and a dictionary of phonetic exceptions. [sent-58, score-0.122]

36 It applies regular and lexicallyconditioned phonological rules, but not optional rules. [sent-59, score-0.163]

37 An example of original text and the phonetic conversion is given below, with phonological changes in bold: orthographic: 1 4 경기도여 앙pho대net문ic: 예 창 경작 기학 학도과 도과를 를여1 2 여5 주에 서출 주졸 졸에업 에업서했 서했 다출 출 3. [sent-64, score-0.343]

38 University4 We relied on spaces in the corpus to indicate word boundaries, although, as in all languages, there can be inconsistencies in written Korean. [sent-68, score-0.047]

39 2 Error analysis An under-researched issue is the nature of the errors that segmentation algorithms make. [sent-70, score-0.234]

40 For a given input word in the test corpus, we defined the output projection as the minimal sequence of segmented words containing the entire input word. [sent-71, score-0.475]

41 For example, if the#kitty were segmented as thekitty, then thekitty would be the output projection for both the and kitty. [sent-72, score-0.403]

42 Similarly, for a posited word in the segmentation/output of the test corpus, we defined the input projection. [sent-73, score-0.149]

43 For example, if the#kitty were segmented as theki#tty, then the the#kitty would be the input projection of both theki and tty. [sent-74, score-0.417]

44 Are highly frequent items segmented frequently enough that the child is likely to be able to learn them? [sent-77, score-0.419]

45 Is it 874 the case that all or most items which are segmented frequently are themselves words? [sent-78, score-0.285]

46 Are there predicted errors which seem especially serious or difficult to overcome? [sent-79, score-0.042]

47 3 Results and discussion The 1350 distinct diphones found in the phonetic corpus were grouped into phonological classes. [sent-80, score-0.575]

48 Table 1 indicates the probabilities (percentage) that a word boundary falls inside the diphone; when the class contains 3 or more diphones, the median and range are shown. [sent-81, score-0.223]

49 Because of various phonological processes, some sequences cannot exist (blank cells), some can occur only word- internally (marked int), and some can occur only across word boundaries (marked span). [sent-82, score-0.413]

50 For example, the velar nasal [ŋ] cannot begin a word, so diphones of the form Xŋ must be wordinternal. [sent-83, score-0.348]

51 Conversely, a lenis-/h/ sequence indicates a word boundary, because within a word a lenis stop merges with following /h/ to become an aspirated stop. [sent-84, score-0.094]

52 If all diphones in a cell have a spanning rate above 90%, the cell says span*, and if below 10%, int*. [sent-85, score-0.395]

53 This means that all the diphones in that class are highly informative; other classes contain a mix of more and less informative diphones. [sent-86, score-0.33]

54 An undersegmentation error is a true word boundary which the segmentation algorithm fails to find (miss), while an oversegmentation error is a falsely posited boundary (false alarm). [sent-88, score-0.902]

55 The under- and over-segmentation error rates are defined as the number of such errors per word (percent). [sent-89, score-0.132]

56 We also report the precision, recall, and F scores for boundary detection, word token segmentation, and type segmentation (for details see DP1 1, GGJ09). [sent-90, score-0.386]

57 On the basis of the fact that the oversegmentation error rate in English and Russian was consistently below 10% (<1 error/10 wds), DP1 1 conjectured that phonotactic segmenters will, cross-linguistically, avoid significant oversegmentation. [sent-91, score-0.669]

58 The results in Table 2 provide a counterexample: oversegmentation is distinctly higher than in English and Russian. [sent-92, score-0.247]

59 Indeed, Korean is a more challenging language for purely phonotactic segmentation. [sent-93, score-0.424]

60 1 Phonotactic cues to word segmentation Because phonological processes are more likely to apply word-internally, word-internal se- quences are more predictable (Aslin et al. [sent-95, score-0.436]

61 The phonology of Korean is a potentially 875 rich source of information for word segmentation: obstruent-initial diphones are generally informative as to the presence/absence of word boundaries. [sent-98, score-0.467]

62 However, as we suspected, vowelvowel sequences are problematic, since they occur freely both within words and across word boundaries. [sent-99, score-0.157]

63 Korean differs from English in that most English diphones occur nearly exclusively within words, or nearly exclusively across word boundaries (DP1 1), while in Korean most sonorant-obstruent sequences occur both within and across words. [sent-100, score-0.596]

64 2 Errors and word-learning It seems reasonable to assume that word-learning is best facilitated by seeing multiple occurrences of a word. [sent-102, score-0.029]

65 A segmentation that is produced only once might be ignored; thus we defined an input or output projection as frequent if it occurred more than once in the test sample. [sent-103, score-0.479]

66 A word learner relying on a phonotactic model could expect to successfully identify many frequent words. [sent-104, score-0.56]

67 For 73 of the 100 most frequent input words, the only frequent output projection in the baseline model was the input word itself, meaning that the word was segmented correctly in most contexts. [sent-105, score-0.7]

68 For 20 there was no frequent output projection, meaning that the word was not segmented consistently across contexts, which we assume is noise to the learner. [sent-106, score-0.352]

69 In the phrasal model, for 16 items the most frequent output projection was the input word itself and for 64 there was no frequent output projection. [sent-107, score-0.569]

70 Conversely, of the 100 most frequent potential words identified by the baseline model, in 26 cases the most frequent input projection was the output word itself: a real word was correctly identified. [sent-108, score-0.47]

71 In 26 cases there was no frequent input projection, and in 48 another input projection was at least as frequent as the output word. [sent-109, score-0.421]

72 One such example is [mjʌn] ‘cotton’ , frequently segmented out when it was a bound morpheme (‘if’ or ‘how many’). [sent-110, score-0.234]

73 The most frequently segmented item was [ke], which can be a freestanding word (‘there/thing’), but was often segmented out from words suffixed with [-ke] ‘-ly/to’ and [-eke] ‘to’ . [sent-111, score-0.495]

74 What do these results mean for a child using a phonotactic strategy? [sent-112, score-0.469]

75 First, many of the types segmented in a day would be experienced only once (and presumably ignored). [sent-113, score-0.214]

76 Second, infants would not go far astray if they learned frequently-segmented items as words. [sent-114, score-0.341]

77 3 Phrase edges and independence We suspected the reason that the phrasal DiBS model performed so much worse than baseline was its assumption that phrase-edge distributions approximate word-edge distributions. [sent-116, score-0.232]

78 Phrase beginnings were a good proxy for word beginnings, but there were mismatches phrase-finally. [sent-117, score-0.148]

79 For example, [a] is much more frequent phrasefinally than word-finally (because of common verb suffixes ending in [a]), while [n] is much more frequent word-finally (because of nonsentence-final suffixes ending in [n]). [sent-118, score-0.416]

80 4 Conclusion This paper extends previous studies by applying a computational learning model of phonotactic word segmentation to Korean. [sent-120, score-0.663]

81 Various properties of Korean led us to believe it would challenge both unsupervised phonotactic and lexical approaches. [sent-121, score-0.454]

82 Phonological and morphological analysis of errors yielded novel insights. [sent-122, score-0.042]

83 For example, the generally greater error rate in Korean is partly caused by a high tolerance for vowel-vowel sequences within words. [sent-123, score-0.16]

84 Interactions between morphology and word order result in violations of a key positional independence assumption. [sent-124, score-0.202]

85 Phonotactic segmentation was distinctly worse than in previous languages (English, Russian), particularly for oversegmentation errors. [sent-125, score-0.467]

86 This implies the segmentation of simplistic diphone models is not cross-linguistically stable, a find- ing that aligns with other cross-linguistic comparisons of segmentation algorithms. [sent-126, score-0.441]

87 In general, distinctly worse performance is found for languages other than English (Sesotho: Blanchard et al. [sent-127, score-0.114]

88 These facts suggest that the successful segmentation model must incorporate richer phonotactics, or integrate some lexical processing. [sent-129, score-0.192]

89 On the bright side, we found that frequently segmented items were mostly words, so a high segmentation error rate does not necessarily translate to a high error rate for word-learning. [sent-130, score-0.645]

90 Models of word segmentation in fluent maternal speech to infants. [sent-140, score-0.239]

91 Modeling the contribution of phonotactic cues to 876 the problem of word segmentation. [sent-153, score-0.471]

92 A Bayesian framework for word segmentation: Exploring the effects of context. [sent-170, score-0.047]

93 Morphemebased grapheme to phoneme conversion using phonetic patterns and morphophonemic connectivity information. [sent-177, score-0.209]

94 Word segmentation in monolingual infants acquiring Canadian-English and Canadian-French: Native language, crosslanguage and cross-dialect comparisons. [sent-186, score-0.482]


similar papers computed by tfidf model

tfidf for this paper:

wordName wordTfidf (topN-words)

[('phonotactic', 0.424), ('korean', 0.297), ('diphones', 0.29), ('infants', 0.29), ('segmentation', 0.192), ('segmented', 0.185), ('pd', 0.184), ('phonological', 0.163), ('dibs', 0.161), ('oversegmentation', 0.161), ('boundary', 0.147), ('aslin', 0.129), ('phonetic', 0.122), ('projection', 0.122), ('daland', 0.097), ('fleck', 0.097), ('saffran', 0.097), ('positional', 0.096), ('frequent', 0.089), ('blanchard', 0.086), ('distinctly', 0.086), ('kitty', 0.086), ('weijer', 0.086), ('sohn', 0.079), ('sequences', 0.076), ('suffixes', 0.07), ('russian', 0.065), ('fr', 0.065), ('beginnings', 0.065), ('phonotactics', 0.065), ('pierrehumbert', 0.065), ('polka', 0.065), ('sundara', 0.065), ('theki', 0.065), ('thekitty', 0.065), ('undersegmentation', 0.065), ('phrasal', 0.064), ('problematic', 0.059), ('boundaries', 0.059), ('independence', 0.059), ('conversion', 0.058), ('diphone', 0.057), ('posited', 0.057), ('angeles', 0.053), ('items', 0.051), ('los', 0.05), ('suspected', 0.05), ('ending', 0.049), ('frequently', 0.049), ('termed', 0.047), ('word', 0.047), ('ic', 0.046), ('conversely', 0.046), ('input', 0.045), ('child', 0.045), ('phonology', 0.043), ('campbell', 0.043), ('error', 0.043), ('generalizations', 0.042), ('errors', 0.042), ('rate', 0.041), ('informative', 0.04), ('proxy', 0.036), ('occur', 0.034), ('processes', 0.034), ('span', 0.033), ('kim', 0.032), ('goldwater', 0.032), ('cell', 0.032), ('assessment', 0.032), ('int', 0.032), ('output', 0.031), ('edges', 0.031), ('box', 0.03), ('ignored', 0.03), ('properties', 0.03), ('day', 0.029), ('falls', 0.029), ('separated', 0.029), ('ju', 0.029), ('morphophonemic', 0.029), ('infant', 0.029), ('infancy', 0.029), ('generously', 0.029), ('converter', 0.029), ('ased', 0.029), ('attend', 0.029), ('bic', 0.029), ('consonantal', 0.029), ('facilitated', 0.029), ('mahwah', 0.029), ('nasal', 0.029), ('newport', 0.029), ('suffixed', 0.029), ('velar', 0.029), ('zipfian', 0.029), ('worse', 0.028), ('exclusively', 0.028), ('marked', 0.028), ('van', 0.027)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.99999934 128 acl-2013-Does Korean defeat phonotactic word segmentation?

Author: Robert Daland ; Kie Zuraw

Abstract: Computational models of infant word segmentation have not been tested on a wide range of languages. This paper applies a phonotactic segmentation model to Korean. In contrast to the undersegmentation pattern previously found in English and Russian, the model exhibited more oversegmentation errors and more errors overall. Despite the high error rate, analysis suggested that lexical acquisition might not be problematic, provided that infants attend only to frequently segmented items. 1

2 0.15445901 140 acl-2013-Evaluating Text Segmentation using Boundary Edit Distance

Author: Chris Fournier

Abstract: This work proposes a new segmentation evaluation metric, named boundary similarity (B), an inter-coder agreement coefficient adaptation, and a confusion-matrix for segmentation that are all based upon an adaptation of the boundary edit distance in Fournier and Inkpen (2012). Existing segmentation metrics such as Pk, WindowDiff, and Segmentation Similarity (S) are all able to award partial credit for near misses between boundaries, but are biased towards segmentations containing few or tightly clustered boundaries. Despite S’s improvements, its normalization also produces cosmetically high values that overestimate agreement & performance, leading this work to propose a solution.

3 0.11881854 82 acl-2013-Co-regularizing character-based and word-based models for semi-supervised Chinese word segmentation

Author: Xiaodong Zeng ; Derek F. Wong ; Lidia S. Chao ; Isabel Trancoso

Abstract: This paper presents a semi-supervised Chinese word segmentation (CWS) approach that co-regularizes character-based and word-based models. Similarly to multi-view learning, the “segmentation agreements” between the two different types of view are used to overcome the scarcity of the label information on unlabeled data. The proposed approach trains a character-based and word-based model on labeled data, respectively, as the initial models. Then, the two models are constantly updated using unlabeled examples, where the learning objective is maximizing their segmentation agreements. The agreements are regarded as a set of valuable constraints for regularizing the learning of both models on unlabeled data. The segmentation for an input sentence is decoded by using a joint scoring function combining the two induced models. The evaluation on the Chinese tree bank reveals that our model results in better gains over the state-of-the-art semi-supervised models reported in the literature.

4 0.1163591 123 acl-2013-Discriminative Learning with Natural Annotations: Word Segmentation as a Case Study

Author: Wenbin Jiang ; Meng Sun ; Yajuan Lu ; Yating Yang ; Qun Liu

Abstract: Structural information in web text provides natural annotations for NLP problems such as word segmentation and parsing. In this paper we propose a discriminative learning algorithm to take advantage of the linguistic knowledge in large amounts of natural annotations on the Internet. It utilizes the Internet as an external corpus with massive (although slight and sparse) natural annotations, and enables a classifier to evolve on the large-scaled and real-time updated web text. With Chinese word segmentation as a case study, experiments show that the segmenter enhanced with the Chinese wikipedia achieves sig- nificant improvement on a series of testing sets from different domains, even with a single classifier and local features.

5 0.10337123 99 acl-2013-Crowd Prefers the Middle Path: A New IAA Metric for Crowdsourcing Reveals Turker Biases in Query Segmentation

Author: Rohan Ramanath ; Monojit Choudhury ; Kalika Bali ; Rishiraj Saha Roy

Abstract: Query segmentation, like text chunking, is the first step towards query understanding. In this study, we explore the effectiveness of crowdsourcing for this task. Through carefully designed control experiments and Inter Annotator Agreement metrics for analysis of experimental data, we show that crowdsourcing may not be a suitable approach for query segmentation because the crowd seems to have a very strong bias towards dividing the query into roughly equal (often only two) parts. Similarly, in the case of hierarchical or nested segmentation, turkers have a strong preference towards balanced binary trees.

6 0.09851636 97 acl-2013-Cross-lingual Projections between Languages from Different Families

7 0.096151695 80 acl-2013-Chinese Parsing Exploiting Characters

8 0.090235248 193 acl-2013-Improving Chinese Word Segmentation on Micro-blog Using Rich Punctuations

9 0.087685183 70 acl-2013-Bilingually-Guided Monolingual Dependency Grammar Induction

10 0.084916472 7 acl-2013-A Lattice-based Framework for Joint Chinese Word Segmentation, POS Tagging and Parsing

11 0.079608351 136 acl-2013-Enhanced and Portable Dependency Projection Algorithms Using Interlinear Glossed Text

12 0.073770098 274 acl-2013-Parsing Graphs with Hyperedge Replacement Grammars

13 0.069317125 369 acl-2013-Unsupervised Consonant-Vowel Prediction over Hundreds of Languages

14 0.068614252 173 acl-2013-Graph-based Semi-Supervised Model for Joint Chinese Word Segmentation and Part-of-Speech Tagging

15 0.066830307 50 acl-2013-An improved MDL-based compression algorithm for unsupervised word segmentation

16 0.066791549 98 acl-2013-Cross-lingual Transfer of Semantic Role Labeling Models

17 0.065882914 368 acl-2013-Universal Dependency Annotation for Multilingual Parsing

18 0.054495241 279 acl-2013-PhonMatrix: Visualizing co-occurrence constraints of sounds

19 0.049740389 44 acl-2013-An Empirical Examination of Challenges in Chinese Parsing

20 0.049134895 73 acl-2013-Broadcast News Story Segmentation Using Manifold Learning on Latent Topic Distributions


similar papers computed by lsi model

lsi for this paper:

topicId topicWeight

[(0, 0.125), (1, -0.037), (2, -0.09), (3, 0.015), (4, 0.09), (5, -0.086), (6, -0.066), (7, 0.013), (8, -0.0), (9, 0.047), (10, -0.009), (11, -0.026), (12, 0.014), (13, 0.002), (14, -0.122), (15, -0.056), (16, 0.065), (17, -0.011), (18, 0.055), (19, 0.023), (20, -0.021), (21, 0.038), (22, 0.027), (23, -0.041), (24, 0.021), (25, 0.001), (26, 0.013), (27, 0.077), (28, -0.064), (29, 0.017), (30, 0.011), (31, -0.003), (32, -0.005), (33, -0.054), (34, 0.051), (35, -0.045), (36, 0.066), (37, 0.058), (38, -0.106), (39, 0.067), (40, 0.056), (41, 0.039), (42, -0.003), (43, 0.082), (44, -0.062), (45, -0.032), (46, -0.053), (47, 0.077), (48, -0.153), (49, -0.031)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.91583329 128 acl-2013-Does Korean defeat phonotactic word segmentation?

Author: Robert Daland ; Kie Zuraw

Abstract: Computational models of infant word segmentation have not been tested on a wide range of languages. This paper applies a phonotactic segmentation model to Korean. In contrast to the undersegmentation pattern previously found in English and Russian, the model exhibited more oversegmentation errors and more errors overall. Despite the high error rate, analysis suggested that lexical acquisition might not be problematic, provided that infants attend only to frequently segmented items. 1

2 0.73600799 140 acl-2013-Evaluating Text Segmentation using Boundary Edit Distance

Author: Chris Fournier

Abstract: This work proposes a new segmentation evaluation metric, named boundary similarity (B), an inter-coder agreement coefficient adaptation, and a confusion-matrix for segmentation that are all based upon an adaptation of the boundary edit distance in Fournier and Inkpen (2012). Existing segmentation metrics such as Pk, WindowDiff, and Segmentation Similarity (S) are all able to award partial credit for near misses between boundaries, but are biased towards segmentations containing few or tightly clustered boundaries. Despite S’s improvements, its normalization also produces cosmetically high values that overestimate agreement & performance, leading this work to propose a solution.

3 0.63831973 99 acl-2013-Crowd Prefers the Middle Path: A New IAA Metric for Crowdsourcing Reveals Turker Biases in Query Segmentation

Author: Rohan Ramanath ; Monojit Choudhury ; Kalika Bali ; Rishiraj Saha Roy

Abstract: Query segmentation, like text chunking, is the first step towards query understanding. In this study, we explore the effectiveness of crowdsourcing for this task. Through carefully designed control experiments and Inter Annotator Agreement metrics for analysis of experimental data, we show that crowdsourcing may not be a suitable approach for query segmentation because the crowd seems to have a very strong bias towards dividing the query into roughly equal (often only two) parts. Similarly, in the case of hierarchical or nested segmentation, turkers have a strong preference towards balanced binary trees.

4 0.62967092 50 acl-2013-An improved MDL-based compression algorithm for unsupervised word segmentation

Author: Ruey-Cheng Chen

Abstract: We study the mathematical properties of a recently proposed MDL-based unsupervised word segmentation algorithm, called regularized compression. Our analysis shows that its objective function can be efficiently approximated using the negative empirical pointwise mutual information. The proposed extension improves the baseline performance in both efficiency and accuracy on a standard benchmark.

5 0.56308806 123 acl-2013-Discriminative Learning with Natural Annotations: Word Segmentation as a Case Study

Author: Wenbin Jiang ; Meng Sun ; Yajuan Lu ; Yating Yang ; Qun Liu

Abstract: Structural information in web text provides natural annotations for NLP problems such as word segmentation and parsing. In this paper we propose a discriminative learning algorithm to take advantage of the linguistic knowledge in large amounts of natural annotations on the Internet. It utilizes the Internet as an external corpus with massive (although slight and sparse) natural annotations, and enables a classifier to evolve on the large-scaled and real-time updated web text. With Chinese word segmentation as a case study, experiments show that the segmenter enhanced with the Chinese wikipedia achieves sig- nificant improvement on a series of testing sets from different domains, even with a single classifier and local features.

6 0.50483376 97 acl-2013-Cross-lingual Projections between Languages from Different Families

7 0.49645755 82 acl-2013-Co-regularizing character-based and word-based models for semi-supervised Chinese word segmentation

8 0.48896697 193 acl-2013-Improving Chinese Word Segmentation on Micro-blog Using Rich Punctuations

9 0.45230713 34 acl-2013-Accurate Word Segmentation using Transliteration and Language Model Projection

10 0.44868481 381 acl-2013-Variable Bit Quantisation for LSH

11 0.44776785 136 acl-2013-Enhanced and Portable Dependency Projection Algorithms Using Interlinear Glossed Text

12 0.43551299 321 acl-2013-Sign Language Lexical Recognition With Propositional Dynamic Logic

13 0.43202049 7 acl-2013-A Lattice-based Framework for Joint Chinese Word Segmentation, POS Tagging and Parsing

14 0.42890579 70 acl-2013-Bilingually-Guided Monolingual Dependency Grammar Induction

15 0.42427471 369 acl-2013-Unsupervised Consonant-Vowel Prediction over Hundreds of Languages

16 0.4234384 281 acl-2013-Post-Retrieval Clustering Using Third-Order Similarity Measures

17 0.42295641 1 acl-2013-"Let Everything Turn Well in Your Wife": Generation of Adult Humor Using Lexical Constraints

18 0.42196861 80 acl-2013-Chinese Parsing Exploiting Characters

19 0.41997752 149 acl-2013-Exploring Word Order Universals: a Probabilistic Graphical Model Approach

20 0.39571327 89 acl-2013-Computerized Analysis of a Verbal Fluency Test


similar papers computed by lda model

lda for this paper:

topicId topicWeight

[(0, 0.029), (6, 0.016), (11, 0.075), (24, 0.517), (26, 0.037), (28, 0.012), (35, 0.052), (42, 0.02), (48, 0.033), (70, 0.029), (88, 0.027), (90, 0.025), (95, 0.054)]

similar papers list:

simIndex simValue paperId paperTitle

1 0.97733682 29 acl-2013-A Visual Analytics System for Cluster Exploration

Author: Andreas Lamprecht ; Annette Hautli ; Christian Rohrdantz ; Tina Bogel

Abstract: This paper offers a new way of representing the results of automatic clustering algorithms by employing a Visual Analytics system which maps members of a cluster and their distance to each other onto a twodimensional space. A case study on Urdu complex predicates shows that the system allows for an appropriate investigation of linguistically motivated data. 1 Motivation In recent years, Visual Analytics systems have increasingly been used for the investigation of linguistic phenomena in a number of different areas, starting from literary analysis (Keim and Oelke, 2007) to the cross-linguistic comparison of language features (Mayer et al., 2010a; Mayer et al., 2010b; Rohrdantz et al., 2012a) and lexical semantic change (Rohrdantz et al., 2011; Heylen et al., 2012; Rohrdantz et al., 2012b). Visualization has also found its way into the field of computational linguistics by providing insights into methods such as machine translation (Collins et al., 2007; Albrecht et al., 2009) or discourse parsing (Zhao et al., 2012). One issue in computational linguistics is the interpretability of results coming from machine learning algorithms and the lack of insight they offer on the underlying data. This drawback often prevents theoretical linguists, who work with computational models and need to see patterns on large data sets, from drawing detailed conclusions. The present paper shows that a Visual Analytics system facilitates “analytical reasoning [...] by an interactive visual interface” (Thomas and Cook, 2006) and helps resolving this issue by offering a customizable, in-depth view on the statistically generated result and simultaneously an at-a-glance overview of the overall data set. In particular, we focus on the visual representa- tion of automatically generated clusters, in itself not a novel idea as it has been applied in other fields like the financial sector, biology or geography (Schreck et al., 2009). But as far as the literature is concerned, interactive systems are still less common, particularly in computational linguistics, and they have not been designed for the specific needs of theoretical linguists. This paper offers a method of visually encoding clusters and their internal coherence with an interactive user interface, which allows users to adjust underlying parameters and their views on the data depending on the particular research question. By this, we partly open up the “black box” of machine learning. The linguistic phenomenon under investigation, for which the system has originally been designed, is the varied behavior of nouns in N+V CP complex predicates in Urdu (e.g., memory+do = ‘to remember’) (Mohanan, 1994; Ahmed and Butt, 2011), where, depending on the lexical semantics of the noun, a set of different light verbs is chosen to form a complex predicate. The aim is an automatic detection of the different groups of nouns, based on their light verb distribution. Butt et al. (2012) present a static visualization for the phenomenon, whereas the present paper proposes an interactive system which alleviates some of the previous issues with respect to noise detection, filtering, data interaction and cluster coherence. For this, we proceed as follows: section 2 explains the proposed Visual Analytics system, followed by the linguistic case study in section 3. Section 4 concludes the paper. 2 The system The system requires a plain text file as input, where each line corresponds to one data object.In our case, each line corresponds to one Urdu noun (data object) and contains its unique ID (the name of the noun) and its bigram frequencies with the 109 Proce dingSsof oifa, th Beu 5l1gsarti Aan,An u aglu Mste 4e-ti9n2g 0 o1f3 t.he ?c A2s0s1o3ci Aatsiosonc fioartio Cno fmorpu Ctoamtiopnuatalt Lioin gauli Lsitnicgsu,i psatgices 109–1 4, four light verbs under investigation, namely kar ‘do’, ho ‘be’, hu ‘become’ and rakH ‘put’ ; an exemplary input file is shown in Figure 1. From a data analysis perspective, we have four- dimensional data objects, where each dimension corresponds to a bigram frequency previously extracted from a corpus. Note that more than four dimensions can be loaded and analyzed, but for the sake of simplicity we focus on the fourdimensional Urdu example for the remainder of this paper. Moreover, it is possible to load files containing absolute bigram frequencies and relative frequencies. When loading absolute frequencies, the program will automatically calculate the relative frequencies as they are the input for the clustering. The absolute frequencies, however, are still available and can be used for further processing (e.g. filtering). Figure 1: preview of appropriate file structures 2.1 Initial opening and processing of a file It is necessary to define a metric distance function between data objects for both clustering and visualization. Thus, each data object is represented through a high dimensional (in our example fourdimensional) numerical vector and we use the Euclidean distance to calculate the distances between pairs of data objects. The smaller the distance between two data objects, the more similar they are. For visualization, the high dimensional data is projected onto the two-dimensional space of a computer screen using a principal component analysis (PCA) algorithm1 . In the 2D projection, the distances between data objects in the highdimensional space, i.e. the dissimilarities of the bigram distributions, are preserved as accurately as possible. However, when projecting a highdimensional data space onto a lower dimension, some distinctions necessarily level out: two data objects may be far apart in the high-dimensional space, but end up closely together in the 2D projection. It is important to bear in mind that the 2D visualization is often quite insightful, but interpre1http://workshop.mkobos.com/201 1/java-pca- transformation-library/ tations have to be verified by interactively investigating the data. The initial clusters are calculated (in the highdimensional data space) using a default k-Means algorithm2 with k being a user-defined parameter. There is also the option of selecting another clustering algorithm, called the Greedy Variance Minimization3 (GVM), and an extension to include further algorithms is under development. 2.2 Configuration & Interaction 2.2.1 The main window The main window in Figure 2 consists of three areas, namely the configuration area (a), the visualization area (b) and the description area (c). The visualization area is mainly built with the piccolo2d library4 and initially shows data objects as colored circles with a variable diameter, where color indicates cluster membership (four clusters in this example). Hovering over a dot displays information on the particular noun, the cluster membership and the light verb distribution in the de- scription area to the right. By using the mouse wheel, the user can zoom in and out of the visualization. A very important feature for the task at hand is the possibility to select multiple data objects for further processing or for filtering, with a list of selected data objects shown in the description area. By right-clicking on these data objects, the user can assign a unique class (and class color) to them. Different clustering methods can be employed using the options item in the menu bar. Another feature of the system is that the user can fade in the cluster centroids (illustrated by a larger dot in the respective cluster color in Figure 2), where the overall feature distribution of the cluster can be examined in a tooltip hovering over the corresponding centroid. 2.2.2 Visually representing data objects To gain further insight into the data distribution based on the 2D projection, the user can choose between several ways to visualize the individual data objects, all of which are shown in Figure 3. The standard visualization type is shown on the left and consists of a circle which encodes cluster membership via color. 2http://java-ml.sourceforge.net/api/0.1.7/ (From the JML library) 3http://www.tomgibara.com/clustering/fast-spatial/ 4http://www.piccolo2d.org/ 110 Figure 2: Overview of the main window of the system, including the configuration area (a), the visualization area (b) and the description area (c). Large circles are cluster centroids. Figure 3: Different visualizations of data points Alternatively, normal glyphs and star glyphs can be displayed. The middle part of Figure 3 shows the data displayed with normal glyphs. In linestarinorthpsiflvtrheinorqsbgnutheviasnemdocwfya,proepfthlpdienaoecsr.nihetloa Titnghve det clockwise around the center according to their occurrence in the input file. This view has the advantage that overall feature dominance in a cluster can be seen at-a-glance. The visualization type on the right in Figure 3 agislnycpaehlxset. dnstHhioe nrset ,oarthngeolyrmlpinhae,l endings are connected, forming a “star”. As in the representation with the glyphs, this makes similar data objects easily recognizable and comparable with each other. 2.2.3 Filtering options Our systems offers options for filtering data ac- cording to different criteria. Filter by means of bigram occurrence By activating the bigram occurrence filtering, it is possible to only show those nouns, which occur in bigrams with a certain selected subset of all features (light verbs) only. This is especially useful when examining possible commonalities. Filter selected words Another opportunity of showing only items of interest is to select and display them separately. The PCA is recalculated for these data objects and the visualization is stretched to the whole area. 111 Filter selected cluster Additionally, the user can visualize a specific cluster of interest. Again, the PCA is recalculated and the visualization stretched to the whole area. The cluster can then be manually fine-tuned and cleaned, for instance by removing wrongly assigned items. 2.2.4 Options to handle overplotting Due to the nature of the data, much overplotting occurs. For example, there are many words, which only occur with one light verb. The PCA assigns the same position to these words and, as a consequence, only the top bigram can be viewed in the visualization. In order to improve visual access to overplotted data objects, several methods that allow for a more differentiated view of the data have been included and are described in the following paragraphs. Change transparency of data objects By modifying the transparency with the given slider, areas with a dense data population can be readily identified, as shown in the following example: Repositioning of data objects To reduce the overplotting in densely populated areas, data objects can be repositioned randomly having a fixed deviation from their initial position. The degree of deviation can be interactively determined by the user employing the corresponding slider: The user has the option to reposition either all data objects or only those that are selected in advance. Frequency filtering If the initial data contains absolute bigram frequencies, the user can filter the visualized words by frequency. For example, many nouns occur only once and therefore have an observed probability of 100% for co-occurring with one of the light verbs. In most cases it is useful to filter such data out. Scaling data objects If the user zooms beyond the maximum zoom factor, the data objects are scaled down. This is especially useful, if data objects are only partly covered by many other objects. In this case, they become fully visible, as shown in the following example: 2.3 Alternative views on the data In order to enable a holistic analysis it is often valuable to provide the user with different views on the data. Consequently, we have integrated the option to explore the data with further standard visualization methods. 2.3.1 Correlation matrix The correlation matrix in Figure 4 shows the correlations between features, which are visualized by circles using the following encoding: The size of a circle represents the correlation strength and the color indicates whether the corresponding features are negatively (white) or positively (black) correlated. Figure 4: example of a correlation matrix 2.3.2 Parallel coordinates The parallel coordinates diagram shows the distribution of the bigram frequencies over the different dimensions (Figure 5). Every noun is represented with a line, and shows, when hovered over, a tooltip with the most important information. To filter the visualized words, the user has the option of displaying previously selected data objects, or s/he can restrict the value range for a feature and show only the items which lie within this range. 2.3.3 Scatter plot matrix To further examine the relation between pairs of features, a scatter plot matrix can be used (Figure 6). The individual scatter plots give further insight into the correlation details of pairs of features. 112 Figure 5: Parallel coordinates diagram Figure 6: Example showing a scatter plot matrix. 3 Case study In principle, the Visual Analytics system presented above can be used for any kind of cluster visualization, but the built-in options and add-ons are particularly designed for the type of work that linguists tend to be interested in: on the one hand, the user wants to get a quick overview of the overall patterns in the phenomenon, but on the same time, the system needs to allow for an in-depth data inspection. Both is given in the system: The overall cluster result shown in Figure 2 depicts the coherence of clusters and therefore the overall pattern of the data set. The different glyph visualizations in Figure 3 illustrate the properties of each cluster. Single data points can be inspected in the description area. The randomization of overplotted data points helps to see concentrated cluster patterns where light verbs behave very similarly in different noun+verb complex predicates. The biggest advantage of the system lies in the ability for interaction: Figure 7 shows an example of the visualization used in Butt et al. (2012), the input being the same text file as shown in Figure 1. In this system, the relative frequencies of each noun with each light verb is correlated with color saturation the more saturated the color to the right of the noun, the higher the relative frequency of the light verb occurring with it. The number of the cluster (here, 3) and the respective nouns (e.g. kAm ‘work’) is shown to the left. The user does — not get information on the coherence of the cluster, nor does the visualization show prototypical cluster patterns. Figure 7: Cluster visualization in Butt et al. (2012) Moreover, the system in Figure 7 only has a limited set of interaction choices, with the consequence that the user is not able to adjust the underlying data set, e.g. by filtering out noise. However, Butt et al. (2012) report that the Urdu data is indeed very noisy and requires a manual cleaning of the data set before the actual clustering. In the system presented here, the user simply marks conspicuous regions in the visualization panel and removes the respective data points from the original data set. Other filtering mechanisms, e.g. the removal of low frequency items which occur due to data sparsity issues, can be removed from the overall data set by adjusting the parameters. A linguistically-relevant improvement lies in the display of cluster centroids, in other words the typical noun + light verb distribution of a cluster. This is particularly helpful when the linguist wants to pick out prototypical examples for the cluster in order to stipulate generalizations over the other cluster members. 113 4 Conclusion In this paper, we present a novel visual analytics system that helps to automatically analyze bigrams extracted from corpora. The main purpose is to enable a more informed and steered cluster analysis than currently possible with standard methods. This includes rich options for interaction, e.g. display configuration or data manipulation. Initially, the approach was motivated by a concrete research problem, but has much wider applicability as any kind of high-dimensional numerical data objects can be loaded and analyzed. However, the system still requires some basic understanding about the algorithms applied for clustering and projection in order to prevent the user to draw wrong conclusions based on artifacts. Bearing this potential pitfall in mind when performing the analysis, the system enables a much more insightful and informed analysis than standard noninteractive methods. In the future, we aim to conduct user experiments in order to learn more about how the functionality and usability could be further enhanced. Acknowledgments This work was partially funded by the German Research Foundation (DFG) under grant BU 1806/7-1 “Visual Analysis of Language Change and Use Patterns” and the German Fed- eral Ministry of Education and Research (BMBF) under grant 01461246 “VisArgue” under research grant. References Tafseer Ahmed and Miriam Butt. 2011. Discovering Semantic Classes for Urdu N-V Complex Predicates. In Proceedings of the international Conference on Computational Semantics (IWCS 2011), pages 305–309. Joshua Albrecht, Rebecca Hwa, and G. Elisabeta Marai. 2009. The Chinese Room: Visualization and Interaction to Understand and Correct Ambiguous Machine Translation. Comput. Graph. Forum, 28(3): 1047–1054. Miriam Butt, Tina B ¨ogel, Annette Hautli, Sebastian Sulger, and Tafseer Ahmed. 2012. Identifying Urdu Complex Predication via Bigram Extraction. In In Proceedings of COLING 2012, Technical Papers, pages 409 424, Mumbai, India. Christopher Collins, M. Sheelagh T. Carpendale, and Gerald Penn. 2007. Visualization of Uncertainty in Lattices to Support Decision-Making. In EuroVis 2007, pages 5 1–58. Eurographics Association. Kris Heylen, Dirk Speelman, and Dirk Geeraerts. 2012. Looking at word meaning. An interactive visualization of Semantic Vector Spaces for Dutch – synsets. In Proceedings of the EACL 2012 Joint Workshop of LINGVIS & UNCLH, pages 16–24. Daniel A. Keim and Daniela Oelke. 2007. Literature Fingerprinting: A New Method for Visual Literary Analysis. In IEEE VAST 2007, pages 115–122. IEEE. Thomas Mayer, Christian Rohrdantz, Miriam Butt, Frans Plank, and Daniel A. Keim. 2010a. Visualizing Vowel Harmony. Linguistic Issues in Language Technology, 4(Issue 2): 1–33, December. Thomas Mayer, Christian Rohrdantz, Frans Plank, Peter Bak, Miriam Butt, and Daniel A. Keim. 2010b. Consonant Co-Occurrence in Stems across Languages: Automatic Analysis and Visualization of a Phonotactic Constraint. In Proceedings of the 2010 Workshop on NLP andLinguistics: Finding the Common Ground, pages 70–78, Uppsala, Sweden, July. Association for Computational Linguistics. Tara Mohanan. 1994. Argument Structure in Hindi. Stanford: CSLI Publications. Christian Rohrdantz, Annette Hautli, Thomas Mayer, Miriam Butt, Frans Plank, and Daniel A. Keim. 2011. Towards Tracking Semantic Change by Visual Analytics. In ACL 2011 (Short Papers), pages 305–3 10, Portland, Oregon, USA, June. Association for Computational Linguistics. Christian Rohrdantz, Michael Hund, Thomas Mayer, Bernhard W ¨alchli, and Daniel A. Keim. 2012a. The World’s Languages Explorer: Visual Analysis of Language Features in Genealogical and Areal Contexts. Computer Graphics Forum, 3 1(3):935–944. Christian Rohrdantz, Andreas Niekler, Annette Hautli, Miriam Butt, and Daniel A. Keim. 2012b. Lexical Semantics and Distribution of Suffixes - A Visual Analysis. In Proceedings of the EACL 2012 Joint Workshop of LINGVIS & UNCLH, pages 7–15, April. Tobias Schreck, J ¨urgen Bernard, Tatiana von Landesberger, and J o¨rn Kohlhammer. 2009. Visual cluster analysis of trajectory data with interactive kohonen maps. Information Visualization, 8(1): 14–29. James J. Thomas and Kristin A. Cook. 2006. A Visual Analytics Agenda. IEEE Computer Graphics and Applications, 26(1): 10–13. Jian Zhao, Fanny Chevalier, Christopher Collins, and Ravin Balakrishnan. 2012. Facilitating Discourse Analysis with Interactive Visualization. IEEE Trans. Vis. Comput. Graph., 18(12):2639–2648. 114

2 0.95173043 74 acl-2013-Building Comparable Corpora Based on Bilingual LDA Model

Author: Zede Zhu ; Miao Li ; Lei Chen ; Zhenxin Yang

Abstract: Comparable corpora are important basic resources in cross-language information processing. However, the existing methods of building comparable corpora, which use intertranslate words and relative features, cannot evaluate the topical relation between document pairs. This paper adopts the bilingual LDA model to predict the topical structures of the documents and proposes three algorithms of document similarity in different languages. Experiments show that the novel method can obtain similar documents with consistent top- ics own better adaptability and stability performance.

3 0.94988996 271 acl-2013-ParaQuery: Making Sense of Paraphrase Collections

Author: Lili Kotlerman ; Nitin Madnani ; Aoife Cahill

Abstract: Pivoting on bilingual parallel corpora is a popular approach for paraphrase acquisition. Although such pivoted paraphrase collections have been successfully used to improve the performance of several different NLP applications, it is still difficult to get an intrinsic estimate of the quality and coverage of the paraphrases contained in these collections. We present ParaQuery, a tool that helps a user interactively explore and characterize a given pivoted paraphrase collection, analyze its utility for a particular domain, and compare it to other popular lexical similarity resources all within a single interface.

4 0.93233013 184 acl-2013-Identification of Speakers in Novels

Author: Hua He ; Denilson Barbosa ; Grzegorz Kondrak

Abstract: Speaker identification is the task of at- tributing utterances to characters in a literary narrative. It is challenging to auto- mate because the speakers of the majority ofutterances are not explicitly identified in novels. In this paper, we present a supervised machine learning approach for the task that incorporates several novel features. The experimental results show that our method is more accurate and general than previous approaches to the problem.

same-paper 5 0.92390436 128 acl-2013-Does Korean defeat phonotactic word segmentation?

Author: Robert Daland ; Kie Zuraw

Abstract: Computational models of infant word segmentation have not been tested on a wide range of languages. This paper applies a phonotactic segmentation model to Korean. In contrast to the undersegmentation pattern previously found in English and Russian, the model exhibited more oversegmentation errors and more errors overall. Despite the high error rate, analysis suggested that lexical acquisition might not be problematic, provided that infants attend only to frequently segmented items. 1

6 0.92009145 229 acl-2013-Leveraging Synthetic Discourse Data via Multi-task Learning for Implicit Discourse Relation Recognition

7 0.86456054 72 acl-2013-Bridging Languages through Etymology: The case of cross language text categorization

8 0.81728268 244 acl-2013-Mining Opinion Words and Opinion Targets in a Two-Stage Framework

9 0.68473464 279 acl-2013-PhonMatrix: Visualizing co-occurrence constraints of sounds

10 0.62493718 377 acl-2013-Using Supervised Bigram-based ILP for Extractive Summarization

11 0.60937667 79 acl-2013-Character-to-Character Sentiment Analysis in Shakespeare's Plays

12 0.59498841 140 acl-2013-Evaluating Text Segmentation using Boundary Edit Distance

13 0.57177377 230 acl-2013-Lightly Supervised Learning of Procedural Dialog Systems

14 0.57097244 2 acl-2013-A Bayesian Model for Joint Unsupervised Induction of Sentiment, Aspect and Discourse Representations

15 0.56793582 194 acl-2013-Improving Text Simplification Language Modeling Using Unsimplified Text Data

16 0.5647279 183 acl-2013-ICARUS - An Extensible Graphical Search Tool for Dependency Treebanks

17 0.56092948 99 acl-2013-Crowd Prefers the Middle Path: A New IAA Metric for Crowdsourcing Reveals Turker Biases in Query Segmentation

18 0.56009823 73 acl-2013-Broadcast News Story Segmentation Using Manifold Learning on Latent Topic Distributions

19 0.55483663 85 acl-2013-Combining Intra- and Multi-sentential Rhetorical Parsing for Document-level Discourse Analysis

20 0.55147046 342 acl-2013-Text Classification from Positive and Unlabeled Data using Misclassified Data Correction