acl acl2011 acl2011-140 knowledge-graph by maker-knowledge-mining
Source: pdf
Author: Daniel Hewlett ; Paul Cohen
Abstract: Several results in the word segmentation literature suggest that description length provides a useful estimate of segmentation quality in fully unsupervised settings. However, since the space of potential segmentations grows exponentially with the length of the corpus, no tractable algorithm follows directly from the Minimum Description Length (MDL) principle. Therefore, it is necessary to generate a set of candidate segmentations and select between them according to the MDL principle. We evaluate several algorithms for generating these candidate segmentations on a range of natural language corpora, and show that the Bootstrapped Voting Experts algorithm consistently outperforms other methods when paired with MDL.
Reference: text
sentIndex sentText sentNum sentScore
1 edu , Abstract Several results in the word segmentation literature suggest that description length provides a useful estimate of segmentation quality in fully unsupervised settings. [sent-3, score-1.119]
2 However, since the space of potential segmentations grows exponentially with the length of the corpus, no tractable algorithm follows directly from the Minimum Description Length (MDL) principle. [sent-4, score-0.368]
3 Therefore, it is necessary to generate a set of candidate segmentations and select between them according to the MDL principle. [sent-5, score-0.351]
4 We evaluate several algorithms for generating these candidate segmentations on a range of natural language corpora, and show that the Bootstrapped Voting Experts algorithm consistently outperforms other methods when paired with MDL. [sent-6, score-0.473]
5 1 Introduction The goal of unsupervised word segmentation is to discover correct word boundaries in natural language corpora where explicit boundaries are absent. [sent-7, score-0.669]
6 Often, unsupervised word segmentation algorithms rely heavily on parameterization to produce the correct segmentation for a given language. [sent-8, score-0.886]
7 The goal of fully unsupervised word segmentation, then, is to recover the correct boundaries for arbitrary natural language corpora without explicit human parameterization. [sent-9, score-0.269]
8 This means that a fully unsupervised algorithm would have to set its own parameters based only on the corpus provided to it. [sent-10, score-0.215]
9 In principle, this goal can be achieved by creating a function that measures the quality of a segmentation in a language-independent way, and applying this function to all possible segmentations of 540 the corpora to select the best one. [sent-11, score-0.653]
10 Evidence from the word segmentation literature suggests that description length provides a good approximation to this segmentation quality function. [sent-12, score-0.982]
11 We discuss the Minimum Description Length (MDL) principle in more detail in the next section. [sent-13, score-0.063]
12 Unfortunately, evaluating all possible segmentations is intractable, since a cor- pus of length n has 2n−1 possible segmentations. [sent-14, score-0.324]
13 As a result, MDL methods have to rely on an efficient algorithm to generate a relatively small number of candidate segmentations to choose between. [sent-15, score-0.395]
14 It is an empirical question which algorithm will generate the most effective set of candidate segmentations. [sent-16, score-0.153]
15 In this work, we compare a variety of unsupervised word segmentation algorithms operating in conjunction with MDL for fully unsupervised segmentation, and find that the Bootstrapped Voting Experts (BVE) algorithm generally achieves the best performance. [sent-17, score-0.755]
16 2 Minimum Description Length At a formal level, a segmentation algorithm is a function SEGMENT(c, θ) that maps a corpus c and a vector of parameters θ ∈ Θ to one of the possible segmentations of th∈at corpus. [sent-18, score-0.662]
17 The goal of fully unsupervised segmentation is to reduce SEGMENT(c, θ) to SEGMENT(c) by removing the need for a human to specify a particular θ. [sent-19, score-0.479]
18 One way to achieve this goal is to generate a set of candidate segmentations by evaluating the algorithm for multiple values of θ, and then choose the segmentation that minimizes some cost function. [sent-20, score-0.776]
19 i ac t2io0n11 fo Ar Cssoocmiaptuiotanti foonra Clo Lminpguutiast i ocns:aslh Loirntpgaupisetrics , pages 540–545, Now, selecting the best segmentation is treated as a model selection problem, where each segmentation provides a different model of the corpus. [sent-23, score-0.716]
20 Intuitively, a general approach is to choose the simplest model that explains the data, a principle known as Occam’s Razor. [sent-24, score-0.063]
21 In information theory, this intuitive principle of simplicity or parsimony has been formalized as the Minimum Description Length (MDL) principle, which states that the most likely model of the data is the one that requires the fewest bits to encode (Rissanen, 1983). [sent-25, score-0.119]
22 The number of bits required to represent a model is called its description length. [sent-26, score-0.24]
23 Previous work applying the MDL principle to segmentation (Yu, 2000; Argamon et al. [sent-27, score-0.405]
24 , 2010) is motivated by the observation that every segmentation of a corpus implicitly defines a lexicon, or set of words. [sent-29, score-0.376]
25 More formally, the segmented corpus S is a list of words s1s2 . [sent-30, score-0.034]
26 L(S), the lexicon implicitly defined by S, is simply the set of unique words in S. [sent-34, score-0.103]
27 The description length of S can then be broken into two components, the description length of the lexicon and the description length of the corpus given the lexicon. [sent-35, score-0.935]
28 If we consider S as being generated by sampling words from a probability distribution over words in the lexicon, the number of bits required to represent each word si in S is simply its surprisal, log P(si). [sent-36, score-0.123]
29 The information cost of the corpus agl,iv −enl otgheP l(exsicon is then computed by summing the surprisal of each word si in the corpus: − CODE(S|L(S)) = −XiN=1logP(si) (2) To properly compute the description length of the segmentation, we must also include the cost of the lexicon. [sent-37, score-0.548]
30 Adding in the description length of the lexicon forces a trade-off between the lexicon size and the size of the compressed corpus. [sent-38, score-0.51]
31 For purposes of the description length calculation, the lexicon is simply treated as a separate corpus consisting of characters rather than words. [sent-39, score-0.471]
32 The description length can then be computed in the usual manner, by summing the surprisal of each character in each word in the lexicon: CODE(L(S)) = −Xw∈L(S)Xk∈wlogP(k) (3) where k ∈ w refers to the characters in word w inw thheer ele kxic ∈on w. [sent-40, score-0.466]
33 , 2010), an additional term is needed for the information required to encode the parameters of the lexicon model. [sent-43, score-0.103]
34 This quantity is normally estimated 541 by (k/2) log n, where k is the degrees of freedom in the model and n is the length of the data (Rissanen, 1983). [sent-44, score-0.142]
35 Substituting the appropriate values for the lexicon model yields: |L(S)2| − 1∗ logN (4) The full description length calculation is simply the sum of three terms shown in 2, 3, and 4. [sent-45, score-0.395]
36 From this definition, it follows that a low description length will be achieved by a segmentation that defines a small lexicon, which nonetheless reduces the corpus to a short series of mostly high-frequency words. [sent-46, score-0.677]
37 3 Generating Candidate Segmentations Recent unsupervised MDL algorithms rely on heuristic methods to generate candidate segmentations. [sent-47, score-0.311]
38 (2010) present an algorithm called EntropyMDL that generates a candidate segmentation based on branching entropy, and then iteratively refines the segmentation in an attempt to greedily minimize description length. [sent-50, score-1.163]
39 We selected three entropy-based algorithms for generating candidate segmentations, because such algorithms do not depend on the details of any particular language. [sent-51, score-0.292]
40 By “unsupervised,” we mean operating on a single unbroken sequence of characters without any boundary information; Excluded from consideration are a class of algorithms that are semisupervised because they require sentence boundaries to be provided. [sent-52, score-0.368]
41 Such algorithms include MBDP-1 (Brent, 1999), HDP (Goldwater et al. [sent-53, score-0.105]
42 Harris noticed that if one proceeds incrementally through a sequence of phonemes and asks speakers of the language to count the letters that could appear next in the sequence (today called the successor count), the points where the number increases often correspond to morpheme boundaries. [sent-57, score-0.106]
43 Tanaka-Ishii and Jin correctly recognized that this idea was an early version of branching entropy, given by HB (seq) = −Pc∈S P(c|seq) log P(c|seq), where S is the set o−f Psuccessors stoe seq. [sent-58, score-0.198]
44 They designed ethreei Sr P istM th algoritPhm based on branching entropy in both directions, and it was able to achieve scores near the state of the art on word segmentation in phonetically-encoded English and Chinese. [sent-59, score-0.576]
45 PtM posits a boundary whenever the increase in the branching entropy exceeds a threshold. [sent-60, score-0.337]
46 This threshold provides an adjustable parameter for PtM, which we exploit to generate 41 candidate segmentations by trying every threshold in the range [0. [sent-61, score-0.469]
47 2 Voting Experts The Voting Experts (VE) algorithm (Cohen and Adams, 2001) is based on the premise that words may be identified by an information theoretic signature: Entropy within a word is relatively low, entropy at word boundaries is relatively high. [sent-66, score-0.207]
48 The name Voting Experts refers to the “experts” that vote on possible boundary locations. [sent-67, score-0.105]
49 VE has two experts: One votes to place boundaries after sequences that have low internal entropy (surprisal), given by HI(seq) = −log P(seq), the other votes after sequences t h=at −haloveg high branching entropy. [sent-68, score-0.502]
50 fAtellr sequences are evaluated locally, within a sliding window, so the algorithm is very efficient. [sent-69, score-0.075]
51 A boundary is generated whenever the vote total at a given location exceeds a threshold, and in some cases only if the vote total is a local maximum. [sent-70, score-0.229]
52 We generated a set of 104 segmentations by trying every viable threshold and local max setting for each window size between 2 and 9. [sent-73, score-0.324]
53 3 Bootstrapped Voting Experts The Bootstrapped Voting Experts (BVE) algorithm (Hewlett and Cohen, 2009) is an extension to VE. [sent-75, score-0.044]
54 BVE works by segmenting the corpus repeatedly, with each new segmentation incorporating knowledge gained from previous segmentations. [sent-76, score-0.43]
55 As with many bootstrapping methods, three essential components are required: some initial seed knowledge, a way to represent knowledge, and a way to lever542 age that knowledge to improve future performance. [sent-77, score-0.036]
56 For BVE, the seed knowledge consists of a highprecision segmentation generated by VE. [sent-78, score-0.378]
57 After this seed segmentation, BVE segments the corpus repeatedly, lowering the vote threshold with each iteration. [sent-79, score-0.176]
58 Knowledge gained from prior segmentations is represented in a data structure called the knowledge trie. [sent-80, score-0.27]
59 During voting, this knowledge trie provides statistics for a third expert that places votes in contexts where boundaries were most frequently observed during the previous iteration. [sent-81, score-0.289]
60 Each iteration of BVE provides a candidate segmentation, and executing BVE for window sizes 2-8 and both local max settings generated a total of 126 segmentations. [sent-82, score-0.153]
61 4 Experiments There are two ways to evaluate the quality of a seg- mentation algorithm in the MDL framework. [sent-83, score-0.044]
62 The first is to directly measure the quantity of the segmentation chosen by MDL. [sent-84, score-0.373]
63 The second is to compare the minimal description length among the candidates to the true description length of the corpus. [sent-86, score-0.532]
64 1 Results We chose a diverse set of natural language corpora, including some widely-used corpora to facilitate comparison. [sent-88, score-0.034]
65 For each corpus, we generated a set of candidate segmentations with PtM, VE, and BVE, as described in the previous section. [sent-89, score-0.324]
66 From each set of candidates, results for the segmentation with minimal description length are presented in the tables below. [sent-90, score-0.608]
67 Where possible, results for other algorithms are presented in italics, with semi-supervised algorithms set apart. [sent-91, score-0.21]
68 Source code for all algorithms evaluated here, as well as data files for all corpora, are available online1 . [sent-92, score-0.162]
69 One of the most commonly-used benchmark cor- pora for unsupervised word segmentation is the BR87 corpus. [sent-93, score-0.439]
70 This corpus is a phonemic encoding of the Bernstein Ratner corpus (Bernstein Ratner, 1987) from the CHILDES database of childdirected speech (MacWhinney, 2000). [sent-94, score-0.068]
71 com/p /vot ing-expert s mance of the algorithms on BR87 is shown in Table 1 below. [sent-97, score-0.105]
72 As with all experiments in this work, the input was presented as one continuous sequence of characters with no word or sentence boundaries. [sent-98, score-0.068]
73 Published results for two unsupervised algorithms, the MDL-based algorithm of Yu (2000) and the EntropyMDL (EMDL) algorithm of Zhikov et al. [sent-99, score-0.185]
74 (2010), on this widely-used benchmark corpus are shown in italics. [sent-100, score-0.034]
75 These algorithms operate on a version of the corpus that includes sentence boundaries. [sent-102, score-0.139]
76 Results for one corpus, the first 50,000 characters of George Orwell’s 1984, have been reported in nearly every VE-related paper. [sent-116, score-0.068]
77 Table 2 shows the results for candidate algorithms as well as the two other VE-derived algorithms, HVE-3E and ME. [sent-118, score-0.187]
78 N 548R509143 Table 2: Results for the first 50,000 characters of 1984. [sent-125, score-0.068]
79 Because of this, these languages provide an excellent real-world challenge for unsupervised segmentation. [sent-127, score-0.125]
80 The word boundaries specified in the Chinese Gigaword Corpus were used as a gold standard. [sent-129, score-0.098]
81 Table 4 shows results for a roughly 100,000 word subset of a corpus of Thai novels written in the Thai script, taken from a recent Thai word segmentation competition, InterBEST 2009. [sent-130, score-0.376]
82 6 4864 38 Table 3: Results for a corpus of orthographic Chinese. [sent-141, score-0.065]
83 8670370 1 Table 4: Results for a corpus of orthographic Thai. [sent-148, score-0.065]
84 2 Description Length Table 6 shows the best description length achieved by each algorithm for each of the test corpora. [sent-167, score-0.345]
85 In most cases, BVE compressed the corpus more than VE, which in turn achieved better compression than PtM. [sent-168, score-0.136]
86 In Chinese, the two VE-algorithms were able to compress the corpus beyond the gold standard size, which may mean that these algorithms are sometimes finding repeated units larger than words, such as phrases. [sent-169, score-0.139]
87 ,1231 e 6 Table 6: Best description length achieved by each algorithm compared to the actual description length of the corpus. [sent-175, score-0.611]
88 5 Related Work The algorithms described in Section 3 are all rela- tively recent algorithms based on entropy. [sent-176, score-0.21]
89 Many algorithms for computational morphology make use of concepts similar to branching entropy, such as successor count. [sent-177, score-0.316]
90 The HubMorph algorithm (Johnson and Martin, 2003) adds all known words to a trie and then performs DFA minimization (Hopcroft and Ullman, 1979) to convert the trie to a finite state machine. [sent-178, score-0.204]
91 In this DFA, it searches for sequences of states (stretched hubs) with low branching factor internally and high branching factor at the boundaries, which is analogous to the chunk signature that drives VE and BVE, as well as the role of branching entropy in PtM. [sent-179, score-0.682]
92 MDL is analogous to Bayesian inference, where the information cost of the model CODE(M) acts as the prior distribution over models P(M), and CODE(D|M), the information cost of the data given the model, Mac)t,s t as ti nhefo lrikmealti hoonod co fsutn ocfti thoen P(D|M). [sent-180, score-0.122]
93 Thus, Bayesian sw thoerd l segmentation mioneth Po(dDs may be considered related as well. [sent-181, score-0.342]
94 Recently, Bayesian methods with more sophisticated language models have been developed, including one that models language generation as a hierarchical Dirichlet process (HDP), in order to incorporate the effects of syntax into word segmentation (Goldwater et al. [sent-184, score-0.37]
95 Another recent algorithm, WordEnds, generalizes information about the distribution of characters near 544 word boundaries to improve segmentation (Fleck, 2008), which is analogous to the role of the knowledge trie in BVE. [sent-186, score-0.632]
96 6 Discussion For the five corpora tested above, BVE achieved the best performance in conjunction with MDL, and also achieved the lowest description length. [sent-187, score-0.288]
97 We have shown that the combination of BVE and MDL provides an effective approach to unsupervised word segmentation, and that it can equal or surpass semisupervised algorithms such as MBDP-1, HDP, and WordEnds in some cases. [sent-188, score-0.259]
98 One area for future work is a full investigation of the performance of these algorithms in polysynthetic languages such as Inuktitut, where each word contains many morphemes. [sent-190, score-0.133]
99 It is likely that in such languages, the algorithms will find morphs rather than words. [sent-191, score-0.105]
100 An algorithm for segmenting categorical time series into meaningful episodes. [sent-213, score-0.07]
wordName wordTfidf (topN-words)
[('mdl', 0.373), ('segmentation', 0.342), ('bve', 0.329), ('segmentations', 0.242), ('description', 0.184), ('experts', 0.176), ('voting', 0.173), ('branching', 0.169), ('algorithmbpbrbfwpwrwf', 0.144), ('zhikov', 0.144), ('seq', 0.117), ('hewlett', 0.115), ('ptm', 0.115), ('wordends', 0.115), ('thai', 0.11), ('algorithms', 0.105), ('lexicon', 0.103), ('surprisal', 0.101), ('boundaries', 0.098), ('unsupervised', 0.097), ('hdp', 0.093), ('ve', 0.092), ('fleck', 0.086), ('candidate', 0.082), ('length', 0.082), ('trie', 0.08), ('phoneme', 0.08), ('cohen', 0.079), ('bernstein', 0.076), ('ratner', 0.076), ('segment', 0.074), ('bootstrapped', 0.07), ('characters', 0.068), ('brent', 0.066), ('entropy', 0.065), ('morpheme', 0.064), ('principle', 0.063), ('vote', 0.063), ('goldwater', 0.059), ('entropymdl', 0.057), ('pbvtevme', 0.057), ('rissanen', 0.057), ('code', 0.057), ('bits', 0.056), ('votes', 0.054), ('bayesian', 0.053), ('childes', 0.051), ('hopcroft', 0.051), ('chinese', 0.048), ('godfrey', 0.047), ('dfa', 0.047), ('algorithm', 0.044), ('bf', 0.044), ('analogous', 0.044), ('threshold', 0.043), ('boundary', 0.042), ('successor', 0.042), ('erlbaum', 0.042), ('minimum', 0.041), ('fully', 0.04), ('switchboard', 0.04), ('harris', 0.04), ('cost', 0.039), ('window', 0.039), ('argamon', 0.038), ('compressed', 0.038), ('wf', 0.038), ('si', 0.038), ('yu', 0.037), ('seed', 0.036), ('achieved', 0.035), ('exceeds', 0.035), ('signature', 0.035), ('jin', 0.035), ('dl', 0.035), ('gigaword', 0.035), ('corpus', 0.034), ('corpora', 0.034), ('cheng', 0.033), ('provides', 0.032), ('summing', 0.031), ('orthographic', 0.031), ('quantity', 0.031), ('sequences', 0.031), ('operating', 0.03), ('sharon', 0.029), ('compression', 0.029), ('miller', 0.029), ('log', 0.029), ('repeatedly', 0.028), ('paul', 0.028), ('hierarchical', 0.028), ('gained', 0.028), ('languages', 0.028), ('generate', 0.027), ('segmenting', 0.026), ('whenever', 0.026), ('calculation', 0.026), ('semisupervised', 0.025), ('expert', 0.025)]
simIndex simValue paperId paperTitle
same-paper 1 1.0000011 140 acl-2011-Fully Unsupervised Word Segmentation with BVE and MDL
Author: Daniel Hewlett ; Paul Cohen
Abstract: Several results in the word segmentation literature suggest that description length provides a useful estimate of segmentation quality in fully unsupervised settings. However, since the space of potential segmentations grows exponentially with the length of the corpus, no tractable algorithm follows directly from the Minimum Description Length (MDL) principle. Therefore, it is necessary to generate a set of candidate segmentations and select between them according to the MDL principle. We evaluate several algorithms for generating these candidate segmentations on a range of natural language corpora, and show that the Bootstrapped Voting Experts algorithm consistently outperforms other methods when paired with MDL.
2 0.20297216 339 acl-2011-Word Alignment Combination over Multiple Word Segmentation
Author: Ning Xi ; Guangchao Tang ; Boyuan Li ; Yinggong Zhao
Abstract: In this paper, we present a new word alignment combination approach on language pairs where one language has no explicit word boundaries. Instead of combining word alignments of different models (Xiang et al., 2010), we try to combine word alignments over multiple monolingually motivated word segmentation. Our approach is based on link confidence score defined over multiple segmentations, thus the combined alignment is more robust to inappropriate word segmentation. Our combination algorithm is simple, efficient, and easy to implement. In the Chinese-English experiment, our approach effectively improved word alignment quality as well as translation performance on all segmentations simultaneously, which showed that word alignment can benefit from complementary knowledge due to the diversity of multiple and monolingually motivated segmentations. 1
3 0.19004212 318 acl-2011-Unsupervised Bilingual Morpheme Segmentation and Alignment with Context-rich Hidden Semi-Markov Models
Author: Jason Naradowsky ; Kristina Toutanova
Abstract: This paper describes an unsupervised dynamic graphical model for morphological segmentation and bilingual morpheme alignment for statistical machine translation. The model extends Hidden Semi-Markov chain models by using factored output nodes and special structures for its conditional probability distributions. It relies on morpho-syntactic and lexical source-side information (part-of-speech, morphological segmentation) while learning a morpheme segmentation over the target language. Our model outperforms a competitive word alignment system in alignment quality. Used in a monolingual morphological segmentation setting it substantially improves accuracy over previous state-of-the-art models on three Arabic and Hebrew datasets.
4 0.15580805 27 acl-2011-A Stacked Sub-Word Model for Joint Chinese Word Segmentation and Part-of-Speech Tagging
Author: Weiwei Sun
Abstract: The large combined search space of joint word segmentation and Part-of-Speech (POS) tagging makes efficient decoding very hard. As a result, effective high order features representing rich contexts are inconvenient to use. In this work, we propose a novel stacked subword model for this task, concerning both efficiency and effectiveness. Our solution is a two step process. First, one word-based segmenter, one character-based segmenter and one local character classifier are trained to produce coarse segmentation and POS information. Second, the outputs of the three predictors are merged into sub-word sequences, which are further bracketed and labeled with POS tags by a fine-grained sub-word tagger. The coarse-to-fine search scheme is effi- cient, while in the sub-word tagging step rich contextual features can be approximately derived. Evaluation on the Penn Chinese Treebank shows that our model yields improvements over the best system reported in the literature.
5 0.1230177 203 acl-2011-Learning Sub-Word Units for Open Vocabulary Speech Recognition
Author: Carolina Parada ; Mark Dredze ; Abhinav Sethy ; Ariya Rastrow
Abstract: Large vocabulary speech recognition systems fail to recognize words beyond their vocabulary, many of which are information rich terms, like named entities or foreign words. Hybrid word/sub-word systems solve this problem by adding sub-word units to large vocabulary word based systems; new words can then be represented by combinations of subword units. Previous work heuristically created the sub-word lexicon from phonetic representations of text using simple statistics to select common phone sequences. We propose a probabilistic model to learn the subword lexicon optimized for a given task. We consider the task of out of vocabulary (OOV) word detection, which relies on output from a hybrid model. A hybrid model with our learned sub-word lexicon reduces error by 6.3% and 7.6% (absolute) at a 5% false alarm rate on an English Broadcast News and MIT Lectures task respectively.
6 0.12143121 241 acl-2011-Parsing the Internal Structure of Words: A New Paradigm for Chinese Word Segmentation
7 0.10748096 49 acl-2011-Automatic Evaluation of Chinese Translation Output: Word-Level or Character-Level?
8 0.09447585 75 acl-2011-Combining Morpheme-based Machine Translation with Post-processing Morpheme Prediction
9 0.086557664 184 acl-2011-Joint Hebrew Segmentation and Parsing using a PCFGLA Lattice Parser
10 0.077600472 182 acl-2011-Joint Annotation of Search Queries
11 0.072429046 57 acl-2011-Bayesian Word Alignment for Statistical Machine Translation
12 0.070467919 238 acl-2011-P11-2093 k2opt.pdf
13 0.06650953 259 acl-2011-Rare Word Translation Extraction from Aligned Comparable Documents
14 0.064844258 66 acl-2011-Chinese sentence segmentation as comma classification
15 0.06442561 43 acl-2011-An Unsupervised Model for Joint Phrase Alignment and Extraction
16 0.061059382 260 acl-2011-Recognizing Authority in Dialogue with an Integer Linear Programming Constrained Model
17 0.060573086 15 acl-2011-A Hierarchical Pitman-Yor Process HMM for Unsupervised Part of Speech Induction
18 0.059832543 135 acl-2011-Faster and Smaller N-Gram Language Models
19 0.059287321 176 acl-2011-Integrating surprisal and uncertain-input models in online sentence comprehension: formal techniques and empirical results
topicId topicWeight
[(0, 0.153), (1, -0.028), (2, -0.009), (3, 0.012), (4, -0.058), (5, -0.006), (6, 0.068), (7, 0.003), (8, 0.028), (9, 0.111), (10, 0.022), (11, 0.09), (12, -0.071), (13, 0.067), (14, -0.02), (15, -0.02), (16, 0.013), (17, -0.082), (18, 0.174), (19, 0.215), (20, 0.141), (21, 0.004), (22, -0.062), (23, -0.013), (24, 0.048), (25, 0.011), (26, 0.031), (27, 0.105), (28, 0.002), (29, 0.033), (30, 0.01), (31, 0.027), (32, 0.022), (33, 0.094), (34, -0.055), (35, 0.002), (36, 0.032), (37, 0.109), (38, 0.007), (39, -0.033), (40, 0.041), (41, 0.025), (42, -0.105), (43, 0.035), (44, -0.021), (45, 0.041), (46, 0.004), (47, -0.028), (48, -0.042), (49, 0.065)]
simIndex simValue paperId paperTitle
same-paper 1 0.95804757 140 acl-2011-Fully Unsupervised Word Segmentation with BVE and MDL
Author: Daniel Hewlett ; Paul Cohen
Abstract: Several results in the word segmentation literature suggest that description length provides a useful estimate of segmentation quality in fully unsupervised settings. However, since the space of potential segmentations grows exponentially with the length of the corpus, no tractable algorithm follows directly from the Minimum Description Length (MDL) principle. Therefore, it is necessary to generate a set of candidate segmentations and select between them according to the MDL principle. We evaluate several algorithms for generating these candidate segmentations on a range of natural language corpora, and show that the Bootstrapped Voting Experts algorithm consistently outperforms other methods when paired with MDL.
2 0.79590344 27 acl-2011-A Stacked Sub-Word Model for Joint Chinese Word Segmentation and Part-of-Speech Tagging
Author: Weiwei Sun
Abstract: The large combined search space of joint word segmentation and Part-of-Speech (POS) tagging makes efficient decoding very hard. As a result, effective high order features representing rich contexts are inconvenient to use. In this work, we propose a novel stacked subword model for this task, concerning both efficiency and effectiveness. Our solution is a two step process. First, one word-based segmenter, one character-based segmenter and one local character classifier are trained to produce coarse segmentation and POS information. Second, the outputs of the three predictors are merged into sub-word sequences, which are further bracketed and labeled with POS tags by a fine-grained sub-word tagger. The coarse-to-fine search scheme is effi- cient, while in the sub-word tagging step rich contextual features can be approximately derived. Evaluation on the Penn Chinese Treebank shows that our model yields improvements over the best system reported in the literature.
3 0.70664644 241 acl-2011-Parsing the Internal Structure of Words: A New Paradigm for Chinese Word Segmentation
Author: Zhongguo Li
Abstract: Lots of Chinese characters are very productive in that they can form many structured words either as prefixes or as suffixes. Previous research in Chinese word segmentation mainly focused on identifying only the word boundaries without considering the rich internal structures of many words. In this paper we argue that this is unsatisfying in many ways, both practically and theoretically. Instead, we propose that word structures should be recovered in morphological analysis. An elegant approach for doing this is given and the result is shown to be promising enough for encouraging further effort in this direction. Our probability model is trained with the Penn Chinese Treebank and actually is able to parse both word and phrase structures in a unified way. 1 Why Parse Word Structures? Research in Chinese word segmentation has progressed tremendously in recent years, with state of the art performing at around 97% in precision and recall (Xue, 2003; Gao et al., 2005; Zhang and Clark, 2007; Li and Sun, 2009). However, virtually all these systems focus exclusively on recognizing the word boundaries, giving no consideration to the internal structures of many words. Though it has been the standard practice for many years, we argue that this paradigm is inadequate both in theory and in practice, for at least the following four reasons. The first reason is that if we confine our definition of word segmentation to the identification of word boundaries, then people tend to have divergent 1405 opinions as to whether a linguistic unit is a word or not (Sproat et al., 1996). This has led to many different annotation standards for Chinese word segmentation. Even worse, this could cause inconsistency in the same corpus. For instance, 䉂 擌 奒 ‘vice president’ is considered to be one word in the Penn Chinese Treebank (Xue et al., 2005), but is split into two words by the Peking University corpus in the SIGHAN Bakeoffs (Sproat and Emerson, 2003). Meanwhile, 䉂 䀓 惼 ‘vice director’ and 䉂 䚲䡮 ‘deputy are both segmented into two words in the same Penn Chinese Treebank. In fact, all these words are composed of the prefix 䉂 ‘vice’ and a root word. Thus the structure of 䉂擌奒 ‘vice president’ can be represented with the tree in Figure 1. Without a doubt, there is complete agree- manager’ NN ,,ll JJf NNf 䉂 擌奒 Figure 1: Example of a word with internal structure. ment on the correctness of this structure among native Chinese speakers. So if instead of annotating only word boundaries, we annotate the structures of every word, then the annotation tends to be more 1 1Here it is necessary to add a note on terminology used in this paper. Since there is no universally accepted definition of the “word” concept in linguistics and especially in Chinese, whenever we use the term “word” we might mean a linguistic unit such as 䉂 擌奒 ‘vice president’ whose structure is shown as the tree in Figure 1, or we might mean a smaller unit such as 擌奒 ‘president’ which is a substructure of that tree. Hopefully, ProceedingPso orftla thned 4,9 Otrhe Agonnn,u Jauln Mee 1e9t-i2ng4, o 2f0 t1h1e. A ?c s 2o0ci1a1ti Aonss foocria Ctioomnp fourta Ctioomnaplu Ltaintigouniaslti Lcisn,g puaigsetsic 1s405–1414, consistent and there could be less duplication of efforts in developing the expensive annotated corpus. The second reason is applications have different requirements for granularity of words. Take the personal name 撱 嗤吼 ‘Zhou Shuren’ as an example. It’s considered to be one word in the Penn Chinese Treebank, but is segmented into a surname and a given name in the Peking University corpus. For some applications such as information extraction, the former segmentation is adequate, while for others like machine translation, the later finer-grained output is more preferable. If the analyzer can produce a structure as shown in Figure 4(a), then every application can extract what it needs from this tree. A solution with tree output like this is more elegant than approaches which try to meet the needs of different applications in post-processing (Gao et al., 2004). The third reason is that traditional word segmentation has problems in handling many phenomena in Chinese. For example, the telescopic compound 㦌 撥 怂惆 ‘universities, middle schools and primary schools’ is in fact composed ofthree coordinating elements 㦌惆 ‘university’, 撥 惆 ‘middle school’ and 怂惆 ‘primary school’ . Regarding it as one flat word loses this important information. Another example is separable words like 扩 扙 ‘swim’ . With a linear segmentation, the meaning of ‘swimming’ as in 扩 堑 扙 ‘after swimming’ cannot be properly represented, since 扩扙 ‘swim’ will be segmented into discontinuous units. These language usages lie at the boundary between syntax and morphology, and are not uncommon in Chinese. They can be adequately represented with trees (Figure 2). (a) NN (b) ???HHH JJ NNf ???HHH JJf JJf JJf 㦌 撥 怂 惆 VV ???HHH VV NNf ZZ VVf VVf 扩 扙 堑 Figure 2: Example of telescopic compound (a) and separable word (b). The last reason why we should care about word the context will always make it clear what is being referred to with the term “word”. 1406 structures is related to head driven statistical parsers (Collins, 2003). To illustrate this, note that in the Penn Chinese Treebank, the word 戽 䊂䠽 吼 ‘English People’ does not occur at all. Hence constituents headed by such words could cause some difficulty for head driven models in which out-ofvocabulary words need to be treated specially both when they are generated and when they are conditioned upon. But this word is in turn headed by its suffix 吼 ‘people’, and there are 2,233 such words in Penn Chinese Treebank. If we annotate the structure of every compound containing this suffix (e.g. Figure 3), such data sparsity simply goes away.
4 0.69250268 339 acl-2011-Word Alignment Combination over Multiple Word Segmentation
Author: Ning Xi ; Guangchao Tang ; Boyuan Li ; Yinggong Zhao
Abstract: In this paper, we present a new word alignment combination approach on language pairs where one language has no explicit word boundaries. Instead of combining word alignments of different models (Xiang et al., 2010), we try to combine word alignments over multiple monolingually motivated word segmentation. Our approach is based on link confidence score defined over multiple segmentations, thus the combined alignment is more robust to inappropriate word segmentation. Our combination algorithm is simple, efficient, and easy to implement. In the Chinese-English experiment, our approach effectively improved word alignment quality as well as translation performance on all segmentations simultaneously, which showed that word alignment can benefit from complementary knowledge due to the diversity of multiple and monolingually motivated segmentations. 1
5 0.6874094 203 acl-2011-Learning Sub-Word Units for Open Vocabulary Speech Recognition
Author: Carolina Parada ; Mark Dredze ; Abhinav Sethy ; Ariya Rastrow
Abstract: Large vocabulary speech recognition systems fail to recognize words beyond their vocabulary, many of which are information rich terms, like named entities or foreign words. Hybrid word/sub-word systems solve this problem by adding sub-word units to large vocabulary word based systems; new words can then be represented by combinations of subword units. Previous work heuristically created the sub-word lexicon from phonetic representations of text using simple statistics to select common phone sequences. We propose a probabilistic model to learn the subword lexicon optimized for a given task. We consider the task of out of vocabulary (OOV) word detection, which relies on output from a hybrid model. A hybrid model with our learned sub-word lexicon reduces error by 6.3% and 7.6% (absolute) at a 5% false alarm rate on an English Broadcast News and MIT Lectures task respectively.
6 0.58005738 66 acl-2011-Chinese sentence segmentation as comma classification
7 0.57938421 318 acl-2011-Unsupervised Bilingual Morpheme Segmentation and Alignment with Context-rich Hidden Semi-Markov Models
8 0.55043048 215 acl-2011-MACAON An NLP Tool Suite for Processing Word Lattices
9 0.54136628 336 acl-2011-Why Press Backspace? Understanding User Input Behaviors in Chinese Pinyin Input Method
10 0.51898688 238 acl-2011-P11-2093 k2opt.pdf
11 0.48914209 184 acl-2011-Joint Hebrew Segmentation and Parsing using a PCFGLA Lattice Parser
12 0.48860347 49 acl-2011-Automatic Evaluation of Chinese Translation Output: Word-Level or Character-Level?
13 0.45391425 74 acl-2011-Combining Indicators of Allophony
14 0.42838904 172 acl-2011-Insertion, Deletion, or Substitution? Normalizing Text Messages without Pre-categorization nor Supervision
15 0.42323828 163 acl-2011-Improved Modeling of Out-Of-Vocabulary Words Using Morphological Classes
16 0.42221493 321 acl-2011-Unsupervised Discovery of Rhyme Schemes
17 0.39973065 260 acl-2011-Recognizing Authority in Dialogue with an Integer Linear Programming Constrained Model
18 0.39547381 75 acl-2011-Combining Morpheme-based Machine Translation with Post-processing Morpheme Prediction
19 0.37508789 301 acl-2011-The impact of language models and loss functions on repair disfluency detection
20 0.36794209 35 acl-2011-An ERP-based Brain-Computer Interface for text entry using Rapid Serial Visual Presentation and Language Modeling
topicId topicWeight
[(5, 0.019), (17, 0.03), (26, 0.012), (31, 0.016), (37, 0.05), (39, 0.035), (41, 0.05), (55, 0.015), (59, 0.034), (72, 0.024), (91, 0.503), (96, 0.119)]
simIndex simValue paperId paperTitle
Author: Oliver Schneider ; Alex Garnett
Abstract: We present ConsentCanvas, a system which structures and “texturizes” End-User License Agreement (EULA) documents to be more readable. The system aims to help users better understand the terms under which they are providing their informed consent. ConsentCanvas receives unstructured text documents as input and uses unsupervised natural language processing methods to embellish the source document using a linked stylesheet. Unlike similar usable security projects which employ summarization techniques, our system preserves the contents of the source document, minimizing the cognitive and legal burden for both the end user and the licensor. Our system does not require a corpus for training. 1
same-paper 2 0.86710823 140 acl-2011-Fully Unsupervised Word Segmentation with BVE and MDL
Author: Daniel Hewlett ; Paul Cohen
Abstract: Several results in the word segmentation literature suggest that description length provides a useful estimate of segmentation quality in fully unsupervised settings. However, since the space of potential segmentations grows exponentially with the length of the corpus, no tractable algorithm follows directly from the Minimum Description Length (MDL) principle. Therefore, it is necessary to generate a set of candidate segmentations and select between them according to the MDL principle. We evaluate several algorithms for generating these candidate segmentations on a range of natural language corpora, and show that the Bootstrapped Voting Experts algorithm consistently outperforms other methods when paired with MDL.
3 0.85721964 148 acl-2011-HITS-based Seed Selection and Stop List Construction for Bootstrapping
Author: Tetsuo Kiso ; Masashi Shimbo ; Mamoru Komachi ; Yuji Matsumoto
Abstract: In bootstrapping (seed set expansion), selecting good seeds and creating stop lists are two effective ways to reduce semantic drift, but these methods generally need human supervision. In this paper, we propose a graphbased approach to helping editors choose effective seeds and stop list instances, applicable to Pantel and Pennacchiotti’s Espresso bootstrapping algorithm. The idea is to select seeds and create a stop list using the rankings of instances and patterns computed by Kleinberg’s HITS algorithm. Experimental results on a variation of the lexical sample task show the effectiveness of our method.
4 0.82561511 79 acl-2011-Confidence Driven Unsupervised Semantic Parsing
Author: Dan Goldwasser ; Roi Reichart ; James Clarke ; Dan Roth
Abstract: Current approaches for semantic parsing take a supervised approach requiring a considerable amount of training data which is expensive and difficult to obtain. This supervision bottleneck is one of the major difficulties in scaling up semantic parsing. We argue that a semantic parser can be trained effectively without annotated data, and introduce an unsupervised learning algorithm. The algorithm takes a self training approach driven by confidence estimation. Evaluated over Geoquery, a standard dataset for this task, our system achieved 66% accuracy, compared to 80% of its fully supervised counterpart, demonstrating the promise of unsupervised approaches for this task.
5 0.76907617 313 acl-2011-Two Easy Improvements to Lexical Weighting
Author: David Chiang ; Steve DeNeefe ; Michael Pust
Abstract: We introduce two simple improvements to the lexical weighting features of Koehn, Och, and Marcu (2003) for machine translation: one which smooths the probability of translating word f to word e by simplifying English morphology, and one which conditions it on the kind of training data that f and e co-occurred in. These new variations lead to improvements of up to +0.8 BLEU, with an average improvement of +0.6 BLEU across two language pairs, two genres, and two translation systems.
6 0.66471952 108 acl-2011-EdIt: A Broad-Coverage Grammar Checker Using Pattern Grammar
7 0.57248724 145 acl-2011-Good Seed Makes a Good Crop: Accelerating Active Learning Using Language Modeling
8 0.55370754 262 acl-2011-Relation Guided Bootstrapping of Semantic Lexicons
9 0.54965937 239 acl-2011-P11-5002 k2opt.pdf
10 0.54803908 86 acl-2011-Coreference for Learning to Extract Relations: Yes Virginia, Coreference Matters
11 0.50920051 304 acl-2011-Together We Can: Bilingual Bootstrapping for WSD
12 0.5038237 241 acl-2011-Parsing the Internal Structure of Words: A New Paradigm for Chinese Word Segmentation
13 0.50209838 200 acl-2011-Learning Dependency-Based Compositional Semantics
14 0.50047451 177 acl-2011-Interactive Group Suggesting for Twitter
15 0.48859575 119 acl-2011-Evaluating the Impact of Coder Errors on Active Learning
16 0.47225958 42 acl-2011-An Interface for Rapid Natural Language Processing Development in UIMA
17 0.45809892 117 acl-2011-Entity Set Expansion using Topic information
18 0.45718625 74 acl-2011-Combining Indicators of Allophony
20 0.45353323 222 acl-2011-Model-Portability Experiments for Textual Temporal Analysis