acl acl2012 acl2012-210 knowledge-graph by maker-knowledge-mining

210 acl-2012-Unsupervized Word Segmentation: the Case for Mandarin Chinese


Source: pdf

Author: Pierre Magistry ; Benoit Sagot

Abstract: In this paper, we present an unsupervized segmentation system tested on Mandarin Chinese. Following Harris's Hypothesis in Kempe (1999) and Tanaka-Ishii's (2005) reformulation, we base our work on the Variation of Branching Entropy. We improve on (Jin and Tanaka-Ishii, 2006) by adding normalization and viterbidecoding. This enable us to remove most of the thresholds and parameters from their model and to reach near state-of-the-art results (Wang et al., 201 1) with a simpler system. We provide evaluation on different corpora available from the Segmentation bake-off II (Emerson, 2005) and define a more precise topline for the task using cross-trained supervized system available off-the-shelf (Zhang and Clark, 2010; Zhao and Kit, 2008; Huang and Zhao, 2007)

Reference: text


Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 Paris 7, 175 rue du Chevaleret, 75013 Paris, France pierre. [sent-2, score-0.112]

2 fr Abstract In this paper, we present an unsupervized segmentation system tested on Mandarin Chinese. [sent-4, score-0.648]

3 This enable us to remove most of the thresholds and parameters from their model and to reach near state-of-the-art results (Wang et al. [sent-7, score-0.066]

4 Supervized segmentation systems exist but rely on manually segmented corpora, which are often specific to a genre or a domain and use many different segmentation guidelines. [sent-11, score-0.59]

5 In order to deal with a larger variety of genres and domains, or to tackle more theoretic questions about linguistic units, unsupervized segmentation is still an important issue. [sent-12, score-0.681]

6 After a short review of the corresponding literature in Section 2, we discuss the challenging issue of evaluating unsupervized word segmentation systems in Section 3. [sent-13, score-0.648]

7 Paris 7, 175 rue du Chevaleret, 75013 Paris, France benoit. [sent-17, score-0.112]

8 fr 2 State of the Art Unsupervized word segmentation systems tend to make use of three different types of information: the cohesion of the resulting units (e. [sent-19, score-0.38]

9 , Mutual Information, as in (Sproat and Shih, 1990)), the degree of separation between the resulting units (e. [sent-21, score-0.157]

10 , 2004)) and the probability of a segmentation given a string (Goldwater et al. [sent-24, score-0.235]

11 ” This method combines cohesion and separation measures in a “goodness” metric that is maximized during an iterative process. [sent-29, score-0.237]

12 This work is the current state-of-the-art in unsupervized segmentation of Mandarin Chinese data. [sent-30, score-0.648]

13 The main drawbacks of ESA are the need to iterate the process on the corpus around 10 times to reach good performance levels and the need to set a parameter that balances the impact of the cohesion measure w. [sent-31, score-0.278]

14 Empirically, a correlation is found between the parameter and the size of the corpus but this correlation depends on the script used in the corpus (it changes if Latin letters and Arabic numbers are taken into account during preprocessing or not). [sent-35, score-0.173]

15 Moreover, computing this correlation and finding the best value for the parameter (i. [sent-36, score-0.031]

16 , what the authors call the proper exponent) requires a manually segmented training corpus. [sent-38, score-0.046]

17 Therefore, this proper exponent may not be easily available in all situations. [sent-39, score-0.051]

18 However, if we only consider their experiments using settings similar to ours, their results consistently lie around an f-score of 0. [sent-40, score-0.123]

19 An older approach, introduced by Jin and TanakaIshii (2006), solely relies on a separation measure Proce dJienjgus, R ofep thueb 5lic0t hof A Knonruea ,l M 8-e1e4ti Jnugly o f2 t0h1e2 A. [sent-42, score-0.172]

20 c so2c0ia1t2io Ans fso rc Ciatoiomnp fuotart Cio nmaplu Ltiantgiounisatlic Lsi,n pgaugiestsi3c 8s3–387, that is directly inspired by a linguistic hypothesis formulated by Harris (1955). [sent-44, score-0.076]

21 Therefore the variation of the branching entropy (VBE) should be negative. [sent-46, score-0.423]

22 Following this hypothesis, (Jin and Tanaka-Ishii, 2006) propose a system that segments when BE is rising or when it reach a certain maximum. [sent-48, score-0.121]

23 The main drawback ofJin and Tanaka-Ishii (2006) model is that segmentation decisions are taken very locally1 and do not depend on neighboring cuts. [sent-49, score-0.235]

24 In theory, we could expect a decreasing BBEE≥ ≥an 0d) . [sent-51, score-0.081]

25 lo Ionk fhoero a yle,s ws decreasing evactlu ae d(eoron the contrary, rising at least to some extent). [sent-52, score-0.087]

26 Finally, Jin and Tanaka-Ishii do not take in account that VBE of n-gram may not be directly comparable to the VBE of m-grams if m n. [sent-54, score-0.07]

27 In this paper we will show that we can correct the drawbacks of Jin and Tanaka-Ishii (2006) model and = reach performances comparable to those of Wang et al. [sent-62, score-0.176]

28 3 Evaluation In this paper, in order to be comparable with Wang et al. [sent-64, score-0.038]

29 (201 1), we evaluate our system against the corpora from the Second International Chinese Word Segmentation Bakeoff (Emerson, 2005). [sent-65, score-0.031]

30 These corpora cover 4 different segmentation guidelines from various origins: Academia Sinica (AS), City-University of Hong-Kong (CITYU), Microsoft Research (MSR) and Peking University (PKU). [sent-66, score-0.358]

31 384 Evaluating unsupervized systems is a challenge by itself. [sent-68, score-0.413]

32 As an agreement on the exact definition of what a word is remains hard to reach, various segmentation guidelines have been proposed and followed for the annotation of different corpora. [sent-69, score-0.327]

33 The evaluation of supervized systems can be achieved on any corpus using any guidelines: when trained on data that follows particular guidelines, the resulting system will follow as well as possible these guide- lines, and can be evaluated on data annotated accordingly. [sent-70, score-0.19]

34 However, for unsupervized systems, there is no reason why a system should be closer to one reference than another or even not to lie somewhere in between the different existing guidelines. [sent-71, score-0.454]

35 Huang and Zhao (2007) propose to use cross-training of a supervized segmentation system in order to have an estimation of the consistency between different segmentation guidelines, and therefore an upper bound of what can be expected from an unsupervized system (Zhao and Kit, 2008). [sent-72, score-1.178]

36 The average consistency is found to be as low as 0. [sent-73, score-0.072]

37 Therefore this figure can be considered as a sensible topline for unsupervized systems. [sent-75, score-0.555]

38 The standard baseline which consists in segmenting each character leads to a baseline around 0. [sent-76, score-0.041]

39 35 (f-score) almost half of the tokens in a manually segmented corpus are unigrams. [sent-77, score-0.046]

40 Per word-length evaluation is also important as units of various lengths tend to have different distributions. [sent-78, score-0.099]

41 We used ZPAR (Zhang and Clark, 2010) on the four corpora from the Second Bakeoff to reproduce Huang and Zhao's (2007) experiments, but also to measure cross-corpus consistency at a per-wordlength level. [sent-79, score-0.144]

42 Our overall results are comparable to what Huang and Zhao (2007) report. [sent-80, score-0.038]

43 However, the — consistency is quickly falling for longer words: on unigrams, f-scores range from 0. [sent-81, score-0.072]

44 In a segmented Chinese text, most of the tokens are uni- and bigrams but most of the types are bi- and trigrams (as unigrams are often high frequency grammatical words and trigrams the result of more or less productive affixations). [sent-89, score-0.307]

45 Therefore the results of evaluations only based on tokens do not suffer much from poor performances on trigrams even if a large part of the lexicon may be incorrectly processed. [sent-90, score-0.104]

46 Another issue about the evaluation and comparison of unsupervized systems is to try and remain fair in terms of preprocessing and prior knowledge given to the systems. [sent-91, score-0.46]

47 (201 1) used different levels of preprocessing (which they call “settings”). [sent-93, score-0.047]

48 (201 1) try not to rely on punctuation and character encoding information (such as distinguishing Latin and Chinese characters). [sent-95, score-0.043]

49 We therefore consider that their system does take into account the level ofprocessing which is performed on Latin char- acters and Arabic numbers, and therefore “knows” whether to expect such characters or not. [sent-97, score-0.147]

50 In setting 3 they add the knowledge ofpunctuation as clear boundaries and in setting 4 they preprocess Arabic and Latin and obtain better, more consistent and less questionable results. [sent-98, score-0.071]

51 As we are more interested in reducing the amount of human labor needed than in achieving by all means fully unsupervized learning, we do not refrain from performing basic and straightforward preprocessing such as detection of punctuation marks, Latin characters and Arabic numbers. [sent-99, score-0.46]

52 2 Therefore, our experiments rely on settings similar to their settings 3 and 4, and are evaluated against the same corpora. [sent-100, score-0.125]

53 4 Normalized Variation of Branching Entropy (nVBE) Our system builds upon Harris's (1955) hypothesis and its reformulation by Kempe (1999) and TanakaIshii (2005). [sent-101, score-0.127]

54 n with a left context χ→, we define its RightBranching Entropy (RBE) as: h→(x0. [sent-114, score-0.032]

55 n's Branching Entropy (BE) when reading from left to right (resp. [sent-133, score-0.032]

56 2Simple regular expressions could also be considered to deal with unambiguous cases of numbers and dates in Chinese script. [sent-135, score-0.038]

57 The VBEs are not directly comparable for strings of different lengths and need to be normalized. [sent-157, score-0.081]

58 In this work, we recenter them around 0 with respect to the length of the string by substracting the mean of the VBEs of the strings of the same length. [sent-158, score-0.041]

59 The normalized VBEs for the string x, or nVBEs, are then defined as follow (we only defined ˜δh← (x) for clarity reasons): for each length k and each k-gram x such that len(x) = k, ˜δh→ (x) = δh→ (x) −µ→,k, where µ→,k is the mean of the values of δ(hx)→− −(xµ) o,fk all k-grams x. [sent-160, score-0.038]

60 Note that we use and normalize the variation of branching entropy and not the branching entropy itself. [sent-161, score-0.772]

61 Doing so would break the Harris's hypothesis as we would not expect ˜h(x0. [sent-162, score-0.125]

62 Many studies use directly the branching entropy (normalized or not) and report results that are below state-of-the-art systems (Cohen et al. [sent-167, score-0.349]

63 5 Decoding algorithm If we follow Harris's hypothesis and consider complex morphological word structures, we expect a large VBE at the boundaries of interesting units and more unstable variations inside “words. [sent-169, score-0.252]

64 For different lengths of n-grams, we compared the distributions ofthe VBEs at different positions inside the n-gram and at its boundaries. [sent-171, score-0.043]

65 non-words, we observed that the VBE at both boundaries were the most discriminative value. [sent-173, score-0.071]

66 Therefore, we decided to take in account the VBE only at the word-candidate boundaries (left and right) and not to consider the inner values. [sent-174, score-0.103]

67 Second, best segmentation can be computed using dynamic programming. [sent-176, score-0.235]

68 Since we consider the VBE only at words boundary, we can define for any n-gram w its autonomy as a(x) = The more an n-gram is autonomous, the more likely it is to be a word. [sent-177, score-0.063]

69 With this measure, we can redefine the sentence segmentation problem as the maximization ofthe autonomy measure of its words. [sent-178, score-0.374]

70 War∈gSmega(xs)w∑i∈Wa(wi) · len(wi), where W is the segmentation corresponding to the sequence of words w0w1 . [sent-180, score-0.235]

71 wm, and len(wi) is the length of a word wi used here to be able to com- pare segmentations resulting in a different number of words. [sent-183, score-0.076]

72 This best segmentation can be computed easily using dynamic programming. [sent-184, score-0.235]

73 6 Results and discussion We tested our system against the data from the 4 corpora of the Second Bakeoff, in both settings 3 and 4, as described in Section 3. [sent-185, score-0.072]

74 , 201 1), it does not require multiple iterations on the corpus and it does not rely on any parameters. [sent-189, score-0.043]

75 This shows that we can rely solely on a separation measure and get high segmentation scores. [sent-190, score-0.45]

76 When maximized over a sentence, this measure captures at least in part what can be modeled by a cohesion measure without the need for fine-tuning the balance between the two. [sent-191, score-0.218]

77 word length is consistent with the supervized cross-evaluation results of the various segmentation guidelines as performed in Section 3. [sent-195, score-0.517]

78 We can simply mention that the errors we observed are consistent with previous systems based on Harris's hypothesis (see (Magistry and Sagot, 201 1) and Jin (2007) for a longer discussion). [sent-197, score-0.076]

79 Many errors are related to dates and Chinese numbers. [sent-198, score-0.038]

80 Other errors often involve frequent grammatical morphemes or productive affixes. [sent-200, score-0.089]

81 Indeed, unlike content words, grammatical morphemes belongs to closed classes, 386 SystemASCITYUPKUMSR E S nAV wBbEeorst 0 . [sent-202, score-0.072]

82 nVBE corresponds to our proposal, based on normalized VBE with maximization at word boundaries. [sent-215, score-0.073]

83 59 (see Section 3) therefore introducing this linguistic knowledge into the system may be of great help without requiring to much human effort. [sent-226, score-0.033]

84 A sensible way to go in that direction would be to let unsupervized system deal with open classes and process closed classes with a symbolic or supervized module. [sent-227, score-0.684]

85 However, PKU is more consistent in genre as it contains only articles from the People's Daily. [sent-230, score-0.031]

86 On the other end, AS is a balanced corpus with a greater variety in many aspects. [sent-231, score-0.033]

87 CITYU Corpus is almost as small as PKU but contains articles from newspapers of various Mandarin Chinese speaking communities where great variation is to be expected. [sent-232, score-0.074]

88 This suggest that consistency of the input data is as important as the amount of data. [sent-233, score-0.072]

89 This hypothesis has to be confirmed in futur studies. [sent-234, score-0.117]

90 An unsupervised algorithm for segmenting categorical timeseries into episodes. [sent-238, score-0.045]

91 In Proceedings of the 21st International Conference on Computational Linguistics and the 44th annual meeting of the Association for Computational Linguistics, page 673–680. [sent-252, score-0.073]

92 Unsupervised segmentation of Chinese text by use of branching entropy. [sent-265, score-0.48]

93 In Proceedings of the COLING/ACL on Main conference poster sessions, page 428–435. [sent-266, score-0.073]

94 In Workshop of EACL in Computational Natural Language Learning, page 7–13. [sent-276, score-0.073]

95 Bayesian unsupervised word segmentation with nested Pitman-Yor language modeling. [sent-285, score-0.28]

96 In Proceedings of the Joint Conference of the 47th Annual Meeting of the ACL and the 4th International Joint Conference on Natural Language Processing of the AFNLP: Volume 1-Volume 1, page 100–108. [sent-286, score-0.073]

97 A statistical method for finding word boundaries in Chinese text. [sent-290, score-0.071]

98 A fast decoder for joint word segmentation and POS-tagging using a single discriminative model. [sent-302, score-0.235]

99 In Proceedings of the 2010 Conference on Empirical Methods in Natural Language Processing, page 843–852. [sent-303, score-0.073]

100 An empirical comparison of goodness measures for unsupervised Chinese word segmentation with a unified framework. [sent-306, score-0.319]


similar papers computed by tfidf model

tfidf for this paper:

wordName wordTfidf (topN-words)

[('unsupervized', 0.413), ('vbe', 0.413), ('branching', 0.245), ('segmentation', 0.235), ('supervized', 0.19), ('jin', 0.159), ('harris', 0.148), ('nvbe', 0.127), ('vbes', 0.127), ('chinese', 0.116), ('entropy', 0.104), ('separation', 0.101), ('latin', 0.097), ('pku', 0.097), ('zhao', 0.097), ('magistry', 0.095), ('topline', 0.095), ('guidelines', 0.092), ('mandarin', 0.089), ('cohesion', 0.089), ('kempe', 0.083), ('lbe', 0.083), ('wang', 0.081), ('hypothesis', 0.076), ('bakeoff', 0.076), ('variation', 0.074), ('page', 0.073), ('paris', 0.073), ('trigrams', 0.073), ('consistency', 0.072), ('boundaries', 0.071), ('esa', 0.071), ('len', 0.067), ('reach', 0.066), ('alpage', 0.063), ('autonomy', 0.063), ('chevaleret', 0.063), ('rue', 0.063), ('tanakaishii', 0.063), ('zhihui', 0.063), ('arabic', 0.062), ('kit', 0.061), ('units', 0.056), ('rising', 0.055), ('beno', 0.055), ('emerson', 0.055), ('rbe', 0.055), ('kumiko', 0.055), ('accessor', 0.055), ('mochihashi', 0.051), ('productive', 0.051), ('reformulation', 0.051), ('inria', 0.051), ('exponent', 0.051), ('france', 0.05), ('expect', 0.049), ('du', 0.049), ('cityu', 0.047), ('maximized', 0.047), ('sensible', 0.047), ('preprocessing', 0.047), ('segmented', 0.046), ('huang', 0.045), ('wi', 0.045), ('unsupervised', 0.045), ('sagot', 0.045), ('lengths', 0.043), ('rely', 0.043), ('sproat', 0.042), ('msr', 0.042), ('hai', 0.042), ('settings', 0.041), ('around', 0.041), ('measure', 0.041), ('lie', 0.041), ('pierre', 0.041), ('drawbacks', 0.041), ('confirmed', 0.041), ('goodness', 0.039), ('normalized', 0.038), ('comparable', 0.038), ('dates', 0.038), ('morphemes', 0.038), ('maximization', 0.035), ('closed', 0.034), ('therefore', 0.033), ('goldwater', 0.033), ('variety', 0.033), ('bigrams', 0.032), ('decreasing', 0.032), ('left', 0.032), ('account', 0.032), ('script', 0.032), ('unigrams', 0.032), ('corpora', 0.031), ('correlation', 0.031), ('genre', 0.031), ('segmentations', 0.031), ('performances', 0.031), ('solely', 0.03)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 1.0000001 210 acl-2012-Unsupervized Word Segmentation: the Case for Mandarin Chinese

Author: Pierre Magistry ; Benoit Sagot

Abstract: In this paper, we present an unsupervized segmentation system tested on Mandarin Chinese. Following Harris's Hypothesis in Kempe (1999) and Tanaka-Ishii's (2005) reformulation, we base our work on the Variation of Branching Entropy. We improve on (Jin and Tanaka-Ishii, 2006) by adding normalization and viterbidecoding. This enable us to remove most of the thresholds and parameters from their model and to reach near state-of-the-art results (Wang et al., 201 1) with a simpler system. We provide evaluation on different corpora available from the Segmentation bake-off II (Emerson, 2005) and define a more precise topline for the task using cross-trained supervized system available off-the-shelf (Zhang and Clark, 2010; Zhao and Kit, 2008; Huang and Zhao, 2007)

2 0.11896107 119 acl-2012-Incremental Joint Approach to Word Segmentation, POS Tagging, and Dependency Parsing in Chinese

Author: Jun Hatori ; Takuya Matsuzaki ; Yusuke Miyao ; Jun'ichi Tsujii

Abstract: We propose the first joint model for word segmentation, POS tagging, and dependency parsing for Chinese. Based on an extension of the incremental joint model for POS tagging and dependency parsing (Hatori et al., 2011), we propose an efficient character-based decoding method that can combine features from state-of-the-art segmentation, POS tagging, and dependency parsing models. We also describe our method to align comparable states in the beam, and how we can combine features of different characteristics in our incremental framework. In experiments using the Chinese Treebank (CTB), we show that the accuracies of the three tasks can be improved significantly over the baseline models, particularly by 0.6% for POS tagging and 2.4% for dependency parsing. We also perform comparison experiments with the partially joint models.

3 0.10983062 94 acl-2012-Fast Online Training with Frequency-Adaptive Learning Rates for Chinese Word Segmentation and New Word Detection

Author: Xu Sun ; Houfeng Wang ; Wenjie Li

Abstract: We present a joint model for Chinese word segmentation and new word detection. We present high dimensional new features, including word-based features and enriched edge (label-transition) features, for the joint modeling. As we know, training a word segmentation system on large-scale datasets is already costly. In our case, adding high dimensional new features will further slow down the training speed. To solve this problem, we propose a new training method, adaptive online gradient descent based on feature frequency information, for very fast online training of the parameters, even given large-scale datasets with high dimensional features. Compared with existing training methods, our training method is an order magnitude faster in terms of training time, and can achieve equal or even higher accuracies. The proposed fast training method is a general purpose optimization method, and it is not limited in the specific task discussed in this paper.

4 0.10194949 3 acl-2012-A Class-Based Agreement Model for Generating Accurately Inflected Translations

Author: Spence Green ; John DeNero

Abstract: When automatically translating from a weakly inflected source language like English to a target language with richer grammatical features such as gender and dual number, the output commonly contains morpho-syntactic agreement errors. To address this issue, we present a target-side, class-based agreement model. Agreement is promoted by scoring a sequence of fine-grained morpho-syntactic classes that are predicted during decoding for each translation hypothesis. For English-to-Arabic translation, our model yields a +1.04 BLEU average improvement over a state-of-the-art baseline. The model does not require bitext or phrase table annotations and can be easily implemented as a feature in many phrase-based decoders. 1

5 0.090022512 168 acl-2012-Reducing Approximation and Estimation Errors for Chinese Lexical Processing with Heterogeneous Annotations

Author: Weiwei Sun ; Xiaojun Wan

Abstract: We address the issue of consuming heterogeneous annotation data for Chinese word segmentation and part-of-speech tagging. We empirically analyze the diversity between two representative corpora, i.e. Penn Chinese Treebank (CTB) and PKU’s People’s Daily (PPD), on manually mapped data, and show that their linguistic annotations are systematically different and highly compatible. The analysis is further exploited to improve processing accuracy by (1) integrating systems that are respectively trained on heterogeneous annotations to reduce the approximation error, and (2) re-training models with high quality automatically converted data to reduce the estimation error. Evaluation on the CTB and PPD data shows that our novel model achieves a relative error reduction of 11% over the best reported result in the literature.

6 0.085414588 194 acl-2012-Text Segmentation by Language Using Minimum Description Length

7 0.083582297 81 acl-2012-Enhancing Statistical Machine Translation with Character Alignment

8 0.072067373 89 acl-2012-Exploring Deterministic Constraints: from a Constrained English POS Tagger to an Efficient ILP Solution to Chinese Word Segmentation

9 0.069011807 41 acl-2012-Bootstrapping a Unified Model of Lexical and Phonetic Acquisition

10 0.066896901 45 acl-2012-Capturing Paradigmatic and Syntagmatic Lexical Relations: Towards Accurate Chinese Part-of-Speech Tagging

11 0.066594906 122 acl-2012-Joint Evaluation of Morphological Segmentation and Syntactic Parsing

12 0.065068975 207 acl-2012-Unsupervised Morphology Rivals Supervised Morphology for Arabic MT

13 0.064429842 16 acl-2012-A Nonparametric Bayesian Approach to Acoustic Model Discovery

14 0.062280238 211 acl-2012-Using Rejuvenation to Improve Particle Filtering for Bayesian Word Segmentation

15 0.061783034 27 acl-2012-Arabic Retrieval Revisited: Morphological Hole Filling

16 0.058601491 140 acl-2012-Machine Translation without Words through Substring Alignment

17 0.056385346 46 acl-2012-Character-Level Machine Translation Evaluation for Languages with Ambiguous Word Boundaries

18 0.054035872 171 acl-2012-SITS: A Hierarchical Nonparametric Model using Speaker Identity for Topic Segmentation in Multiparty Conversations

19 0.053514868 134 acl-2012-Learning to Find Translations and Transliterations on the Web

20 0.052538343 35 acl-2012-Automatically Mining Question Reformulation Patterns from Search Log Data


similar papers computed by lsi model

lsi for this paper:

topicId topicWeight

[(0, -0.159), (1, -0.003), (2, -0.04), (3, -0.027), (4, 0.005), (5, 0.123), (6, 0.05), (7, -0.125), (8, -0.034), (9, 0.024), (10, -0.097), (11, -0.01), (12, -0.002), (13, -0.051), (14, 0.011), (15, 0.019), (16, -0.008), (17, -0.007), (18, -0.01), (19, 0.133), (20, 0.002), (21, 0.019), (22, -0.023), (23, 0.014), (24, -0.12), (25, -0.019), (26, -0.036), (27, 0.02), (28, 0.093), (29, -0.195), (30, -0.063), (31, 0.083), (32, 0.03), (33, -0.111), (34, -0.014), (35, -0.026), (36, 0.078), (37, 0.005), (38, 0.021), (39, -0.12), (40, 0.028), (41, 0.061), (42, -0.147), (43, 0.076), (44, 0.153), (45, -0.058), (46, -0.006), (47, 0.084), (48, -0.064), (49, 0.085)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.94604939 210 acl-2012-Unsupervized Word Segmentation: the Case for Mandarin Chinese

Author: Pierre Magistry ; Benoit Sagot

Abstract: In this paper, we present an unsupervized segmentation system tested on Mandarin Chinese. Following Harris's Hypothesis in Kempe (1999) and Tanaka-Ishii's (2005) reformulation, we base our work on the Variation of Branching Entropy. We improve on (Jin and Tanaka-Ishii, 2006) by adding normalization and viterbidecoding. This enable us to remove most of the thresholds and parameters from their model and to reach near state-of-the-art results (Wang et al., 201 1) with a simpler system. We provide evaluation on different corpora available from the Segmentation bake-off II (Emerson, 2005) and define a more precise topline for the task using cross-trained supervized system available off-the-shelf (Zhang and Clark, 2010; Zhao and Kit, 2008; Huang and Zhao, 2007)

2 0.70293403 94 acl-2012-Fast Online Training with Frequency-Adaptive Learning Rates for Chinese Word Segmentation and New Word Detection

Author: Xu Sun ; Houfeng Wang ; Wenjie Li

Abstract: We present a joint model for Chinese word segmentation and new word detection. We present high dimensional new features, including word-based features and enriched edge (label-transition) features, for the joint modeling. As we know, training a word segmentation system on large-scale datasets is already costly. In our case, adding high dimensional new features will further slow down the training speed. To solve this problem, we propose a new training method, adaptive online gradient descent based on feature frequency information, for very fast online training of the parameters, even given large-scale datasets with high dimensional features. Compared with existing training methods, our training method is an order magnitude faster in terms of training time, and can achieve equal or even higher accuracies. The proposed fast training method is a general purpose optimization method, and it is not limited in the specific task discussed in this paper.

3 0.67136323 194 acl-2012-Text Segmentation by Language Using Minimum Description Length

Author: Hiroshi Yamaguchi ; Kumiko Tanaka-Ishii

Abstract: The problem addressed in this paper is to segment a given multilingual document into segments for each language and then identify the language of each segment. The problem was motivated by an attempt to collect a large amount of linguistic data for non-major languages from the web. The problem is formulated in terms of obtaining the minimum description length of a text, and the proposed solution finds the segments and their languages through dynamic programming. Empirical results demonstrating the potential of this approach are presented for experiments using texts taken from the Universal Declaration of Human Rights and Wikipedia, covering more than 200 languages.

4 0.65540272 26 acl-2012-Applications of GPC Rules and Character Structures in Games for Learning Chinese Characters

Author: Wei-Jie Huang ; Chia-Ru Chou ; Yu-Lin Tzeng ; Chia-Ying Lee ; Chao-Lin Liu

Abstract: We demonstrate applications of psycholinguistic and sublexical information for learning Chinese characters. The knowledge about the grapheme-phoneme conversion (GPC) rules of languages has been shown to be highly correlated to the ability of reading alphabetic languages and Chinese. We build and will demo a game platform for strengthening the association of phonological components in Chinese characters with the pronunciations of the characters. Results of a preliminary evaluation of our games indicated significant improvement in learners’ response times in Chinese naming tasks. In addition, we construct a Webbased open system for teachers to prepare their own games to best meet their teaching goals. Techniques for decomposing Chinese characters and for comparing the similarity between Chinese characters were employed to recommend lists of Chinese characters for authoring the games. Evaluation of the authoring environment with 20 subjects showed that our system made the authoring of games more effective and efficient.

5 0.51690644 81 acl-2012-Enhancing Statistical Machine Translation with Character Alignment

Author: Ning Xi ; Guangchao Tang ; Xinyu Dai ; Shujian Huang ; Jiajun Chen

Abstract: The dominant practice of statistical machine translation (SMT) uses the same Chinese word segmentation specification in both alignment and translation rule induction steps in building Chinese-English SMT system, which may suffer from a suboptimal problem that word segmentation better for alignment is not necessarily better for translation. To tackle this, we propose a framework that uses two different segmentation specifications for alignment and translation respectively: we use Chinese character as the basic unit for alignment, and then convert this alignment to conventional word alignment for translation rule induction. Experimentally, our approach outperformed two baselines: fully word-based system (using word for both alignment and translation) and fully character-based system, in terms of alignment quality and translation performance. 1Introduction Chinese Word segmentation is a necessary step in Chinese-English statistical machine translation (SMT) because Chinese sentences do not delimit words by spaces. The key characteristic of a Chinese word segmenter is the segmentation specification1. As depicted in Figure 1(a), the dominant practice of SMT uses the same word segmentation for both word alignment and translation rule induction. For brevity, we will refer to the word segmentation of the bilingual corpus as word segmentation for alignment (WSA for short), because it determines the basic tokens for alignment; and refer to the word segmentation of the aligned corpus as word segmentation for rules (WSR for short), because it determines the basic tokens of translation 1 We hereafter use “word segmentation” for short. 285 (a) WSA=WSR (b) WSA≠WSR Figure 1. WSA and WSR in SMT pipeline rules2, which also determines how the translation rules would be matched by the source sentences. It is widely accepted that word segmentation with a higher F-score will not necessarily yield better translation performance (Chang et al., 2008; Zhang et al., 2008; Xiao et al., 2010). Therefore, many approaches have been proposed to learn word segmentation suitable for SMT. These approaches were either complicated (Ma et al., 2007; Chang et al., 2008; Ma and Way, 2009; Paul et al., 2010), or of high computational complexity (Chung and Gildea 2009; Duan et al., 2010). Moreover, they implicitly assumed that WSA and WSR should be equal. This requirement may lead to a suboptimal problem that word segmentation better for alignment is not necessarily better for translation. To tackle this, we propose a framework that uses different word segmentation specifications as WSA and WSR respectively, as shown Figure 1(b). We investigate a solution in this framework: first, we use Chinese character as the basic unit for alignment, viz. character alignment; second, we use a simple method (Elming and Habash, 2007) to convert the character alignment to conventional word alignment for translation rule induction. In the 2 Interestingly, word is also a basic token in syntax-based rules. Proce dJienjgus, R ofep thueb 5lic0t hof A Knonruea ,l M 8-e1e4ti Jnugly o f2 t0h1e2 A.s ?c so2c0ia1t2io Ans fso rc Ciatoiomnp fuotart Cio nmaplu Ltiantgiounisatlic Lsi,n pgaugiestsi2c 8s5–290, experiment, our approach consistently outperformed two baselines with three different word segmenters: fully word-based system (using word for both alignment and translation) and fully character-based system, in terms of alignment quality and translation performance. The remainder of this paper is structured as follows: Section 2 analyzes the influences of WSA and WSR on SMT respectively; Section 3 discusses how to convert character alignment to word alignment; Section 4 presents experimental results, followed by conclusions and future work in section 5. 2 Understanding WSA and WSR We propose a solution to tackle the suboptimal problem: using Chinese character for alignment while using Chinese word for translation. Character alignment differs from conventional word alignment in the basic tokens of the Chinese side of the training corpus3. Table 1 compares the token distributions of character-based corpus (CCorpus) and word-based corpus (WCorpus). We see that the WCorpus has a longer-tailed distribution than the CCorpus. More than 70% of the unique tokens appear less than 5 times in WCorpus. However, over half of the tokens appear more than or equal to 5 times in the CCorpus. This indicates that modeling word alignment could suffer more from data sparsity than modeling character alignment. Table 2 shows the numbers of the unique tokens (#UT) and unique bilingual token pairs (#UTP) of the two corpora. Consider two extensively features, fertility and translation features, which are extensively used by many state-of-the-art word aligners. The number of parameters w.r.t. fertility features grows linearly with #UT while the number of parameters w.r.t. translation features grows linearly with #UTP. We compare #UT and #UTP of both corpora in Table 2. As can be seen, CCorpus has less UT and UTP than WCorpus, i.e. character alignment model has a compact parameterization than word alignment model, where the compactness of parameterization is shown very important in statistical modeling (Collins, 1999). Another advantage of character alignment is the reduction in alignment errors caused by word seg3 Several works have proposed to use character (letter) on both sides of the parallel corpus for SMT between similar (European) languages (Vilar et al., 2007; Tiedemann, 2009), however, Chinese is not similar to English. 286 Frequency Characters (%) Words (%) 1 27.22 45.39 2 11.13 14.61 3 6.18 6.47 4 4.26 4.32 5(+) 50.21 29.21 Table 1 Token distribution of CCorpus and WCorpus Stats. Characters Words #UT 9.7K 88.1K #UTP 15.8M 24.2M Table 2 #UT and #UTP in CCorpus and WCorpus mentation errors. For example, “切尼 (Cheney)” and “愿 (will)” are wrongly merged into one word 切 尼 by the word segmenter, and 切 尼 wrongly aligns to a comma in English sentence in the word alignment; However, both 切 and 尼 align to “Cheney” correctly in the character alignment. However, this kind of errors cannot be fixed by methods which learn new words by packing already segmented words, such as word packing (Ma et al., 2007) and Pseudo-word (Duan et al., 2010). As character could preserve more meanings than word in Chinese, it seems that a character can be wrongly aligned to many English words by the aligner. However, we found this can be avoided to a great extent by the basic features (co-occurrence and distortion) used by many alignment models. For example, we observed that the four characters of the non-compositional word “阿拉法特 (Arafat)” align to Arafat correctly, although these characters preserve different meanings from that of Arafat. This can be attributed to the frequent co-occurrence (192 愿 愿 times) of these characters and Arafat in CCorpus. Moreover, 法 usually means France in Chinese, thus it may co-occur very often with France in CCorpus. If both France and Arafat appear in the English sentence, 法 may wrongly align to France. However, if 阿 aligns to Arafat, 法 will probably align to Arafat, because aligning 法 to Arafat could result in a lower distortion cost than aligning it to France. Different from alignment, translation is a pattern matching procedure (Lopez, 2008). WSR determines how the translation rules would be matched by the source sentences. For example, if we use translation rules with character as WSR to translate name entities such as the non-compositional word 阿拉法特, i.e. translating literally, we may get a wrong translation. That’s because the linguistic knowledge that the four characters convey a specific meaning different from the characters has been lost, which cannot always be totally recovered even by using phrase in phrase-based SMT systems (see Chang et al. (2008) for detail). Duan et al. (2010) and Paul et al., (2010) further pointed out that coarser-grained segmentation of the source sentence do help capture more contexts in translation. Therefore, rather than using character, using coarser-grained, at least as coarser as the conventional word, as WSR is quite necessary. 3 Converting Character Alignment to Word Alignment In order to use word as WSR, we employ the same method as Elming and Habash (2007)4 to convert the character alignment (CA) to its word-based version (CA ’) for translation rule induction. The conversion is very intuitive: for every English-Chinese word pair ??, ?? in the sentence pair, we align ? to ? as a link in CA ’, if and only if there is at least one Chinese character of ? aligns to ? in CA. Given two different segmentations A and B of the same sentence, it is easy to prove that if every word in A is finer-grained than the word of B at the corresponding position, the conversion is unambiguity (we omit the proof due to space limitation). As character is a finer-grained than its original word, character alignment can always be converted to alignment based on any word segmentation. Therefore, our approach can be naturally scaled to syntax-based system by converting character alignment to word alignment where the word seg- mentation is consistent with the parsers. We compare CA with the conventional word alignment (WA) as follows: We hand-align some sentence pairs as the evaluation set based on characters (ESChar), and converted it to the evaluation set based on word (ESWord) using the above conversion method. It is worth noting that comparing CA and WA by evaluating CA on ESChar and evaluating WA on ESWord is meaningless, because the basic tokens in CA and WA are different. However, based on the conversion method, comparing CA with WA can be accomplished by evaluating both CA ’ and WA on ESWord. 4 They used this conversion for word alignment combination only, no translation results were reported. 287 4 Experiments 4.1 Setup FBIS corpus (LDC2003E14) (210K sentence pairs) was used for small-scale task. A large bilingual corpus of our lab (1.9M sentence pairs) was used for large-scale task. The NIST’06 and NIST’08 test sets were used as the development set and test set respectively. The Chinese portions of all these data were preprocessed by character segmenter (CHAR), ICTCLAS word segmenter5 (ICT) and Stanford word segmenters with CTB and PKU specifications6 respectively. The first 100 sentence pairs of the hand-aligned set in Haghighi et al. (2009) were hand-aligned as ESChar, which is converted to three ESWords based on three segmentations respectively. These ESWords were appended to training corpus with the corresponding word segmentation for evaluation purpose. Both character and word alignment were performed by GIZA++ (Och and Ney, 2003) enhanced with gdf heuristics to combine bidirectional alignments (Koehn et al., 2003). A 5-gram language model was trained from the Xinhua portion of Gigaword corpus. A phrase-based MT decoder similar to (Koehn et al., 2007) was used with the decoding weights optimized by MERT (Och, 2003). 4.2 Evaluation We first evaluate the alignment quality. The method discussed in section 3 was used to compare character and word alignment. As can be seen from Table 3, the systems using character as WSA outperformed the ones using word as WSA in both small-scale (row 3-5) and large-scale task (row 6-8) with all segmentations. This gain can be attributed to the small vocabulary size (sparsity) for character alignment. The observation is consistent with Koehn (2005) which claimed that there is a negative correlation between the vocabulary size and translation performance without explicitly distinguishing WSA and WSR. We then evaluated the translation performance. The baselines are fully word-based MT systems (WordSys), i.e. using word as both WSA and WSR, and fully character-based systems (CharSys). Table 5 http://www.ictclas.org/ 6 http://nlp.stanford.edu/software/segmenter.shtml TLSablCIPeKT3BUAlig87 n609P5mW.0162eonrdt8avl52R01ai.g l6489numatieo78n29F t. 46590PrecC87 i1hP28s.oa3027rn(ctPe89)r6R05,.ar7162e3licganm8 (15F62eR.n983)t, TableSL4TWwrcahonSraAdslatioWw no SerdRvalu2Ct31iT.o405Bn1724ofW2P 301Ko.895rU61d Sy2sI03Ca.29nT035d4 proand F-score (F) with ? ? 0.5 (Fraser and Marcu, 2007) posed system using BLEU-SBP (Chiang et al., 2008) 4 compares WordSys to our proposed system. Significant testing was carried out using bootstrap re-sampling method proposed by Koehn (2004) with a 95% confidence level. We see that our proposed systems outperformed WordSys in all segmentation specifications settings. Table 5 lists the results of CharSys in small-scale task. In this setting, we gradually set the phrase length and the distortion limits of the phrase-based decoder (context size) to 7, 9, 11 and 13, in order to remove the disadvantage of shorter context size of using character as WSR for fair comparison with WordSys as suggested by Duan et al. (2010). Comparing Table 4 and 5, we see that all CharSys underperformed WordSys. This observation is consistent with Chang et al. (2008) which claimed that using characters, even with large phrase length (up to 13 in our experiment) cannot always capture everything a Chinese word segmenter can do, and using word for translation is quite necessary. We also see that CharSys underperformed our proposed systems, that’s because the harm of using character as WSR outweighed the benefit of using character as WSA, which indicated that word segmentation better for alignment is not necessarily better for translation, and vice versa. We finally compared our approaches to Ma et al. (2007) and Ma and Way (2009), which proposed “packed word (PW)” and “bilingual motivated word (BS)” respectively. Both methods iteratively learn word segmentation and alignment alternatively, with the former starting from word-based corpus and the latter starting from characters-based corpus. Therefore, PW can be experimented on all segmentations. Table 6 lists their results in small- 288 Context Size 7 9 11 13 BLEU 20.90 21.19 20.89 21.09 Table 5 Translation evaluation of CharSys. CWPhrSoayps+TdoPtSaBebWmySdsle6wWPcCBhoWSa rAmdpawWrPBisoWS rRdnwiC2t1hT.2504oB6the2r1P0w9K.2o178U496rk s2I10C.9T547 scale task, we see that both PW and BS underperformed our approach. This may be attributed to the low recall of the learned BS or PW in their approaches. BS underperformed both two baselines, one reason is that Ma and Way (2009) also employed word lattice decoding techniques (Dyer et al., 2008) to tackle the low recall of BS, which was removed from our experiments for fair comparison. Interestingly, we found that using character as WSA and BS as WSR (Char+BS), a moderate gain (+0.43 point) was achieved compared with fully BS-based system; and using character as WSA and PW as WSR (Char+PW), significant gains were achieved compared with fully PW-based system, the result of CTB segmentation in this setting even outperformed our proposed approach (+0.42 point). This observation indicated that in our framework, better combinations of WSA and WSR can be found to achieve better translation performance. 5 Conclusions and Future Work We proposed a SMT framework that uses character for alignment and word for translation, which improved both alignment quality and translation performance. We believe that in this framework, using other finer-grained segmentation, with fewer ambiguities than character, would better parameterize the alignment models, while using other coarser-grained segmentation as WSR can help capture more linguistic knowledge than word to get better translation. We also believe that our approach, if integrated with combination techniques (Dyer et al., 2008; Xi et al., 2011), can yield better results. Acknowledgments We thank ACL reviewers. This work is supported by the National Natural Science Foundation of China (No. 61003 112), the National Fundamental Research Program of China (2010CB327903). References Peter F. Brown, Stephen A. Della Pietra, Vincent J. Della Peitra, and Robert L. Mercer. 1993. The mathematics of statistical machine translation: parameter estimation. Computational Linguistics, 19(2), pages 263-3 11. Pi-Chuan Chang, Michel Galley, and Christopher D. Manning. 2008. Optimizing Chinese word segmentation for machine translation performance. In Proceedings of third workshop on SMT, pages 224-232. David Chiang, Steve DeNeefe, Yee Seng Chan and Hwee Tou Ng. 2008. Decomposability of Translation Metrics for Improved Evaluation and Efficient Algorithms. In Proceedings of Conference on Empirical Methods in Natural Language Processing, pages 610-619. Tagyoung Chung and Daniel Gildea. 2009. Unsupervised tokenization for machine translation. In Proceedings of Conference on Empirical Methods in Natural Language Processing, pages 718-726. Michael Collins. 1999. Head-driven statistical models for natural language parsing. Ph.D. thesis, University of Pennsylvania. Xiangyu Duan, Min Zhang, and Haizhou Li. 2010. Pseudo-word for phrase-based machine translation. In Proceedings of the Association for Computational Linguistics, pages 148-156. Christopher Dyer, Smaranda Muresan, and Philip Resnik. 2008. Generalizing word lattice translation. In Proceedings of the Association for Computational Linguistics, pages 1012-1020. Jakob Elming and Nizar Habash. 2007. Combination of statistical word alignments based on multiple preprocessing schemes. In Proceedings of the Association for Computational Linguistics, pages 25-28. Alexander Fraser and Daniel Marcu. 2007. Squibs and Discussions: Measuring Word Alignment Quality for Statistical Machine Translation. In Computational Linguistics, 33(3), pages 293-303. Aria Haghighi, John Blitzer, John DeNero, and Dan Klein. 2009. Better word alignments with supervised ITG models. In Proceedings of the Association for Computational Linguistics, pages 923-93 1. Phillip Koehn, H. Hoang, A. Birch, C. Callison-Burch, M. Federico, N. Bertoldi, B. Cowan,W. Shen, C. Moran, R. Zens, C. Dyer, O. Bojar, A. Constantin, E. Herbst. 2007. Moses: Open source toolkit for statistical machine translation. In Proceedings of the Association for Computational Linguistics, pages 177-1 80. 289 Philipp Koehn. 2004. Statistical significance tests for machine translation evaluation. In Proceedings of the Conference on Empirical Methods on Natural Language Processing, pages 388-395. Philipp Koehn. 2005. Europarl: A parallel corpus for statistical machine translation. In Proceedings of the MT Summit. Adam David Lopez. 2008. Machine translation by pattern matching. Ph.D. thesis, University of Maryland. Yanjun Ma, Nicolas Stroppa, and Andy Way. 2007. Bootstrapping word alignment via word packing. In Proceedings of the Association for Computational Linguistics, pages 304-3 11. Yanjun Ma and Andy Way. 2009. Bilingually motivated domain-adapted word segmentation for statistical machine translation. In Proceedings of the Conference of the European Chapter of the ACL, pages 549-557. Franz Josef Och. 2003. Minimum error rate training in statistical machine translation. In Proceedings of the Association for Computational Linguistics, pages 440-447. Franz Josef Och and Hermann Ney. 2003. A systematic comparison of various statistical alignment models. Computational Linguistics, 29(1), pages 19-5 1. Michael Paul, Andrew Finch and Eiichiro Sumita. 2010. Integration of multiple bilingually-learned segmentation schemes into statistical machine translation. In Proceedings of the Joint Fifth Workshop on Statistical Machine Translation and MetricsMATR, pages 400-408. Jörg Tiedemann. 2009. Character-based PSMT for closely related languages. In Proceedings of the Annual Conference of the European Association for machine Translation, pages 12-19. David Vilar, Jan-T. Peter and Hermann Ney. 2007. Can we translate letters? In Proceedings of the Second Workshop on Statistical Machine Translation, pages 33-39. Xinyan Xiao, Yang Liu, Young-Sook Hwang, Qun Liu and Shouxun Lin. 2010. Joint tokenization and translation. In Proceedings of the 23rd International Conference on Computational Linguistics, pages 1200-1208. Ning Xi, Guangchao Tang, Boyuan Li, and Yinggong Zhao. 2011. Word alignment combination over multiple word segmentation. In Proceedings of the ACL 2011 Student Session, pages 1-5. Ruiqiang Zhang, Keiji Yasuda, and Eiichiro Sumita. 2008. Improved statistical machine translation by multiple Chinese word segmentation. of the Third Workshop on Statistical Machine Translation, pages 216-223. 290 In Proceedings

6 0.49382606 168 acl-2012-Reducing Approximation and Estimation Errors for Chinese Lexical Processing with Heterogeneous Annotations

7 0.49151811 211 acl-2012-Using Rejuvenation to Improve Particle Filtering for Bayesian Word Segmentation

8 0.49082327 89 acl-2012-Exploring Deterministic Constraints: from a Constrained English POS Tagger to an Efficient ILP Solution to Chinese Word Segmentation

9 0.48115718 119 acl-2012-Incremental Joint Approach to Word Segmentation, POS Tagging, and Dependency Parsing in Chinese

10 0.46894628 122 acl-2012-Joint Evaluation of Morphological Segmentation and Syntactic Parsing

11 0.4438737 46 acl-2012-Character-Level Machine Translation Evaluation for Languages with Ambiguous Word Boundaries

12 0.40924501 16 acl-2012-A Nonparametric Bayesian Approach to Acoustic Model Discovery

13 0.38674003 200 acl-2012-Toward Automatically Assembling Hittite-Language Cuneiform Tablet Fragments into Larger Texts

14 0.3764599 207 acl-2012-Unsupervised Morphology Rivals Supervised Morphology for Arabic MT

15 0.37165058 41 acl-2012-Bootstrapping a Unified Model of Lexical and Phonetic Acquisition

16 0.34011999 27 acl-2012-Arabic Retrieval Revisited: Morphological Hole Filling

17 0.32675174 39 acl-2012-Beefmoves: Dissemination, Diversity, and Dynamics of English Borrowings in a German Hip Hop Forum

18 0.31714138 151 acl-2012-Multilingual Subjectivity and Sentiment Analysis

19 0.30615637 219 acl-2012-langid.py: An Off-the-shelf Language Identification Tool

20 0.30561551 3 acl-2012-A Class-Based Agreement Model for Generating Accurately Inflected Translations


similar papers computed by lda model

lda for this paper:

topicId topicWeight

[(25, 0.03), (26, 0.059), (28, 0.049), (30, 0.031), (37, 0.027), (39, 0.044), (52, 0.011), (74, 0.038), (78, 0.313), (82, 0.021), (84, 0.048), (85, 0.024), (90, 0.103), (92, 0.047), (94, 0.023), (99, 0.059)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.69779742 210 acl-2012-Unsupervized Word Segmentation: the Case for Mandarin Chinese

Author: Pierre Magistry ; Benoit Sagot

Abstract: In this paper, we present an unsupervized segmentation system tested on Mandarin Chinese. Following Harris's Hypothesis in Kempe (1999) and Tanaka-Ishii's (2005) reformulation, we base our work on the Variation of Branching Entropy. We improve on (Jin and Tanaka-Ishii, 2006) by adding normalization and viterbidecoding. This enable us to remove most of the thresholds and parameters from their model and to reach near state-of-the-art results (Wang et al., 201 1) with a simpler system. We provide evaluation on different corpora available from the Segmentation bake-off II (Emerson, 2005) and define a more precise topline for the task using cross-trained supervized system available off-the-shelf (Zhang and Clark, 2010; Zhao and Kit, 2008; Huang and Zhao, 2007)

2 0.44442439 206 acl-2012-UWN: A Large Multilingual Lexical Knowledge Base

Author: Gerard de Melo ; Gerhard Weikum

Abstract: We present UWN, a large multilingual lexical knowledge base that describes the meanings and relationships of words in over 200 languages. This paper explains how link prediction, information integration and taxonomy induction methods have been used to build UWN based on WordNet and extend it with millions of named entities from Wikipedia. We additionally introduce extensions to cover lexical relationships, frame-semantic knowledge, and language data. An online interface provides human access to the data, while a software API enables applications to look up over 16 million words and names.

3 0.44406548 191 acl-2012-Temporally Anchored Relation Extraction

Author: Guillermo Garrido ; Anselmo Penas ; Bernardo Cabaleiro ; Alvaro Rodrigo

Abstract: Although much work on relation extraction has aimed at obtaining static facts, many of the target relations are actually fluents, as their validity is naturally anchored to a certain time period. This paper proposes a methodological approach to temporally anchored relation extraction. Our proposal performs distant supervised learning to extract a set of relations from a natural language corpus, and anchors each of them to an interval of temporal validity, aggregating evidence from documents supporting the relation. We use a rich graphbased document-level representation to generate novel features for this task. Results show that our implementation for temporal anchoring is able to achieve a 69% of the upper bound performance imposed by the relation extraction step. Compared to the state of the art, the overall system achieves the highest precision reported.

4 0.44398117 63 acl-2012-Cross-lingual Parse Disambiguation based on Semantic Correspondence

Author: Lea Frermann ; Francis Bond

Abstract: We present a system for cross-lingual parse disambiguation, exploiting the assumption that the meaning of a sentence remains unchanged during translation and the fact that different languages have different ambiguities. We simultaneously reduce ambiguity in multiple languages in a fully automatic way. Evaluation shows that the system reliably discards dispreferred parses from the raw parser output, which results in a pre-selection that can speed up manual treebanking.

5 0.44366816 219 acl-2012-langid.py: An Off-the-shelf Language Identification Tool

Author: Marco Lui ; Timothy Baldwin

Abstract: We present langid .py, an off-the-shelflanguage identification tool. We discuss the design and implementation of langid .py, and provide an empirical comparison on 5 longdocument datasets, and 2 datasets from the microblog domain. We find that langid .py maintains consistently high accuracy across all domains, making it ideal for end-users that require language identification without wanting to invest in preparation of in-domain training data.

6 0.44262165 174 acl-2012-Semantic Parsing with Bayesian Tree Transducers

7 0.44057393 139 acl-2012-MIX Is Not a Tree-Adjoining Language

8 0.44031125 130 acl-2012-Learning Syntactic Verb Frames using Graphical Models

9 0.4402951 11 acl-2012-A Feature-Rich Constituent Context Model for Grammar Induction

10 0.44016013 123 acl-2012-Joint Feature Selection in Distributed Stochastic Learning for Large-Scale Discriminative Training in SMT

11 0.44012588 187 acl-2012-Subgroup Detection in Ideological Discussions

12 0.43981284 214 acl-2012-Verb Classification using Distributional Similarity in Syntactic and Semantic Structures

13 0.43945211 72 acl-2012-Detecting Semantic Equivalence and Information Disparity in Cross-lingual Documents

14 0.43934223 41 acl-2012-Bootstrapping a Unified Model of Lexical and Phonetic Acquisition

15 0.43892452 62 acl-2012-Cross-Lingual Mixture Model for Sentiment Classification

16 0.43889594 40 acl-2012-Big Data versus the Crowd: Looking for Relationships in All the Right Places

17 0.43879208 156 acl-2012-Online Plagiarized Detection Through Exploiting Lexical, Syntax, and Semantic Information

18 0.43822166 99 acl-2012-Finding Salient Dates for Building Thematic Timelines

19 0.43773612 148 acl-2012-Modified Distortion Matrices for Phrase-Based Statistical Machine Translation

20 0.43692386 28 acl-2012-Aspect Extraction through Semi-Supervised Modeling