acl acl2011 acl2011-339 knowledge-graph by maker-knowledge-mining

339 acl-2011-Word Alignment Combination over Multiple Word Segmentation


Source: pdf

Author: Ning Xi ; Guangchao Tang ; Boyuan Li ; Yinggong Zhao

Abstract: In this paper, we present a new word alignment combination approach on language pairs where one language has no explicit word boundaries. Instead of combining word alignments of different models (Xiang et al., 2010), we try to combine word alignments over multiple monolingually motivated word segmentation. Our approach is based on link confidence score defined over multiple segmentations, thus the combined alignment is more robust to inappropriate word segmentation. Our combination algorithm is simple, efficient, and easy to implement. In the Chinese-English experiment, our approach effectively improved word alignment quality as well as translation performance on all segmentations simultaneously, which showed that word alignment can benefit from complementary knowledge due to the diversity of multiple and monolingually motivated segmentations. 1

Reference: text


Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 cn , , , Abstract In this paper, we present a new word alignment combination approach on language pairs where one language has no explicit word boundaries. [sent-4, score-0.446]

2 Instead of combining word alignments of different models (Xiang et al. [sent-5, score-0.274]

3 , 2010), we try to combine word alignments over multiple monolingually motivated word segmentation. [sent-6, score-0.703]

4 Our approach is based on link confidence score defined over multiple segmentations, thus the combined alignment is more robust to inappropriate word segmentation. [sent-7, score-0.658]

5 Our combination algorithm is simple, efficient, and easy to implement. [sent-8, score-0.067]

6 1 Introduction Word segmentation is the first step prior to word alignment for building statistical machine translations (SMT) on language pairs without explicit word boundaries such as Chinese-English. [sent-10, score-0.676]

7 Many works have focused on the improvement of word alignment models. [sent-11, score-0.294]

8 Most of the word alignment models take single word segmentation as input. [sent-16, score-0.61]

9 However, for languages such as Chinese, it is necessary to segment sentences into appropriate words for word alignment. [sent-17, score-0.059]

10 1 A large amount of works have stressed the impact of word segmentation on word alignment. [sent-18, score-0.399]

11 (2009) try to learn word segmentation from bilingually motivated point of view; they use an initial alignment to learn word segmentation appropriate for SMT. [sent-23, score-1.054]

12 Some other methods try to combine multiple word segmentation at SMT decoding step (Xu et al. [sent-25, score-0.482]

13 Different segmentations are yet independently used for word alignment. [sent-31, score-0.424]

14 We introduce a tabular structure called word segmentation network (WSN for short) to encode multiple segmentations of a Chinese sentence, and define skeleton links (SL for short) between spans of WSN and words of English sentence. [sent-33, score-1.194]

15 The confidence score of a SL is defined over multiple segmentations. [sent-34, score-0.157]

16 Our combination algorithm picks up potential SLs based on their confidence scores similar to Xiang et al. [sent-35, score-0.162]

17 (2010), and then projects each selected SL to link in all segmentation respectively. [sent-36, score-0.37]

18 Our algorithm is simple, efficient, easy to implement, and can effectively improve word alignment quality on all segmentations simultaneously, and alignment errors caused Portland, OR,P UroScAeed 19in-2g4s o Juf nthee 2 A0C1L1-. [sent-37, score-0.894]

19 c L2T0 210111 As Sstoucdieantito Sne fsosrio Cno,m papguetsat 1i–o5n,al Linguistics by inappropriate segmentations from single segmenter can be substantially reduced. [sent-39, score-0.454]

20 Two questions will be answered in the paper: 1) how to define the link confidence over multiple segmentations in combination algorithm? [sent-40, score-0.678]

21 (2010), the success of their word alignment combination of different models lies in the complementary information that the candidate alignments contain. [sent-42, score-0.58]

22 In our work, are multiple monolingually motivated segmentations complementary enough to improve the alignments? [sent-43, score-0.719]

23 Experiments of word alignment and SMT will be reported in section 4. [sent-46, score-0.294]

24 2 Word Segmentation Network We propose a new structure called word segmentation network (WSN) to encode multiple segmentations. [sent-47, score-0.387]

25 Due to space limitation, all definitions are presented by illustration of a running example of a sentence pair: 下雨路滑 (xia-yu-lu-hua) Road is slippery when raining We first introduce skeleton segmentation. [sent-48, score-0.503]

26 Given two segmentation S1 and S2 in Table 1, the word boundaries of their skeleton segmentation is the union of word boundaries (marked by “/”) in S1 and S2. [sent-49, score-1.048]

27 skeS le21 ton下 下Se/雨g/雨m雨 e/ n路/路t路at/i/滑o滑n滑 Table 1: The skeleton segmentation of two segmentations S 1 and S2. [sent-50, score-0.958]

28 As is depicted, line 1 and 2 represent words in S1 and S2 respectively, line 3 represents skeleton words. [sent-52, score-0.388]

29 Each column, or span, comprises a skeleton word and words of S1 and S2 with the skeleton word as their morphemes at that position. [sent-53, score-0.821]

30 The number of columns of a WSN is equal to the number of skeleton words. [sent-54, score-0.336]

31 It should be noted that there may be words covering two or more spans, such as “路滑” 2 in S1, because the word “路滑” in S1 is split into two words “路” and “滑” in S2. [sent-55, score-0.059]

32 The skeleton word can be projected onto words in the same span in S1 and S2. [sent-58, score-0.593]

33 For clarity, words in each segmentation are indexed (1-based), for example, “路滑” in S1 is indexed by 3. [sent-59, score-0.313]

34 路 路滑32 3 下下11 下雨1 雨雨22 路 32路 滑3 滑 43 Road Road is slippery when raining (a) (b) Figure 1: A n example alignment be tween WSN in Table 2 and English sentence “Road is slippery when raining”. [sent-63, score-0.48]

35 Each span of the WSN comprises words from different segmentations (Figure 1a), which indicates that the confidence score of a SL can be defined over words in the same span. [sent-65, score-0.555]

36 By projection function, a SL can be projected onto the link for each segmentation. [sent-66, score-0.296]

37 Therefore, the problem of combining word alignment over different segmentations can be transformed into the problem of selecting SLs for SA first, and then project the selected SLs onto links for each segmentation respectively. [sent-67, score-1.099]

38 3 Combination Algorithm Given k alignments over segmentations respectively ),and is the pair of the Chinese WSN and its parallel English sentence. [sent-68, score-0.584]

39 Suppose is the SL between the j-th span and i-th English word is the link between the j-th Chinese word in and Inspired by Huang (2009), we define the confidence score of each SL as follows ,. [sent-69, score-0.39]

40 , ( | ) ∑ where link (1) is the confidence score of the defined as ( √ | ) ( | ) (2) where c-to-e link posterior probability is defined as ( . [sent-70, score-0.345]

41 | ) ∑ (3) and I the length of is E-to-c link posterior probability | ) can be defined similarly, Our alignment combination algorithm is as follows. [sent-71, score-0.415]

42 Compute the confidence score for each SL based on Eq. [sent-75, score-0.119]

43 A SL gets a vote from if appears in Denote the set of all SLs getting at least one vote by . [sent-77, score-0.091]

44 A SL is included if its confidence score is higher than a tunable threshold and one of the following is true1: ,   4. [sent-81, score-0.147]

45 Neither nor is aligned so far; is not aligned and its left or right neighboring word is aligned to so far;  is not aligned and its left or right neighboring word is aligned to so far. [sent-83, score-0.7]

46 All included SLs comprise Map SLs in on each to get k new alignments respectively, i. [sent-85, score-0.216]

47 (1) and can be tuned in a handaligned dataset to maximize word alignment Fscore on any with hill climbing algorithm. [sent-95, score-0.407]

48 1 Data Our training set contains about 190K ChineseEnglish sentence pairs from LDC2003E14 corpus. [sent-101, score-0.026]

49 The Chinese portions of all the data are preprocessed by three monolingually motived segmenters respectively. [sent-103, score-0.275]

50 These segmenters differ in either training method or specification, including ICTCLAS (I)3, Stanford segmenters with CTB (C) and PKU (P) specifications4 respectively. [sent-104, score-0.136]

51 , 2003), and generated two baseline alignments using GIZA++ enhanced by gdf heuristics (Koehn et al. [sent-106, score-0.182]

52 , 2003) and a linear discriminative word alignment model (DIWA) (Liu et al. [sent-107, score-0.294]

53 , 2010) on training set with the three segmentations respectively. [sent-108, score-0.365]

54 The decoding weights were optimized with Minimum Error Rate Training (MERT) (Och, 2003). [sent-110, score-0.042]

55 We used the handaligned set of 491 sentence pairs in Haghighi et al. [sent-111, score-0.115]

56 (2009), the first 250 sentence pairs were used to tune the weights in Eq. [sent-112, score-0.026]

57 shtml [粮食署] [的] [380] [万] [美元] [救济金] relief [粮un食ds署 wo]r [th的 3] . [sent-118, score-0.026]

58 8[3 m80i]l i[o万n ]u [s 美do元la]r s[救 fro济m金 th] e national fo dstuf department [香港] [特别] [行政区] [行政] [长官] [香港]c [h特ie别f e] x[行ecu政ti区ve] [i行n t政he] h[长ksa官r ] Figure 2: Two examples (left and right respectively) of word alignment on segmentation C. [sent-119, score-0.578]

59 Baselines (DIWA) are in the top half, combined alignments are in the bottom half. [sent-120, score-0.226]

60 The solid line represents the correct link while the dashed line represents the bad link. [sent-121, score-0.165]

61 Note that we adapted the Chinese portion of this handaligned set to segmentation C. [sent-124, score-0.346]

62 2 Improvement of Word Alignment We first evaluate our combination approach on the hand-aligned set (on segmentation C). [sent-126, score-0.324]

63 Table 3 shows the precision, recall and F-score of baseline alignments and combined alignments. [sent-127, score-0.226]

64 As shown in Table 3, the combination alignments outperformed the baselines (setting C) in all settings in both GIZA and DIWA. [sent-128, score-0.274]

65 We notice that the higher F-score is mainly due to the higher precision in GIZA but higher recall in DIWA. [sent-129, score-0.108]

66 5% higher F-score respectively, and both of them outperformed C+P+I, we speculate it is because GIZA favors recall rather than DIWA, i. [sent-132, score-0.058]

67 GIZA may contain more bad links than DIWA, which would lead to more unstable F-score if more alignments produced by GIZA are combined, just as the poor precision (69. [sent-134, score-0.279]

68 However, DIWA favors precision than recall (this observation is consistent with Liu et al. [sent-136, score-0.054]

69 (2010)), which may explain that the more diversified segmentations lead to better results in DIWA. [sent-137, score-0.391]

70 4 Figure 2 gives baseline alignments and combined alignments on two sentence pairs in the training data. [sent-146, score-0.434]

71 As can be seen, alignment errors caused by inappropriate segmentations by single segmenter were substantially reduced. [sent-147, score-0.689]

72 For example, in the second example, the word “香港特别行 政区 hksar” appears in segmentation Iof the Chinese sentence, which benefits the generation of the three correct links connecting for words “ 香 港” ,“特别”, “行政区” respectively in the combined alignment. [sent-148, score-0.47]

73 3 Improvement in MT performance We then evaluate our combination approach on the SMT training data on all segmentations. [sent-150, score-0.067]

74 For efficiency, we just used the first 50k sentence pairs of the aligned training corpus with the three segmentations to build three SMT systems respectively. [sent-151, score-0.465]

75 Table 4 shows the BLEU scores of baselines and combined alignment (C+P+I, and then projected onto C, P, I respectively). [sent-152, score-0.462]

76 Our approach achieves improvement over baseline alignments on all segmentations consistently, without using any lattice decoding techniques as Dyer et al. [sent-153, score-0.616]

77 The gain of translation performance purely comes from improvements of word alignment on all segmentations by our proposed word alignment combination. [sent-155, score-0.993]

78 5 Conclusion We evaluated our word alignment combination over three monolingually motivated segmentations on Chinese-English pair. [sent-162, score-1.005]

79 We showed that the combined alignment significantly outperforms the baseline alignment with both higher F-score and higher BLEU score on all segmentations. [sent-163, score-0.594]

80 Our work also proved the effectiveness of link confidence score in combining different word alignment models (Xiang et al. [sent-164, score-0.559]

81 , 2010), and extend it to combine word alignments over different segmentations. [sent-165, score-0.29]

82 They aim to achieve better translation but not higher alignment quality of all segmentations. [sent-169, score-0.303]

83 They combine multiple segmentations at SMT decoding step, while we combine segmentation alternatives at word align- ment step. [sent-170, score-0.859]

84 We believe that we can further improve the performance by combining these two kinds of works. [sent-171, score-0.033]

85 We also believe that combining word alignments over both monolingually motivated and bilingually motivated segmentations (Ma et al. [sent-172, score-1.068]

86 In the future, we will investigate combining word alignments on language pairs where both languages have no explicit word boundaries such as Chinese-Japanese. [sent-174, score-0.399]

87 Using a maximum entropy model to build segmentation lattices for mt. [sent-201, score-0.257]

88 Bilingually motivated domain-adapted word segmentation for statistical machine translation. [sent-229, score-0.388]

89 Diversify and combine: improving word alignment for machine translation on low-resource languages. [sent-233, score-0.334]

90 Do we need Chinese word segmentation for statistical machine translation? [sent-241, score-0.316]

91 Improved statistical machine translation by multiple Chinese word segmentation. [sent-249, score-0.137]


similar papers computed by tfidf model

tfidf for this paper:

wordName wordTfidf (topN-words)

[('segmentations', 0.365), ('wsn', 0.355), ('skeleton', 0.336), ('sls', 0.325), ('segmentation', 0.257), ('alignment', 0.235), ('monolingually', 0.207), ('sl', 0.182), ('alignments', 0.182), ('diwa', 0.148), ('xiang', 0.114), ('link', 0.113), ('smt', 0.096), ('confidence', 0.095), ('handaligned', 0.089), ('raining', 0.089), ('chinese', 0.084), ('giza', 0.083), ('projected', 0.081), ('bilingually', 0.078), ('slippery', 0.078), ('onto', 0.077), ('road', 0.076), ('aligned', 0.074), ('links', 0.073), ('motivated', 0.072), ('segmenters', 0.068), ('combination', 0.067), ('xu', 0.059), ('word', 0.059), ('jia', 0.057), ('liu', 0.054), ('neighboring', 0.053), ('nanjing', 0.052), ('dyer', 0.051), ('inappropriate', 0.05), ('combine', 0.049), ('yanjun', 0.045), ('combined', 0.044), ('decoding', 0.042), ('sa', 0.04), ('boundaries', 0.04), ('span', 0.04), ('translation', 0.04), ('segmenter', 0.039), ('andy', 0.039), ('haghighi', 0.038), ('multiple', 0.038), ('respectively', 0.037), ('try', 0.037), ('complementary', 0.037), ('ma', 0.035), ('comprise', 0.034), ('chung', 0.033), ('shouxun', 0.033), ('combining', 0.033), ('network', 0.033), ('spans', 0.033), ('vote', 0.032), ('xiao', 0.032), ('comprises', 0.031), ('favors', 0.03), ('indexed', 0.028), ('bleu', 0.028), ('china', 0.028), ('higher', 0.028), ('lattice', 0.027), ('tokenization', 0.027), ('getting', 0.027), ('qun', 0.027), ('right', 0.027), ('neither', 0.027), ('pairs', 0.026), ('line', 0.026), ('pku', 0.026), ('diversified', 0.026), ('ictclas', 0.026), ('keiji', 0.026), ('yasuda', 0.026), ('stroppa', 0.026), ('shujie', 0.026), ('evgeny', 0.026), ('yonggang', 0.026), ('xinyan', 0.026), ('sne', 0.026), ('comb', 0.026), ('relief', 0.026), ('left', 0.026), ('baselines', 0.025), ('projection', 0.025), ('zens', 0.025), ('ascending', 0.024), ('ruiqiang', 0.024), ('stressed', 0.024), ('hwang', 0.024), ('climbing', 0.024), ('matusov', 0.024), ('score', 0.024), ('precision', 0.024), ('koehn', 0.024)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 1.0 339 acl-2011-Word Alignment Combination over Multiple Word Segmentation

Author: Ning Xi ; Guangchao Tang ; Boyuan Li ; Yinggong Zhao

Abstract: In this paper, we present a new word alignment combination approach on language pairs where one language has no explicit word boundaries. Instead of combining word alignments of different models (Xiang et al., 2010), we try to combine word alignments over multiple monolingually motivated word segmentation. Our approach is based on link confidence score defined over multiple segmentations, thus the combined alignment is more robust to inappropriate word segmentation. Our combination algorithm is simple, efficient, and easy to implement. In the Chinese-English experiment, our approach effectively improved word alignment quality as well as translation performance on all segmentations simultaneously, which showed that word alignment can benefit from complementary knowledge due to the diversity of multiple and monolingually motivated segmentations. 1

2 0.242347 152 acl-2011-How Much Can We Gain from Supervised Word Alignment?

Author: Jinxi Xu ; Jinying Chen

Abstract: Word alignment is a central problem in statistical machine translation (SMT). In recent years, supervised alignment algorithms, which improve alignment accuracy by mimicking human alignment, have attracted a great deal of attention. The objective of this work is to explore the performance limit of supervised alignment under the current SMT paradigm. Our experiments used a manually aligned ChineseEnglish corpus with 280K words recently released by the Linguistic Data Consortium (LDC). We treated the human alignment as the oracle of supervised alignment. The result is surprising: the gain of human alignment over a state of the art unsupervised method (GIZA++) is less than 1point in BLEU. Furthermore, we showed the benefit of improved alignment becomes smaller with more training data, implying the above limit also holds for large training conditions. 1

3 0.20297216 140 acl-2011-Fully Unsupervised Word Segmentation with BVE and MDL

Author: Daniel Hewlett ; Paul Cohen

Abstract: Several results in the word segmentation literature suggest that description length provides a useful estimate of segmentation quality in fully unsupervised settings. However, since the space of potential segmentations grows exponentially with the length of the corpus, no tractable algorithm follows directly from the Minimum Description Length (MDL) principle. Therefore, it is necessary to generate a set of candidate segmentations and select between them according to the MDL principle. We evaluate several algorithms for generating these candidate segmentations on a range of natural language corpora, and show that the Bootstrapped Voting Experts algorithm consistently outperforms other methods when paired with MDL.

4 0.18500799 318 acl-2011-Unsupervised Bilingual Morpheme Segmentation and Alignment with Context-rich Hidden Semi-Markov Models

Author: Jason Naradowsky ; Kristina Toutanova

Abstract: This paper describes an unsupervised dynamic graphical model for morphological segmentation and bilingual morpheme alignment for statistical machine translation. The model extends Hidden Semi-Markov chain models by using factored output nodes and special structures for its conditional probability distributions. It relies on morpho-syntactic and lexical source-side information (part-of-speech, morphological segmentation) while learning a morpheme segmentation over the target language. Our model outperforms a competitive word alignment system in alignment quality. Used in a monolingual morphological segmentation setting it substantially improves accuracy over previous state-of-the-art models on three Arabic and Hebrew datasets.

5 0.16966942 57 acl-2011-Bayesian Word Alignment for Statistical Machine Translation

Author: Coskun Mermer ; Murat Saraclar

Abstract: In this work, we compare the translation performance of word alignments obtained via Bayesian inference to those obtained via expectation-maximization (EM). We propose a Gibbs sampler for fully Bayesian inference in IBM Model 1, integrating over all possible parameter values in finding the alignment distribution. We show that Bayesian inference outperforms EM in all of the tested language pairs, domains and data set sizes, by up to 2.99 BLEU points. We also show that the proposed method effectively addresses the well-known rare word problem in EM-estimated models; and at the same time induces a much smaller dictionary of bilingual word-pairs. .t r

6 0.15449174 27 acl-2011-A Stacked Sub-Word Model for Joint Chinese Word Segmentation and Part-of-Speech Tagging

7 0.13099717 49 acl-2011-Automatic Evaluation of Chinese Translation Output: Word-Level or Character-Level?

8 0.13055243 221 acl-2011-Model-Based Aligner Combination Using Dual Decomposition

9 0.12626989 141 acl-2011-Gappy Phrasal Alignment By Agreement

10 0.12621294 235 acl-2011-Optimal and Syntactically-Informed Decoding for Monolingual Phrase-Based Alignment

11 0.12288645 325 acl-2011-Unsupervised Word Alignment with Arbitrary Features

12 0.12106571 93 acl-2011-Dealing with Spurious Ambiguity in Learning ITG-based Word Alignment

13 0.10551809 43 acl-2011-An Unsupervised Model for Joint Phrase Alignment and Extraction

14 0.10370554 265 acl-2011-Reordering Modeling using Weighted Alignment Matrices

15 0.10326721 241 acl-2011-Parsing the Internal Structure of Words: A New Paradigm for Chinese Word Segmentation

16 0.10173962 217 acl-2011-Machine Translation System Combination by Confusion Forest

17 0.099159218 100 acl-2011-Discriminative Feature-Tied Mixture Modeling for Statistical Machine Translation

18 0.09489651 203 acl-2011-Learning Sub-Word Units for Open Vocabulary Speech Recognition

19 0.090194851 75 acl-2011-Combining Morpheme-based Machine Translation with Post-processing Morpheme Prediction

20 0.089063704 146 acl-2011-Goodness: A Method for Measuring Machine Translation Confidence


similar papers computed by lsi model

lsi for this paper:

topicId topicWeight

[(0, 0.178), (1, -0.149), (2, 0.092), (3, 0.087), (4, 0.016), (5, 0.015), (6, 0.068), (7, 0.004), (8, 0.028), (9, 0.15), (10, 0.136), (11, 0.162), (12, -0.037), (13, 0.042), (14, -0.117), (15, 0.016), (16, 0.132), (17, -0.123), (18, 0.041), (19, 0.213), (20, 0.05), (21, -0.026), (22, -0.14), (23, 0.089), (24, 0.069), (25, 0.014), (26, 0.024), (27, 0.157), (28, 0.003), (29, 0.023), (30, 0.005), (31, 0.034), (32, -0.039), (33, 0.052), (34, -0.03), (35, 0.02), (36, 0.005), (37, 0.081), (38, -0.009), (39, -0.022), (40, 0.025), (41, 0.056), (42, -0.05), (43, 0.1), (44, -0.025), (45, 0.06), (46, 0.044), (47, -0.012), (48, -0.056), (49, 0.036)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.96235883 339 acl-2011-Word Alignment Combination over Multiple Word Segmentation

Author: Ning Xi ; Guangchao Tang ; Boyuan Li ; Yinggong Zhao

Abstract: In this paper, we present a new word alignment combination approach on language pairs where one language has no explicit word boundaries. Instead of combining word alignments of different models (Xiang et al., 2010), we try to combine word alignments over multiple monolingually motivated word segmentation. Our approach is based on link confidence score defined over multiple segmentations, thus the combined alignment is more robust to inappropriate word segmentation. Our combination algorithm is simple, efficient, and easy to implement. In the Chinese-English experiment, our approach effectively improved word alignment quality as well as translation performance on all segmentations simultaneously, which showed that word alignment can benefit from complementary knowledge due to the diversity of multiple and monolingually motivated segmentations. 1

2 0.74520075 140 acl-2011-Fully Unsupervised Word Segmentation with BVE and MDL

Author: Daniel Hewlett ; Paul Cohen

Abstract: Several results in the word segmentation literature suggest that description length provides a useful estimate of segmentation quality in fully unsupervised settings. However, since the space of potential segmentations grows exponentially with the length of the corpus, no tractable algorithm follows directly from the Minimum Description Length (MDL) principle. Therefore, it is necessary to generate a set of candidate segmentations and select between them according to the MDL principle. We evaluate several algorithms for generating these candidate segmentations on a range of natural language corpora, and show that the Bootstrapped Voting Experts algorithm consistently outperforms other methods when paired with MDL.

3 0.72461629 318 acl-2011-Unsupervised Bilingual Morpheme Segmentation and Alignment with Context-rich Hidden Semi-Markov Models

Author: Jason Naradowsky ; Kristina Toutanova

Abstract: This paper describes an unsupervised dynamic graphical model for morphological segmentation and bilingual morpheme alignment for statistical machine translation. The model extends Hidden Semi-Markov chain models by using factored output nodes and special structures for its conditional probability distributions. It relies on morpho-syntactic and lexical source-side information (part-of-speech, morphological segmentation) while learning a morpheme segmentation over the target language. Our model outperforms a competitive word alignment system in alignment quality. Used in a monolingual morphological segmentation setting it substantially improves accuracy over previous state-of-the-art models on three Arabic and Hebrew datasets.

4 0.71193218 152 acl-2011-How Much Can We Gain from Supervised Word Alignment?

Author: Jinxi Xu ; Jinying Chen

Abstract: Word alignment is a central problem in statistical machine translation (SMT). In recent years, supervised alignment algorithms, which improve alignment accuracy by mimicking human alignment, have attracted a great deal of attention. The objective of this work is to explore the performance limit of supervised alignment under the current SMT paradigm. Our experiments used a manually aligned ChineseEnglish corpus with 280K words recently released by the Linguistic Data Consortium (LDC). We treated the human alignment as the oracle of supervised alignment. The result is surprising: the gain of human alignment over a state of the art unsupervised method (GIZA++) is less than 1point in BLEU. Furthermore, we showed the benefit of improved alignment becomes smaller with more training data, implying the above limit also holds for large training conditions. 1

5 0.66189921 235 acl-2011-Optimal and Syntactically-Informed Decoding for Monolingual Phrase-Based Alignment

Author: Kapil Thadani ; Kathleen McKeown

Abstract: The task of aligning corresponding phrases across two related sentences is an important component of approaches for natural language problems such as textual inference, paraphrase detection and text-to-text generation. In this work, we examine a state-of-the-art structured prediction model for the alignment task which uses a phrase-based representation and is forced to decode alignments using an approximate search approach. We propose instead a straightforward exact decoding technique based on integer linear programming that yields order-of-magnitude improvements in decoding speed. This ILP-based decoding strategy permits us to consider syntacticallyinformed constraints on alignments which significantly increase the precision of the model.

6 0.6560303 141 acl-2011-Gappy Phrasal Alignment By Agreement

7 0.64653319 221 acl-2011-Model-Based Aligner Combination Using Dual Decomposition

8 0.62612492 93 acl-2011-Dealing with Spurious Ambiguity in Learning ITG-based Word Alignment

9 0.62461537 27 acl-2011-A Stacked Sub-Word Model for Joint Chinese Word Segmentation and Part-of-Speech Tagging

10 0.58112931 57 acl-2011-Bayesian Word Alignment for Statistical Machine Translation

11 0.57348454 325 acl-2011-Unsupervised Word Alignment with Arbitrary Features

12 0.56649154 265 acl-2011-Reordering Modeling using Weighted Alignment Matrices

13 0.56236529 241 acl-2011-Parsing the Internal Structure of Words: A New Paradigm for Chinese Word Segmentation

14 0.50697082 340 acl-2011-Word Alignment via Submodular Maximization over Matroids

15 0.486866 49 acl-2011-Automatic Evaluation of Chinese Translation Output: Word-Level or Character-Level?

16 0.48203257 203 acl-2011-Learning Sub-Word Units for Open Vocabulary Speech Recognition

17 0.4601478 66 acl-2011-Chinese sentence segmentation as comma classification

18 0.4520182 100 acl-2011-Discriminative Feature-Tied Mixture Modeling for Statistical Machine Translation

19 0.42446065 335 acl-2011-Why Initialization Matters for IBM Model 1: Multiple Optima and Non-Strict Convexity

20 0.4183152 336 acl-2011-Why Press Backspace? Understanding User Input Behaviors in Chinese Pinyin Input Method


similar papers computed by lda model

lda for this paper:

topicId topicWeight

[(5, 0.016), (17, 0.035), (32, 0.24), (37, 0.082), (39, 0.092), (41, 0.058), (55, 0.025), (59, 0.026), (72, 0.081), (91, 0.081), (96, 0.167)]

similar papers list:

simIndex simValue paperId paperTitle

1 0.86547089 242 acl-2011-Part-of-Speech Tagging for Twitter: Annotation, Features, and Experiments

Author: Kevin Gimpel ; Nathan Schneider ; Brendan O'Connor ; Dipanjan Das ; Daniel Mills ; Jacob Eisenstein ; Michael Heilman ; Dani Yogatama ; Jeffrey Flanigan ; Noah A. Smith

Abstract: We address the problem of part-of-speech tagging for English data from the popular microblogging service Twitter. We develop a tagset, annotate data, develop features, and report tagging results nearing 90% accuracy. The data and tools have been made available to the research community with the goal of enabling richer text analysis of Twitter and related social media data sets.

2 0.82728112 151 acl-2011-Hindi to Punjabi Machine Translation System

Author: Vishal Goyal ; Gurpreet Singh Lehal

Abstract: Hindi-Punjabi being closely related language pair (Goyal V. and Lehal G.S., 2008) , Hybrid Machine Translation approach has been used for developing Hindi to Punjabi Machine Translation System. Non-availability of lexical resources, spelling variations in the source language text, source text ambiguous words, named entity recognition and collocations are the major challenges faced while developing this syetm. The key activities involved during translation process are preprocessing, translation engine and post processing. Lookup algorithms, pattern matching algorithms etc formed the basis for solving these issues. The system accuracy has been evaluated using intelligibility test, accuracy test and BLEU score. The hybrid syatem is found to perform better than the constituent systems. Keywords: Machine Translation, Computational Linguistics, Natural Language Processing, Hindi, Punjabi. Translate Hindi to Punjabi, Closely related languages. 1Introduction Machine Translation system is a software designed that essentially takes a text in one language (called the source language), and translates it into another language (called the target language). There are number of approaches for MT like Direct based, Transform based, Interlingua based, Statistical etc. But the choice of approach depends upon the available resources and the kind of languages involved. In general, if the two languages are structurally similar, in particular as regards lexical correspondences, morphology and word order, the case for abstract syntactic analysis seems less convincing. Since the present research work deals with a pair of closely related language 1 Gurpreet Singh Lehal Department of Computer Science Punjabi University, Patiala,India gs lehal @ gmai l com . i.e. Hindi-Punjabi , thus direct word-to-word translation approach is the obvious choice. As some rule based approach has also been used, thus, Hybrid approach has been adopted for developing the system. An exhaustive survey has already been given for existing machine translations systems developed so far mentioning their accuracies and limitations. (Goyal V. and Lehal G.S., 2009). 2 System Architecture 2.1 Pre Processing Phase The preprocessing stage is a collection of operations that are applied on input data to make it processable by the translation engine. In the first phase of Machine Translation system, various activities incorporated include text normalization, replacing collocations and replacing proper nouns. 2.2 Text Normalization The variety in the alphabet, different dialects and influence of foreign languages has resulted in spelling variations of the same word. Such variations sometimes can be treated as errors in writing. (Goyal V. and Lehal G.S., 2010). 2.3 Replacing Collocations After passing the input text through text normalization, the text passes through this Collocation replacement sub phase of Preprocessing phase. Collocation is two or more consecutive words with a special behavior. (Choueka :1988). For example, the collocation उ?र ?देश (uttar pradēsh) if translated word to word, will be translated as ਜਵਾਬ ਰਾਜ (javāb rāj) but it must be translated as ਉ?ਤਰ ਪ?ਦਸ਼ੇ (uttar pradēsh). The accuracy of the results for collocation extraction using t-test is not accurate and includes number of such bigrams and trigrams that are not actually collocations. Thus, manually such entries were removed and actual collocations were further extracted. The Portland, POrroecgeoend,in UgSsA o,f 2 t1he Ju AnCeL 2-0H1L1T. 2 ?c 021101 S1y Astessmoc Diaetmioonn fsotr a Ctioonms,p puatagteiosn 1a–l6 L,inguistics correct corresponding Punjabi translation for each extracted collocation is stored in the collocation table of the database. The collocation table of the database consists of 5000 such entries. In this sub phase, the normalized input text is analyzed. Each collocation in the database found in the input text will be replaced with the Punjabi translation of the corresponding collocation. It is found that when tested on a corpus containing about 1,00,000 words, only 0.001 % collocations were found and replaced during the translation. Hindi Text Figure 1: Overview of Hindi-Punjabi Machine Translation System 2.4 Replacing Proper Nouns A great proposition of unseen words includes proper nouns like personal, days of month, days of week, country names, city names, bank fastens words proper decide the translation process. Once these are recognized and stored into the noun database, there is no need to about their translation or transliteration names, organization names, ocean names, river every names, university words names etc. and if translated time in the case of presence in word to word, their meaning is changed. If the gazetteer meaning is not affected, even though this step fast. This input makes list text for the translation is self of such translation. growing This accurate and during each 2 translation. Thus, to process this sub phase, the system requires a proper noun gazetteer that has been complied offline. For this task, we have developed an offline module to extract proper nouns from the corpus based on some rules. Also, Named Entity recognition module has been developed based on the CRF approach (Sharma R. and Goyal V., 2011b). 2.5 Tokenizer Tokenizers (also known as lexical analyzers or word segmenters) segment a stream of characters into meaningful units called tokens. The tokenizer takes the text generated by pre processing phase as input. Individual words or tokens are extracted and processed to generate its equivalent in the target language. This module, using space, a punctuation mark, as delimiter, extracts tokens (word) one by one from the text and gives it to translation engine for analysis till the complete input text is read and processed. 2.6 Translation Engine The translation engine is the main component of our Machine Translation system. It takes token generated by the tokenizer as input and outputs the translated token in the target language. These translated tokens are concatenated one after another along with the delimiter. Modules included in this phase are explained below one by one. 2.6.1 Identifying Titles and Surnames Title may be defined as a formal appellation attached to the name of a person or family by virtue of office, rank, hereditary privilege, noble birth, or attainment or used as a mark of respect. Thus word next to title and word previous to surname is usually a proper noun. And sometimes, a word used as proper name of a person has its own meaning in target language. Similarly, Surname may be defined as a name shared in common to identify the members of a family, as distinguished from each member's given name. It is also called family name or last name. When either title or surname is passed through the translation engine, it is translated by the system. This cause the system failure as these proper names should be transliterated instead of translation. For example consider the Hindi sentence 3 ?ीमान हष? जी हमार ेयहाँ पधार।े (shrīmān harsh jī हष? hamārē yahāṃ padhārē). In this sentence, (harsh) has the meaning “joy”. The equivalent translation of हष? (harsh) in target language is ਖੁਸ਼ੀ (khushī). Similarly, consider the Hindi sentence ?काश ?सह हमार े (prakāsh siṃh hamārē yahāṃ padhārē). Here, ?काश (prakāsh) word is acting as proper noun and it must be transliterated and not translated because (siṃh) is surname and word previous to it is proper noun. Thus, a small module has been developed for यहाँ पधार।े. ?सह locating such proper nouns to consider them as title or surname. There is one special character ‘॰’ in Devanagari script to mark the symbols like डा॰, ?ो॰. If this module found this symbol to be title or surname, the word next and previous to this token as the case may be for title or surname respectively, will be transliterated not translated. The title and surname database consists of 14 and 654 entries respectively. These databases can be extended at any time to allow new titles and surnames to be added. This module was tested on a large Hindi corpus and showed that about 2-5 % text of the input text depending upon its domain is proper noun. Thus, this module plays an important role in translation. 2.6.2 Hindi Morphological analyzer This module finds the root word for the token and its morphological features.Morphological analyzer developed by IIT-H has been ported for Windows platform for making it usable for this system. (Goyal V. and Lehal G.S.,2008a) 2.6.3 Word-to-Word translation using lexicon lookup If token is not a title or a surname, it is looked up in the HPDictionary database containing Hindi to Punjabi direct word to word translation. If it is found, it is used for translation. If no entry is found in HPDictionary database, it is sent to next sub phase for processing. The HPDictionary database consists of 54, 127 entries.This database can be extended at any time to allow new entries in the dictionary to be added. 2.6.4 Resolving Ambiguity Among number of approaches for disambiguation, the most appropriate approach to determine the correct meaning of a Hindi word in a particular usage for our Machine Translation system is to examine its context using N-gram approach. After analyzing the past experiences of various authors, we have chosen the value of n to be 3 and 2 i.e. trigram and bigram approaches respectively for our system. Trigrams are further categorized into three different types. First category of trigram consists of context one word previous to and one word next to the ambiguous word. Second category of trigram consists of context of two adjacent previous words to the ambiguous word. Third category of the trigram consists of context of two adjacent next words to the ambiguous word. Bigrams are also categorized into two categories. First category of the bigrams consists of context of one previous word to ambiguous word and second category of the bigrams consists of one context word next to ambiguous word. For this purpose, the Hindi corpus consisting of about 2 million words was collected from different sources like online newspaper daily news, blogs, Prem Chand stories, Yashwant jain stories, articles etc. The most common list of ambiguous words was found. We have found a list of 75 ambiguous words out of which the most are स े sē and aur. (Goyal V. and frequent Lehal G.S., 2011) और 2.6.5 Handling Unknown Words 2.6.5.1 Word Inflectional Analysis and generation In linguistics, a suffix (also sometimes called a postfix or ending) is an affix which is placed after the stem of a word. Common examples are case endings, which indicate the grammatical case of nouns or adjectives, and verb endings. Hindi is a (relatively) free wordorder and highly inflectional language. Because of same origin, both languages have very similar structure and grammar. The difference is only in words and in pronunciation e.g. in Hindi it is लड़का and in Punjabi the word for boy is ਮੰੁਡਾ and even sometimes that is also not there like घर (ghar) and ਘਰ (ghar). The inflection forms of both these words in Hindi and Punjabi are also similar. In this activity, inflectional analysis without using morphology has been performed 4 for all those tokens that are not processed by morphological analysis module. Thus, for performing inflectional analysis, rule based approach has been followed. When the token is passed to this sub phase for inflectional analysis, If any pattern of the regular expression (inflection rule) matches with this token, that rule is applied on the token and its equivalent translation in Punjabi is generated based on the matched rule(s). There is also a check on the generated word for its correctness. We are using correct Punjabi words database for testing the correctness of the generated word. 2.6.5.2 Transliteration This module is beneficial for handling out-ofvocabulary words. For example the word िवशाल is as ਿਵਸ਼ਾਲ (vishāl) whereas translated as ਵੱਡਾ. There must be some method in every Machine Translation system for words like technical terms and (vishāl) transliterated proper names of persons, places, objects etc. that cannot be found in translation resources such as Hindi-Punjabi bilingual dictionary, surnames database, titles database etc and transliteration is an obvious choice for such words. (Goyal V. and Lehal G.S., 2009a). 2.7 Post-Processing 2.7.1 Agreement Corrections In spite of the great similarity between Hindi and Punjabi, there are still a number of important agreement divergences in gender and number. The output generated by the translation engine phase becomes the input for post-processing phase. This phase will correct the agreement errors based on the rules implemented in the form of regular expressions. (Goyal V. and Lehal G.S., 2011) 3 Evaluation and Results The evaluation document set consisted of documents from various online newspapers news, articles, blogs, biographies etc. This test bed consisted of 35500 words and was translated using our Machine Translation system. 3.1 Test Document For our Machine Translation system evaluation, we have used benchmark sampling method for selecting the set of sentences. Input sentences are selected from randomly selected news (sports, politics, world, regional, entertainment, travel etc.), articles (published by various writers, philosophers etc.), literature (stories by Prem Chand, Yashwant jain etc.), Official language for office letters (The Language Officially used on the files in Government offices) and blogs (Posted by general public in forums etc.). Care has been taken to ensure that sentences use a variety of constructs. All possible constructs including simple as well as complex ones are incorporated in the set. The sentence set also contains all types of sentences such as declarative, interrogative, imperative and exclamatory. Sentence length is not restricted although care has been taken that single sentences do not become too long. Following table shows the test data set: Table 1: Test data set for the evaluation of Hindi to Punjabi Machine Translation DTSWeo nctaruldenmscent 91DN03ae, 4wil0ys A5230,1rt6ic70lS4esytO0LQ38m6,1au5f4no9itg3c5e1uiaslgeB5130,lo6g50 L29105i,te84r05atue 3.2 Experiments It is also important to choose appropriate evaluators for our experiments. Thus, depending upon the requirements and need of the above mentioned tests, 50 People of different professions were selected for performing experiments. 20 Persons were from villages that only knew Punjabi and did not know Hindi and 30 persons were from different professions having knowledge of both Hindi and Punjabi. Average ratings for the sentences of the individual translations were then summed up (separately according to intelligibility and accuracy) to get the average scores. Percentage of accurate sentences and intelligent sentences was also calculated separately sentences. by counting the number of 3.2.1 Intelligibility Evaluation 5 The evaluators do not have any clue about the source language i.e. Hindi. They judge each sentence (in target language i.e. Punjabi) on the basis of its comprehensibility. The target user is a layman who is interested only in the comprehensibility of translations. Intelligibility is effected by grammatical errors, mistranslations, and un-translated words. 3.2.1.1 Results The response by the evaluators were analysed and following are the results: • 70.3 % sentences got the score 3 i.e. they were perfectly clear and intelligible. • 25. 1 % sentences got the score 2 i.e. they were generally clear and intelligible. • 3.5 % sentences got the score 1i.e. they were hard to understand. • 1. 1 % sentences got the score 0 i.e. they were not understandable. So we can say that about 95.40 % sentences are intelligible. These sentences are those which have score 2 or above. Thus, we can say that the direct approach can translate Hindi text to Punjabi Text with a consideably good accuracy. 3.2.2 Accuracy Evaluation / Fidelity Measure The evaluators are provided with source text along with translated text. A highly intelligible output sentence need not be a correct translation of the source sentence. It is important to check whether the meaning of the source language sentence is preserved in the translation. This property is called accuracy. 3.2.2.1 Results Initially Null Hypothesis is assumed i.e. the system’s performance is NULL. The author assumes that system is dumb and does not produce any valuable output. By the intelligibility of the analysis and Accuracy analysis, it has been proved wrong. The accuracy percentage for the system is found out to be 87.60% Further investigations reveal that out of 13.40%: • 80.6 % sentences achieve a match between 50 to 99% • 17.2 % of remaining sentences were marked with less than 50% match against the correct sentences. • Only 2.2 % sentences are those which are found unfaithful. A match of lower 50% does not mean that the sentences are not usable. After some post editing, they can fit properly in the translated text. (Goyal, V., Lehal, G.S., 2009b) 3.2.2 BLEU Score: As there is no Hindi –Parallel Corpus was available, thus for testing the system automatically, we generated Hindi-Parallel Corpus of about 10K Sentences. The BLEU score comes out to be 0.7801. 5 Conclusion In this paper, a hybrid translation approach for translating the text from Hindi to Punjabi has been presented. The proposed architecture has shown extremely good results and if found to be appropriate for MT systems between closely related language pairs. Copyright The developed system has already been copyrighted with The Registrar, Punjabi University, Patiala with authors same as the authors of the publication. Acknowlegement We are thankful to Dr. Amba Kulkarni, University of Hyderabad for her support in providing technical assistance for developing this system. References Bharati, Akshar, Chaitanya, Vineet, Kulkarni, Amba P., Sangal, Rajeev. 1997. Anusaaraka: Machine Translation in stages. Vivek, A Quarterly in Artificial Intelligence, Vol. 10, No. 3. ,NCST, Banglore. India, pp. 22-25. 6 Goyal V., Lehal G.S. 2008. Comparative Study of Hindi and Punjabi Language Scripts, Napalese Linguistics, Journal of the Linguistics Society of Nepal, Volume 23, November Issue, pp 67-82. Goyal V., Lehal, G. S. 2008a. Hindi Morphological Analyzer and Generator. In Proc.: 1st International Conference on Emerging Trends in Engineering and Technology, Nagpur, G.H.Raisoni College of Engineering, Nagpur, July16-19, 2008, pp. 11561159, IEEE Computer Society Press, California, USA. Goyal V., Lehal G.S. 2009. Advances in Machine Translation Systems, Language In India, Volume 9, November Issue, pp. 138-150. Goyal V., Lehal G.S. 2009a. A Machine Transliteration System for Machine Translation System: An Application on Hindi-Punjabi Language Pair. Atti Della Fondazione Giorgio Ronchi (Italy), Volume LXIV, No. 1, pp. 27-35. Goyal V., Lehal G.S. 2009b. Evaluation of Hindi to Punjabi Machine Translation System. International Journal of Computer Science Issues, France, Vol. 4, No. 1, pp. 36-39. Goyal V., Lehal G.S. 2010. Automatic Spelling Standardization for Hindi Text. In : 1st International Conference on Computer & Communication Technology, Moti Lal Nehru National Institute of technology, Allhabad, Sepetember 17-19, 2010, pp. 764-767, IEEE Computer Society Press, California. Goyal V., Lehal G.S. 2011. N-Grams Based Word Sense Disambiguation: A Case Study of Hindi to Punjabi Machine Translation System. International Journal of Translation. (Accepted, In Print). Goyal V., Lehal G.S. 2011a. Hindi to Punjabi Machine Translation System. In Proc.: International Conference for Information Systems for Indian Languages, Department of Computer Science, Punjabi University, Patiala, March 9-11, 2011, pp. 236-241, Springer CCIS 139, Germany. Sharma R., Goyal V. 2011b. Named Entity Recognition Systems for Hindi using CRF Approach. In Proc.: International Conference for Information Systems for Indian Languages, Department of Computer Science, Punjabi University, Patiala, March 9-11, 2011, pp. 31-35, Springer CCIS 139, Germany.

same-paper 3 0.76863098 339 acl-2011-Word Alignment Combination over Multiple Word Segmentation

Author: Ning Xi ; Guangchao Tang ; Boyuan Li ; Yinggong Zhao

Abstract: In this paper, we present a new word alignment combination approach on language pairs where one language has no explicit word boundaries. Instead of combining word alignments of different models (Xiang et al., 2010), we try to combine word alignments over multiple monolingually motivated word segmentation. Our approach is based on link confidence score defined over multiple segmentations, thus the combined alignment is more robust to inappropriate word segmentation. Our combination algorithm is simple, efficient, and easy to implement. In the Chinese-English experiment, our approach effectively improved word alignment quality as well as translation performance on all segmentations simultaneously, which showed that word alignment can benefit from complementary knowledge due to the diversity of multiple and monolingually motivated segmentations. 1

4 0.69050074 88 acl-2011-Creating a manually error-tagged and shallow-parsed learner corpus

Author: Ryo Nagata ; Edward Whittaker ; Vera Sheinman

Abstract: The availability of learner corpora, especially those which have been manually error-tagged or shallow-parsed, is still limited. This means that researchers do not have a common development and test set for natural language processing of learner English such as for grammatical error detection. Given this background, we created a novel learner corpus that was manually error-tagged and shallowparsed. This corpus is available for research and educational purposes on the web. In this paper, we describe it in detail together with its data-collection method and annotation schemes. Another contribution of this paper is that we take the first step toward evaluating the performance of existing POStagging/chunking techniques on learner corpora using the created corpus. These contributions will facilitate further research in related areas such as grammatical error detection and automated essay scoring.

5 0.68668002 241 acl-2011-Parsing the Internal Structure of Words: A New Paradigm for Chinese Word Segmentation

Author: Zhongguo Li

Abstract: Lots of Chinese characters are very productive in that they can form many structured words either as prefixes or as suffixes. Previous research in Chinese word segmentation mainly focused on identifying only the word boundaries without considering the rich internal structures of many words. In this paper we argue that this is unsatisfying in many ways, both practically and theoretically. Instead, we propose that word structures should be recovered in morphological analysis. An elegant approach for doing this is given and the result is shown to be promising enough for encouraging further effort in this direction. Our probability model is trained with the Penn Chinese Treebank and actually is able to parse both word and phrase structures in a unified way. 1 Why Parse Word Structures? Research in Chinese word segmentation has progressed tremendously in recent years, with state of the art performing at around 97% in precision and recall (Xue, 2003; Gao et al., 2005; Zhang and Clark, 2007; Li and Sun, 2009). However, virtually all these systems focus exclusively on recognizing the word boundaries, giving no consideration to the internal structures of many words. Though it has been the standard practice for many years, we argue that this paradigm is inadequate both in theory and in practice, for at least the following four reasons. The first reason is that if we confine our definition of word segmentation to the identification of word boundaries, then people tend to have divergent 1405 opinions as to whether a linguistic unit is a word or not (Sproat et al., 1996). This has led to many different annotation standards for Chinese word segmentation. Even worse, this could cause inconsistency in the same corpus. For instance, 䉂 擌 奒 ‘vice president’ is considered to be one word in the Penn Chinese Treebank (Xue et al., 2005), but is split into two words by the Peking University corpus in the SIGHAN Bakeoffs (Sproat and Emerson, 2003). Meanwhile, 䉂 䀓 惼 ‘vice director’ and 䉂 䚲䡮 ‘deputy are both segmented into two words in the same Penn Chinese Treebank. In fact, all these words are composed of the prefix 䉂 ‘vice’ and a root word. Thus the structure of 䉂擌奒 ‘vice president’ can be represented with the tree in Figure 1. Without a doubt, there is complete agree- manager’ NN ,,ll JJf NNf 䉂 擌奒 Figure 1: Example of a word with internal structure. ment on the correctness of this structure among native Chinese speakers. So if instead of annotating only word boundaries, we annotate the structures of every word, then the annotation tends to be more 1 1Here it is necessary to add a note on terminology used in this paper. Since there is no universally accepted definition of the “word” concept in linguistics and especially in Chinese, whenever we use the term “word” we might mean a linguistic unit such as 䉂 擌奒 ‘vice president’ whose structure is shown as the tree in Figure 1, or we might mean a smaller unit such as 擌奒 ‘president’ which is a substructure of that tree. Hopefully, ProceedingPso orftla thned 4,9 Otrhe Agonnn,u Jauln Mee 1e9t-i2ng4, o 2f0 t1h1e. A ?c s 2o0ci1a1ti Aonss foocria Ctioomnp fourta Ctioomnaplu Ltaintigouniaslti Lcisn,g puaigsetsic 1s405–1414, consistent and there could be less duplication of efforts in developing the expensive annotated corpus. The second reason is applications have different requirements for granularity of words. Take the personal name 撱 嗤吼 ‘Zhou Shuren’ as an example. It’s considered to be one word in the Penn Chinese Treebank, but is segmented into a surname and a given name in the Peking University corpus. For some applications such as information extraction, the former segmentation is adequate, while for others like machine translation, the later finer-grained output is more preferable. If the analyzer can produce a structure as shown in Figure 4(a), then every application can extract what it needs from this tree. A solution with tree output like this is more elegant than approaches which try to meet the needs of different applications in post-processing (Gao et al., 2004). The third reason is that traditional word segmentation has problems in handling many phenomena in Chinese. For example, the telescopic compound 㦌 撥 怂惆 ‘universities, middle schools and primary schools’ is in fact composed ofthree coordinating elements 㦌惆 ‘university’, 撥 惆 ‘middle school’ and 怂惆 ‘primary school’ . Regarding it as one flat word loses this important information. Another example is separable words like 扩 扙 ‘swim’ . With a linear segmentation, the meaning of ‘swimming’ as in 扩 堑 扙 ‘after swimming’ cannot be properly represented, since 扩扙 ‘swim’ will be segmented into discontinuous units. These language usages lie at the boundary between syntax and morphology, and are not uncommon in Chinese. They can be adequately represented with trees (Figure 2). (a) NN (b) ???HHH JJ NNf ???HHH JJf JJf JJf 㦌 撥 怂 惆 VV ???HHH VV NNf ZZ VVf VVf 扩 扙 堑 Figure 2: Example of telescopic compound (a) and separable word (b). The last reason why we should care about word the context will always make it clear what is being referred to with the term “word”. 1406 structures is related to head driven statistical parsers (Collins, 2003). To illustrate this, note that in the Penn Chinese Treebank, the word 戽 䊂䠽 吼 ‘English People’ does not occur at all. Hence constituents headed by such words could cause some difficulty for head driven models in which out-ofvocabulary words need to be treated specially both when they are generated and when they are conditioned upon. But this word is in turn headed by its suffix 吼 ‘people’, and there are 2,233 such words in Penn Chinese Treebank. If we annotate the structure of every compound containing this suffix (e.g. Figure 3), such data sparsity simply goes away.

6 0.68027639 108 acl-2011-EdIt: A Broad-Coverage Grammar Checker Using Pattern Grammar

7 0.67964023 119 acl-2011-Evaluating the Impact of Coder Errors on Active Learning

8 0.67462879 64 acl-2011-C-Feel-It: A Sentiment Analyzer for Micro-blogs

9 0.67288148 318 acl-2011-Unsupervised Bilingual Morpheme Segmentation and Alignment with Context-rich Hidden Semi-Markov Models

10 0.67179191 117 acl-2011-Entity Set Expansion using Topic information

11 0.67173326 261 acl-2011-Recognizing Named Entities in Tweets

12 0.67150623 145 acl-2011-Good Seed Makes a Good Crop: Accelerating Active Learning Using Language Modeling

13 0.66980869 32 acl-2011-Algorithm Selection and Model Adaptation for ESL Correction Tasks

14 0.66975236 86 acl-2011-Coreference for Learning to Extract Relations: Yes Virginia, Coreference Matters

15 0.6690433 58 acl-2011-Beam-Width Prediction for Efficient Context-Free Parsing

16 0.66796505 28 acl-2011-A Statistical Tree Annotator and Its Applications

17 0.66732067 238 acl-2011-P11-2093 k2opt.pdf

18 0.66699719 137 acl-2011-Fine-Grained Class Label Markup of Search Queries

19 0.66579878 209 acl-2011-Lexically-Triggered Hidden Markov Models for Clinical Document Coding

20 0.66557848 246 acl-2011-Piggyback: Using Search Engines for Robust Cross-Domain Named Entity Recognition