acl acl2010 acl2010-147 knowledge-graph by maker-knowledge-mining

147 acl-2010-Improving Statistical Machine Translation with Monolingual Collocation


Source: pdf

Author: Zhanyi Liu ; Haifeng Wang ; Hua Wu ; Sheng Li

Abstract: This paper proposes to use monolingual collocations to improve Statistical Machine Translation (SMT). We make use of the collocation probabilities, which are estimated from monolingual corpora, in two aspects, namely improving word alignment for various kinds of SMT systems and improving phrase table for phrase-based SMT. The experimental results show that our method improves the performance of both word alignment and translation quality significantly. As compared to baseline systems, we achieve absolute improvements of 2.40 BLEU score on a phrase-based SMT system and 1.76 BLEU score on a parsing-based SMT system. 1

Reference: text


Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 We make use of the collocation probabilities, which are estimated from monolingual corpora, in two aspects, namely improving word alignment for various kinds of SMT systems and improving phrase table for phrase-based SMT. [sent-8, score-1.365]

2 The experimental results show that our method improves the performance of both word alignment and translation quality significantly. [sent-9, score-0.42]

3 1 Introduction Statistical bilingual word alignment (Brown et al. [sent-13, score-0.419]

4 But as far as we know, few previous studies exploit the collocation relations of the words in a phrase. [sent-20, score-0.743]

5 We first identify potentially collocated words and estimate collocation probabilities from monolingual corpora using a Monolingual Word Alignment (MWA) method (Liu et al. [sent-26, score-1.179]

6 Then the collocation information is employed to improve Bilingual Word Alignment (BWA) for various kinds of SMT systems and to improve phrase table for phrase-based SMT. [sent-28, score-0.916]

7 To improve BWA, we re-estimate the alignment probabilities by using the collocation probabilities of words in the same cept. [sent-29, score-1.338]

8 An alignment between a source multi-word cept and a target word is a many-to-one multi-word alignment. [sent-32, score-0.421]

9 To improve phrase table, we calculate phrase collocation probabilities based on word collocation probabilities. [sent-33, score-1.925]

10 Then the phrase collocation probabilities are used as additional features in phrase-based SMT systems. [sent-34, score-0.997]

11 The alignment improvement results in an improvement of 2. [sent-36, score-0.316]

12 If we use phrase collocation probabilities as additional features, the phrase-based 825 ProceedinUgspp osfa tlhae, 4S8wthed Aennn,u 1a1l-1 M6e Jeutilnyg 2 o0f1 t0h. [sent-39, score-0.997]

13 The paper is organized as follows: In section 2, we introduce the collocation model based on the MWA method. [sent-43, score-0.762]

14 In section 3 and 4, we show how to improve the BWA method and the phrase table using collocation models respectively. [sent-44, score-0.897]

15 A collocation is composed of two words occurring as either a consecutive word sequence or an interrupted word sequence in sentences, such as "by accident" or "take . [sent-48, score-0.932]

16 This method adapts the bilingual word alignment algorithm to monolingual scenario to extract collocations only from monolingual corpora. [sent-54, score-0.799]

17 1 Monolingual word alignment The monolingual corpus is first replicated to generate a parallel corpus, where each sentence pair consists of two identical sentences in the same language. [sent-58, score-0.496]

18 Then the monolingual word alignment algorithm is employed to align the potentially collocated words in the monolingual sentences. [sent-59, score-0.688]

19 (2009), we employ the MWA Model 3 (corresponding to IBM Model 3) to calculate the probability of the monolingual word alignment sequence, as shown in Eq. [sent-61, score-0.553]

20 pMWA Mode l3(S,A|S)ln(i |wi) li1 j1t(wj |waj)d(j|aj,l) (1) Where S  w1l is a monolingual sentence, i denotes the number of words that are aligned with wi . [sent-63, score-0.346]

21 Since a word never collocates with itself, the alignment set is denoted as A  {(i,ai )| i [1 ,l] & ai  i} . [sent-64, score-0.308]

22 Three kinds of probabilities are involved in this model: word collocation probability t(wj | waj ) , position collocation probability d(j | aj ,l) and fertility probability n(i | wi) . [sent-65, score-2.001]

23 In the MWA method, the similar algorithm to bilingual word alignment is used to estimate the parameters of the models, except that a word cannot be aligned to itself. [sent-66, score-0.533]

24 2 Collocation probability Given the monolingual word aligned corpus, we calculate the frequency of two words aligned in the corpus, denoted as freq(wi ,wj ) . [sent-71, score-0.431]

25 Then the probability for each aligned word pair is estimated as follows: p(wi|wj)wfrferqe(qw(wi,,wwj)j) p(wj|wi)wfrferqe(qw(wi, w,wj)) (2) (3) In this paper, the words of collocation are symmetric and we do not determine which word is the head and which word is the modifier. [sent-73, score-1.013]

26 Thus, the collocation probability of two words is defined as the average of both probabilities, as in Eq. [sent-74, score-0.796]

27 r(wi,wj)p(wi|wj)2p(wj|wi) (4) If we have multiple monolingual corpora to estimate the collocation probabilities, we interpolate the probabilities as shown in Eq. [sent-76, score-1.061]

28 k 3 (5) for Improving Statistical Bilingual Word Alignment We use the collocation information to improve both one-directional and bi-directional bilingual word alignments. [sent-79, score-0.927]

29 The alignment probabilities are re-estimated by using the collocation probabili- ties of words in the same cept. [sent-80, score-1.158]

30 1 Improving one-directional bilingual word alignment According to the BWA method, given a bilingual sentence pair E  e1l and F  f1m , the optimal alignment sequence A between E and F can be obtained as in Eq. [sent-82, score-0.864]

31 IBM Model 1 only employs the word translation model to calculate the probabilities of alignments. [sent-85, score-0.308]

32 Although the fertility model is used to restrict the number of source words in a cept and the position distortion model is used to describe the correlation of the positions of the source words, the quality of many-to-one alignments is lower than that of one-to-one alignments. [sent-90, score-0.507]

33 Intuitively, the probability of the source words aligned to a target word is not only related to the fertility ability and their relative positions, but also related to lexical tokens of words, such as common phrase or idiom. [sent-91, score-0.35]

34 In this paper, we use the collocation probability of the source words in a cept to measure their correlation strength. [sent-92, score-0.933]

35 Given source words {fj | aj  i} aligned to ei , their collocation probability is calculated as in Eq. [sent-93, score-1.019]

36 r({fj|aji})2ki11g ki *1r((fi[i]k 1,)f[i]g) (7) Here, f[i]k and f[i]g denote the kth word and gth word in {fj | aj  i} ; r(f[i]k ,f[i]g) denotes the collocation probability of f[i]k and f[i]g , as shown in Eq. [sent-95, score-0.999]

37 Thus, the collocation probability of the alignment sequence of a sentence pair can be calcu- lated according to Eq. [sent-97, score-1.13]

38 r(F,A|E)il1r({fj |aji}) (8) Based on maximum entropy framework, we combine the collocation model and the BWA model to calculate the word alignment probability of a sentence pair, as shown in Eq. [sent-99, score-1.205]

39 We use two features in this paper, namely alignment probabilities and collocation probabilities. [sent-102, score-1.158]

40 We first train IBM Model 4 and collocation model on bilingual corpus and monolingual cor- pus respectively. [sent-104, score-1.04]

41 , 1999) to search for the optimal alignment sequence of a given sentence pair, where the score of an alignment sequence is calculated as in Eq. [sent-106, score-0.655]

42 (8) only deals with many-toone alignments, but the alignment sequence of a sentence pair also includes one-to-one alignments. [sent-109, score-0.334]

43 To calculate the collocation probability of the alignment sequence, we should also consider the collocation probabilities of such one-to-one alignments. [sent-110, score-1.997]

44 To solve this problem, we use the collocation probability of the whole source sentence, r(F) , as the collocation probability of one-word cept. [sent-111, score-1.636]

45 2 Improving bi-directional bilingual word alignments In word alignment models implemented in GIZA++, only one-to-one and many-to-one word alignment links can be found. [sent-113, score-0.962]

46 Bi-directional alignments are generally obtained from source-to-target alignments As2t and targetto-source alignments At2s , using some heuristic rules (Koehn et al. [sent-116, score-0.579]

47 This method ignores the correlation of the words in the same alignment unit, so an alignment may include many unrelated words2, which influences the performances of SMT systems. [sent-118, score-0.647]

48 may include up to 827 In order to solve the above problem, we incorporate the collocation probabilities into the bidirectional word alignment process. [sent-123, score-1.222]

49 (e1l',f1m',A)* arAgAmsaxt{(ei,fj)pA(ei,fj)1r(ei)2r(fj)3} (11) Here, r(fj) and r(ei) denote the collocation probabilities of the words in the source language and target language respectively, which are calculated by using Eq. [sent-130, score-0.958]

50  (p(e | f)  p(f | e))/2 p(ei,fj)eeiffj|ei|*|fj| (12) p(e | f) and p(f | e) are the source-to-target and target-to-source translation probabilities trained from the word aligned bilingual corpus. [sent-135, score-0.429]

51 4 Improving Phrase Table Phrase-based SMT system automatically extracts bilingual phrase pairs from the word aligned bilingual corpus. [sent-136, score-0.423]

52 In this paper, we use the collocation probability to measure the possibility of words composing a phrase. [sent-138, score-0.796]

53 For each bilingual phrase pair automatically extracted from word aligned corpus, we calculate the collocation probabilities of source phrase and target phrase respectively, according to Eq. [sent-139, score-1.484]

54 2n1 n r(wi,wj) r(w1n)i1jn*i1(n 1) (13) Here, w1n denotes a phrase with n words; r(wi ,wj ) denotes the collocation probability of a AdBitloTnCcago ulrbampelo c1rano. [sent-141, score-0.979]

55 For the phrase only including one word, we set a fixed collocation probability that is the average of the collocation probabilities of the sentences on a development set. [sent-146, score-1.775]

56 These collocation probabilities are incorporated into the phrase-based SMT system as features. [sent-147, score-0.892]

57 To train the collocation models, besides the monolingual parts of FBIS, we also employ some other larger Chinese and English monolingual corpora, namely, Chinese Gigaword (LDC2007T38), English Gigaword (LDC2007T07), UN corpus (LDC2004E12), Sinorama corpus (LDC2005T10), as shown in Table 1. [sent-150, score-1.059]

58 Using these corpora, we got three kinds of collocation models: CM-1: the training data is the additional monolingual corpora; CM-2: the training data is either side of the bi- lingual corpus; CM-3: the interpolation of CM-1 and CM-2. [sent-151, score-0.994]

59 Then word alignments in the subset were manually labeled, referring to the guideline of the Chinese-to-English alignment (LDC2006E93), but we made some modifications for the guideline. [sent-153, score-0.501]

60 There are several different evaluation metrics for word alignment (Ahrenberg et al. [sent-155, score-0.308]

61 We use precision (P), recall (R) and alignment error ratio (AER), which are similar to those in Och and Ney (2000), except that we consider each alignment as a sure link. [sent-157, score-0.564]

62 Example of the English-to-Chinese word alignments generated by the BWA method and the improved BWA method using CM-3. [sent-162, score-0.368]

63 "" denotes the alignments of our method; "" denotes the alignments of the baseline method. [sent-163, score-0.519]

64 By minimizing the AER on the development set, the interpolation coefficients of the collocation probabilities on CM-1 and CM-2 were set to 0. [sent-168, score-0.934]

65 2 Evaluation results One-directional alignment results To train a Chinese-to-English SMT system, we need to perform both Chinese-to-English and English-to-Chinese word alignment. [sent-176, score-0.326]

66 The evaluation results in Table 2 indicate that the performances of our methods on single word alignments are close to that of the baseline method. [sent-179, score-0.297]

67 CM-3, the error rate of multi-word alignment results is further reduced. [sent-185, score-0.337]

68 Figure 2 shows an example of word alignment results generated by the baseline method and the improved method using CM-3. [sent-186, score-0.478]

69 In our collocation model, the collocation probability of "the people of the world" is much higher than that of "people world". [sent-188, score-1.56]

70 For example, in the baseline alignment "has made . [sent-190, score-0.303]

71 have 取得", "have" and "has" are unrelated to the target word, while our method only generated "made 取 得", this is because that the collocation probabilities of "has/have" and "made" are much lower than that of the whole source sentence. [sent-193, score-1.022]

72 Bi-directional alignment results We build a bi-directional alignment baseline in two steps: (1) GIZA++ is used to obtain the source-to-target and target-to-source alignments; (2) the bi-directional alignments are generated by using "grow-diag-final". [sent-194, score-0.782]

73 We evaluate three methods: WA-1: one-directional alignment method proposed in section 3. [sent-196, score-0.302]

74 1 and grow-diag-final; WA-2: GIZA++ and the bi-directional bilingual word alignments method proposed in section 3. [sent-197, score-0.382]

75 We can see that WA-1 achieves lower alignment error rate as compared to the baseline method, since the performance of the improved onedirectional alignment method is better than that of GIZA++. [sent-201, score-0.717]

76 This result indicates that improving one-directional word alignment results in bidirectional word alignment improvement. [sent-202, score-0.665]

77 This is because the proposed bi-directional alignment method can effectively recognize the correct alignments from the alignment union, by leveraging collocation probabilities of the words in the same cept. [sent-204, score-1.653]

78 Our method using both methods proposed in section 3 produces the best alignment performance, achieving 11% absolute error rate reduction. [sent-205, score-0.459]

79 2 Effect of improved word alignment on phrase-based SMT We investigate the effectiveness of the improved word alignments on the phrase-based SMT system. [sent-218, score-0.647]

80 Example of the translations generated by the baseline system and the system where the phrase collocation probabilities are added + TPIhamrblpesr5ocEv. [sent-224, score-1.036]

81 Here, we investigate three different collocation models for translation quality improvement. [sent-228, score-0.841]

82 From the results of Table 4, it can be seen that the systems using the improved bi-directional alignments achieve higher quality of translation than the baseline system. [sent-230, score-0.347]

83 If the same alignment method is used, the systems using CM-3 got the highest BLEU scores. [sent-231, score-0.32]

84 And if the same collocation model is used, the systems using WA-3 achieved the higher scores. [sent-232, score-0.762]

85 3 Effect of phrase collocation probabilities To investigate the effectiveness of the method proposed in section 4, we only use the collocation model CM-3 as described in section 5. [sent-235, score-1.799]

86 When the phrase collocation probabilities are incorporated into the SMT system, the translation quality is improved, achieving an absolute improvement of 0. [sent-238, score-1.166]

87 This result indicates that the collocation probabilities of phrases are useful in determining the boundary of phrase and predicting whether phrases should be translated together, which helps to improve the phrase-based SMT performance. [sent-240, score-1.01]

88 Figure 3 shows an example: T1 is generated by the system where the phrase collocation probabilities are used and T2 is generated by the baseline system. [sent-241, score-1.056]

89 In this example, since the collocation probability of " 出 问题" is much higher than that of " 问题 ", our method tends to split " 出 问题 " into "(出 问题) (。 )", rather than 。 。 "(出) (问题 )". [sent-242, score-0.832]

90 For the phrase "才能 避免" in the source sentence, the collocation probability of the translation "in order to avoid" is higher than that of the translation "can we avoid". [sent-243, score-1.037]

91 Although the phrase "我们 必须 采取 措 施" in the source sentence has the same translation "We must adopt effective measures", our method splits this phrase into two parts "我们 必 须" and "采取 措施", because two parts have higher collocation probabilities than the whole phrase. [sent-245, score-1.221]

92 We also investigate the performance of the system employing both the word alignment improvement and phrase table improvement methods. [sent-246, score-0.467]

93 The system using the improved word alignments achieves an absolute improvement of 1. [sent-263, score-0.349]

94 We first used the MWA method to identify potentially collocated words and estimate collocation probabilities only from monolingual corpora, no additional resource or linguistic preprocessing is needed. [sent-266, score-1.177]

95 Then the collocation information was employed to improve BWA for various kinds of SMT systems and to improve phrase table for phrasebased SMT. [sent-267, score-0.916]

96 To improve BWA, we re-estimate the alignment probabilities by using the collocation probabilities of words in the same cept. [sent-268, score-1.338]

97 To improve phrase table, we calculate phrase collocation probabilities based on word collocation probabilities. [sent-269, score-1.925]

98 Then the phrase collocation probabilities are used as additional features in phrase-based SMT systems. [sent-270, score-0.997]

99 The improved word alignment results in an improvement of 2. [sent-272, score-0.374]

100 When we also used phrase collocation probabilities as additional features, the phrase-based SMT performance is finally improved by 2. [sent-275, score-1.038]


similar papers computed by tfidf model

tfidf for this paper:

wordName wordTfidf (topN-words)

[('collocation', 0.743), ('alignment', 0.266), ('smt', 0.208), ('bwa', 0.202), ('alignments', 0.193), ('monolingual', 0.149), ('probabilities', 0.149), ('mwa', 0.128), ('bilingual', 0.111), ('fj', 0.104), ('bleu', 0.091), ('phrase', 0.087), ('wi', 0.077), ('aligned', 0.072), ('wj', 0.072), ('cept', 0.069), ('collocated', 0.064), ('ibm', 0.061), ('translation', 0.055), ('probability', 0.053), ('aj', 0.052), ('fertility', 0.052), ('fbis', 0.051), ('koehn', 0.049), ('absolute', 0.048), ('ihi', 0.048), ('denotes', 0.048), ('och', 0.046), ('collocations', 0.046), ('giza', 0.045), ('aer', 0.044), ('source', 0.044), ('liu', 0.044), ('calculate', 0.043), ('word', 0.042), ('interpolation', 0.042), ('improved', 0.041), ('rate', 0.039), ('achieving', 0.038), ('joshua', 0.037), ('ahrenberg', 0.037), ('cepts', 0.037), ('rferqe', 0.037), ('waj', 0.037), ('baseline', 0.037), ('method', 0.036), ('sg', 0.033), ('ei', 0.033), ('qw', 0.032), ('error', 0.032), ('improve', 0.031), ('unrelated', 0.03), ('interrupted', 0.029), ('sequence', 0.029), ('chris', 0.027), ('improving', 0.027), ('ki', 0.026), ('alexandra', 0.026), ('philipp', 0.026), ('improvement', 0.025), ('performances', 0.025), ('kinds', 0.024), ('franz', 0.024), ('china', 0.024), ('correlation', 0.024), ('ijcnlp', 0.024), ('xiong', 0.024), ('gs', 0.023), ('meeting', 0.023), ('annual', 0.023), ('score', 0.023), ('calculated', 0.022), ('sheng', 0.022), ('marton', 0.022), ('bidirectional', 0.022), ('distortion', 0.022), ('hermann', 0.022), ('statistical', 0.022), ('investigate', 0.022), ('weights', 0.022), ('josef', 0.021), ('people', 0.021), ('quality', 0.021), ('wf', 0.02), ('generated', 0.02), ('corpora', 0.02), ('sentence', 0.02), ('ney', 0.019), ('cherry', 0.019), ('sr', 0.019), ('pair', 0.019), ('kth', 0.019), ('brown', 0.019), ('model', 0.019), ('potentially', 0.018), ('train', 0.018), ('got', 0.018), ('additional', 0.018), ('mckeown', 0.018), ('consecutive', 0.018)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 1.0000004 147 acl-2010-Improving Statistical Machine Translation with Monolingual Collocation

Author: Zhanyi Liu ; Haifeng Wang ; Hua Wu ; Sheng Li

Abstract: This paper proposes to use monolingual collocations to improve Statistical Machine Translation (SMT). We make use of the collocation probabilities, which are estimated from monolingual corpora, in two aspects, namely improving word alignment for various kinds of SMT systems and improving phrase table for phrase-based SMT. The experimental results show that our method improves the performance of both word alignment and translation quality significantly. As compared to baseline systems, we achieve absolute improvements of 2.40 BLEU score on a phrase-based SMT system and 1.76 BLEU score on a parsing-based SMT system. 1

2 0.51920515 36 acl-2010-Automatic Collocation Suggestion in Academic Writing

Author: Jian-Cheng Wu ; Yu-Chia Chang ; Teruko Mitamura ; Jason S. Chang

Abstract: In recent years, collocation has been widely acknowledged as an essential characteristic to distinguish native speakers from non-native speakers. Research on academic writing has also shown that collocations are not only common but serve a particularly important discourse function within the academic community. In our study, we propose a machine learning approach to implementing an online collocation writing assistant. We use a data-driven classifier to provide collocation suggestions to improve word choices, based on the result of classifica- tion. The system generates and ranks suggestions to assist learners’ collocation usages in their academic writing with satisfactory results. 1

3 0.2715292 60 acl-2010-Collocation Extraction beyond the Independence Assumption

Author: Gerlof Bouma

Abstract: In this paper we start to explore two-part collocation extraction association measures that do not estimate expected probabilities on the basis of the independence assumption. We propose two new measures based upon the well-known measures of mutual information and pointwise mutual information. Expected probabilities are derived from automatically trained Aggregate Markov Models. On three collocation gold standards, we find the new association measures vary in their effectiveness.

4 0.25422844 240 acl-2010-Training Phrase Translation Models with Leaving-One-Out

Author: Joern Wuebker ; Arne Mauser ; Hermann Ney

Abstract: Several attempts have been made to learn phrase translation probabilities for phrasebased statistical machine translation that go beyond pure counting of phrases in word-aligned training data. Most approaches report problems with overfitting. We describe a novel leavingone-out approach to prevent over-fitting that allows us to train phrase models that show improved translation performance on the WMT08 Europarl German-English task. In contrast to most previous work where phrase models were trained separately from other models used in translation, we include all components such as single word lexica and reordering mod- els in training. Using this consistent training of phrase models we are able to achieve improvements of up to 1.4 points in BLEU. As a side effect, the phrase table size is reduced by more than 80%.

5 0.24875855 133 acl-2010-Hierarchical Search for Word Alignment

Author: Jason Riesa ; Daniel Marcu

Abstract: We present a simple yet powerful hierarchical search algorithm for automatic word alignment. Our algorithm induces a forest of alignments from which we can efficiently extract a ranked k-best list. We score a given alignment within the forest with a flexible, linear discriminative model incorporating hundreds of features, and trained on a relatively small amount of annotated data. We report results on Arabic-English word alignment and translation tasks. Our model outperforms a GIZA++ Model-4 baseline by 6.3 points in F-measure, yielding a 1.1 BLEU score increase over a state-of-the-art syntax-based machine translation system.

6 0.23757531 90 acl-2010-Diversify and Combine: Improving Word Alignment for Machine Translation on Low-Resource Languages

7 0.23645023 24 acl-2010-Active Learning-Based Elicitation for Semi-Supervised Word Alignment

8 0.19153185 87 acl-2010-Discriminative Modeling of Extraction Sets for Machine Translation

9 0.17949231 262 acl-2010-Word Alignment with Synonym Regularization

10 0.17704648 170 acl-2010-Letter-Phoneme Alignment: An Exploration

11 0.16472766 54 acl-2010-Boosting-Based System Combination for Machine Translation

12 0.15386452 249 acl-2010-Unsupervised Search for the Optimal Segmentation for Statistical Machine Translation

13 0.14869633 110 acl-2010-Exploring Syntactic Structural Features for Sub-Tree Alignment Using Bilingual Tree Kernels

14 0.14758372 201 acl-2010-Pseudo-Word for Phrase-Based Machine Translation

15 0.12766424 48 acl-2010-Better Filtration and Augmentation for Hierarchical Phrase-Based Translation Rules

16 0.12557429 102 acl-2010-Error Detection for Statistical Machine Translation Using Linguistic Features

17 0.12228099 88 acl-2010-Discriminative Pruning for Discriminative ITG Alignment

18 0.12044434 51 acl-2010-Bilingual Sense Similarity for Statistical Machine Translation

19 0.11882523 145 acl-2010-Improving Arabic-to-English Statistical Machine Translation by Reordering Post-Verbal Subjects for Alignment

20 0.11440309 120 acl-2010-Fully Unsupervised Core-Adjunct Argument Classification


similar papers computed by lsi model

lsi for this paper:

topicId topicWeight

[(0, -0.23), (1, -0.336), (2, -0.078), (3, 0.022), (4, 0.147), (5, 0.105), (6, -0.226), (7, 0.041), (8, 0.116), (9, -0.048), (10, -0.006), (11, 0.055), (12, -0.085), (13, 0.143), (14, 0.233), (15, 0.029), (16, -0.04), (17, -0.24), (18, 0.134), (19, 0.37), (20, -0.158), (21, -0.013), (22, -0.26), (23, 0.098), (24, -0.009), (25, 0.06), (26, 0.032), (27, -0.063), (28, -0.012), (29, -0.101), (30, 0.037), (31, 0.041), (32, 0.009), (33, -0.014), (34, -0.03), (35, 0.037), (36, 0.01), (37, -0.043), (38, -0.013), (39, -0.028), (40, -0.002), (41, -0.019), (42, 0.025), (43, 0.011), (44, 0.024), (45, 0.013), (46, 0.004), (47, -0.007), (48, -0.031), (49, 0.014)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.91578871 147 acl-2010-Improving Statistical Machine Translation with Monolingual Collocation

Author: Zhanyi Liu ; Haifeng Wang ; Hua Wu ; Sheng Li

Abstract: This paper proposes to use monolingual collocations to improve Statistical Machine Translation (SMT). We make use of the collocation probabilities, which are estimated from monolingual corpora, in two aspects, namely improving word alignment for various kinds of SMT systems and improving phrase table for phrase-based SMT. The experimental results show that our method improves the performance of both word alignment and translation quality significantly. As compared to baseline systems, we achieve absolute improvements of 2.40 BLEU score on a phrase-based SMT system and 1.76 BLEU score on a parsing-based SMT system. 1

2 0.82079762 36 acl-2010-Automatic Collocation Suggestion in Academic Writing

Author: Jian-Cheng Wu ; Yu-Chia Chang ; Teruko Mitamura ; Jason S. Chang

Abstract: In recent years, collocation has been widely acknowledged as an essential characteristic to distinguish native speakers from non-native speakers. Research on academic writing has also shown that collocations are not only common but serve a particularly important discourse function within the academic community. In our study, we propose a machine learning approach to implementing an online collocation writing assistant. We use a data-driven classifier to provide collocation suggestions to improve word choices, based on the result of classifica- tion. The system generates and ranks suggestions to assist learners’ collocation usages in their academic writing with satisfactory results. 1

3 0.7602123 60 acl-2010-Collocation Extraction beyond the Independence Assumption

Author: Gerlof Bouma

Abstract: In this paper we start to explore two-part collocation extraction association measures that do not estimate expected probabilities on the basis of the independence assumption. We propose two new measures based upon the well-known measures of mutual information and pointwise mutual information. Expected probabilities are derived from automatically trained Aggregate Markov Models. On three collocation gold standards, we find the new association measures vary in their effectiveness.

4 0.50901949 90 acl-2010-Diversify and Combine: Improving Word Alignment for Machine Translation on Low-Resource Languages

Author: Bing Xiang ; Yonggang Deng ; Bowen Zhou

Abstract: We present a novel method to improve word alignment quality and eventually the translation performance by producing and combining complementary word alignments for low-resource languages. Instead of focusing on the improvement of a single set of word alignments, we generate multiple sets of diversified alignments based on different motivations, such as linguistic knowledge, morphology and heuristics. We demonstrate this approach on an English-to-Pashto translation task by combining the alignments obtained from syntactic reordering, stemming, and partial words. The combined alignment outperforms the baseline alignment, with significantly higher F-scores and better transla- tion performance.

5 0.48328522 201 acl-2010-Pseudo-Word for Phrase-Based Machine Translation

Author: Xiangyu Duan ; Min Zhang ; Haizhou Li

Abstract: The pipeline of most Phrase-Based Statistical Machine Translation (PB-SMT) systems starts from automatically word aligned parallel corpus. But word appears to be too fine-grained in some cases such as non-compositional phrasal equivalences, where no clear word alignments exist. Using words as inputs to PBSMT pipeline has inborn deficiency. This paper proposes pseudo-word as a new start point for PB-SMT pipeline. Pseudo-word is a kind of basic multi-word expression that characterizes minimal sequence of consecutive words in sense of translation. By casting pseudo-word searching problem into a parsing framework, we search for pseudo-words in a monolingual way and a bilingual synchronous way. Experiments show that pseudo-word significantly outperforms word for PB-SMT model in both travel translation domain and news translation domain. 1

6 0.48041767 262 acl-2010-Word Alignment with Synonym Regularization

7 0.46671686 240 acl-2010-Training Phrase Translation Models with Leaving-One-Out

8 0.45985663 87 acl-2010-Discriminative Modeling of Extraction Sets for Machine Translation

9 0.45641127 24 acl-2010-Active Learning-Based Elicitation for Semi-Supervised Word Alignment

10 0.44912738 133 acl-2010-Hierarchical Search for Word Alignment

11 0.42741874 170 acl-2010-Letter-Phoneme Alignment: An Exploration

12 0.41391644 88 acl-2010-Discriminative Pruning for Discriminative ITG Alignment

13 0.36866286 110 acl-2010-Exploring Syntactic Structural Features for Sub-Tree Alignment Using Bilingual Tree Kernels

14 0.36092874 249 acl-2010-Unsupervised Search for the Optimal Segmentation for Statistical Machine Translation

15 0.34193999 102 acl-2010-Error Detection for Statistical Machine Translation Using Linguistic Features

16 0.33898038 54 acl-2010-Boosting-Based System Combination for Machine Translation

17 0.30939767 135 acl-2010-Hindi-to-Urdu Machine Translation through Transliteration

18 0.30578181 265 acl-2010-cdec: A Decoder, Alignment, and Learning Framework for Finite-State and Context-Free Translation Models

19 0.29973581 48 acl-2010-Better Filtration and Augmentation for Hierarchical Phrase-Based Translation Rules

20 0.29489419 119 acl-2010-Fixed Length Word Suffix for Factored Statistical Machine Translation


similar papers computed by lda model

lda for this paper:

topicId topicWeight

[(16, 0.042), (25, 0.036), (59, 0.15), (73, 0.033), (83, 0.344), (84, 0.015), (98, 0.232)]

similar papers list:

simIndex simValue paperId paperTitle

1 0.97046363 38 acl-2010-Automatic Evaluation of Linguistic Quality in Multi-Document Summarization

Author: Emily Pitler ; Annie Louis ; Ani Nenkova

Abstract: To date, few attempts have been made to develop and validate methods for automatic evaluation of linguistic quality in text summarization. We present the first systematic assessment of several diverse classes of metrics designed to capture various aspects of well-written text. We train and test linguistic quality models on consecutive years of NIST evaluation data in order to show the generality of results. For grammaticality, the best results come from a set of syntactic features. Focus, coherence and referential clarity are best evaluated by a class of features measuring local coherence on the basis of cosine similarity between sentences, coreference informa- tion, and summarization specific features. Our best results are 90% accuracy for pairwise comparisons of competing systems over a test set of several inputs and 70% for ranking summaries of a specific input.

2 0.97035122 132 acl-2010-Hierarchical Joint Learning: Improving Joint Parsing and Named Entity Recognition with Non-Jointly Labeled Data

Author: Jenny Rose Finkel ; Christopher D. Manning

Abstract: One of the main obstacles to producing high quality joint models is the lack of jointly annotated data. Joint modeling of multiple natural language processing tasks outperforms single-task models learned from the same data, but still underperforms compared to single-task models learned on the more abundant quantities of available single-task annotated data. In this paper we present a novel model which makes use of additional single-task annotated data to improve the performance of a joint model. Our model utilizes a hierarchical prior to link the feature weights for shared features in several single-task models and the joint model. Experiments on joint parsing and named entity recog- nition, using the OntoNotes corpus, show that our hierarchical joint model can produce substantial gains over a joint model trained on only the jointly annotated data.

3 0.95605761 73 acl-2010-Coreference Resolution with Reconcile

Author: Veselin Stoyanov ; Claire Cardie ; Nathan Gilbert ; Ellen Riloff ; David Buttler ; David Hysom

Abstract: Despite the existence of several noun phrase coreference resolution data sets as well as several formal evaluations on the task, it remains frustratingly difficult to compare results across different coreference resolution systems. This is due to the high cost of implementing a complete end-to-end coreference resolution system, which often forces researchers to substitute available gold-standard information in lieu of implementing a module that would compute that information. Unfortunately, this leads to inconsistent and often unrealistic evaluation scenarios. With the aim to facilitate consistent and realistic experimental evaluations in coreference resolution, we present Reconcile, an infrastructure for the development of learning-based noun phrase (NP) coreference resolution systems. Reconcile is designed to facilitate the rapid creation of coreference resolution systems, easy implementation of new feature sets and approaches to coreference res- olution, and empirical evaluation of coreference resolvers across a variety of benchmark data sets and standard scoring metrics. We describe Reconcile and present experimental results showing that Reconcile can be used to create a coreference resolver that achieves performance comparable to state-ofthe-art systems on six benchmark data sets.

4 0.95372933 4 acl-2010-A Cognitive Cost Model of Annotations Based on Eye-Tracking Data

Author: Katrin Tomanek ; Udo Hahn ; Steffen Lohmann ; Jurgen Ziegler

Abstract: We report on an experiment to track complex decision points in linguistic metadata annotation where the decision behavior of annotators is observed with an eyetracking device. As experimental conditions we investigate different forms of textual context and linguistic complexity classes relative to syntax and semantics. Our data renders evidence that annotation performance depends on the semantic and syntactic complexity of the decision points and, more interestingly, indicates that fullscale context is mostly negligible with – the exception of semantic high-complexity cases. We then induce from this observational data a cognitively grounded cost model of linguistic meta-data annotations and compare it with existing non-cognitive models. Our data reveals that the cognitively founded model explains annotation costs (expressed in annotation time) more adequately than non-cognitive ones.

5 0.95305133 72 acl-2010-Coreference Resolution across Corpora: Languages, Coding Schemes, and Preprocessing Information

Author: Marta Recasens ; Eduard Hovy

Abstract: This paper explores the effect that different corpus configurations have on the performance of a coreference resolution system, as measured by MUC, B3, and CEAF. By varying separately three parameters (language, annotation scheme, and preprocessing information) and applying the same coreference resolution system, the strong bonds between system and corpus are demonstrated. The experiments reveal problems in coreference resolution evaluation relating to task definition, coding schemes, and features. They also ex- pose systematic biases in the coreference evaluation metrics. We show that system comparison is only possible when corpus parameters are in exact agreement.

same-paper 6 0.94959867 147 acl-2010-Improving Statistical Machine Translation with Monolingual Collocation

7 0.94937474 1 acl-2010-"Ask Not What Textual Entailment Can Do for You..."

8 0.94307035 256 acl-2010-Vocabulary Choice as an Indicator of Perspective

9 0.93087602 32 acl-2010-Arabic Named Entity Recognition: Using Features Extracted from Noisy Data

10 0.92261565 102 acl-2010-Error Detection for Statistical Machine Translation Using Linguistic Features

11 0.91836768 31 acl-2010-Annotation

12 0.91515249 219 acl-2010-Supervised Noun Phrase Coreference Research: The First Fifteen Years

13 0.91166091 101 acl-2010-Entity-Based Local Coherence Modelling Using Topological Fields

14 0.91081792 155 acl-2010-Kernel Based Discourse Relation Recognition with Temporal Ordering Information

15 0.91075993 153 acl-2010-Joint Syntactic and Semantic Parsing of Chinese

16 0.90775192 195 acl-2010-Phylogenetic Grammar Induction

17 0.90632701 233 acl-2010-The Same-Head Heuristic for Coreference

18 0.90394485 197 acl-2010-Practical Very Large Scale CRFs

19 0.90173119 56 acl-2010-Bridging SMT and TM with Translation Recommendation

20 0.89987344 244 acl-2010-TrustRank: Inducing Trust in Automatic Translations via Ranking